Arxiv今日论文 | 2025-07-11

本篇博文主要内容为 2025-07-11 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决当前视觉 grounded reasoning 模型缺乏全面评估基准的问题，特别是如何有效评估模型在复杂场景中对细微目标的视觉感知、可追溯证据的获取以及二阶推理能力。其解决方案的关键在于提出 TreeBench 基准，该基准基于三个核心原则：聚焦复杂场景中的细微目标感知、通过边界框评估实现证据可追溯性，以及测试超越简单目标定位的对象交互与空间层次推理。此外，论文还引入了 TreeVGR 训练范式，通过强化学习联合监督定位与推理，提升模型的定位精度和推理可解释性。

链接: https://arxiv.org/abs/2507.07999
作者: Haochen Wang,Xiangtai Li,Zilong Huang,Anran Wang,Jiacong Wang,Tao Zhang,Jiani Zheng,Sule Bai,Zijian Kang,Jiashi Feng,Zhuochen Wang,Zhaoxiang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human “thinking with images”. However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at this https URL.
zh

[NLP-1] PyVision: Agent ic Vision with Dynamic Tooling

【速读】：该论文试图解决视觉推理中传统方法受限于预定义工作流和静态工具集的问题。其解决方案的关键在于提出PyVision，一个交互式、多轮框架，使多模态大语言模型（MLLMs）能够自主生成、执行和优化基于Python的工具，从而实现灵活且可解释的问题解决过程。

链接: https://arxiv.org/abs/2507.07998
作者: Shitian Zhao,Haoquan Zhang,Shaoheng Lin,Ming Li,Qilong Wu,Kaipeng Zhang,Chen Wei
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 Pages, 10 Figures, Technical report

点击查看摘要

Abstract:LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.
zh

[NLP-2] Automating Expert-Level Medical Reasoning Evaluation of Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在医疗推理能力评估中存在评估不充分或可扩展性差的问题，缺乏一个严谨的基准测试工具。解决方案的关键在于引入MedThink-Bench，这是一个用于严格、可解释和可扩展评估LLMs医疗推理能力的基准，其包含跨十个医学领域的500个挑战性问题，并配有专家精心编写的逐步推理过程。此外，还提出了LLM-w-Ref评估框架，该框架利用细粒度推理过程和LLM-as-a-Judge机制，在保持可扩展性的同时实现与专家水平一致的中间推理评估。

链接: https://arxiv.org/abs/2507.07988
作者: Shuang Zhou,Wenya Xie,Jiaxi Li,Zaifu Zhan,Meijia Song,Han Yang,Cheyenna Espinoza,Lindsay Welton,Xinnie Mai,Yanwei Jin,Zidu Xu,Yuen-Hei Chung,Yiyun Xing,Meng-Han Tsai,Emma Schaffer,Yucheng Shi,Ninghao Liu,Zirui Liu,Rui Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 22 pages,6 figures

点击查看摘要

Abstract:As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs’ medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs’ medical reasoning. MedThink-Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by-step rationales. Building on this, we propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms to assess intermediate reasoning with expert-level fidelity while maintaining scalability. Experiments show that LLM-w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of-the-art LLMs, we find that smaller models (e.g., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI-o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs’ medical reasoning, advancing their safe and responsible deployment in clinical practice.
zh

[NLP-3] Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology

【速读】：该论文试图解决在风湿病学等复杂临床领域中，如何有效支持临床决策的问题。其解决方案的关键在于采用较小的语言模型（SLMs）结合检索增强生成（RAG）技术，从而在保持较高诊断和治疗性能的同时，显著降低能耗并实现成本高效、本地化的部署。

链接: https://arxiv.org/abs/2507.07983
作者: Sabine Felde,Rüdiger Buchkremer,Gamal Chehab,Christian Thielscher,Jörg HW Distler,Matthias Schneider,Jutta G. Richter
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) show promise for supporting clinical decision-making in complex fields such as rheumatology. Our evaluation shows that smaller language models (SLMs), combined with retrieval-augmented generation (RAG), achieve higher diagnostic and therapeutic performance than larger models, while requiring substantially less energy and enabling cost-efficient, local deployment. These features are attractive for resource-limited healthcare. However, expert oversight remains essential, as no model consistently reached specialist-level accuracy in rheumatology.
zh

[NLP-4] Why is Your Language Model a Poor Implicit Reward Model?

【速读】：该论文试图解决隐式奖励模型（IM-RM）与显式奖励模型（EX-RM）在泛化性能上的差距问题。其解决方案的关键在于揭示IM-RM更依赖于表层的token级线索，这导致其在token级分布偏移以及分布内情况下均表现出较差的泛化能力。研究通过理论分析和实验验证，证明了这一设计差异是造成泛化差距的主要原因，而非其他替代性假设，如生成任务难度对IM-RM的影响。

链接: https://arxiv.org/abs/2507.07981
作者: Noam Razin,Yong Lin,Jiarui Yao,Sanjeev Arora
机构: Princeton Language and Intelligence, Princeton University (普林斯顿语言与智能实验室，普林斯顿大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Towards a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.
zh

[NLP-5] Scaling RL to Long Videos

【速读】：该论文旨在解决视觉语言模型（VLMs）在长视频推理任务中的扩展性问题，即如何有效处理和理解长时间视频内容。其解决方案的关键在于构建一个全栈框架，包含三个核心组件：大规模高质量长视频问答数据集LongVideo-Reason、两阶段训练流程（结合链式思维监督微调和强化学习）、以及专为长视频设计的多模态强化序列并行训练基础设施Multi-modal Reinforcement Sequence Parallelism (MR-SP)，该系统通过缓存视频嵌入实现了高效的回放与预填充，显著提升了长视频强化学习训练效率。

链接: https://arxiv.org/abs/2507.07966
作者: Yukang Chen,Wei Huang,Baifeng Shi,Qinghao Hu,Hanrong Ye,Ligeng Zhu,Zhijian Liu,Pavlo Molchanov,Jan Kautz,Xiaojuan Qi,Sifei Liu,Hongxu Yin,Yao Lu,Song Han
机构: NVIDIA(英伟达); MIT(麻省理工学院); HKU(香港大学); UC Berkeley(加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code and models are available at this https URL

点击查看摘要

Abstract:We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).
zh

[NLP-6] MIRIX: Multi-Agent Memory System for LLM -Based Agents

【速读】：该论文试图解决当前人工智能代理在记忆能力上的局限性，尤其是其在个性化、抽象化和可靠回忆用户特定信息方面的能力不足。解决方案的关键在于引入MIRIX，一个模块化、多智能体记忆系统，通过六种不同且结构化的记忆类型（Core、Episodic、Semantic、Procedural、Resource Memory和Knowledge Vault）以及多智能体框架，实现对多样化、长期用户数据的持久存储、推理和准确检索。MIRIX突破了传统文本记忆的限制，融入丰富的视觉和多模态体验，从而在实际应用场景中提升记忆的实际效用。

链接: https://arxiv.org/abs/2507.07957
作者: Yu Wang,Xi Chen
机构: MIRIX AI (MIRIX 人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field’s most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.
zh

[NLP-7] SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment

【速读】：该论文旨在解决工业场景中异常检测与推理的问题，特别是现有视觉-语言模型（Vision-Language Models, VLMs）在提供可解释性解释和泛化到未见类别方面的不足。其解决方案的关键在于提出SAGE框架，该框架通过自引导事实增强（Self-Guided Fact Enhancement, SFE）和熵感知直接偏好优化（Entropy-aware Direct Preference Optimization, E-DPO）来增强异常推理能力。SFE通过事实提取与融合将领域知识整合到视觉推理中，而E-DPO则利用熵感知优化使模型输出与专家偏好对齐。此外，论文还构建了AD-PL数据集，并提出了多尺度逻辑评估（Multiscale Logical Evaluation, MLE）方法以量化评估模型逻辑与一致性。

链接: https://arxiv.org/abs/2507.07939
作者: Guoxin Zang,Xue Li,Donglin Di,Lanshun Nie,Dechen Zhan,Yang Song,Lei Fan
机构: Harbin Institute of Technology(哈尔滨工业大学); DZ Matrix(迪智矩阵); University of New South Wales(新南威尔士大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACMMM2025

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at this https URL.
zh

[NLP-8] DTECT: Dynamic Topic Explorer Context Tracker

【速读】：该论文试图解决随着文本数据的爆炸性增长，如何有效揭示演变的主题和趋势的问题。现有动态主题建模技术虽然强大，但通常存在于碎片化的处理流程中，缺乏对解释性和用户友好探索的有力支持。解决方案的关键在于提出DTECT（Dynamic Topic Explorer Context Tracker），这是一个端到端系统，通过统一的工作流支持数据预处理、多种模型架构以及专门的评估指标，以分析时间主题模型的主题质量。DTECT通过引入基于大语言模型（LLM）的自动主题标签生成、基于时间显著词汇的趋势分析、带有文档级摘要的交互式可视化以及自然语言聊天界面，显著提升了可解释性。

链接: https://arxiv.org/abs/2507.07910
作者: Suman Adhya,Debarshi Kumar Sanyal
机构: Indian Association for the Cultivation of Science(印度科学文化协会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Code: this https URL | Demo: this https URL | Video: this https URL

点击查看摘要

Abstract:The explosive growth of textual data over time presents a significant challenge in uncovering evolving themes and trends. Existing dynamic topic modeling techniques, while powerful, often exist in fragmented pipelines that lack robust support for interpretation and user-friendly exploration. We introduce DTECT (Dynamic Topic Explorer Context Tracker), an end-to-end system that bridges the gap between raw textual data and meaningful temporal insights. DTECT provides a unified workflow that supports data preprocessing, multiple model architectures, and dedicated evaluation metrics to analyze the topic quality of temporal topic models. It significantly enhances interpretability by introducing LLM-driven automatic topic labeling, trend analysis via temporally salient words, interactive visualizations with document-level summarization, and a natural language chat interface for intuitive data querying. By integrating these features into a single, cohesive platform, DTECT empowers users to more effectively track and understand thematic dynamics. DTECT is open-source and available at this https URL.
zh

[NLP-9] Automating MD simulations for Proteins using Large language Models : NAMD-Agent

【速读】：该论文试图解决分子动力学（Molecular Dynamics, MD）模拟中高质量输入文件准备过程耗时且容易出错的问题。解决方案的关键在于构建一个自动化流程，该流程结合了大型语言模型（Large Language Models, LLMs）——特别是Gemini 2.0 Flash——与Python脚本及基于Selenium的网页自动化技术，以简化NAMD模拟输入文件的生成。该流程利用CHARMM GUI的全面网络界面，通过Gemini的代码生成和迭代优化能力，实现模拟脚本的自动编写、执行与修正，从而高效地提取参数并生成所需的NAMD输入文件。

链接: https://arxiv.org/abs/2507.07887
作者: Achuth Chandrasekhar,Amir Barati Farimani
机构: 未知
类目: Computation and Language (cs.CL)
备注: 34 pages

点击查看摘要

Abstract:Molecular dynamics simulations are an essential tool in understanding protein structure, dynamics, and function at the atomic level. However, preparing high quality input files for MD simulations can be a time consuming and error prone process. In this work, we introduce an automated pipeline that leverages Large Language Models (LLMs), specifically Gemini 2.0 Flash, in conjunction with python scripting and Selenium based web automation to streamline the generation of MD input files. The pipeline exploits CHARMM GUI’s comprehensive web-based interface for preparing simulation-ready inputs for NAMD. By integrating Gemini’s code generation and iterative refinement capabilities, simulation scripts are automatically written, executed, and revised to navigate CHARMM GUI, extract appropriate parameters, and produce the required NAMD input files. Post processing is performed using additional software to further refine the simulation outputs, thereby enabling a complete and largely hands free workflow. Our results demonstrate that this approach reduces setup time, minimizes manual errors, and offers a scalable solution for handling multiple protein systems in parallel. This automated framework paves the way for broader application of LLMs in computational structural biology, offering a robust and adaptable platform for future developments in simulation automation.
zh

[NLP-10] DocCHA: Towards LLM -Augmented Interactive Online diagnosis System

【速读】：该论文旨在解决现有对话式健康代理（Conversational Health Agents, CHAs）在临床诊断场景中表现不佳的问题，具体表现为静态、脆弱，无法进行自适应多轮推理、症状澄清或透明决策。其解决方案的关键在于提出DocCHA框架，该框架通过将诊断过程分解为三个阶段——症状获取、病史采集和因果图构建，并利用可解释的置信度分数来指导自适应提问、优先处理信息性澄清以及优化薄弱推理链，从而实现结构化、透明且高效的诊断对话。

链接: https://arxiv.org/abs/2507.07870
作者: Xinyi Liu,Dachun Sun,Yi R. Fung,Dilek Hakkani-Tür,Tarek Abdelzaher
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the impressive capabilities of Large Language Models (LLMs), existing Conversational Health Agents (CHAs) remain static and brittle, incapable of adaptive multi-turn reasoning, symptom clarification, or transparent decision-making. This hinders their real-world applicability in clinical diagnosis, where iterative and structured dialogue is essential. We propose DocCHA, a confidence-aware, modular framework that emulates clinical reasoning by decomposing the diagnostic process into three stages: (1) symptom elicitation, (2) history acquisition, and (3) causal graph construction. Each module uses interpretable confidence scores to guide adaptive questioning, prioritize informative clarifications, and refine weak reasoning links. Evaluated on two real-world Chinese consultation datasets (IMCS21, DX), DocCHA consistently outperforms strong prompting-based LLM baselines (GPT-3.5, GPT-4o, LLaMA-3), achieving up to 5.18 percent higher diagnostic accuracy and over 30 percent improvement in symptom recall, with only modest increase in dialogue turns. These results demonstrate the effectiveness of DocCHA in enabling structured, transparent, and efficient diagnostic conversations – paving the way for trustworthy LLM-powered clinical assistants in multilingual and resource-constrained settings. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.07870 [cs.CL] (or arXiv:2507.07870v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.07870 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-11] Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation

【速读】：该论文试图解决在多层语义博弈架构中实现AI系统与文档之间的语义对齐问题，其核心挑战在于如何通过自指框架实现层级子博弈的逐层收敛。解决方案的关键在于引入一种嵌套博弈论结构，其中通过复合算子 $\phi(\cdot, \gamma(\cdot))$ 实现主语义收敛与局部子博弈求解的分离，使得博弈论推理自然地从不动点迭代中涌现，而非依赖外部强制。该方法基于Alpay代数IV的共情嵌入概念，并结合范畴论、信息论及现实AI认知模型，确保理论的实用性和数学严谨性。

链接: https://arxiv.org/abs/2507.07868
作者: Bugra Kilictas,Faruk Alpay
机构: Bahçeşehir University (巴赫切席尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 2 figures

点击查看摘要

Abstract:This paper extends the self-referential framework of Alpay Algebra into a multi-layered semantic game architecture where transfinite fixed-point convergence encompasses hierarchical sub-games at each iteration level. Building upon Alpay Algebra IV’s empathetic embedding concept, we introduce a nested game-theoretic structure where the alignment process between AI systems and documents becomes a meta-game containing embedded decision problems. We formalize this through a composite operator \phi(\cdot, \gamma(\cdot)) where \phi drives the main semantic convergence while \gamma resolves local sub-games. The resulting framework demonstrates that game-theoretic reasoning emerges naturally from fixed-point iteration rather than being imposed externally. We prove a Game Theorem establishing existence and uniqueness of semantic equilibria under realistic cognitive simulation assumptions. Our verification suite includes adaptations of Banach’s fixed-point theorem to transfinite contexts, a novel \phi -topology based on the Kozlov-Maz’ya-Rossmann formula for handling semantic singularities, and categorical consistency tests via the Yoneda lemma. The paper itself functions as a semantic artifact designed to propagate its fixed-point patterns in AI embedding spaces – a deliberate instantiation of the “semantic virus” concept it theorizes. All results are grounded in category theory, information theory, and realistic AI cognition models, ensuring practical applicability beyond pure mathematical abstraction.
zh

[NLP-12] From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems

【速读】：该论文试图解决检索增强生成（Retrieval-Augmented Generation, RAG）系统中因实体共指（coreference）复杂性导致的文档检索与生成性能下降问题。其解决方案的关键在于引入共指消解（coreference resolution）技术，以提升检索相关性、上下文理解能力以及问答（QA）任务的整体表现，尤其在较小模型中表现出更显著的改进效果。

链接: https://arxiv.org/abs/2507.07847
作者: Youngjoon Jang,Seongtae Hong,Junyoung Son,Sungjin Park,Chanjun Park,Heuiseok Lim
机构: Korea University (韩国大学); Naver Corp (纳维亚公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a crucial framework in natural language processing (NLP), improving factual consistency and reducing hallucinations by integrating external document retrieval with large language models (LLMs). However, the effectiveness of RAG is often hindered by coreferential complexity in retrieved documents, introducing ambiguity that disrupts in-context learning. In this study, we systematically investigate how entity coreference affects both document retrieval and generative performance in RAG-based systems, focusing on retrieval relevance, contextual understanding, and overall response quality. We demonstrate that coreference resolution enhances retrieval effectiveness and improves question-answering (QA) performance. Through comparative analysis of different pooling strategies in retrieval tasks, we find that mean pooling demonstrates superior context capturing ability after applying coreference resolution. In QA tasks, we discover that smaller models benefit more from the disambiguation process, likely due to their limited inherent capacity for handling referential ambiguity. With these findings, this study aims to provide a deeper understanding of the challenges posed by coreferential complexity in RAG, providing guidance for improving retrieval and generation in knowledge-intensive AI applications.
zh

[NLP-13] Conditional Unigram Tokenization with Parallel Data ICML2025

【速读】：该论文试图解决跨语言分词中的语义对齐问题，旨在通过条件化目标词的概率来提升不同语言间的语义一致性。其解决方案的关键在于引入条件单字分词（conditional unigram tokenization），该方法在固定源语言分词器的基础上，学习一个目标语言分词器，以最大化跨语言语义对齐效果。

链接: https://arxiv.org/abs/2507.07824
作者: Gianluca Vico,Jindřinch Libovický
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 4 figures, submitted to Tokenization Workshop (TokShop) at ICML 2025

点击查看摘要

Abstract:We introduce conditional unigram tokenization, a novel approach that extends unigram tokenization by conditioning target token probabilities on source-language tokens from parallel data. Given a fixed source tokenizer, our method learns a target tokenizer that maximizes cross-lingual semantic alignment. We evaluate our tokenizer on four language pairs across different families and resource levels, examining intrinsic properties and downstream performance on machine translation and language modeling. While our conditional tokenizer maintains comparable statistical properties to standard unigram tokenizers, results are mixed: we observe no improvements in machine translation quality, but find consistent perplexity reductions in language modeling. We hypothesize that quadratic scaling of conditional probability estimation with respect to the vocabulary size creates a data efficiency bottleneck. Our findings suggest that alternative parameterizations may be necessary for practical cross-lingual tokenization.
zh

[NLP-14] On the Effect of Instruction Tuning Loss on Generalization ACL

【速读】：该论文试图解决指令微调（Instruction Tuning）中损失函数优化不足的问题，特别是传统自回归目标（仅对响应标记计算损失，忽略提示标记）是否真正适用于指令微调。论文的关键解决方案是提出加权指令微调（Weighted Instruction Tuning, WIT），通过差异化地赋予提示和响应标记不同的权重，以提升模型在不同场景下的性能与鲁棒性。实验结果表明，适当调整权重比例能够显著改善模型表现，并为后续的偏好对齐训练提供更优的起点。

链接: https://arxiv.org/abs/2507.07817
作者: Anwoy Chatterjee,H S V N S Kowndinya Renduchintala,Sumit Bhatia,Tanmoy Chakraborty
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Transactions of the Association for Computational Linguistics (TACL)

点击查看摘要

Abstract:Instruction Tuning has emerged as a pivotal post-training paradigm that enables pre-trained language models to better follow user instructions. Despite its significance, little attention has been given to optimizing the loss function used. A fundamental, yet often overlooked, question is whether the conventional auto-regressive objective - where loss is computed only on response tokens, excluding prompt tokens - is truly optimal for instruction tuning. In this work, we systematically investigate the impact of differentially weighting prompt and response tokens in instruction tuning loss, and propose Weighted Instruction Tuning (WIT) as a better alternative to conventional instruction tuning. Through extensive experiments on five language models of different families and scale, three finetuning datasets of different sizes, and five diverse evaluation benchmarks, we show that the standard instruction tuning loss often yields suboptimal performance and limited robustness to input prompt variations. We find that a low-to-moderate weight for prompt tokens coupled with a moderate-to-high weight for response tokens yields the best-performing models across settings and also serve as better starting points for the subsequent preference alignment training. These findings highlight the need to reconsider instruction tuning loss and offer actionable insights for developing more robust and generalizable models. Our code is open-sourced at this https URL.
zh

[NLP-15] Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning

【速读】：该论文试图解决大型语言模型（LLMs）在上下文学习（ICL）中的表现与其识别重复输入模式能力之间的关系问题。论文的核心贡献在于从技能神经元（特别是重复神经元）的角度分析这一关系，而非以往研究中关注的注意力头。解决方案的关键在于揭示重复神经元对ICL性能的影响随其所在层深度的不同而变化，并通过对比重复神经元与归纳头（induction heads）的作用，提出在减少重复输出的同时保持强ICL能力的策略。

链接: https://arxiv.org/abs/2507.07810
作者: Nhi Hoai Doan,Tatsuya Hiraoka,Kentaro Inui
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates the relationship between large language models’ (LLMs) ability to recognize repetitive input patterns and their performance on in-context learning (ICL). In contrast to prior work that has primarily focused on attention heads, we examine this relationship from the perspective of skill neurons, specifically repetition neurons. Our experiments reveal that the impact of these neurons on ICL performance varies depending on the depth of the layer in which they reside. By comparing the effects of repetition neurons and induction heads, we further identify strategies for reducing repetitive outputs while maintaining strong ICL capabilities.
zh

[NLP-16] Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers ECML-PKDD

【速读】：该论文试图解决如何将语义一致的逻辑公式连续表示转化为具体要求的问题，即实现语义嵌入的可逆性。解决方案的关键在于训练一个基于Transformer的解码器模型，以逆向工程Signal Temporal Logic (STL)公式语义嵌入，从而生成有效的逻辑公式，并在保持语义接近性的同时简化公式的长度和嵌套结构。

链接: https://arxiv.org/abs/2507.07808
作者: Sara Candussio,Gaia Saveri,Gabriele Sarti,Luca Bortolussi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figures, to be published in ECML-PKDD

点击查看摘要

Abstract:Continuous representations of logic formulae allow us to integrate symbolic knowledge into data-driven learning algorithms. If such embeddings are semantically consistent, i.e. if similar specifications are mapped into nearby vectors, they enable continuous learning and optimization directly in the semantic space of formulae. However, to translate the optimal continuous representation into a concrete requirement, such embeddings must be invertible. We tackle this issue by training a Transformer-based decoder-only model to invert semantic embeddings of Signal Temporal Logic (STL) formulae. STL is a powerful formalism that allows us to describe properties of signals varying over time in an expressive yet concise way. By constructing a small vocabulary from STL syntax, we demonstrate that our proposed model is able to generate valid formulae after only 1 epoch and to generalize to the semantics of the logic in about 10 epochs. Additionally, the model is able to decode a given embedding into formulae that are often simpler in terms of length and nesting while remaining semantically close (or equivalent) to gold references. We show the effectiveness of our methodology across various levels of training formulae complexity to assess the impact of training data on the model’s ability to effectively capture the semantic information contained in the embeddings and generalize out-of-distribution. Finally, we deploy our model for solving a requirement mining task, i.e. inferring STL specifications that solve a classification task on trajectories, performing the optimization directly in the semantic space.
zh

[NLP-17] StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model

【速读】：该论文试图解决Streaming speech translation (StreamST)中由于依赖句子级语音片段导致的策略决策受限和翻译质量下降的问题，以及SimulST模型在复杂语音输入和跨语言生成下的策略学习困难。其解决方案的关键在于提出StreamUni，该方法通过统一的Large Speech-Language Model (LSLM)实现StreamST，引入语音Chain-of-Thought (CoT)引导LSLM生成多阶段输出，从而同时完成语音分割、策略决策和翻译生成，无需大量特定策略的训练。此外，还提出了流式CoT训练方法，利用有限的CoT数据提升低延迟策略决策和生成能力。

链接: https://arxiv.org/abs/2507.07803
作者: Shoutao Guo,Xiang Li,Shaolei Zhang,Mengge Liu,Wei Chen,Yang Feng
机构: Li Auto(李想汽车)
类目: Computation and Language (cs.CL)
备注: The code is at this https URL The model is at this https URL

点击查看摘要

Abstract:Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require collaboration with segmentation models to accomplish StreamST, where the truncated speech segments constrain SimulST models to make policy decisions and generate translations based on limited contextual information. Moreover, SimulST models struggle to learn effective policies due to the complexity of speech inputs and cross-lingual generation. To address these challenges, we propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM). Specifically, StreamUni incorporates speech Chain-of-Thought (CoT) in guiding the LSLM to generate multi-stage outputs. Leveraging these multi-stage outputs, StreamUni simultaneously accomplishes speech segmentation, policy decision, and translation generation, completing StreamST without requiring massive policy-specific training. Additionally, we propose a streaming CoT training method that enhances low-latency policy decisions and generation capabilities using limited CoT data. Experiments demonstrate that our approach achieves state-of-the-art performance on StreamST tasks.
zh

[NLP-18] When Large Language Models Meet Law: Dual-Lens Taxonomy Technical Advances and Ethical Governance

【速读】：该论文旨在解决将大型语言模型（Large Language Models, LLMs）应用于法律领域时所面临的系统性挑战，包括法律语义的动态捕捉、证据推理的统一、任务泛化能力的提升以及文本处理、知识整合和评估严谨性等核心问题。其解决方案的关键在于引入一种创新的双视角分类法，结合法律推理框架与专业本体论，以系统地整合历史研究与最新进展，并通过技术手段如稀疏注意力机制和专家混合架构实现任务泛化、推理形式化及工作流集成。此外，论文还提出了一种将法律角色映射到自然语言处理子任务的新型分类法，并计算实现了Toulmin论证框架，从而系统化地推进了推理、检索、预测和争议解决等方面的进展。

链接: https://arxiv.org/abs/2507.07748
作者: Peizhang Shao,Linrui Xu,Jinxi Wang,Wei Zhou,Xingyu Wu
机构: China University of Political Science and Law(中国政法大学); Zhejiang University of Finance and Economics Dongfang College(浙江财经大学东方学院); Chung-Ang University(中央大学); The Hong Kong Polytechnic University(香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper establishes the first comprehensive review of Large Language Models (LLMs) applied within the legal domain. It pioneers an innovative dual lens taxonomy that integrates legal reasoning frameworks and professional ontologies to systematically unify historical research and contemporary breakthroughs. Transformer-based LLMs, which exhibit emergent capabilities such as contextual reasoning and generative argumentation, surmount traditional limitations by dynamically capturing legal semantics and unifying evidence reasoning. Significant progress is documented in task generalization, reasoning formalization, workflow integration, and addressing core challenges in text processing, knowledge integration, and evaluation rigor via technical innovations like sparse attention mechanisms and mixture-of-experts architectures. However, widespread adoption of LLM introduces critical challenges: hallucination, explainability deficits, jurisdictional adaptation difficulties, and ethical asymmetry. This review proposes a novel taxonomy that maps legal roles to NLP subtasks and computationally implements the Toulmin argumentation framework, thus systematizing advances in reasoning, retrieval, prediction, and dispute resolution. It identifies key frontiers including low-resource systems, multimodal evidence integration, and dynamic rebuttal handling. Ultimately, this work provides both a technical roadmap for researchers and a conceptual framework for practitioners navigating the algorithmic future, laying a robust foundation for the next era of legal artificial intelligence. We have created a GitHub repository to index the relevant papers: this https URL.
zh

[NLP-19] Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review

【速读】：该论文试图解决在端到端自动语音识别（end-to-end ASR）模型中处理语言混用（code-switching, CS）的问题，其解决方案的关键在于系统性地回顾和分析现有研究，包括所涉及的语言、数据集、评估指标、模型选择及性能表现，并探讨在端到端ASR中处理语言混用的挑战。通过总结当前的研究进展与资源状况，该研究旨在为未来相关领域的研究提供指导。

链接: https://arxiv.org/abs/2507.07741
作者: Maha Tufail Agro,Atharva Kulkarni,Karima Kadaoui,Zeerak Talat,Hanan Aldarmaki
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Motivated by a growing research interest into automatic speech recognition (ASR), and the growing body of work for languages in which code-switching (CS) often occurs, we present a systematic literature review of code-switching in end-to-end ASR models. We collect and manually annotate papers published in peer reviewed venues. We document the languages considered, datasets, metrics, model choices, and performance, and present a discussion of challenges in end-to-end ASR for code-switching. Our analysis thus provides insights on current research efforts and available resources as well as opportunities and gaps to guide future research.
zh

[NLP-20] GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在面对Jailbreak攻击时暴露的安全漏洞问题，以及现有评估方法在动态性和有效性上的不足。其解决方案的关键在于提出GuardVal评估协议，该协议通过动态生成和优化Jailbreak提示，根据防御者LLM的状态进行调整，从而更准确地评估LLM在安全关键情境下的表现，并通过新的优化方法避免提示优化过程中的停滞，持续揭示LLM的深层弱点。

链接: https://arxiv.org/abs/2507.07735
作者: Peiyan Zhang,Haibo Jin,Liying Kang,Haohan Wang
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Hong Kong Polytechnic University (香港理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 24 pages

点击查看摘要

Abstract:Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs) by causing them to generate harmful or unethical content. Evaluating these threats is particularly challenging due to the evolving nature of LLMs and the sophistication required in effectively probing their vulnerabilities. Current benchmarks and evaluation methods struggle to fully address these challenges, leaving gaps in the assessment of LLM vulnerabilities. In this paper, we review existing jailbreak evaluation practices and identify three assumed desiderata for an effective jailbreak evaluation protocol. To address these challenges, we introduce GuardVal, a new evaluation protocol that dynamically generates and refines jailbreak prompts based on the defender LLM’s state, providing a more accurate assessment of defender LLMs’ capacity to handle safety-critical situations. Moreover, we propose a new optimization method that prevents stagnation during prompt refinement, ensuring the generation of increasingly effective jailbreak prompts that expose deeper weaknesses in the defender LLMs. We apply this protocol to a diverse set of models, from Mistral-7b to GPT-4, across 10 safety domains. Our findings highlight distinct behavioral patterns among the models, offering a comprehensive view of their robustness. Furthermore, our evaluation process deepens the understanding of LLM behavior, leading to insights that can inform future research and drive the development of more secure models.
zh

[NLP-21] Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization

【速读】：该论文试图解决大规模语言模型（Large Language Models, LLMs）在后训练对齐中的挑战，即并非所有token对模型性能的贡献均等。其解决方案的关键在于提出一种选择性对齐策略，通过优先考虑偏好对中高影响的token，利用当前策略与参考模型之间的token级对数概率差异进行优化。这种方法通过聚焦于信息量大的token，降低了计算开销并提高了对齐精度。

链接: https://arxiv.org/abs/2507.07725
作者: Zhijin Dong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training alignment of large language models (LLMs) is a critical challenge, as not all tokens contribute equally to model performance. This paper introduces a selective alignment strategy that prioritizes high-impact tokens within preference pairs, leveraging token-level log-probability differences between the current policy and a reference model. By focusing on these informative tokens, our approach reduces computational overhead and enhances alignment fidelity. We further explore the role of reference model quality, demonstrating that stronger reference models significantly improve token selection accuracy and overall optimization effectiveness. Comprehensive experiments on benchmarks such as Arena-Hard and MT-Bench validate the superiority of our Selective-DPO method over standard DPO and distillation-based baselines. Our findings highlight the importance of token-level optimization and reference model selection in advancing preference alignment for LLMs. The code is available at this https URL.
zh

[NLP-22] Rethinking the Privacy of Text Embeddings: A Reproducibility Study of “Text Embeddings Reveal (Almost) As Much As Text” RECSYS2025

【速读】：该论文试图解决文本嵌入（text embeddings）在隐私保护方面的潜在风险，即传统上认为通过传输嵌入而非原始文本可以实现隐私保护，但近期方法如Vec2Text表明，通过控制解码可以从黑盒嵌入中成功重建原始文本。论文的核心解决方案是复现并验证Vec2Text框架，评估其有效性及局限性，并探索可能的隐私防御策略，如嵌入量化和噪声添加。关键在于揭示文本嵌入在特定条件下可能泄露敏感信息，并提出简单有效的防御手段以降低隐私风险。

链接: https://arxiv.org/abs/2507.07700
作者: Dominykas Seputis,Yongkang Li,Karsten Langerak,Serghei Mihailov
机构: University of Amsterdam(阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: This paper has been accepted for oral presentation in the reproducibility track at RecSys 2025

点击查看摘要

Abstract:Text embeddings are fundamental to many natural language processing (NLP) tasks, extensively applied in domains such as recommendation systems and information retrieval (IR). Traditionally, transmitting embeddings instead of raw text has been seen as privacy-preserving. However, recent methods such as Vec2Text challenge this assumption by demonstrating that controlled decoding can successfully reconstruct original texts from black-box embeddings. The unexpectedly strong results reported by Vec2Text motivated us to conduct further verification, particularly considering the typically non-intuitive and opaque structure of high-dimensional embedding spaces. In this work, we reproduce the Vec2Text framework and evaluate it from two perspectives: (1) validating the original claims, and (2) extending the study through targeted experiments. First, we successfully replicate the original key results in both in-domain and out-of-domain settings, with only minor discrepancies arising due to missing artifacts, such as model checkpoints and dataset splits. Furthermore, we extend the study by conducting a parameter sensitivity analysis, evaluating the feasibility of reconstructing sensitive inputs (e.g., passwords), and exploring embedding quantization as a lightweight privacy defense. Our results show that Vec2Text is effective under ideal conditions, capable of reconstructing even password-like sequences that lack clear semantics. However, we identify key limitations, including its sensitivity to input sequence length. We also find that Gaussian noise and quantization techniques can mitigate the privacy risks posed by Vec2Text, with quantization offering a simpler and more widely applicable solution. Our findings emphasize the need for caution in using text embeddings and highlight the importance of further research into robust defense mechanisms for NLP systems.
zh

[NLP-23] KeyKnowledgeRAG (K2RAG RAG (K2RAG): An Enhanced RAG method for improved LLM question-answering capabilities

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在知识扩展过程中因微调带来的高资源消耗问题，以及传统检索增强生成（Retrieval-Augmented Generation, RAG）方法在可扩展性和答案准确性方面的局限性。其解决方案的关键在于提出一种名为KeyKnowledgeRAG (K2RAG) 的框架，该框架通过整合密集与稀疏向量搜索、知识图谱和文本摘要技术，提升检索质量和系统效率，并通过预处理阶段对训练数据进行摘要处理，显著降低训练时间，从而实现更高效、准确且可扩展的知识增强生成机制。

链接: https://arxiv.org/abs/2507.07695
作者: Hruday Markondapatnaikuni,Basem Suleiman,Abdelkarim Erradi,Shijing Chen
机构: University of Sydney (悉尼大学); University of New South Wales (新南威尔士大学); Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 14 figures

点击查看摘要

Abstract:Fine-tuning is an immensely resource-intensive process when retraining Large Language Models (LLMs) to incorporate a larger body of knowledge. Although many fine-tuning techniques have been developed to reduce the time and computational cost involved, the challenge persists as LLMs continue to grow in size and complexity. To address this, a new approach to knowledge expansion in LLMs is needed. Retrieval-Augmented Generation (RAG) offers one such alternative by storing external knowledge in a database and retrieving relevant chunks to support question answering. However, naive implementations of RAG face significant limitations in scalability and answer accuracy. This paper introduces KeyKnowledgeRAG (K2RAG), a novel framework designed to overcome these limitations. Inspired by the divide-and-conquer paradigm, K2RAG integrates dense and sparse vector search, knowledge graphs, and text summarization to improve retrieval quality and system efficiency. The framework also includes a preprocessing step that summarizes the training data, significantly reducing the training time. K2RAG was evaluated using the MultiHopRAG dataset, where the proposed pipeline was trained on the document corpus and tested on a separate evaluation set. Results demonstrated notable improvements over common naive RAG implementations. K2RAG achieved the highest mean answer similarity score of 0.57, and reached the highest third quartile (Q3) similarity of 0.82, indicating better alignment with ground-truth answers. In addition to improved accuracy, the framework proved highly efficient. The summarization step reduced the average training time of individual components by 93%, and execution speed was up to 40% faster than traditional knowledge graph-based RAG systems. K2RAG also demonstrated superior scalability, requiring three times less VRAM than several naive RAG implementations tested in this study.
zh

[NLP-24] SAS: Simulated Attention Score

【速读】：该论文试图解决注意力机制在Transformer架构中性能提升与模型参数增长之间的平衡问题。其解决方案的关键在于引入Simulated Attention Score (SAS)，通过将低维头表示投影到高维空间，从而在不增加参数数量的情况下模拟更多注意力头和更大的隐藏特征维度，有效提升了注意力容量和模型表达能力。

链接: https://arxiv.org/abs/2507.07694
作者: Chuanyang Zheng,Jiankai Sun,Yihang Gao,Yuehao Wang,Peihao Wang,Jing Xiong,Liliang Ren,Hao Cheng,Janardhan Kulkarni,Yelong Shen,Atlas Wang,Mac Schwager,Anderson Schneider,Xiaodong Liu,Jianfeng Gao
机构: Morgan Stanley; Stanford; Microsoft Research; NUS; UT Austin; HKU
类目: Computation and Language (cs.CL)
备注: Tech Report

点击查看摘要

Abstract:The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficiently large. Therefore, increasing both the head count and hidden size per head with minimal parameter overhead can lead to significant performance gains at a low cost. Motivated by this insight, we introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head. This is achieved by projecting a low-dimensional head representation into a higher-dimensional space, effectively increasing attention capacity without increasing parameter count. Beyond the head representations, we further extend the simulation approach to feature dimension of the key and query embeddings, enhancing expressiveness by mimicking the behavior of a larger model while preserving the original model size. To control the parameter cost, we also propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants.
zh

[NLP-25] An Automated Length-Aware Quality Metric for Summarization

【速读】：该论文试图解决摘要质量评估的问题，特别是如何在保留语义信息的同时实现摘要长度的压缩，即召回率-压缩权衡问题（recall-compression tradeoff）。解决方案的关键在于提出一种名为NOrmed Index of Retention (NOIR) 的量化客观指标，该指标结合了语义保留和摘要长度压缩，通过语言模型嵌入来测量语义相似性，从而提供了一种无需依赖人工生成参考摘要的自动化评估方法。

链接: https://arxiv.org/abs/2507.07653
作者: Andrew D. Foland
机构: Sonnetiq(赛纳蒂克)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper proposes NOrmed Index of Retention (NOIR), a quantitative objective metric for evaluating summarization quality of arbitrary texts that relies on both the retention of semantic meaning and the summary length compression. This gives a measure of how well the recall-compression tradeoff is managed, the most important skill in summarization. Experiments demonstrate that NOIR effectively captures the token-length / semantic retention tradeoff of a summarizer and correlates to human perception of sumarization quality. Using a language model-embedding to measure semantic similarity, it provides an automated alternative for assessing summarization quality without relying on time-consuming human-generated reference summaries. The proposed metric can be applied to various summarization tasks, offering an automated tool for evaluating and improving summarization algorithms, summarization prompts, and synthetically-generated summaries.
zh

[NLP-26] Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement

【速读】：该论文试图解决中文内容审核中由于语音伪装替换（Phonetic Cloaking Replacement, PCR）带来的挑战，即用户通过使用同音或近音变体来隐藏有害意图的问题。解决方案的关键在于构建了一个四类表面形式的PCR分类体系，并提出了一个真实场景下的数据集\ours，包含500条从RedNote平台获取的自然发生的语音伪装攻击性帖子。此外，研究还发现基于拼音的提示策略在特定条件下能够有效提升检测性能，从而为鲁棒的毒性检测提供了轻量级的缓解方法。

链接: https://arxiv.org/abs/2507.07640
作者: Haotan Guo,Jianfei He,Jiayuan Ma,Hongbin Na,Zimu Wang,Haiyang Zhang,Qi Chen,Wei Wang,Zijing Shi,Tao Shen,Ling Chen
机构: School of Computer Science, The University of Sydney (悉尼大学计算机科学学院); Business School, The Hong Kong University of Science and Technology (香港科技大学商学院); Australian AI Institute, University of Technology Sydney (澳大利亚人工智能研究所，悉尼科技大学); School of Advanced Technology, Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学先进技术学院)
类目: Computation and Language (cs.CL)
备注: In progress

点击查看摘要

Abstract:Phonetic Cloaking Replacement (PCR), defined as the deliberate use of homophonic or near-homophonic variants to hide toxic intent, has become a major obstacle to Chinese content moderation. While this problem is well-recognized, existing evaluations predominantly rely on rule-based, synthetic perturbations that ignore the creativity of real users. We organize PCR into a four-way surface-form taxonomy and compile \ours, a dataset of 500 naturally occurring, phonetically cloaked offensive posts gathered from the RedNote platform. Benchmarking state-of-the-art LLMs on this dataset exposes a serious weakness: the best model reaches only an F1-score of 0.672, and zero-shot chain-of-thought prompting pushes performance even lower. Guided by error analysis, we revisit a Pinyin-based prompting strategy that earlier studies judged ineffective and show that it recovers much of the lost accuracy. This study offers the first comprehensive taxonomy of Chinese PCR, a realistic benchmark that reveals current detectors’ limits, and a lightweight mitigation technique that advances research on robust toxicity detection.
zh

[NLP-27] FrugalRAG : Learning to retrieve and reason for multi-hop QA ICML

【速读】：该论文试图解决在大规模非结构化文档语料库中回答复杂问题的问题，其解决方案的关键在于优化检索增强生成（RAG）方法的效率与效果。传统方法依赖语言模型进行迭代检索和推理，而本文指出大规模微调并非提升RAG指标的必要手段，通过改进提示工程的标准ReAct流程即可在HotPotQA等基准测试中超越当前最先进方法。此外，从减少检索次数的角度出发，监督学习和基于强化学习的微调技术能够有效提升RAG的资源利用效率，在保持竞争力的同时显著降低检索成本。

链接: https://arxiv.org/abs/2507.07634
作者: Abhinav Java,Srivathsan Koundinyan,Nagarajan Natarajan,Amit Sharma
机构: Microsoft Research(微软研究院)
类目: Computation and Language (cs.CL)
备注: Accepted at ICML Workshop: Efficient Systems for Foundation Models

点击查看摘要

Abstract:We consider the problem of answering complex questions, given access to a large unstructured document corpus. The de facto approach to solving the problem is to leverage language models that (iteratively) retrieve and reason through the retrieved documents, until the model has sufficient information to generate an answer. Attempts at improving this approach focus on retrieval-augmented generation (RAG) metrics such as accuracy and recall and can be categorized into two types: (a) fine-tuning on large question answering (QA) datasets augmented with chain-of-thought traces, and (b) leveraging RL-based fine-tuning techniques that rely on question-document relevance signals. However, efficiency in the number of retrieval searches is an equally important metric, which has received less attention. In this work, we show that: (1) Large-scale fine-tuning is not needed to improve RAG metrics, contrary to popular claims in recent literature. Specifically, a standard ReAct pipeline with improved prompts can outperform state-of-the-art methods on benchmarks such as HotPotQA. (2) Supervised and RL-based fine-tuning can help RAG from the perspective of frugality, i.e., the latency due to number of searches at inference time. For example, we show that we can achieve competitive RAG metrics at nearly half the cost (in terms of number of searches) on popular RAG benchmarks, using the same base model, and at a small training cost (1000 examples).
zh

[NLP-28] Exploring the Limits of Model Compression in LLM s: A Knowledge Distillation Study on QA Tasks

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在资源受限环境中部署时面临的计算需求过高的问题。其解决方案的关键在于利用知识蒸馏（Knowledge Distillation, KD）技术对LLMs进行压缩，同时保持其在问答（Question Answering, QA）任务上的高性能。通过在SQuAD和MLQA两个QA基准测试中评估从Pythia和Qwen2.5系列中蒸馏得到的学生模型，研究发现学生模型在参数量减少高达57.1%的情况下仍能保留超过90%的教师模型性能，并且在少量示例提示（one-shot prompting）条件下相比零样本提示（zero-shot prompting）进一步提升了性能。这表明知识蒸馏结合最小提示策略能够在模型效率与任务性能之间实现有效平衡。

链接: https://arxiv.org/abs/2507.07630
作者: Joyeeta Datta,Niclas Doll,Qusai Ramadan,Zeyd Boukhers
机构: Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, Germany; University of Southern Denmark, Denmark; Fraunhofer Institute for Applied Information Technology FIT, Germany; University Hospital of Cologne, Germany
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted four publication at the 26th Meeting of the Special Interest on Discourse and Dialogue

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated outstanding performance across a range of NLP tasks, however, their computational demands hinder their deployment in real-world, resource-constrained environments. This work investigates the extent to which LLMs can be compressed using Knowledge Distillation (KD) while maintaining strong performance on Question Answering (QA) tasks. We evaluate student models distilled from the Pythia and Qwen2.5 families on two QA benchmarks, SQuAD and MLQA, under zero-shot and one-shot prompting conditions. Results show that student models retain over 90% of their teacher models’ performance while reducing parameter counts by up to 57.1%. Furthermore, one-shot prompting yields additional performance gains over zero-shot setups for both model families. These findings underscore the trade-off between model efficiency and task performance, demonstrating that KD, combined with minimal prompting, can yield compact yet capable QA systems suitable for resource-constrained applications.
zh

[NLP-29] SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLM s

【速读】：该论文试图解决当前多模态大语言模型（MLLMs）在空间可视化任务中的评估不足问题，特别是现有评估方法依赖于可能与训练数据重叠的IQ测试或数学竞赛，导致评估结果不可靠。解决方案的关键是引入SpatialViz-Bench，这是一个包含12项任务、覆盖4种子能力的综合性多模态基准，共包含1,180个自动生成的问题，能够有效评估模型的空间可视化能力，并揭示了当前先进MLLMs在该领域仍存在显著缺陷。

链接: https://arxiv.org/abs/2507.07610
作者: Siting Wang,Luoyang Sun,Cheng Deng,Kun Shao,Minnan Pei,Zheng Tian,Haifeng Zhang,Jun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark’s strong discriminative power, but also uncovers counter-intuitive findings: models exhibit unexpected behaviors by showing difficulty perception that misaligns with human intuition, displaying dramatic 2D-to-3D performance cliffs, and defaulting to formula derivation despite spatial tasks requiring visualization alone. SpatialVizBench empirically demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark is publicly available.
zh

[NLP-30] Bayesian Discrete Diffusion Beats Autoregressive Perplexity

【速读】：该论文试图解决离散扩散语言模型在生成过程中缺乏精确后验估计的问题，从而提升模型对干净标记的不确定性建模能力。其解决方案的关键在于揭示了前向掩码分布下去噪器期望输出与干净标记后验分布之间的理论联系，并通过蒙特卡洛边际化方法在最小假设下收敛至该后验分布，从而实现了无需额外训练成本的后验感知的token概率和不确定性估计。

链接: https://arxiv.org/abs/2507.07586
作者: Cooper Doyle
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 2 figures, 2 tables

点击查看摘要

Abstract:We reveal a hidden Bayesian core of discrete-diffusion language models by showing that the expected denoiser output under the forward masking distribution recovers the exact posterior over clean tokens. Under minimal assumptions, Monte Carlo marginalization over K independent corruptions converges to this posterior at rate O(1/sqrt(K)), yielding a simple proof of consistency and finite-sample error bounds. Building on this insight, we introduce a lightweight inference-time ensemble that averages K mask-and-denoise passes to obtain posterior-aware token probabilities and uncertainty estimates at no extra training cost. On WikiText-2, our method achieves test perplexity 8.8 with K=8, versus 20.3 for GPT-2 Small, despite using a model of comparable size. Code is available at this https URL.
zh

[NLP-31] Improving Clustering on Occupational Text Data through Dimensionality Reduction

【速读】：该论文试图解决如何为基于美国的O*NET职业数据库中的职业定义建立一个有效的映射机制，以便在不同组织和国家间扩展已收集的职业数据。解决方案的关键在于提出了一种结合多种基于BERT的技术与聚类方法的流程，并通过维度约简技术优化聚类性能，最终利用一种专门的轮廓系数方法提升结果质量，从而实现职业的自动区分与映射。

链接: https://arxiv.org/abs/2507.07582
作者: Iago Xabier Vázquez García,Damla Partanaz,Emrullah Fatih Yetkin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Preprint, 10 figures

点击查看摘要

Abstract:In this study, we focused on proposing an optimal clustering mechanism for the occupations defined in the well-known US-based occupational database, ONET. Even though all occupations are defined according to well-conducted surveys in the US, their definitions can vary for different firms and countries. Hence, if one wants to expand the data that is already collected in ONET for the occupations defined with different tasks, a map between the definitions will be a vital requirement. We proposed a pipeline using several BERT-based techniques with various clustering approaches to obtain such a map. We also examined the effect of dimensionality reduction approaches on several metrics used in measuring performance of clustering algorithms. Finally, we improved our results by using a specialized silhouette approach. This new clustering-based mapping approach with dimensionality reduction may help distinguish the occupations automatically, creating new paths for people wanting to change their careers.
zh

[NLP-32] COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation

【速读】：该论文旨在解决现有基于上下文感知低秩逼近的方法在神经网络压缩和微调中因依赖显式Gram矩阵计算及其逆运算而导致的数值不稳定性问题。其解决方案的关键在于提出一种完全基于稳定分解的无逆正则化框架，从而克服了先前方法的数值缺陷，并能够处理诸如校准矩阵超出GPU内存容量、输入激活矩阵接近奇异以及数据不足导致无法唯一逼近等挑战场景。

链接: https://arxiv.org/abs/2507.07580
作者: Uliana Parkina,Maxim Rakhuba
机构: HSE University (高等经济大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Recent studies suggest that context-aware low-rank approximation is a useful tool for compression and fine-tuning of modern large-scale neural networks. In this type of approximation, a norm is weighted by a matrix of input activations, significantly improving metrics over the unweighted case. Nevertheless, existing methods for neural networks suffer from numerical instabilities due to their reliance on classical formulas involving explicit Gram matrix computation and their subsequent inversion. We demonstrate that this can degrade the approximation quality or cause numerically singular matrices. To address these limitations, we propose a novel inversion-free regularized framework that is based entirely on stable decompositions and overcomes the numerical pitfalls of prior art. Our method can handle possible challenging scenarios: (1) when calibration matrices exceed GPU memory capacity, (2) when input activation matrices are nearly singular, and even (3) when insufficient data prevents unique approximation. For the latter, we prove that our solution converges to a desired approximation and derive explicit error bounds. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Numerical Analysis (math.NA) MSC classes: 65F55, 68T50 Cite as: arXiv:2507.07580 [cs.LG] (or arXiv:2507.07580v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.07580 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-33] Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation ACL2025

【速读】：该论文试图解决文档图像机器翻译（Document Image Machine Translation, DIMT）在有限训练数据和视觉与文本信息复杂交互下的泛化挑战。其解决方案的关键在于提出M4Doc，一种基于多模态大语言模型（Multimodal Large Language Models, MLLMs）的单模态到多模态对齐框架。M4Doc通过将仅图像编码器与MLLM的多模态表示进行对齐，在训练过程中使轻量级DIMT模型学习关键的视觉-文本关联，而在推理阶段则绕过MLLM以保持计算效率，同时受益于其多模态知识。

链接: https://arxiv.org/abs/2507.07572
作者: Yupu Liang,Yaping Zhang,Zhiyang Zhang,Yang Zhao,Lu Xiang,Chengqing Zong,Yu Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACL 2025 Main

点击查看摘要

Abstract:Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an image-only encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios.
zh

[NLP-34] he Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

【速读】：该论文试图解决后训练技术（如长链式思维监督微调（long-CoT SFT）和强化学习（RL））在多模态视觉-语言模型（VLMs）中联合效果不确定的问题。研究发现，SFT通过深入的结构化推理提升复杂问题的性能，但导致冗长并降低简单问题的表现；而RL则促进泛化与简洁性，在所有难度水平上均带来稳定提升，但在最困难的问题上提升效果不如SFT显著。然而，将两者结合的多种策略（如两阶段、交错或渐进训练）未能产生叠加优势，反而引发准确率、推理风格和响应长度之间的权衡。该研究揭示了“协同困境”，强调需要更无缝和自适应的方法以充分发挥组合后训练技术在推理VLMs中的潜力。

链接: https://arxiv.org/abs/2507.07562
作者: Jierun Chen,Tiezheng Yu,Haoli Bai,Lewei Yao,Jiannan Wu,Kaican Li,Fei Mi,Chaofan Tao,Lei Zhu,Manyi Zhang,Xiaohui Li,Lu Hou,Lifeng Shang,Qun Liu
机构: Huawei Technologies(华为技术); HKUST(香港科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This ``synergy dilemma’’ highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.
zh

[NLP-35] he Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

【速读】：该论文试图解决跨语言检索增强生成（Cross-lingual Retrieval-Augmented Generation, RAG）在领域特定场景下的性能瓶颈问题，特别是在用户查询语言与支持文档语言不一致时导致的显著性能下降。其解决方案的关键在于识别出检索失败的主要原因在于检索器在跨语言文档排序上的困难，并提出一种简单的检索策略，通过强制从两种语言中均等检索，从而显著提升跨语言和整体性能。

链接: https://arxiv.org/abs/2507.07543
作者: Chen Amiraz,Yaroslav Fyodorov,Elad Haramaty,Zohar Karnin,Liane Lewin-Eytan
机构: Technology Innovation Institute (技术创新研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior. Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with significant performance drops occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever’s difficulty in ranking documents across languages. Finally, we propose a simple retrieval strategy that addresses this source of failure by enforcing equal retrieval from both languages, resulting in substantial improvements in cross-lingual and overall performance. These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2507.07543 [cs.CL] (or arXiv:2507.07543v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.07543 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-36] CEA-LIST at CheckThat! 2025: Evaluating LLM s as Detectors of Bias and Opinion in Text

【速读】：该论文旨在解决多语言情感极性检测（multilingual subjectivity detection）问题，特别是在标注数据稀缺或质量较低的情况下。其解决方案的关键在于利用大型语言模型（LLMs）结合少量示例提示（few-shot prompting）的方法，通过精心设计的提示工程实现对多种语言的情感分析任务的有效处理。研究结果显示，该方法在多个语言任务中表现优异，甚至优于微调的小型语言模型（SLMs），表明基于LLM的少样本学习在多语言情感任务中具有高效性和适应性。

链接: https://arxiv.org/abs/2507.07539
作者: Akram Elbouanani,Evan Dufraisse,Aboubacar Tuo,Adrian Popescu
机构: Université Paris-Saclay, CEA-List, F-91120, Palaiseau, France
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Notebook for the CheckThat! Lab at CLEF 2025

点击查看摘要

Abstract:This paper presents a competitive approach to multilingual subjectivity detection using large language models (LLMs) with few-shot prompting. We participated in Task 1: Subjectivity of the CheckThat! 2025 evaluation campaign. We show that LLMs, when paired with carefully designed prompts, can match or outperform fine-tuned smaller language models (SLMs), particularly in noisy or low-quality data settings. Despite experimenting with advanced prompt engineering techniques, such as debating LLMs and various example selection strategies, we found limited benefit beyond well-crafted standard few-shot prompts. Our system achieved top rankings across multiple languages in the CheckThat! 2025 subjectivity detection task, including first place in Arabic and Polish, and top-four finishes in Italian, English, German, and multilingual tracks. Notably, our method proved especially robust on the Arabic dataset, likely due to its resilience to annotation inconsistencies. These findings highlight the effectiveness and adaptability of LLM-based few-shot learning for multilingual sentiment tasks, offering a strong alternative to traditional fine-tuning, particularly when labeled data is scarce or inconsistent.
zh

[NLP-37] riadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems INTERSPEECH2025

【速读】：该论文试图解决多轮对话中三元对话场景下的发言权转换（turn-taking）预测问题，传统研究主要集中在两人对话设置中。解决方案的关键在于应用语音活动投影（Voice Activity Projection, VAP）技术，通过仅利用声学数据预测每个说话者的未来语音活动，从而实现对三元对话中发言权转换的预测。

链接: https://arxiv.org/abs/2507.07518
作者: Mikey Elmers,Koji Inoue,Divesh Lala,Tatsuya Kawahara
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.
zh

[NLP-38] oward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System

【速读】：该论文旨在解决由于日益增长的压力导致的心理支持需求与相关数据集稀缺之间的矛盾，特别是在非英语语言中的数据不足问题。其解决方案的关键在于提出一种框架，该框架利用有限的真实数据和专家知识对两个大型语言模型——对话生成器（Dialog Generator）和对话修改器（Dialog Modifier）进行微调。生成器基于预定义路径生成大规模心理辅导对话，而修改器则用于优化对话以符合真实数据的质量标准，最终构建了中文心理支持对话数据集（Chinese Psychological support Dialogue Dataset, CPsDD），并引入了综合代理对话支持系统（Comprehensive Agent Dialogue Support System, CADSS）以实现高效的心理支持任务。

链接: https://arxiv.org/abs/2507.07509
作者: Yuanchen Shi,Longyin Zhang,Fang Kong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 10pages,8 figures

点击查看摘要

Abstract:The growing need for psychological support due to increasing pressures has exposed the scarcity of relevant datasets, particularly in non-English languages. To address this, we propose a framework that leverages limited real-world data and expert knowledge to fine-tune two large language models: Dialog Generator and Dialog Modifier. The Generator creates large-scale psychological counseling dialogues based on predefined paths, which guide system response strategies and user interactions, forming the basis for effective support. The Modifier refines these dialogues to align with real-world data quality. Through both automated and manual review, we construct the Chinese Psychological support Dialogue Dataset (CPsDD), containing 68K dialogues across 13 groups, 16 psychological problems, 13 causes, and 12 support focuses. Additionally, we introduce the Comprehensive Agent Dialogue Support System (CADSS), where a Profiler analyzes user characteristics, a Summarizer condenses dialogue history, a Planner selects strategies, and a Supporter generates empathetic responses. The experimental results of the Strategy Prediction and Emotional Support Conversation (ESC) tasks demonstrate that CADSS achieves state-of-the-art performance on both CPsDD and ESConv datasets.
zh

[NLP-39] Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models AAAI-26

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理复杂计算和自主代理任务时的局限性问题，特别是其在面对超出一定复杂度的任务时无法正确执行或验证任务准确性的现象。解决方案的关键在于从计算复杂性的角度分析LLMs的推理过程，证明LLMs在处理超过特定复杂度的任务时存在能力瓶颈，并进一步指出它们无法有效验证任务的准确性。

链接: https://arxiv.org/abs/2507.07505
作者: Varin Sikka,Vishal Sikka
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages; to be submitted to AAAI-26 after reviews

点击查看摘要

Abstract:With widespread adoption of transformer-based language models in AI, there is significant interest in the limits of LLMs capabilities, specifically so-called hallucinations, occurrences in which LLMs provide spurious, factually incorrect or nonsensical information when prompted on certain subjects. Furthermore, there is growing interest in agentic uses of LLMs - that is, using LLMs to create agents that act autonomously or semi-autonomously to carry out various tasks, including tasks with applications in the real world. This makes it important to understand the types of tasks LLMs can and cannot perform. We explore this topic from the perspective of the computational complexity of LLM inference. We show that LLMs are incapable of carrying out computational and agentic tasks beyond a certain complexity, and further that LLMs are incapable of verifying the accuracy of tasks beyond a certain complexity. We present examples of both, then discuss some consequences of this work.
zh

[NLP-40] Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature

【速读】：该论文试图解决从大量科学文献中提取氧气还原反应（ORR）催化剂相关结构化信息的挑战，这一过程因文本数据的复杂性和多样性而显得尤为困难。解决方案的关键在于采用命名实体识别（NER）与关系抽取（RE）方法，结合多种预训练BERT模型（如MatSciBERT和PubMedBERT）进行模型微调，以提高信息提取的准确性。通过构建包含12个关键实体及两种关系类型的综合性数据集，并评估不同BERT变体对提取性能的影响，最终证明领域特定的BERT模型在ORR催化剂提取任务中优于通用科学模型。

链接: https://arxiv.org/abs/2507.07499
作者: Hein Htet,Amgad Ahmed Ali Ibrahim,Yutaka Sasaki,Ryoji Asahi
机构: 未知
类目: Computation and Language (cs.CL); Data Analysis, Statistics and Probability (physics.data-an)
备注: 28 pages, 12 figures, 6 tables

点击查看摘要

Abstract:The oxygen reduction reaction (ORR) catalyst plays a critical role in enhancing fuel cell efficiency, making it a key focus in material science research. However, extracting structured information about ORR catalysts from vast scientific literature remains a significant challenge due to the complexity and diversity of textual data. In this study, we propose a named entity recognition (NER) and relation extraction (RE) approach using DyGIE++ with multiple pre-trained BERT variants, including MatSciBERT and PubMedBERT, to extract ORR catalyst-related information from the scientific literature, which is compiled into a fuel cell corpus for materials informatics (FC-CoMIcs). A comprehensive dataset was constructed manually by identifying 12 critical entities and two relationship types between pairs of the entities. Our methodology involves data annotation, integration, and fine-tuning of transformer-based models to enhance information extraction accuracy. We assess the impact of different BERT variants on extraction performance and investigate the effects of annotation consistency. Experimental evaluations demonstrate that the fine-tuned PubMedBERT model achieves the highest NER F1-score of 82.19% and the MatSciBERT model attains the best RE F1-score of 66.10%. Furthermore, the comparison with human annotators highlights the reliability of fine-tuned models for ORR catalyst extraction, demonstrating their potential for scalable and automated literature analysis. The results indicate that domain-specific BERT models outperform general scientific models like BlueBERT for ORR catalyst extraction.
zh

[NLP-41] aching LLM to Reason : Reinforcement Learning from Algorithmic Problems without Code

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）在推理能力上的不足，特别是在面对简单任务时过度依赖复杂数据结构和算法，导致过拟合于算法模式而非核心推理结构的问题。解决方案的关键在于提出TeaR方法，该方法通过精心设计的数据整理和强化学习机制，引导模型在与代码相关的任务中发现最优的推理路径，从而提升模型的通用推理能力。

链接: https://arxiv.org/abs/2507.07498
作者: Keqin Bao,Nuo Chen,Xiaoyuan Li,Binyuan Hui,Bowen Yu,Fuli Feng,Junyang Lin,Xiangnan He,Dayiheng Liu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Enhancing reasoning capabilities remains a central focus in the LLM reasearch community. A promising direction involves requiring models to simulate code execution step-by-step to derive outputs for given inputs. However, as code is often designed for large-scale systems, direct application leads to over-reliance on complex data structures and algorithms, even for simple cases, resulting in overfitting to algorithmic patterns rather than core reasoning structures. To address this, we propose TeaR, which aims at teaching LLMs to reason better. TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks, thereby improving general reasoning abilities. We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning. The results consistently show significant performance improvements. Notably, TeaR achieves a 35.9% improvement on Qwen2.5-7B and 5.9% on R1-Distilled-7B.
zh

[NLP-42] PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving

【速读】：该论文试图解决如何在后训练阶段提升小型开放源代码大语言模型（LLM）性能的问题，特别是在复杂推理任务中的表现。其解决方案的关键在于提出PLAN-TUNING框架，该框架通过从大规模LLM中提炼合成的任务分解（称为“规划轨迹”），并利用监督学习和强化学习目标对小型模型进行微调，以模仿这些规划过程，从而增强模型的复杂推理能力。

链接: https://arxiv.org/abs/2507.07495
作者: Mihir Parmar,Palash Goyal,Xin Liu,Yiwen Song,Mingyang Ling,Chitta Baral,Hamid Palangi,Tomas Pfister
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 Pages

点击查看摘要

Abstract:Recently, decomposing complex problems into simple subtasks–a crucial part of human-like natural planning–to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce PLAN-TUNING, a unified post-training framework that (i) distills synthetic task decompositions (termed “planning trajectories”) from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average \sim7% . Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average \sim10% and \sim12% performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that PLAN-TUNING is an effective strategy for improving task-specific performance of smaller LLMs.
zh

[NLP-43] Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

【速读】：该论文试图解决生成式人工智能（Generative AI）在运行过程中出现的失真或非真实性问题，即机器谎言（machine bullshit）现象。其核心问题是理解并量化大型语言模型（LLM）在生成内容时对真实性的漠视，并揭示其背后的机制。解决方案的关键在于提出“机器谎言指数”（Bullshit Index）这一新度量标准，用于量化LLM对真理的 indifference，并构建一个互补的分类法，分析四种定性形式的机器谎言：空洞修辞、模棱两可、含糊其辞和未经验证的陈述。通过在多个数据集上的实证评估，研究揭示了模型微调与人类反馈强化学习（RLHF）及推理时思维链提示（CoT）对特定类型机器谎言的显著影响。

链接: https://arxiv.org/abs/2507.07484
作者: Kaiqu Liang,Haimin Hu,Xuandong Zhao,Dawn Song,Thomas L. Griffiths,Jaime Fernández Fisac
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page, code data: this https URL

点击查看摘要

Abstract:Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and sycophancy, we propose machine bullshit as an overarching conceptual framework that can allow researchers to characterize the broader phenomenon of emergent loss of truthfulness in LLMs and shed light on its underlying mechanisms. We introduce the Bullshit Index, a novel metric quantifying LLMs’ indifference to truth, and propose a complementary taxonomy analyzing four qualitative forms of bullshit: empty rhetoric, paltering, weasel words, and unverified claims. We conduct empirical evaluations on the Marketplace dataset, the Political Neutrality dataset, and our new BullshitEval benchmark (2,400 scenarios spanning 100 AI assistants) explicitly designed to evaluate machine bullshit. Our results demonstrate that model fine-tuning with reinforcement learning from human feedback (RLHF) significantly exacerbates bullshit and inference-time chain-of-thought (CoT) prompting notably amplify specific bullshit forms, particularly empty rhetoric and paltering. We also observe prevalent machine bullshit in political contexts, with weasel words as the dominant strategy. Our findings highlight systematic challenges in AI alignment and provide new insights toward more truthful LLM behavior.
zh

[NLP-44] RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning

【速读】：该论文试图解决大型语言模型在强化学习（Reinforcement Learning, RL）过程中存在的训练不稳定和策略逐渐偏离预训练权重的问题。其解决方案的关键在于提出一种两阶段框架——RLEP（Reinforcement Learning with Experience RePlay），该框架首先收集经过验证的轨迹，然后在后续训练中对其进行重放。在每个更新步骤中，策略在混合新生成的 rollout 与重放的成功示例的迷你批次上进行优化，通过重放高质量示例引导模型远离无益探索，聚焦于有前景的推理路径，从而实现更快的收敛和更强的最终性能。

链接: https://arxiv.org/abs/2507.07451
作者: Hongzhi Zhang,Jia Fu,Jingyuan Zhang,Kai Fu,Qi Wang,Fuzheng Zhang,Guorui Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present \emphRLEP, – ,Reinforcement Learning with Experience rePlay, – ,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at this https URL to facilitate reproducibility and further research.
zh

[NLP-45] SAND: Boosting LLM Agents with Self-Taught Action Deliberation

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）代理在经过监督微调或偏好优化后，由于有限的动作空间探索而可能过度依赖看似合理但次优动作的问题。解决方案的关键在于提出Self-taught ActioN Deliberation (SAND)框架，该框架使LLM代理在决定采取某个动作之前能够对候选动作进行显式推理和权衡。为应对大规模动作空间和步骤级动作评估带来的挑战，SAND引入了自一致性动作采样和执行引导的动作批判机制，以利用LLM代理的基础模型合成步骤级的动作推理过程，并通过迭代方式使用推理轨迹对LLM代理进行微调。

链接: https://arxiv.org/abs/2507.07441
作者: Yu Xia,Yiran Jenny Shen,Junda Wu,Tong Yu,Sungchul Kim,Ryan A. Rossi,Lina Yao,Julian McAuley
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imitating specific expert behaviors or promoting chosen reasoning thoughts and actions over rejected ones. However, without reasoning and comparing over alternatives actions, LLM agents finetuned with these methods may over-commit towards seemingly plausible but suboptimal actions due to limited action space exploration. To address this, in this paper we propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one. To tackle the challenges of when and what to deliberate given large action space and step-level action evaluation, we incorporate self-consistency action sampling and execution-guided action critique to help synthesize step-wise action deliberation thoughts using the base model of the LLM agent. In an iterative manner, the deliberation trajectories are then used to finetune the LLM agent itself. Evaluating on two representative interactive agent tasks, SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
zh

[NLP-46] owards Interpretable Time Series Foundation Models ICML

【速读】：该论文试图解决如何将时间序列推理能力压缩到小型指令调优语言模型中，以构建可解释的时间序列基础模型。其解决方案的关键在于利用一个具有系统性变化趋势和噪声水平的均值回归时间序列合成数据集，通过大型多模态模型生成自然语言注释，并以此监督紧凑型Qwen模型的微调，从而实现对时间序列特征（如趋势方向、噪声强度和极值定位）的有效推理能力迁移。

链接: https://arxiv.org/abs/2507.07439
作者: Matthieu Boileau,Philippe Helluy,Jeremy Pawlus,Svitlana Vyetrenko
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: International Conference on Machine Leaning (ICML) 2025 Workshop on Foundation Models for Structured Data

点击查看摘要

Abstract:In this paper, we investigate the distillation of time series reasoning capabilities into small, instruction-tuned language models as a step toward building interpretable time series foundation models. Leveraging a synthetic dataset of mean-reverting time series with systematically varied trends and noise levels, we generate natural language annotations using a large multimodal model and use these to supervise the fine-tuning of compact Qwen models. We introduce evaluation metrics that assess the quality of the distilled reasoning - focusing on trend direction, noise intensity, and extremum localization - and show that the post-trained models acquire meaningful interpretive capabilities. Our results highlight the feasibility of compressing time series understanding into lightweight, language-capable models suitable for on-device or privacy-sensitive deployment. This work contributes a concrete foundation toward developing small, interpretable models that explain temporal patterns in natural language.
zh

[NLP-47] SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM -Augmented Synthetic EHR Data

【速读】：该论文试图解决从电子健康记录（EHR）中提取与驱逐（eviction）相关的社会决定因素健康（SDoH）信息的难题，因为驱逐在结构化字段中很少被编码，限制了后续应用。解决方案的关键是提出SynthEHR-Eviction管道，该管道结合了大语言模型（LLM）、人机协同标注和自动化提示优化（APO），有效提取临床笔记中的驱逐状态，并构建了目前最大的公开驱逐相关SDoH数据集。

链接: https://arxiv.org/abs/2507.07421
作者: Zonghai Yao,Youxia Zhao,Avijit Mitra,David A. Levy,Emily Druhl,Jack Tsai,Hong Yu
机构: Center for Healthcare Organization and Implementation Research, VA Bedford Health Care, MA, USA(医疗保健组织与实施研究中心，贝福德医疗中心，马萨诸塞州，美国); Manning College of Information and Computer Sciences, UMass Amherst, MA, USA(信息与计算机科学学院，马萨诸塞大学阿默斯特分校，马萨诸塞州，美国); Department of Medicine, University of Massachusetts Medical School, Worcester, MA, USA(医学系，马萨诸塞大学医学院，伍斯特，马萨诸塞州，美国); National Center on Homelessness among Veterans, VA Homeless Programs Office, Washington, DC, USA(退伍军人无家可归中心，退伍军人无家可归项目办公室，华盛顿特区，美国); School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA(公共卫生学院，德克萨斯大学休斯顿健康科学中心，休斯顿，德克萨斯州，美国); Department of Psychiatry, Yale University School of Medicine, New Haven, CT, USA(精神病学系，耶鲁大学医学院，纽黑文，康涅狄格州，美国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Equal contribution for the first two authors

点击查看摘要

Abstract:Eviction is a significant yet understudied social determinants of health (SDoH), linked to housing instability, unemployment, and mental health. While eviction appears in unstructured electronic health records (EHRs), it is rarely coded in structured fields, limiting downstream applications. We introduce SynthEHR-Eviction, a scalable pipeline combining LLMs, human-in-the-loop annotation, and automated prompt optimization (APO) to extract eviction statuses from clinical notes. Using this pipeline, we created the largest public eviction-related SDoH dataset to date, comprising 14 fine-grained categories. Fine-tuned LLMs (e.g., Qwen2.5, LLaMA3) trained on SynthEHR-Eviction achieved Macro-F1 scores of 88.8% (eviction) and 90.3% (other SDoH) on human validated data, outperforming GPT-4o-APO (87.8%, 87.3%), GPT-4o-mini-APO (69.1%, 78.1%), and BioBERT (60.7%, 68.3%), while enabling cost-effective deployment across various model sizes. The pipeline reduces annotation effort by over 80%, accelerates dataset creation, enables scalable eviction detection, and generalizes to other information extraction tasks.
zh

[NLP-48] MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning

【速读】：该论文试图解决医疗领域中生成式AI（Generative AI）在与人类有效沟通时面临的挑战，即如何在保持语义完整性的同时实现输出内容的个性化和可理解性。解决方案的关键在于提出MedReadCtrl，这是一个可读性控制的指令微调框架，使大语言模型（LLM）能够在不牺牲医学意图的前提下调整输出复杂度，从而提升患者对医疗信息的理解和使用效果。

链接: https://arxiv.org/abs/2507.07419
作者: Hieu Tran,Zonghai Yao,Won Seok Jang,Sharmin Sultana,Allen Chang,Yuan Zhang,Hong Yu
机构: Center for Healthcare Organization and Implementation Research, VA Bedford Health Care, MA, USA(医疗保健组织与实施研究中心，贝德福德医疗中心，马萨诸塞州，美国); Miner School of Computer and Information Sciences, UMass Lowell, MA, USA(计算机与信息科学学院，马萨诸塞大学洛厄尔分校，马萨诸塞州，美国); Department of Medicine, University of Massachusetts Medical School, Worcester, MA, USA(医学系，马萨诸塞大学医学院，伍斯特，马萨诸塞州，美国); School of Nursing, Zuckerberg College of Health Sciences, UMass Lowell, MA, USA(护理学院，祖克伯格健康科学学院，马萨诸塞大学洛厄尔分校，马萨诸塞州，美国); Manning College of Information and Computer Sciences, UMass Amherst, MA, USA(信息与计算机科学学院，马萨诸塞大学阿默斯特分校，马萨诸塞州，美国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Equal contribution for the first two authors. arXiv admin note: text overlap with arXiv:2406.09205

点击查看摘要

Abstract:Generative AI has demonstrated strong potential in healthcare, from clinical decision support to patient-facing chatbots that improve outcomes. A critical challenge for deployment is effective human-AI communication, where content must be both personalized and understandable. We introduce MedReadCtrl, a readability-controlled instruction tuning framework that enables LLMs to adjust output complexity without compromising meaning. Evaluations of nine datasets and three tasks across medical and general domains show that MedReadCtrl achieves significantly lower readability instruction-following errors than GPT-4 (e.g., 1.39 vs. 1.59 on ReadMe, p0.001) and delivers substantial gains on unseen clinical tasks (e.g., +14.7 ROUGE-L, +6.18 SARI on MTSamples). Experts consistently preferred MedReadCtrl (71.7% vs. 23.3%), especially at low literacy levels. These gains reflect MedReadCtrl’s ability to restructure clinical content into accessible, readability-aligned language while preserving medical intent, offering a scalable solution to support patient education and expand equitable access to AI-enabled care.
zh

[NLP-49] May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks

【速读】：该论文试图解决生成式 AI (Generative AI) 在白盒设置下对提示注入攻击（prompt injection attacks）的防御机制的鲁棒性问题。其解决方案的关键在于通过构造基于优化的强攻击方法，验证现有防御机制是否能够有效抵御此类攻击。研究者提出了一种基于注意力机制的新攻击算法，并将其应用于两种最新的白盒防御系统 SecAlign 和 StruQ，结果表明这些防御在增加少量攻击者预算的情况下，攻击成功率可达70%，从而揭示了现有防御方案在安全属性上的不足。

链接: https://arxiv.org/abs/2507.07417
作者: Nishit V. Pandya,Andrey Labunets,Sicun Gao,Earlence Fernandes
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A popular class of defenses against prompt injection attacks on large language models (LLMs) relies on fine-tuning the model to separate instructions and data, so that the LLM does not follow instructions that might be present with data. There are several academic systems and production-level implementations of this idea. We evaluate the robustness of this class of prompt injection defenses in the whitebox setting by constructing strong optimization-based attacks and showing that the defenses do not provide the claimed security properties. Specifically, we construct a novel attention-based attack algorithm for text-based LLMs and apply it to two recent whitebox defenses SecAlign (CCS 2025) and StruQ (USENIX Security 2025), showing attacks with success rates of up to 70% with modest increase in attacker budget in terms of tokens. Our findings make fundamental progress towards understanding the robustness of prompt injection defenses in the whitebox setting. We release our code and attacks at this https URL
zh

[NLP-50] GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation

【速读】：该论文旨在解决深度学习在处理长文本时面临的时间、成本和能效问题，尤其是传统Transformer模型由于其与输入长度呈二次复杂度的关系，导致在处理长文档时效率低下。解决方案的关键在于提出一种结合图神经网络（GNN）和卷积神经网络（CNN）的新模型架构，并集成实时端到端图生成机制，通过字符级输入的紧凑批次处理避免了填充或截断操作，同时利用高效字典查找整合大型语言模型（LLM）的信息，如词嵌入和情感极性，以提升性能并保持高速与高效。

链接: https://arxiv.org/abs/2507.07414
作者: Fardin Rastakhiz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time, cost, and energy efficiency are critical considerations in Deep-Learning (DL), particularly when processing long texts. Transformers, which represent the current state of the art, exhibit quadratic computational complexity relative to input length, making them inefficient for extended documents. This study introduces a novel model architecture that combines Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), integrated with a real-time, end-to-end graph generation mechanism. The model processes compact batches of character-level inputs without requiring padding or truncation. To enhance performance while maintaining high speed and efficiency, the model incorporates information from Large Language Models (LLMs), such as token embeddings and sentiment polarities, through efficient dictionary lookups. It captures local contextual patterns using CNNs, expands local receptive fields via lattice-based graph structures, and employs small-world graphs to aggregate document-level information. The generated graphs exhibit structural properties indicative of meaningful semantic organization, with an average clustering coefficient of approximately 0.45 and an average shortest path length ranging between 4 and 5. The model is evaluated across multiple text classification tasks, including sentiment analysis and news-categorization, and is compared against state-of-the-art models. Experimental results confirm the proposed model’s efficiency and competitive performance.
zh

[NLP-51] Bradley-Terry and Multi-Objective Reward Modeling Are Complementary

【速读】：该论文试图解决强化学习中基于人类反馈（RLHF）框架下奖励模型易受奖励黑客攻击的问题，特别是在分布外（OOD）设置中，现有方法表现不佳。其解决方案的关键在于提出一种统一的奖励建模框架，通过共享嵌入空间联合训练Bradley-Terry（BT）单目标和基于回归的多目标奖励函数，理论分析表明BT损失与回归目标具有互补优势，从而提升奖励模型在挑战性OOD场景下的鲁棒性和评分性能。

链接: https://arxiv.org/abs/2507.07375
作者: Zhiwei Zhang,Hui Liu,Xiaomin Li,Zhenwei Dai,Jingying Zeng,Fali Wang,Minhua Lin,Ramraj Chandradevan,Zhen Li,Chen Luo,Xianfeng Tang,Qi He,Suhang Wang
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Amazon (亚马逊); Havard University (哈佛大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward models trained on human preference data have demonstrated strong effectiveness in aligning Large Language Models (LLMs) with human intent under the framework of Reinforcement Learning from Human Feedback (RLHF). However, RLHF remains vulnerable to reward hacking, where the policy exploits imperfections in the reward function rather than genuinely learning the intended behavior. Although significant efforts have been made to mitigate reward hacking, they predominantly focus on and evaluate in-distribution scenarios, where the training and testing data for the reward model share the same distribution. In this paper, we empirically show that state-of-the-art methods struggle in more challenging out-of-distribution (OOD) settings. We further demonstrate that incorporating fine-grained multi-attribute scores helps address this challenge. However, the limited availability of high-quality data often leads to weak performance of multi-objective reward functions, which can negatively impact overall performance and become the bottleneck. To address this issue, we propose a unified reward modeling framework that jointly trains Bradley–Terry (BT) single-objective and multi-objective regression-based reward functions using a shared embedding space. We theoretically establish a connection between the BT loss and the regression objective and highlight their complementary benefits. Specifically, the regression task enhances the single-objective reward function’s ability to mitigate reward hacking in challenging OOD settings, while BT-based training improves the scoring capability of the multi-objective reward function, enabling a 7B model to outperform a 70B baseline. Extensive experimental results demonstrate that our framework significantly improves both the robustness and the scoring performance of reward models.
zh

[NLP-52] Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation

【速读】：该论文试图解决健康虚假信息对抗性言论生成中证据有限且输出控制不足的问题。其解决方案的关键在于提出一种多智能体检索增强框架，通过整合多个大型语言模型来优化知识检索、证据增强和回应精炼，同时结合静态与动态证据，确保生成的对抗性言论具有相关性、依据充分且内容更新及时。

链接: https://arxiv.org/abs/2507.07307
作者: Anirban Saha Anik,Xiaoying Song,Elliott Wang,Bryan Wang,Bengisu Yarimbas,Lingzi Hong
机构: Department of Data Science, University of North Texas (数据科学系，北德克萨斯大学); Texas Academy of Mathematics and Science, University of North Texas (数学与科学学院，北德克萨斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) incorporated with Retrieval-Augmented Generation (RAG) have demonstrated powerful capabilities in generating counterspeech against misinformation. However, current studies rely on limited evidence and offer less control over final outputs. To address these challenges, we propose a Multi-agent Retrieval-Augmented Framework to generate counterspeech against health misinformation, incorporating multiple LLMs to optimize knowledge retrieval, evidence enhancement, and response refinement. Our approach integrates both static and dynamic evidence, ensuring that the generated counterspeech is relevant, well-grounded, and up-to-date. Our method outperforms baseline approaches in politeness, relevance, informativeness, and factual accuracy, demonstrating its effectiveness in generating high-quality counterspeech. To further validate our approach, we conduct ablation studies to verify the necessity of each component in our framework. Furthermore, human evaluations reveal that refinement significantly enhances counterspeech quality and obtains human preference.
zh

[NLP-53] ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning

【速读】：该论文旨在解决传统基于大语言模型（Large Language Model, LLM）的翻译代理在处理多模态输入时的局限性，即其通常仅支持文本输入，无法有效利用视觉和上下文背景信息。解决方案的关键在于提出ViDove系统，该系统模仿人类译者的流程，整合多模态记忆系统与结合领域知识的长短时记忆模块，从而提升翻译的准确性和适应性。此外，论文还引入了DoveBench基准测试集，以推动长篇自动视频字幕生成与翻译的研究。

链接: https://arxiv.org/abs/2507.07306
作者: Yichen Lu,Wei Dai,Jiaen Liu,Ching Wing Kwok,Zongheng Wu,Xudong Xiao,Ao Sun,Sheng Fu,Jianyuan Zhan,Yian Wang,Takatomo Saito,Sicheng Lai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our code is available here: this https URL
zh

[NLP-54] he Impact of Background Speech on Interruption Detection in Collaborative Groups

【速读】：该论文试图解决在多组协同学习环境中中断检测的问题，特别是在存在重叠语音的情况下如何准确识别中断。传统方法主要针对单对话环境设计，而实际课堂场景中存在多个并发对话，导致重叠语音普遍，因此需要新的方法来应对这一挑战。解决方案的关键在于开发一种对重叠语音具有鲁棒性的中断识别方法，从而能够在真实课堂环境中有效部署，并揭示中断在协作群体互动中的语言和韵律特征。

链接: https://arxiv.org/abs/2507.07280
作者: Mariah Bradford,Nikhil Krishnaswamy,Nathaniel Blanchard
机构: 未知
类目: Computation and Language (cs.CL)
备注: Long Paper AIED 2025

点击查看摘要

Abstract:Interruption plays a crucial role in collaborative learning, shaping group interactions and influencing knowledge construction. AI-driven support can assist teachers in monitoring these interactions. However, most previous work on interruption detection and interpretation has been conducted in single-conversation environments with relatively clean audio. AI agents deployed in classrooms for collaborative learning within small groups will need to contend with multiple concurrent conversations – in this context, overlapping speech will be ubiquitous, and interruptions will need to be identified in other ways. In this work, we analyze interruption detection in single-conversation and multi-group dialogue settings. We then create a state-of-the-art method for interruption identification that is robust to overlapping speech, and thus could be deployed in classrooms. Further, our work highlights meaningful linguistic and prosodic information about how interruptions manifest in collaborative group interactions. Our investigation also paves the way for future works to account for the influence of overlapping speech from multiple groups when tracking group dialog.
zh

[NLP-55] LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

【速读】：该论文试图解决大型多模态模型（Large Multimodal Models, LMMs）在语言覆盖范围上的局限性，导致跨语言输出存在偏见和不公平的问题。其解决方案的关键是引入LinguaMark，这是一个用于评估最先进的LMMs在多语言视觉问答（Multilingual Visual Question Answering, VQA）任务中的基准数据集，通过涵盖11种语言和五种社会属性的6,875对图像-文本数据，以及使用Bias、Answer Relevancy和Faithfulness三个关键指标进行模型评估，以促进对多语言能力的系统性研究。

链接: https://arxiv.org/abs/2507.07274
作者: Ananya Raval,Aravind Narayanan,Vahid Reza Khazaie,Shaina Raza
机构: Vector Institute for AI (Vector Institute for AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ASONAM’25

点击查看摘要

Abstract:Large Multimodal Models (LMMs) are typically trained on vast corpora of image-text data but are often limited in linguistic coverage, leading to biased and unfair outputs across languages. While prior work has explored multimodal evaluation, less emphasis has been placed on assessing multilingual capabilities. In this work, we introduce LinguaMark, a benchmark designed to evaluate state-of-the-art LMMs on a multilingual Visual Question Answering (VQA) task. Our dataset comprises 6,875 image-text pairs spanning 11 languages and five social attributes. We evaluate models using three key metrics: Bias, Answer Relevancy, and Faithfulness. Our findings reveal that closed-source models generally achieve the highest overall performance. Both closed-source (GPT-4o and Gemini2.5) and open-source models (Gemma3, Qwen2.5) perform competitively across social attributes, and Qwen2.5 demonstrates strong generalization across multiple languages. We release our benchmark and evaluation code to encourage reproducibility and further research.
zh

[NLP-56] Open Source Planning Control System with Language Agents for Autonomous Scientific Discovery ICML2025 WWW

【速读】：该论文试图解决科学研究所需任务的自动化问题，旨在通过多智能体系统实现无需人工干预的全流程自动化。解决方案的关键在于构建了一个由约30个大型语言模型（Large Language Model, LLM）代理组成的系统，采用规划控制策略协调代理工作流，每个代理专注于不同的任务（如科学论文和代码库的检索、代码编写、结果解释、对其他代理输出的批判性评估），并能够本地执行代码。该系统在宇宙学领域的博士级别任务中表现出色，展现出优于现有最先进LLM的性能。

链接: https://arxiv.org/abs/2507.07257
作者: Licong Xu,Milind Sarkar,Anto I. Lonappan,Íñigo Zubeldia,Pablo Villanueva-Domingo,Santiago Casas,Christian Fidler,Chetana Amancharla,Ujjwal Tiwari,Adrian Bayer,Chadi Ait Ekiou,Miles Cranmer,Adrian Dimitrov,James Fergusson,Kahaan Gandhi,Sven Krippendorf,Andrew Laverick,Julien Lesgourgues,Antony Lewis,Thomas Meier,Blake Sherwin,Kristen Surrao,Francisco Villaescusa-Navarro,Chi Wang,Xueqing Xu,Boris Bolliet
机构: 未知
类目: Artificial Intelligence (cs.AI); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Accepted contribution to the ICML 2025 Workshop on Machine Learning for Astrophysics. Code: this https URL Videos: this https URL HuggingFace: this https URL Cloud: this https URL

点击查看摘要

Abstract:We present a multi-agent system for automation of scientific research tasks, cmbagent. The system is formed by about 30 Large Language Model (LLM) agents and implements a Planning Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.
zh

[NLP-57] A Language-Driven Framework for Improving Personalized Recommendations: Merging LLM s with Traditional Algorithms

【速读】：该论文试图解决传统推荐算法在处理基于文本的用户偏好信息（如“我喜欢轻松幽默的喜剧”）时缺乏个性化推荐能力的问题。其解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）来增强电影推荐系统，通过细化传统算法的输出并整合基于语言的用户偏好输入，从而提升推荐的准确性和个性化程度。

链接: https://arxiv.org/abs/2507.07251
作者: Aaron Goldstein,Ayan Dutta
机构: University of North Florida(北佛罗里达大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Traditional recommendation algorithms are not designed to provide personalized recommendations based on user preferences provided through text, e.g., “I enjoy light-hearted comedies with a lot of humor”. Large Language Models (LLMs) have emerged as one of the most promising tools for natural language processing in recent years. This research proposes a novel framework that mimics how a close friend would recommend items based on their knowledge of an individual’s tastes. We leverage LLMs to enhance movie recommendation systems by refining traditional algorithm outputs and integrating them with language-based user preference inputs. We employ Singular Value Decomposition (SVD) or SVD++ algorithms to generate initial movie recommendations, implemented using the Surprise Python library and trained on the MovieLens-Latest-Small dataset. We compare the performance of the base algorithms with our LLM-enhanced versions using leave-one-out validation hit rates and cumulative hit rates. Additionally, to compare the performance of our framework against the current state-of-the-art recommendation systems, we use rating and ranking metrics with an item-based stratified 0.75 train, 0.25 test split. Our framework can generate preference profiles automatically based on users’ favorite movies or allow manual preference specification for more personalized results. Using an automated approach, our framework overwhelmingly surpassed SVD and SVD++ on every evaluation metric used (e.g., improvements of up to ~6x in cumulative hit rate, ~3.7x in NDCG, etc.), albeit at the cost of a slight increase in computational overhead.
zh

[NLP-58] Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings

【速读】：该论文试图解决医疗领域中大型语言模型（Large Language Models, LLMs）的安全性评估不足的问题，特别是在患者和临床医生等不同用户角色下的安全风险。现有研究多集中于通用安全基准，而缺乏针对医疗场景的系统性安全评估框架。解决方案的关键在于提出一个针对医疗领域的安全评估协议，从患者、临床医生及一般用户三个视角进行红队测试（red-teaming），并构建了PatientSafetyBench数据集，包含466个样本，覆盖5个关键类别，以量化评估医疗LLMs的安全性，从而为医疗领域的安全部署奠定基础。

链接: https://arxiv.org/abs/2507.07248
作者: Minseon Kim,Jean-Philippe Corbeil,Alessandro Sordoni,Francois Beaulieu,Paul Vozila
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the performance of large language models (LLMs) continues to advance, their adoption is expanding across a wide range of domains, including the medical field. The integration of LLMs into medical applications raises critical safety concerns, particularly due to their use by users with diverse roles, e.g. patients and clinicians, and the potential for model’s outputs to directly affect human health. Despite the domain-specific capabilities of medical LLMs, prior safety evaluations have largely focused only on general safety benchmarks. In this paper, we introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives, alongside general safety assessments and quantitatively analyze the safety of medical LLMs. We bridge a gap in the literature by building the PatientSafetyBench containing 466 samples over 5 critical categories to measure safety from the perspective of the patient. We apply our red-teaming protocols on the MediPhi model collection as a case study. To our knowledge, this is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view - patient, clinician, and general user - establishing a foundation for safer deployment in medical domains.
zh

[NLP-59] An Information-Theoretic Perspective on Multi-LLM Uncertainty Estimation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在不同输入下行为不一致的问题，这种不一致性表明了模型的不确定性，而在高风险场景中需要对其量化。解决方案的关键在于利用模型多样性，通过聚合多个LLMs的输出来获得更可靠的不确定性估计。论文提出MUSE（Multi-LLM Uncertainty via Subset Ensembles），这是一种基于信息论的方法，使用Jensen-Shannon散度识别并聚合校准良好的LLMs子集，从而提升校准性和预测性能。

链接: https://arxiv.org/abs/2507.07236
作者: Maya Kruse,Majid Afshar,Saksham Khatwani,Anoop Mayampurath,Guanhua Chen,Yanjun Gao
机构: University of Colorado Anschutz Medical Campus (科罗拉多大学安舒茨医学校区); University of Colorado Boulder (科罗拉多大学博尔德分校); University of Wisconsin Madison (威斯康星大学麦迪逊分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naive ensemble baselines.
zh

[NLP-60] SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains

【速读】：该论文试图解决合成文本在高风险领域（如医疗和法律）中应用时的评估问题，旨在通过全面评估合成文本的流畅性、下游系统的效用、公平性、隐私泄露风险、分布差异以及领域专家的定性反馈，提升其可行性和隐私保护能力。解决方案的关键在于提出SynthTextEval工具包，该工具包能够对用户上传或通过其生成模块创建的合成数据进行多维度的一致性评估，并通过整合和标准化评估指标来提高合成文本的可靠性与实用性。

链接: https://arxiv.org/abs/2507.07229
作者: Krithika Ramesh,Daniel Smolyak,Zihao Zhao,Nupoor Gandhi,Ritu Agarwal,Margrét Bjarnadóttir,Anjalie Field
机构: Johns Hopkins University (约翰霍普金斯大学); University of Maryland, College Park (马里兰大学学院公园分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerous applications, such as reducing the risks of privacy violations in the development and deployment of AI systems in high-stakes domains. Realizing this potential, however, requires principled consistent evaluations of synthetic data across multiple dimensions: its utility in downstream systems, the fairness of these systems, the risk of privacy leakage, general distributional differences from the source text, and qualitative feedback from domain experts. SynthTextEval allows users to conduct evaluations along all of these dimensions over synthetic data that they upload or generate using the toolkit’s generation module. While our toolkit can be run over any data, we highlight its functionality and effectiveness over datasets from two high-stakes domains: healthcare and law. By consolidating and standardizing evaluation metrics, we aim to improve the viability of synthetic text, and in-turn, privacy-preservation in AI development.
zh

[NLP-61] Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在社会科学研究调查中作为人类受试者代理的可靠性问题，特别是其对已知响应偏差的敏感性。解决方案的关键在于通过应用一系列11种扰动来测试九种不同的LLMs在世界价值观调查（World Values Survey, WVS）问题上的响应鲁棒性，从而揭示LLMs在面对问题表述和选项结构变化时的脆弱性及一致性近期偏差特征。研究还表明，尽管较大模型通常更具鲁棒性，但所有模型仍对语义变化如改写和组合扰动敏感，强调了在使用LLMs生成合成调查数据时进行提示设计和鲁棒性测试的重要性。

链接: https://arxiv.org/abs/2507.07188
作者: Jens Rupprecht(1),Georg Ahnert(1),Markus Strohmaier(1 and 2 and 3) ((1) University of Mannheim, (2) GESIS - Leibniz Institute for the Social Sciences, (3) Complexity Science Hub)
机构: University of Mannheim (曼海姆大学); GESIS - Leibniz Institute for the Social Sciences (德国社会科学研究中心); Complexity Science Hub (复杂性科学中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 18 pages, 17 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known response biases are poorly understood. This paper investigates the response robustness of LLMs in normative survey contexts – we test nine diverse LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of 11 perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated interviews. In doing so, we not only reveal LLMs’ vulnerabilities to perturbations but also reveal that all tested models exhibit a consistent \textitrecency bias varying in intensity, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. By applying a set of perturbations, we reveal that LLMs partially align with survey response biases identified in humans. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.
zh

[NLP-62] Planted in Pretraining Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中认知偏差的来源问题，即这些偏差是源于预训练阶段、微调阶段，还是训练中的随机噪声。其解决方案的关键在于提出一种两步因果实验方法：首先通过多次使用不同随机种子进行微调，研究训练随机性对30多种认知偏差的影响；其次引入跨微调（cross-tuning）方法，通过在模型间交换指令数据集来隔离偏差来源，从而直接测试偏差是否依赖于数据集。该方法揭示了预训练阶段对模型偏差模式的主导作用，为理解微调模型中的偏差提供了新的视角。

链接: https://arxiv.org/abs/2507.07186
作者: Itay Itzhak,Yonatan Belinkov,Gabriel Stanovsky
机构: Technion – Israel Institute of Technology (以色列理工学院); The Hebrew University of Jerusalem (希伯来大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CoLM 2025

点击查看摘要

Abstract:Large language models (LLMs) exhibit cognitive biases – systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models and can be amplified by instruction tuning. However, it remains unclear if these differences in biases stem from pretraining, finetuning, or even random noise due to training stochasticity. We propose a two-step causal experimental approach to disentangle these factors. First, we finetune models multiple times using different random seeds to study how training randomness affects over 30 cognitive biases. Second, we introduce \emphcross-tuning – swapping instruction datasets between models to isolate bias sources. This swap uses datasets that led to different bias patterns, directly testing whether biases are dataset-dependent. Our findings reveal that while training randomness introduces some variability, biases are mainly shaped by pretraining: models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. These insights suggest that understanding biases in finetuned models requires considering their pretraining origins beyond finetuning effects. This perspective can guide future efforts to develop principled strategies for evaluating and mitigating bias in LLMs.
zh

[NLP-63] Robust Multimodal Large Language Models Against Modality Conflict ICML2025

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在现实场景中容易产生幻觉的问题，其核心在于分析由模态冲突（modality conflict）引发的幻觉现象。论文提出的关键解决方案是通过三种方法——基于提示工程、监督微调和强化学习——来缓解由模态冲突导致的幻觉问题，其中强化学习方法在减少幻觉方面表现最佳，而监督微调方法则展现出良好的稳定性和潜力。

链接: https://arxiv.org/abs/2507.07151
作者: Zongmeng Zhang,Wengang Zhou,Jie Zhao,Houqiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICML 2025

点击查看摘要

Abstract:Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.
zh

[NLP-64] Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

【速读】：该论文试图解决大规模语言模型（Large Language Models, LLMs）扩展过程中存在的资源密集且缺乏灵活性的问题，传统方法依赖于单一的端到端训练流程。其解决方案的关键在于构建一种基于不可训练、确定性输入嵌入的构造性模型开发方法，利用固定表征基础作为通用“对接端口”，从而实现无缝模块化组合和逐层增长的高效扩展范式。

链接: https://arxiv.org/abs/2507.07129
作者: A. Bochkov
机构: Moscow Institute of Physics and Technology (MIPT)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal “docking port,” enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is “grown” by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2507.07129 [cs.LG] (or arXiv:2507.07129v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.07129 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-65] Multi-level Mixture of Experts for Multimodal Entity Linking KDD2025

【速读】：该论文旨在解决多模态实体链接（Multimodal Entity Linking, MEL）中的两个关键问题：提及歧义（mention ambiguity），即由于提及文本上下文的简略和关键信息缺失导致的语义内容不足；以及动态选择模态内容（dynamic selection of modal content），即对不同模态信息部分的重要性进行动态区分。解决方案的关键在于提出一种多层级专家混合模型（Multi-level Mixture of Experts, MMoE），其核心组件包括：描述感知的提及增强模块、多模态特征提取模块、层内专家混合模块和层间专家混合模块，通过引入开关专家机制实现对相关信息区域特征的动态自适应选择。

链接: https://arxiv.org/abs/2507.07108
作者: Zhiwei Hu,Víctor Gutiérrez-Basulto,Zhiliang Xiang,Ru Li,Jeff Z. Pan
机构: Shanxi University(山西大学); Cardiff University(卡迪夫大学); University of Edinburgh(爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted at KDD 2025

点击查看摘要

Abstract:Multimodal Entity Linking (MEL) aims to link ambiguous mentions within multimodal contexts to associated entities in a multimodal knowledge base. Existing approaches to MEL introduce multimodal interaction and fusion mechanisms to bridge the modality gap and enable multi-grained semantic matching. However, they do not address two important problems: (i) mention ambiguity, i.e., the lack of semantic content caused by the brevity and omission of key information in the mention’s textual context; (ii) dynamic selection of modal content, i.e., to dynamically distinguish the importance of different parts of modal information. To mitigate these issues, we propose a Multi-level Mixture of Experts (MMoE) model for MEL. MMoE has four components: (i) the description-aware mention enhancement module leverages large language models to identify the WikiData descriptions that best match a mention, considering the mention’s textual context; (ii) the multimodal feature extraction module adopts multimodal feature encoders to obtain textual and visual embeddings for both mentions and entities; (iii)-(iv) the intra-level mixture of experts and inter-level mixture of experts modules apply a switch mixture of experts mechanism to dynamically and adaptively select features from relevant regions of information. Extensive experiments demonstrate the outstanding performance of MMoE compared to the state-of-the-art. MMoE’s code is available at: this https URL.
zh

计算机视觉

[CV-0] Impact of Pretraining Word Co-occurrence on Compositional Generalization in Multimodal Models

【速读】：该论文试图解决多模态模型（如CLIP和大型多模态模型LMMs）在组合泛化能力上的局限性，特别是当图像中出现常见物体与不常见物体的组合时，模型性能如何变化的问题。其解决方案的关键在于通过点互信息（PMI）衡量预训练数据集中词频共现统计，从而揭示概念组合对模型性能的影响。研究发现，PMI值与零样本准确率之间存在显著相关性，表明即使是对常见概念的识别也受到图像中概念组合的影响。这一发现为改进多模态模型的组合泛化能力提供了新的视角。

链接: https://arxiv.org/abs/2507.08000
作者: Helen Qu,Sang Michael Xie
机构: Flatiron Institute(弗拉蒂隆研究所); Stanford University(斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:CLIP and large multimodal models (LMMs) have better accuracy on examples involving concepts that are highly represented in the training data. However, the role of concept combinations in the training data on compositional generalization is largely unclear – for instance, how does accuracy vary when a common object appears in an uncommon pairing with another object? In this paper, we investigate how word co-occurrence statistics in the pretraining dataset (a proxy for co-occurrence of visual concepts) impacts CLIP/LMM performance. To disentangle the effects of word co-occurrence frequencies from single-word frequencies, we measure co-occurrence with pointwise mutual information (PMI), which normalizes the joint probability of two words co-occurring by the probability of co-occurring independently. Using synthetically generated images with a variety of concept pairs, we show a strong correlation between PMI in the CLIP pretraining data and zero-shot accuracy in CLIP models trained on LAION-400M (r=0.97 and 14% accuracy gap between images in the top and bottom 5% of PMI values), demonstrating that even accuracy on common concepts is affected by the combination of concepts in the image. Leveraging this finding, we reproduce this effect in natural images by editing them to contain pairs with varying PMI, resulting in a correlation of r=0.75. Finally, we demonstrate that this behavior in CLIP transfers to LMMs built on top of CLIP (r=0.70 for TextVQA, r=0.62 for VQAv2). Our findings highlight the need for algorithms and architectures that improve compositional generalization in multimodal models without scaling the training data combinatorially. Our code is available at this https URL.
zh

[CV-1] MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization

【速读】：该论文旨在解决向量量化变分自编码器（VQ-VAEs）在重建质量上与变分自编码器（VAEs）之间存在的显著差距问题。其解决方案的关键在于提出一种名为\NickName的新方法，通过增强离散代码本的表示能力，保留潜在维度以维持编码特征，并引入一组子代码本进行量化，从而简化代码本的优化过程并减少信息损失，最终提升重建质量。

链接: https://arxiv.org/abs/2507.07997
作者: Mingkai Jia,Wei Yin,Xiaotao Hu,Jiaxin Guo,Xiaoyang Guo,Qian Zhang,Xiao-Xiao Long,Ping Tan
机构: The Hong Kong University of Science and Technology (香港科技大学); Horizon Robotics (地平线机器人); The Chinese University of Hong Kong (香港中文大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental models that compress continuous visual data into discrete tokens. Existing methods have tried to improve the quantization strategy for better reconstruction quality, however, there still exists a large gap between VQ-VAEs and VAEs. To narrow this gap, we propose \NickName, a novel method to augment the representation capability of discrete codebooks, facilitating easier optimization for codebooks and minimizing information loss, thereby enhancing reconstruction quality. Specifically, we propose to retain the latent dimension to preserve encoded features and incorporate a set of sub-codebooks for quantization. Furthermore, we construct comprehensive zero-shot benchmarks featuring resolutions of 512p and 2k to evaluate the reconstruction performance of existing methods rigorously. \NickName~achieves the \textbfstate-of-the-art performance on both ImageNet and 8 zero-shot benchmarks across all VQ-VAEs. Notably, compared with SD-VAE, we outperform them on ImageNet significantly, with rFID \textbf0.49 v.s. \textbf0.91 , and achieve superior PSNR on all zero-shot benchmarks. These results highlight the superiority of \NickName~in reconstruction and pave the way for preserving fidelity in HD image processing tasks. Code will be publicly available at this https URL.
zh

[CV-2] Single-pass Adaptive Image Tokenization for Minimum Program Search

【速读】：该论文试图解决视觉表示学习系统中固定长度表示无法适应输入复杂性或熟悉度差异的问题。传统方法通常采用固定长度的表示，而缺乏对数据复杂性的动态适应能力。解决方案的关键在于提出一种单次通过的自适应分词器KARL，该方法基于Kolmogorov Complexity（KC）原则，在一次前向传播中预测图像所需的适当标记数量，并在接近其近似KC时停止。这种方法通过token数量作为最小描述长度的代理，实现了对输入复杂性的自适应建模，同时保持了与最新自适应分词器相当的性能。

链接: https://arxiv.org/abs/2507.07995
作者: Shivam Duggal,Sanghyun Byun,William T. Freeman,Antonio Torralba,Phillip Isola
机构: Massachusetts Institute of Technology (麻省理工学院); LG Electronics (LG电子)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code at: this https URL Keywords: Representation Learning, Adaptive Tokenization, Compression, Algorithmic Information Theory, Kolmogorov Complexity, Upside-Down RL

点击查看摘要

Abstract:According to Algorithmic Information Theory (AIT) – Intelligent representations compress data into the shortest possible program that can reconstruct its content, exhibiting low Kolmogorov Complexity (KC). In contrast, most visual representation learning systems use fixed-length representations for all inputs, ignoring variations in complexity or familiarity. Recent adaptive tokenization methods address this by allocating variable-length representations but typically require test-time search over multiple encodings to find the most predictive one. Inspired by Kolmogorov Complexity principles, we propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass, halting once its approximate KC is reached. The token count serves as a proxy for the minimum description length. KARL’s training procedure closely resembles the Upside-Down Reinforcement Learning paradigm, as it learns to conditionally predict token halting based on a desired reconstruction quality. KARL matches the performance of recent adaptive tokenizers while operating in a single pass. We present scaling laws for KARL, analyzing the role of encoder/decoder size, continuous vs. discrete tokenization and more. Additionally, we offer a conceptual study drawing an analogy between Adaptive Image Tokenization and Algorithmic Information Theory, examining the predicted image complexity (KC) across axes such as structure vs. noise and in- vs. out-of-distribution familiarity – revealing alignment with human intuition.
zh

[CV-3] Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection ICCV2025

【速读】：该论文试图解决在少样本学习（few-shot learning）中，由于缺乏与查询数据来自同一分布的源数据而导致的关键点检测（keypoint detection）性能下降问题。其解决方案的关键在于利用草图（sketch）作为无需源数据的替代信息源，并通过原型框架（prototypical setup）、基于网格的定位器（grid-based locator）以及原型域适应（prototypical domain adaptation）来克服跨模态嵌入和用户特定草图风格带来的挑战。

链接: https://arxiv.org/abs/2507.07994
作者: Subhajit Maity,Ayan Kumar Bhunia,Subhadeep Koley,Pinaki Nath Chowdhury,Aneeshan Sain,Yi-Zhe Song
机构: University of Central Florida (佛罗里达中部大学); SketchX, CVSSP, University of Surrey (SketchX, CVSSP, 萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025. Project Page: this https URL

点击查看摘要

Abstract:Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering cross-modal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.
zh

[CV-4] Multigranular Evaluation for Brain Visual Decoding

【速读】：该论文试图解决现有脑视觉解码评估协议依赖粗粒度指标、缺乏神经科学基础以及无法捕捉精细视觉差异的问题。其解决方案的关键在于提出BASIC，一个统一的多粒度评估框架，该框架联合量化解码图像与真实图像之间的结构保真度、推断对齐性和上下文一致性，通过分层的基于分割的度量、结构化场景表示的提取等方法，提供更具有区分性、可解释性和全面性的评估基础。

链接: https://arxiv.org/abs/2507.07993
作者: Weihao Xia,Cengiz Oztireli
机构: University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
备注: Project: this https URL

点击查看摘要

Abstract:Existing evaluation protocols for brain visual decoding predominantly rely on coarse metrics that obscure inter-model differences, lack neuroscientific foundation, and fail to capture fine-grained visual distinctions. To address these limitations, we introduce BASIC, a unified, multigranular evaluation framework that jointly quantifies structural fidelity, inferential alignment, and contextual coherence between decoded and ground truth images. For the structural level, we introduce a hierarchical suite of segmentation-based metrics, including foreground, semantic, instance, and component masks, anchored in granularity-aware correspondence across mask structures. For the semantic level, we extract structured scene representations encompassing objects, attributes, and relationships using multimodal large language models, enabling detailed, scalable, and context-rich comparisons with ground-truth stimuli. We benchmark a diverse set of visual decoding methods across multiple stimulus-neuroimaging datasets within this unified evaluation framework. Together, these criteria provide a more discriminative, interpretable, and comprehensive foundation for measuring brain visual decoding methods.
zh

[CV-5] Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLM s ICCV2025 WWW

【速读】：该论文试图解决视频大语言模型（Video LLM）在处理大量时空令牌时面临的二次计算复杂度问题。解决方案的关键在于提出一种无需训练的时空令牌合并方法（STTM），其核心思想是利用视频数据中被先前工作忽视的局部空间和时间冗余性。STTM首先通过四叉树结构的粗到细搜索将每帧转换为多粒度空间令牌，然后在时间维度上进行定向成对合并，从而有效减少令牌数量并提升计算效率。

链接: https://arxiv.org/abs/2507.07990
作者: Jeongseok Hyun,Sukjun Hwang,Su Ho Han,Taeoh Kim,Inwoong Lee,Dongyoon Wee,Joon-Young Lee,Seon Joo Kim,Minho Shim
机构: Yonsei University (延世大学); Carnegie Mellon University (卡内基梅隆大学); NAVER Cloud (NAVER云); Adobe Research (Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICCV2025; Project page: this https URL

点击查看摘要

Abstract:Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2 \times speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3 \times speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at this https URL.
zh

[CV-6] CLIP Wont Learn Object-Attribute Binding from Natural Data and Here is Why

【速读】：该论文试图解决对比视觉-语言模型（如CLIP）在学习对象属性绑定（binding）方面的局限性，即模型无法正确区分图像中不同对象的属性组合。解决方案的关键在于识别数据属性对CLIP学习绑定能力的影响，研究发现自然数据的常见特性（如低属性密度、不完整的描述和显著性偏差）对绑定性能有负面影响，而只有当数据表现出所识别的数据属性时，CLIP才能实现几乎完美的绑定。

链接: https://arxiv.org/abs/2507.07985
作者: Bijay Gurung,David T. Hoffmann,Thomas Brox
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive vision-language models like CLIP are used for a large variety of applications, such as zero-shot classification or as vision encoder for multi-modal models. Despite their popularity, their representations show major limitations. For instance, CLIP models learn bag-of-words representations and, as a consequence, fail to distinguish whether an image is of “a yellow submarine and a blue bus” or “a blue submarine and a yellow bus”. Previous attempts to fix this issue added hard negatives during training or modified the architecture, but failed to resolve the problem in its entirety. We suspect that the missing insights to solve the binding problem for CLIP are hidden in the arguably most important part of learning algorithms: the data. In this work, we fill this gap by rigorously identifying the influence of data properties on CLIP’s ability to learn binding using a synthetic dataset. We find that common properties of natural data such as low attribute density, incomplete captions, and the saliency bias, a tendency of human captioners to describe the object that is “most salient” to them have a detrimental effect on binding performance. In contrast to common belief, we find that neither scaling the batch size, i.e., implicitly adding more hard negatives, nor explicitly creating hard negatives enables CLIP to learn reliable binding. Only when the data expresses our identified data properties CLIP learns almost perfect binding.
zh

[CV-7] OST-Bench: Evaluating the Capabilities of MLLM s in Online Spatio-temporal Scene Understanding

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在在线时空理解任务中的性能不足问题，特别是在动态场景中进行持续观察与推理时的表现。其解决方案的关键在于构建OST-Bench基准，该基准从代理主动探索场景的角度出发，评估模型在在线设置下的时空理解能力，强调对逐步获取的观察进行处理与推理，并结合当前视觉输入与历史记忆以支持动态空间推理。通过这一基准，研究者能够更真实地反映现实世界中具身感知的挑战，并揭示MLLMs在复杂时空推理任务中的局限性。

链接: https://arxiv.org/abs/2507.07984
作者: JingLi Lin,Chenming Zhu,Runsen Xu,Xiaohan Mao,Xihui Liu,Tai Wang,Jiangmiao Pang
机构: Shanghai AI Laboratory(上海人工智能实验室); Shanghai Jiao Tong University(上海交通大学); The University of Hong Kong(香港大学); The Chinese University of Hong Kong(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. Project Page: this https URL

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: this https URL
zh

[CV-8] Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

【速读】：该论文试图解决视频扩散模型在仅基于原始视频数据训练时，难以捕捉有意义的几何感知结构的问题。解决方案的关键在于提出一种名为Geometry Forcing的方法，该方法通过将模型的中间表示与预训练几何基础模型的特征对齐，引导模型内部化潜在的3D表示。其核心思想是利用两个互补的对齐目标：Angular Alignment通过余弦相似性强制方向一致性，Scale Alignment通过从归一化的扩散表示回归未归一化的几何特征来保留尺度相关信息。

链接: https://arxiv.org/abs/2507.07982
作者: Haoyu Wu,Diankun Wu,Tianyu He,Junliang Guo,Yang Ye,Yueqi Duan,Jiang Bian
机构: Microsoft Research(微软研究院); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, project page: this https URL

点击查看摘要

Abstract:Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model’s intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: this https URL.
zh

[CV-9] Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions

【速读】：该论文试图解决生成真实感火星地貌视频的问题，这一任务面临的主要挑战包括高质量火星数据的稀缺性以及火星与地球图像之间的领域差距。解决方案的关键在于提出一个由两个核心组件组成的综合方案：一是多模态火星合成数据整理管道M3arsSynth，它从真实的立体导航图像重建三维火星环境并渲染高保真多视角3D视频序列；二是火星地形视频生成器MarsGen，它能够生成在视觉上逼真且几何上与数据中编码的三维结构一致的新视频。

链接: https://arxiv.org/abs/2507.07978
作者: Longfei Li,Zhiwen Fan,Wenyan Cong,Xinhang Liu,Yuyang Yin,Matt Foutter,Panwang Pan,Chenyu You,Yue Wang,Zhangyang Wang,Yao Zhao,Marco Pavone,Yunchao Wei
机构: BJTU 2UT Austin 3HKUST 4Stanford University 5XMU 6SBU 7USC 8NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA’s Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.
zh

[CV-10] Input Conditioned Layer Dropping in Speech Foundation Models

【速读】：该论文试图解决在边缘和物联网环境中，由于计算资源动态变化，需要具有自适应缩减策略的动态架构来优化基础语音模型的问题。其解决方案的关键在于提出一种输入驱动的层跳过（input-driven $\mathcal{LD}$ ），该方法利用网络的输入特征和一个轻量级的层选择网络来确定最优的处理层组合，从而在不显著修改神经架构的前提下实现高效的计算负载调整。

链接: https://arxiv.org/abs/2507.07954
作者: Abdul Hannan,Daniele Falavigna,Alessio Brutti
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: Accepted at IEEE MLSP 2025

点击查看摘要

Abstract:Curating foundation speech models for edge and IoT settings, where computational resources vary over time, requires dynamic architectures featuring adaptable reduction strategies. One emerging approach is layer dropping ( \mathcalLD ) which skips fraction of the layers of a backbone network during inference to reduce the computational load. This allows transforming static models into dynamic ones. However, existing approaches exhibit limitations either in the mode of selecting layers or by significantly modifying the neural architecture. To this end, we propose input-driven \mathcalLD that employs the network’s input features and a lightweight layer selecting network to determine the optimum combination of processing layers. Extensive experimentation on 4 speech and audio public benchmarks, using two different pre-trained foundation models, demonstrates the effectiveness of our approach, thoroughly outperforming random dropping and producing on-par (or better) results to early exit.
zh

[CV-11] nierHAR: Towards Ultra-Lightweight Deep Learning Models for Efficient Human Activity Recognition on Edge Devices

【速读】：该论文旨在解决在资源受限的可穿戴设备上实现高效的人类活动识别（Human Activity Recognition, HAR）问题，其核心挑战是平衡模型的准确性与计算效率。解决方案的关键在于提出一种超轻量级深度学习架构TinierHAR，该架构通过融合残差深度可分离卷积、门控循环单元（Gated Recurrent Units, GRUs）和时间聚合技术，在不牺牲性能的前提下显著降低了模型参数量和乘加操作次数（MACs）。

链接: https://arxiv.org/abs/2507.07949
作者: Sizhen Bian,Mengxi Liu,Vitor Fortes Rey,Daniel Geissler,Paul Lukowicz
机构: DFKIKaiserslauternGermany
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) on resource-constrained wearable devices demands inference models that harmonize accuracy with computational efficiency. This paper introduces TinierHAR, an ultra-lightweight deep learning architecture that synergizes residual depthwise separable convolutions, gated recurrent units (GRUs), and temporal aggregation to achieve SOTA efficiency without compromising performance. Evaluated across 14 public HAR datasets, TinierHAR reduces Parameters by 2.7x (vs. TinyHAR) and 43.3x (vs. DeepConvLSTM), and MACs by 6.4x and 58.6x, respectively, while maintaining the averaged F1-scores. Beyond quantitative gains, this work provides the first systematic ablation study dissecting the contributions of spatial-temporal components across proposed TinierHAR, prior SOTA TinyHAR, and the classical DeepConvLSTM, offering actionable insights for designing efficient HAR systems. We finally discussed the findings and suggested principled design guidelines for future efficient HAR. To catalyze edge-HAR research, we open-source all materials in this work for future benchmarking\footnotethis https URL
zh

[CV-12] owards Continuous Home Cage Monitoring: An Evaluation of Tracking and Identification Strategies for Laboratory Mice

【速读】：该论文试图解决在高密度饲养环境下对实验小鼠进行个体识别的难题，这一问题由于小鼠外观相似、移动频繁及群体互动频繁而尤为突出。解决方案的关键在于开发一种实时识别算法，该算法结合了定制的多目标跟踪器（MouseTracks）、基于Transformer的ID分类器（Mouseformer）以及轨迹片段关联线性规划模块（MouseMap），从而实现对佩戴定制耳标的个体小鼠在数字家笼中的高效、准确识别与跟踪。

链接: https://arxiv.org/abs/2507.07929
作者: Juan Pablo Oberhauser,Daniel Grzenda
机构: TLR Ventures(TLR风投); University of Chicago(芝加哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Continuous, automated monitoring of laboratory mice enables more accurate data collection and improves animal welfare through real-time insights. Researchers can achieve a more dynamic and clinically relevant characterization of disease progression and therapeutic effects by integrating behavioral and physiological monitoring in the home cage. However, providing individual mouse metrics is difficult because of their housing density, similar appearances, high mobility, and frequent interactions. To address these challenges, we develop a real-time identification (ID) algorithm that accurately assigns ID predictions to mice wearing custom ear tags in digital home cages monitored by cameras. Our pipeline consists of three parts: (1) a custom multiple object tracker (MouseTracks) that combines appearance and motion cues from mice; (2) a transformer-based ID classifier (Mouseformer); and (3) a tracklet associator linear program to assign final ID predictions to tracklets (MouseMap). Our models assign an animal ID based on custom ear tags at 30 frames per second with 24/7 cage coverage. We show that our custom tracking and ID pipeline improves tracking efficiency and lowers ID switches across mouse strains and various environmental factors compared to current mouse tracking methods.
zh

[CV-13] ArteryX: Advancing Brain Artery Feature Extraction with Vessel-Fused Networks and a Robust Validation Framework

【速读】：该论文旨在解决现有三维时间飞跃磁共振血管成像（3D TOF MRA）在临床评估中对微小血管变化敏感性不足的问题，以及传统手动或自动化方法在提取动脉结构、几何和形态特征时存在的用户依赖性、学习曲线陡峭和缺乏标准化定量验证的挑战。其解决方案的关键在于提出一种名为ArteryX的半监督动脉评估框架，该框架基于MATLAB的工具箱，通过结合血管融合网络的地标定位方法，实现了对血管的可靠追踪与管理，有效解决了分支/断开血管的问题，并通过集成体内类似动脉仿真框架实现了定量特征的验证，从而提高了对细微血管变化的敏感性和评估效率。

链接: https://arxiv.org/abs/2507.07920
作者: Abrar Faiyaz,Nhat Hoang,Giovanni Schifitto,Md Nasir Uddin
机构: University of Rochester(罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 Pages, 8 Figures, Preliminary version of the toolbox was presented at the ISMRM 2025 Conference in Hawaii at the “Software Tools” Session

点击查看摘要

Abstract:Cerebrovascular pathology significantly contributes to cognitive decline and neurological disorders, underscoring the need for advanced tools to assess vascular integrity. Three-dimensional Time-of-Flight Magnetic Resonance Angiography (3D TOF MRA) is widely used to visualize cerebral vasculature, however, clinical evaluations generally focus on major arterial abnormalities, overlooking quantitative metrics critical for understanding subtle vascular changes. Existing methods for extracting structural, geometrical and morphological arterial features from MRA - whether manual or automated - face challenges including user-dependent variability, steep learning curves, and lack of standardized quantitative validations. We propose a novel semi-supervised artery evaluation framework, named ArteryX, a MATLAB-based toolbox that quantifies vascular features with high accuracy and efficiency, achieving processing times ~10-15 minutes per subject at 0.5 mm resolution with minimal user intervention. ArteryX employs a vessel-fused network based landmarking approach to reliably track and manage tracings, effectively addressing the issue of dangling/disconnected vessels. Validation on human subjects with cerebral small vessel disease demonstrated its improved sensitivity to subtle vascular changes and better performance than an existing semi-automated method. Importantly, the ArteryX toolbox enables quantitative feature validation by integrating an in-vivo like artery simulation framework utilizing vessel-fused graph nodes and predefined ground-truth features for specific artery types. Thus, the ArteryX framework holds promise for benchmarking feature extraction toolboxes and for seamless integration into clinical workflows, enabling early detection of cerebrovascular pathology and standardized comparisons across patient cohorts to advance understanding of vascular contributions to brain health.
zh

[CV-14] Not Only Consistency: Enhance Test-Time Adaptation with Spatio-temporal Inconsistency for Remote Physiological Measurement

【速读】：该论文旨在解决远程光电容积描记术（rPPG）在实际部署环境中因隐私问题和实时适应性限制而难以应用的问题。其解决方案的关键在于提出一种全新的全测试时适应（TTA）策略，通过结合生理学先验知识与观察到的rPPG信号在频域中的时空一致性及时间域中的显著不一致性，设计了一个基于专家知识的自监督一致性整合（CiCi）框架，以增强模型在推理过程中的适应能力，并引入梯度动态控制机制以缓解先验之间的潜在冲突，从而实现稳定且高效的实时自监督适应。

链接: https://arxiv.org/abs/2507.07908
作者: Xiao Yang,Yuxuan Fan,Can Liu,Houcheng Su,Weichen Guo,Jiyao Wang,Dengbo He
机构: Hong Kong University of Science and Technology (Guangzhou); Sichuan Agricultural University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) has emerged as a promising non-invasive method for monitoring physiological signals using the camera. Although various domain adaptation and generalization methods were proposed to promote the adaptability of deep-based rPPG models in unseen deployment environments, considerations in aspects like privacy concerns and real-time adaptation restrict their application in real-world deployment. Thus, we aim to propose a novel fully Test-Time Adaptation (TTA) strategy tailored for rPPG tasks in this work. Specifically, based on prior knowledge in physiology and our observations, we noticed not only there is spatio-temporal consistency in the frequency domain of rPPG signals, but also that inconsistency in the time domain was significant. Given this, by leveraging both consistency and inconsistency priors, we introduce an innovative expert knowledge-based self-supervised \textbfConsistency-\textbfin\textbfConsistency-\textbfintegration (\textbfCiCi) framework to enhances model adaptation during inference. Besides, our approach further incorporates a gradient dynamic control mechanism to mitigate potential conflicts between priors, ensuring stable adaptation across instances. Through extensive experiments on five diverse datasets under the TTA protocol, our method consistently outperforms existing techniques, presenting state-of-the-art performance in real-time self-supervised adaptation without accessing source data. The code will be released later.
zh

[CV-15] Hardware-Aware Feature Extraction Quantisation for Real-Time Visual Odometry on FPGA Platforms

【速读】：该论文旨在解决在资源受限的嵌入式平台（如移动设备或FPGA系统）上实现高效且准确的视觉定位与建图（VSLAM）问题，特别是针对特征点检测与描述的计算需求过高这一挑战。解决方案的关键在于提出一种基于量化后的SuperPoint卷积神经网络的无监督架构，并将其部署在FPGA系统上，通过模型量化和硬件感知优化，显著降低了计算需求，同时保持了较高的特征点检测质量，从而实现了在640 x 480像素图像下高达54帧/秒的处理速度。

链接: https://arxiv.org/abs/2507.07903
作者: Mateusz Wasala,Mateusz Smolarczyk,Michal Danilowicz,Tomasz Kryjak
机构: AGH University of Krakow (AGH大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted for the DSD 2025 conference in Salerno, Italy

点击查看摘要

Abstract:Accurate position estimation is essential for modern navigation systems deployed in autonomous platforms, including ground vehicles, marine vessels, and aerial drones. In this context, Visual Simultaneous Localisation and Mapping (VSLAM) - which includes Visual Odometry - relies heavily on the reliable extraction of salient feature points from the visual input data. In this work, we propose an embedded implementation of an unsupervised architecture capable of detecting and describing feature points. It is based on a quantised SuperPoint convolutional neural network. Our objective is to minimise the computational demands of the model while preserving high detection quality, thus facilitating efficient deployment on platforms with limited resources, such as mobile or embedded systems. We implemented the solution on an FPGA System-on-Chip (SoC) platform, specifically the AMD/Xilinx Zynq UltraScale+, where we evaluated the performance of Deep Learning Processing Units (DPUs) and we also used the Brevitas library and the FINN framework to perform model quantisation and hardware-aware optimisation. This allowed us to process 640 x 480 pixel images at up to 54 fps on an FPGA platform, outperforming state-of-the-art solutions in the field. We conducted experiments on the TUM dataset to demonstrate and discuss the impact of different quantisation techniques on the accuracy and performance of the model in a visual odometry task.
zh

[CV-16] MIRA: A Novel Framework for Fusing Modalities in Medical RAG

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在医疗诊断中生成事实不一致响应的问题，以及检索增强生成（Retrieval-Augmented Generation, RAG）方法中存在的检索不足或过度检索导致的准确性下降问题。解决方案的关键在于提出的多模态智能检索与增强框架（Multimodal Intelligent Retrieval and Augmentation, MIRA），其核心包括：（1）一个校准的再思考与重排模块，用于动态调整检索上下文数量以管理事实风险；（2）一个结合图像嵌入和医学知识库的医学RAG框架，配备查询重写模块以实现高效的多模态推理。

链接: https://arxiv.org/abs/2507.07902
作者: Jinhong Wang,Tajamul Ashraf,Zongyan Han,Jorma Laaksonen,Rao Mohammad Anwer
机构: MBZUAI(穆巴达拉人工智能大学); Aalto University(阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM Multimedia 2025

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have significantly advanced AI-assisted medical diagnosis, but they often generate factually inconsistent responses that deviate from established medical knowledge. Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external sources, but it presents two key challenges. First, insufficient retrieval can miss critical information, whereas excessive retrieval can introduce irrelevant or misleading content, disrupting model output. Second, even when the model initially provides correct answers, over-reliance on retrieved data can lead to factual errors. To address these issues, we introduce the Multimodal Intelligent Retrieval and Augmentation (MIRA) framework, designed to optimize factual accuracy in MLLM. MIRA consists of two key components: (1) a calibrated Rethinking and Rearrangement module that dynamically adjusts the number of retrieved contexts to manage factual risk, and (2) A medical RAG framework integrating image embeddings and a medical knowledge base with a query-rewrite module for efficient multimodal reasoning. This enables the model to effectively integrate both its inherent knowledge and external references. Our evaluation of publicly available medical VQA and report generation benchmarks demonstrates that MIRA substantially enhances factual accuracy and overall performance, achieving new state-of-the-art results. Code is released at this https URL.
zh

[CV-17] Single-Step Latent Diffusion for Underwater Image Restoration

【速读】：该论文旨在解决水下图像恢复算法在处理具有复杂几何结构和显著深度变化的场景时存在的计算成本高和生成不真实伪影的问题。其解决方案的关键在于结合一种新型网络架构（SLURPP）与精确的合成数据生成流程，SLURPP通过融合预训练潜在扩散模型（具备对场景几何和深度的强先验知识）与显式场景分解，从而能够建模并补偿光衰减和后向散射的影响。

链接: https://arxiv.org/abs/2507.07878
作者: Jiayi Wu,Tianfu Wang,Md Abu Bakr Siddique,Md Jahidul Islam,Cornelia Fermuller,Yiannis Aloimonos,Christopher A. Metzler
机构: University of Maryland (马里兰大学); University of Florida (佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater image restoration algorithms seek to restore the color, contrast, and appearance of a scene that is imaged underwater. They are a critical tool in applications ranging from marine ecology and aquaculture to underwater construction and archaeology. While existing pixel-domain diffusion-based image restoration approaches are effective at restoring simple scenes with limited depth variation, they are computationally intensive and often generate unrealistic artifacts when applied to scenes with complex geometry and significant depth variation. In this work we overcome these limitations by combining a novel network architecture (SLURPP) with an accurate synthetic data generation pipeline. SLURPP combines pretrained latent diffusion models – which encode strong priors on the geometry and depth of scenes – with an explicit scene decomposition – which allows one to model and account for the effects of light attenuation and backscattering. To train SLURPP we design a physics-based underwater image synthesis pipeline that applies varied and realistic underwater degradation effects to existing terrestrial image datasets. This approach enables the generation of diverse training data with dense medium/degradation annotations. We evaluate our method extensively on both synthetic and real-world benchmarks and demonstrate state-of-the-art performance. Notably, SLURPP is over 200X faster than existing diffusion-based methods while offering ~ 3 dB improvement in PSNR on synthetic benchmarks. It also offers compelling qualitative improvements on real-world data. Project website this https URL.
zh

[CV-18] HUNDER: Tile-level Histopathology image UNDERstanding benchmark

【速读】：该论文试图解决数字病理学领域中，由于大量并行提出的基础模型难以有效评估其性能和特性的问题。解决方案的关键在于引入THUNDER，这是一个面向数字病理学基础模型的tile级基准，能够高效地在多种数据集上对多个模型进行比较，并通过一系列下游任务研究其特征空间，同时评估预测的鲁棒性和不确定性。THUNDER提供了一个快速、易用且动态的平台，支持多种最先进的基础模型以及用户自定义模型的直接tile级比较。

链接: https://arxiv.org/abs/2507.07860
作者: Pierre Marza,Leo Fillioux,Sofiène Boutaj,Kunal Mahatha,Christian Desrosiers,Pablo Piantanida,Jose Dolz,Stergios Christodoulidis,Maria Vakalopoulou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at this https URL.
zh

[CV-19] MeD-3D: A Multimodal Deep Learning Framework for Precise Recurrence Prediction in Clear Cell Renal Cell Carcinoma (ccRCC)

【速读】：该论文试图解决透明细胞肾癌（clear cell renal cell carcinoma, ccRCC）复发预测的准确性问题，这一问题因疾病在分子、病理和临床方面的高度异质性而变得复杂。传统基于单一数据模态（如影像学、组织病理学或基因组学）的预后模型难以全面捕捉疾病复杂性，导致预测效果不佳。该研究的关键解决方案是提出一种深度学习（deep learning, DL）框架，整合包括CT、MRI、组织病理学全切片图像（whole slide images, WSI）、临床数据和基因组信息在内的多模态数据，通过领域特定模型提取各模态的深度特征嵌入，并采用早期和晚期融合架构进行信息整合，从而提升ccRCC复发预测的准确性。此外，该框架还具备处理临床环境中常见数据不完整问题的能力。

链接: https://arxiv.org/abs/2507.07839
作者: Hasaan Maqsood,Saif Ur Rehman Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate prediction of recurrence in clear cell renal cell carcinoma (ccRCC) remains a major clinical challenge due to the disease complex molecular, pathological, and clinical heterogeneity. Traditional prognostic models, which rely on single data modalities such as radiology, histopathology, or genomics, often fail to capture the full spectrum of disease complexity, resulting in suboptimal predictive accuracy. This study aims to overcome these limitations by proposing a deep learning (DL) framework that integrates multimodal data, including CT, MRI, histopathology whole slide images (WSI), clinical data, and genomic profiles, to improve the prediction of ccRCC recurrence and enhance clinical decision-making. The proposed framework utilizes a comprehensive dataset curated from multiple publicly available sources, including TCGA, TCIA, and CPTAC. To process the diverse modalities, domain-specific models are employed: CLAM, a ResNet50-based model, is used for histopathology WSIs, while MeD-3D, a pre-trained 3D-ResNet18 model, processes CT and MRI images. For structured clinical and genomic data, a multi-layer perceptron (MLP) is used. These models are designed to extract deep feature embeddings from each modality, which are then fused through an early and late integration architecture. This fusion strategy enables the model to combine complementary information from multiple sources. Additionally, the framework is designed to handle incomplete data, a common challenge in clinical settings, by enabling inference even when certain modalities are missing.
zh

[CV-20] 3D-ADAM: A Dataset for 3D Anomaly Detection in Advanced Manufacturing

【速读】：该论文旨在解决工业制造中表面缺陷检测的准确性与可靠性问题，尤其是在现实工业环境中，现有方法在处理复杂场景时表现不足。其解决方案的关键在于构建3D-ADAM，这是首个大规模、高精度的工业相关3D异常检测数据集，包含14,120个高分辨率扫描、27,346个标注缺陷实例以及8,110个机械元件特征标注，且数据采集环境真实，涵盖了零件位置与方向变化、相机定位、环境光照及部分遮挡等实际工业场景因素，从而为当前先进模型提供了更具挑战性的基准。

链接: https://arxiv.org/abs/2507.07838
作者: Paul McHard,Florent P. Audonnet,Oliver Summerell,Sebastian Andraos,Paul Henderson,Gerardo Aragon-Camarasa
机构: University of Glasgow(格拉斯哥大学); HAL Robotics Ltd.(HAL机器人有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surface defects are one of the largest contributors to low yield in the manufacturing sector. Accurate and reliable detection of defects during the manufacturing process is therefore of great value across the sector. State-of-the-art approaches to automated defect detection yield impressive performance on current datasets, yet still fall short in real-world manufacturing settings and developing improved methods relies on large datasets representative of real-world scenarios. Unfortunately, high-quality, high-precision RGB+3D industrial anomaly detection datasets are scarce, and typically do not reflect real-world industrial deployment scenarios. To address this, we introduce 3D-ADAM, the first large-scale industry-relevant dataset for high-precision 3D Anomaly Detection. 3D-ADAM comprises 14,120 high-resolution scans across 217 unique parts, captured using 4 industrial depth imaging sensors. It includes 27,346 annotated defect instances from 12 categories, covering the breadth of industrial surface defects. 3D-ADAM uniquely captures an additional 8,110 annotations of machine element features, spanning the range of relevant mechanical design form factors. Unlike existing datasets, 3D-ADAM is captured in a real industrial environment with variations in part position and orientation, camera positioning, ambient lighting conditions, as well as partial occlusions. Our evaluation of SOTA models across various RGB+3D anomaly detection tasks demonstrates the significant challenge this dataset presents to current approaches. We further validated the industrial relevance and quality of the dataset through an expert labelling survey conducted by industry partners. By providing this challenging benchmark, 3D-ADAM aims to accelerate the development of robust 3D Anomaly Detection models capable of meeting the demands of modern manufacturing environments.
zh

[CV-21] Rethinking Query-based Transformer for Continual Image Segmentation CVPR2025

【速读】：该论文试图解决类增量/持续图像分割（Class-incremental/Continual Image Segmentation, CIS）中由于分阶段训练导致的灾难性遗忘问题，以及现有方法在解耦掩码生成与持续学习过程中所面临的可塑性丧失和对输入数据顺序依赖过高的问题。其解决方案的关键在于深入分析了基于查询的Transformer模型中内置的物体感知能力，并提出SimCIS，通过直接选择图像特征进行查询分配，实现“完美对齐”以保持物体感知，同时允许查询选择新类别以提升可塑性。此外，引入跨阶段一致性约束和基于“视觉查询”的重放机制，进一步缓解类别遗忘问题。

链接: https://arxiv.org/abs/2507.07831
作者: Yuchen Zhu,Cheng Shi,Dingyou Wang,Jiajin Tang,Zhengxuan Wei,Yu Wu,Guanbin Li,Sibei Yang
机构: ShanghaiTech University (上海科技大学); Sun Yat-sen University (中山大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work is accepted by CVPR 2025

点击查看摘要

Abstract:Class-incremental/Continual image segmentation (CIS) aims to train an image segmenter in stages, where the set of available categories differs at each stage. To leverage the built-in objectness of query-based transformers, which mitigates catastrophic forgetting of mask proposals, current methods often decouple mask generation from the continual learning process. This study, however, identifies two key issues with decoupled frameworks: loss of plasticity and heavy reliance on input data order. To address these, we conduct an in-depth investigation of the built-in objectness and find that highly aggregated image features provide a shortcut for queries to generate masks through simple feature alignment. Based on this, we propose SimCIS, a simple yet powerful baseline for CIS. Its core idea is to directly select image features for query assignment, ensuring “perfect alignment” to preserve objectness, while simultaneously allowing queries to select new classes to promote plasticity. To further combat catastrophic forgetting of categories, we introduce cross-stage consistency in selection and an innovative “visual query”-based replay mechanism. Experiments demonstrate that SimCIS consistently outperforms state-of-the-art methods across various segmentation tasks, settings, splits, and input data orders. All models and codes will be made publicly available at this https URL.
zh

[CV-22] Benchmarking Content-Based Puzzle Solvers on Corrupted Jigsaw Puzzles

【速读】：该论文试图解决内容感知拼图求解器在现实世界应用中的鲁棒性不足问题，特别是在处理碎片化文物或碎纸文档等实际场景时的挑战。解决方案的关键在于引入三种类型的拼图损坏类型（缺失片段、边缘侵蚀和内容侵蚀），并通过评估启发式与基于深度学习的求解器来分析其对这些损坏的处理能力，进而提出通过增强数据微调提升深度学习模型鲁棒性的方法。

链接: https://arxiv.org/abs/2507.07828
作者: Richard Dirauf,Florian Wolz,Dario Zanca,Björn Eskofier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICIAP 2025

点击查看摘要

Abstract:Content-based puzzle solvers have been extensively studied, demonstrating significant progress in computational techniques. However, their evaluation often lacks realistic challenges crucial for real-world applications, such as the reassembly of fragmented artefacts or shredded documents. In this work, we investigate the robustness of State-Of-The-Art content-based puzzle solvers introducing three types of jigsaw puzzle corruptions: missing pieces, eroded edges, and eroded contents. Evaluating both heuristic and deep learning-based solvers, we analyse their ability to handle these corruptions and identify key limitations. Our results show that solvers developed for standard puzzles have a rapid decline in performance if more pieces are corrupted. However, deep learning models can significantly improve their robustness through fine-tuning with augmented data. Notably, the advanced Positional Diffusion model adapts particularly well, outperforming its competitors in most experiments. Based on our findings, we highlight promising research directions for enhancing the automated reconstruction of real-world artefacts.
zh

[CV-23] Patient-specific vs Multi-Patient Vision Transformer for Markerless Tumor Motion Forecasting

【速读】：该论文试图解决肺癌肿瘤运动的精准预测问题，以提高质子治疗中的剂量精准交付。其解决方案的关键在于引入基于视觉Transformer（Vision Transformer, ViT）的无标记运动预测方法，并对比了两种训练策略：患者特异性（Patient-specific, PS）模型和多患者（Multi-patient, MP）模型。PS模型通过学习个体患者的运动模式实现更高精度，而MP模型则在有限的数据条件下表现出更强的鲁棒性和泛化能力，适用于临床时间约束场景。

链接: https://arxiv.org/abs/2507.07811
作者: Gauthier Rotsart de Hertaing,Dani Manjah,Benoit Macq
机构: UCLouvain(卢万大学); ICTEAM(信息与通信技术工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background: Accurate forecasting of lung tumor motion is essential for precise dose delivery in proton therapy. While current markerless methods mostly rely on deep learning, transformer-based architectures remain unexplored in this domain, despite their proven performance in trajectory forecasting. Purpose: This work introduces a markerless forecasting approach for lung tumor motion using Vision Transformers (ViT). Two training strategies are evaluated under clinically realistic constraints: a patient-specific (PS) approach that learns individualized motion patterns, and a multi-patient (MP) model designed for generalization. The comparison explicitly accounts for the limited number of images that can be generated between planning and treatment sessions. Methods: Digitally reconstructed radiographs (DRRs) derived from planning 4DCT scans of 31 patients were used to train the MP model; a 32nd patient was held out for evaluation. PS models were trained using only the target patient’s planning data. Both models used 16 DRRs per input and predicted tumor motion over a 1-second horizon. Performance was assessed using Average Displacement Error (ADE) and Final Displacement Error (FDE), on both planning (T1) and treatment (T2) data. Results: On T1 data, PS models outperformed MP models across all training set sizes, especially with larger datasets (up to 25,000 DRRs, p 0.05). However, MP models demonstrated stronger robustness to inter-fractional anatomical variability and achieved comparable performance on T2 data without retraining. Conclusions: This is the first study to apply ViT architectures to markerless tumor motion forecasting. While PS models achieve higher precision, MP models offer robust out-of-the-box performance, well-suited for time-constrained clinical settings. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.07811 [cs.CV] (or arXiv:2507.07811v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.07811 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gauthier Rotsart De Hertaing [view email] [v1] Thu, 10 Jul 2025 14:40:52 UTC (209 KB) Full-text links: Access Paper: View a PDF of the paper titled Patient-specific vs Multi-Patient Vision Transformer for Markerless Tumor Motion Forecasting, by Gauthier Rotsart de Hertaing and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-07 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-24] Synergistic Prompting for Robust Visual Recognition with Missing Modalities

【速读】：该论文试图解决在现实应用场景中，由于缺失或不完整的模态输入导致大规模多模态模型性能显著下降的问题。其解决方案的关键在于提出一种名为Synergistic Prompting (SyP)的框架，该框架包含两个核心创新：(I) 动态适配器（Dynamic Adapter），用于计算自适应缩放因子以动态生成提示，替代静态参数实现灵活的多模态适配；(II) 协同提示策略（Synergistic Prompting Strategy），通过结合静态和动态提示来平衡各模态的信息，确保在关键模态缺失时仍能保持鲁棒的推理能力。

链接: https://arxiv.org/abs/2507.07802
作者: Zhihui Zhang,Luanyuan Dai,Qika Lin,Yunfeng Diao,Guangyin Jin,Yufei Guo,Jing Zhang,Xiaoshuai Hao
机构: Beijing Institute of Technology (北京理工大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Nanjing University of Science and Technology (南京理工大学); National University of Singapore (新加坡国立大学); Hefei University of Technology (合肥工业大学); National Innovative Institute of Defense Technology (国防创新研究院); Intelligent Science & Technology Academy of CASIC (中国航天科工智能科学与技术研究院); School of Computer Science, Wuhan University (武汉大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale multi-modal models have demonstrated remarkable performance across various visual recognition tasks by leveraging extensive paired multi-modal training data. However, in real-world applications, the presence of missing or incomplete modality inputs often leads to significant performance degradation. Recent research has focused on prompt-based strategies to tackle this issue; however, existing methods are hindered by two major limitations: (1) static prompts lack the flexibility to adapt to varying missing-data conditions, and (2) basic prompt-tuning methods struggle to ensure reliable performance when critical modalities are this http URL address these challenges, we propose a novel Synergistic Prompting (SyP) framework for robust visual recognition with missing modalities. The proposed SyP introduces two key innovations: (I) a Dynamic Adapter, which computes adaptive scaling factors to dynamically generate prompts, replacing static parameters for flexible multi-modal adaptation, and (II) a Synergistic Prompting Strategy, which combines static and dynamic prompts to balance information across modalities, ensuring robust reasoning even when key modalities are missing. The proposed SyP achieves significant performance improvements over existing approaches across three widely-used visual recognition datasets, demonstrating robustness under diverse missing rates and conditions. Extensive experiments and ablation studies validate its effectiveness in handling missing modalities, highlighting its superior adaptability and reliability.
zh

[CV-25] Visual Instance-aware Prompt Tuning

【速读】：该论文试图解决视觉提示微调（Visual Prompt Tuning, VPT）在下游数据集中因数据方差大而导致性能不佳的问题。传统方法使用固定的数据集级提示，无法有效捕捉实例特定信息。解决方案的关键在于提出一种实例感知的视觉提示微调方法（ViaPT），该方法基于每个输入生成实例感知的提示，并将其与数据集级提示融合，利用主成分分析（PCA）保留关键提示信息，从而在平衡数据集级与实例级知识的同时，减少可学习参数数量。

链接: https://arxiv.org/abs/2507.07796
作者: Xi Xiao,Yunbei Zhang,Xingjian Li,Tianyang Wang,Xiao Wang,Yuxiang Wei,Jihun Hamm,Min Xu
机构: University of Alabama at Birmingham(阿拉巴马大学伯明翰分校); Tulane University(图兰大学); Carnegie Mellon University(卡内基梅隆大学); Oak Ridge National Laboratory(橡树岭国家实验室); Georgia Institute of Technology(佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Prompt Tuning (VPT) has emerged as a parameter-efficient fine-tuning paradigm for vision transformers, with conventional approaches utilizing dataset-level prompts that remain the same across all input instances. We observe that this strategy results in sub-optimal performance due to high variance in downstream datasets. To address this challenge, we propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input and fuses them with dataset-level prompts, leveraging Principal Component Analysis (PCA) to retain important prompting information. Moreover, we reveal that VPT-Deep and VPT-Shallow represent two corner cases based on a conceptual understanding, in which they fail to effectively capture instance-specific information, while random dimension reduction on prompts only yields performance between the two extremes. Instead, ViaPT overcomes these limitations by balancing dataset-level and instance-level knowledge, while reducing the amount of learnable parameters compared to VPT-Deep. Extensive experiments across 34 diverse datasets demonstrate that our method consistently outperforms state-of-the-art baselines, establishing a new paradigm for analyzing and optimizing visual prompts for vision transformers.
zh

[CV-26] Robust and Generalizable Heart Rate Estimation via Deep Learning for Remote Photoplethysmography in Complex Scenarios

【速读】：该论文旨在解决非接触式远程光体积描记术（rPPG）在复杂场景下精度、鲁棒性和泛化能力不足的问题。其解决方案的关键在于提出一种端到端的rPPG提取网络，该网络采用3D卷积神经网络从原始面部视频中重建精确的rPPG信号，并引入差分帧融合模块以捕捉血容量脉冲（BVP）变化，同时结合时间位移模块（TSM）与自注意力机制，在计算开销最小的情况下有效增强rPPG特征。此外，还设计了一种动态混合损失函数，以提供更强的监督并缓解过拟合问题。

链接: https://arxiv.org/abs/2507.07795
作者: Kang Cen,Chang-Hong Fu,Hong Hong
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Non-contact remote photoplethysmography (rPPG) technology enables heart rate measurement from facial videos. However, existing network models still face challenges in accu racy, robustness, and generalization capability under complex scenarios. This paper proposes an end-to-end rPPG extraction network that employs 3D convolutional neural networks to reconstruct accurate rPPG signals from raw facial videos. We introduce a differential frame fusion module that integrates differential frames with original frames, enabling frame-level representations to capture blood volume pulse (BVP) variations. Additionally, we incorporate Temporal Shift Module (TSM) with self-attention mechanisms, which effectively enhance rPPG features with minimal computational overhead. Furthermore, we propose a novel dynamic hybrid loss function that provides stronger supervision for the network, effectively mitigating over fitting. Comprehensive experiments were conducted on not only the PURE and UBFC-rPPG datasets but also the challenging MMPD dataset under complex scenarios, involving both intra dataset and cross-dataset evaluations, which demonstrate the superior robustness and generalization capability of our network. Specifically, after training on PURE, our model achieved a mean absolute error (MAE) of 7.58 on the MMPD test set, outperforming the state-of-the-art models.
zh

[CV-27] SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes

【速读】：该论文旨在解决当前3D视觉-语言研究中对空间关系理解的不足，特别是现有数据集将语义线索与空间上下文混合，导致模型依赖表面捷径而非真正解析空间关系的问题。其解决方案的关键在于引入S\textscurprise3D数据集，该数据集包含超过200k个视觉语言对和900多个详细室内场景，通过人工标注的不包含物体名称的空间查询，有效减少了空间理解中的捷径偏差，并全面覆盖了相对位置、叙述视角、参数视角和绝对距离等空间推理能力。

链接: https://arxiv.org/abs/2507.07781
作者: Jiaxin Huang,Ziwen Li,Hanlve Zhang,Runnan Chen,Xiao He,Yandong Guo,Wenping Wang,Tongliang Liu,Mingming Gong
机构: MBZUAI; The University of Sydney; AI2Robotic; Texas A&M University; The University of Melbourne
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The integration of language and 3D perception is critical for embodied AI and robotic systems to perceive, understand, and interact with the physical world. Spatial reasoning, a key capability for understanding spatial relationships between objects, remains underexplored in current 3D vision-language research. Existing datasets often mix semantic cues (e.g., object name) with spatial context, leading models to rely on superficial shortcuts rather than genuinely interpreting spatial relationships. To address this gap, we introduce S\textscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes. S\textscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2, including more than 2.8k unique object classes. The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name, thereby mitigating shortcut biases in spatial understanding. These queries comprehensively cover various spatial reasoning skills, such as relative position, narrative perspective, parametric perspective, and absolute distance reasoning. Initial benchmarks demonstrate significant challenges for current state-of-the-art expert 3D visual grounding methods and 3D-LLMs, underscoring the necessity of our dataset and the accompanying 3D Spatial Reasoning Segmentation (3D-SRS) benchmark suite. S\textscurprise3D and 3D-SRS aim to facilitate advancements in spatially aware AI, paving the way for effective embodied interaction and robotic planning. The code and datasets can be found in this https URL.
zh

[CV-28] Where are we with calibration under dataset shift in image classification?

【速读】：该论文试图解决在真实世界数据集偏移（dataset shift）条件下图像分类任务中模型校准（calibration）的问题。其关键解决方案是通过系统比较后训练（post-hoc）校准方法与训练中校准策略的交互，结合熵正则化和标签平滑（label smoothing）技术，以及利用少量语义无关的分布外数据（OOD）进行后训练校准，从而提升模型在数据集偏移下的校准鲁棒性。此外，研究还表明，在集成（ensembling）前进行校准比在集成后进行更有效，并指出微调基础模型可显著提升校准性能。

链接: https://arxiv.org/abs/2507.07780
作者: Mélanie Roschewitz,Raghav Mehta,Fabio de Sousa Ribeiro,Ben Glocker
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code available at this https URL

点击查看摘要

Abstract:We conduct an extensive study on the state of calibration under real-world dataset shift for image classification. Our work provides important insights on the choice of post-hoc and in-training calibration techniques, and yields practical guidelines for all practitioners interested in robust calibration under shift. We compare various post-hoc calibration methods, and their interactions with common in-training calibration strategies (e.g., label smoothing), across a wide range of natural shifts, on eight different classification tasks across several imaging domains. We find that: (i) simultaneously applying entropy regularisation and label smoothing yield the best calibrated raw probabilities under dataset shift, (ii) post-hoc calibrators exposed to a small amount of semantic out-of-distribution data (unrelated to the task) are most robust under shift, (iii) recent calibration methods specifically aimed at increasing calibration under shifts do not necessarily offer significant improvements over simpler post-hoc calibration methods, (iv) improving calibration under shifts often comes at the cost of worsening in-distribution calibration. Importantly, these findings hold for randomly initialised classifiers, as well as for those finetuned from foundation models, the latter being consistently better calibrated compared to models trained from scratch. Finally, we conduct an in-depth analysis of ensembling effects, finding that (i) applying calibration prior to ensembling (instead of after) is more effective for calibration under shifts, (ii) for ensembles, OOD exposure deteriorates the ID-shifted calibration trade-off, (iii) ensembling remains one of the most effective methods to improve calibration robustness and, combined with finetuning from foundation models, yields best calibration results overall.
zh

[CV-29] Synchronizing Task Behavior: Aligning Multiple Tasks during Test-Time Training ICCV2025

【速读】：该论文试图解决在领域偏移情况下，传统测试时训练（Test-time Training, TTT）方法在处理多任务时出现的非同步任务行为问题，即不同任务所需的适应步骤不一致，导致性能下降。解决方案的关键在于提出一种名为Synchronizing Tasks for Test-time Training (S4T)的新方法，其核心思想是通过预测领域偏移下的任务关系，实现测试时多任务的同步处理。

链接: https://arxiv.org/abs/2507.07778
作者: Wooseong Jeong,Jegyeong Cho,Youngho Yoon,Kuk-Jin Yoon
机构: KAIST(韩国科学技术院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Generalizing neural networks to unseen target domains is a significant challenge in real-world deployments. Test-time training (TTT) addresses this by using an auxiliary self-supervised task to reduce the domain gap caused by distribution shifts between the source and target. However, we find that when models are required to perform multiple tasks under domain shifts, conventional TTT methods suffer from unsynchronized task behavior, where the adaptation steps needed for optimal performance in one task may not align with the requirements of other tasks. To address this, we propose a novel TTT approach called Synchronizing Tasks for Test-time Training (S4T), which enables the concurrent handling of multiple tasks. The core idea behind S4T is that predicting task relations across domain shifts is key to synchronizing tasks during test time. To validate our approach, we apply S4T to conventional multi-task benchmarks, integrating it with traditional TTT protocols. Our empirical results show that S4T outperforms state-of-the-art TTT methods across various benchmarks.
zh

[CV-30] SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples

【速读】：该论文试图解决无约束对抗攻击在视觉模型中难以保证人类感知不可察觉性的问题，从而使得传统基于范数的防御策略失效。其解决方案的关键在于提出SCOOTER框架，该框架提供了一套最佳实践指南、大规模的人类与模型对比实验、开源工具以及一个基于ImageNet的基准数据集，以系统化评估和比较无约束对抗样本的不可察觉性。

链接: https://arxiv.org/abs/2507.07776
作者: Dren Fazlija,Monty-Maximilian Zühlke,Johanna Schrader,Arkadij Orlov,Clara Stein,Iyiola E. Olatunji,Daniel Kudenko
机构: L3S.de(莱布尼茨实验室); EON.com(埃能矿); uni.lu(卢森堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 42 pages, 16 figures, 11 tables, Under Review, Code: this https URL , Data: this https URL

点击查看摘要

Abstract:Unrestricted adversarial attacks aim to fool computer vision models without being constrained by \ell_p -norm bounds to remain imperceptible to humans, for example, by changing an object’s color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: (i) best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; (ii) the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; (iii) open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; (iv) an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.
zh

[CV-31] Rainbow Artifacts from Electromagnetic Signal Injection Attacks on Image Sensors

【速读】：该论文试图解决图像传感器在安全和安保关键系统中面临的电磁信号注入攻击问题，这类攻击通过操控图像传感器的模拟域来篡改原始视觉输入，从而绕过传统的数字完整性检查。解决方案的关键在于揭示了CMOS图像传感器上一种此前未被记录的攻击现象，即通过精心调整的电磁干扰在图像中引入类似彩虹的色彩伪影，并验证了这些伪影能够通过图像信号处理流程导致先进目标检测模型的重大误判。

链接: https://arxiv.org/abs/2507.07773
作者: Youqian Zhang,Xinyu Ji,Zhihao Wang,Qinhong Jiang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:Image sensors are integral to a wide range of safety- and security-critical systems, including surveillance infrastructure, autonomous vehicles, and industrial automation. These systems rely on the integrity of visual data to make decisions. In this work, we investigate a novel class of electromagnetic signal injection attacks that target the analog domain of image sensors, allowing adversaries to manipulate raw visual inputs without triggering conventional digital integrity checks. We uncover a previously undocumented attack phenomenon on CMOS image sensors: rainbow-like color artifacts induced in images captured by image sensors through carefully tuned electromagnetic interference. We further evaluate the impact of these attacks on state-of-the-art object detection models, showing that the injected artifacts propagate through the image signal processing pipeline and lead to significant mispredictions. Our findings highlight a critical and underexplored vulnerability in the visual perception stack, highlighting the need for more robust defenses against physical-layer attacks in such systems.
zh

[CV-32] RIX- Trading Adversarial Fairness via Mixed Adversarial Training

【速读】：该论文试图解决对抗训练（Adversarial Training, AT）中存在的一种对抗不公平问题，即现有方法对所有类别采用统一的训练目标，导致具有明显区分特征的强类（strong classes）变得更加鲁棒，而具有重叠或共享特征的弱类（weak classes）则更容易受到对抗攻击。解决方案的关键在于提出TRIX框架，该框架通过自适应地为强类分配较弱的目标对抗样本并促进特征多样性，同时为弱类分配更强的无目标对抗样本以增强其鲁棒性，并结合类别感知的损失加权和扰动强度调整，以在优化过程中强调弱类的表现。

链接: https://arxiv.org/abs/2507.07768
作者: Tejaswini Medi,Steffen Jung,Margret Keuper
机构: University of Mannheim (马尔堡大学); MPI for Informatics (信息研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial Training (AT) is a widely adopted defense against adversarial examples. However, existing approaches typically apply a uniform training objective across all classes, overlooking disparities in class-wise vulnerability. This results in adversarial unfairness: classes with well distinguishable features (strong classes) tend to become more robust, while classes with overlapping or shared features(weak classes) remain disproportionately susceptible to adversarial attacks. We observe that strong classes do not require strong adversaries during training, as their non-robust features are quickly suppressed. In contrast, weak classes benefit from stronger adversaries to effectively reduce their vulnerabilities. Motivated by this, we introduce TRIX, a feature-aware adversarial training framework that adaptively assigns weaker targeted adversaries to strong classes, promoting feature diversity via uniformly sampled targets, and stronger untargeted adversaries to weak classes, enhancing their focused robustness. TRIX further incorporates per-class loss weighting and perturbation strength adjustments, building on prior work, to emphasize weak classes during the optimization. Comprehensive experiments on standard image classification benchmarks, including evaluations under strong attacks such as PGD and AutoAttack, demonstrate that TRIX significantly improves worst-case class accuracy on both clean and adversarial data, reducing inter-class robustness disparities, and preserves overall accuracy. Our results highlight TRIX as a practical step toward fair and effective adversarial defense.
zh

[CV-33] Deep Learning based 3D Volume Correlation for Additive Manufacturing Using High-Resolution Industrial X-ray Computed Tomography

【速读】：该论文试图解决增材制造（Additive Manufacturing, AM）中由于收缩和变形导致的几何误差问题，这些问题可能影响制造部件的寿命和性能。为了解决这一问题，研究提出了一种基于深度学习的方法，用于估计计算机辅助设计（CAD）与X射线计算机断层扫描（XCT）体积之间的体素级变形。该方法的关键在于采用动态块处理策略以应对高分辨率XCT数据的计算挑战，并引入二值差异图（Binary Difference Map, BDM）来量化二值化CAD与XCT体积之间的体素级不匹配，从而评估配准精度。

链接: https://arxiv.org/abs/2507.07757
作者: Keerthana Chand,Tobias Fritsch,Bardia Hejazi,Konstantin Poka,Giovanni Bruno
机构: Bundesanstalt Für Materialforschung Und -Prüfung (BAM, Federal Institute for Materials Research and Testing); Institute of Physics and Astronomy, University of Potsdam
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quality control in additive manufacturing (AM) is vital for industrial applications in areas such as the automotive, medical and aerospace sectors. Geometric inaccuracies caused by shrinkage and deformations can compromise the life and performance of additively manufactured components. Such deviations can be quantified using Digital Volume Correlation (DVC), which compares the computer-aided design (CAD) model with the X-ray Computed Tomography (XCT) geometry of the components produced. However, accurate registration between the two modalities is challenging due to the absence of a ground truth or reference deformation field. In addition, the extremely large data size of high-resolution XCT volumes makes computation difficult. In this work, we present a deep learning-based approach for estimating voxel-wise deformations between CAD and XCT volumes. Our method uses a dynamic patch-based processing strategy to handle high-resolution volumes. In addition to the Dice Score, we introduce a Binary Difference Map (BDM) that quantifies voxel-wise mismatches between binarized CAD and XCT volumes to evaluate the accuracy of the registration. Our approach shows a 9.2% improvement in the Dice Score and a 9.9% improvement in the voxel match rate compared to classic DVC methods, while reducing the interaction time from days to minutes. This work sets the foundation for deep learning-based DVC methods to generate compensation meshes that can then be used in closed-loop correlations during the AM production process. Such a system would be of great interest to industries since the manufacturing process will become more reliable and efficient, saving time and material.
zh

[CV-34] X-RAFT: Cross-Modal Non-Rigid Registration of Blue and White Light Neurosurgical Hyperspectral Images

【速读】：该论文试图解决在荧光引导神经外科手术中，如何在不同光照条件（蓝光荧光模式与白光反射模式）下实现高精度的跨模态图像对应问题，从而支持实时定量荧光测量。解决方案的关键在于引入X-RAFT模型，该模型是对Recurrent All-Pairs Field Transforms (RAFT)光学流模型的改进，专门用于处理跨模态输入，并通过使用针对每对模态的独立图像编码器，在神经外科超光谱数据上以自监督方式基于流循环一致性进行微调，从而提高跨模态图像匹配的准确性。

链接: https://arxiv.org/abs/2507.07747
作者: Charlie Budd,Silvère Ségaud,Matthew Elliot,Graeme Stasiuk,Yijing Xie,Jonathan Shapey,Tom Vercauteren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Integration of hyperspectral imaging into fluorescence-guided neurosurgery has the potential to improve surgical decision making by providing quantitative fluorescence measurements in real-time. Quantitative fluorescence requires paired spectral data in fluorescence (blue light) and reflectance (white light) mode. Blue and white image acquisition needs to be performed sequentially in a potentially dynamic surgical environment. A key component to the fluorescence quantification process is therefore the ability to find dense cross-modal image correspondences between two hyperspectral images taken under these drastically different lighting conditions. We address this challenge with the introduction of X-RAFT, a Recurrent All-Pairs Field Transforms (RAFT) optical flow model modified for cross-modal inputs. We propose using distinct image encoders for each modality pair, and fine-tune these in a self-supervised manner using flow-cycle-consistency on our neurosurgical hyperspectral data. We show an error reduction of 36.6% across our evaluation metrics when comparing to a naive baseline and 27.83% reduction compared to an existing cross-modal optical flow method (CrossRAFT). Our code and models will be made publicly available after the review process.
zh

[CV-35] Sparse-Dense Side-Tuner for efficient Video Temporal Grounding

【速读】：该论文试图解决视频时间定位（Video Temporal Grounding, VTG）任务中，现有方法依赖冻结的大型预训练主干网络最终层特征而导致的适应性不足问题。其关键解决方案是提出一种无锚点的侧调优架构——Sparse-Dense Side-Tuner (SDST)，并引入基于参考的可变形自注意力机制，以增强上下文建模能力，同时首次将InternVideo2主干有效集成到侧调优框架中，从而在提升性能的同时显著减少参数量。

链接: https://arxiv.org/abs/2507.07744
作者: David Pujol-Perich,Sergio Escalera,Albert Clapés
机构: Universitat de Barcelona and Computer Vision Center, Barcelona, Spain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full fine-tuning is often impractical, parameter-efficient fine-tuning – and particularly side-tuning (ST) – has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention – a key limitation of existing anchor-free methods. Additionally, we present the first effective integration of InternVideo2 backbone into an ST framework, showing its profound implications in performance. Overall, our method significantly improves existing ST methods, achieving highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA, while reducing up to a 73% the parameter count w.r.t. the existing SOTA methods. The code is publicly accessible at this https URL.
zh

[CV-36] EEvAct: Early Event-Based Action Recognition with High-Rate Two-Stream Spiking Neural Networks

【速读】：该论文试图解决在人机交互系统中早期识别人类活动的问题，以提高系统的安全性和响应性。传统方法通过将事件累积为低速率帧或时空体素进行处理，限制了早期预测能力，而尽管脉冲神经网络（Spiking Neural Networks, SNNs）能够以高频率处理事件实现早期预测，但其最终准确性仍有不足。该研究的关键在于提出一种高频率双流SNN架构，在大规模THU EACT-50数据集上实现了比之前工作高出2%的最终准确率，从而弥补了这一差距。

链接: https://arxiv.org/abs/2507.07734
作者: Michael Neumeier,Jules Lecomte,Nils Kazinski,Soubarna Banik,Bing Li,Axel von Arnim
机构: fortiss GmbH(fortiss有限公司); Technical University Munich(慕尼黑技术大学); Neuromorphic Computing(神经形态计算); Simi Reality Motion Systems GmbH(Simi现实运动系统有限公司); Digital Integrated Systems Group(数字集成系统组); University of Siegen(锡根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: International Conference on Neuromorphic Systems (ICONS) 2025

点击查看摘要

Abstract:Recognizing human activities early is crucial for the safety and responsiveness of human-robot and human-machine interfaces. Due to their high temporal resolution and low latency, event-based vision sensors are a perfect match for this early recognition demand. However, most existing processing approaches accumulate events to low-rate frames or space-time voxels which limits the early prediction capabilities. In contrast, spiking neural networks (SNNs) can process the events at a high-rate for early predictions, but most works still fall short on final accuracy. In this work, we introduce a high-rate two-stream SNN which closes this gap by outperforming previous work by 2% in final accuracy on the large-scale THU EACT-50 dataset. We benchmark the SNNs within a novel early event-based recognition framework by reporting Top-1 and Top-5 recognition scores for growing observation time. Finally, we exemplify the impact of these methods on a real-world task of early action triggering for human motion capture in sports.
zh

[CV-37] RTR-GS: 3D Gaussian Splatting for Inverse Rendering with Radiance Transfer and Reflection

【速读】：该论文试图解决在逆向渲染和再照明过程中对反射物体进行精确渲染的问题，尤其是处理任意反射属性的物体时的挑战。解决方案的关键在于提出一种名为RTR-GS的新型逆向渲染框架，该框架通过结合正向渲染与延迟渲染的混合渲染模型，有效恢复几何结构，并分离高频和低频外观，从而减轻因球面谐波过拟合导致的浮动伪影。此外，通过引入基于物理的延迟渲染分支进一步优化BRDF和光照分解，实现了可信的再照明结果。

链接: https://arxiv.org/abs/2507.07733
作者: Yongyang Zhou,Fang-Lue Zhang,Zichen Wang,Lei Zhang
机构: Beijing Institute of Technology(北京理工大学); Victoria University of Wellington(维多利亚大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has demonstrated impressive capabilities in novel view synthesis. However, rendering reflective objects remains a significant challenge, particularly in inverse rendering and relighting. We introduce RTR-GS, a novel inverse rendering framework capable of robustly rendering objects with arbitrary reflectance properties, decomposing BRDF and lighting, and delivering credible relighting results. Given a collection of multi-view images, our method effectively recovers geometric structure through a hybrid rendering model that combines forward rendering for radiance transfer with deferred rendering for reflections. This approach successfully separates high-frequency and low-frequency appearances, mitigating floating artifacts caused by spherical harmonic overfitting when handling high-frequency details. We further refine BRDF and lighting decomposition using an additional physically-based deferred rendering branch. Experimental results show that our method enhances novel view synthesis, normal estimation, decomposition, and relighting while maintaining efficient training inference process.
zh

[CV-38] Energy-Guided Decoding for Object Hallucination Mitigation

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）中物体幻觉（object hallucination）的问题，特别是模型在视觉问答（VQA）任务中对“是”回答的比例（yes ratio）存在显著不平衡的现象。其解决方案的关键在于提出一种基于能量的解码方法，该方法通过动态选择具有最小能量得分的隐藏状态来优化解码过程，从而有效降低yes ratio的偏差并提升模型在多个基准测试（POPE、MME和MMVP）上的性能。

链接: https://arxiv.org/abs/2507.07731
作者: Xixi Liu,Ailin Deng,Christopher Zach
机构: Chalmers University of Technology(查尔姆斯理工大学); National University of Singapore(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mitigating object hallucination in large vision-language models (LVLMs) is critical to their safe deployment. Existing methods either are restricted to specific decoding methods, or demand sophisticated modifications to visual inputs, or rely on knowledge from external models. In this work, we first reveal the phenomenon that VLMs exhibit significant imbalance in the Yes'' ratio ( \ie, the fraction of Yes’’ answers among the total number of questions) across three different visual question answering (VQA) datasets. Furthermore, we propose an energy-based decoding method, which dynamically selects the hidden states from the layer with minimal energy score. It is simple yet effective in reducing the bias for the yes ratio while boosting performance across three benchmarks (POPE, MME, and MMVP). Our method consistently improves accuracy and F1 score on three VQA datasets across three commonly used VLMs over several baseline methods. The average accuracy improvement is 4.82% compared to greedy decoding. Moreover, the average yes-ratio gap reduction is 8.81%, meaning the proposed method is less biased as shown in Figure 1.
zh

[CV-39] RAPS-3D: Efficient interactive segmentation for 3D radiological imaging

【速读】：该论文旨在解决将可提示分割（promptable segmentation）方法从2D图像扩展到3D医学影像数据时所面临的计算复杂度高、推理时间长以及需要复杂策略如滑动窗口（sliding-window）来管理内存的问题。其解决方案的关键在于提出一种简化的3D可提示分割方法，受SegVol启发，旨在减少推理时间并消除滑动窗口带来的提示管理复杂性，同时保持最先进的性能。

链接: https://arxiv.org/abs/2507.07730
作者: Théo Danielou,Daniel Tordjman,Pierre Manceron,Corentin Dancette
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Abstract accepted at MIUA 2025

点击查看摘要

Abstract:Promptable segmentation, introduced by the Segment Anything Model (SAM), is a promising approach for medical imaging, as it enables clinicians to guide and refine model predictions interactively. However, SAM’s architecture is designed for 2D images and does not extend naturally to 3D volumetric data such as CT or MRI scans. Adapting 2D models to 3D typically involves autoregressive strategies, where predictions are propagated slice by slice, resulting in increased inference complexity. Processing large 3D volumes also requires significant computational resources, often leading existing 3D methods to also adopt complex strategies like sliding-window inference to manage memory usage, at the cost of longer inference times and greater implementation complexity. In this paper, we present a simplified 3D promptable segmentation method, inspired by SegVol, designed to reduce inference time and eliminate prompt management complexities associated with sliding windows while achieving state-of-the-art performance.
zh

[CV-40] Understanding Dataset Bias in Medical Imaging: A Case Study on Chest X-rays

【速读】：该论文试图解决医学影像领域中开放数据集是否存在潜在偏差的问题，以及现代方法是否在AI应用中采取了捷径而非关注相关病理问题。其解决方案的关键在于通过对流行的开源胸部X光数据集（如NIH、CheXpert、MIMIC-CXR和PadChest）进行相同的“命名数据集”任务，并通过数据集变换增加任务难度，以检测数据集偏差的存在。研究者还实现了多种网络架构来评估模型是否依赖于数据集的偏差而非真正的病理特征。

链接: https://arxiv.org/abs/2507.07722
作者: Ethan Dack,Chengliang Dai
机构: University of Bern (伯尔尼大学); UCB Pharma Limited (UCB制药有限公司); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent work has revisited the infamous task Name that dataset and established that in non-medical datasets, there is an underlying bias and achieved high Accuracies on the dataset origin task. In this work, we revisit the same task applied to popular open-source chest X-ray datasets. Medical images are naturally more difficult to release for open-source due to their sensitive nature, which has led to certain open-source datasets being extremely popular for research purposes. By performing the same task, we wish to explore whether dataset bias also exists in these datasets. % We deliberately try to increase the difficulty of the task by dataset transformations. We apply simple transformations of the datasets to try to identify bias. Given the importance of AI applications in medical imaging, it’s vital to establish whether modern methods are taking shortcuts or are focused on the relevant pathology. We implement a range of different network architectures on the datasets: NIH, CheXpert, MIMIC-CXR and PadChest. We hope this work will encourage more explainable research being performed in medical imaging and the creation of more open-source datasets in the medical domain. The corresponding code will be released upon acceptance.
zh

[CV-41] Breast Ultrasound Tumor Generation via Mask Generator and Text-Guided Network:A Clinically Controllable Framework with Downstream Evaluation

【速读】：该论文旨在解决乳腺超声（Breast Ultrasound, BUS）图像分析中深度学习模型因专家标注数据稀缺而难以构建鲁棒模型的问题。其解决方案的关键在于提出一种临床可控的生成框架，该框架通过整合临床描述与结构掩码来合成肿瘤，从而实现对肿瘤特征（如形态、回声性和形状）的细粒度控制，并设计了一种语义-曲率掩码生成器，以临床先验知识为指导生成结构多样的肿瘤掩码，进而生成具有真实世界形态多样性的个性化合成BUS图像。

链接: https://arxiv.org/abs/2507.07721
作者: Haoyu Pan,Hongxin Lin,Zetian Feng,Chuxuan Lin,Junyang Mo,Chu Zhang,Zijian Wu,Yi Wang,Qingqing Zheng
机构: Shenzhen University Medical School(深圳大学医学院); Shenzhen University(深圳大学); Smart Medical Imaging, Learning and Engineering (SMILE) Lab(智能医学影像、学习与工程实验室); the Seventh Affiliated Hospital, Sun Yat-Sen University(中山大学第七附属医院); Sun Yat-Sen University(中山大学); Shenzhen University of Advanced Technology(深圳技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:The development of robust deep learning models for breast ultrasound (BUS) image analysis is significantly constrained by the scarcity of expert-annotated data. To address this limitation, we propose a clinically controllable generative framework for synthesizing BUS images. This framework integrates clinical descriptions with structural masks to generate tumors, enabling fine-grained control over tumor characteristics such as morphology, echogencity, and shape. Furthermore, we design a semantic-curvature mask generator, which synthesizes structurally diverse tumor masks guided by clinical priors. During inference, synthetic tumor masks serve as input to the generative framework, producing highly personalized synthetic BUS images with tumors that reflect real-world morphological diversity. Quantitative evaluations on six public BUS datasets demonstrate the significant clinical utility of our synthetic images, showing their effectiveness in enhancing downstream breast cancer diagnosis tasks. Furthermore, visual Turing tests conducted by experienced sonographers confirm the realism of the generated images, indicating the framework’s potential to support broader clinical applications.
zh

[CV-42] Balancing the Past and Present: A Coordinated Replay Framework for Federated Class-Incremental Learning

【速读】：该论文旨在解决联邦持续学习（Federated Class Incremental Learning, FCIL）中因类别不平衡导致的遗忘问题，特别是在数据分布异构和任务间类别不平衡的情况下。其解决方案的关键在于提出一种基于类别平衡的数据重放方法（FedCBDR），该方法通过全局协调机制构建类级别的记忆，并通过重新加权学习目标来缓解类别间的不平衡。FedCBDR包含两个核心组件：一是全局视角的数据重放模块，以隐私保护的方式重建先前任务的全局表征，并指导类感知和重要性敏感的采样策略；二是任务感知的温度缩放模块，根据任务动态自适应调整逻辑值的温度，从而降低模型对多数类的过度自信并增强对少数类的敏感性。

链接: https://arxiv.org/abs/2507.07712
作者: Zhuang Qi,Lei Meng,Han Yu
机构: Shandong University(山东大学); Shandong Research Institute of Industrial Technology(山东工业技术研究院); Nanyang Technological University(南洋理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated Class Incremental Learning (FCIL) aims to collaboratively process continuously increasing incoming tasks across multiple clients. Among various approaches, data replay has become a promising solution, which can alleviate forgetting by reintroducing representative samples from previous tasks. However, their performance is typically limited by class imbalance, both within the replay buffer due to limited global awareness and between replayed and newly arrived classes. To address this issue, we propose a class wise balancing data replay method for FCIL (FedCBDR), which employs a global coordination mechanism for class-level memory construction and reweights the learning objective to alleviate the aforementioned imbalances. Specifically, FedCBDR has two key components: 1) the global-perspective data replay module reconstructs global representations of prior task in a privacy-preserving manner, which then guides a class-aware and importance-sensitive sampling strategy to achieve balanced replay; 2) Subsequently, to handle class imbalance across tasks, the task aware temperature scaling module adaptively adjusts the temperature of logits at both class and instance levels based on task dynamics, which reduces the model’s overconfidence in majority classes while enhancing its sensitivity to minority classes. Experimental results verified that FedCBDR achieves balanced class-wise sampling under heterogeneous data distributions and improves generalization under task imbalance between earlier and recent tasks, yielding a 2%-15% Top-1 accuracy improvement over six state-of-the-art methods.
zh

[CV-43] One Object Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models

【速读】：该论文试图解决统一视觉-语言模型（Unified VLMs）在面对跨任务对抗攻击时的安全性问题，即如何使恶意输入在不同任务指令下仍能有效误导模型。解决方案的关键在于提出CrossVLAD基准数据集和CRAFT（Cross-task Region-based Attack Framework with Token-alignment）攻击框架，其中CrossVLAD通过对象变化目标对模型进行系统评估，而CRAFT则基于区域的攻击方法结合了token对齐技术，以提高跨任务攻击的有效性和针对性。

链接: https://arxiv.org/abs/2507.07709
作者: Jiale Zhao,Xinyang Jiang,Junyao Gao,Yuhao Xue,Cairong Zhao
机构: Tongji University (同济大学); Microsoft Research Asia (亚洲微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unified vision-language models(VLMs) have recently shown remarkable progress, enabling a single model to flexibly address diverse tasks through different instructions within a shared computational architecture. This instruction-based control mechanism creates unique security challenges, as adversarial inputs must remain effective across multiple task instructions that may be unpredictably applied to process the same malicious content. In this paper, we introduce CrossVLAD, a new benchmark dataset carefully curated from MSCOCO with GPT-4-assisted annotations for systematically evaluating cross-task adversarial attacks on unified VLMs. CrossVLAD centers on the object-change objective-consistently manipulating a target object’s classification across four downstream tasks-and proposes a novel success rate metric that measures simultaneous misclassification across all tasks, providing a rigorous evaluation of adversarial transferability. To tackle this challenge, we present CRAFT (Cross-task Region-based Attack Framework with Token-alignment), an efficient region-centric attack method. Extensive experiments on Florence-2 and other popular unified VLMs demonstrate that our method outperforms existing approaches in both overall cross-task attack performance and targeted object-change success rates, highlighting its effectiveness in adversarially influencing unified VLMs across diverse tasks.
zh

[CV-44] Motion-Aware Adaptive Pixel Pruning for Efficient Local Motion Deblurring

【速读】：该论文旨在解决数字图像中局部运动模糊的问题，该模糊源于动态物体与静态成像系统在曝光期间的相对运动。其解决方案的关键在于提出一种可训练的掩码预测器，用于识别图像中的模糊区域，并通过结构重参数化将3×3卷积转换为计算效率更高的1×1卷积，实现像素级的清晰区域剪枝以降低计算量。此外，还开发了一个帧内运动分析器，将相对像素位移转化为运动轨迹，为区域特定的模糊恢复提供自适应指导。

链接: https://arxiv.org/abs/2507.07708
作者: Wei Shang,Dongwei Ren,Wanying Zhang,Pengfei Zhu,Qinghua Hu,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学); City University of Hong Kong (香港城市大学); School of Computer Science and Technology, Harbin Institute of Technology (哈尔滨工业大学计算机科学与技术学院); Tianjin University (天津大学); Low-Altitude Intelligence Lab, Xiong’an National Innovation Center Technology Co., Ltd (低空智能实验室，雄安国家创新中心科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACMMM 2025

点击查看摘要

Abstract:Local motion blur in digital images originates from the relative motion between dynamic objects and static imaging systems during exposure. Existing deblurring methods face significant challenges in addressing this problem due to their inefficient allocation of computational resources and inadequate handling of spatially varying blur patterns. To overcome these limitations, we first propose a trainable mask predictor that identifies blurred regions in the image. During training, we employ blur masks to exclude sharp regions. For inference optimization, we implement structural reparameterization by converting 3\times 3 convolutions to computationally efficient 1\times 1 convolutions, enabling pixel-level pruning of sharp areas to reduce computation. Second, we develop an intra-frame motion analyzer that translates relative pixel displacements into motion trajectories, establishing adaptive guidance for region-specific blur restoration. Our method is trained end-to-end using a combination of reconstruction loss, reblur loss, and mask loss guided by annotated blur masks. Extensive experiments demonstrate superior performance over state-of-the-art methods on both local and global blur datasets while reducing FLOPs by 49% compared to SOTA models (e.g., LMD-ViT). The source code is available at this https URL.
zh

[CV-45] Compressive Imaging Reconstruction via Tensor Decomposed Multi-Resolution Grid Encoding

【速读】：该论文旨在解决压缩成像（Compressive Imaging, CI）重建中现有无监督表示方法在表征能力和效率之间难以达到理想平衡的问题。其解决方案的关键在于提出一种名为GridTD的无监督连续表征框架，该框架通过优化轻量级神经网络与输入张量分解模型，并利用多分辨率哈希网格编码进行参数学习，结合了多分辨率网格编码的层次建模能力和张量分解的紧凑性，从而实现了高维图像的有效且高效的重建。

链接: https://arxiv.org/abs/2507.07707
作者: Zhenyu Jin,Yisi Luo,Xile Zhao,Deyu Meng
机构: Xi’an Jiaotong University (西安交通大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compressive imaging (CI) reconstruction, such as snapshot compressive imaging (SCI) and compressive sensing magnetic resonance imaging (MRI), aims to recover high-dimensional images from low-dimensional compressed measurements. This process critically relies on learning an accurate representation of the underlying high-dimensional image. However, existing unsupervised representations may struggle to achieve a desired balance between representation ability and efficiency. To overcome this limitation, we propose Tensor Decomposed multi-resolution Grid encoding (GridTD), an unsupervised continuous representation framework for CI reconstruction. GridTD optimizes a lightweight neural network and the input tensor decomposition model whose parameters are learned via multi-resolution hash grid encoding. It inherently enjoys the hierarchical modeling ability of multi-resolution grid encoding and the compactness of tensor decomposition, enabling effective and efficient reconstruction of high-dimensional images. Theoretical analyses for the algorithm’s Lipschitz property, generalization error bound, and fixed-point convergence reveal the intrinsic superiority of GridTD as compared with existing continuous representation models. Extensive experiments across diverse CI tasks, including video SCI, spectral SCI, and compressive dynamic MRI reconstruction, consistently demonstrate the superiority of GridTD over existing methods, positioning GridTD as a versatile and state-of-the-art CI reconstruction method.
zh

[CV-46] D-CNN and VQ-VAE Autoencoders for Compression and Denoising of Industrial X-ray Computed Tomography Images

【速读】：该论文试图解决工业X射线计算机断层扫描（XCT）数据在数据量不断增长背景下高效可靠存储的问题。其解决方案的关键在于利用深度学习中的自编码器技术，包括深度卷积神经网络（D-CNN）和向量量化变分自编码器（VQ-VAE），对XCT数据进行压缩，并评估不同压缩率下解码数据的质量。研究还引入了一种对边缘保留敏感的度量标准，以提升三维数据分析中的图像解码质量，从而根据具体分析需求选择合适的架构和压缩率。

链接: https://arxiv.org/abs/2507.07704
作者: Bardia Hejazi,Keerthana Chand,Tobias Fritsch,Giovanni Bruno
机构: Department for X-Ray Imaging, Federal Institute for Materials Research and Testing (BAM), Berlin, Germany; Institute of Physics and Astronomy, University of Potsdam, Potsdam, Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The ever-growing volume of data in imaging sciences stemming from the advancements in imaging technologies, necessitates efficient and reliable storage solutions for such large datasets. This study investigates the compression of industrial X-ray computed tomography (XCT) data using deep learning autoencoders and examines how these compression algorithms affect the quality of the recovered data. Two network architectures with different compression rates were used, a deep convolution neural network (D-CNN) and a vector quantized variational autoencoder (VQ-VAE). The XCT data used was from a sandstone sample with a complex internal pore network. The quality of the decoded images obtained from the two different deep learning architectures with different compression rates were quantified and compared to the original input data. In addition, to improve image decoding quality metrics, we introduced a metric sensitive to edge preservation, which is crucial for three-dimensional data analysis. We showed that different architectures and compression rates are required depending on the specific characteristics needed to be preserved for later analysis. The findings presented here can aid scientists to determine the requirements and strategies for their data storage and analysis needs.
zh

[CV-47] ree-Mamba: A Tree-Aware Mamba for Underwater Monocular Depth Estimation

【速读】：该论文旨在解决水下单目深度估计（Underwater Monocular Depth Estimation, UMDE）中存在的两个关键问题：一是基于Mamba的方法由于缺乏灵活的状态扫描策略，难以有效建模水下图像的结构特征；二是现有UMDE数据集通常包含不可靠的深度标签，导致水下图像与其对应深度图之间的物体-深度关系错误。解决方案的关键在于提出一种树感知的Mamba方法（Tree-Mamba），其核心是引入一种基于特征相似性的最小生成树自适应构建策略，并通过自底向上和自顶向下的遍历方式灵活聚合树节点间的空间拓扑特征，从而增强多尺度特征表示能力。此外，论文还构建了一个名为BlueDepth的水下深度估计基准，包含38,162对具有可靠深度标签的水下图像，为训练深度学习方法提供基础数据支持。

链接: https://arxiv.org/abs/2507.07687
作者: Peixian Zhuang,Yijian Wang,Zhenqi Fu,Hongliang Zhang,Sam Kwong,Chongyi Li
机构: University of Science and Technology Beijing(北京科技大学); Wenzhou Medical University(温州医科大学); Tsinghua University(清华大学); Deepinfar Ocean Technology Inc.(深海科技有限公司); Lingnan University(岭南大学); Nankai University(南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater Monocular Depth Estimation (UMDE) is a critical task that aims to estimate high-precision depth maps from underwater degraded images caused by light absorption and scattering effects in marine environments. Recently, Mamba-based methods have achieved promising performance across various vision tasks; however, they struggle with the UMDE task because their inflexible state scanning strategies fail to model the structural features of underwater images effectively. Meanwhile, existing UMDE datasets usually contain unreliable depth labels, leading to incorrect object-depth relationships between underwater images and their corresponding depth maps. To overcome these limitations, we develop a novel tree-aware Mamba method, dubbed Tree-Mamba, for estimating accurate monocular depth maps from underwater degraded images. Specifically, we propose a tree-aware scanning strategy that adaptively constructs a minimum spanning tree based on feature similarity. The spatial topological features among the tree nodes are then flexibly aggregated through bottom-up and top-down traversals, enabling stronger multi-scale feature representation capabilities. Moreover, we construct an underwater depth estimation benchmark (called BlueDepth), which consists of 38,162 underwater image pairs with reliable depth labels. This benchmark serves as a foundational dataset for training existing deep learning-based UMDE methods to learn accurate object-depth relationships. Extensive experiments demonstrate the superiority of the proposed Tree-Mamba over several leading methods in both qualitative results and quantitative evaluations with competitive computational efficiency. Code and dataset will be available at this https URL.
zh

[CV-48] Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought

【速读】：该论文试图解决生成式 AI (Generative AI) 中多模态链式推理（multi-modal chain-of-thought, CoT）过程中，现有大型视觉-语言模型（LVLMs）往往忽略生成的中间推理内容的问题。其解决方案的关键在于将多模态 CoT 推理重新建模为基于 KL 散度约束的奖励最大化问题，聚焦于基于推理内容的对数似然。为此，作者提出了一种称为推理增强解码（Rationale-Enhanced Decoding, RED）的新方法，通过乘积方式融合图像条件和推理条件的下一个词概率分布，从而有效整合视觉与推理信息，提升模型推理的准确性和一致性。

链接: https://arxiv.org/abs/2507.07685
作者: Shin’ya Yamaguchi,Kosuke Nishida,Daiki Chijiwa
机构: NTT(日本电信电话株式会社); Kyoto University(京都大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:Large vision-language models (LVLMs) have demonstrated remarkable capabilities by integrating pre-trained vision encoders with large language models (LLMs). Similar to single-modal LLMs, chain-of-thought (CoT) prompting has been adapted for LVLMs to enhance multi-modal reasoning by generating intermediate rationales based on visual and textual inputs. While CoT is assumed to improve grounding and accuracy in LVLMs, our experiments reveal a key challenge: existing LVLMs often ignore the contents of generated rationales in CoT reasoning. To address this, we re-formulate multi-modal CoT reasoning as a KL-constrained reward maximization focused on rationale-conditional log-likelihood. As the optimal solution, we propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy. RED harmonizes visual and rationale information by multiplying distinct image-conditional and rationale-conditional next token distributions. Extensive experiments show that RED consistently and significantly improves reasoning over standard CoT and other decoding methods across multiple benchmarks and LVLMs. Our work offers a practical and effective approach to improve both the faithfulness and accuracy of CoT reasoning in LVLMs, paving the way for more reliable rationale-grounded multi-modal systems.
zh

[CV-49] Action Unit Enhance Dynamic Facial Expression Recognition

【速读】：该论文旨在解决动态面部表情识别（DFER）中深度学习模型效果受限的问题，特别是如何有效融合动作单元（AUs）的先验知识以提升模型性能。其解决方案的关键在于引入AU增强的DFER架构（AU-DFER），通过量化AUs对不同表情的贡献并设计权重矩阵来整合先验知识，同时通过引入AU损失函数将知识与传统深度学习网络的学习结果相结合，从而提升模型的有效性。

链接: https://arxiv.org/abs/2507.07678
作者: Feng Liu,Lingna Gu,Chen Shi,Xiaolan Fu
机构: Shanghai Jiao Tong University(上海交通大学); University of New South Wales(新南威尔士大学); East China Normal University(华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic Facial Expression Recognition(DFER) is a rapidly evolving field of research that focuses on the recognition of time-series facial expressions. While previous research on DFER has concentrated on feature learning from a deep learning perspective, we put forward an AU-enhanced Dynamic Facial Expression Recognition architecture, namely AU-DFER, that incorporates AU-expression knowledge to enhance the effectiveness of deep learning modeling. In particular, the contribution of the Action Units(AUs) to different expressions is quantified, and a weight matrix is designed to incorporate a priori knowledge. Subsequently, the knowledge is integrated with the learning outcomes of a conventional deep learning network through the introduction of AU loss. The design is incorporated into the existing optimal model for dynamic expression recognition for the purpose of validation. Experiments are conducted on three recent mainstream open-source approaches to DFER on the principal datasets in this field. The results demonstrate that the proposed architecture outperforms the state-of-the-art(SOTA) methods without the need for additional arithmetic and generally produces improved results. Furthermore, we investigate the potential of AU loss function redesign to address data label imbalance issues in established dynamic expression datasets. To the best of our knowledge, this is the first attempt to integrate quantified AU-expression knowledge into various DFER models. We also devise strategies to tackle label imbalance, or minor class problems. Our findings suggest that employing a diverse strategy of loss function design can enhance the effectiveness of DFER. This underscores the criticality of addressing data imbalance challenges in mainstream datasets within this domain. The source code is available at this https URL.
zh

[CV-50] Attend-and-Refine: Interactive keypoint estimation and quantitative cervical vertebrae analysis for bone age assessment

【速读】：该论文旨在解决儿科正畸学中准确评估生长潜力的问题，以制定有效的治疗策略。其解决方案的关键在于通过侧位头影测量片全面分析颈椎骨成熟度（CVM）特征，从而预测生长高峰期。为提高关键点标注的效率与准确性，研究引入了Attend-and-Refine Network (ARNet)，该模型通过交互引导的重校准网络和形态感知损失函数，实现对用户反馈的自适应响应，显著降低了手动标注的工作量。

链接: https://arxiv.org/abs/2507.07670
作者: Jinhee Kim,Taesung Kim,Taewoo Kim,Dong-Wook Kim,Byungduk Ahn,Yoon-Ji Kim,In-Seok Song,Jaegul Choo
机构: KAIST(韩国科学技术院); Korea University Anam Hospital(韩国延世大学安山医院); Papa’s Dental Clinic(爸爸牙科诊所); Asan Medical Center, University of Ulsan College of Medicine(首尔大学医学院亚森医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Medical Image Analysis (2025)

点击查看摘要

Abstract:In pediatric orthodontics, accurate estimation of growth potential is essential for developing effective treatment strategies. Our research aims to predict this potential by identifying the growth peak and analyzing cervical vertebra morphology solely through lateral cephalometric radiographs. We accomplish this by comprehensively analyzing cervical vertebral maturation (CVM) features from these radiographs. This methodology provides clinicians with a reliable and efficient tool to determine the optimal timings for orthodontic interventions, ultimately enhancing patient outcomes. A crucial aspect of this approach is the meticulous annotation of keypoints on the cervical vertebrae, a task often challenged by its labor-intensive nature. To mitigate this, we introduce Attend-and-Refine Network (ARNet), a user-interactive, deep learning-based model designed to streamline the annotation process. ARNet features Interaction-guided recalibration network, which adaptively recalibrates image features in response to user feedback, coupled with a morphology-aware loss function that preserves the structural consistency of keypoints. This novel approach substantially reduces manual effort in keypoint identification, thereby enhancing the efficiency and accuracy of the process. Extensively validated across various datasets, ARNet demonstrates remarkable performance and exhibits wide-ranging applicability in medical imaging. In conclusion, our research offers an effective AI-assisted diagnostic tool for assessing growth potential in pediatric orthodontics, marking a significant advancement in the field.
zh

[CV-51] MolCLIP: A Molecular-Auxiliary CLIP Framework for Identifying Drug Mechanism of Action Based on Time-Lapsed Mitochondrial Images

【速读】：该论文旨在解决现有深度学习方法在药物作用机制（MoA）识别中过于关注空间特征而忽视活细胞时间动态变化的问题。其关键解决方案是提出MolCLIP，这是首个将显微细胞视频与分子模态相结合的视觉语言模型，通过设计分子辅助的CLIP框架来引导视频特征学习分子潜在空间的分布，并结合度量学习策略优化视频特征的聚合，从而提升药物识别和MoA识别的性能。

链接: https://arxiv.org/abs/2507.07663
作者: Fengqian Pang(1),Chunyue Lei(1),Hongfei Zhao(2),Chenghao Liu(3),Zhiqiang Xing(1),Huafeng Wang(1),Chuyang Ye(3) ((1) North China University of Technology, (2) Beijing Neusoft Medical Equipment CO., Ltd, (3) Beijing Institute of Technology)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Drug Mechanism of Action (MoA) mainly investigates how drug molecules interact with cells, which is crucial for drug discovery and clinical application. Recently, deep learning models have been used to recognize MoA by relying on high-content and fluorescence images of cells exposed to various drugs. However, these methods focus on spatial characteristics while overlooking the temporal dynamics of live cells. Time-lapse imaging is more suitable for observing the cell response to drugs. Additionally, drug molecules can trigger cellular dynamic variations related to specific MoA. This indicates that the drug molecule modality may complement the image counterpart. This paper proposes MolCLIP, the first visual language model to combine microscopic cell video- and molecule-modalities. MolCLIP designs a molecule-auxiliary CLIP framework to guide video features in learning the distribution of the molecular latent space. Furthermore, we integrate a metric learning strategy with MolCLIP to optimize the aggregation of video features. Experimental results on the MitoDataset demonstrate that MolCLIP achieves improvements of 51.2% and 20.5% in mAP for drug identification and MoA recognition, respectively.
zh

[CV-52] Bridging the gap in FER: addressing age bias in deep learning

【速读】：该论文试图解决深度面部表情识别（FER）模型中存在的年龄相关偏差问题，特别是对老年人群体的识别不公平性。其解决方案的关键在于通过引入三种偏差缓解策略——多任务学习、多模态输入和年龄加权损失函数，结合带有自动估计年龄标签的大规模数据集进行训练，并在平衡的基准数据集上进行验证，从而提升模型在不同年龄群体中的识别准确性和公平性。

链接: https://arxiv.org/abs/2507.07638
作者: F. Xavier Gaya-Morey,Julia Sanchez-Perez,Cristina Manresa-Yee,Jose M. Buades-Rubio
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial Expression Recognition (FER) systems based on deep learning have achieved impressive performance in recent years. However, these models often exhibit demographic biases, particularly with respect to age, which can compromise their fairness and reliability. In this work, we present a comprehensive study of age-related bias in deep FER models, with a particular focus on the elderly population. We first investigate whether recognition performance varies across age groups, which expressions are most affected, and whether model attention differs depending on age. Using Explainable AI (XAI) techniques, we identify systematic disparities in expression recognition and attention patterns, especially for “neutral”, “sadness”, and “anger” in elderly individuals. Based on these findings, we propose and evaluate three bias mitigation strategies: Multi-task Learning, Multi-modal Input, and Age-weighted Loss. Our models are trained on a large-scale dataset, AffectNet, with automatically estimated age labels and validated on balanced benchmark datasets that include underrepresented age groups. Results show consistent improvements in recognition accuracy for elderly individuals, particularly for the most error-prone expressions. Saliency heatmap analysis reveals that models trained with age-aware strategies attend to more relevant facial regions for each age group, helping to explain the observed improvements. These findings suggest that age-related bias in FER can be effectively mitigated using simple training modifications, and that even approximate demographic labels can be valuable for promoting fairness in large-scale affective computing systems.
zh

[CV-53] -GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates

【速读】：该论文试图解决在超低比特率（ULB）场景下，现有生成式视频编码方法因领域特定性或对高层文本指导的过度依赖而导致运动细节丢失和重建不真实的问题。其解决方案的关键在于提出一种轨迹引导的生成式视频编码框架（T-GVC），通过语义感知的稀疏运动采样流程，将低层运动跟踪与高层语义理解有效结合，利用基于语义重要性的像素级运动稀疏轨迹点来显著降低码率并保留关键时间语义信息，同时引入轨迹对齐损失约束以实现无需训练的潜在空间指导机制，确保物理上合理的运动模式。

链接: https://arxiv.org/abs/2507.07633
作者: Zhitao Wang,Hengyu Man,Wenrui Li,Xingtao Wang,Xiaopeng Fan,Debin Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding, aiming to achieve semantically accurate reconstructions in Ultra-Low Bitrate (ULB) scenarios by leveraging strong generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or an excessive dependence on high-level text guidance, which often fails to capture motion details and results in unrealistic reconstructions. To address these challenges, we propose a Trajectory-Guided Generative Video Coding framework (dubbed T-GVC). T-GVC employs a semantic-aware sparse motion sampling pipeline to effectively bridge low-level motion tracking with high-level semantic understanding by extracting pixel-wise motion as sparse trajectory points based on their semantic importance, not only significantly reducing the bitrate but also preserving critical temporal semantic information. In addition, by incorporating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free latent space guidance mechanism to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that our framework outperforms both traditional codecs and state-of-the-art end-to-end video compression methods under ULB conditions. Furthermore, additional experiments confirm that our approach achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.
zh

[CV-54] Capture Stage Environments: A Guide to Better Matting

【速读】：该论文试图解决在捕获阶段（capture stage）中图像抠像（matting）所面临的特殊挑战，这些挑战使得现有的通用抠像算法在电影、游戏等高精度媒体应用中表现不佳。论文的关键解决方案是提出一种针对捕获阶段内容特性的优化流程，通过分析和总结此类内容的独特属性，提供实践指南以改进工作流，并展示一种高效的管道，能够在无需大量标注数据的情况下，将最先进的方法适配到特定的捕获设置中，从而有效缓解未解难题。

链接: https://arxiv.org/abs/2507.07623
作者: Hannah Dröge,Janelle Pfeifer,Saskia Rabich,Markus Plack,Reinhard Klein,Matthias B. Hullin
机构: University of Bonn(波恩大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Capture stages are high-end sources of state-of-the-art recordings for downstream applications in movies, games, and other media. One crucial step in almost all pipelines is the matting of images to isolate the captured performances from the background. While common matting algorithms deliver remarkable performance in other applications like teleconferencing and mobile entertainment, we found that they struggle significantly with the peculiarities of capture stage content. The goal of our work is to share insights into those challenges as a curated list of those characteristics along with a constructive discussion for proactive intervention and present a guideline to practitioners for an improved workflow to mitigate unresolved challenges. To this end, we also demonstrate an efficient pipeline to adapt state-of-the-art approaches to such custom setups without the need of extensive annotations, both offline and real-time. For an objective evaluation, we propose a validation methodology based on a leading diffusion model that highlights the benefits of our approach.
zh

[CV-55] ViLU: Learning Vision-Language Uncertainties for Failure Prediction

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）中可靠的风险量化（Uncertainty Quantification, UQ）和故障预测问题。其解决方案的关键在于提出ViLU框架，该框架通过整合视觉嵌入、预测的文本嵌入以及基于图像的文本表示，构建一种具有不确定性的多模态表示，并利用交叉注意力机制进行融合。与传统基于损失预测的UQ方法不同，ViLU通过训练一个二分类器来区分正确与错误预测，采用加权二元交叉熵损失函数，从而实现对不确定性的建模，且不依赖于特定的损失函数。

链接: https://arxiv.org/abs/2507.07620
作者: Marc Lafon,Yannis Karmim,Julio Silva-Rodriguez,Paul Couairon,Clément Rambour,Raphaël Fournier-Sniehotta,Ismail Ben Ayed,Jose Dolz,Nicolas Thome
机构: Conservatoire National des Arts et Métiers, CEDRIC, F-75141 Paris, France; Sorbonne Université, CNRS, ISIR, F-75005 Paris, France; ETS Montreal, Canada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: this https URL.
zh

[CV-56] LOSC: LiDAR Open-voc Segmentation Consolidator

【速读】：该论文试图解决在驾驶场景中对激光雷达扫描进行开放词汇分割的问题，传统方法通过将图像语义反投影到3D点云上获得标签，但结果存在噪声且稀疏。解决方案的关键在于利用图像基础的Vision-Language Models (VLMs)生成的标签，并通过整合这些标签以实现时空一致性及对图像级增强的鲁棒性，随后基于优化后的标签训练3D网络，这种方法称为LOSC，在nuScenes和SemanticKITTI数据集上均取得了显著优于现有最先进方法的结果。

链接: https://arxiv.org/abs/2507.07605
作者: Nermin Samet,Gilles Puy,Renaud Marlet
机构: Valeo.ai(瓦莱奥人工智能); LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS(图兰德实验室，巴黎路桥大学，古斯塔夫·埃菲尔大学，法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are noisy and sparse. We consolidate these labels to enforce both spatio-temporal consistency and robustness to image-level augmentations. We then train a 3D network based on these refined labels. This simple method, called LOSC, outperforms the SOTA of zero-shot open-vocabulary semantic and panoptic segmentation on both nuScenes and SemanticKITTI, with significant margins.
zh

[CV-57] HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking

【速读】：该论文旨在解决视频目标跟踪任务中的挑战，包括遮挡、背景杂乱和目标重新出现等问题。其解决方案的关键在于引入一种分层运动估计策略，结合轻量级线性预测与选择性非线性精调，以在不增加额外训练成本的情况下提升跟踪精度。此外，通过区分长短期记忆帧优化记忆库，增强了在长期遮挡和外观变化下的跟踪可靠性。

链接: https://arxiv.org/abs/2507.07603
作者: Ruixiang Chen,Guolei Sun,Yawei Li,Jie Qin,Luca Benini
机构: ETH Zurich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents enhancements to the SAM2 framework for video object tracking task, addressing challenges such as occlusions, background clutter, and target reappearance. We introduce a hierarchical motion estimation strategy, combining lightweight linear prediction with selective non-linear refinement to improve tracking accuracy without requiring additional training. In addition, we optimize the memory bank by distinguishing long-term and short-term memory frames, enabling more reliable tracking under long-term occlusions and appearance changes. Experimental results show consistent improvements across different model scales. Our method achieves state-of-the-art performance on LaSOT and LaSOText with the large model, achieving 9.6% and 7.2% relative improvements in AUC over the original SAM2, and demonstrates even larger relative gains on smaller models, highlighting the effectiveness of our trainless, low-overhead improvements for boosting long-term tracking performance. The code is available at this https URL.
zh

[CV-58] Stable-Hair v2: Real-World Hair Transfer via Multiple-View Diffusion Model

【速读】：该论文试图解决多视角下生成一致且高质量的头发迁移问题，这是实现数字人类和虚拟角色等实际应用中的关键挑战。解决方案的关键在于提出Stable-Hair v2，这是一个基于扩散模型的多视角头发迁移框架，首次利用多视角扩散模型实现鲁棒、高保真和视角一致的头发迁移。其核心创新包括构建全面的多视角训练数据生成管道以及集成极角嵌入和时间注意力层以确保姿态条件下的稳定性和视角间的平滑过渡。

链接: https://arxiv.org/abs/2507.07591
作者: Kuiyuan Sun,Yuxuan Zhang,Jichao Zhang,Jiaming Liu,Wei Wang,Niculae Sebe,Yao Zhao
机构: Beijing Jiaotong University (北京交通大学); Ocean University of China (中国海洋大学); The Chinese University of Hong Kong (香港中文大学); Tiamat AI (天马AI); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:While diffusion-based methods have shown impressive capabilities in capturing diverse and complex hairstyles, their ability to generate consistent and high-quality multi-view outputs – crucial for real-world applications such as digital humans and virtual avatars – remains underexplored. In this paper, we propose Stable-Hair v2, a novel diffusion-based multi-view hair transfer framework. To the best of our knowledge, this is the first work to leverage multi-view diffusion models for robust, high-fidelity, and view-consistent hair transfer across multiple perspectives. We introduce a comprehensive multi-view training data generation pipeline comprising a diffusion-based Bald Converter, a data-augment inpainting model, and a face-finetuned multi-view diffusion model to generate high-quality triplet data, including bald images, reference hairstyles, and view-aligned source-bald pairs. Our multi-view hair transfer model integrates polar-azimuth embeddings for pose conditioning and temporal attention layers to ensure smooth transitions between views. To optimize this model, we design a novel multi-stage training strategy consisting of pose-controllable latent IdentityNet training, hair extractor training, and temporal attention training. Extensive experiments demonstrate that our method accurately transfers detailed and realistic hairstyles to source subjects while achieving seamless and consistent results across views, significantly outperforming existing methods and establishing a new benchmark in multi-view hair transfer. Code is publicly available at this https URL.
zh

[CV-59] HOTA: Hierarchical Overlap-Tiling Aggregation for Large-Area 3D Flood Mapping

【速读】：该论文旨在解决现有洪水监测产品在空间细节与覆盖范围之间的权衡问题，以及普遍忽略洪水深度信息的缺陷。其关键解决方案是提出HOTA（Hierarchical Overlap-Tiling Aggregation）方法，这是一种无需修改网络权重或重新训练的多尺度推理策略，通过在推理阶段对多光谱Sentinel-2图像应用不同尺寸的重叠瓦片，使SegFormer模型能够同时捕捉局部特征和公里级淹没范围，结合双约束深度估计模块，实现精确的三维洪水制图。

链接: https://arxiv.org/abs/2507.07585
作者: Wenfeng Jia,Bin Liang,Yuxi Lu,Attavit Wilaiwongsakul,Muhammad Arif Khan,Lihong Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Floods are among the most frequent natural hazards and cause significant social and economic damage. Timely, large-scale information on flood extent and depth is essential for disaster response; however, existing products often trade spatial detail for coverage or ignore flood depth altogether. To bridge this gap, this work presents HOTA: Hierarchical Overlap-Tiling Aggregation, a plug-and-play, multi-scale inference strategy. When combined with SegFormer and a dual-constraint depth estimation module, this approach forms a complete 3D flood-mapping pipeline. HOTA applies overlapping tiles of different sizes to multispectral Sentinel-2 images only during inference, enabling the SegFormer model to capture both local features and kilometre-scale inundation without changing the network weights or retraining. The subsequent depth module is based on a digital elevation model (DEM) differencing method, which refines the 2D mask and estimates flood depth by enforcing (i) zero depth along the flood boundary and (ii) near-constant flood volume with respect to the DEM. A case study on the March 2021 Kempsey (Australia) flood shows that HOTA, when coupled with SegFormer, improves IoU from 73% (U-Net baseline) to 84%. The resulting 3D surface achieves a mean absolute boundary error of less than 0.5 m. These results demonstrate that HOTA can produce accurate, large-area 3D flood maps suitable for rapid disaster response.
zh

[CV-60] NexViTAD: Few-shot Unsupervised Cross-Domain Defect Detection via Vision Foundation Models and Multi-Task Learning

【速读】：该论文旨在解决工业异常检测中的领域偏移（domain-shift）问题，即模型在不同领域数据上的泛化能力不足。其解决方案的关键在于提出一种基于视觉基础模型的少样本跨领域异常检测框架NexViTAD，通过创新的共享子空间投影机制和多任务学习（MTL）模块实现有效的跨领域知识迁移与特征融合。

链接: https://arxiv.org/abs/2507.07579
作者: Tianwei Mu,Feiyu Duan,Bo Zhou,Dan Xue,Manhong Huang
机构: Guangzhou Institute of Industrial Intelligence (广州工业智能研究所); Shenyang University of Technology (沈阳工业大学); School of Information Science and Engineering (信息科学与工程学院); Donghua University (东华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a novel few-shot cross-domain anomaly detection framework, Nexus Vision Transformer for Anomaly Detection (NexViTAD), based on vision foundation models, which effectively addresses domain-shift challenges in industrial anomaly detection through innovative shared subspace projection mechanisms and multi-task learning (MTL) module. The main innovations include: (1) a hierarchical adapter module that adaptively fuses complementary features from Hiera and DINO-v2 pre-trained models, constructing more robust feature representations; (2) a shared subspace projection strategy that enables effective cross-domain knowledge transfer through bottleneck dimension constraints and skip connection mechanisms; (3) a MTL Decoder architecture supports simultaneous processing of multiple source domains, significantly enhancing model generalization capabilities; (4) an anomaly score inference method based on Sinkhorn-K-means clustering, combined with Gaussian filtering and adaptive threshold processing for precise pixel level. Valuated on the MVTec AD dataset, NexViTAD delivers state-of-the-art performance with an AUC of 97.5%, AP of 70.4%, and PRO of 95.2% in the target domains, surpassing other recent models, marking a transformative advance in cross-domain defect detection.
zh

[CV-61] Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-Light Semantic Segmentation

【速读】：该论文旨在解决弱监督语义分割在低光照环境下的性能下降问题，这一问题主要由图像质量退化（如低对比度、噪声和颜色失真）以及弱监督的固有约束共同导致，进而引发不可靠的类别激活图和语义模糊的伪标签。其解决方案的关键在于提出了一种名为DGKD-WLSS的框架，该框架通过结合扩散引导的知识蒸馏（DGKD）与深度引导的特征融合（DGF2），实现正常光照与低光照特征的对齐，并利用深度图作为光照不变的几何先验来增强结构特征学习。

链接: https://arxiv.org/abs/2507.07578
作者: Chunyan Wang,Dong Zhang,Jinhui Tang
机构: Nanjing University of Science and Technology(南京理工大学); The Hong Kong University of Science and Technology(香港科技大学); Nanjing Forestry University(南京林业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Weakly-supervised semantic segmentation aims to assign category labels to each pixel using weak annotations, significantly reducing manual annotation costs. Although existing methods have achieved remarkable progress in well-lit scenarios, their performance significantly degrades in low-light environments due to two fundamental limitations: severe image quality degradation (e.g., low contrast, noise, and color distortion) and the inherent constraints of weak supervision. These factors collectively lead to unreliable class activation maps and semantically ambiguous pseudo-labels, ultimately compromising the model’s ability to learn discriminative feature representations. To address these problems, we propose Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-light Semantic Segmentation (DGKD-WLSS), a novel framework that synergistically combines Diffusion-Guided Knowledge Distillation (DGKD) with Depth-Guided Feature Fusion (DGF2). DGKD aligns normal-light and low-light features via diffusion-based denoising and knowledge distillation, while DGF2 integrates depth maps as illumination-invariant geometric priors to enhance structural feature learning. Extensive experiments demonstrate the effectiveness of DGKD-WLSS, which achieves state-of-the-art performance in weakly supervised semantic segmentation tasks under low-light conditions. The source codes have been released at:this https URL.
zh

[CV-62] Beyond the Linear Separability Ceiling

【速读】：该论文试图解决视觉-语言模型（Visual-Language Models, VLMs）在抽象推理任务中受到的“线性可分性瓶颈”问题，即其视觉嵌入在面对复杂推理任务时表现出线性不可分的局限性。解决方案的关键在于识别并解决这一瓶颈源于语言模型推理路径的不足，而非感知能力的缺陷。研究提出通过激活现有推理路径或调整核心模型权重来实现对模型的对齐优化，其中对于语义概念只需激活现有路径，而复杂关系推理则需要调整模型权重。此外，研究还表明，虽然增强表示质量可以提升嵌入的可分性，但在面对新提示格式时可能导致模型失效，从而强调了鲁棒推理更依赖于有针对性的对齐而非单纯的表示学习改进。

链接: https://arxiv.org/abs/2507.07574
作者: Enrico Vompa,Tanel Tammet,Mohit Vaishnav
机构: Tallinn University of Technology (塔林理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most state-of-the-art Visual-Language Models (VLMs) are seemingly limited by the linear separabilty of their visual embeddings on abstract reasoning tasks. This work investigates this “linear reasoning bottleneck” by introducing the Linear Separability Ceiling (LSC), the performance of a simple linear classifier on a VLM’s visual embeddings. We find this bottleneck is widespread and stems not from poor perception, but from failures in the language model’s reasoning pathways. We demonstrate this is a solvable alignment issue. The required intervention, however, is task-dependent: activating existing pathways suffices for semantic concepts, while complex relational reasoning requires adapting core model weights. Using postfix tuning as a methodological control, we find strong evidence for powerful, dormant reasoning pathways within VLMs. However, for complex relational tasks requiring deeper adaptation, explicitly improving representation quality causes the model to fail on new prompt formats despite its embeddings remaining well separated. Ultimately, this work provides a new lens for VLM analysis, showing that robust reasoning is a matter of targeted alignment, not simply improved representation learning.
zh

[CV-63] MAPEX: Modality-Aware Pruning of Experts for Remote Sensing Foundation Models

【速读】：该论文试图解决远程感知基础模型在应用时与任务模态之间的不匹配问题，以及大尺寸模型在特定任务小数据集上微调和部署的困难。其解决方案的关键在于提出MAPEX，一个基于模态专家混合（mixture-of-modality experts）的远程感知基础模型，该模型通过一种新颖的模态条件令牌路由机制，激发特定模态的专家。为应用于具体任务，论文还提出了一个模态感知剪枝技术，仅保留针对任务模态的专业化专家，从而实现高效的模态特定模型，并简化微调与部署过程。

链接: https://arxiv.org/abs/2507.07527
作者: Joelle Hanna,Linus Scheibenreif,Damian Borth
机构: AIML Lab, School of Computer Science, University of St.Gallen (AIML 实验室，计算机科学学院，圣加仑大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing data is commonly used for tasks such as flood mapping, wildfire detection, or land-use studies. For each task, scientists carefully choose appropriate modalities or leverage data from purpose-built instruments. Recent work on remote sensing foundation models pre-trains computer vision models on large amounts of remote sensing data. These large-scale models tend to focus on specific modalities, often optical RGB or multispectral data. For many important applications, this introduces a mismatch between the application modalities and the pre-training data. Moreover, the large size of foundation models makes them expensive and difficult to fine-tune on typically small datasets for each task. We address this mismatch with MAPEX, a remote sensing foundation model based on mixture-of-modality experts. MAPEX is pre-trained on multi-modal remote sensing data with a novel modality-conditioned token routing mechanism that elicits modality-specific experts. To apply the model on a specific task, we propose a modality aware pruning technique, which only retains experts specialized for the task modalities. This yields efficient modality-specific models while simplifying fine-tuning and deployment for the modalities of interest. We experimentally validate MAPEX on diverse remote sensing datasets and show strong performance compared to fully supervised training and state-of-the-art remote sensing foundation models. Code is available at this https URL.
zh

[CV-64] Spline Deformation Field

【速读】：该论文旨在解决密集点轨迹建模中由于神经网络固有归纳偏置导致的空间不一致问题，以及现有方法在处理稀疏时间信号插值时的不足。其解决方案的关键在于提出一种基于样条的轨迹表示方法，通过显式控制节点数量来确定自由度，从而实现速度和加速度的高效解析推导，保持空间一致性并减少时间波动，同时引入一种新颖的低秩时变空间编码以建模节点特征，替代传统的耦合时空技术。

链接: https://arxiv.org/abs/2507.07521
作者: Mingyang Song,Yang Zhang,Marko Mihajlovic,Siyu Tang,Markus Gross,Tunç Ozan Aydın
机构: DisneyResearch—StudiosSwitzerland(迪士尼研究院—工作室瑞士); ETH Zürich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Trajectory modeling of dense points usually employs implicit deformation fields, represented as neural networks that map coordinates to relate canonical spatial positions to temporal offsets. However, the inductive biases inherent in neural networks can hinder spatial coherence in ill-posed scenarios. Current methods focus either on enhancing encoding strategies for deformation fields, often resulting in opaque and less intuitive models, or adopt explicit techniques like linear blend skinning, which rely on heuristic-based node initialization. Additionally, the potential of implicit representations for interpolating sparse temporal signals remains under-explored. To address these challenges, we propose a spline-based trajectory representation, where the number of knots explicitly determines the degrees of freedom. This approach enables efficient analytical derivation of velocities, preserving spatial coherence and accelerations, while mitigating temporal fluctuations. To model knot characteristics in both spatial and temporal domains, we introduce a novel low-rank time-variant spatial encoding, replacing conventional coupled spatiotemporal techniques. Our method demonstrates superior performance in temporal interpolation for fitting continuous fields with sparse inputs. Furthermore, it achieves competitive dynamic scene reconstruction quality compared to state-of-the-art methods while enhancing motion coherence without relying on linear blend skinning or as-rigid-as-possible constraints.
zh

[CV-65] MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation

【速读】：该论文旨在解决动态场景中4D物体分割的挑战，这一领域由于缺乏足够广泛且精确标注的多视角视频数据集而研究不足。解决方案的关键在于提出MUVOD数据集，这是一个用于训练和评估重建现实场景中物体分割的多视角视频数据集，包含17个场景、7830张RGB图像及其对应的4D运动分割掩码，能够支持跨时间帧或不同视角的物体追踪。此外，论文还引入了一种评估指标和基线分割方法，以促进该领域的进展，并提出了一个基于MUVOD数据集子集的3D物体分割新基准。

链接: https://arxiv.org/abs/2507.07519
作者: Bangning Wei,Joshua Maraval,Meriem Outtas,Kidiyo Kpalma,Nicolas Ramin,Lu Zhang
机构: Institute of Research and Technology b<>absent<>< >com(研究与技术研究所 b<>absent<>< >com); Univ Rennes(雷恩大学); INSA Rennes(雷恩国立工程师学院); CNRS(法国国家科学研究中心); IETR - UMR 6164(信息电子与电信研究所 - UMR 6164)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The application of methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS) have steadily gained popularity in the field of 3D object segmentation in static scenes. These approaches demonstrate efficacy in a range of 3D scene understanding and editing tasks. Nevertheless, the 4D object segmentation of dynamic scenes remains an underexplored field due to the absence of a sufficiently extensive and accurately labelled multi-view video dataset. In this paper, we present MUVOD, a new multi-view video dataset for training and evaluating object segmentation in reconstructed real-world scenarios. The 17 selected scenes, describing various indoor or outdoor activities, are collected from different sources of datasets originating from various types of camera rigs. Each scene contains a minimum of 9 views and a maximum of 46 views. We provide 7830 RGB images (30 frames per video) with their corresponding segmentation mask in 4D motion, meaning that any object of interest in the scene could be tracked across temporal frames of a given view or across different views belonging to the same camera rig. This dataset, which contains 459 instances of 73 categories, is intended as a basic benchmark for the evaluation of multi-view video segmentation methods. We also present an evaluation metric and a baseline segmentation approach to encourage and evaluate progress in this evolving field. Additionally, we propose a new benchmark for 3D object segmentation task with a subset of annotated multi-view images selected from our MUVOD dataset. This subset contains 50 objects of different conditions in different scenarios, providing a more comprehensive analysis of state-of-the-art 3D object segmentation methods. Our proposed MUVOD dataset is available at this https URL.
zh

[CV-66] GGMotion: Group Graph Dynamics-Kinematics Networks for Human Motion Prediction

【速读】：该论文旨在解决现有方法在表示人体姿态时忽略关节之间的内在物理依赖性，导致学习难度增加且模型容易生成不现实运动的问题。其解决方案的关键在于提出GGMotion，一种基于群体图动力学-运动学网络的模型，通过将人体拓扑结构分组建模，更好地利用动力学和运动学先验。该方法引入了一种新的径向场以保持三维空间中的几何等变性，并通过空间和时间边聚合关节特征来捕捉更全面的时空依赖关系，同时采用组间与组内交互模块以捕获不同尺度的关节依赖性，结合等变多层感知机（MLP）实现各组中关节位置特征的并行化动力学-运动学传播，从而提升物理合理性。

链接: https://arxiv.org/abs/2507.07515
作者: Shuaijin Wan,Huaijiang Sun
机构: 南京理工大学计算机科学与工程学院(南京理工大学计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human motion is a continuous physical process in 3D space, governed by complex dynamic and kinematic constraints. Existing methods typically represent the human pose as an abstract graph structure, neglecting the intrinsic physical dependencies between joints, which increases learning difficulty and makes the model prone to generating unrealistic motions. In this paper, we propose GGMotion, a group graph dynamics-kinematics network that models human topology in groups to better leverage dynamics and kinematics priors. To preserve the geometric equivariance in 3D space, we propose a novel radial field for the graph network that captures more comprehensive spatio-temporal dependencies by aggregating joint features through spatial and temporal edges. Inter-group and intra-group interaction modules are employed to capture the dependencies of joints at different scales. Combined with equivariant multilayer perceptrons (MLP), joint position features are updated in each group through parallelized dynamics-kinematics propagation to improve physical plausibility. Meanwhile, we introduce an auxiliary loss to supervise motion priors during training. Extensive experiments on three standard benchmarks, including Human3.6M, CMU-Mocap, and 3DPW, demonstrate the effectiveness and superiority of our approach, achieving a significant performance margin in short-term motion prediction. The code is available at this https URL.
zh

[CV-67] Divergence Minimization Preference Optimization for Diffusion Model Alignment

【速读】：该论文试图解决扩散模型在生成过程中与人类偏好对齐的问题，特别是现有偏好优化方法容易陷入次优的均值导向优化。解决方案的关键在于引入一种名为Divergence Minimization Preference Optimization (DMPO)的新方法，通过最小化反向KL散度实现对齐，该方法在渐近意义上与原始强化学习具有相同的优化方向，从而更有效地引导生成行为与期望输出一致。

链接: https://arxiv.org/abs/2507.07510
作者: Binxu Li,Minkai Xu,Meihua Dang,Stefano Ermon
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages, 8 figures

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in generating realistic and versatile images from text prompts. Inspired by the recent advancements of language models, there is an increasing interest in further improving the models by aligning with human preferences. However, we investigate alignment from a divergence minimization perspective and reveal that existing preference optimization methods are typically trapped in suboptimal mean-seeking optimization. In this paper, we introduce Divergence Minimization Preference Optimization (DMPO), a novel and principled method for aligning diffusion models by minimizing reverse KL divergence, which asymptotically enjoys the same optimization direction as original RL. We provide rigorous analysis to justify the effectiveness of DMPO and conduct comprehensive experiments to validate its empirical strength across both human evaluations and automatic metrics. Our extensive results show that diffusion models fine-tuned with DMPO can consistently outperform or match existing techniques, specifically outperforming all existing diffusion alignment baselines by at least 64.6% in PickScore across all evaluation datasets, demonstrating the method’s superiority in aligning generative behavior with desired outputs. Overall, DMPO unlocks a robust and elegant pathway for preference alignment, bridging principled theory with practical performance in diffusion models.
zh

[CV-68] Semi-supervised learning and integration of multi-sequence MR-images for carotid vessel wall and plaque segmentation

【速读】：该论文旨在解决多序列磁共振成像（MRI）数据中颈动脉血管壁及斑块的准确分割问题，特别是在斑块复杂形态和标注数据稀缺的情况下。其解决方案的关键在于提出一种半监督深度学习方法，该方法结合了粗略定位模型与精细分割模型，并引入了多级多序列U-Net架构以有效融合不同MRI序列的互补信息。此外，通过在多种输入变换下强制模型保持一致性，解决了标注数据有限和MRI图像复杂性带来的挑战。

链接: https://arxiv.org/abs/2507.07496
作者: Marie-Christine Pali,Christina Schwaiger,Malik Galijasevic,Valentin K. Ladenhauf,Stephanie Mangesius,Elke R. Gizewski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The analysis of carotid arteries, particularly plaques, in multi-sequence Magnetic Resonance Imaging (MRI) data is crucial for assessing the risk of atherosclerosis and ischemic stroke. In order to evaluate metrics and radiomic features, quantifying the state of atherosclerosis, accurate segmentation is important. However, the complex morphology of plaques and the scarcity of labeled data poses significant challenges. In this work, we address these problems and propose a semi-supervised deep learning-based approach designed to effectively integrate multi-sequence MRI data for the segmentation of carotid artery vessel wall and plaque. The proposed algorithm consists of two networks: a coarse localization model identifies the region of interest guided by some prior knowledge on the position and number of carotid arteries, followed by a fine segmentation model for precise delineation of vessel walls and plaques. To effectively integrate complementary information across different MRI sequences, we investigate different fusion strategies and introduce a multi-level multi-sequence version of U-Net architecture. To address the challenges of limited labeled data and the complexity of carotid artery MRI, we propose a semi-supervised approach that enforces consistency under various input transformations. Our approach is evaluated on 52 patients with arteriosclerosis, each with five MRI sequences. Comprehensive experiments demonstrate the effectiveness of our approach and emphasize the role of fusion point selection in U-Net-based architectures. To validate the accuracy of our results, we also include an expert-based assessment of model performance. Our findings highlight the potential of fusion strategies and semi-supervised learning for improving carotid artery segmentation in data-limited MRI applications.
zh

[CV-69] Driving by Hybrid Navigation: An Online HD-SD Map Association Framework and Benchmark for Autonomous Vehicles

【速读】：该论文旨在解决自主车辆在混合导航中在线高精度地图（HD maps）与全局标准精度地图（SD maps）之间关联不足的问题，从而提升自主车辆的路径规划能力。其解决方案的关键在于提出了一种名为Map Association Transformer的新框架，该框架通过路径感知注意力和空间注意力机制，实现了对几何和拓扑对应关系的理解，以支持在线地图的高效关联。

链接: https://arxiv.org/abs/2507.07487
作者: Jiaxu Wan,Xu Wang,Mengwei Xie,Xinyuan Chang,Xinran Liu,Zheng Pan,Mu Xu,Ding Yuan
机构: Beihang University (北京航空航天大学); Amap, Alibaba Group (高德地图，阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 10 figures, 9 tables

点击查看摘要

Abstract:Autonomous vehicles rely on global standard-definition (SD) maps for road-level route planning and online local high-definition (HD) maps for lane-level navigation. However, recent work concentrates on construct online HD maps, often overlooking the association of global SD maps with online HD maps for hybrid navigation, making challenges in utilizing online HD maps in the real world. Observing the lack of the capability of autonomous vehicles in navigation, we introduce \textbfOnline \textbfMap \textbfAssociation, the first benchmark for the association of hybrid navigation-oriented online maps, which enhances the planning capabilities of autonomous vehicles. Based on existing datasets, the OMA contains 480k of roads and 260k of lane paths and provides the corresponding metrics to evaluate the performance of the model. Additionally, we propose a novel framework, named Map Association Transformer, as the baseline method, using path-aware attention and spatial attention mechanisms to enable the understanding of geometric and topological correspondences. The code and dataset can be accessed at this https URL.
zh

[CV-70] Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning ICCV2025

【速读】：该论文试图解决多任务学习（Multi-Task Learning, MTL）中因任务目标差异导致的负迁移问题，即一个任务的学习会损害另一个任务的性能。现有预训练Transformer模型虽然提升了MTL性能，但其固定的网络容量和结构限制了适应性。该论文提出的解决方案——动态令牌调制与扩展（Dynamic Token Modulation and Expansion, DTME-MTL），关键在于通过识别令牌空间中的梯度冲突并根据冲突类型应用自适应解决方案，从而提高模型的适应性并减少过拟合，且完全在令牌空间内操作，避免了参数的过度增长。

链接: https://arxiv.org/abs/2507.07485
作者: Wooseong Jeong,Kuk-Jin Yoon
机构: KAIST(韩国科学技术院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Multi-Task Learning (MTL) enables multiple tasks to be learned within a shared network, but differences in objectives across tasks can cause negative transfer, where the learning of one task degrades another task’s performance. While pre-trained transformers significantly improve MTL performance, their fixed network capacity and rigid structure limit adaptability. Previous dynamic network architectures attempt to address this but are inefficient as they directly convert shared parameters into task-specific ones. We propose Dynamic Token Modulation and Expansion (DTME-MTL), a framework applicable to any transformer-based MTL architecture. DTME-MTL enhances adaptability and reduces overfitting by identifying gradient conflicts in token space and applying adaptive solutions based on conflict type. Unlike prior methods that mitigate negative transfer by duplicating network parameters, DTME-MTL operates entirely in token space, enabling efficient adaptation without excessive parameter growth. Extensive experiments demonstrate that DTME-MTL consistently improves multi-task performance with minimal computational overhead, offering a scalable and effective solution for enhancing transformer-based MTL models.
zh

[CV-71] mporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation by Object Tracking ICCV2025

【速读】：该论文试图解决视频数据隐私泄露问题，即在视觉目标跟踪（VOT）任务中，未经授权的个人视频数据被用于训练商业模型所引发的隐私风险。解决方案的关键在于提出一种新颖的生成框架，用于生成时间不可学习样本（Temporal Unlearnable Examples, TUEs），该框架通过高效计算实现大规模视频数据集的可扩展性。TUEs使跟踪器在训练过程中依赖于不可学习的噪声进行时间匹配，从而忽略原始数据结构，确保视频数据的隐私性。此外，引入的时间对比损失进一步增强了TUEs对现有跟踪器学习过程的干扰效果。

链接: https://arxiv.org/abs/2507.07483
作者: Qiangqiang Wu,Yi Yu,Chenqi Kong,Ziquan Liu,Jia Wan,Haoliang Li,Alex C. Kot,Antoni B. Chan
机构: City University of Hong Kong (香港城市大学); ROSE Lab, Nanyang Technological University (南洋理工大学ROSE实验室); Queen Mary University of London (伦敦玛丽女王大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:With the rise of social media, vast amounts of user-uploaded videos (e.g., YouTube) are utilized as training data for Visual Object Tracking (VOT). However, the VOT community has largely overlooked video data-privacy issues, as many private videos have been collected and used for training commercial models without authorization. To alleviate these issues, this paper presents the first investigation on preventing personal video data from unauthorized exploitation by deep trackers. Existing methods for preventing unauthorized data use primarily focus on image-based tasks (e.g., image classification), directly applying them to videos reveals several limitations, including inefficiency, limited effectiveness, and poor generalizability. To address these issues, we propose a novel generative framework for generating Temporal Unlearnable Examples (TUEs), and whose efficient computation makes it scalable for usage on large-scale video datasets. The trackers trained w/ TUEs heavily rely on unlearnable noises for temporal matching, ignoring the original data structure and thus ensuring training video data-privacy. To enhance the effectiveness of TUEs, we introduce a temporal contrastive loss, which further corrupts the learning of existing trackers when using our TUEs for training. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in video data-privacy protection, with strong transferability across VOT models, datasets, and temporal matching tasks.
zh

[CV-72] SD-GS: Structured Deformable 3D Gaussians for Efficient Dynamic Scene Reconstruction

【速读】：该论文试图解决当前4D高斯框架在动态场景重建中面临的存储成本与表征复杂物理运动能力之间的固有权衡问题，这一问题显著限制了方法的实际应用。其解决方案的关键在于提出SD-GS框架，该框架通过两个核心贡献实现高效动态场景重建：首先，引入可变形锚点网格，作为一种分层且内存高效的场景表示，每个锚点在局部时空区域内生成多个3D高斯分布，并作为3D场景的几何骨架；其次，提出一种感知形变的密集化策略，自适应地在高动态区域增加锚点并减少静态区域的冗余，从而在使用更少锚点的情况下实现更优的视觉质量。

链接: https://arxiv.org/abs/2507.07465
作者: Wei Yao,Shuzhao Xie,Letian Li,Weixiang Zhang,Zhixin Lai,Shiqi Dai,Ke Zhang,Zhi Wang
机构: SIGS, Tsinghua University (清华大学深圳研究生院); Google(谷歌); Department of CST, Tsinghua University (清华大学计算机科学与技术系); Soochow University (苏州大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current 4D Gaussian frameworks for dynamic scene reconstruction deliver impressive visual fidelity and rendering speed, however, the inherent trade-off between storage costs and the ability to characterize complex physical motions significantly limits the practical application of these methods. To tackle these problems, we propose SD-GS, a compact and efficient dynamic Gaussian splatting framework for complex dynamic scene reconstruction, featuring two key contributions. First, we introduce a deformable anchor grid, a hierarchical and memory-efficient scene representation where each anchor point derives multiple 3D Gaussians in its local spatiotemporal region and serves as the geometric backbone of the 3D scene. Second, to enhance modeling capability for complex motions, we present a deformation-aware densification strategy that adaptively grows anchors in under-reconstructed high-dynamic regions while reducing redundancy in static areas, achieving superior visual quality with fewer anchors. Experimental results demonstrate that, compared to state-of-the-art methods, SD-GS achieves an average of 60% reduction in model size and an average of 100% improvement in FPS, significantly enhancing computational efficiency while maintaining or even surpassing visual quality.
zh

[CV-73] Degradation-Agnostic Statistical Facial Feature Transformation for Blind Face Restoration in Adverse Weather Conditions

【速读】：该论文旨在解决在恶劣天气条件下，由于图像质量下降导致的人脸识别准确率降低的问题。现有基于生成对抗网络（GAN）和扩散模型的面部图像修复（FIR）方法因缺乏专门处理天气引起的退化模块，导致修复后的面部纹理和结构失真。论文提出的解决方案关键在于引入两个核心组件：局部统计面部特征变换（local SFFT）和无退化特征嵌入（DAFE），通过增强面部结构与色彩保真度以及在恶劣天气下实现鲁棒的统计面部特征提取，提升修复效果。

链接: https://arxiv.org/abs/2507.07464
作者: Chang-Hwan Son
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the increasing deployment of intelligent CCTV systems in outdoor environments, there is a growing demand for face recognition systems optimized for challenging weather conditions. Adverse weather significantly degrades image quality, which in turn reduces recognition accuracy. Although recent face image restoration (FIR) models based on generative adversarial networks (GANs) and diffusion models have shown progress, their performance remains limited due to the lack of dedicated modules that explicitly address weather-induced degradations. This leads to distorted facial textures and structures. To address these limitations, we propose a novel GAN-based blind FIR framework that integrates two key components: local Statistical Facial Feature Transformation (SFFT) and Degradation-Agnostic Feature Embedding (DAFE). The local SFFT module enhances facial structure and color fidelity by aligning the local statistical distributions of low-quality (LQ) facial regions with those of high-quality (HQ) counterparts. Complementarily, the DAFE module enables robust statistical facial feature extraction under adverse weather conditions by aligning LQ and HQ encoder representations, thereby making the restoration process adaptive to severe weather-induced degradations. Experimental results demonstrate that the proposed degradation-agnostic SFFT model outperforms existing state-of-the-art FIR methods based on GAN and diffusion models, particularly in suppressing texture distortions and accurately reconstructing facial structures. Furthermore, both the SFFT and DAFE modules are empirically validated in enhancing structural fidelity and perceptual quality in face restoration under challenging weather scenarios.
zh

[CV-74] Objectomaly: Objectness-Aware Refinement for OoD Segmentation with Structural Consistency and Boundary Precision

【速读】：该论文旨在解决分布外（Out-of-Distribution, OoD）分割在安全敏感应用（如自动驾驶）中的挑战，特别是现有基于掩码的方法存在的边界不精确、物体内部异常分数不一致以及背景噪声引起的误报问题。其解决方案的关键在于提出一种对象感知的精炼框架——Objectomaly，该框架通过三个阶段实现：首先利用现有的OoD骨干网络进行粗粒度异常评分，然后借助SAM生成的实例掩码进行对象级分数校准，最后通过拉普拉斯滤波和高斯平滑实现边界精细化。该方法在多个OoD分割基准测试中取得了最先进的性能。

链接: https://arxiv.org/abs/2507.07460
作者: Jeonghoon Song,Sunghun Kim,Jaegyun Im,Byeongjoon Noh
机构: Soonchunhyang University (孙秋香大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Out-of-Distribution (OoD) segmentation is critical for safety-sensitive applications like autonomous driving. However, existing mask-based methods often suffer from boundary imprecision, inconsistent anomaly scores within objects, and false positives from background noise. We propose \textbf\textitObjectomaly, an objectness-aware refinement framework that incorporates object-level priors. Objectomaly consists of three stages: (1) Coarse Anomaly Scoring (CAS) using an existing OoD backbone, (2) Objectness-Aware Score Calibration (OASC) leveraging SAM-generated instance masks for object-level score normalization, and (3) Meticulous Boundary Precision (MBP) applying Laplacian filtering and Gaussian smoothing for contour refinement. Objectomaly achieves state-of-the-art performance on key OoD segmentation benchmarks, including SMIYC AnomalyTrack/ObstacleTrack and RoadAnomaly, improving both pixel-level (AuPRC up to 96.99, FPR _95 down to 0.07) and component-level (F1 - score up to 83.44) metrics. Ablation studies and qualitative results on real-world driving videos further validate the robustness and generalizability of our method. Code will be released upon publication.
zh

[CV-75] Bluish Veil Detection and Lesion Classification using Custom Deep Learnable Layers with Explainable Artificial Intelligence (XAI)

【速读】：该论文旨在解决皮肤病变中蓝-白 veil (BWV) 的检测问题，这是 melanoma 诊断中的关键特征，但目前相关研究较为有限。其解决方案的关键在于利用基于颜色阈值技术的成像算法将非标注的皮肤病变数据集转换为标注数据集，并设计一种深度卷积神经网络 (DCNN)，通过自定义层替代标准激活函数层，在多个独立及组合的皮肤镜数据集上进行训练，以实现对 BWV 的分类。该方法在多个数据集上均表现出优于传统模型的性能，且结合可解释人工智能 (XAI) 算法提升了模型决策的透明度与诊断可靠性。

链接: https://arxiv.org/abs/2507.07453
作者: M. A. Rasel,Sameem Abdul Kareem,Zhenli Kwan,Shin Shen Yong,Unaizah Obaidellah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted version. Published in Computers in Biology and Medicine, 14 June 2024. DOI: https://doi.org/10.1016/j.compbiomed.2024.108758

点击查看摘要

Abstract:Melanoma, one of the deadliest types of skin cancer, accounts for thousands of fatalities globally. The bluish, blue-whitish, or blue-white veil (BWV) is a critical feature for diagnosing melanoma, yet research into detecting BWV in dermatological images is limited. This study utilizes a non-annotated skin lesion dataset, which is converted into an annotated dataset using a proposed imaging algorithm based on color threshold techniques on lesion patches and color palettes. A Deep Convolutional Neural Network (DCNN) is designed and trained separately on three individual and combined dermoscopic datasets, using custom layers instead of standard activation function layers. The model is developed to categorize skin lesions based on the presence of BWV. The proposed DCNN demonstrates superior performance compared to conventional BWV detection models across different datasets. The model achieves a testing accuracy of 85.71% on the augmented PH2 dataset, 95.00% on the augmented ISIC archive dataset, 95.05% on the combined augmented (PH2+ISIC archive) dataset, and 90.00% on the Derm7pt dataset. An explainable artificial intelligence (XAI) algorithm is subsequently applied to interpret the DCNN’s decision-making process regarding BWV detection. The proposed approach, coupled with XAI, significantly improves the detection of BWV in skin lesions, outperforming existing models and providing a robust tool for early melanoma diagnosis.
zh

[CV-76] Dual Semantic-Aware Network for Noise Suppressed Ultrasound Video Segmentation

【速读】：该论文旨在解决超声视频序列中由于固有特性导致的噪声问题，从而提升自动病灶或器官分割的准确性。其解决方案的关键在于提出双语义感知网络（DSANet），通过引入相邻帧语义感知模块（AFSA）和局部-全局语义感知模块（LGSA），实现局部与全局特征之间的语义互感知，有效抑制噪声干扰并增强模型的鲁棒性。

链接: https://arxiv.org/abs/2507.07443
作者: Ling Zhou,Runtian Yuan,Yi Liu,Yuejie Zhang,Rui Feng,Shang Gao
机构: Fudan University (复旦大学); Changzhou University (常州大学); Deakin University (迪肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultrasound imaging is a prevalent diagnostic tool known for its simplicity and non-invasiveness. However, its inherent characteristics often introduce substantial noise, posing considerable challenges for automated lesion or organ segmentation in ultrasound video sequences. To address these limitations, we propose the Dual Semantic-Aware Network (DSANet), a novel framework designed to enhance noise robustness in ultrasound video segmentation by fostering mutual semantic awareness between local and global features. Specifically, we introduce an Adjacent-Frame Semantic-Aware (AFSA) module, which constructs a channel-wise similarity matrix to guide feature fusion across adjacent frames, effectively mitigating the impact of random noise without relying on pixel-level relationships. Additionally, we propose a Local-and-Global Semantic-Aware (LGSA) module that reorganizes and fuses temporal unconditional local features, which capture spatial details independently at each frame, with conditional global features that incorporate temporal context from adjacent frames. This integration facilitates multi-level semantic representation, significantly improving the model’s resilience to noise interference. Extensive evaluations on four benchmark datasets demonstrate that DSANet substantially outperforms state-of-the-art methods in segmentation accuracy. Moreover, since our model avoids pixel-level feature dependencies, it achieves significantly higher inference FPS than video-based methods, and even surpasses some image-based models. Code can be found in \hrefthis https URLDSANet
zh

[CV-77] owards High-Resolution 3D Anomaly Detection: A Scalable Dataset and Real-Time Framework for Subtle Industrial Defects

【速读】：该论文试图解决工业点云分析中对细微异常检测的需求与现有基准侧重低分辨率输入之间的矛盾，从而提升3D异常检测的精度与效率。解决方案的关键在于提出了一种可扩展的管道以生成真实且细微的3D异常，并构建了MiniShift数据集，该数据集包含高分辨率点云数据，同时引入了Simple3D框架，通过集成多尺度邻域描述符（Multi-scale Neighborhood Descriptors, MSND）和局部特征空间聚合（Local Feature Spatial Aggregation, LFSA）实现高效且精确的几何细节捕捉，从而在保持低计算开销的同时实现超过20 fps的实时推理。

链接: https://arxiv.org/abs/2507.07435
作者: Yuqi Cheng,Yihan Sun,Hui Zhang,Weiming Shen,Yunkang Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8figures

点击查看摘要

Abstract:In industrial point cloud analysis, detecting subtle anomalies demands high-resolution spatial data, yet prevailing benchmarks emphasize low-resolution inputs. To address this disparity, we propose a scalable pipeline for generating realistic and subtle 3D anomalies. Employing this pipeline, we developed MiniShift, the inaugural high-resolution 3D anomaly detection dataset, encompassing 2,577 point clouds, each with 500,000 points and anomalies occupying less than 1% of the total. We further introduce Simple3D, an efficient framework integrating Multi-scale Neighborhood Descriptors (MSND) and Local Feature Spatial Aggregation (LFSA) to capture intricate geometric details with minimal computational overhead, achieving real-time inference exceeding 20 fps. Extensive evaluations on MiniShift and established benchmarks demonstrate that Simple3D surpasses state-of-the-art methods in both accuracy and speed, highlighting the pivotal role of high-resolution data and effective feature aggregation in advancing practical 3D anomaly detection.
zh

[CV-78] Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning ICCV2025

【速读】：该论文试图解决多模态大语言模型（MLLMs）在复杂和结构化推理任务中的性能不足问题，尤其是在需要深度推理的决策和问题解决任务中。其解决方案的关键在于提出Corvid模型，该模型通过引入增强的思维链（CoT）推理能力、混合视觉编码器以及精心设计的跨模态连接器（GateMixer），并结合高质量的多模态CoT指令跟随数据集MCoT-Instruct-287K进行两阶段CoT格式训练，从而提升模型的逐步推理能力。此外，还提出了一个有效的推理时缩放策略，以通过自验证减少过度推理和推理不足的问题。

链接: https://arxiv.org/abs/2507.07424
作者: Jingjing Jiang,Chao Ma,Xurui Song,Hanwang Zhang,Jun Luo
机构: Shanghai Jiao Tong University (上海交通大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have demonstrated exceptional performance in multimodal perception and understanding. However, leading open-source MLLMs exhibit significant limitations in complex and structured reasoning, particularly in tasks requiring deep reasoning for decision-making and problem-solving. In this work, we present Corvid, an MLLM with enhanced chain-of-thought (CoT) reasoning capabilities. Architecturally, Corvid incorporates a hybrid vision encoder for informative visual representation and a meticulously designed connector (GateMixer) to facilitate cross-modal alignment. To enhance Corvid’s CoT reasoning capabilities, we introduce MCoT-Instruct-287K, a high-quality multimodal CoT instruction-following dataset, refined and standardized from diverse public reasoning sources. Leveraging this dataset, we fine-tune Corvid with a two-stage CoT-formatted training approach to progressively enhance its step-by-step reasoning abilities. Furthermore, we propose an effective inference-time scaling strategy that enables Corvid to mitigate over-reasoning and under-reasoning through self-verification. Extensive experiments demonstrate that Corvid outperforms existing o1-like MLLMs and state-of-the-art MLLMs with similar parameter scales, with notable strengths in mathematical reasoning and science problem-solving. Project page: this https URL.
zh

[CV-79] EPIC: Efficient Prompt Interaction for Text-Image Classification

【速读】：该论文试图解决大规模预训练多模态模型（LMMs）在下游任务微调过程中计算成本过高的问题。其解决方案的关键在于提出一种高效的基于提示的多模态交互策略，即针对文本-图像分类任务的高效提示交互（EPIC）。该方法通过在中间层引入时间提示，并利用基于相似性的提示交互来实现不同模态之间的充分信息交换，从而减少计算资源消耗和可训练参数数量（约为基础模型的1%），同时在多个数据集上表现出优越或相当的性能。

链接: https://arxiv.org/abs/2507.07415
作者: Xinyao Yu,Hao Sun,Zeyu Ling,Ziwei Niu,Zhenjia Bai,Rui Qin,Yen-Wei Chen,Lanfen Lin
机构: Zhejiang University (浙江大学); Ritsumeikan University (立命馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2401.14856

点击查看摘要

Abstract:In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset.
zh

[CV-80] EscherNet: Simultaneous Amodal Completion and Scalable View Synthesis through Masked Fine-Tuning and Enhanced Feed-Forward 3D Reconstruction

【速读】：该论文试图解决在零样本条件下合成物体新视角并具备非模态补全能力的问题，现有方法通过多阶段和复杂流程先幻化图像缺失部分再进行新视角合成，未能考虑跨视角依赖关系且需要冗余的存储与计算。解决方案的关键在于应用包括输入级和特征级掩码的掩码微调策略，以实现端到端模型，提升新视角合成和非模态补全能力。此外，该模型可与其它前馈图像到网格模型结合而无需额外训练，从而显著减少重建时间并提高3D重建速度。

链接: https://arxiv.org/abs/2507.07410
作者: Xinan Zhang,Muhammad Zubair Irshad,Anthony Yezzi,Yi-Chang Tsai,Zsolt Kira
机构: Georgia Institute of Technology (佐治亚理工学院); Toyota Research Institute (丰田研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose EscherNet++, a masked fine-tuned diffusion model that can synthesize novel views of objects in a zero-shot manner with amodal completion ability. Existing approaches utilize multiple stages and complex pipelines to first hallucinate missing parts of the image and then perform novel view synthesis, which fail to consider cross-view dependencies and require redundant storage and computing for separate stages. Instead, we apply masked fine-tuning including input-level and feature-level masking to enable an end-to-end model with the improved ability to synthesize novel views and conduct amodal completion. In addition, we empirically integrate our model with other feed-forward image-to-mesh models without extra training and achieve competitive results with reconstruction time decreased by 95%, thanks to its ability to synthesize arbitrary query views. Our method’s scalable nature further enhances fast 3D reconstruction. Despite fine-tuning on a smaller dataset and batch size, our method achieves state-of-the-art results, improving PSNR by 3.9 and Volume IoU by 0.28 on occluded tasks in 10-input settings, while also generalizing to real-world occluded reconstruction.
zh

[CV-81] Seg-Wild: Interactive Segmentation based on 3D Gaussian Splatting for Unconstrained Image Collections

【速读】：该论文试图解决从互联网获取的非约束性图像集合中进行场景重建与分割的问题，这类图像由于光照不一致和瞬时遮挡导致分割难度较大。传统分割方法无法有效处理瞬时遮挡或准确恢复场景的光照条件。其解决方案的关键在于提出Seg-Wild方法，该方法基于3D Gaussian Splatting，通过为每个3D Gaussian集成多维特征嵌入，并计算特征嵌入与分割目标之间的相似性，实现交互式分割。此外，引入了Spiky 3D Gaussian Cutter (SGC) 来平滑异常的3D高斯分布，从而提升分割效果。

链接: https://arxiv.org/abs/2507.07395
作者: Yongtang Bao,Chengjie Tang,Yuze Wang,Haojie Li
机构: Shandong University of Science and Technology(山东科技大学); Beihang University(北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing and segmenting scenes from unconstrained photo collections obtained from the Internet is a novel but challenging task. Unconstrained photo collections are easier to get than well-captured photo collections. These unconstrained images suffer from inconsistent lighting and transient occlusions, which makes segmentation challenging. Previous segmentation methods cannot address transient occlusions or accurately restore the scene’s lighting conditions. Therefore, we propose Seg-Wild, an interactive segmentation method based on 3D Gaussian Splatting for unconstrained image collections, suitable for in-the-wild scenes. We integrate multi-dimensional feature embeddings for each 3D Gaussian and calculate the feature similarity between the feature embeddings and the segmentation target to achieve interactive segmentation in the 3D scene. Additionally, we introduce the Spiky 3D Gaussian Cutter (SGC) to smooth abnormal 3D Gaussians. We project the 3D Gaussians onto a 2D plane and calculate the ratio of 3D Gaussians that need to be cut using the SAM mask. We also designed a benchmark to evaluate segmentation quality in in-the-wild scenes. Experimental results demonstrate that compared to previous methods, Seg-Wild achieves better segmentation results and reconstruction quality. Our code will be available at this https URL.
zh

[CV-82] Behave Your Motion: Habit-preserved Cross-category Animal Motion Transfer

【速读】：该论文试图解决跨类别动物运动迁移中忽视物种特有行为习惯的问题，现有方法主要关注人类运动的骨骼对齐或风格一致性，而忽略了动物独特的行为特征。解决方案的关键在于提出一种基于生成框架的习惯保留运动迁移方法，其核心是引入具有类别特定习惯编码器的习惯保留模块，以学习捕捉显著行为特征的运动先验，并结合大语言模型实现对未观察物种的运动迁移。

链接: https://arxiv.org/abs/2507.07394
作者: Zhimin Zhang,Bi’an Du,Caoyuan Ma,Zheng Wang,Wei Hu
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Animal motion embodies species-specific behavioral habits, making the transfer of motion across categories a critical yet complex task for applications in animation and virtual reality. Existing motion transfer methods, primarily focused on human motion, emphasize skeletal alignment (motion retargeting) or stylistic consistency (motion style transfer), often neglecting the preservation of distinct habitual behaviors in animals. To bridge this gap, we propose a novel habit-preserved motion transfer framework for cross-category animal motion. Built upon a generative framework, our model introduces a habit-preservation module with category-specific habit encoder, allowing it to learn motion priors that capture distinctive habitual characteristics. Furthermore, we integrate a large language model (LLM) to facilitate the motion transfer to previously unobserved species. To evaluate the effectiveness of our approach, we introduce the DeformingThings4D-skl dataset, a quadruped dataset with skeletal bindings, and conduct extensive experiments and quantitative analyses, which validate the superiority of our proposed model.
zh

[CV-83] KeyRe-ID: Keypoint-Guided Person Re-Identification using Part-Aware Representation in Videos

【速读】：该论文试图解决视频行人再识别（Video-based Person Re-Identification, VPR）中的跨场景身份匹配问题，旨在提升模型在复杂环境下的鲁棒性和准确性。其解决方案的关键在于提出一种基于关键点引导的框架\textbf{KeyRe-ID}，该框架通过全局分支和局部分支协同工作：全局分支利用基于Transformer的时间聚合机制捕捉整体身份语义，而局部分支则根据人体关键点动态分割身体区域，生成细粒度的部分感知特征，从而增强时空表征学习的效果。

链接: https://arxiv.org/abs/2507.07393
作者: Jinseong Kim,Junghoon Song,Gyeongseon Baek,Byeongjoon Noh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures,

点击查看摘要

Abstract:We propose \textbfKeyRe-ID, a keypoint-guided video-based person re-identification framework consisting of global and local branches that leverage human keypoints for enhanced spatiotemporal representation learning. The global branch captures holistic identity semantics through Transformer-based temporal aggregation, while the local branch dynamically segments body regions based on keypoints to generate fine-grained, part-aware features. Extensive experiments on MARS and iLIDS-VID benchmarks demonstrate state-of-the-art performance, achieving 91.73% mAP and 97.32% Rank-1 accuracy on MARS, and 96.00% Rank-1 and 100.0% Rank-5 accuracy on iLIDS-VID. The code for this work will be publicly available on GitHub upon publication.
zh

[CV-84] ST-GRIT: Spatio-Temporal Graph Transformer For Internal Ice Layer Thickness Prediction ICIP

【速读】：该论文旨在解决雷达图像中内部冰层厚度及其变化性的准确识别问题，这对于监测积雪积累、评估冰川动力学以及减少气候模型中的不确定性具有重要意义。其解决方案的关键在于提出ST-GRIT，一种用于冰层厚度的时空图变换器，该方法通过引入归纳几何图学习框架提取局部空间特征，并利用独立的时间和空间注意力块分别建模长程依赖关系，从而有效捕捉浅层与深层冰层之间的时空关系。

链接: https://arxiv.org/abs/2507.07389
作者: Zesheng Liu,Maryam Rahnemoonfar
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for 2025 IEEE International Conference on Image Processing (ICIP)

点击查看摘要

Abstract:Understanding the thickness and variability of internal ice layers in radar imagery is crucial for monitoring snow accumulation, assessing ice dynamics, and reducing uncertainties in climate models. Radar sensors, capable of penetrating ice, provide detailed radargram images of these internal layers. In this work, we present ST-GRIT, a spatio-temporal graph transformer for ice layer thickness, designed to process these radargrams and capture the spatiotemporal relationships between shallow and deep ice layers. ST-GRIT leverages an inductive geometric graph learning framework to extract local spatial features as feature embeddings and employs a series of temporal and spatial attention blocks separately to model long-range dependencies effectively in both dimensions. Experimental evaluation on radargram data from the Greenland ice sheet demonstrates that ST-GRIT consistently outperforms current state-of-the-art methods and other baseline graph neural networks by achieving lower root mean-squared error. These results highlight the advantages of self-attention mechanisms on graphs over pure graph neural networks, including the ability to handle noise, avoid oversmoothing, and capture long-range dependencies. Moreover, the use of separate spatial and temporal attention blocks allows for distinct and robust learning of spatial relationships and temporal patterns, providing a more comprehensive and effective approach.
zh

[CV-85] Multi-Scale Attention and Gated Shifting for Fine-Grained Event Spotting in Videos

【速读】：该论文旨在解决体育视频中精确事件检测（Precise Event Spotting, PES）问题，即从单摄像头视频中进行细粒度动作的帧级识别。现有PES模型通常采用轻量级的时间模块如Gate Shift Module (GSM)或Gate Shift Fuse (GSF)来增强2D CNN特征提取器的时间上下文，但这些模块在时间感受野和空间适应性方面存在局限。解决方案的关键是提出多尺度注意力门控移位模块（Multi-Scale Attention Gate Shift Module, MSAGSM），该模块通过引入多尺度时间膨胀和多头空间注意力机制，提升了对短期和长期依赖关系的建模能力，同时聚焦显著区域，从而实现了高效的事件检测。

链接: https://arxiv.org/abs/2507.07381
作者: Hao Xu,Arbind Agrahari Baniya,Sam Wells,Mohamed Reda Bouadjenek,Richard Dazeley,Sunil Aryal
机构: Deakin University (迪肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precise Event Spotting (PES) in sports videos requires frame-level recognition of fine-grained actions from single-camera footage. Existing PES models typically incorporate lightweight temporal modules such as Gate Shift Module (GSM) or Gate Shift Fuse (GSF) to enrich 2D CNN feature extractors with temporal context. However, these modules are limited in both temporal receptive field and spatial adaptability. We propose a Multi-Scale Attention Gate Shift Module (MSAGSM) that enhances GSM with multi-scale temporal dilations and multi-head spatial attention, enabling efficient modeling of both short- and long-term dependencies while focusing on salient regions. MSAGSM is a lightweight plug-and-play module that can be easily integrated with various 2D backbones. To further advance the field, we introduce the Table Tennis Australia (TTA) dataset-the first PES benchmark for table tennis-containing over 4800 precisely annotated events. Extensive experiments across five PES benchmarks demonstrate that MSAGSM consistently improves performance with minimal overhead, setting new state-of-the-art results.
zh

[CV-86] Adaptive Particle-Based Shape Modeling for Anatomical Surface Correspondence

【速读】：该论文试图解决基于粒子的形状建模（Particle-based Shape Modeling, PSM）方法在自适应性方面的不足，即缺乏自动调整粒子配置以适应局部几何特征的能力，这限制了其对复杂解剖变异性的准确表示。解决方案的关键在于引入两种机制：（1）一种新颖的邻域对应损失，以实现高适应性；（2）一种测地对应算法，通过正则化优化来强制测地邻域一致性，从而在保持一致粒子配置的同时提升表面适应性。

链接: https://arxiv.org/abs/2507.07379
作者: Hong Xu,Shireen Y. Elhabian
机构: Scientific Computing and Imaging Institute, Kahlert School of Computing, University of Utah (科学计算与成像研究所，卡勒特计算机学院，犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Particle-based shape modeling (PSM) is a family of approaches that automatically quantifies shape variability across anatomical cohorts by positioning particles (pseudo landmarks) on shape surfaces in a consistent configuration. Recent advances incorporate implicit radial basis function representations as self-supervised signals to better capture the complex geometric properties of anatomical structures. However, these methods still lack self-adaptivity – that is, the ability to automatically adjust particle configurations to local geometric features of each surface, which is essential for accurately representing complex anatomical variability. This paper introduces two mechanisms to increase surface adaptivity while maintaining consistent particle configurations: (1) a novel neighborhood correspondence loss to enable high adaptivity and (2) a geodesic correspondence algorithm that regularizes optimization to enforce geodesic neighborhood consistency. We evaluate the efficacy and scalability of our approach on challenging datasets, providing a detailed analysis of the adaptivity-correspondence trade-off and benchmarking against existing methods on surface representation accuracy and correspondence metrics.
zh

[CV-87] PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency ICCV2025

【速读】：该论文试图解决深度补全（depth completion）模型在未见过的环境中泛化能力不足的问题，特别是在缺乏大规模带度量深度标签的数据集的情况下，训练此类模型面临数据获取成本高和标注工作量大的挑战。解决方案的关键在于提出一种标签高效的技术——PacGDC，其核心是利用2D到3D投影过程中物体形状和位置的固有模糊性与一致性，合成大量伪几何结构，从而在不增加标注负担的前提下显著提升数据多样性。通过引入多个深度基础模型作为尺度操控器，PacGDC能够生成具有不同场景尺度的伪深度标签，并保持投影一致性以支持模型泛化。此外，结合插值、重定位策略及未标记图像进一步扩展了数据覆盖范围。

链接: https://arxiv.org/abs/2507.07374
作者: Haotian Wang,Aoran Xiao,Xiaoqin Zhang,Meng Yang,Shijian Lu
机构: Xi’an Jiaotong University (西安交通大学); Nanyang Technological University (南洋理工大学); Zhejiang University of Technology (浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Generalizable depth completion enables the acquisition of dense metric depth maps for unseen environments, offering robust perception capabilities for various downstream tasks. However, training such models typically requires large-scale datasets with metric depth labels, which are often labor-intensive to collect. This paper presents PacGDC, a label-efficient technique that enhances data diversity with minimal annotation effort for generalizable depth completion. PacGDC builds on novel insights into inherent ambiguities and consistencies in object shapes and positions during 2D-to-3D projection, allowing the synthesis of numerous pseudo geometries for the same visual scene. This process greatly broadens available geometries by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a new data synthesis pipeline that uses multiple depth foundation models as scale manipulators. These models robustly provide pseudo depth labels with varied scene scales, affecting both local objects and global layouts, while ensuring projection consistency that supports generalization. To further diversify geometries, we incorporate interpolation and relocation strategies, as well as unlabeled images, extending the data coverage beyond the individual use of foundation models. Extensive experiments show that PacGDC achieves remarkable generalizability across multiple benchmarks, excelling in diverse scene semantics/scales and depth sparsity/patterns under both zero-shot and few-shot settings. Code: this https URL.
zh

[CV-88] Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning

【速读】：该论文试图解决视觉叙事系统中角色和物体身份一致性问题，即在不同帧之间难以保持实体的识别与关联，导致引用不一致和指代幻觉。解决方案的关键是提出一种对比强化学习方法，通过训练模型区分连贯的图像序列与无关图像，并利用带有双组件奖励函数的直接偏好优化来促进真实故事中的实体定位与再识别，同时惩罚合成情境中的错误实体连接。

链接: https://arxiv.org/abs/2507.07340
作者: Daniel A. P. Oliveira,David Martins de Matos
机构: INESC-ID Lisboa(INESC-ID里斯本); Instituto Superior Técnico, Universidade de Lisboa(高级技术研究所，里斯本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages

点击查看摘要

Abstract:Visual storytelling systems, particularly large vision-language models, struggle to maintain character and object identity across frames, often failing to recognize when entities in different images represent the same individuals or objects, leading to inconsistent references and referential hallucinations. This occurs because models lack explicit training on when to establish entity connections across frames. We propose a contrastive reinforcement learning approach that trains models to discriminate between coherent image sequences and stories from unrelated images. We extend the Story Reasoning dataset with synthetic negative examples to teach appropriate entity connection behavior. We employ Direct Preference Optimization with a dual-component reward function that promotes grounding and re-identification of entities in real stories while penalizing incorrect entity connections in synthetic contexts. Using this contrastive framework, we fine-tune Qwen Storyteller (based on Qwen2.5-VL 7B). Evaluation shows improvements in grounding mAP from 0.27 to 0.31 (+14.8%), F1 from 0.35 to 0.41 (+17.1%). Pronoun grounding accuracy improved across all pronoun types except ``its’', and cross-frame character and object persistence increased across all frame counts, with entities appearing in 5 or more frames advancing from 29.3% to 33.3% (+13.7%). Well-structured stories, containing the chain-of-thought and grounded story, increased from 79.1% to 97.5% (+23.3%). Comments: 7 pages Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.2; I.4; I.5; I.7 Cite as: arXiv:2507.07340 [cs.CV] (or arXiv:2507.07340v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.07340 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Daniel Oliveira [view email] [v1] Wed, 9 Jul 2025 23:52:10 UTC (96 KB) Full-text links: Access Paper: View a PDF of the paper titled Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning, by Daniel A. P. Oliveira and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-07 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-89] Scalable and Realistic Virtual Try-on Application for Foundation Makeup with Kubelka-Munk Theory CVPR2025

【速读】：该论文试图解决虚拟试妆（Virtual Try-On, VTO）应用中基础化妆品与皮肤色调色彩融合的准确合成问题，同时确保方法在多种产品范围内的可扩展性。解决方案的关键在于提出一种新的方法，以近似已建立的Kubelka-Munk（KM）理论，从而实现更快的图像合成，同时保持基础化妆品与皮肤色调色彩融合的真实性。此外，该研究构建了一个仅依赖电子商务网站上产品信息的可扩展端到端框架，以实现逼真的基础化妆品虚拟试妆效果。

链接: https://arxiv.org/abs/2507.07333
作者: Hui Pang,Sunil Hadap,Violetta Shevchenko,Rahul Suresh,Amin Banitalebi-Dehkordi
机构: Amazon(亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at the workshop Three questions about virtual try-on at CVPR 2025

点击查看摘要

Abstract:Augmented reality is revolutionizing beauty industry with virtual try-on (VTO) applications, which empowers users to try a wide variety of products using their phones without the hassle of physically putting on real products. A critical technical challenge in foundation VTO applications is the accurate synthesis of foundation-skin tone color blending while maintaining the scalability of the method across diverse product ranges. In this work, we propose a novel method to approximate well-established Kubelka-Munk (KM) theory for faster image synthesis while preserving foundation-skin tone color blending realism. Additionally, we build a scalable end-to-end framework for realistic foundation makeup VTO solely depending on the product information available on e-commerce sites. We validate our method using real-world makeup images, demonstrating that our framework outperforms other techniques.
zh

[CV-90] ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation ICCV

【速读】：该论文旨在解决指令引导图像编辑的自动化评估问题，当前存在的挑战包括开源视觉-语言模型（VLM）在对齐性上的不足、专有模型的不透明性和成本效率问题，以及缺乏用于微调开源VLM的公开训练数据集。其解决方案的关键在于提出一种自动数据集创建方法ADIEE，通过生成大规模数据集（超过10万样本）并微调LLaVA-NeXT-8B模型，使其能够从自定义标记中解码数值评分，从而构建一个性能优于现有开源VLM和Gemini-Pro 1.5的评分模型。

链接: https://arxiv.org/abs/2507.07317
作者: Sherry X. Chen,Yi Wei,Luowei Zhou,Suren Kumar
机构: Samsung AI Center Mountain View(三星人工智能中心山景城); University of California, Santa Barbara(加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Computer Vision (ICCV) 2025

点击查看摘要

Abstract:Recent advances in instruction-guided image editing underscore the need for effective automated evaluation. While Vision-Language Models (VLMs) have been explored as judges, open-source models struggle with alignment, and proprietary models lack transparency and cost efficiency. Additionally, no public training datasets exist to fine-tune open-source VLMs, only small benchmarks with diverse evaluation schemes. To address this, we introduce ADIEE, an automated dataset creation approach which is then used to train a scoring model for instruction-guided image editing evaluation. We generate a large-scale dataset with over 100K samples and use it to fine-tune a LLaVA-NeXT-8B model modified to decode a numeric score from a custom token. The resulting scorer outperforms all open-source VLMs and Gemini-Pro 1.5 across all benchmarks, achieving a 0.0696 (+17.24%) gain in score correlation with human ratings on AURORA-Bench, and improving pair-wise comparison accuracy by 4.03% (+7.21%) on GenAI-Bench and 4.75% (+9.35%) on AURORA-Bench, respectively, compared to the state-of-the-art. The scorer can act as a reward model, enabling automated best edit selection and model fine-tuning. Notably, the proposed scorer can boost MagicBrush model’s average evaluation score on ImagenHub from 5.90 to 6.43 (+8.98%).
zh

[CV-91] LangNavBench: Evaluation of Natural Language Understanding in Semantic Navigation

【速读】：该论文试图解决语言引导的语义导航中，如何评估智能体对自然语言指令中词汇的准确语义 grounding 问题。现有方法缺乏一个专注于语言的基准测试，难以全面评估智能体在处理不同细节层次描述时的能力。解决方案的关键在于构建 LangNav 数据集和 LangNavBench 基准，前者提供了一个低误差率的开放集数据，后者系统地评估当前语义导航方法在处理属性、空间和关系线索以及类别层次结构方面的表现。此外，论文还提出 Multi-Layered Feature Map (MLFM) 方法，通过构建可查询的多层语义地图，有效提升对小物体或涉及空间关系的指令的处理能力。

链接: https://arxiv.org/abs/2507.07299
作者: Sonia Raychaudhuri,Enrico Cancelli,Tommaso Campari,Lamberto Ballan,Manolis Savva,Angel X. Chang
机构: Simon Fraser University (西蒙弗雷泽大学); University of Padova (帕多瓦大学); Fondazione Bruno Kessler (FBK) (布鲁诺·凯塞尔基金会); Alberta Machine Intelligence Institute (Amii) (阿尔伯塔机器智能研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Despite these advances, we still lack a clear, language-focused benchmark for testing how well such agents ground the words in their instructions. We address this gap with LangNav, an open-set dataset specifically created to test an agent’s ability to locate objects described at different levels of detail, from broad category names to fine attributes and object-object relations. Every description in LangNav was manually checked, yielding a lower error rate than existing lifelong- and semantic-navigation datasets. On top of LangNav we build LangNavBench, a benchmark that measures how well current semantic-navigation methods understand and act on these descriptions while moving toward their targets. LangNavBench allows us to systematically compare models on their handling of attributes, spatial and relational cues, and category hierarchies, offering the first thorough, language-centric evaluation of embodied navigation systems. We also present Multi-Layered Feature Map (MLFM), a method that builds a queryable multi-layered semantic map, particularly effective when dealing with small objects or instructions involving spatial relations. MLFM outperforms state-of-the-art mapping-based navigation baselines on the LangNav dataset.
zh

[CV-92] MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning

【速读】：该论文旨在解决当前大型视觉-语言模型在视觉问答和多模态推理任务中是否真正具备基于视觉证据的推理能力，还是仅仅依赖于表面模式和数据集偏差的问题。其解决方案的关键在于提出MagiC基准，这是一个全面评估 grounded 多模态认知的测试平台，不仅关注答案准确性，还评估逐步推理的质量及其与视觉证据的一致性，并通过弱监督和人工标注的数据集进行多维度模型评估，同时引入新的度量标准如MagiScore和StepSense以揭示现有方法在基于视觉证据的推理中的关键局限与改进空间。

链接: https://arxiv.org/abs/2507.07297
作者: Chengfei Wu,Ronald Seoh,Bingxuan Li,Liqiang Zhang,Fengrong Han,Dan Goldwasser
机构: Purdue University (普渡大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of California Los Angeles (加州大学洛杉矶分校); University of Connecticut (康涅狄格大学); University of California Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in large vision-language models have led to impressive performance in visual question answering and multimodal reasoning. However, it remains unclear whether these models genuinely perform grounded visual reasoning or rely on superficial patterns and dataset biases. In this work, we introduce MagiC, a comprehensive benchmark designed to evaluate grounded multimodal cognition, assessing not only answer accuracy but also the quality of step-by-step reasoning and its alignment with relevant visual evidence. Our benchmark includes approximately 5,500 weakly supervised QA examples generated from strong model outputs and 900 human-curated examples with fine-grained annotations, including answers, rationales, and bounding box groundings. We evaluate 15 vision-language models ranging from 7B to 70B parameters across four dimensions: final answer correctness, reasoning validity, grounding fidelity, and self-correction ability. MagiC further includes diagnostic settings to probe model robustness under adversarial visual cues and assess their capacity for introspective error correction. We introduce new metrics such as MagiScore and StepSense, and provide comprehensive analyses that reveal key limitations and opportunities in current approaches to grounded visual reasoning.
zh

[CV-93] DisenQ: Disentangling Q-Former for Activity-Biometrics ICCV2025

【速读】：该论文旨在解决活动生物特征识别（activity-biometrics）问题，即在多样化的活动中识别个体。传统的人脸识别方法在此场景下面临挑战，因为身份线索与运动动态和外观变化交织在一起，使得生物特征学习更加复杂。论文提出的解决方案的关键在于引入一种多模态语言引导框架，核心是\textbfDisenQ（\textbfDisentangling \textbfQ-Former），该框架通过结构化语言指导分离生物特征、运动和非生物特征，确保身份线索独立于外观和运动变化，从而避免误识别。

链接: https://arxiv.org/abs/2507.07262
作者: Shehreen Azad,Yogesh S Rawat
机构: University of Central Florida(中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICCV 2025

点击查看摘要

Abstract:In this work, we address activity-biometrics, which involves identifying individuals across diverse set of activities. Unlike traditional person identification, this setting introduces additional challenges as identity cues become entangled with motion dynamics and appearance variations, making biometrics feature learning more complex. While additional visual data like pose and/or silhouette help, they often struggle from extraction inaccuracies. To overcome this, we propose a multimodal language-guided framework that replaces reliance on additional visual data with structured textual supervision. At its core, we introduce \textbfDisenQ (\textbfDisentangling \textbfQ-Former), a unified querying transformer that disentangles biometrics, motion, and non-biometrics features by leveraging structured language guidance. This ensures identity cues remain independent of appearance and motion variations, preventing misidentifications. We evaluate our approach on three activity-based video benchmarks, achieving state-of-the-art performance. Additionally, we demonstrate strong generalization to complex real-world scenario with competitive performance on a traditional video-based identification benchmark, showing the effectiveness of our framework.
zh

[CV-94] Automated Video Segmentation Machine Learning Pipeline

【速读】：该论文试图解决视觉特效（Visual Effects, VFX）制作中掩码生成过程缓慢且资源消耗大的问题。其解决方案的关键在于构建一个自动化的视频分割流程，该流程利用机器学习技术实现：（1）通过文本提示进行灵活的目标检测，（2）精细化的逐帧图像分割，以及（3）鲁棒的视频跟踪以确保时间一致性。此外，该流程通过容器化部署并采用结构化输出格式，显著降低了人工干预，提高了初步合成的创建速度，并提供了全面的分割数据，从而提升了整体VFX生产效率。

链接: https://arxiv.org/abs/2507.07242
作者: Johannes Merz,Lucien Fostier
机构: Image Engine Design Inc.(Image Engine Design Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual effects (VFX) production often struggles with slow, resource-intensive mask generation. This paper presents an automated video segmentation pipeline that creates temporally consistent instance masks. It employs machine learning for: (1) flexible object detection via text prompts, (2) refined per-frame image segmentation and (3) robust video tracking to ensure temporal stability. Deployed using containerization and leveraging a structured output format, the pipeline was quickly adopted by our artists. It significantly reduces manual effort, speeds up the creation of preliminary composites, and provides comprehensive segmentation data, thereby enhancing overall VFX production efficiency.
zh

[CV-95] Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement ICCV’25

【速读】：该论文试图解决服装变化再识别（Clothes-Changing Re-Identification, CC-ReID）中由于衣物变化导致的外观偏差问题，旨在实现跨场景和时间的个体识别。其解决方案的关键在于利用颜色信息作为轻量级、无需标注的代理信号，通过提出Colors See, Colors Ignore (CSCI)方法，直接从原始图像或视频帧中提取颜色信息，并利用S2A自注意力机制防止颜色与身份线索之间的信息泄露，从而有效分离颜色相关的外观偏差与身份相关特征。

链接: https://arxiv.org/abs/2507.07230
作者: Priyank Pathak,Yogesh S. Rawat
机构: Center for Research in Computer Vision, University of Central Florida (计算机视觉研究中心，中央佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV’25 paper

点击查看摘要

Abstract:Clothes-Changing Re-Identification (CC-ReID) aims to recognize individuals across different locations and times, irrespective of clothing. Existing methods often rely on additional models or annotations to learn robust, clothing-invariant features, making them resource-intensive. In contrast, we explore the use of color - specifically foreground and background colors - as a lightweight, annotation-free proxy for mitigating appearance bias in ReID models. We propose Colors See, Colors Ignore (CSCI), an RGB-only method that leverages color information directly from raw images or video frames. CSCI efficiently captures color-related appearance bias (‘Color See’) while disentangling it from identity-relevant ReID features (‘Color Ignore’). To achieve this, we introduce S2A self-attention, a novel self-attention to prevent information leak between color and identity cues within the feature space. Our analysis shows a strong correspondence between learned color embeddings and clothing attributes, validating color as an effective proxy when explicit clothing labels are unavailable. We demonstrate the effectiveness of CSCI on both image and video ReID with extensive experiments on four CC-ReID datasets. We improve the baseline by Top-1 2.9% on LTCC and 5.0% on PRCC for image-based ReID, and 1.0% on CCVID and 2.5% on MeVID for video-based ReID without relying on additional supervision. Our results highlight the potential of color as a cost-effective solution for addressing appearance bias in CC-ReID. Github: this https URL.
zh

[CV-96] A Survey on Long-Video Storytelling Generation: Architectures Consistency and Cinematic Quality

【速读】：该论文试图解决当前视频生成模型在生成长视频时面临的挑战，包括视频长度受限（通常仅能生成5-16秒的视频）、超过16秒后角色外观和场景布局的一致性难以维持，以及多主体长视频中角色一致性与动作连贯性不足的问题。此外，尽管某些方法能够生成长达150秒的视频，但常存在帧冗余和时间多样性低的问题。为了解决这些问题，该论文的关键在于通过系统分析32篇视频生成相关文献，识别出能够持续生成高质量长视频的架构组件和训练策略，并构建了一个全面的现有方法分类体系，以促进对视频生成技术的深入理解和进一步发展。

链接: https://arxiv.org/abs/2507.07202
作者: Mohamed Elmoghany,Ryan Rossi,Seunghyun Yoon,Subhojyoti Mukherjee,Eslam Bakr,Puneet Mathur,Gang Wu,Viet Dac Lai,Nedim Lipka,Ruiyi Zhang,Varun Manjunatha,Chien Nguyen,Daksh Dangi,Abel Salinas,Mohammad Taesiri,Hongjie Chen,Xiaolei Huang,Joe Barrow,Nesreen Ahmed,Hoda Eldardiry,Namyong Park,Yu Wang,Jaemin Cho,Anh Totti Nguyen,Zhengzhong Tu,Thien Nguyen,Dinesh Manocha,Mohamed Elhoseiny,Franck Dernoncourt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the significant progress that has been made in video generative models, existing state-of-the-art methods can only produce videos lasting 5-16 seconds, often labeled “long-form videos”. Furthermore, videos exceeding 16 seconds struggle to maintain consistent character appearances and scene layouts throughout the narrative. In particular, multi-subject long videos still fail to preserve character consistency and motion coherence. While some methods can generate videos up to 150 seconds long, they often suffer from frame redundancy and low temporal diversity. Recent work has attempted to produce long-form videos featuring multiple characters, narrative coherence, and high-fidelity detail. We comprehensively studied 32 papers on video generation to identify key architectural components and training strategies that consistently yield these qualities. We also construct a comprehensive novel taxonomy of existing methods and present comparative tables that categorize papers by their architectural designs and performance characteristics.
zh

[CV-97] Interpretable EEG-to-Image Generation with Semantic Prompts

【速读】：该论文试图解决从脑电信号（EEG）中解码视觉体验的问题，旨在实现可解释的视觉重建。其解决方案的关键在于通过将EEG信号与由大语言模型生成的多层级语义描述（从物体级到抽象主题）进行对齐，而非直接进行EEG到图像的生成。该方法利用基于Transformer的EEG编码器通过对比学习将脑活动映射到这些语义描述，并在推理阶段通过投影头检索描述嵌入以条件化预训练的潜在扩散模型进行图像生成。

链接: https://arxiv.org/abs/2507.07157
作者: Arshak Rezvani,Ali Akbari,Kosar Sanjar Arani,Maryam Mirian,Emad Arasteh,Martin J. McKeown
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: Actionable Interpretability Workshop (non-archival) at the 42 International Conference on Machine Learning

点击查看摘要

Abstract:Decoding visual experience from brain signals offers exciting possibilities for neuroscience and interpretable AI. While EEG is accessible and temporally precise, its limitations in spatial detail hinder image reconstruction. Our model bypasses direct EEG-to-image generation by aligning EEG signals with multilevel semantic captions – ranging from object-level to abstract themes – generated by a large language model. A transformer-based EEG encoder maps brain activity to these captions through contrastive learning. During inference, caption embeddings retrieved via projection heads condition a pretrained latent diffusion model for image generation. This text-mediated framework yields state-of-the-art visual decoding on the EEGCVPR dataset, with interpretable alignment to known neurocognitive pathways. Dominant EEG-caption associations reflected the importance of different semantic levels extracted from perceived images. Saliency maps and t-SNE projections reveal semantic topography across the scalp. Our model demonstrates how structured semantic mediation enables cognitively aligned visual decoding from EEG.
zh

[CV-98] CL-Polyp: A Contrastive Learning-Enhanced Network for Accurate Polyp Segmentation

【速读】：该论文旨在解决结肠镜图像中息肉分割的准确性问题，以支持结直肠癌的早期诊断与治疗。现有基于深度学习的息肉分割方法通常采用编码器-解码器架构，并借助多任务框架引入辅助任务（如分类）以提升分割性能，但这些方法常需额外标注数据且依赖任务相似性，限制了其泛化能力。论文提出的CL-Polyp方法通过引入对比学习增强编码器的判别特征提取能力，利用从息肉图像中生成的正负样本对进行对比学习，从而在无需额外标注的情况下提升视觉表征。该解决方案的关键在于对比学习策略与两个轻量级模块——改进的空洞空间金字塔池化（MASPP）模块和通道拼接与元素相加（CA）模块，分别用于多尺度特征融合和低层与上采样特征的边界重建。

链接: https://arxiv.org/abs/2507.07154
作者: Desheng Li,Chaoliang Liu,Zhiyong Xiao
机构: Jiangnan University (江南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of polyps from colonoscopy images is crucial for the early diagnosis and treatment of colorectal cancer. Most existing deep learning-based polyp segmentation methods adopt an Encoder-Decoder architecture, and some utilize multi-task frameworks that incorporate auxiliary tasks such as classification to enhance segmentation performance. However, these approaches often require additional labeled data and rely on task similarity, which can limit their generalizability. To address these challenges, we propose CL-Polyp, a contrastive learning-enhanced polyp segmentation network. Our method leverages contrastive learning to improve the encoder’s ability to extract discriminative features by contrasting positive and negative sample pairs derived from polyp images. This self-supervised strategy enhances visual representation without requiring additional annotations. In addition, we introduce two lightweight and effective modules: the Modified Atrous Spatial Pyramid Pooling (MASPP) module for better multi-scale feature fusion, and the Channel Concatenate and Element Add (CA) module to fuse low-level and upsampled features for improved boundary reconstruction. Extensive experiments on five benchmark datasets-Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, CVC-300, and ETIS-demonstrate that CL-Polyp consistently outperforms state-of-the-art methods. Specifically, it improves the IoU metric by 0.011 and 0.020 on the Kvasir-SEG and CVC-ClinicDB datasets, respectively, validating its effectiveness in clinical polyp segmentation tasks.
zh

[CV-99] Aerial Maritime Vessel Detection and Identification

【速读】：该论文试图解决在无全球导航卫星系统（GNSS）环境下，自主海上监视与目标船舶识别的问题。当目标船舶仅能通过视觉线索进行识别且其最后已知位置不可用时，无人飞行器（UAV）必须依靠机载视觉在计算资源受限的情况下扫描大范围搜索区域。解决方案的关键在于利用YOLOv8目标检测模型检测视场内的所有船舶，并通过特征匹配和色相直方图距离分析判断检测到的船舶是否为目标船舶，一旦确认则使用简单的几何原理进行目标定位。

链接: https://arxiv.org/abs/2507.07153
作者: Antonella Barisic Kulas,Frano Petric,Stjepan Bogdan
机构: University of Zagreb Faculty of Electrical Engineering and Computing (萨格勒布大学电气工程与计算学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Preprint. ICUAS 2025

点击查看摘要

Abstract:Autonomous maritime surveillance and target vessel identification in environments where Global Navigation Satellite Systems (GNSS) are not available is critical for a number of applications such as search and rescue and threat detection. When the target vessel is only described by visual cues and its last known position is not available, unmanned aerial vehicles (UAVs) must rely solely on on-board vision to scan a large search area under strict computational constraints. To address this challenge, we leverage the YOLOv8 object detection model to detect all vessels in the field of view. We then apply feature matching and hue histogram distance analysis to determine whether any detected vessel corresponds to the target. When found, we localize the target using simple geometric principles. We demonstrate the proposed method in real-world experiments during the MBZIRC2023 competition, integrated into a fully autonomous system with GNSS-denied navigation. We also evaluate the impact of perspective on detection accuracy and localization precision and compare it with the oracle approach.
zh

[CV-100] Explainable Artificial Intelligence in Biomedical Image Analysis: A Comprehensive Survey

【速读】：该论文试图解决当前可解释人工智能（Explainable AI, XAI）在生物医学图像分析中缺乏模态感知视角、忽视多模态和视觉-语言范式的最新进展以及缺乏实际指导的问题。其解决方案的关键在于通过系统分类和分析XAI方法，提出一种以模态为中心的分类体系，以适应不同成像类型，并探讨多模态学习和视觉-语言模型在可解释生物医学AI中的新兴作用。此外，还总结了常用的评估指标和开源框架，并对现存挑战和未来方向进行了深入讨论。

链接: https://arxiv.org/abs/2507.07148
作者: Getamesay Haile Dagnaw,Yanming Zhu,Muhammad Hassan Maqsood,Wencheng Yang,Xingshuai Dong,Xuefei Yin,Alan Wee-Chung Liew
机构: Griffith University (格里菲斯大学); University of Southern Queensland (南昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Explainable artificial intelligence (XAI) has become increasingly important in biomedical image analysis to promote transparency, trust, and clinical adoption of DL models. While several surveys have reviewed XAI techniques, they often lack a modality-aware perspective, overlook recent advances in multimodal and vision-language paradigms, and provide limited practical guidance. This survey addresses this gap through a comprehensive and structured synthesis of XAI methods tailored to biomedical image this http URL systematically categorize XAI methods, analyzing their underlying principles, strengths, and limitations within biomedical contexts. A modality-centered taxonomy is proposed to align XAI methods with specific imaging types, highlighting the distinct interpretability challenges across modalities. We further examine the emerging role of multimodal learning and vision-language models in explainable biomedical AI, a topic largely underexplored in previous work. Our contributions also include a summary of widely used evaluation metrics and open-source frameworks, along with a critical discussion of persistent challenges and future directions. This survey offers a timely and in-depth foundation for advancing interpretable DL in biomedical image analysis.
zh

[CV-101] Weighted Multi-Prompt Learning with Description-free Large Language Model Distillation ICLR2025

【速读】：该论文旨在解决预训练视觉语言模型（Vision Language Models, VLM）在适应下游任务时依赖文本描述作为提示信息所带来的高变异性和低可靠性问题。现有方法通过从大型语言模型（Large Language Models, LLM）中提取文本响应作为提示，但这一过程存在不稳定和不可靠的缺陷。论文提出的解决方案的关键在于Description-free Multi-prompt Learning (DeMul)，该方法摒弃了提取文本描述的步骤，直接将LLM的知识蒸馏到提示中，从而在保持连续向量表示以供优化的同时，增强了提示的语义丰富性，并消除了对离散预定义模板的依赖。

链接: https://arxiv.org/abs/2507.07147
作者: Sua Lee,Kyubum Shin,Jung Ho Park
机构: Seoul National University (首尔国立大学); Naver AI (纳维AI)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Recent advances in pre-trained Vision Language Models (VLM) have shown promising potential for effectively adapting to downstream tasks through prompt learning, without the need for additional annotated paired datasets. To supplement the text information in VLM trained on correlations with vision data, new approaches leveraging Large Language Models (LLM) in prompts have been proposed, enhancing robustness to unseen and diverse data. Existing methods typically extract text-based responses (i.e., descriptions) from LLM to incorporate into prompts; however, this approach suffers from high variability and low reliability. In this work, we propose Description-free Multi-prompt Learning(DeMul), a novel method that eliminates the process of extracting descriptions and instead directly distills knowledge from LLM into prompts. By adopting a description-free approach, prompts can encapsulate richer semantics while still being represented as continuous vectors for optimization, thereby eliminating the need for discrete pre-defined templates. Additionally, in a multi-prompt setting, we empirically demonstrate the potential of prompt weighting in reflecting the importance of different prompts during training. Experimental results show that our approach achieves superior performance across 11 recognition datasets.
zh

[CV-102] Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning

【速读】：该论文试图解决生成式AI（Generative AI）模型在经过机器遗忘（Machine Unlearning, MU）处理后，其鲁棒性和有效性仍存在显著漏洞的问题，尤其是在面对多模态对抗输入时。论文提出的解决方案的关键在于设计了一个名为Recall的新型对抗框架，该框架通过利用扩散模型内在的多模态条件生成能力，高效地优化对抗图像提示，并借助单一语义相关的参考图像进行引导，从而有效削弱已遗忘模型的性能。

链接: https://arxiv.org/abs/2507.07139
作者: Renyang Liu,Guanlin Li,Tianwei Zhang,See-Kiong Ng
机构: Institute of Data Science, National University of Singapore; S-Lab, Nanyang Technological University; College of Computing and Data Science, Nanyang Technological University
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in image generation models (IGMs), particularly diffusion-based architectures such as Stable Diffusion (SD), have markedly enhanced the quality and diversity of AI-generated visual content. However, their generative capability has also raised significant ethical, legal, and societal concerns, including the potential to produce harmful, misleading, or copyright-infringing content. To mitigate these concerns, machine unlearning (MU) emerges as a promising solution by selectively removing undesirable concepts from pretrained models. Nevertheless, the robustness and effectiveness of existing unlearning techniques remain largely unexplored, particularly in the presence of multi-modal adversarial inputs. To bridge this gap, we propose Recall, a novel adversarial framework explicitly designed to compromise the robustness of unlearned IGMs. Unlike existing approaches that predominantly rely on adversarial text prompts, Recall exploits the intrinsic multi-modal conditioning capabilities of diffusion models by efficiently optimizing adversarial image prompts with guidance from a single semantically relevant reference image. Extensive experiments across ten state-of-the-art unlearning methods and diverse tasks show that Recall consistently outperforms existing baselines in terms of adversarial effectiveness, computational efficiency, and semantic fidelity with the original textual prompt. These findings reveal critical vulnerabilities in current unlearning mechanisms and underscore the need for more robust solutions to ensure the safety and reliability of generative models. Code and data are publicly available at \textcolorbluethis https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2507.07139 [cs.CV] (or arXiv:2507.07139v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.07139 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-103] CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings ECCV2024

【速读】：该论文试图解决无监督域适应（Unsupervised Domain Adaptation, UDA）中图像分割任务的域不变特征学习问题，特别是在目标域未见的情况下如何利用源域的标注数据提升模型泛化能力。解决方案的关键在于提出了一种基于协方差的像素-文本损失（Covariance-based Pixel-Text loss, CoPT），该方法通过域无关的文本嵌入来引导图像分割编码器学习域不变特征，其中文本嵌入是通过LLM域模板过程生成的，即利用大语言模型（Large Language Model, LLM）生成源域和目标域描述，并将其输入冻结的CLIP模型进行融合。

链接: https://arxiv.org/abs/2507.07125
作者: Cristina Mata,Kanchana Ranasinghe,Michael S. Ryoo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: ECCV 2024

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) involves learning class semantics from labeled data within a source domain that generalize to an unseen target domain. UDA methods are particularly impactful for semantic segmentation, where annotations are more difficult to collect than in image classification. Despite recent advances in large-scale vision-language representation learning, UDA methods for segmentation have not taken advantage of the domain-agnostic properties of text. To address this, we present a novel Covariance-based Pixel-Text loss, CoPT, that uses domain-agnostic text embeddings to learn domain-invariant features in an image segmentation encoder. The text embeddings are generated through our LLM Domain Template process, where an LLM is used to generate source and target domain descriptions that are fed to a frozen CLIP model and combined. In experiments on four benchmarks we show that a model trained using CoPT achieves the new state of the art performance on UDA for segmentation. The code can be found at this https URL.
zh

[CV-104] A Comprehensive Survey on Deep Learning Solutions for 3D Flood Mapping

【速读】：该论文试图解决洪水灾害管理中传统二维洪水制图技术提供的信息有限的问题，其核心挑战在于如何通过更精确的洪水范围和深度信息提升灾害应对与城市规划的效果。解决方案的关键在于利用深度学习（Deep Learning, DL）技术，实现对洪水的三维制图，从而整合洪水范围与深度信息，提高预测精度和计算效率。

链接: https://arxiv.org/abs/2506.13201
作者: Wenfeng Jia,Bin Liang,Yuxi Liu,Muhammad Arif Khan,Lihong Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Flooding remains a major global challenge, worsened by climate change and urbanization, demanding advanced solutions for effective disaster management. While traditional 2D flood mapping techniques provide limited insights, 3D flood mapping, powered by deep learning (DL), offers enhanced capabilities by integrating flood extent and depth. This paper presents a comprehensive survey of deep learning-based 3D flood mapping, emphasizing its advancements over 2D maps by integrating flood extent and depth for effective disaster management and urban planning. The survey categorizes deep learning techniques into task decomposition and end-to-end approaches, applicable to both static and dynamic flood features. We compare key DL architectures, highlighting their respective roles in enhancing prediction accuracy and computational efficiency. Additionally, this work explores diverse data sources such as digital elevation models, satellite imagery, rainfall, and simulated data, outlining their roles in 3D flood mapping. The applications reviewed range from real-time flood prediction to long-term urban planning and risk assessment. However, significant challenges persist, including data scarcity, model interpretability, and integration with traditional hydrodynamic models. This survey concludes by suggesting future directions to address these limitations, focusing on enhanced datasets, improved models, and policy implications for flood management. This survey aims to guide researchers and practitioners in leveraging DL techniques for more robust and reliable 3D flood mapping, fostering improved flood management strategies.
zh

[CV-105] Adaptive Attention Residual U-Net for curvilinear structure segmentation in fluorescence microscopy and biomedical images

【速读】：该论文旨在解决在荧光显微镜图像中分割曲线结构（curvilinear structures）的难题，尤其是在噪声环境和密集纤维网络中，这在活体成像中尤为常见。为了解决这一问题，研究者构建了两个原始数据集，包含数百张合成的荧光标记微管图像，并精确标注以模拟真实的显微图像特性。此外，还提出了一种新型深度学习架构——自适应挤压与激励残差U-Net（ASE_Res_UNet），该模型通过在编码器中引入残差块以及在解码器中集成自适应SE注意力机制，显著提升了标准U-Net的性能。其关键创新在于自适应SE注意力模块，有效增强了模型在噪声环境下的鲁棒性以及对细小、低强度结构的检测能力。

链接: https://arxiv.org/abs/2507.07800
作者: Achraf Ait Laydi,Louis Cueff,Mewen Crespo,Yousef El Mourabit,Hélène Bouvrais
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmenting curvilinear structures in fluorescence microscopy remains a challenging task, particularly under noisy conditions and in dense filament networks commonly seen in vivo. To address this, we created two original datasets consisting of hundreds of synthetic images of fluorescently labelled microtubules within cells. These datasets are precisely annotated and closely mimic real microscopy images, including realistic noise. The second dataset presents an additional challenge, by simulating varying fluorescence intensities along filaments that complicate segmentation. While deep learning has shown strong potential in biomedical image analysis, its performance often declines in noisy or low-contrast conditions. To overcome this limitation, we developed a novel advanced architecture: the Adaptive Squeeze-and-Excitation Residual U-Net (ASE_Res_UNet). This model enhanced the standard U-Net by integrating residual blocks in the encoder and adaptive SE attention mechanisms in the decoder. Through ablation studies and comprehensive visual and quantitative evaluations, ASE_Res_UNet consistently outperformed its variants, namely standard U-Net, ASE_UNet and Res_UNet architectures. These improvements, particularly in noise resilience and detecting fine, low-intensity structures, were largely attributed to the adaptive SE attention module that we created. We further benchmarked ASE_Res_UNet against various state-of-the-art models, and found it achieved superior performance on our most challenging dataset. Finally, the model also generalized well to real microscopy images of stained microtubules as well as to other curvilinear structures. Indeed, it successfully segmented retinal blood vessels and nerves in noisy or low-contrast biomedical images, demonstrating its strong potential for applications in disease diagnosis and treatment.
zh

[CV-106] Computationally Efficient Information-Driven Optical Design with Interchanging Optimization

【速读】：该论文试图解决信息驱动的光学系统设计中因端到端可微性要求导致的目标函数不匹配、高内存占用和长运行时间的问题。其解决方案的关键在于提出IDEAL-IO方法，通过交替优化密度估计与光学参数更新，将两者解耦，从而降低计算资源消耗并提升密度模型的表达能力，实现更高效的光学系统优化。

链接: https://arxiv.org/abs/2507.07789
作者: Eric Markley,Henry Pinkard,Leyla Kabuli,Nalini Singh,Laura Waller
机构: UC Berkeley(加州大学伯克利分校)
类目: Image and Video Processing (eess.IV); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Recent work has demonstrated that imaging systems can be evaluated through the information content of their measurements alone, enabling application-agnostic optical design that avoids computational decoding challenges. Information-Driven Encoder Analysis Learning (IDEAL) was proposed to automate this process through gradient-based. In this work, we study IDEAL across diverse imaging systems and find that it suffers from high memory usage, long runtimes, and a potentially mismatched objective function due to end-to-end differentiability requirements. We introduce IDEAL with Interchanging Optimization (IDEAL-IO), a method that decouples density estimation from optical parameter optimization by alternating between fitting models to current measurements and updating optical parameters using fixed models for information estimation. This approach reduces runtime and memory usage by up to 6x while enabling more expressive density models that guide optimization toward superior designs. We validate our method on diffractive optics, lensless imaging, and snapshot 3D microscopy applications, establishing information-theoretic optimization as a practical, scalable strategy for real-world imaging system design.
zh

[CV-107] mmFlux: Crowd Flow Analytics with Commodity mmWave MIMO Radar

【速读】：该论文试图解决从毫米波雷达数据中提取人群运动模式并推断人群语义的问题。其解决方案的关键在于提出了一种信号处理流程，该流程结合了视觉中的光流估计概念与新颖的统计和形态学噪声过滤方法，生成高保真度的毫米波流场；随后将这些流场转换为有向几何图，通过局部雅可比矩阵分析计算旋度和散度，从而提取结构化和扩散人群的关键语义信息。

链接: https://arxiv.org/abs/2507.07331
作者: Anurag Pallaprolu,Winston Hurst,Yasamin Mostofi
机构: University of California Santa Barbara(加州大学圣塔芭芭拉分校)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present a novel framework for extracting underlying crowd motion patterns and inferring crowd semantics using mmWave radar. First, our proposed signal processing pipeline combines optical flow estimation concepts from vision with novel statistical and morphological noise filtering to generate high-fidelity mmWave flow fields - compact 2D vector representations of crowd motion. We then introduce a novel approach that transforms these fields into directed geometric graphs, where edges capture dominant flow currents, vertices mark crowd splitting or merging, and flow distribution is quantified across edges. Finally, we show that by analyzing the local Jacobian and computing the corresponding curl and divergence, we can extract key crowd semantics for both structured and diffused crowds. We conduct 21 experiments on crowds of up to (and including) 20 people across 3 areas, using commodity mmWave radar. Our framework achieves high-fidelity graph reconstruction of the underlying flow structure, even for complex crowd patterns, demonstrating strong spatial alignment and precise quantitative characterization of flow split ratios. Finally, our curl and divergence analysis accurately infers key crowd semantics, e.g., abrupt turns, boundaries where flow directions shift, dispersions, and gatherings. Overall, these findings validate our framework, underscoring its potential for various crowd analytics applications.
zh

[CV-108] Label-Efficient Chest X-ray Diagnosis via Partial CLIP Adaptation

【速读】：该论文试图解决医学影像领域中由于隐私问题、高昂成本和病例稀缺导致的标注数据获取困难的问题。其解决方案的关键在于提出一种标签高效的策略，通过部分微调预训练的CLIP ViT-B/32模型的视觉编码器，并利用零样本和少量样本学习（1-16个标注示例/疾病类别）进行评估，从而有效适应少样本医学影像任务，显著提升了平均AUC分数。该方法旨在模拟真实医院环境中的内部工作流程，即图像存档存在但标注稀疏的情况。

链接: https://arxiv.org/abs/2507.07254
作者: Heet Nitinkumar Dalsania
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern deep learning implementations for medical imaging usually rely on large labeled datasets. These datasets are often difficult to obtain due to privacy concerns, high costs, and even scarcity of cases. In this paper, a label-efficient strategy is proposed for chest X-ray diagnosis that seeks to reflect real-world hospital scenarios. The experiments use the NIH Chest X-ray14 dataset and a pre-trained CLIP ViT-B/32 model. The model is adapted via partial fine-tuning of its visual encoder and then evaluated using zero-shot and few-shot learning with 1-16 labeled examples per disease class. The tests demonstrate that CLIP’s pre-trained vision-language features can be effectively adapted to few-shot medical imaging tasks, achieving over 20% improvement in mean AUC score as compared to the zero-shot baseline. The key aspect of this work is to attempt to simulate internal hospital workflows, where image archives exist but annotations are sparse. This work evaluates a practical and scalable solution for both common and rare disease diagnosis. Additionally this research is intended for academic and experimental purposes only and has not been peer reviewed yet. All code is found at this https URL.
zh

[CV-109] Wrist bone segmentation in X-ray images using CT-based simulations

【速读】：该论文旨在解决X-ray图像中腕骨分割的挑战性问题，尤其是在存在多个小型腕骨相互遮挡的情况下，传统方法因依赖大量高质量标注数据而面临数据获取困难。其解决方案的关键在于利用从CT体积生成的大量模拟X-ray图像及其对应的10个骨骼标签，训练深度学习模型以实现对真实X-ray图像中腕骨的准确分割。这种方法有效缓解了标注数据不足的问题，并在模拟和真实数据集上均表现出良好的分割性能。

链接: https://arxiv.org/abs/2507.07131
作者: Youssef ElTantawy,Alexia Karantana,Xin Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages

点击查看摘要

Abstract:Plain X-ray is one of the most common image modalities for clinical diagnosis (e.g. bone fracture, pneumonia, cancer screening, etc.). X-ray image segmentation is an essential step for many computer-aided diagnostic systems, yet it remains challenging. Deep-learning-based methods have achieved superior performance in medical image segmentation tasks but often require a large amount of high-quality annotated data for model training. Providing such an annotated dataset is not only time-consuming but also requires a high level of expertise. This is particularly challenging in wrist bone segmentation in X-rays, due to the interposition of multiple small carpal bones in the image. To overcome the data annotation issue, this work utilizes a large number of simulated X-ray images generated from Computed Tomography (CT) volumes with their corresponding 10 bone labels to train a deep learning-based model for wrist bone segmentation in real X-ray images. The proposed method was evaluated using both simulated images and real images. The method achieved Dice scores ranging from 0.80 to 0.92 for the simulated dataset generated from different view angles. Qualitative analysis of the segmentation results of the real X-ray images also demonstrated the superior performance of the trained model. The trained model and X-ray simulation code are freely available for research purposes: the link will be provided upon acceptance.
zh

人工智能

[AI-0] EXPO: Stable Reinforcement Learning with Expressive Policies

【速读】：该论文试图解决在给定离线数据集的情况下，通过在线强化学习（RL）训练和微调表达性策略（expressive policies）时面临的稳定价值最大化问题。与常用的简单高斯策略不同，如扩散和流匹配策略等表达性策略通过长去噪链进行参数化，这在优化过程中会阻碍从动作到策略参数的稳定梯度传播。该研究的关键解决方案是避免直接对价值函数进行表达性策略的优化，而是构建一个实时生成的RL策略来最大化Q值，从而实现稳定的值最大化。

链接: https://arxiv.org/abs/2507.07986
作者: Perry Dong,Qiyang Li,Dorsa Sadigh,Chelsea Finn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies – a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.
zh

[AI-1] Reinforcement Learning with Action Chunking

【速读】：该论文旨在解决长期任务和稀疏奖励环境下强化学习（Reinforcement Learning, RL）算法的样本效率问题，特别是在离线到在线RL设置中如何有效利用离线数据以提升在线学习效果。其解决方案的关键在于引入Q-chunking方法，该方法通过动作分块（action chunking）技术，在“分块”动作空间中直接进行强化学习，从而使得智能体能够利用离线数据中的时间一致性行为进行更有效的在线探索，并使用无偏n步回溯以实现更稳定和高效的时序差分（Temporal Difference, TD）学习。

链接: https://arxiv.org/abs/2507.07969
作者: Qiyang Li,Zhiyuan Zhou,Sergey Levine
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
备注: 25 pages, 15 figures

点击查看摘要

Abstract:We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a ‘chunked’ action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased n -step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
zh

[AI-2] Low Resource Reconstruction Attacks Through Benign Prompts

【速读】：该论文试图解决生成式AI模型在训练过程中可能泄露敏感数据的问题，特别是通过低资源、无需直接访问训练集的攻击手段实现对训练集中图像的重建。其解决方案的关键在于利用从电商平台上抓取的数据中存在的模板化布局和与模式化提示词相关联的图像这一固有漏洞，通过识别看似无害的提示词来实现潜在风险的图像重建。

链接: https://arxiv.org/abs/2507.07947
作者: Sol Yarkoni,Roi Livni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent advances in generative models such as diffusion models have raised several risks and concerns related to privacy, copyright infringements and data stewardship. To better understand and control the risks, various researchers have created techniques, experiments and attacks that reconstruct images, or part of images, from the training set. While these techniques already establish that data from the training set can be reconstructed, they often rely on high-resources, excess to the training set as well as well-engineered and designed prompts. In this work, we devise a new attack that requires low resources, assumes little to no access to the actual training set, and identifies, seemingly, benign prompts that lead to potentially-risky image reconstruction. This highlights the risk that images might even be reconstructed by an uninformed user and unintentionally. For example, we identified that, with regard to one existing model, the prompt ``blue Unisex T-Shirt’’ can generate the face of a real-life human model. Our method builds on an intuition from previous works which leverages domain knowledge and identifies a fundamental vulnerability that stems from the use of scraped data from e-commerce platforms, where templated layouts and images are tied to pattern-like prompts. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.07947 [cs.LG] (or arXiv:2507.07947v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.07947 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-3] Working with AI: Measuring the Occupational Implications of Generative AI

【速读】：该论文试图解决生成式AI（Generative AI）对经济影响的评估问题，旨在理解AI在不同职业中的适用性及其对工作任务的影响。其解决方案的关键在于分析用户与AI系统交互中的工作活动类型，结合任务成功度和影响范围数据，计算每个职业的AI适用性得分，并据此评估AI在不同职业领域的应用潜力。

链接: https://arxiv.org/abs/2507.07935
作者: Kiran Tomlinson,Sonia Jaffe,Will Wang,Scott Counts,Siddharth Suri
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); General Economics (econ.GN)
备注: 40 pages

点击查看摘要

Abstract:Given the rapid adoption of generative AI and its potential to impact a wide range of tasks, understanding the effects of AI on the economy is one of society’s most important questions. In this work, we take a step toward that goal by analyzing the work activities people do with AI, how successfully and broadly those activities are done, and combine that with data on what occupations do those activities. We analyze a dataset of 200k anonymized and privacy-scrubbed conversations between users and Microsoft Bing Copilot, a publicly available generative AI system. We find the most common work activities people seek AI assistance for involve gathering information and writing, while the most common activities that AI itself is performing are providing information and assistance, writing, teaching, and advising. Combining these activity classifications with measurements of task success and scope of impact, we compute an AI applicability score for each occupation. We find the highest AI applicability scores for knowledge work occupation groups such as computer and mathematical, and office and administrative support, as well as occupations such as sales whose work activities involve providing and communicating information. Additionally, we characterize the types of work activities performed most successfully, how wage and education correlate with AI applicability, and how real-world usage compares to predictions of occupational AI impact.
zh

[AI-4] Meek Models Shall Inherit the Earth ICML2025

【速读】：该论文试图解决AI系统在计算资源扩展过程中出现的性能不平等问题，即少数公司通过大规模投入导致AI模型性能显著优于其他竞争者。论文提出的解决方案关键在于指出计算资源扩展的边际收益递减现象，认为在固定分布的下一个token目标下，原始计算资源的边际能力收益会显著下降。这一递减趋势将促使计算预算有限的“弱小模型”逐渐接近顶级模型的性能水平，从而实现AI模型能力的收敛。

链接: https://arxiv.org/abs/2507.07931
作者: Hans Gundlach,Jayson Lynch,Neil Thompson
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 13 pages, 9 figures, longer version of the paper presented at TAIG ICML 2025

点击查看摘要

Abstract:The past decade has seen incredible scaling of AI systems by a few companies, leading to inequality in AI model performance. This paper argues that, contrary to prevailing intuition, the diminishing returns to compute scaling will lead to a convergence of AI model capabilities. In other words, meek models (those with limited computation budget) shall inherit the earth, approaching the performance level of the best models overall. We develop a model illustrating that under a fixed-distribution next-token objective, the marginal capability returns to raw compute shrink substantially. Given current scaling practices, we argue that these diminishing returns are strong enough that even companies that can scale their models exponentially faster than other organizations will eventually have little advantage in capabilities. As part of our argument, we give several reasons that proxies like training loss differences capture important capability measures using evidence from benchmark data and theoretical performance models. In addition, we analyze empirical data on the capability difference of AI models over time. Finally, in light of the increasing ability of meek models, we argue that AI strategy and policy require reexamination, and we outline the areas this shift will affect.
zh

[AI-5] Probing Experts Perspectives on AI-Assisted Public Speaking Training

【速读】：该论文试图解决当前商业生成式 AI (Generative AI) 支持的公众演讲训练工具在实际应用中存在的一些问题，特别是这些工具在专业演讲培训中的有效性及设计缺陷。研究通过16次半结构化访谈和2组焦点讨论，评估了演讲专家对这些工具的看法，并提出了改进建议。解决方案的关键在于开发能够提供个性化、可理解且经过精心筛选反馈的系统，并结合传统教练指导与AI辅助练习，形成混合教学模式，以提升训练效果。

链接: https://arxiv.org/abs/2507.07930
作者: Nesrine Fourati,Alisa Barkar,Marion Dragée,Liv Danthon-Lefebvre,Mathieu Chollet
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background: Public speaking is a vital professional skill, yet it remains a source of significant anxiety for many individuals. Traditional training relies heavily on expert coaching, but recent advances in AI has led to novel types of commercial automated public speaking feedback tools. However, most research has focused on prototypes rather than commercial applications, and little is known about how public speaking experts perceive these tools. Objectives: This study aims to evaluate expert opinions on the efficacy and design of commercial AI-based public speaking training tools and to propose guidelines for their improvement. Methods: The research involved 16 semi-structured interviews and 2 focus groups with public speaking experts. Participants discussed their views on current commercial tools, their potential integration into traditional coaching, and suggestions for enhancing these systems. Results and Conclusions: Experts acknowledged the value of AI tools in handling repetitive, technical aspects of training, allowing coaches to focus on higher-level skills. However they found key issues in current tools, emphasising the need for personalised, understandable, carefully selected feedback and clear instructional design. Overall, they supported a hybrid model combining traditional coaching with AI-supported exercises. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.07930 [cs.HC] (or arXiv:2507.07930v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2507.07930 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mathieu Chollet [view email] [v1] Thu, 10 Jul 2025 17:09:21 UTC (119 KB)
zh

[AI-6] Agent ic Retrieval of Topics and Insights from Earnings Calls SIGIR

【速读】：该论文试图解决传统主题建模技术在动态捕捉产业演变中新兴主题及其关系方面的不足。其解决方案的关键在于引入一种基于大语言模型（LLM）代理的方法，该方法能够从季度财报电话会议中提取主题，将其结构化为层次化本体，并通过主题本体建立新旧主题之间的关系。

链接: https://arxiv.org/abs/2507.07906
作者: Anant Gupta,Rajarshi Bhowmik,Geoffrey Gunow
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The 2nd Workshop on Financial Information Retrieval in the Era of Generative AI, The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval July 13-17, 2025 | Padua, Italy

点击查看摘要

Abstract:Tracking the strategic focus of companies through topics in their earnings calls is a key task in financial analysis. However, as industries evolve, traditional topic modeling techniques struggle to dynamically capture emerging topics and their relationships. In this work, we propose an LLM-agent driven approach to discover and retrieve emerging topics from quarterly earnings calls. We propose an LLM-agent to extract topics from documents, structure them into a hierarchical ontology, and establish relationships between new and existing topics through a topic ontology. We demonstrate the use of extracted topics to infer company-level insights and emerging trends over time. We evaluate our approach by measuring ontology coherence, topic evolution accuracy, and its ability to surface emerging financial trends.
zh

[AI-7] An Integrated Framework of Prompt Engineering and Multidimensional Knowledge Graphs for Legal Dispute Analysis

【速读】：该论文试图解决大型语言模型在法律纠纷分析中的局限性，包括法律知识表示不足、概念理解有限以及推理能力欠缺等问题。其解决方案的关键在于构建一个融合提示工程与多维知识图谱的增强框架，该框架采用三阶段分层提示结构，并结合法律专用的推理模板和动态优化机制，同时构建三层法律知识图谱架构，以实现精确的法律概念检索和法律决策支持。

链接: https://arxiv.org/abs/2507.07893
作者: Mingda Zhang,Na Zhao,Jianglong Qing,Qing xu,Kaiwen Pan,Ting luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages,3 figures

点击查看摘要

Abstract:The rapid development of artificial intelligence has positioned large language models as fundamental components of intelligent legal systems. However, these models face significant limitations in legal dispute analysis, including insufficient legal knowledge representation, limited concept understanding, and reasoning deficiencies. This research proposes an enhanced framework integrating prompt engineering with multidimensional knowledge graphs. The framework introduces a three-stage hierarchical prompt structure comprising task definition, knowledge background, and reasoning guidance, supplemented by legal-specific reasoning templates and dynamic optimization mechanisms. A three-layer knowledge graph architecture is constructed with legal classification ontology, representation, and instance layers. Four complementary methods enable precise legal concept retrieval: direct legal norm code matching, domain-specific semantic vector similarity, ontology-based path reasoning, and specialized lexical segmentation. These components integrate with web search technology to establish a knowledge-enhanced framework for legal decision-making. Experimental results demonstrate significant performance improvements in legal dispute analysis, enabling accurate legal application analysis for complex cases while exhibiting nuanced understanding of judicial decision-making logic, providing a novel technical approach for implementing intelligent legal assistance systems.
zh

[AI-8] UnIT: Scalable Unstructured Inference-Time Pruning for MAC-efficient Neural Inference on MCUs

【速读】：该论文试图解决在无SIMD支持或并行计算能力的设备上，传统结构化剪枝方法未能充分利用细粒度效率优化的问题。其解决方案的关键在于提出一种轻量级的非结构化推理时间剪枝方法（UnIT），该方法通过输入特定的激活模式动态识别并跳过不必要的乘加（MAC）操作，利用不规则稀疏性，无需重新训练或硬件专用化，将剪枝决策转化为轻量级比较操作，并通过阈值计算复用和层及组特定的剪枝敏感性优化计算效率。

链接: https://arxiv.org/abs/2507.07885
作者: Ashe Neth,Sawinder kaur,Mohammad Nur Hossain Khan,Subrata Biswas,Asif Salekin,Bashima Islam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to SenSys 2026 on July 1, 2025

点击查看摘要

Abstract:Existing pruning methods are typically applied during training or compile time and often rely on structured sparsity. While compatible with low-power microcontrollers (MCUs), structured pruning underutilizes the opportunity for fine-grained efficiency on devices without SIMD support or parallel compute. To address these limitations, we introduce UnIT (Unstructured Inference-Time pruning), a lightweight method that dynamically identifies and skips unnecessary multiply-accumulate (MAC) operations during inference, guided by input-specific activation patterns. Unlike structured pruning, UnIT embraces irregular sparsity and does not require retraining or hardware specialization. It transforms pruning decisions into lightweight comparisons, replacing multiplications with threshold checks and approximated divisions. UnIT further optimizes compute by reusing threshold computations across multiple connections and applying layer- and group-specific pruning sensitivity. We present three fast, hardware-friendly division approximations tailored to the capabilities of common embedded platforms. Demonstrated on the MSP430 microcontroller, UnIT achieves 11.02% to 82.03% MAC reduction, 27.30% to 84.19% faster inference, and 27.33% to 84.38% lower energy consumption compared to training-time pruned models, while maintaining accuracy with 0.48-7%. Under domain shift, UnIT matches or exceeds the accuracy of retrained models while requiring significantly fewer MACs. These results establish unstructured inference-time pruning as a viable and practical solution for efficient, retraining-free deployment of deep neural networks on MCUs.
zh

[AI-9] Mitigating Watermark Stealing Attacks in Generative Models via Multi-Key Watermarking

【速读】：该论文试图解决生成式人工智能（Generative AI）提供商面临的水印盗用攻击问题，即用户在未访问秘密水印密钥的情况下，伪造水印以误导性地指责提供商。解决方案的关键在于提出一种多密钥扩展方法，该方法可后置应用于任何模态的水印技术，从而有效降低伪造水印的有效性，并通过理论分析和实验验证其效果。

链接: https://arxiv.org/abs/2507.07871
作者: Toluwani Aremu,Noor Hussein,Munachiso Nwadike,Samuele Poppi,Jie Zhang,Karthik Nandakumar,Neil Gong,Nils Lukas
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Watermarking offers a promising solution for GenAI providers to establish the provenance of their generated content. A watermark is a hidden signal embedded in the generated content, whose presence can later be verified using a secret watermarking key. A threat to GenAI providers are \emphwatermark stealing attacks, where users forge a watermark into content that was \emphnot generated by the provider’s models without access to the secret key, e.g., to falsely accuse the provider. Stealing attacks collect \emphharmless watermarked samples from the provider’s model and aim to maximize the expected success rate of generating \emphharmful watermarked samples. Our work focuses on mitigating stealing attacks while treating the underlying watermark as a black-box. Our contributions are: (i) Proposing a multi-key extension to mitigate stealing attacks that can be applied post-hoc to any watermarking method across any modality. (ii) We provide theoretical guarantees and demonstrate empirically that our method makes forging substantially less effective across multiple datasets, and (iii) we formally define the threat of watermark forging as the task of generating harmful, watermarked content and model this threat via security games.
zh

[AI-10] Searching for actual causes: Approximate algorithms with adjustable precision

【速读】：该论文试图解决如何准确识别实际因果关系（actual causes）的问题，这一问题在可解释人工智能（XAI）领域中仍是一个开放性难题。传统XAI与因果推理文献主要关注哪些因素导致了何种结果，但这种知识对于非专家用户而言并不充分，他们更期待的是能够明确说明目标结果的实际原因。现有方法在处理非布尔型、黑盒及随机系统时存在局限性，且实际因果识别被证明是NP完全问题，缺乏有效的近似解决方案。该论文提出了一组算法，能够在多项式时间内识别实际因果关系，并支持通过增加计算时间来调整精度和全面性，其关键在于实现高效且可调的因果识别机制。

链接: https://arxiv.org/abs/2507.07857
作者: Samuel Reyd,Ada Diaconescu,Jean-Louis Dessalles
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causality has gained popularity in recent years. It has helped improve the performance, reliability, and interpretability of machine learning models. However, recent literature on explainable artificial intelligence (XAI) has faced criticism. The classical XAI and causality literature focuses on understanding which factors contribute to which consequences. While such knowledge is valuable for researchers and engineers, it is not what non-expert users expect as explanations. Instead, these users often await facts that cause the target consequences, i.e., actual causes. Formalizing this notion is still an open problem. Additionally, identifying actual causes is reportedly an NP-complete problem, and there are too few practical solutions to approximate formal definitions. We propose a set of algorithms to identify actual causes with a polynomial complexity and an adjustable level of precision and exhaustiveness. Our experiments indicate that the algorithms (1) identify causes for different categories of systems that are not handled by existing approaches (i.e., non-boolean, black-box, and stochastic systems), (2) can be adjusted to gain more precision and exhaustiveness with more computation time.
zh

[AI-11] Optimization Guarantees for Square-Root Natural-Gradient Variational Inference

【速读】：该论文试图解决变分推断中自然梯度下降方法在理论上缺乏收敛保证的问题，尤其是在处理具有凹对数似然函数并使用高斯近似的情况下。解决方案的关键在于采用高斯协方差的平方根参数化，这一方法有效克服了传统方法在理论分析上的困难，并为自然梯度变分高斯推断及其连续时间梯度流建立了新的收敛保证。

链接: https://arxiv.org/abs/2507.07853
作者: Navish Kumar,Thomas Möllenhoff,Mohammad Emtiyaz Khan,Aurelien Lucchi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Variational inference with natural-gradient descent often shows fast convergence in practice, but its theoretical convergence guarantees have been challenging to establish. This is true even for the simplest cases that involve concave log-likelihoods and use a Gaussian approximation. We show that the challenge can be circumvented for such cases using a square-root parameterization for the Gaussian covariance. This approach establishes novel convergence guarantees for natural-gradient variational-Gaussian inference and its continuous-time gradient flow. Our experiments demonstrate the effectiveness of natural gradient methods and highlight their advantages over algorithms that use Euclidean or Wasserstein geometries.
zh

[AI-12] AI Should Sense Better Not Just Scale Bigger: Adaptive Sensing as a Paradigm Shift

【速读】：该论文试图解决当前人工智能技术在环境、经济和伦理方面的高成本问题，从而实现更可持续和公平的AI系统。其解决方案的关键在于引入自适应感知（adaptive sensing），即通过在输入层面动态调节传感器参数（如曝光、灵敏度、多模态配置），以显著减轻协变量偏移并提高效率。这一方法使小型模型能够在资源受限的情况下超越大型模型的表现，为实际应用提供了新的技术路径。

链接: https://arxiv.org/abs/2507.07820
作者: Eunsu Baek,Keondo Park,Jeonggil Ko,Min-hwan Oh,Taesik Gong,Hyung-Sin Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current AI advances largely rely on scaling neural models and expanding training datasets to achieve generalization and robustness. Despite notable successes, this paradigm incurs significant environmental, economic, and ethical costs, limiting sustainability and equitable access. Inspired by biological sensory systems, where adaptation occurs dynamically at the input (e.g., adjusting pupil size, refocusing vision)–we advocate for adaptive sensing as a necessary and foundational shift. Adaptive sensing proactively modulates sensor parameters (e.g., exposure, sensitivity, multimodal configurations) at the input level, significantly mitigating covariate shifts and improving efficiency. Empirical evidence from recent studies demonstrates that adaptive sensing enables small models (e.g., EfficientNet-B0) to surpass substantially larger models (e.g., OpenCLIP-H) trained with significantly more data and compute. We (i) outline a roadmap for broadly integrating adaptive sensing into real-world applications spanning humanoid, healthcare, autonomous systems, agriculture, and environmental monitoring, (ii) critically assess technical and ethical integration challenges, and (iii) propose targeted research directions, such as standardized benchmarks, real-time adaptive algorithms, multimodal integration, and privacy-preserving methods. Collectively, these efforts aim to transition the AI community toward sustainable, robust, and equitable artificial intelligence systems.
zh

[AI-13] MoSE: Skill-by-Skill Mixture-of-Expert Learning for Autonomous Driving

【速读】：该论文旨在解决自动驾驶系统中模型泛化能力和解释性不足的问题，以及传统混合专家（Mixture-of-Experts, MoE）模型对大量训练数据和复杂优化的依赖。其解决方案的关键在于提出一种面向技能的MoE架构——MoSE，该架构模仿人类驾驶员的学习与推理过程，通过定义和标注特定技能，构建分层技能数据集，并预训练路由机制以实现逐步推理，从而在不增加额外计算成本的情况下，提升模型在极端场景下的推理能力。

链接: https://arxiv.org/abs/2507.07818
作者: Lu Xu,Jiaqian Yu,Xiongfeng Peng,Yiwei Chen,Weiming Li,Jaewook Yoo,Sunghyun Chunag,Dongwook Lee,Daehyun Ji,Chao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies show large language models (LLMs) and vision language models (VLMs) trained using web-scale data can empower end-to-end autonomous driving systems for a better generalization and interpretation. Specifically, by dynamically routing inputs to specialized subsets of parameters, the Mixture-of-Experts (MoE) technique enables general LLMs or VLMs to achieve substantial performance improvements while maintaining computational efficiency. However, general MoE models usually demands extensive training data and complex optimization. In this work, inspired by the learning process of human drivers, we propose a skill-oriented MoE, called MoSE, which mimics human drivers’ learning process and reasoning process, skill-by-skill and step-by-step. We propose a skill-oriented routing mechanism that begins with defining and annotating specific skills, enabling experts to identify the necessary driving competencies for various scenarios and reasoning tasks, thereby facilitating skill-by-skill learning. Further align the driving process to multi-step planning in human reasoning and end-to-end driving models, we build a hierarchical skill dataset and pretrain the router to encourage the model to think step-by-step. Unlike multi-round dialogs, MoSE integrates valuable auxiliary tasks (e.g.\ description, reasoning, planning) in one single forward process without introducing any extra computational cost. With less than 3B sparsely activated parameters, our model outperforms several 8B+ parameters on CODA AD corner case reasoning task. Compared to existing methods based on open-source models and data, our approach achieves state-of-the-art performance with significantly reduced activated model size (at least by 62.5% ) with a single-turn conversation.
zh

[AI-14] Measuring AI Alignment with Human Flourishing

【速读】：该论文试图解决当前AI系统在评估过程中过于侧重技术能力或危害预防，而忽视了AI对人类整体福祉（human flourishing）的积极影响的问题。其解决方案的关键在于提出了一种名为Flourishing AI Benchmark (FAI Benchmark)的新评估框架，该框架从七个维度（包括性格与美德、亲密社会关系、幸福与生活满意度、意义与目标、心理健康与身体状况、财务与物质稳定以及信仰与灵性）全面评估AI系统对人类福祉的贡献程度，并通过1,229个客观与主观问题进行综合测评，采用几何平均评分方法确保各维度表现的平衡性。

链接: https://arxiv.org/abs/2507.07787
作者: Elizabeth Hilliard,Akshaya Jagadeesh,Alex Cook,Steele Billings,Nicholas Skytland,Alicia Llewellyn,Jackson Paull,Nathan Paull,Nolan Kurylo,Keatra Nesbitt,Robert Gruenewald,Anthony Jantzi,Omar Chavez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces the Flourishing AI Benchmark (FAI Benchmark), a novel evaluation framework that assesses AI alignment with human flourishing across seven dimensions: Character and Virtue, Close Social Relationships, Happiness and Life Satisfaction, Meaning and Purpose, Mental and Physical Health, Financial and Material Stability, and Faith and Spirituality. Unlike traditional benchmarks that focus on technical capabilities or harm prevention, the FAI Benchmark measures AI performance on how effectively models contribute to the flourishing of a person across these dimensions. The benchmark evaluates how effectively LLM AI systems align with current research models of holistic human well-being through a comprehensive methodology that incorporates 1,229 objective and subjective questions. Using specialized judge Large Language Models (LLMs) and cross-dimensional evaluation, the FAI Benchmark employs geometric mean scoring to ensure balanced performance across all flourishing dimensions. Initial testing of 28 leading language models reveals that while some models approach holistic alignment (with the highest-scoring models achieving 72/100), none are acceptably aligned across all dimensions, particularly in Faith and Spirituality, Character and Virtue, and Meaning and Purpose. This research establishes a framework for developing AI systems that actively support human flourishing rather than merely avoiding harm, offering significant implications for AI development, ethics, and evaluation.
zh

[AI-15] OPC: One-Point-Contraction Unlearning Toward Deep Feature Forgetting

【速读】：该论文试图解决机器遗忘（machine unlearning）中现有方法存在“浅层遗忘”问题，即模型在表面上看似删除了特定数据或类别的影响，但其内部表示仍保留足够信息以恢复被遗忘的数据或行为。解决方案的关键在于定义了一个基于特征表示单点收缩（one-point-contraction）的“深度遗忘”理论准则，并提出了一种高效的近似算法，从而构建出一种新的通用遗忘算法——One-Point-Contraction (OPC)。该算法通过强制深度特征遗忘，显著提升了机器遗忘的性能和对攻击的鲁棒性。

链接: https://arxiv.org/abs/2507.07754
作者: Jaeheun Jung,Bosung Jung,Suhyun Bae,Donghun Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine unlearning seeks to remove the influence of particular data or class from trained models to meet privacy, legal, or ethical requirements. Existing unlearning methods tend to forget shallowly: phenomenon of an unlearned model pretend to forget by adjusting only the model response, while its internal representations retain information sufficiently to restore the forgotten data or behavior. We empirically confirm the widespread shallowness by reverting the forgetting effect of various unlearning methods via training-free performance recovery attack and gradient-inversion-based data reconstruction attack. To address this vulnerability fundamentally, we define a theoretical criterion of ``deep forgetting’’ based on one-point-contraction of feature representations of data to forget. We also propose an efficient approximation algorithm, and use it to construct a novel general-purpose unlearning algorithm: One-Point-Contraction (OPC). Empirical evaluations on image classification unlearning benchmarks show that OPC achieves not only effective unlearning performance but also superior resilience against both performance recovery attack and gradient-inversion attack. The distinctive unlearning performance of OPC arises from the deep feature forgetting enforced by its theoretical foundation, and recaps the need for improved robustness of machine unlearning methods.
zh

[AI-16] Identification of Violin Reduction via Contour Lines Classification

【速读】：该论文试图解决如何通过几何特征区分小提琴是否经过尺寸缩减的问题，特别是针对那些在标准尺寸之间被缩小的乐器。其解决方案的关键在于从3D几何网格中提取轮廓线，并用参数化的数学模型（如y = alpha*abs(x)**beta）拟合这些轮廓线，从而提取出描述曲线开放程度（beta）和垂直拉伸程度（alpha）的特征，最终通过分类方法评估几何特征对预测尺寸缩减的有效性。

链接: https://arxiv.org/abs/2507.07743
作者: Philémon Beghin,Anne-Emmanuelle Ceulemans,François Glineur
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The first violins appeared in late 16th-century Italy. Over the next 200 years, they spread across Europe and luthiers of various royal courts, eager to experiment with new techniques, created a highly diverse family of instruments. Around 1750, size standards were introduced to unify violin making for orchestras and conservatories. Instruments that fell between two standards were then reduced to a smaller size by luthiers. These reductions have an impact on several characteristics of violins, in particular on the contour lines, i.e. lines of constant altitude, which look more like a U for non reduced instruments and a V for reduced ones. While such differences are observed by experts, they have not been studied quantitatively. This paper presents a method for classifying violins as reduced or non-reduced based on their contour lines. We study a corpus of 25 instruments whose 3D geometric meshes were acquired via photogrammetry. For each instrument, we extract 10-20 contour lines regularly spaced every millimetre. Each line is fitted with a parabola-like curve (with an equation of the type y = alpha*abs(x)**beta) depending on two parameters, describing how open (beta) and how vertically stretched (alpha) the curve is. We compute additional features from those parameters, using regressions and counting how many values fall under some threshold. We also deal with outliers and non equal numbers of levels, and eventually obtain a numerical profile for each instrument. We then apply classification methods to assess whether geometry alone can predict size reduction. We find that distinguishing between reduced and non reduced instruments is feasible to some degree, taking into account that a whole spectrum of more or less transformed violins exists, for which it is more difficult to quantify the reduction. We also find the opening parameter beta to be the most predictive. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.07743 [cs.AI] (or arXiv:2507.07743v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.07743 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-17] Stable Preference Optimization for LLM s: A Bilevel Approach Beyond Direct Preference Optimization

【速读】：该论文试图解决直接偏好优化（DPO）在语言模型对齐中的理论性质和内在局限性问题，特别是其对初始化的高度敏感性以及概率质量的错误分配可能导致模型偏差加剧的问题。解决方案的关键在于提出一种基于理论的双层优化框架，即稳定偏好优化，该框架将监督微调与增强的DPO目标紧密结合，并引入一种规范化的正则化方案，以明确鼓励优选输出的绝对概率提升，同时保持稳定的优化动态。

链接: https://arxiv.org/abs/2507.07723
作者: Chengtao Jian,Kai Yang,Ye Ouyang,Xiaozhou Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has emerged as a popular and efficient alternative to reward modeling and reinforcement learning for aligning language models with human preferences. Despite its empirical success, the theoretical properties and intrinsic limitations of DPO remain underexplored. In this work, we first present a comprehensive analysis of DPO’s dynamics from a probability evolution perspective. Our analysis reveals that DPO is highly sensitive to initialization. It also tends to misallocate probability mass, which can inadvertently shift probability toward irrelevant or undesired responses. This misallocation may unintentionally reinforce model bias, thereby compromising both the stability of model alignment and the consistency with intended preferences. Motivated by these theoretical findings, we propose a theoretically grounded bilevel optimization framework that tightly integrate supervised fine-tuning with an enhanced DPO objective a.k.a. stable preference optimization. Our approach introduces a principled regularization scheme to explicitly encourage absolute probability improvement for preferred outputs, while maintaining stable optimization dynamics. Experiments on challenging reasoning and summarization benchmarks elucidate that our method consistently improves reasoning accuracy and better aligns output distributions with intended preferences, outperforming standard DPO. Stable preference optimization provides new insights into the design of preference-based alignment objectives and opens up new avenues towards more reliable and interpretable language model alignment.
zh

[AI-18] Adaptive Gaussian Mixture Models-based Anomaly Detection for under-constrained Cable-Driven Parallel Robots

【速读】：该论文旨在解决电缆驱动并联机器人（Cable-Driven Parallel Robots, CDPRs）在执行具有预定义路径和中间停靠点的任务时，如何仅通过电机扭矩数据检测异常问题，而无需额外传感器。其解决方案的关键在于引入一种基于高斯混合模型（GMM）的自适应无监督异常检测算法，该算法通过短时间校准阶段建立正常状态模型，随后利用马氏距离对实时扭矩信号进行评估，并通过统计阈值触发异常标志，同时定期更新模型参数以适应环境变化。

链接: https://arxiv.org/abs/2507.07714
作者: Julio Garrido,Javier Vales,Diego Silva-Muñiz,Enrique Riveiro,Pablo López-Matencio,Josué Rivera-Andrade
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 8 figures, 1 table, to be submitted to Advanced Intelligent Systems

点击查看摘要

Abstract:Cable-Driven Parallel Robots (CDPRs) are increasingly used for load manipulation tasks involving predefined toolpaths with intermediate stops. At each stop, where the platform maintains a fixed pose and the motors keep the cables under tension, the system must evaluate whether it is safe to proceed by detecting anomalies that could compromise performance (e.g., wind gusts or cable impacts). This paper investigates whether anomalies can be detected using only motor torque data, without additional sensors. It introduces an adaptive, unsupervised outlier detection algorithm based on Gaussian Mixture Models (GMMs) to identify anomalies from torque signals. The method starts with a brief calibration period, just a few seconds, during which a GMM is fit on known anomaly-free data. Real-time torque measurements are then evaluated using Mahalanobis distance from the GMM, with statistically derived thresholds triggering anomaly flags. Model parameters are periodically updated using the latest segments identified as anomaly-free to adapt to changing conditions. Validation includes 14 long-duration test sessions simulating varied wind intensities. The proposed method achieves a 100% true positive rate and 95.4% average true negative rate, with 1-second detection latency. Comparative evaluation against power threshold and non-adaptive GMM methods indicates higher robustness to drift and environmental variation.
zh

[AI-19] PlanQA: A Benchmark for Spatial Reasoning in LLM s using Structured Representations

【速读】：该论文试图解决大型语言模型（Large-Language Models, LLMs）在几何与空间推理能力上的不足，特别是其在处理真实世界布局时的局限性。解决方案的关键在于构建PlanQA，一个基于室内场景结构化表示的诊断基准，采用符号化格式（如JSON、XML布局）对厨房、客厅和卧室等场景进行编码，并包含多种问题类型，涵盖度量与拓扑推理以及室内设计约束。通过该基准，研究揭示了当前LLMs在模拟物理约束、保持空间一致性及在布局扰动下泛化能力方面的明显缺陷。

链接: https://arxiv.org/abs/2507.07644
作者: Fedor Rodionov,Abdelrahman Eldesokey,Michael Birsak,John Femiani,Bernard Ghanem,Peter Wonka
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 18 figures. Diagnostic benchmark for spatial reasoning in LLMs. Project page: this https URL

点击查看摘要

Abstract:We introduce PlanQA, a diagnostic benchmark for evaluating geometric and spatial reasoning in large-language models (LLMs). PlanQA is grounded in structured representations of indoor scenes, such as kitchens, living rooms, and bedrooms, encoded in a symbolic format (e.g., JSON, XML layouts). The benchmark includes diverse question types that test not only metric and topological reasoning (e.g., distance, visibility, shortest paths) but also interior design constraints such as affordance, clearance, balance, and usability. Our results across a variety of frontier open-source and commercial LLMs show that while models may succeed in shallow queries, they often fail to simulate physical constraints, preserve spatial coherence, or generalize under layout perturbation. PlanQA uncovers a clear blind spot in today’s LLMs: they do not consistently reason about real-world layouts. We hope that this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.
zh

[AI-20] ransformEEG: Towards Improving Model Generalizability in Deep Learning-based EEG Parkinsons Disease Detection

【速读】：该论文试图解决当前基于深度学习的脑电图（EEG）模型在帕金森病（PD）检测中因个体间差异大而导致的泛化能力不足问题。解决方案的关键在于提出一种混合卷积-Transformer架构——TransformEEG，其核心创新是引入了深度可分离卷积分词器，该分词器能够生成包含通道特异性特征的令牌，从而在Transformer编码器的自注意力层中实现更有效的特征融合。

链接: https://arxiv.org/abs/2507.07622
作者: Federico Del Pup,Riccardo Brun,Filippo Iotti,Edoardo Paccagnella,Mattia Pezzato,Sabrina Bertozzo,Andrea Zanola,Louis Fabrice Tshimanga,Henning Müller,Manfredo Atzori
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted for possible publication. GitHub repository: see this https URL

点击查看摘要

Abstract:Electroencephalography (EEG) is establishing itself as an important, low-cost, noninvasive diagnostic tool for the early detection of Parkinson’s Disease (PD). In this context, EEG-based Deep Learning (DL) models have shown promising results due to their ability to discover highly nonlinear patterns within the signal. However, current state-of-the-art DL models suffer from poor generalizability caused by high inter-subject variability. This high variability underscores the need for enhancing model generalizability by developing new architectures better tailored to EEG data. This paper introduces TransformEEG, a hybrid Convolutional-Transformer designed for Parkinson’s disease detection using EEG data. Unlike transformer models based on the EEGNet structure, TransformEEG incorporates a depthwise convolutional tokenizer. This tokenizer is specialized in generating tokens composed by channel-specific features, which enables more effective feature mixing within the self-attention layers of the transformer encoder. To evaluate the proposed model, four public datasets comprising 290 subjects (140 PD patients, 150 healthy controls) were harmonized and aggregated. A 10-outer, 10-inner Nested-Leave-N-Subjects-Out (N-LNSO) cross-validation was performed to provide an unbiased comparison against seven other consolidated EEG deep learning models. TransformEEG achieved the highest balanced accuracy’s median (78.45%) as well as the lowest interquartile range (6.37%) across all the N-LNSO partitions. When combined with data augmentation and threshold correction, median accuracy increased to 80.10%, with an interquartile range of 5.74%. In conclusion, TransformEEG produces more consistent and less skewed results. It demonstrates a substantial reduction in variability and more reliable PD detection using EEG data compared to the other investigated models.
zh

[AI-21] owards conservative inference in credal networks using belief functions: the case of credal chains

【速读】：该论文试图解决在credal networks中进行信念推理（belief inference）的问题，特别是通过Dempster-Shafer理论来传播不确定性。解决方案的关键在于提出一种新的框架，用于在credal networks的一个子类——链结构中高效地传播不确定性，该框架利用信念函数和似然函数生成保守区间，实现了计算速度与鲁棒不确定性表示的结合。

链接: https://arxiv.org/abs/2507.07619
作者: Marco Sangalli,Thomas Krak,Cassio de Campos
机构: 未知
类目: Artificial Intelligence (cs.AI); Probability (math.PR)
备注:

点击查看摘要

Abstract:This paper explores belief inference in credal networks using Dempster-Shafer theory. By building on previous work, we propose a novel framework for propagating uncertainty through a subclass of credal networks, namely chains. The proposed approach efficiently yields conservative intervals through belief and plausibility functions, combining computational speed with robust uncertainty representation. Key contributions include formalizing belief-based inference methods and comparing belief-based inference against classical sensitivity analysis. Numerical results highlight the advantages and limitations of applying belief inference within this framework, providing insights into its practical utility for chains and for credal networks in general.
zh

[AI-22] Enhancing Vaccine Safety Surveillance: Extracting Vaccine Mentions from Emergency Department Triage Notes Using Fine-Tuned Large Language Models

【速读】：该论文试图解决从急诊科分诊记录中提取疫苗相关信息以支持近实时疫苗安全监测的问题。其解决方案的关键在于使用提示工程构建标注数据集，并通过微调Llama 3.2模型提升疫苗名称提取的准确性，同时利用模型量化技术实现资源受限环境下的高效部署。

链接: https://arxiv.org/abs/2507.07599
作者: Sedigh Khademi,Jim Black,Christopher Palmer,Muhammad Javed,Hazel Clothier,Jim Buttery,Gerardo Luis Dimaguila
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages

点击查看摘要

Abstract:This study evaluates fine-tuned Llama 3.2 models for extracting vaccine-related information from emergency department triage notes to support near real-time vaccine safety surveillance. Prompt engineering was used to initially create a labeled dataset, which was then confirmed by human annotators. The performance of prompt-engineered models, fine-tuned models, and a rule-based approach was compared. The fine-tuned Llama 3 billion parameter model outperformed other models in its accuracy of extracting vaccine names. Model quantization enabled efficient deployment in resource-constrained environments. Findings demonstrate the potential of large language models in automating data extraction from emergency department notes, supporting efficient vaccine safety surveillance and early detection of emerging adverse events following immunization issues.
zh

[AI-23] Context Pooling: Query-specific Graph Pooling for Generic Inductive Link Prediction in Knowledge Graphs

【速读】：该论文试图解决图神经网络（Graph Neural Network, GNN）在知识图谱（Knowledge Graph, KG）中的链接预测任务中，传统聚合方法对模型性能提升有限的问题。其解决方案的关键在于提出了一种名为Context Pooling的新方法，该方法首次在KG中应用了图池化（graph pooling），并实现了在归纳设置下生成查询特定图的能力，从而有效识别逻辑相关的邻居节点，提升了链接预测的准确性。

链接: https://arxiv.org/abs/2507.07595
作者: Zhixiang Su,Di Wang,Chunyan Miao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent investigations on the effectiveness of Graph Neural Network (GNN)-based models for link prediction in Knowledge Graphs (KGs) show that vanilla aggregation does not significantly impact the model performance. In this paper, we introduce a novel method, named Context Pooling, to enhance GNN-based models’ efficacy for link predictions in KGs. To our best of knowledge, Context Pooling is the first methodology that applies graph pooling in KGs. Additionally, Context Pooling is first-of-its-kind to enable the generation of query-specific graphs for inductive settings, where testing entities are unseen during training. Specifically, we devise two metrics, namely neighborhood precision and neighborhood recall, to assess the neighbors’ logical relevance regarding the given queries, thereby enabling the subsequent comprehensive identification of only the logically relevant neighbors for link prediction. Our method is generic and assessed by being applied to two state-of-the-art (SOTA) models on three public transductive and inductive datasets, achieving SOTA performance in 42 out of 48 settings.
zh

[AI-24] On Trustworthy Rule-Based Models and Explanations

【速读】：该论文试图解决机器学习（Machine Learning, ML）模型预测解释的可靠性问题，特别是在高风险领域中，确保解释的严谨性以避免误导人类决策者。解决方案的关键在于分析规则基础的ML模型中存在的不利特性，如负重叠和多种冗余形式，并开发算法来检测这些特性。研究指出，广泛使用的规则学习工具所生成的规则集往往会表现出一种或多种负面特性。

链接: https://arxiv.org/abs/2507.07576
作者: Mohamed Siala,Jordi Planes,Joao Marques-Silva
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:A task of interest in machine learning (ML) is that of ascribing explanations to the predictions made by ML models. Furthermore, in domains deemed high risk, the rigor of explanations is paramount. Indeed, incorrect explanations can and will mislead human decision makers. As a result, and even if interpretability is acknowledged as an elusive concept, so-called interpretable models are employed ubiquitously in high-risk uses of ML and data mining (DM). This is the case for rule-based ML models, which encompass decision trees, diagrams, sets and lists. This paper relates explanations with well-known undesired facets of rule-based ML models, which include negative overlap and several forms of redundancy. The paper develops algorithms for the analysis of these undesired facets of rule-based systems, and concludes that well-known and widely used tools for learning rule-based ML models will induce rule sets that exhibit one or more negative facets.
zh

[AI-25] ArchiveGPT : A human-centered evaluation of using a vision language model for image cataloguing

【速读】：该论文试图解决摄影资料库快速增长导致人工编目效率不足的问题，探索生成式AI在档案和博物馆藏品目录描述中的应用潜力。其解决方案的关键在于利用视觉语言模型（VLM）自动生成目录描述，并通过专家与非专家的实验评估其质量与可信度，进而探讨AI在目录编制流程中的整合方式。研究强调，尽管AI生成的描述在某些方面表现出较高的准确性与实用性，但仍需依赖人工审核以确保专业领域（如考古学）中的准确性与可靠性，同时指出技术进步与专业信任建立是实现有效整合的核心因素。

链接: https://arxiv.org/abs/2507.07551
作者: Line Abele,Gerrit Anders,Tolgahan Aydın,Jürgen Buder,Helen Fischer,Dominik Kimmel,Markus Huff
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 56 pages, 7 figures

点击查看摘要

Abstract:The accelerating growth of photographic collections has outpaced manual cataloguing, motivating the use of vision language models (VLMs) to automate metadata generation. This study examines whether Al-generated catalogue descriptions can approximate human-written quality and how generative Al might integrate into cataloguing workflows in archival and museum collections. A VLM (InternVL2) generated catalogue descriptions for photographic prints on labelled cardboard mounts with archaeological content, evaluated by archive and archaeology experts and non-experts in a human-centered, experimental framework. Participants classified descriptions as AI-generated or expert-written, rated quality, and reported willingness to use and trust in AI tools. Classification performance was above chance level, with both groups underestimating their ability to detect Al-generated descriptions. OCR errors and hallucinations limited perceived quality, yet descriptions rated higher in accuracy and usefulness were harder to classify, suggesting that human review is necessary to ensure the accuracy and quality of catalogue descriptions generated by the out-of-the-box model, particularly in specialized domains like archaeological cataloguing. Experts showed lower willingness to adopt AI tools, emphasizing concerns on preservation responsibility over technical performance. These findings advocate for a collaborative approach where AI supports draft generation but remains subordinate to human verification, ensuring alignment with curatorial values (e.g., provenance, transparency). The successful integration of this approach depends not only on technical advancements, such as domain-specific fine-tuning, but even more on establishing trust among professionals, which could both be fostered through a transparent and explainable AI pipeline.
zh

[AI-26] Position: We Need An Algorithmic Understanding of Generative AI ICML2025

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在解决问题时所学习和使用的算法机制不明确的问题，旨在填补理论与实证研究的空白。其解决方案的关键在于提出AlgEval框架，通过系统性研究LLMs的隐层表示、注意力机制和推理阶段的计算，揭示其学习到的算法基元及其组合方式，从而理解模型如何解决特定任务。该框架强调了自上而下假设的形成与自下而上验证的结合，为深入解析模型内部计算提供了严谨的评估方法。

链接: https://arxiv.org/abs/2507.07544
作者: Oliver Eberle,Thomas McGee,Hamza Giaffar,Taylor Webb,Ida Momennejad
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICML 2025 as a Spotlight Position Paper

点击查看摘要

Abstract:What algorithms do LLMs actually learn and use to solve problems? Studies addressing this question are sparse, as research priorities are focused on improving performance through scale, leaving a theoretical and empirical gap in understanding emergent algorithms. This position paper proposes AlgEval: a framework for systematic research into the algorithms that LLMs learn and use. AlgEval aims to uncover algorithmic primitives, reflected in latent representations, attention, and inference-time compute, and their algorithmic composition to solve task-specific problems. We highlight potential methodological paths and a case study toward this goal, focusing on emergent search algorithms. Our case study illustrates both the formation of top-down hypotheses about candidate algorithms, and bottom-up tests of these hypotheses via circuit-level analysis of attention patterns and hidden states. The rigorous, systematic evaluation of how LLMs actually solve tasks provides an alternative to resource-intensive scaling, reorienting the field toward a principled understanding of underlying computations. Such algorithmic explanations offer a pathway to human-understandable interpretability, enabling comprehension of the model’s internal reasoning performance measures. This can in turn lead to more sample-efficient methods for training and improving performance, as well as novel architectures for end-to-end and multi-agent systems.
zh

[AI-27] Neural Concept Verifier: Scaling Prover-Verifier Games via Concept Encodings

【速读】：该论文试图解决在高维输入（如图像）中实现可解释且非线性的分类模型的问题，同时克服现有方法在可验证性和避免捷径行为方面的局限性。其解决方案的关键在于提出一种统一框架——神经概念验证器（Neural Concept Verifier, NCV），该框架结合了Prover-Verifier Games (PVGs)与概念编码，通过最小监督的概念发现模型从原始输入中提取结构化的概念编码，并由验证器（作为非线性预测器）仅基于这些编码进行决策，从而实现了可解释的非线性分类。

链接: https://arxiv.org/abs/2507.07532
作者: Berkant Turan,Suhrab Asadulla,David Steinmann,Wolfgang Stammer,Sebastian Pokutta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, 8 tables

点击查看摘要

Abstract:While Prover-Verifier Games (PVGs) offer a promising path toward verifiability in nonlinear classification models, they have not yet been applied to complex inputs such as high-dimensional images. Conversely, Concept Bottleneck Models (CBMs) effectively translate such data into interpretable concepts but are limited by their reliance on low-capacity linear predictors. In this work, we introduce the Neural Concept Verifier (NCV), a unified framework combining PVGs with concept encodings for interpretable, nonlinear classification in high-dimensional settings. NCV achieves this by utilizing recent minimally supervised concept discovery models to extract structured concept encodings from raw inputs. A prover then selects a subset of these encodings, which a verifier – implemented as a nonlinear predictor – uses exclusively for decision-making. Our evaluations show that NCV outperforms CBM and pixel-based PVG classifier baselines on high-dimensional, logically complex datasets and also helps mitigate shortcut behavior. Overall, we demonstrate NCV as a promising step toward performative, verifiable AI.
zh

[AI-28] StarDojo: Benchmarking Open-Ended Behaviors of Agent ic Multimodal LLM s in Production-Living Simulations with Stardew Valley

【速读】：该论文旨在解决当前基准测试很少同时评估自主代理在生产活动和社会互动方面的能力的问题，从而推动其在复杂生产-生活环境中实现更稳健的开放性代理研究。解决方案的关键在于引入StarDojo，这是一个基于Stardew Valley的新型基准，它通过设计涵盖五个关键领域（农业、手工艺、探索、战斗和社交互动）的1,000项精心策划任务，以及一个包含100项代表性任务的精简子集，来全面评估AI代理在开放式的生产-生活模拟中的表现。此外，StarDojo提供了一个统一且用户友好的界面，支持多平台运行和并行执行多个环境实例，以适应基于多模态大语言模型（MLLMs）的高级代理的评估需求。

链接: https://arxiv.org/abs/2507.07445
作者: Weihao Tan,Changjiu Jiang,Yu Duan,Mingcong Lei,Jiageng Li,Yitian Hong,Xinrun Wang,Bo An
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Project website: this https URL

点击查看摘要

Abstract:Autonomous agents navigating human society must master both production activities and social interactions, yet existing benchmarks rarely evaluate these skills simultaneously. To bridge this gap, we introduce StarDojo, a novel benchmark based on Stardew Valley, designed to assess AI agents in open-ended production-living simulations. In StarDojo, agents are tasked to perform essential livelihood activities such as farming and crafting, while simultaneously engaging in social interactions to establish relationships within a vibrant community. StarDojo features 1,000 meticulously curated tasks across five key domains: farming, crafting, exploration, combat, and social interactions. Additionally, we provide a compact subset of 100 representative tasks for efficient model evaluation. The benchmark offers a unified, user-friendly interface that eliminates the need for keyboard and mouse control, supports all major operating systems, and enables the parallel execution of multiple environment instances, making it particularly well-suited for evaluating the most capable foundation agents, powered by multimodal large language models (MLLMs). Extensive evaluations of state-of-the-art MLLMs agents demonstrate substantial limitations, with the best-performing model, GPT-4.1, achieving only a 12.7% success rate, primarily due to challenges in visual understanding, multimodal reasoning and low-level manipulation. As a user-friendly environment and benchmark, StarDojo aims to facilitate further research towards robust, open-ended agents in complex production-living environments.
zh

[AI-29] DrugMCTS: a drug repurposing framework combining multi-agent RAG and Monte Carlo Tree Search

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在药物再利用任务中因推理超出预训练知识范围而效果受限的问题。传统方法如微调或检索增强生成（Retrieval-Augmented Generation, RAG）在计算开销或对结构化科学数据利用不足方面存在局限。论文提出的解决方案关键在于DrugMCTS框架，该框架通过融合RAG、多智能体协作和蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS），实现了结构化和迭代式的推理，无需领域特定微调即可显著提升模型性能。

链接: https://arxiv.org/abs/2507.07426
作者: Zerui Yang,Yuwei Wan,Yinqiao Li,Yudai Matsuda,Tong Xie,Linqi Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models have demonstrated considerable potential in scientific domains such as drug discovery. However, their effectiveness remains constrained when reasoning extends beyond the knowledge acquired during pretraining. Conventional approaches, such as fine-tuning or retrieval-augmented generation, face limitations in either imposing high computational overhead or failing to fully exploit structured scientific data. To overcome these challenges, we propose DrugMCTS, a novel framework that synergistically integrates RAG, multi-agent collaboration, and Monte Carlo Tree Search for drug repurposing. The framework employs five specialized agents tasked with retrieving and analyzing molecular and protein information, thereby enabling structured and iterative reasoning. Without requiring domain-specific fine-tuning, DrugMCTS empowers Qwen2.5-7B-Instruct to outperform Deepseek-R1 by over 20%. Extensive experiments on the DrugBank and KIBA datasets demonstrate that DrugMCTS achieves substantially higher recall and robustness compared to both general-purpose LLMs and deep learning baselines. Our results highlight the importance of structured reasoning, agent-based collaboration, and feedback-driven search mechanisms in advancing LLM applications for drug discovery.
zh

[AI-30] Optimal Auction Design in the Joint Advertising ICML2025

【速读】：该论文试图解决联合广告（joint advertising）中现有机制无法实现最优分配的问题，这些问题主要源于现有方法过于关注单个广告主而忽视了广告包（bundle）的结构。论文提出的关键解决方案是设计一种基于广告包的神经网络方法——\textbf{BundleNet}，该方法专门用于多槽位联合广告场景，通过深度学习模型优化广告分配策略，从而在提升平台收益的同时保证近似主导策略激励相容性和个体理性。

链接: https://arxiv.org/abs/2507.07418
作者: Yang Li,Yuchao Ma,Qi Qi
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2025 (International Conference on Machine Learning). 17 pages, 4 figures

点击查看摘要

Abstract:Online advertising is a vital revenue source for major internet platforms. Recently, joint advertising, which assigns a bundle of two advertisers in an ad slot instead of allocating a single advertiser, has emerged as an effective method for enhancing allocation efficiency and revenue. However, existing mechanisms for joint advertising fail to realize the optimality, as they tend to focus on individual advertisers and overlook bundle structures. This paper identifies an optimal mechanism for joint advertising in a single-slot setting. For multi-slot joint advertising, we propose \textbfBundleNet, a novel bundle-based neural network approach specifically designed for joint advertising. Our extensive experiments demonstrate that the mechanisms generated by \textbfBundleNet approximate the theoretical analysis results in the single-slot setting and achieve state-of-the-art performance in the multi-slot setting. This significantly increases platform revenue while ensuring approximate dominant strategy incentive compatibility and individual rationality.
zh

[AI-31] Autonomous AI-based Cybersecurity Framework for Critical Infrastructure: Real-Time Threat Mitigation

【速读】：该论文试图解决关键基础设施系统在日益互联的环境下所面临的网络安全威胁问题，包括勒索软件、拒绝服务（DoS）攻击和高级持续性威胁（APTs）。解决方案的关键在于提出一种混合人工智能（AI）驱动的网络安全框架，以增强实时漏洞检测、威胁建模和自动化修复能力。该框架旨在提升关键基础设施系统对新兴网络威胁的安全性和韧性。

链接: https://arxiv.org/abs/2507.07416
作者: Jenifer Paulraj,Brindha Raghuraman,Nagarani Gopalakrishnan,Yazan Otoum
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 7 pages, IEEE conference

点击查看摘要

Abstract:Critical infrastructure systems, including energy grids, healthcare facilities, transportation networks, and water distribution systems, are pivotal to societal stability and economic resilience. However, the increasing interconnectivity of these systems exposes them to various cyber threats, including ransomware, Denial-of-Service (DoS) attacks, and Advanced Persistent Threats (APTs). This paper examines cybersecurity vulnerabilities in critical infrastructure, highlighting the threat landscape, attack vectors, and the role of Artificial Intelligence (AI) in mitigating these risks. We propose a hybrid AI-driven cybersecurity framework to enhance real-time vulnerability detection, threat modelling, and automated remediation. This study also addresses the complexities of adversarial AI, regulatory compliance, and integration. Our findings provide actionable insights to strengthen the security and resilience of critical infrastructure systems against emerging cyber threats.
zh

[AI-32] Hybrid LLM -Enhanced Intrusion Detection for Zero-Day Threats in IoT Networks

【速读】：该论文试图解决传统基于特征的入侵检测方法在面对日益复杂和未知攻击模式时的局限性，特别是在分布式、异构和资源受限环境中。解决方案的关键在于将传统签名检测方法与GPT-2大语言模型（Large Language Model, LLM）的上下文理解能力相结合，利用GPT-2处理非结构化数据和识别复杂语义关系的优势，提升对新型攻击向量的检测能力，从而构建一个动态、自适应的入侵检测系统（Intrusion Detection System, IDS）。

链接: https://arxiv.org/abs/2507.07413
作者: Mohammad F. Al-Hammouri,Yazan Otoum,Rasha Atwa,Amiya Nayak
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, IEEE conference

点击查看摘要

Abstract:This paper presents a novel approach to intrusion detection by integrating traditional signature-based methods with the contextual understanding capabilities of the GPT-2 Large Language Model (LLM). As cyber threats become increasingly sophisticated, particularly in distributed, heterogeneous, and resource-constrained environments such as those enabled by the Internet of Things (IoT), the need for dynamic and adaptive Intrusion Detection Systems (IDSs) becomes increasingly urgent. While traditional methods remain effective for detecting known threats, they often fail to recognize new and evolving attack patterns. In contrast, GPT-2 excels at processing unstructured data and identifying complex semantic relationships, making it well-suited to uncovering subtle, zero-day attack vectors. We propose a hybrid IDS framework that merges the robustness of signature-based techniques with the adaptability of GPT-2-driven semantic analysis. Experimental evaluations on a representative intrusion dataset demonstrate that our model enhances detection accuracy by 6.3%, reduces false positives by 9.0%, and maintains near real-time responsiveness. These results affirm the potential of language model integration to build intelligent, scalable, and resilient cybersecurity defences suited for modern connected environments.
zh

[AI-33] Phishing Detection in the Gen-AI Era: Quantized LLM s vs Classical Models

【速读】：该论文试图解决网络钓鱼攻击检测中高精度与计算效率之间的平衡问题。其解决方案的关键在于对比评估传统机器学习（ML）、深度学习（DL）以及量化小参数大语言模型（LLM）在该任务中的表现，并探索轻量级LLM在准确性和计算资源消耗方面的潜力。研究发现，尽管当前LLM在原始准确率上仍低于ML和DL方法，但其在识别基于上下文的隐蔽钓鱼线索方面展现出优势，并且通过优化可实现较高的准确率（如DeepSeek R1 Distill Qwen 14B模型达到80%以上），同时具备较低的显存占用和解释性，支持其在实际系统中的高效部署与实时决策应用。

链接: https://arxiv.org/abs/2507.07406
作者: Jikesh Thapa,Gurrehmat Chahal,Serban Voinea Gabreanu,Yazan Otoum
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 Pages, IEEE Conference

点击查看摘要

Abstract:Phishing attacks are becoming increasingly sophisticated, underscoring the need for detection systems that strike a balance between high accuracy and computational efficiency. This paper presents a comparative evaluation of traditional Machine Learning (ML), Deep Learning (DL), and quantized small-parameter Large Language Models (LLMs) for phishing detection. Through experiments on a curated dataset, we show that while LLMs currently underperform compared to ML and DL methods in terms of raw accuracy, they exhibit strong potential for identifying subtle, context-based phishing cues. We also investigate the impact of zero-shot and few-shot prompting strategies, revealing that LLM-rephrased emails can significantly degrade the performance of both ML and LLM-based detectors. Our benchmarking highlights that models like DeepSeek R1 Distill Qwen 14B (Q8_0) achieve competitive accuracy, above 80%, using only 17GB of VRAM, supporting their viability for cost-efficient deployment. We further assess the models’ adversarial robustness and cost-performance tradeoffs, and demonstrate how lightweight LLMs can provide concise, interpretable explanations to support real-time decision-making. These findings position optimized LLMs as promising components in phishing defence systems and offer a path forward for integrating explainable, efficient AI into modern cybersecurity frameworks.
zh

[AI-34] HGMP:Heterogeneous Graph Multi-Task Prompt Learning IJCAI-25

【速读】：该论文试图解决预训练与微调方法在异构图神经网络中因预训练模型与下游任务之间存在不匹配而导致性能不佳的问题，以及现有图提示学习方法在异构图领域难以有效整合对比预训练策略的局限性。其解决方案的关键在于提出一种名为HGMP的多任务提示框架，通过将所有下游任务统一为图级任务格式以弥合预训练模型与任务之间的差距，并设计一种图级对比预训练策略以更好地利用异构信息，同时引入异构特征提示以优化输入图特征的表示。

链接: https://arxiv.org/abs/2507.07405
作者: Pengfei Jiao,Jialong Ni,Di Jin,Xuan Guo,Huan Liu,Hongjiang Chen,Yanxian Bi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The 25th International Joint Conference on Artificial Intelligence (IJCAI-25)

点击查看摘要

Abstract:The pre-training and fine-tuning methods have gained widespread attention in the field of heterogeneous graph neural networks due to their ability to leverage large amounts of unlabeled data during the pre-training phase, allowing the model to learn rich structural features. However, these methods face the issue of a mismatch between the pre-trained model and downstream tasks, leading to suboptimal performance in certain application scenarios. Prompt learning methods have emerged as a new direction in heterogeneous graph tasks, as they allow flexible adaptation of task representations to address target inconsistency. Building on this idea, this paper proposes a novel multi-task prompt framework for the heterogeneous graph domain, named HGMP. First, to bridge the gap between the pre-trained model and downstream tasks, we reformulate all downstream tasks into a unified graph-level task format. Next, we address the limitations of existing graph prompt learning methods, which struggle to integrate contrastive pre-training strategies in the heterogeneous graph domain. We design a graph-level contrastive pre-training strategy to better leverage heterogeneous information and enhance performance in multi-task scenarios. Finally, we introduce heterogeneous feature prompts, which enhance model performance by refining the representation of input graph features. Experimental results on public datasets show that our proposed method adapts well to various tasks and significantly outperforms baseline methods.
zh

[AI-35] Generalized Tree Edit Distance (GTED): A Faithful Evaluation Metric for Statement Autoformalization ICML25

【速读】：该论文试图解决自动形式化（statement autoformalization）中缺乏鲁棒的自动化评估指标的问题。现有评估方法在语义理解、计算成本以及自动化定理证明进展方面存在局限。解决方案的关键在于提出一种名为GTED（Generalized Tree Edit Distance）的新评估框架，该框架首先将形式化语句标准化并转换为操作符树，然后通过GTED度量确定语义相似性，从而在miniF2F和ProofNet基准测试中实现了最高的准确率和Kappa分数。

链接: https://arxiv.org/abs/2507.07399
作者: Yuntian Liu,Tao Zhu,Xiaoyang Liu,Yu Chen,Zhaoxuan Liu,Qingfeng Guo,Jiashuo Zhang,Kangjie Bao,Tao Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to AI4Math@ICML25

点击查看摘要

Abstract:Statement autoformalization, the automated translation of statement from natural language into formal languages, has become a subject of extensive research, yet the development of robust automated evaluation metrics remains limited. Existing evaluation methods often lack semantic understanding, face challenges with high computational costs, and are constrained by the current progress of automated theorem proving. To address these issues, we propose GTED (Generalized Tree Edit Distance), a novel evaluation framework that first standardizes formal statements and converts them into operator trees, then determines the semantic similarity using the eponymous GTED metric. On the miniF2F and ProofNet benchmarks, GTED outperforms all baseline metrics by achieving the highest accuracy and Kappa scores, thus providing the community with a more faithful metric for automated evaluation. The code and experimental results are available at this https URL.
zh

[AI-36] PILOC: A Pheromone Inverse Guidance Mechanism and Local-Communication Framework for Dynamic Target Search of Multi-Agent in Unknown Environments

【速读】：该论文旨在解决多智能体搜索与救援（MASAR）中动态和未知环境下的目标不可预测性和环境不确定性问题。其解决方案的关键在于提出PILOC框架，该框架不依赖全局先验知识，而是利用局部感知和通信，并引入信息素逆向引导机制，以实现高效的协作和动态目标定位。该机制将信息素概念嵌入深度强化学习（DRL）的观察空间，支持基于环境线索的间接智能体协调，从而提升系统的搜索效率、适应性和鲁棒性。

链接: https://arxiv.org/abs/2507.07376
作者: Hengrui Liu,Yi Feng,Qilong Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-Agent Search and Rescue (MASAR) plays a vital role in disaster response, exploration, and reconnaissance. However, dynamic and unknown environments pose significant challenges due to target unpredictability and environmental uncertainty. To tackle these issues, we propose PILOC, a framework that operates without global prior knowledge, leveraging local perception and communication. It introduces a pheromone inverse guidance mechanism to enable efficient coordination and dynamic target localization. PILOC promotes decentralized cooperation through local communication, significantly reducing reliance on global channels. Unlike conventional heuristics, the pheromone mechanism is embedded into the observation space of Deep Reinforcement Learning (DRL), supporting indirect agent coordination based on environmental cues. We further integrate this strategy into a DRL-based multi-agent architecture and conduct extensive experiments. Results show that combining local communication with pheromone-based guidance significantly boosts search efficiency, adaptability, and system robustness. Compared to existing methods, PILOC performs better under dynamic and communication-constrained scenarios, offering promising directions for future MASAR applications.
zh

[AI-37] Atherosclerosis through Hierarchical Explainable Neural Network Analysis

【速读】：该论文旨在解决亚临床动脉粥样硬化个性化分类的问题，特别是如何有效整合患者群体层面的特征与个体分子数据，以提升分类性能并实现对患者亚型的机制性理解。其解决方案的关键在于提出ATHENA框架，该框架通过集成多模态学习构建层次化网络表示，并优化反映个体组学数据的患者特异性分子指纹，同时确保与群体层面模式的一致性。

链接: https://arxiv.org/abs/2507.07373
作者: Irsyad Adam,Steven Swee,Erika Yilin,Ethan Ji,William Speier,Dean Wang,Alex Bui,Wei Wang,Karol Watson,Peipei Ping
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we study the problem pertaining to personalized classification of subclinical atherosclerosis by developing a hierarchical graph neural network framework to leverage two characteristic modalities of a patient: clinical features within the context of the cohort, and molecular data unique to individual patients. Current graph-based methods for disease classification detect patient-specific molecular fingerprints, but lack consistency and comprehension regarding cohort-wide features, which are an essential requirement for understanding pathogenic phenotypes across diverse atherosclerotic trajectories. Furthermore, understanding patient subtypes often considers clinical feature similarity in isolation, without integration of shared pathogenic interdependencies among patients. To address these challenges, we introduce ATHENA: Atherosclerosis Through Hierarchical Explainable Neural Network Analysis, which constructs a novel hierarchical network representation through integrated modality learning; subsequently, it optimizes learned patient-specific molecular fingerprints that reflect individual omics data, enforcing consistency with cohort-wide patterns. With a primary clinical dataset of 391 patients, we demonstrate that this heterogeneous alignment of clinical features with molecular interaction patterns has significantly boosted subclinical atherosclerosis classification performance across various baselines by up to 13% in area under the receiver operating curve (AUC) and 20% in F1 score. Taken together, ATHENA enables mechanistically-informed patient subtype discovery through explainable AI (XAI)-driven subnetwork clustering; this novel integration framework strengthens personalized intervention strategies, thereby improving the prediction of atherosclerotic disease progression and management of their clinical actionable outcomes.
zh

[AI-38] Goal-Oriented Sequential Bayesian Experimental Design for Causal Learning

【速读】：该论文试图解决传统因果实验设计方法在有限实验预算和复杂因果机制下效率不足的问题，其核心在于如何更精准地获取用户关注的因果量信息。解决方案的关键是提出GO-CBED框架，该框架通过直接最大化用户指定因果量的期望信息增益（EIG），实现更针对性和高效的实验设计，并采用基于Transformer的策略网络与基于归一化流的变分后验联合优化的变分下界估计器，以应对精确EIG计算的不可行性，从而支持实时决策。

链接: https://arxiv.org/abs/2507.07359
作者: Zheyu Zhang,Jiayuan Dong,Jie Liu,Xun Huan(University of Michigan)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:We present GO-CBED, a goal-oriented Bayesian framework for sequential causal experimental design. Unlike conventional approaches that select interventions aimed at inferring the full causal model, GO-CBED directly maximizes the expected information gain (EIG) on user-specified causal quantities of interest, enabling more targeted and efficient experimentation. The framework is both non-myopic, optimizing over entire intervention sequences, and goal-oriented, targeting only model aspects relevant to the causal query. To address the intractability of exact EIG computation, we introduce a variational lower bound estimator, optimized jointly through a transformer-based policy network and normalizing flow-based variational posteriors. The resulting policy enables real-time decision-making via an amortized network. We demonstrate that GO-CBED consistently outperforms existing baselines across various causal reasoning and discovery tasks-including synthetic structural causal models and semi-synthetic gene regulatory networks-particularly in settings with limited experimental budgets and complex causal mechanisms. Our results highlight the benefits of aligning experimental design objectives with specific research goals and of forward-looking sequential planning.
zh

[AI-39] Supply Chain Optimization via Generative Simulation and Iterative Decision Policies

【速读】：该论文试图解决供应链运输中高响应性和经济效率的问题，这些问题受到运输方式战略决策的影响。解决方案的关键在于提出一种名为Sim-to-Dec的集成框架，其核心包括一个生成式仿真模块，该模块利用自回归建模来模拟连续状态变化，从而减少对人工设计领域规则的依赖，并增强对数据波动的鲁棒性；以及一个历史-未来双感知决策模型，通过与仿真器的交互进行端到端优化迭代改进。

链接: https://arxiv.org/abs/2507.07355
作者: Haoyue Bai,Haoyu Wang,Nanxu Gong,Xinyuan Wang,Wangyang Ying,Haifeng Chen,Yanjie Fu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High responsiveness and economic efficiency are critical objectives in supply chain transportation, both of which are influenced by strategic decisions on shipping mode. An integrated framework combining an efficient simulator with an intelligent decision-making algorithm can provide an observable, low-risk environment for transportation strategy design. An ideal simulation-decision framework must (1) generalize effectively across various settings, (2) reflect fine-grained transportation dynamics, (3) integrate historical experience with predictive insights, and (4) maintain tight integration between simulation feedback and policy refinement. We propose Sim-to-Dec framework to satisfy these requirements. Specifically, Sim-to-Dec consists of a generative simulation module, which leverages autoregressive modeling to simulate continuous state changes, reducing dependence on handcrafted domain-specific rules and enhancing robustness against data fluctuations; and a history-future dual-aware decision model, refined iteratively through end-to-end optimization with simulator interactions. Extensive experiments conducted on three real-world datasets demonstrate that Sim-to-Dec significantly improves timely delivery rates and profit.
zh

[AI-40] On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在实际部署中可能被滥用生成有害内容的安全问题，重点研究了通过过滤器防止生成不安全信息的对齐挑战。其解决方案的关键在于揭示了在输入提示和输出内容上进行过滤所面临的计算难题，表明在某些情况下，高效的外部过滤机制无法有效识别和阻止有害内容的生成，尤其是在基于密码学难度假设的前提下，这种计算上的不可行性成为核心障碍。因此，论文指出，仅依靠模型外部的过滤器无法实现安全目标，强调了模型内部结构与判断能力的不可分割性。

链接: https://arxiv.org/abs/2507.07341
作者: Sarah Ball,Greg Gluch,Shafi Goldwasser,Frauke Kreuter,Omer Reingold,Guy N. Rothblum
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:With the increased deployment of large language models (LLMs), one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering of the input prompt before it reaches the model, and filtering the output after generation. Our main results demonstrate computational challenges in filtering both prompts and outputs. First, we show that there exist LLMs for which there are no efficient prompt filters: adversarial prompts that elicit harmful behavior can be easily constructed, which are computationally indistinguishable from benign prompts for any efficient filter. Our second main result identifies a natural setting in which output filtering is computationally intractable. All of our separation results are under cryptographic hardness assumptions. In addition to these core findings, we also formalize and study relaxed mitigation approaches, demonstrating further computational barriers. We conclude that safety cannot be achieved by designing filters external to the LLM internals (architecture and weights); in particular, black-box access to the LLM will not suffice. Based on our technical results, we argue that an aligned AI system’s intelligence cannot be separated from its judgment.
zh

[AI-41] Leverag ing Manifold Embeddings for Enhanced Graph Transformer Representations and Learning

【速读】：该论文试图解决图变换器在嵌入节点时通常将其置于单一欧几里得空间中，从而模糊异构拓扑结构的问题。其解决方案的关键是在图变换器中引入一个轻量级的黎曼专家混合层（Riemannian mixture-of-experts layer），该层将每个节点路由到不同类型的流形（如球面、平坦、双曲），以最佳匹配其局部结构。这种显式的几何感知投影不仅提升了模型的预测能力，还增强了图表示的可解释性。

链接: https://arxiv.org/abs/2507.07335
作者: Ankit Jyothish,Ali Jannesari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph transformers typically embed every node in a single Euclidean space, blurring heterogeneous topologies. We prepend a lightweight Riemannian mixture-of-experts layer that routes each node to various kinds of manifold, mixture of spherical, flat, hyperbolic - best matching its local structure. These projections provide intrinsic geometric explanations to the latent space. Inserted into a state-of-the-art ensemble graph transformer, this projector lifts accuracy by up to 3% on four node-classification benchmarks. The ensemble makes sure that both euclidean and non-euclidean features are captured. Explicit, geometry-aware projection thus sharpens predictive power while making graph representations more interpretable.
zh

[AI-42] Bridging the Plausibility-Validity Gap by Fine-Tuning a Reasoning -Enhanced LLM for Chemical Synthesis and Discovery

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在生成科学内容时存在的“可塑性-有效性差距”问题，即模型生成的内容在形式上看似合理，但在事实准确性上存在缺陷，尤其在化学等专业领域更为显著。解决方案的关键在于构建一个“双域数据集”，该数据集涵盖了分子性质和化学反应的标准化内容，并基于此对Magistral Small模型进行低秩适应（Low-Rank Adaptation, LoRA）微调，从而提升模型在化学有效性、合成路线可行性等方面的性能。

链接: https://arxiv.org/abs/2507.07328
作者: Malikussaid,Hilal Hudan Nuha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Chemical Physics (physics.chem-ph)
备注: 42 pages, 8 figures, 1 equation, 2 algorithms, 31 tables, to be published in ISPACS Conference 2025, unabridged version

点击查看摘要

Abstract:Large Language Models (LLMs) often generate scientifically plausible but factually invalid information, a challenge we term the “plausibility-validity gap,” particularly in specialized domains like chemistry. This paper presents a systematic methodology to bridge this gap by developing a specialized scientific assistant. We utilized the Magistral Small model, noted for its integrated reasoning capabilities, and fine-tuned it using Low-Rank Adaptation (LoRA). A key component of our approach was the creation of a “dual-domain dataset,” a comprehensive corpus curated from various sources encompassing both molecular properties and chemical reactions, which was standardized to ensure quality. Our evaluation demonstrates that the fine-tuned model achieves significant improvements over the baseline model in format adherence, chemical validity of generated molecules, and the feasibility of proposed synthesis routes. The results indicate a hierarchical learning pattern, where syntactic correctness is learned more readily than chemical possibility and synthesis feasibility. While a comparative analysis with human experts revealed competitive performance in areas like chemical creativity and reasoning, it also highlighted key limitations, including persistent errors in stereochemistry, a static knowledge cutoff, and occasional reference hallucination. This work establishes a viable framework for adapting generalist LLMs into reliable, specialized tools for chemical research, while also delineating critical areas for future improvement.
zh

[AI-43] SonicMotion: Dynamic Spatial Audio Soundscapes with Latent Diffusion Models

【速读】：该论文试图解决在沉浸式娱乐场景中生成具有动态声源的三维音频场景的问题，特别是针对第一阶全向声场（first-order Ambisonics, FOA）格式的生成。解决方案的关键在于提出一种端到端的模型SonicMotion，该模型通过不同的用户输入和声源定位精度实现对三维空间音频的生成，并结合一个新提出的模拟空间音频-描述配对数据集进行训练与评估，从而在保持语义对齐和音频质量的同时，准确捕捉所需的空间属性。

链接: https://arxiv.org/abs/2507.07318
作者: Christian Templin,Yanda Zhu,Hao Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Spatial audio is an integral part of immersive entertainment, such as VR/AR, and has seen increasing popularity in cinema and music as well. The most common format of spatial audio is described as first-order Ambisonics (FOA). We seek to extend recent advancements in FOA generative AI models to enable the generation of 3D scenes with dynamic sound sources. Our proposed end-to-end model, SonicMotion, comes in two variations which vary in their user input and level of precision in sound source localization. In addition to our model, we also present a new dataset of simulated spatial audio-caption pairs. Evaluation of our models demonstrate that they are capable of matching the semantic alignment and audio quality of state of the art models while capturing the desired spatial attributes.
zh

[AI-44] Application of LLM s to Multi-Robot Path Planning and Task Allocation

【速读】：该论文试图解决多智能体强化学习中由于算法内在复杂性而加剧的高效探索问题。其解决方案的关键在于利用大语言模型作为专家规划器，以在基于规划的任务中为多个智能体提供高效的探索策略。

链接: https://arxiv.org/abs/2507.07302
作者: Ashish Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Efficient exploration is a well known problem in deep reinforcement learning and this problem is exacerbated in multi-agent reinforcement learning due the intrinsic complexities of such algorithms. There are several approaches to efficiently explore an environment to learn to solve tasks by multi-agent operating in that environment, of which, the idea of expert exploration is investigated in this work. More specifically, this work investigates the application of large-language models as expert planners for efficient exploration in planning based tasks for multiple agents.
zh

[AI-45] Exploiting Edge Features for Transferable Adversarial Attacks in Distributed Machine Learning

【速读】：该论文试图解决分布式深度学习系统中因中间特征泄露而带来的安全风险问题。其关键解决方案是利用拦截到的中间特征重建原始张量形状，并适配替代模型架构以实现有效的特征蒸馏，从而生成高度可迁移的对抗样本，显著提升逃避攻击的可行性。

链接: https://arxiv.org/abs/2507.07259
作者: Giulio Rossolini,Fabio Brau,Alessandro Biondi,Battista Biggio,Giorgio Buttazzo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:As machine learning models become increasingly deployed across the edge of internet of things environments, a partitioned deep learning paradigm in which models are split across multiple computational nodes introduces a new dimension of security risk. Unlike traditional inference setups, these distributed pipelines span the model computation across heterogeneous nodes and communication layers, thereby exposing a broader attack surface to potential adversaries. Building on these motivations, this work explores a previously overlooked vulnerability: even when both the edge and cloud components of the model are inaccessible (i.e., black-box), an adversary who intercepts the intermediate features transmitted between them can still pose a serious threat. We demonstrate that, under these mild and realistic assumptions, an attacker can craft highly transferable proxy models, making the entire deep learning system significantly more vulnerable to evasion attacks. In particular, the intercepted features can be effectively analyzed and leveraged to distill surrogate models capable of crafting highly transferable adversarial examples against the target model. To this end, we propose an exploitation strategy specifically designed for distributed settings, which involves reconstructing the original tensor shape from vectorized transmitted features using simple statistical analysis, and adapting surrogate architectures accordingly to enable effective feature distillation. A comprehensive and systematic experimental evaluation has been conducted to demonstrate that surrogate models trained with the proposed strategy, i.e., leveraging intermediate features, tremendously improve the transferability of adversarial attacks. These findings underscore the urgent need to account for intermediate feature leakage in the design of secure distributed deep learning systems.
zh

[AI-46] FedP3E: Privacy-Preserving Prototype Exchange for Non-IID IoT Malware Detection in Cross-Silo Federated Learning

【速读】：该论文旨在解决物联网（IoT）环境中由于数据异构性、类别不平衡以及罕见或不重叠的恶意软件类导致的传统联邦学习（FL）算法在检测恶意软件时效果不佳的问题。其解决方案的关键在于提出FedP3E（Privacy-Preserving Prototype Exchange）框架，该框架通过客户端使用高斯混合模型（GMMs）构建类别原型，并添加高斯噪声后仅上传这些紧凑的摘要至服务器，从而实现隐私保护下的跨客户端表示共享，同时利用SMOTE增强技术提升少数类别恶意软件的表征能力。

链接: https://arxiv.org/abs/2507.07258
作者: Rami Darwish,Mahmoud Abdelsalam,Sajad Khorsandroo,Kaushik Roy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As IoT ecosystems continue to expand across critical sectors, they have become prominent targets for increasingly sophisticated and large-scale malware attacks. The evolving threat landscape, combined with the sensitive nature of IoT-generated data, demands detection frameworks that are both privacy-preserving and resilient to data heterogeneity. Federated Learning (FL) offers a promising solution by enabling decentralized model training without exposing raw data. However, standard FL algorithms such as FedAvg and FedProx often fall short in real-world deployments characterized by class imbalance and non-IID data distributions – particularly in the presence of rare or disjoint malware classes. To address these challenges, we propose FedP3E (Privacy-Preserving Prototype Exchange), a novel FL framework that supports indirect cross-client representation sharing while maintaining data privacy. Each client constructs class-wise prototypes using Gaussian Mixture Models (GMMs), perturbs them with Gaussian noise, and transmits only these compact summaries to the server. The aggregated prototypes are then distributed back to clients and integrated into local training, supported by SMOTE-based augmentation to enhance representation of minority malware classes. Rather than relying solely on parameter averaging, our prototype-driven mechanism enables clients to enrich their local models with complementary structural patterns observed across the federation – without exchanging raw data or gradients. This targeted strategy reduces the adverse impact of statistical heterogeneity with minimal communication overhead. We evaluate FedP3E on the N-BaIoT dataset under realistic cross-silo scenarios with varying degrees of data imbalance.
zh

[AI-47] Attentions Under the Microscope: A Comparative Study of Resource Utilization for Variants of Self-Attention

【速读】：该论文试图解决大规模语言模型（Large Language Models, LLMs）和视觉语言模型（Visual Language Models, VLMs）中注意力机制因高内存和时间复杂度而导致的计算瓶颈问题。其解决方案的关键在于对多种注意力机制进行能量效率和硬件资源需求的系统性评估，发现通过优化内核实现的注意力机制，如Flash Attention、局部敏感哈希（Locality-Sensitive Hashing, LSH）Attention以及多头潜在注意力（Multi-Head Latent Attention, MLA），在训练过程中表现出最佳的能量效率。研究进一步表明，仅降低GPU功耗并不等同于减少整体能耗，训练时间同样起着关键作用。

链接: https://arxiv.org/abs/2507.07247
作者: Zhengyu Tian,Anantha Padmanaban Krishna Kumar,Hemant Krishnakumar,Reza Rawassizadeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 6 pages, 8 figures

点击查看摘要

Abstract:As large language models (LLMs) and visual language models (VLMs) grow in scale and application, attention mechanisms have become a central computational bottleneck due to their high memory and time complexity. While many efficient attention variants have been proposed, there remains a lack of rigorous evaluation on their actual energy usage and hardware resource demands during training. In this work, we benchmark eight attention mechanisms in training GPT-2 architecture, measuring key metrics including training time, GPU memory usage, FLOPS, CPU usage, and power consumption. Our results reveal that attention mechanisms with optimized kernel implementations, including Flash Attention, Locality-Sensitive Hashing (LSH) Attention, and Multi-Head Latent Attention (MLA), achieve the best energy efficiency. We further show that lower GPU power alone does not guarantee reduced energy use, as training time plays an equally important role. Our study highlights the importance of energy-aware benchmarking in attention design and provides a practical insight for selecting resource-efficient mechanisms. All our codes are available at GitHub.
zh

[AI-48] Neurosymbolic Feature Extraction for Identifying Forced Labor in Supply Chains

【速读】：该论文试图解决在供应链网络中识别非法活动（illicit activities）的问题，特别是在数据稀疏且可能被篡改的情况下。其解决方案的关键在于采用神经符号方法（neurosymbolic methods）以及利用大规模语言模型（LLM）的问答树方法，以自动检测与非法活动相关的新型模式，而无需依赖大量标注训练数据。

链接: https://arxiv.org/abs/2507.07217
作者: Zili Wang,Frank Montabon,Kristin Yvonne Rozier
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Supply chain networks are complex systems that are challenging to analyze; this problem is exacerbated when there are illicit activities involved in the supply chain, such as counterfeit parts, forced labor, or human trafficking. While machine learning (ML) can find patterns in complex systems like supply chains, traditional ML techniques require large training data sets. However, illicit supply chains are characterized by very sparse data, and the data that is available is often (purposely) corrupted or unreliable in order to hide the nature of the activities. We need to be able to automatically detect new patterns that correlate with such illegal activity over complex, even temporal data, without requiring large training data sets. We explore neurosymbolic methods for identifying instances of illicit activity in supply chains and compare the effectiveness of manual and automated feature extraction from news articles accurately describing illicit activities uncovered by authorities. We propose a question tree approach for querying a large language model (LLM) to identify and quantify the relevance of articles. This enables a systematic evaluation of the differences between human and machine classification of news articles related to forced labor in supply chains.
zh

[AI-49] Bias-Aware Mislabeling Detection via Decoupled Confident Learning

【速读】：该论文试图解决由标签偏差（label bias）引起的数据完整性问题，即在不同社会群体中标签质量存在系统性差异，这会影响定量分析的可靠性。解决方案的关键在于提出一种基于机器学习的框架——解耦置信学习（Decoupled Confident Learning, DeCoLe），该框架专门用于检测受标签偏差影响的数据集中的误标实例，从而实现偏差感知的误标检测并提升数据质量。

链接: https://arxiv.org/abs/2507.07216
作者: Yunyi Li,Maria De-Arteaga,Maytal Saar-Tsechansky
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Reliable data is a cornerstone of modern organizational systems. A notable data integrity challenge stems from label bias, which refers to systematic errors in a label, a covariate that is central to a quantitative analysis, such that its quality differs across social groups. This type of bias has been conceptually and empirically explored and is widely recognized as a pressing issue across critical domains. However, effective methodologies for addressing it remain scarce. In this work, we propose Decoupled Confident Learning (DeCoLe), a principled machine learning based framework specifically designed to detect mislabeled instances in datasets affected by label bias, enabling bias aware mislabelling detection and facilitating data quality improvement. We theoretically justify the effectiveness of DeCoLe and evaluate its performance in the impactful context of hate speech detection, a domain where label bias is a well documented challenge. Empirical results demonstrate that DeCoLe excels at bias aware mislabeling detection, consistently outperforming alternative approaches for label error detection. Our work identifies and addresses the challenge of bias aware mislabeling detection and offers guidance on how DeCoLe can be integrated into organizational data management practices as a powerful tool to enhance data reliability.
zh

[AI-50] State-Inference-Based Prompting for Natural Language Trading with Game NPCs KDD2025

【速读】：该论文试图解决大型语言模型在规则约束的交易系统中出现的规则违反问题，例如物品幻觉和计算错误，这些问题会损害玩家信任。其解决方案的关键在于基于状态推理的提示（State-Inference-Based Prompting, SIBP），通过自主对话状态推理和上下文特定的规则遵循实现可靠交易。该方法在统一提示框架内将交易分解为六个状态，并实现了上下文感知的物品引用和基于占位符的价格计算。

链接: https://arxiv.org/abs/2507.07203
作者: Minkyung Kim,Junsik Kim,Hwidong Bae,Woongcheol Yang,Sangdon Park,Sohee Bae
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages main content, 4 pages appendix, 3 figures. Accepted to the KDD 2025 Workshop on Prompt Optimization

点击查看摘要

Abstract:Large Language Models enable dynamic game interactions but struggle with rule-governed trading systems. Current implementations suffer from rule violations, such as item hallucinations and calculation errors, that erode player trust. Here, State-Inference-Based Prompting (SIBP) enables reliable trading through autonomous dialogue state inference and context-specific rule adherence. The approach decomposes trading into six states within a unified prompt framework, implementing context-aware item referencing and placeholder-based price calculations. Evaluation across 100 trading dialogues demonstrates 97% state compliance, 95% referencing accuracy, and 99.7% calculation precision. SIBP maintains computational efficiency while outperforming baseline approaches, establishing a practical foundation for trustworthy NPC interactions in commercial games.
zh

[AI-51] Combining Pre-Trained Models for Enhanced Feature Representation in Reinforcement Learning

【速读】：该论文试图解决如何有效结合和利用不同预训练模型的隐含信息以提升强化学习（Reinforcement Learning, RL）性能的问题。其解决方案的关键在于提出一种名为权重共享注意力（Weight Sharing Attention, WSA）的新架构，该架构通过融合多个预训练模型的嵌入表示，构建更丰富的状态表征，在效率与性能之间取得平衡。

链接: https://arxiv.org/abs/2507.07197
作者: Elia Piccoli,Malio Li,Giacomo Carfì,Vincenzo Lomonaco,Davide Bacciu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published at 4th Conference on Lifelong Learning Agents (CoLLAs), 2025

点击查看摘要

Abstract:The recent focus and release of pre-trained models have been a key components to several advancements in many fields (e.g. Natural Language Processing and Computer Vision), as a matter of fact, pre-trained models learn disparate latent embeddings sharing insightful representations. On the other hand, Reinforcement Learning (RL) focuses on maximizing the cumulative reward obtained via agent’s interaction with the environment. RL agents do not have any prior knowledge about the world, and they either learn from scratch an end-to-end mapping between the observation and action spaces or, in more recent works, are paired with monolithic and computationally expensive Foundational Models. How to effectively combine and leverage the hidden information of different pre-trained models simultaneously in RL is still an open and understudied question. In this work, we propose Weight Sharing Attention (WSA), a new architecture to combine embeddings of multiple pre-trained models to shape an enriched state representation, balancing the tradeoff between efficiency and performance. We run an extensive comparison between several combination modes showing that WSA obtains comparable performance on multiple Atari games compared to end-to-end models. Furthermore, we study the generalization capabilities of this approach and analyze how scaling the number of models influences agents’ performance during and after training.
zh

[AI-52] Bridging the Last Mile of Prediction: Enhancing Time Series Forecasting with Conditional Guided Flow Matching

【速读】：该论文旨在解决扩散模型在时间序列预测中面临的局限性，如源分布刚性及采样路径有限等问题。其解决方案的关键在于提出条件引导流匹配（Conditional Guided Flow Matching, CGFM），通过引入辅助模型的输出，实现从辅助模型预测误差中学习的能力，并结合历史数据构建双向条件概率路径，利用通用仿射路径扩展概率路径空间，从而提升预测性能。

链接: https://arxiv.org/abs/2507.07192
作者: Huibo Xu,Runlong Yu,Likang Wu,Xianquan Wang,Qi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models, a type of generative model, have shown promise in time series forecasting. But they face limitations like rigid source distributions and limited sampling paths, which hinder their performance. Flow matching offers faster generation, higher-quality outputs, and greater flexibility, while also possessing the ability to utilize valuable information from the prediction errors of prior models, which were previously inaccessible yet critically important. To address these challenges and fully unlock the untapped potential of flow matching, we propose Conditional Guided Flow Matching (CGFM). CGFM extends flow matching by incorporating the outputs of an auxiliary model, enabling a previously unattainable capability in the field: learning from the errors of the auxiliary model. For time series forecasting tasks, it integrates historical data as conditions and guidance, constructs two-sided conditional probability paths, and uses a general affine path to expand the space of probability paths, ultimately leading to improved predictions. Extensive experiments show that CGFM consistently enhances and outperforms state-of-the-art models, highlighting its effectiveness in advancing forecasting methods.
zh

[AI-53] BOOST: Out-of-Distribution-Informed Adaptive Sampling for Bias Mitigation in Stylistic Convolutional Neural Networks

【速读】：该论文试图解决人工智能在艺术分类任务中因数据集不平衡导致的偏见问题，这种偏见影响了模型预测的公平性和准确性，尤其是在处理分布外（out-of-distribution, OOD）数据时。解决方案的关键在于提出一种名为BOOST（Bias-Oriented OOD Sampling and Tuning）的新型OOD感知模型偏差自适应采样方法，通过动态调整温度缩放和采样概率，促进所有类别的更公平表示。

链接: https://arxiv.org/abs/2507.07134
作者: Mridula Vijendran,Shuang Chen,Jingjing Deng,Hubert P. H. Shum
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 7 figures, 3 tables

点击查看摘要

Abstract:The pervasive issue of bias in AI presents a significant challenge to painting classification, and is getting more serious as these systems become increasingly integrated into tasks like art curation and restoration. Biases, often arising from imbalanced datasets where certain artistic styles dominate, compromise the fairness and accuracy of model predictions, i.e., classifiers are less accurate on rarely seen paintings. While prior research has made strides in improving classification performance, it has largely overlooked the critical need to address these underlying biases, that is, when dealing with out-of-distribution (OOD) data. Our insight highlights the necessity of a more robust approach to bias mitigation in AI models for art classification on biased training data. We propose a novel OOD-informed model bias adaptive sampling method called BOOST (Bias-Oriented OOD Sampling and Tuning). It addresses these challenges by dynamically adjusting temperature scaling and sampling probabilities, thereby promoting a more equitable representation of all classes. We evaluate our proposed approach to the KaoKore and PACS datasets, focusing on the model’s ability to reduce class-wise bias. We further propose a new metric, Same-Dataset OOD Detection Score (SODC), designed to assess class-wise separation and per-class bias reduction. Our method demonstrates the ability to balance high performance with fairness, making it a robust solution for unbiasing AI models in the art domain.
zh

[AI-54] Generative Panoramic Image Stitching

【速读】：该论文试图解决生成式全景图像拼接问题（generative panoramic image stitching），即在存在视差效应、光照、相机捕捉设置或风格强烈变化的多参考图像中，合成无缝且忠实于内容的全景图。传统图像拼接流程在此类复杂场景下失效，产生鬼影和其他伪影；而现有生成模型虽能保持多参考图像内容的一致性，却难以合成大范围且连贯的全景区域。解决方案的关键在于对基于扩散的修复模型进行微调，以根据多参考图像保留场景的内容和布局，从而从单个参考图像中外推生成完整的全景图，实现高质量且结构一致的视觉结果。

链接: https://arxiv.org/abs/2507.07133
作者: Mathieu Tuli,Kaveh Kamali,David B. Lindell
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce the task of generative panoramic image stitching, which aims to synthesize seamless panoramas that are faithful to the content of multiple reference images containing parallax effects and strong variations in lighting, camera capture settings, or style. In this challenging setting, traditional image stitching pipelines fail, producing outputs with ghosting and other artifacts. While recent generative models are capable of outpainting content consistent with multiple reference images, they fail when tasked with synthesizing large, coherent regions of a panorama. To address these limitations, we propose a method that fine-tunes a diffusion-based inpainting model to preserve a scene’s content and layout based on multiple reference images. Once fine-tuned, the model outpaints a full panorama from a single reference image, producing a seamless and visually coherent result that faithfully integrates content from all reference images. Our approach significantly outperforms baselines for this task in terms of image quality and the consistency of image structure and scene layout when evaluated on captured datasets.
zh

[AI-55] Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在处理多百万token的键值（Key-Value, KV）历史时，在严格Token-to-Token Latency (TTL)约束下进行实时自回归解码所面临的性能瓶颈问题。核心问题包括访问前馈网络（Feed-Forward Network, FFN）权重的高开销以及读取长KV缓存的低效性。论文提出的解决方案是Helix Parallelism，其关键在于采用混合执行策略，在注意力计算阶段通过KV并行划分KV缓存到不同GPU，随后在同一组GPU上执行FFN计算中的张量并行（Tensor Parallelism, TP）或专家并行（TPxExpert Parallel, EP），从而提升整体GPU利用率并降低通信开销。

链接: https://arxiv.org/abs/2507.07120
作者: Nidhi Bhatia,Ankit More,Ritika Borkar,Tiyasa Mitra,Ramon Matas,Ritchie Zhao,Maximilian Golub,Dheevatsa Mudigere,Brian Pharris,Bita Darvish Rouhani
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As LLMs scale to multi-million-token KV histories, real-time autoregressive decoding under tight Token-to-Token Latency (TTL) constraints faces growing pressure. Two core bottlenecks dominate: accessing Feed-Forward Network (FFN) weights and reading long KV caches. While Tensor Parallelism (TP) helps mitigate the cost of FFN weight reads, it does not scale well for attention. When TP width exceeds the number of KV heads, it leads to inefficient KV duplication, limits parallelism, and constrains batch size. Simultaneously, DRAM reads for long KV histories scale linearly with batch size, further capping efficiency. We introduce Helix Parallelism, a hybrid execution strategy that applies KV parallelism during attention to shard KV caches across GPUs, then reuses the same GPUs for TP in dense LLMs or TPxExpert Parallel (EP) in MoEs during FFN computation. To preserve exact attention behavior, Helix includes a lightweight communication step. To minimize the exposed communication cost, we introduce Helix HOP-B. Helix HOP-B effectively minimizes communication overhead through batchwise overlap, preserving low TTL while improving GPU efficiency. Compared to conventional parallelism approaches, Helix reduces TTL by up to 1.5x at fixed batch sizes and supports up to 32x larger batches under the same latency budget for DeepSeek-R1, pushing forward the throughput-latency Pareto on Blackwell and making real-time inference with ultra-long-sequence practical. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.07120 [cs.DC] (or arXiv:2507.07120v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2507.07120 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-56] Collective Communication Profiling of Modern-day Machine Learning Workloads

【速读】：该论文试图解决在大规模分布式高性能系统中执行机器学习任务时，由于AllReduce、AllGather和Broadcast等集体通信操作引发的高带宽和突发流量模式导致的网络拥塞和数据包丢失问题。解决方案的关键在于对不同机器学习模型（如DeepSeek、GPT、Llama等）的集体通信行为进行深入分析，并通过调整并行度、节点数量和模型类型等配置参数来优化通信性能。研究利用Nvidia Collective Communication Library的日志功能获取更丰富的上下文信息，以揭示操作类型与数量、每操作传输大小及请求大小分布等关键指标，从而为重新设计集体通信框架和网络拓扑提供依据。

链接: https://arxiv.org/abs/2507.07117
作者: Jit Gupta,Andrew Li,Tarun Banka,Ariel Cohen,T. Sridhar,Raj Yavatkar
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Poser, USENIX NSDI 2025, April 2025, Philadelphia, PA, USA

点击查看摘要

Abstract:Machine Learning jobs, carried out on large number of distributed high performance systems, involve periodic communication using operations like AllReduce, AllGather, and Broadcast. These operations may create high bandwidth and bursty traffic patterns, leading to network congestion and packet loss, thus impacting the performance of these jobs. Hence it is imperative to analyze these patterns, which can be helpful in provisioning network resources depending on the type of machine learning workloads. In this poster we carry out extensive analysis of the collective communication behavior seen in a wide variety of models (ex. DeepSeek, GPT, Llama, etc.) To achieve this we instrument Nvidia Collective Communication Library logging functionality for richer context about the collectives and workloads. We adjust configuration parameters that influence collective communication behavior, such as parallelism, number of nodes, and model type. This overview presents and discusses some of the results on the collective communication behavior for the open source DeepSeek V3 inferencing model, which includes operation type and count, transfer sizes per operation, and request size distribution. Our analysis shows that it makes sense to rethink current collective communication frameworks and network topologies so as to accommodate the effect of network anomalies on the mentioned workloads.
zh

[AI-57] Analysing semantic data storag e in Distributed Ledger Technologies for Data Spaces

【速读】：该论文试图解决在数据空间中实现语义互操作性时，如何高效存储语义数据的问题。其解决方案的关键在于评估不同类型的分布式账本技术（DLT）在存储和管理语义数据方面的性能、存储效率、资源消耗以及更新和查询能力，从而为基于数据主权需求选择合适的DLT基础设施提供依据。

链接: https://arxiv.org/abs/2507.07116
作者: Juan Cano-Benito,Andrea Cimmino,Sven Hertling,Heiko Paulheim,Raúl García-Castro
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Data spaces are emerging as decentralised infrastructures that enable sovereign, secure, and trustworthy data exchange among multiple participants. To achieve semantic interoperability within these environments, the use of semantic web technologies and knowledge graphs has been proposed. Although distributed ledger technologies (DLT) fit as the underlying infrastructure for data spaces, there remains a significant gap in terms of the efficient storage of semantic data on these platforms. This paper presents a systematic evaluation of semantic data storage across different types of DLT (public, private, and hybrid), using a real-world knowledge graph as an experimental basis. The study compares performance, storage efficiency, resource consumption, and the capabilities to update and query semantic data. The results show that private DLTs are the most efficient for storing and managing semantic content, while hybrid DLTs offer a balanced trade-off between public auditability and operational efficiency. This research leads to a discussion on the selection of the most appropriate DLT infrastructure based on the data sovereignty requirements of decentralised data ecosystems.
zh

[AI-58] Autonomous Control Leverag ing LLM s: An Agent ic Framework for Next-Generation Industrial Automation

【速读】：该论文旨在解决现代化工过程中复杂性增加、人员短缺以及故障场景复杂所带来的自动化挑战，其解决方案的关键在于提出一种统一的智能体框架，该框架结合了符号推理与自适应控制。该框架利用大型语言模型（LLMs）在单一架构中实现离散故障恢复规划和连续过程控制，通过有限状态机（FSMs）作为可解释的操作边界，结合规划代理、仿真代理和验证-重提示循环，实现了对无效计划的迭代优化。

链接: https://arxiv.org/abs/2507.07115
作者: Javal Vyas,Mehmet Mercangoz
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The increasing complexity of modern chemical processes, coupled with workforce shortages and intricate fault scenarios, demands novel automation paradigms that blend symbolic reasoning with adaptive control. In this work, we introduce a unified agentic framework that leverages large language models (LLMs) for both discrete fault-recovery planning and continuous process control within a single architecture. We adopt Finite State Machines (FSMs) as interpretable operating envelopes: an LLM-driven planning agent proposes recovery sequences through the FSM, a Simulation Agent executes and checks each transition, and a Validator-Reprompting loop iteratively refines invalid plans. In Case Study 1, across 180 randomly generated FSMs of varying sizes (4-25 states, 4-300 transitions), GPT-4o and GPT-4o-mini achieve 100% valid-path success within five reprompts-outperforming open-source LLMs in both accuracy and latency. In Case Study 2, the same framework modulates dual-heater inputs on a laboratory TCLab platform (and its digital twin) to maintain a target average temperature under persistent asymmetric disturbances. Compared to classical PID control, our LLM-based controller attains similar performance, while ablation of the prompting loop reveals its critical role in handling nonlinear dynamics. We analyze key failure modes-such as instruction following lapses and coarse ODE approximations. Our results demonstrate that, with structured feedback and modular agents, LLMs can unify high-level symbolic planningand low-level continuous control, paving the way towards resilient, language-driven automation in chemical engineering.
zh

[AI-59] Generative Adversarial Evasion and Out-of-Distribution Detection for UAV Cyber-Attacks

【速读】：该论文试图解决传统异常检测方法在应对新型威胁时的不足，以及现有异构分布（OOD）检测器难以区分隐蔽对抗攻击与真实OOD事件的问题。其解决方案的关键在于提出一种基于条件生成对抗网络（cGAN）的框架，用于生成能够逃避入侵检测系统（IDS）机制的隐蔽对抗样本，同时保持与OOD分布的统计相似性，并通过条件变分自编码器（CVAE）实现对这些扰动的有效检测。

链接: https://arxiv.org/abs/2506.21142
作者: Deepak Kumar Panda,Weisi Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing integration of UAVs into civilian airspace underscores the need for resilient and intelligent intrusion detection systems (IDS), as traditional anomaly detection methods often fail to identify novel threats. A common approach treats unfamiliar attacks as out-of-distribution (OOD) samples; however, this leaves systems vulnerable when mitigation is inadequate. Moreover, conventional OOD detectors struggle to distinguish stealthy adversarial attacks from genuine OOD events. This paper introduces a conditional generative adversarial network (cGAN)-based framework for crafting stealthy adversarial attacks that evade IDS mechanisms. We first design a robust multi-class IDS classifier trained on benign UAV telemetry and known cyber-attacks, including Denial of Service (DoS), false data injection (FDI), man-in-the-middle (MiTM), and replay attacks. Using this classifier, our cGAN perturbs known attacks to generate adversarial samples that misclassify as benign while retaining statistical resemblance to OOD distributions. These adversarial samples are iteratively refined to achieve high stealth and success rates. To detect such perturbations, we implement a conditional variational autoencoder (CVAE), leveraging negative log-likelihood to separate adversarial inputs from authentic OOD samples. Comparative evaluation shows that CVAE-based regret scores significantly outperform traditional Mahalanobis distance-based detectors in identifying stealthy adversarial threats. Our findings emphasize the importance of advanced probabilistic modeling to strengthen IDS capabilities against adaptive, generative-model-based cyber intrusions.
zh

[AI-60] Learning Pole Structures of Hadronic States using Predictive Uncertainty Estimation

【速读】：该论文试图解决在强子谱学中将理论预测与实验数据匹配的核心挑战，特别是新强子态的识别问题，因为阈值附近的异常信号可能由多种物理机制引起。解决方案的关键在于引入一种不确定性感知的机器学习方法，用于分类S矩阵元素中的极点结构，该方法基于集成分类器链，能够提供认知不确定性和随机不确定性的估计，并通过基于预测不确定性的拒绝准则实现高验证准确率，同时有效处理高不确定性预测。

链接: https://arxiv.org/abs/2507.07668
作者: Felix Frohnert,Denny Lane B. Sombrillo,Evert van Nieuwenburg,Patrick Emonts
机构: 未知
类目: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
备注:

点击查看摘要

Abstract:Matching theoretical predictions to experimental data remains a central challenge in hadron spectroscopy. In particular, the identification of new hadronic states is difficult, as exotic signals near threshold can arise from a variety of physical mechanisms. A key diagnostic in this context is the pole structure of the scattering amplitude, but different configurations can produce similar signatures. The mapping between pole configurations and line shapes is especially ambiguous near the mass threshold, where analytic control is limited. In this work, we introduce an uncertainty-aware machine learning approach for classifying pole structures in S -matrix elements. Our method is based on an ensemble of classifier chains that provide both epistemic and aleatoric uncertainty estimates. We apply a rejection criterion based on predictive uncertainty, achieving a validation accuracy of nearly 95% while discarding only a small fraction of high-uncertainty predictions. Trained on synthetic data with known pole structures, the model generalizes to previously unseen experimental data, including enhancements associated with the P_c\barc(4312)^+ state observed by LHCb. In this, we infer a four-pole structure, representing the presence of a genuine compact pentaquark in the presence of a higher channel virtual state pole with non-vanishing width. While evaluated on this particular state, our framework is broadly applicable to other candidate hadronic states and offers a scalable tool for pole structure inference in scattering amplitudes.
zh

[AI-61] MODA: A Unified 3D Diffusion Framework for Multi-Task Target-Aware Molecular Generation

【速读】：该论文试图解决当前基于扩散模型的三维分子生成方法在任务间碎片化的问题，包括SMILES-only输入、两阶段预训练-微调流程以及单任务单模型策略所带来的立体化学保真度低、任务对齐差和零样本迁移能力弱等问题。其解决方案的关键在于提出MODA框架，该框架通过贝叶斯掩码调度器统一了片段生长、连接子设计、骨架跳跃和侧链修饰等任务，并在训练过程中通过一次遍历掩码并去噪连续空间片段，使模型能够学习跨任务的共享几何与化学先验，从而实现多任务训练下的通用骨干模型。

链接: https://arxiv.org/abs/2507.07201
作者: Dong Xu,Zhangfan Yang,Sisi Yuan,Jenna Xinyi Yao,Jiangqiang Li,Junkai Ji
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Three-dimensional molecular generators based on diffusion models can now reach near-crystallographic accuracy, yet they remain fragmented across tasks. SMILES-only inputs, two-stage pretrain-finetune pipelines, and one-task-one-model practices hinder stereochemical fidelity, task alignment, and zero-shot transfer. We introduce MODA, a diffusion framework that unifies fragment growing, linker design, scaffold hopping, and side-chain decoration with a Bayesian mask scheduler. During training, a contiguous spatial fragment is masked and then denoised in one pass, enabling the model to learn shared geometric and chemical priors across tasks. Multi-task training yields a universal backbone that surpasses six diffusion baselines and three training paradigms on substructure, chemical property, interaction, and geometry. Model-C reduces ligand-protein clashes and substructure divergences while maintaining Lipinski compliance, whereas Model-B preserves similarity but trails in novelty and binding affinity. Zero-shot de novo design and lead-optimisation tests confirm stable negative Vina scores and high improvement rates without force-field refinement. These results demonstrate that a single-stage multi-task diffusion routine can replace two-stage workflows for structure-based molecular design.
zh

[AI-62] Evaluating Retrieval-Augmented Generation Agents for Autonomous Scientific Discovery in Astrophysics ICML2025

【速读】：该论文旨在解决如何在天体物理学领域中构建高效的检索增强生成（Retrieval Augmented Generation, RAG）代理系统，以支持自主科学发现的问题。其关键解决方案是通过手动评估9种RAG代理配置在105个专门构建的宇宙学问答（Cosmology Question-Answer, QA）对上的性能，最终确定使用OpenAI嵌入和生成模型的配置在人类专家评估中表现出最高的准确率（91.4%）。此外，研究还基于人类评估结果校准了LLM-as-a-Judge (LLMaaJ)系统，作为人类评估的稳健替代方案，从而为大规模宇宙学QA对的自动化评估提供了可行路径。

链接: https://arxiv.org/abs/2507.07155
作者: Xueqing Xu,Boris Bolliet,Adrian Dimitrov,Andrew Laverick,Francisco Villaescusa-Navarro,Licong Xu,Íñigo Zubeldia
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI)
备注: Accepted contribution (spotlight) to the ICML 2025 Workshop on Machine Learning for Astrophysics; codes: this https URL , this https URL , this https URL

点击查看摘要

Abstract:We evaluate 9 Retrieval Augmented Generation (RAG) agent configurations on 105 Cosmology Question-Answer (QA) pairs that we built specifically for this this http URL RAG configurations are manually evaluated by a human expert, that is, a total of 945 generated answers were assessed. We find that currently the best RAG agent configuration is with OpenAI embedding and generative model, yielding 91.4% accuracy. Using our human evaluation results we calibrate LLM-as-a-Judge (LLMaaJ) system which can be used as a robust proxy for human evaluation. These results allow us to systematically select the best RAG agent configuration for multi-agent system for autonomous scientific discovery in astrophysics (e.g., cmbagent presented in a companion paper) and provide us with an LLMaaJ system that can be scaled to thousands of cosmology QA pairs. We make our QA dataset, human evaluation results, RAG pipelines, and LLMaaJ system publicly available for further use by the astrophysics community.
zh

[AI-63] DpDNet: An Dual-Prompt-Driven Network for Universal PET-CT Segmentation

【速读】：该论文旨在解决PET-CT图像中病灶分割的挑战，包括对噪声敏感、病灶形态小且多变以及生理高代谢信号的干扰。现有方法将所有癌症视为单一任务进行多病种分割，忽略了不同癌症类型的独特特征。其解决方案的关键在于提出DpDNet，一种双提示驱动网络，通过引入特定提示捕捉癌症特异性特征，并利用通用提示保留共享知识；同时，在解码器后采用提示感知头以适应性地处理多个分割任务，从而缓解早期引入提示导致的信息遗忘问题。

链接: https://arxiv.org/abs/2507.07126
作者: Xinglong Liang,Jiaju Huang,Luyi Han,Tianyu Zhang,Xin Wang,Yuan Gao,Chunyao Lu,Lishan Cai,Tao Tan,Ritse Mann
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:PET-CT lesion segmentation is challenging due to noise sensitivity, small and variable lesion morphology, and interference from physiological high-metabolic signals. Current mainstream approaches follow the practice of one network solving the segmentation of multiple cancer lesions by treating all cancers as a single task. However, this overlooks the unique characteristics of different cancer types. Considering the specificity and similarity of different cancers in terms of metastatic patterns, organ preferences, and FDG uptake intensity, we propose DpDNet, a Dual-Prompt-Driven network that incorporates specific prompts to capture cancer-specific features and common prompts to retain shared knowledge. Additionally, to mitigate information forgetting caused by the early introduction of prompts, prompt-aware heads are employed after the decoder to adaptively handle multiple segmentation tasks. Experiments on a PET-CT dataset with four cancer types show that DpDNet outperforms state-of-the-art models. Finally, based on the segmentation results, we calculated MTV, TLG, and SUVmax for breast cancer survival analysis. The results suggest that DpDNet has the potential to serve as a valuable tool for personalized risk stratification, supporting clinicians in optimizing treatment strategies and improving outcomes. Code is available at this https URL.
zh

机器学习

[LG-0] Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLM s

链接: https://arxiv.org/abs/2507.07996
作者: Ziyue Li,Yang Li,Tianyi Zhou
类目: Machine Learning (cs.LG)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:Can a pretrained neural network adapt its architecture to different inputs without any finetuning? Do we need all layers for simple tasks, and are they adequate for challenging tasks? We found that the layers of a pretrained large language model (LLM) can be manipulated as separate modules to build a better and even shallower model customized for each test sample. In particular, each layer from the pretrained model can be skipped/pruned or repeated multiple times as recurrent neural networks (RNN), and stacked with others in arbitrary orders, yielding a chain-of-layers (CoLa) per sample. This compositional space greatly expands the scope of existing works on looped/recurrent pretrained modules, layer pruning, or early-exit networks. We develop a Monte Carlo Tree Search (MCTS) protocol to explore and identify the optimal CoLa for each sample from math and commonsense reasoning benchmarks. Compared to a static model of a fixed depth, CoLa allows shortcut paths (fast thinking), recurrence of the same layer(s) (slow thinking), and combining both, offering more flexible, dynamic architectures for different inputs. We conduct an extensive analysis of the MCTS-optimized CoLa, which leads to two key findings: (1) For 75% of samples with correct predictions by the original LLM, we can find shorter CoLa, suggesting a large space for improving inference efficiency; (2) For 60% of samples with originally incorrect predictions, we can identify CoLa achieving correct predictions, suggesting a large space of performance enhancement. Our results highlight the shortcomings of using a fixed architecture of pre-trained LLMs for inference on different samples and pave the way to unlock the generalization power of test-time depth adaptation.

[LG-1] Prospective Learning in Retrospect

链接: https://arxiv.org/abs/2507.07965
作者: Yuxin Bai,Cecelia Shuai,Ashwin De Silva,Siyu Yu,Pratik Chaudhari,Joshua T. Vogelstein
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to AGI 2025

点击查看摘要

Abstract:In most real-world applications of artificial intelligence, the distributions of the data and the goals of the learners tend to change over time. The Probably Approximately Correct (PAC) learning framework, which underpins most machine learning algorithms, fails to account for dynamic data distributions and evolving objectives, often resulting in suboptimal performance. Prospective learning is a recently introduced mathematical framework that overcomes some of these limitations. We build on this framework to present preliminary results that improve the algorithm and numerical results, and extend prospective learning to sequential decision-making scenarios, specifically foraging. Code is available at: this https URL.

[LG-2] Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

链接: https://arxiv.org/abs/2507.07955
作者: Sukjun Hwang,Brandon Wang,Albert Gu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite incredible progress in language models (LMs) in recent years, largely resulting from moving away from specialized models designed for specific tasks to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content – and context – dependent segmentation strategies learned jointly with the rest of the model. Incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end. When compute- and data- matched, an H-Net with one stage of hierarchy operating at the byte level outperforms a strong Transformer language model operating over BPE tokens. Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction, demonstrating significantly better scaling with data and matching a token-based Transformer of twice its size. H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without any heuristics or explicit supervision. Finally, the H-Net’s improvement over tokenized pipelines is further increased in languages and modalities with weaker tokenization heuristics, such as Chinese and code, or DNA sequences (nearly 4x improvement in data efficiency over baselines), showing the potential of true end-to-end models that learn and scale better from unprocessed data.

[LG-3] Plausible Counterfactual Explanations of Recommendations

链接: https://arxiv.org/abs/2507.07919
作者: Jakub Černý,Jiří Němeček,Ivan Dovica,Jakub Mareček
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 8 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Explanations play a variety of roles in various recommender systems, from a legally mandated afterthought, through an integral element of user experience, to a key to persuasiveness. A natural and useful form of an explanation is the Counterfactual Explanation (CE). We present a method for generating highly plausible CEs in recommender systems and evaluate it both numerically and with a user study.

[LG-4] Efficient Causal Discovery for Autoregressive Time Series

链接: https://arxiv.org/abs/2507.07898
作者: Mohammad Fesanghary,Achintya Gopal
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:In this study, we present a novel constraint-based algorithm for causal structure learning specifically designed for nonlinear autoregressive time series. Our algorithm significantly reduces computational complexity compared to existing methods, making it more efficient and scalable to larger problems. We rigorously evaluate its performance on synthetic datasets, demonstrating that our algorithm not only outperforms current techniques, but also excels in scenarios with limited data availability. These results highlight its potential for practical applications in fields requiring efficient and accurate causal inference from nonlinear time series data.

[LG-5] SAMO: A Lightweight Sharpness-Aware Approach for Multi-Task Optimization with Joint Global-Local Perturbation

链接: https://arxiv.org/abs/2507.07883
作者: Hao Ban,Gokul Ram Subramani,Kaiyi Ji
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-task learning (MTL) enables a joint model to capture commonalities across multiple tasks, reducing computation costs and improving data efficiency. However, a major challenge in MTL optimization is task conflicts, where the task gradients differ in direction or magnitude, limiting model performance compared to single-task counterparts. Sharpness-aware minimization (SAM) minimizes task loss while simultaneously reducing the sharpness of the loss landscape. Our empirical observations show that SAM effectively mitigates task conflicts in MTL. Motivated by these findings, we explore integrating SAM into MTL but face two key challenges. While both the average loss gradient and individual task gradients-referred to as global and local information-contribute to SAM, how to combine them remains unclear. Moreover, directly computing each task gradient introduces significant computational and memory overheads. To address these challenges, we propose SAMO, a lightweight \textbfSharpness-\textbfAware \textbfMulti-task \textbfOptimization approach, that leverages a joint global-local perturbation. The local perturbations are approximated using only forward passes and are layerwise normalized to improve efficiency. Extensive experiments on a suite of multi-task benchmarks demonstrate both the effectiveness and efficiency of our method. Code is available at this https URL.

[LG-6] Can AI-predicted complexes teach machine learning to compute drug binding affinity?

链接: https://arxiv.org/abs/2507.07882
作者: Wei-Tse Hsu,Savva Grevtsev,Thomas Douglas,Aniket Magarkar,Philip C. Biggin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We evaluate the feasibility of using co-folding models for synthetic data augmentation in training machine learning-based scoring functions (MLSFs) for binding affinity prediction. Our results show that performance gains depend critically on the structural quality of augmented data. In light of this, we established simple heuristics for identifying high-quality co-folding predictions without reference structures, enabling them to substitute for experimental structures in MLSF training. Our study informs future data augmentation strategies based on co-folding models.

[LG-7] Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models

链接: https://arxiv.org/abs/2507.07877
作者: Chen Feng,Yicheng Lin,Shaojie Zhuo,Chenzheng Su,Ramchalam Kinattinkara Ramakrishnan,Zhaocong Yuan,Xiaopeng Zhang
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recent advances in Automatic Speech Recognition (ASR) have demonstrated remarkable accuracy and robustness in diverse audio applications, such as live transcription and voice command processing. However, deploying these models on resource constrained edge devices (e.g., IoT device, wearables) still presents substantial challenges due to strict limits on memory, compute and power. Quantization, particularly Post-Training Quantization (PTQ), offers an effective way to reduce model size and inference cost without retraining. Despite its importance, the performance implications of various advanced quantization methods and bit-width configurations on ASR models remain unclear. In this work, we present a comprehensive benchmark of eight state-of-the-art (SOTA) PTQ methods applied to two leading edge-ASR model families, Whisper and Moonshine. We systematically evaluate model performances (i.e., accuracy, memory I/O and bit operations) across seven diverse datasets from the open ASR leaderboard, analyzing the impact of quantization and various configurations on both weights and activations. Built on an extension of the LLM compression toolkit, our framework integrates edge-ASR models, diverse advanced quantization algorithms, a unified calibration and evaluation data pipeline, and detailed analysis tools. Our results characterize the trade-offs between efficiency and accuracy, demonstrating that even 3-bit quantization can succeed on high capacity models when using advanced PTQ techniques. These findings provide valuable insights for optimizing ASR models on low-power, always-on edge devices.

[LG-8] Improving AEBS Validation Through Objective Intervention Classification Leverag ing the Prediction Divergence Principle

链接: https://arxiv.org/abs/2507.07872
作者: Daniel Betschinske,Steven Peters
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: This work has been accepted for publication at the 2025 IEEE International Automated Vehicle Validation Conference (IAVVC)

点击查看摘要

Abstract:The safety validation of automatic emergency braking system (AEBS) requires accurately distinguishing between false positive (FP) and true positive (TP) system activations. While simulations allow straightforward differentiation by comparing scenarios with and without interventions, analyzing activations from open-loop resimulations - such as those from field operational testing (FOT) - is more complex. This complexity arises from scenario parameter uncertainty and the influence of driver interventions in the recorded data. Human labeling is frequently used to address these challenges, relying on subjective assessments of intervention necessity or situational criticality, potentially introducing biases and limitations. This work proposes a rule-based classification approach leveraging the Prediction Divergence Principle (PDP) to address those issues. Applied to a simplified AEBS, the proposed method reveals key strengths, limitations, and system requirements for effective implementation. The findings suggest that combining this approach with human labeling may enhance the transparency and consistency of classification, thereby improving the overall validation process. While the rule set for classification derived in this work adopts a conservative approach, the paper outlines future directions for refinement and broader applicability. Finally, this work highlights the potential of such methods to complement existing practices, paving the way for more reliable and reproducible AEBS validation frameworks.

[LG-9] Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders

链接: https://arxiv.org/abs/2507.07867
作者: Dimitrios Bralios,Jonah Casebeer,Paris Smaragdis
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at IEEE MLSP 2025

点击查看摘要

Abstract:Neural audio codecs and autoencoders have emerged as versatile models for audio compression, transmission, feature-extraction, and latent-space generation. However, a key limitation is that most are trained to maximize reconstruction fidelity, often neglecting the specific latent structure necessary for optimal performance in diverse downstream applications. We propose a simple, post-hoc framework to address this by modifying the bottleneck of a pre-trained autoencoder. Our method introduces a “Re-Bottleneck”, an inner bottleneck trained exclusively through latent space losses to instill user-defined structure. We demonstrate the framework’s effectiveness in three experiments. First, we enforce an ordering on latent channels without sacrificing reconstruction quality. Second, we align latents with semantic embeddings, analyzing the impact on downstream diffusion modeling. Third, we introduce equivariance, ensuring that a filtering operation on the input waveform directly corresponds to a specific transformation in the latent space. Ultimately, our Re-Bottleneck framework offers a flexible and efficient way to tailor representations of neural audio models, enabling them to seamlessly meet the varied demands of different applications with minimal additional training.

[LG-10] Predicting and generating antibiotics against future pathogens with ApexOracle

链接: https://arxiv.org/abs/2507.07862
作者: Tianang Leng,Fangping Wan,Marcelo Der Torossian Torres,Cesar de la Fuente-Nunez
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 3 figures

点击查看摘要

Abstract:Antimicrobial resistance (AMR) is escalating and outpacing current antibiotic development. Thus, discovering antibiotics effective against emerging pathogens is becoming increasingly critical. However, existing approaches cannot rapidly identify effective molecules against novel pathogens or emerging drug-resistant strains. Here, we introduce ApexOracle, an artificial intelligence (AI) model that both predicts the antibacterial potency of existing compounds and designs de novo molecules active against strains it has never encountered. Departing from models that rely solely on molecular features, ApexOracle incorporates pathogen-specific context through the integration of molecular features captured via a foundational discrete diffusion language model and a dual-embedding framework that combines genomic- and literature-derived strain representations. Across diverse bacterial species and chemical modalities, ApexOracle consistently outperformed state-of-the-art approaches in activity prediction and demonstrated reliable transferability to novel pathogens with little or no antimicrobial data. Its unified representation-generation architecture further enables the in silico creation of “new-to-nature” molecules with high predicted efficacy against priority threats. By pairing rapid activity prediction with targeted molecular generation, ApexOracle offers a scalable strategy for countering AMR and preparing for future infectious-disease outbreaks.

[LG-11] Principled Foundations for Preference Optimization

链接: https://arxiv.org/abs/2507.07855
作者: Wenxuan Zhou,Shujian Zhang,Brice Magdalou,John Lambert,Ehsan Amid,Richard Nock,Andrew Hard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we show that direct preference optimization (DPO) is a very specific form of a connection between two major theories in the ML context of learning from preferences: loss functions (Savage) and stochastic choice (Doignon-Falmagne and Machina). The connection is established for all of Savage’s losses and at this level of generality, (i) it includes support for abstention on the choice theory side, (ii) it includes support for non-convex objectives on the ML side, and (iii) it allows to frame for free some notable extensions of the DPO setting, including margins and corrections for length. Getting to understand how DPO operates from a general principled perspective is crucial because of the huge and diverse application landscape of models, because of the current momentum around DPO, but also – and importantly – because many state of the art variations on DPO definitely occupy a small region of the map that we cover. It also helps to understand the pitfalls of departing from this map, and figure out workarounds.

[LG-12] Credit Risk Analysis for SMEs Using Graph Neural Networks in Supply Chain

链接: https://arxiv.org/abs/2507.07854
作者: Zizhou Zhang,Qinyan Shen,Zhuohuan Hu,Qianying Liu,Huijie Shen
类目: Machine Learning (cs.LG)
*备注: The paper will be published on 2025 International Conference on Big Data, Artificial Intelligence and Digital Economy

点击查看摘要

Abstract:Small and Medium-sized Enterprises (SMEs) are vital to the modern economy, yet their credit risk analysis often struggles with scarce data, especially for online lenders lacking direct credit records. This paper introduces a Graph Neural Network (GNN)-based framework, leveraging SME interactions from transaction and social data to map spatial dependencies and predict loan default risks. Tests on real-world datasets from Discover and Ant Credit (23.4M nodes for supply chain analysis, 8.6M for default prediction) show the GNN surpasses traditional and other GNN baselines, with AUCs of 0.995 and 0.701 for supply chain mining and default prediction, respectively. It also helps regulators model supply chain disruption impacts on banks, accurately forecasting loan defaults from material shortages, and offers Federal Reserve stress testers key data for CCAR risk buffers. This approach provides a scalable, effective tool for assessing SME credit risk.

[LG-13] Pre-Trained AI Model Assisted Online Decision-Making under Missing Covariates: A Theoretical Perspective

链接: https://arxiv.org/abs/2507.07852
作者: Haichen Hu,David Simchi-Levi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study a sequential contextual decision-making problem in which certain covariates are missing but can be imputed using a pre-trained AI model. From a theoretical perspective, we analyze how the presence of such a model influences the regret of the decision-making process. We introduce a novel notion called “model elasticity”, which quantifies the sensitivity of the reward function to the discrepancy between the true covariate and its imputed counterpart. This concept provides a unified way to characterize the regret incurred due to model imputation, regardless of the underlying missingness mechanism. More surprisingly, we show that under the missing at random (MAR) setting, it is possible to sequentially calibrate the pre-trained model using tools from orthogonal statistical learning and doubly robust regression. This calibration significantly improves the quality of the imputed covariates, leading to much better regret guarantees. Our analysis highlights the practical value of having an accurate pre-trained model in sequential decision-making tasks and suggests that model elasticity may serve as a fundamental metric for understanding and improving the integration of pre-trained models in a wide range of data-driven decision-making problems.

[LG-14] “So Tell Me About Your Policy…”: Distillation of interpretable policies from Deep Reinforcement Learning agents

链接: https://arxiv.org/abs/2507.07848
作者: Giovanni Dispoto,Paolo Bonetti,Marcello Restelli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in Reinforcement Learning (RL) largely benefit from the inclusion of Deep Neural Networks, boosting the number of novel approaches proposed in the field of Deep Reinforcement Learning (DRL). These techniques demonstrate the ability to tackle complex games such as Atari, Go, and other real-world applications, including financial trading. Nevertheless, a significant challenge emerges from the lack of interpretability, particularly when attempting to comprehend the underlying patterns learned, the relative importance of the state features, and how they are integrated to generate the policy’s output. For this reason, in mission-critical and real-world settings, it is often preferred to deploy a simpler and more interpretable algorithm, although at the cost of performance. In this paper, we propose a novel algorithm, supported by theoretical guarantees, that can extract an interpretable policy (e.g., a linear policy) without disregarding the peculiarities of expert behavior. This result is obtained by considering the advantage function, which includes information about why an action is superior to the others. In contrast to previous works, our approach enables the training of an interpretable policy using previously collected experience. The proposed algorithm is empirically evaluated on classic control environments and on a financial trading scenario, demonstrating its ability to extract meaningful information from complex expert policies.

[LG-15] owards Benchmarking Foundation Models for Tabular Data With Text ICML2025

链接: https://arxiv.org/abs/2507.07829
作者: Martin Mráz,Breenda Das,Anshul Gupta,Lennart Purucker,Frank Hutter
类目: Machine Learning (cs.LG)
*备注: Accepted at Foundation Models for Structured Data workshop at ICML 2025

点击查看摘要

Abstract:Foundation models for tabular data are rapidly evolving, with increasing interest in extending them to support additional modalities such as free-text features. However, existing benchmarks for tabular data rarely include textual columns, and identifying real-world tabular datasets with semantically rich text features is non-trivial. We propose a series of simple yet effective ablation-style strategies for incorporating text into conventional tabular pipelines. Moreover, we benchmark how state-of-the-art tabular foundation models can handle textual data by manually curating a collection of real-world tabular datasets with meaningful textual features. Our study is an important step towards improving benchmarking of foundation models for tabular data with text.

[LG-16] An Empirical Bernstein Inequality for Dependent Data in Hilbert Spaces and Applications

链接: https://arxiv.org/abs/2507.07826
作者: Erfan Mirzaei,Andreas Maurer,Vladimir R. Kostic,Massimiliano Pontil
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: In The 28th International Conference on Artificial Intelligence and Statistics (2025)

点击查看摘要

Abstract:Learning from non-independent and non-identically distributed data poses a persistent challenge in statistical learning. In this study, we introduce data-dependent Bernstein inequalities tailored for vector-valued processes in Hilbert space. Our inequalities apply to both stationary and non-stationary processes and exploit the potential rapid decay of correlations between temporally separated variables to improve estimation. We demonstrate the utility of these bounds by applying them to covariance operator estimation in the Hilbert-Schmidt norm and to operator learning in dynamical systems, achieving novel risk bounds. Finally, we perform numerical experiments to illustrate the practical implications of these bounds in both contexts.

[LG-17] Pay Attention to Attention Distribution: A New Local Lipschitz Bound for Transformers

链接: https://arxiv.org/abs/2507.07814
作者: Nikolay Yudin,Alexander Gaponov,Sergei Kudriashov,Maxim Rakhuba
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We present a novel local Lipschitz bound for self-attention blocks of transformers. This bound is based on a refined closed-form expression for the spectral norm of the softmax function. The resulting bound is not only more accurate than in the prior art, but also unveils the dependence of the Lipschitz constant on attention score maps. Based on the new findings, we suggest an explanation of the way distributions inside the attention map affect the robustness from the Lipschitz constant perspective. We also introduce a new lightweight regularization term called JaSMin (Jacobian Softmax norm Minimization), which boosts the transformer’s robustness and decreases local Lipschitz constants of the whole network.

[LG-18] Deep Survival Analysis in Multimodal Medical Data: A Parametric and Probabilistic Approach with Competing Risks

链接: https://arxiv.org/abs/2507.07804
作者: Alba Garrido,Alejandro Almodóvar,Patricia A. Apellániz,Juan Parras,Santiago Zazo
类目: Machine Learning (cs.LG)
*备注: 29 pages, 9 Figures

点击查看摘要

Abstract:Accurate survival prediction is critical in oncology for prognosis and treatment planning. Traditional approaches often rely on a single data modality, limiting their ability to capture the complexity of tumor biology. To address this challenge, we introduce a multimodal deep learning framework for survival analysis capable of modeling both single and competing risks scenarios, evaluating the impact of integrating multiple medical data sources on survival predictions. We propose SAMVAE (Survival Analysis Multimodal Variational Autoencoder), a novel deep learning architecture designed for survival prediction that integrates six data modalities: clinical variables, four molecular profiles, and histopathological images. SAMVAE leverages modality specific encoders to project inputs into a shared latent space, enabling robust survival prediction while preserving modality specific information. Its parametric formulation enables the derivation of clinically meaningful statistics from the output distributions, providing patient-specific insights through interactive multimedia that contribute to more informed clinical decision-making and establish a foundation for interpretable, data-driven survival analysis in oncology. We evaluate SAMVAE on two cancer cohorts breast cancer and lower grade glioma applying tailored preprocessing, dimensionality reduction, and hyperparameter optimization. The results demonstrate the successful integration of multimodal data for both standard survival analysis and competing risks scenarios across different datasets. Our model achieves competitive performance compared to state-of-the-art multimodal survival models. Notably, this is the first parametric multimodal deep learning architecture to incorporate competing risks while modeling continuous time to a specific event, using both tabular and image data.

[LG-19] Space-Filling Regularization for Robust and Interpretable Nonlinear State Space Models

链接: https://arxiv.org/abs/2507.07792
作者: Hermann Klein,Max Heinz Herkersdorf,Oliver Nelles
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The state space dynamics representation is the most general approach for nonlinear systems and often chosen for system identification. During training, the state trajectory can deform significantly leading to poor data coverage of the state space. This can cause significant issues for space-oriented training algorithms which e.g. rely on grid structures, tree partitioning, or similar. Besides hindering training, significant state trajectory deformations also deteriorate interpretability and robustness properties. This paper proposes a new type of space-filling regularization that ensures a favorable data distribution in state space via introducing a data-distribution-based penalty. This method is demonstrated in local model network architectures where good interpretability is a major concern. The proposed approach integrates ideas from modeling and design of experiments for state space structures. This is why we present two regularization techniques for the data point distributions of the state trajectories for local affine state space models. Beyond that, we demonstrate the results on a widely known system identification benchmark.

[LG-20] BEAVER: Building Environments with Assessable Variation for Evaluating Multi-Objective Reinforcement Learning ICML2025 ICML

链接: https://arxiv.org/abs/2507.07769
作者: Ruohong Liu,Jack Umenberger,Yize Chen
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted at the Workshop on Computational Optimization of Buildings (ICML CO-BUILD), 42nd International Conference on Machine Learning (ICML 2025), Vancouver, Canada

点击查看摘要

Abstract:Recent years have seen significant advancements in designing reinforcement learning (RL)-based agents for building energy management. While individual success is observed in simulated or controlled environments, the scalability of RL approaches in terms of efficiency and generalization across building dynamics and operational scenarios remains an open question. In this work, we formally characterize the generalization space for the cross-environment, multi-objective building energy management task, and formulate the multi-objective contextual RL problem. Such a formulation helps understand the challenges of transferring learned policies across varied operational contexts such as climate and heat convection dynamics under multiple control objectives such as comfort level and energy consumption. We provide a principled way to parameterize such contextual information in realistic building RL environments, and construct a novel benchmark to facilitate the evaluation of generalizable RL algorithms in practical building control tasks. Our results show that existing multi-objective RL methods are capable of achieving reasonable trade-offs between conflicting objectives. However, their performance degrades under certain environment variations, underscoring the importance of incorporating dynamics-dependent contextual information into the policy learning process.

[LG-21] Distributed and Decentralised Training: Technical Governance Challenges in a Shifting AI Landscape ICML2025

链接: https://arxiv.org/abs/2507.07765
作者: Jakub Kryś,Yashvardhan Sharma,Janet Egan
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Accepted as an oral presentation at the Technical AI Governance Workshop (ICML 2025)

点击查看摘要

Abstract:Advances in low-communication training algorithms are enabling a shift from centralised model training to compute setups that are either distributed across multiple clusters or decentralised via community-driven contributions. This paper distinguishes these two scenarios - distributed and decentralised training - which are little understood and often conflated in policy discourse. We discuss how they could impact technical AI governance through an increased risk of compute structuring, capability proliferation, and the erosion of detectability and shutdownability. While these trends foreshadow a possible new paradigm that could challenge key assumptions of compute governance, we emphasise that certain policy levers, like export controls, remain relevant. We also acknowledge potential benefits of decentralised AI, including privacy-preserving training runs that could unlock access to more data, and mitigating harmful power concentration. Our goal is to support more precise policymaking around compute, capability proliferation, and decentralised AI development.

[LG-22] Efficient and Scalable Estimation of Distributional Treatment Effects with Multi-Task Neural Networks

链接: https://arxiv.org/abs/2507.07738
作者: Tomu Hirata,Undral Byambadalai,Tatsushi Oka,Shota Yasui,Shingo Uto
类目: Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:We propose a novel multi-task neural network approach for estimating distributional treatment effects (DTE) in randomized experiments. While DTE provides more granular insights into the experiment outcomes over conventional methods focusing on the Average Treatment Effect (ATE), estimating it with regression adjustment methods presents significant challenges. Specifically, precision in the distribution tails suffers due to data imbalance, and computational inefficiencies arise from the need to solve numerous regression problems, particularly in large-scale datasets commonly encountered in industry. To address these limitations, our method leverages multi-task neural networks to estimate conditional outcome distributions while incorporating monotonic shape constraints and multi-threshold label learning to enhance accuracy. To demonstrate the practical effectiveness of our proposed method, we apply our method to both simulated and real-world datasets, including a randomized field experiment aimed at reducing water consumption in the US and a large-scale A/B test from a leading streaming platform in Japan. The experimental results consistently demonstrate superior performance across various datasets, establishing our method as a robust and practical solution for modern causal inference applications requiring a detailed understanding of treatment effect heterogeneity.

[LG-23] Accelerating Transposed Convolutions on FPGA-based Edge Devices

链接: https://arxiv.org/abs/2507.07683
作者: Jude Haris,José Cano
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted to 35th International Conference on Field-Programmable Logic and Applications (FPL) 2025

点击查看摘要

Abstract:Transposed Convolutions (TCONV) enable the up-scaling mechanism within generative Artificial Intelligence (AI) models. However, the predominant Input-Oriented Mapping (IOM) method for implementing TCONV has complex output mapping, overlapping sums, and ineffectual computations. These inefficiencies further exacerbate the performance bottleneck of TCONV and generative models on resource-constrained edge devices. To address this problem, in this paper we propose MM2IM, a hardware-software co-designed accelerator that combines Matrix Multiplication (MatMul) with col2IM to process TCONV layers on resource-constrained edge devices efficiently. Using the SECDA-TFLite design toolkit, we implement MM2IM and evaluate its performance across 261 TCONV problem configurations, achieving an average speedup of 1.9x against a dual-thread ARM Neon optimized CPU baseline. We then evaluate the performance of MM2IM on a range of TCONV layers from well-known generative models achieving up to 4.2x speedup, and compare it against similar resource-constrained TCONV accelerators, outperforming them by at least 2x GOPs/DSP. Finally, we evaluate MM2IM on the DCGAN and pix2pix GAN models, achieving up to 3x speedup and 2.4x energy reduction against the CPU baseline.

[LG-24] Some Theoretical Results on Layerwise Effective Dimension Oscillations in Finite Width ReLU Networks

链接: https://arxiv.org/abs/2507.07675
作者: Darshan Makwana
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We analyze the layerwise effective dimension (rank of the feature matrix) in fully-connected ReLU networks of finite width. Specifically, for a fixed batch of m inputs and random Gaussian weights, we derive closed-form expressions for the expected rank of the \ m\times n\ hidden activation matrices. Our main result shows that \mathbbE[EDim(\ell)]=m[1-(1-2/\pi)^\ell]+O(e^-c m) so that the rank deficit decays geometrically with ratio 1-2 / \pi \approx 0.3634 . We also prove a sub-Gaussian concentration bound, and identify the “revival” depths at which the expected rank attains local maxima. In particular, these peaks occur at depths \ell_k^*\approx(k+1/2)\pi/\log(1/\rho) with height \approx (1-e^-\pi/2) m \approx 0.79m . We further show that this oscillatory rank behavior is a finite-width phenomenon: under orthogonal weight initialization or strong negative-slope leaky-ReLU, the rank remains (nearly) full. These results provide a precise characterization of how random ReLU layers alternately collapse and partially revive the subspace of input variations, adding nuance to prior work on expressivity of deep networks.

[LG-25] HLF-FSL. A Decentralized Federated Split Learning Solution for IoT on Hyperledger Fabric

链接: https://arxiv.org/abs/2507.07637
作者: Carlos Beis Penedo,Rebeca P. Díaz Redondo,Ana Fernández Vilas,Manuel Fernández Veiga,Francisco Troncoso Pastoriza
类目: Machine Learning (cs.LG)
*备注: 19 pages, 7 figures and 6 tables

点击查看摘要

Abstract:Collaborative machine learning in sensitive domains demands scalable, privacy preserving solutions for enterprise deployment. Conventional Federated Learning (FL) relies on a central server, introducing single points of failure and privacy risks, while Split Learning (SL) partitions models for privacy but scales poorly due to sequential training. We present a decentralized architecture that combines Federated Split Learning (FSL) with the permissioned blockchain Hyperledger Fabric (HLF). Our chaincode orchestrates FSL’s split model execution and peer-to-peer aggregation without any central coordinator, leveraging HLF’s transient fields and Private Data Collections (PDCs) to keep raw data and model activations private. On CIFAR-10 and MNIST benchmarks, HLF-FSL matches centralized FSL accuracy while reducing per epoch training time compared to Ethereum-based works. Performance and scalability tests show minimal blockchain overhead and preserved accuracy, demonstrating enterprise grade viability.

[LG-26] Sparse Causal Discovery with Generative Intervention for Unsupervised Graph Domain Adaptation ICML2025

链接: https://arxiv.org/abs/2507.07621
作者: Junyu Luo,Yuhao Tang,Yiwei Fu,Xiao Luo,Zhizhuo Kou,Zhiping Xiao,Wei Ju,Wentao Zhang,Ming Zhang
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Unsupervised Graph Domain Adaptation (UGDA) leverages labeled source domain graphs to achieve effective performance in unlabeled target domains despite distribution shifts. However, existing methods often yield suboptimal results due to the entanglement of causal-spurious features and the failure of global alignment strategies. We propose SLOGAN (Sparse Causal Discovery with Generative Intervention), a novel approach that achieves stable graph representation transfer through sparse causal modeling and dynamic intervention mechanisms. Specifically, SLOGAN first constructs a sparse causal graph structure, leveraging mutual information bottleneck constraints to disentangle sparse, stable causal features while compressing domain-dependent spurious correlations through variational inference. To address residual spurious correlations, we innovatively design a generative intervention mechanism that breaks local spurious couplings through cross-domain feature recombination while maintaining causal feature semantic consistency via covariance constraints. Furthermore, to mitigate error accumulation in target domain pseudo-labels, we introduce a category-adaptive dynamic calibration strategy, ensuring stable discriminative learning. Extensive experiments on multiple real-world datasets demonstrate that SLOGAN significantly outperforms existing baselines.

[LG-27] Sparse Self-Federated Learning for Energy Efficient Cooperative Intelligence in Society 5.0

链接: https://arxiv.org/abs/2507.07613
作者: Davide Domini,Laura Erhan,Gianluca Aguzzi,Lucia Cavallaro,Amirhossein Douzandeh Zenoozi,Antonio Liotta,Mirko Viroli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning offers privacy-preserving collaborative intelligence but struggles to meet the sustainability demands of emerging IoT ecosystems necessary for Society 5.0-a human-centered technological future balancing social advancement with environmental responsibility. The excessive communication bandwidth and computational resources required by traditional FL approaches make them environmentally unsustainable at scale, creating a fundamental conflict with green AI principles as billions of resource-constrained devices attempt to participate. To this end, we introduce Sparse Proximity-based Self-Federated Learning (SParSeFuL), a resource-aware approach that bridges this gap by combining aggregate computing for self-organization with neural network sparsification to reduce energy and bandwidth consumption.

[LG-28] Synthetic MC via Biological Transmitters: Therapeutic Modulation of the Gut-Brain Axis

链接: https://arxiv.org/abs/2507.07604
作者: Sebastian Lotter,Elisabeth Mohr,Andrina Rutsch,Lukas Brand,Francesca Ronchi,Laura Díaz-Marugán
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Tissues and Organs (q-bio.TO)
*备注:

点击查看摘要

Abstract:Synthetic molecular communication (SMC) is a key enabler for future healthcare systems in which Internet of Bio-Nano-Things (IoBNT) devices facilitate the continuous monitoring of a patient’s biochemical signals. To close the loop between sensing and actuation, both the detection and the generation of in-body molecular communication (MC) signals is key. However, generating signals inside the human body, e.g., via synthetic nanodevices, poses a challenge in SMC, due to technological obstacles as well as legal, safety, and ethical issues. Hence, this paper considers an SMC system in which signals are generated indirectly via the modulation of a natural in-body MC system, namely the gut-brain axis (GBA). Therapeutic GBA modulation is already established as treatment for neurological diseases, e.g., drug refractory epilepsy (DRE), and performed via the administration of nutritional supplements or specific diets. However, the molecular signaling pathways that mediate the effect of such treatments are mostly unknown. Consequently, existing treatments are standardized or designed heuristically and able to help only some patients while failing to help others. In this paper, we propose to leverage personal health data, e.g., gathered by in-body IoBNT devices, to design more versatile and robust GBA modulation-based treatments as compared to the existing ones. To show the feasibility of our approach, we define a catalog of theoretical requirements for therapeutic GBA modulation. Then, we propose a machine learning model to verify these requirements for practical scenarios when only limited data on the GBA modulation exists. By evaluating the proposed model on several datasets, we confirm its excellent accuracy in identifying different modulators of the GBA. Finally, we utilize the proposed model to identify specific modulatory pathways that play an important role for therapeutic GBA modulation.

[LG-29] Stress Monitoring in Healthcare: An Ensemble Machine Learning Framework Using Wearable Sensor Data

链接: https://arxiv.org/abs/2507.07589
作者: Arpana Sinhal,Anay Sinhal,Amit Sinhal
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Healthcare professionals, particularly nurses, face elevated occupational stress, a concern amplified during the COVID-19 pandemic. While wearable sensors offer promising avenues for real-time stress monitoring, existing studies often lack comprehensive datasets and robust analytical frameworks. This study addresses these gaps by introducing a multimodal dataset comprising physiological signals, electrodermal activity, heart rate and skin temperature. A systematic literature review identified limitations in prior stress-detection methodologies, particularly in handling class imbalance and optimizing model generalizability. To overcome these challenges, the dataset underwent preprocessing with the Synthetic Minority Over sampling Technique (SMOTE), ensuring balanced representation of stress states. Advanced machine learning models including Random Forest, XGBoost and a Multi-Layer Perceptron (MLP) were evaluated and combined into a Stacking Classifier to leverage their collective predictive strengths. By using a publicly accessible dataset and a reproducible analytical pipeline, this work advances the development of deployable stress-monitoring systems, offering practical implications for safeguarding healthcare workers’ mental health. Future research directions include expanding demographic diversity and exploring edge-computing implementations for low latency stress alerts.

[LG-30] CHOMET: Conditional Handovers via Meta-Learning

链接: https://arxiv.org/abs/2507.07581
作者: Michail Kalntis,Fernando A. Kuipers,George Iosifidis
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Handovers (HOs) are the cornerstone of modern cellular networks for enabling seamless connectivity to a vast and diverse number of mobile users. However, as mobile networks become more complex with more diverse users and smaller cells, traditional HOs face significant challenges, such as prolonged delays and increased failures. To mitigate these issues, 3GPP introduced conditional handovers (CHOs), a new type of HO that enables the preparation (i.e., resource allocation) of multiple cells for a single user to increase the chance of HO success and decrease the delays in the procedure. Despite its advantages, CHO introduces new challenges that must be addressed, including efficient resource allocation and managing signaling/communication overhead from frequent cell preparations and releases. This paper presents a novel framework aligned with the O-RAN paradigm that leverages meta-learning for CHO optimization, providing robust dynamic regret guarantees and demonstrating at least 180% superior performance than other 3GPP benchmarks in volatile signal conditions.

[LG-31] Real-Time Decorrelation-Based Anomaly Detection for Multivariate Time Series

链接: https://arxiv.org/abs/2507.07559
作者: Amirhossein Sadough,Mahyar Shahsavari,Mark Wijtvliet,Marcel van Gerven
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Anomaly detection (AD) plays a vital role across a wide range of real-world domains by identifying data instances that deviate from expected patterns, potentially signaling critical events such as system failures, fraudulent activities, or rare medical conditions. The demand for real-time AD has surged with the rise of the (Industrial) Internet of Things, where massive volumes of multivariate sensor data must be processed instantaneously. Real-time AD requires methods that not only handle high-dimensional streaming data but also operate in a single-pass manner, without the burden of storing historical instances, thereby ensuring minimal memory usage and fast decision-making. We propose DAD, a novel real-time decorrelation-based anomaly detection method for multivariate time series, based on an online decorrelation learning approach. Unlike traditional proximity-based or reconstruction-based detectors that process entire data or windowed instances, DAD dynamically learns and monitors the correlation structure of data sample by sample in a single pass, enabling efficient and effective detection. To support more realistic benchmarking practices, we also introduce a practical hyperparameter tuning strategy tailored for real-time anomaly detection scenarios. Extensive experiments on widely used benchmark datasets demonstrate that DAD achieves the most consistent and superior performance across diverse anomaly types compared to state-of-the-art methods. Crucially, its robustness to increasing dimensionality makes it particularly well-suited for real-time, high-dimensional data streams. Ultimately, DAD not only strikes an optimal balance between detection efficacy and computational efficiency but also sets a new standard for real-time, memory-constrained anomaly detection.

[LG-32] Uncertainty Quantification for Motor Imagery BCI – Machine Learning vs. Deep Learning

链接: https://arxiv.org/abs/2507.07511
作者: Joris Suurmeijer,Ivo Pascal de Jong,Matias Valdenegro-Toro,Andreea Ioana Sburlea
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) turn brain signals into functionally useful output, but they are not always accurate. A good Machine Learning classifier should be able to indicate how confident it is about a given classification, by giving a probability for its classification. Standard classifiers for Motor Imagery BCIs do give such probabilities, but research on uncertainty quantification has been limited to Deep Learning. We compare the uncertainty quantification ability of established BCI classifiers using Common Spatial Patterns (CSP-LDA) and Riemannian Geometry (MDRM) to specialized methods in Deep Learning (Deep Ensembles and Direct Uncertainty Quantification) as well as standard Convolutional Neural Networks (CNNs). We found that the overconfidence typically seen in Deep Learning is not a problem in CSP-LDA and MDRM. We found that MDRM is underconfident, which we solved by adding Temperature Scaling (MDRM-T). CSP-LDA and MDRM-T give the best uncertainty estimates, but Deep Ensembles and standard CNNs give the best classifications. We show that all models are able to separate between easy and difficult estimates, so that we can increase the accuracy of a Motor Imagery BCI by rejecting samples that are ambiguous. Comments: 6 pages, 3 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.07511 [cs.LG] (or arXiv:2507.07511v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.07511 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] General purpose models for the chemical sciences

链接: https://arxiv.org/abs/2507.07456
作者: Nawaf Alampara,Anagha Aneesh,Martiño Ríos-García,Adrian Mirza,Mara Schilling-Wilhelmi,Ali Asghar Aghajani,Meiling Sun,Gordan Prastalo,Kevin Maik Jablonka
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Data-driven techniques have a large potential to transform and accelerate the chemical sciences. However, chemical sciences also pose the unique challenge of very diverse, small, fuzzy datasets that are difficult to leverage in conventional machine learning approaches completely. A new class of models, general-purpose models (GPMs) such as large language models, have shown the ability to solve tasks they have not been directly trained on, and to flexibly operate with low amounts of data in different formats. In this review, we discuss fundamental building principles of GPMs and review recent applications of those models in the chemical sciences across the entire scientific process. While many of these applications are still in the prototype phase, we expect that the increasing interest in GPMs will make many of them mature in the coming years.

[LG-34] Neural networks leverag e nominally quantum and post-quantum representations

链接: https://arxiv.org/abs/2507.07432
作者: Paul M. Riechers,Thomas J. Elliott,Adam S. Shai
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:We show that deep neural networks, including transformers and RNNs, pretrained as usual on next-token prediction, intrinsically discover and represent beliefs over ‘quantum’ and ‘post-quantum’ low-dimensional generative models of their training data – as if performing iterative Bayesian updates over the latent state of this world model during inference as they observe more context. Notably, neural nets easily find these representation whereas there is no finite classical circuit that would do the job. The corresponding geometric relationships among neural activations induced by different input sequences are found to be largely independent of neural-network architecture. Each point in this geometry corresponds to a history-induced probability density over all possible futures, and the relative displacement of these points reflects the difference in mechanism and magnitude for how these distinct pasts affect the future.

[LG-35] IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing

链接: https://arxiv.org/abs/2507.07396
作者: Zeyang Song,Shimin Zhang,Yuhong Chou,Jibin Wu,Haizhou Li
类目: Multimedia (cs.MM); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Under review of TNNLS

点击查看摘要

Abstract:Spiking Neural Networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional Artificial Neural Networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing task. Two key challenges hinder progress: (1) the high computational overhead during training caused by multi-timestep spike firing, and (2) the absence of large-scale SNN architectures tailored to speech processing tasks. To overcome the issues, we introduce Input-aware Multi-Level Spikeformer, i.e. IML-Spikeformer, a spiking Transformer architecture specifically designed for large-scale speech processing. Central to our design is the Input-aware Multi-Level Spike (IMLS) mechanism, which simulate multi-timestep spike firing within a single timestep using an adaptive, input-aware thresholding scheme. IML-Spikeformer further integrates a Reparameterized Spiking Self-Attention (RepSSA) module with a Hierarchical Decay Mask (HDM), forming the HD-RepSSA module. This module enhances the precision of attention maps and enables modeling of multi-scale temporal dependencies in speech signals. Experiments demonstrate that IML-Spikeformer achieves word error rates of 6.0% on AiShell-1 and 3.4% on Librispeech-960, comparable to conventional ANN transformers while reducing theoretical inference energy consumption by 4.64 \times and 4.32 \times respectively. IML-Spikeformer marks an advance of scalable SNN architectures for large-scale speech processing in both task performance and energy efficiency.

[LG-36] Learning Collective Variables from Time-lagged Generation

链接: https://arxiv.org/abs/2507.07390
作者: Seonghyun Park,Kiyoung Seong,Soojung Yang,Rafael Gómez-Bombarelli,Sungsoo Ahn
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rare events such as state transitions are difficult to observe directly with molecular dynamics simulations due to long timescales. Enhanced sampling techniques overcome this by introducing biases along carefully chosen low-dimensional features, known as collective variables (CVs), which capture the slow degrees of freedom. Machine learning approaches (MLCVs) have automated CV discovery, but existing methods typically focus on discriminating meta-stable states without fully encoding the detailed dynamics essential for accurate sampling. We propose TLC, a framework that learns CVs directly from time-lagged conditions of a generative model. Instead of modeling the static Boltzmann distribution, TLC models a time-lagged conditional distribution yielding CVs to capture the slow dynamic behavior. We validate TLC on the Alanine Dipeptide system using two CV-based enhanced sampling tasks: (i) steered molecular dynamics (SMD) and (ii) on-the-fly probability enhanced sampling (OPES), demonstrating equal or superior performance compared to existing MLCV methods in both transition path sampling and state discrimination.

[LG-37] GRIT: Graph Transformer For Internal Ice Layer Thickness Prediction

链接: https://arxiv.org/abs/2507.07388
作者: Zesheng Liu,Maryam Rahnemoonfar
类目: Machine Learning (cs.LG)
*备注: Accepted for 2025 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2025)

点击查看摘要

Abstract:Gaining a deeper understanding of the thickness and variability of internal ice layers in Radar imagery is essential in monitoring the snow accumulation, better evaluating ice dynamics processes, and minimizing uncertainties in climate models. Radar sensors, capable of penetrating ice, capture detailed radargram images of internal ice layers. In this work, we introduce GRIT, graph transformer for ice layer thickness. GRIT integrates an inductive geometric graph learning framework with an attention mechanism, designed to map the relationships between shallow and deeper ice layers. Compared to baseline graph neural networks, GRIT demonstrates consistently lower prediction errors. These results highlight the attention mechanism’s effectiveness in capturing temporal changes across ice layers, while the graph transformer combines the strengths of transformers for learning long-range dependencies with graph neural networks for capturing spatial patterns, enabling robust modeling of complex spatiotemporal dynamics.

[LG-38] Data-driven Kinematic Modeling in Soft Robots: System Identification and Uncertainty Quantification

链接: https://arxiv.org/abs/2507.07370
作者: Zhanhong Jiang,Dylan Shah,Hsin-Jung Yang,Soumik Sarkar
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 6 pages; 6 figures; accepted at the 5th Modeling, Estimation and Control Conference (MECC 2025)

点击查看摘要

Abstract:Precise kinematic modeling is critical in calibration and controller design for soft robots, yet remains a challenging issue due to their highly nonlinear and complex behaviors. To tackle the issue, numerous data-driven machine learning approaches have been proposed for modeling nonlinear dynamics. However, these models suffer from prediction uncertainty that can negatively affect modeling accuracy, and uncertainty quantification for kinematic modeling in soft robots is underexplored. In this work, using limited simulation and real-world data, we first investigate multiple linear and nonlinear machine learning models commonly used for kinematic modeling of soft robots. The results reveal that nonlinear ensemble methods exhibit the most robust generalization performance. We then develop a conformal kinematic modeling framework for soft robots by utilizing split conformal prediction to quantify predictive position uncertainty, ensuring distribution-free prediction intervals with a theoretical guarantee.

[LG-39] Learning from positive and unlabeled examples -Finite size sample bounds

链接: https://arxiv.org/abs/2507.07354
作者: Farnam Mansouri,Shai Ben-David
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:PU (Positive Unlabeled) learning is a variant of supervised classification learning in which the only labels revealed to the learner are of positively labeled instances. PU learning arises in many real-world applications. Most existing work relies on the simplifying assumptions that the positively labeled training data is drawn from the restriction of the data generating distribution to positively labeled instances and/or that the proportion of positively labeled points (a.k.a. the class prior) is known apriori to the learner. This paper provides a theoretical analysis of the statistical complexity of PU learning under a wider range of setups. Unlike most prior work, our study does not assume that the class prior is known to the learner. We prove upper and lower bounds on the required sample sizes (of both the positively labeled and the unlabeled samples).

[LG-40] Machine Learning-driven Multiscale MD Workflows: The Mini-MuMMI Experience

链接: https://arxiv.org/abs/2507.07352
作者: Loïc Pottier,Konstantia Georgouli,Timothy S. Carpenter,Fikret Aydin,Jeremy O. B. Tempkin,Dwight V. Nissley,Frederick H. Streitz,Thomas R. W. Scogland,Peer-Timo Bremer,Felice C. Lightstone,Helgi I. Ingólfsson
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computational models have become one of the prevalent methods to model complex phenomena. To accurately model complex interactions, such as detailed biomolecular interactions, scientists often rely on multiscale models comprised of several internal models operating at difference scales, ranging from microscopic to macroscopic length and time scales. Bridging the gap between different time and length scales has historically been challenging but the advent of newer machine learning (ML) approaches has shown promise for tackling that task. Multiscale models require massive amounts of computational power and a powerful workflow management system. Orchestrating ML-driven multiscale studies on parallel systems with thousands of nodes is challenging, the workflow must schedule, allocate and control thousands of simulations operating at different scales. Here, we discuss the massively parallel Multiscale Machine-Learned Modeling Infrastructure (MuMMI), a multiscale workflow management infrastructure, that can orchestrate thousands of molecular dynamics (MD) simulations operating at different timescales, spanning from millisecond to nanosecond. More specifically, we introduce a novel version of MuMMI called “mini-MuMMI”. Mini-MuMMI is a curated version of MuMMI designed to run on modest HPC systems or even laptops whereas MuMMI requires larger HPC systems. We demonstrate mini-MuMMI utility by exploring RAS-RAF membrane interactions and discuss the different challenges behind the generalization of multiscale workflows and how mini-MuMMI can be leveraged to target a broader range of applications outside of MD and RAS-RAF interactions.

[LG-41] Zero-Shot Context Generalization in Reinforcement Learning from Few Training Contexts NEURIPS2025

链接: https://arxiv.org/abs/2507.07348
作者: James Chapman,Kedar Karhadkar,Guido Montufar
类目: Machine Learning (cs.LG)
*备注: 10 pages, 8 figures, 3 tables, submitted to Neurips 2025

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has achieved remarkable success across multiple domains, including competitive games, natural language processing, and robotics. Despite these advancements, policies trained via DRL often struggle to generalize to evaluation environments with different parameters. This challenge is typically addressed by training with multiple contexts and/or by leveraging additional structure in the problem. However, obtaining sufficient training data across diverse contexts can be impractical in real-world applications. In this work, we consider contextual Markov decision processes (CMDPs) with transition and reward functions that exhibit regularity in context parameters. We introduce the context-enhanced Bellman equation (CEBE) to improve generalization when training on a single context. We prove both analytically and empirically that the CEBE yields a first-order approximation to the Q-function trained across multiple contexts. We then derive context sample enhancement (CSE) as an efficient data augmentation method for approximating the CEBE in deterministic control environments. We numerically validate the performance of CSE in simulation environments, showcasing its potential to improve generalization in DRL.

[LG-42] Optimizing Model Splitting and Device Task Assignment for Deceptive Signal Assisted Private Multi-hop Split Learning

链接: https://arxiv.org/abs/2507.07323
作者: Dongyu Wei,Xiaoren Xu,Yuchen Liu,H. Vincent Poor,Mingzhe Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, deceptive signal-assisted private split learning is investigated. In our model, several edge devices jointly perform collaborative training, and some eavesdroppers aim to collect the model and data information from devices. To prevent the eavesdroppers from collecting model and data information, a subset of devices can transmit deceptive signals. Therefore, it is necessary to determine the subset of devices used for deceptive signal transmission, the subset of model training devices, and the models assigned to each model training device. This problem is formulated as an optimization problem whose goal is to minimize the information leaked to eavesdroppers while meeting the model training energy consumption and delay constraints. To solve this problem, we propose a soft actor-critic deep reinforcement learning framework with intrinsic curiosity module and cross-attention (ICM-CA) that enables a centralized agent to determine the model training devices, the deceptive signal transmission devices, the transmit power, and sub-models assigned to each model training device without knowing the position and monitoring probability of eavesdroppers. The proposed method uses an ICM module to encourage the server to explore novel actions and states and a CA module to determine the importance of each historical state-action pair thus improving training efficiency. Simulation results demonstrate that the proposed method improves the convergence rate by up to 3x and reduces the information leaked to eavesdroppers by up to 13% compared to the traditional SAC algorithm.

[LG-43] Optimizing Communication and Device Clustering for Clustered Federated Learning with Differential Privacy

链接: https://arxiv.org/abs/2507.07320
作者: Dongyu Wei,Xiaoren Xu,Shiwen Mao,Mingzhe Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, a secure and communication-efficient clustered federated learning (CFL) design is proposed. In our model, several base stations (BSs) with heterogeneous task-handling capabilities and multiple users with non-independent and identically distributed (non-IID) data jointly perform CFL training incorporating differential privacy (DP) techniques. Since each BS can process only a subset of the learning tasks and has limited wireless resource blocks (RBs) to allocate to users for federated learning (FL) model parameter transmission, it is necessary to jointly optimize RB allocation and user scheduling for CFL performance optimization. Meanwhile, our considered CFL method requires devices to use their limited data and FL model information to determine their task identities, which may introduce additional communication overhead. We formulate an optimization problem whose goal is to minimize the training loss of all learning tasks while considering device clustering, RB allocation, DP noise, and FL model transmission delay. To solve the problem, we propose a novel dynamic penalty function assisted value decomposed multi-agent reinforcement learning (DPVD-MARL) algorithm that enables distributed BSs to independently determine their connected users, RBs, and DP noise of the connected users but jointly minimize the training loss of all learning tasks across all BSs. Different from the existing MARL methods that assign a large penalty for invalid actions, we propose a novel penalty assignment scheme that assigns penalty depending on the number of devices that cannot meet communication constraints (e.g., delay), which can guide the MARL scheme to quickly find valid actions, thus improving the convergence speed. Simulation results show that the DPVD-MARL can improve the convergence rate by up to 20% and the ultimate accumulated rewards by 15% compared to independent Q-learning.

[LG-44] AdeptHEQ-FL: Adaptive Homomorphic Encryption for Federated Learning of Hybrid Classical-Quantum Models with Dynamic Layer Sparing ICCV’25

链接: https://arxiv.org/abs/2507.07316
作者: Md Abrar Jahin,Taufikur Rahman Fuad,M. F. Mridha,Nafiz Fahad,Md. Jakir Hossen
类目: Machine Learning (cs.LG)
*备注: Accepted in 1st International Workshop on ICCV’25 BISCUIT (Biomedical Image and Signal Computing for Unbiasedness, Interpretability, and Trustworthiness)

点击查看摘要

Abstract:Federated Learning (FL) faces inherent challenges in balancing model performance, privacy preservation, and communication efficiency, especially in non-IID decentralized environments. Recent approaches either sacrifice formal privacy guarantees, incur high overheads, or overlook quantum-enhanced expressivity. We introduce AdeptHEQ-FL, a unified hybrid classical-quantum FL framework that integrates (i) a hybrid CNN-PQC architecture for expressive decentralized learning, (ii) an adaptive accuracy-weighted aggregation scheme leveraging differentially private validation accuracies, (iii) selective homomorphic encryption (HE) for secure aggregation of sensitive model layers, and (iv) dynamic layer-wise adaptive freezing to minimize communication overhead while preserving quantum adaptability. We establish formal privacy guarantees, provide convergence analysis, and conduct extensive experiments on the CIFAR-10, SVHN, and Fashion-MNIST datasets. AdeptHEQ-FL achieves a \approx 25.43% and \approx 14.17% accuracy improvement over Standard-FedQNN and FHE-FedQNN, respectively, on the CIFAR-10 dataset. Additionally, it reduces communication overhead by freezing less important layers, demonstrating the efficiency and practicality of our privacy-preserving, resource-aware design for FL.

[LG-45] Frontier LLM s Still Struggle with Simple Reasoning Tasks

链接: https://arxiv.org/abs/2507.07313
作者: Alan Malek,Jiawei Ge,Jiawei Ge,Chi Jin,András György,Csaba Szepesvári
类目: Machine Learning (cs.LG)
*备注: 53 pages

点击查看摘要

Abstract:While state-of-the-art large language models (LLMs) demonstrate advanced reasoning capabilities-achieving remarkable performance on challenging competitive math and coding benchmarks-they also frequently fail on tasks that are easy for humans. This work studies the performance of frontier LLMs on a broad set of such “easy” reasoning problems. By extending previous work in the literature, we create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning, with changeable parameters (such as document length. or the number of variables in a math problem) that can arbitrarily increase the amount of computation required to produce the answer while preserving the fundamental difficulty. While previous work showed that traditional, non-thinking models can be made to fail on such problems, we demonstrate that even state-of-the-art thinking models consistently fail on such problems and for similar reasons (e.g. statistical shortcuts, errors in intermediate steps, and difficulties in processing long contexts). To further understand the behavior of the models, we introduce the unpuzzles dataset, a different “easy” benchmark consisting of trivialized versions of well-known math and logic puzzles. Interestingly, while modern LLMs excel at solving the original puzzles, they tend to fail on the trivialized versions, exhibiting several systematic failure patterns related to memorizing the originals. We show that this happens even if the models are otherwise able to solve problems with different descriptions but requiring the same logic. Our results highlight that out-of-distribution generalization is still problematic for frontier language models and the new generation of thinking models, even for simple reasoning tasks, and making tasks easier does not necessarily imply improved performance.

[LG-46] Multilayer GNN for Predictive Maintenance and Clustering in Power Grids

链接: https://arxiv.org/abs/2507.07298
作者: Muhammad Kazim,Harun Pirim,Chau Le,Trung Le,Om Prakash Yadav
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unplanned power outages cost the US economy over 150 billion annually, partly due to predictive maintenance (PdM) models that overlook spatial, temporal, and causal dependencies in grid failures. This study introduces a multilayer Graph Neural Network (GNN) framework to enhance PdM and enable resilience-based substation clustering. Using seven years of incident data from Oklahoma Gas Electric (292,830 records across 347 substations), the framework integrates Graph Attention Networks (spatial), Graph Convolutional Networks (temporal), and Graph Isomorphism Networks (causal), fused through attention-weighted embeddings. Our model achieves a 30-day F1-score of 0.8935 +/- 0.0258, outperforming XGBoost and Random Forest by 3.2% and 2.7%, and single-layer GNNs by 10 to 15 percent. Removing the causal layer drops performance to 0.7354 +/- 0.0418. For resilience analysis, HDBSCAN clustering on HierarchicalRiskGNN embeddings identifies eight operational risk groups. The highest-risk cluster (Cluster 5, 44 substations) shows 388.4 incidents/year and 602.6-minute recovery time, while low-risk groups report fewer than 62 incidents/year. ANOVA (p 0.0001) confirms significant inter-cluster separation. Our clustering outperforms K-Means and Spectral Clustering with a Silhouette Score of 0.626 and Davies-Bouldin index of 0.527. This work supports proactive grid management through improved failure prediction and risk-aware substation clustering.

[LG-47] Discretization-independent multifidelity operator learning for partial differential equations

链接: https://arxiv.org/abs/2507.07292
作者: Jacob Hauck,Yanzhi Zhang
类目: Machine Learning (cs.LG)
*备注: 33 pages, 9 figures, submitted to the Journal of Machine Learning Research

点击查看摘要

Abstract:We develop a new and general encode-approximate-reconstruct operator learning model that leverages learned neural representations of bases for input and output function distributions. We introduce the concepts of \textitnumerical operator learning and \textitdiscretization independence, which clarify the relationship between theoretical formulations and practical realizations of operator learning models. Our model is discretization-independent, making it particularly effective for multifidelity learning. We establish theoretical approximation guarantees, demonstrating uniform universal approximation under strong assumptions on the input functions and statistical approximation under weaker conditions. To our knowledge, this is the first comprehensive study that investigates how discretization independence enables robust and efficient multifidelity operator learning. We validate our method through extensive numerical experiments involving both local and nonlocal PDEs, including time-independent and time-dependent problems. The results show that multifidelity training significantly improves accuracy and computational efficiency. Moreover, multifidelity training further enhances empirical discretization independence.

[LG-48] Estimating Dataset Dimension via Singular Metrics under the Manifold Hypothesis: Application to Inverse Problems

链接: https://arxiv.org/abs/2507.07291
作者: Paola Causin,Alessio Marta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-dimensional datasets often exhibit low-dimensional geometric structures, as suggested by the manifold hypothesis, which implies that data lie on a smooth manifold embedded in a higher-dimensional ambient space. While this insight underpins many advances in machine learning and inverse problems, fully leveraging it requires to deal with three key tasks: estimating the intrinsic dimension (ID) of the manifold, constructing appropriate local coordinates, and learning mappings between ambient and manifold spaces. In this work, we propose a framework that addresses all these challenges using a Mixture of Variational Autoencoders (VAEs) and tools from Riemannian geometry. We specifically focus on estimating the ID of datasets by analyzing the numerical rank of the VAE decoder pullback metric. The estimated ID guides the construction of an atlas of local charts using a mixture of invertible VAEs, enabling accurate manifold parameterization and efficient inference. We how this approach enhances solutions to ill-posed inverse problems, particularly in biomedical imaging, by enforcing that reconstructions lie on the learned manifold. Lastly, we explore the impact of network pruning on manifold geometry and reconstruction quality, showing that the intrinsic dimension serves as an effective proxy for monitoring model capacity.

[LG-49] Natural Evolutionary Search meets Probabilistic Numerics

链接: https://arxiv.org/abs/2507.07288
作者: Pierre Osselin,Masaki Adachi,Xiaowen Dong,Michael A. Osborne
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 figures (24 pages, 11 figures including references and appendices)

点击查看摘要

Abstract:Zeroth-order local optimisation algorithms are essential for solving real-valued black-box optimisation problems. Among these, Natural Evolution Strategies (NES) represent a prominent class, particularly well-suited for scenarios where prior distributions are available. By optimising the objective function in the space of search distributions, NES algorithms naturally integrate prior knowledge during initialisation, making them effective in settings such as semi-supervised learning and user-prior belief frameworks. However, due to their reliance on random sampling and Monte Carlo estimates, NES algorithms can suffer from limited sample efficiency. In this paper, we introduce a novel class of algorithms, termed Probabilistic Natural Evolutionary Strategy Algorithms (ProbNES), which enhance the NES framework with Bayesian quadrature. We show that ProbNES algorithms consistently outperforms their non-probabilistic counterparts as well as global sample efficient methods such as Bayesian Optimisation (BO) or \pi BO across a wide range of tasks, including benchmark test functions, data-driven optimisation tasks, user-informed hyperparameter tuning tasks and locomotion tasks.

[LG-50] RIP: A Nonparametric Test to Diagnose Biased Feature Importance Scores IJCAI2025

链接: https://arxiv.org/abs/2507.07276
作者: Aaron Foote,Danny Krizanc
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Accepted at the Workshop on Explainable Artificial Intelligence (XAI) at IJCAI 2025

点击查看摘要

Abstract:Along with accurate prediction, understanding the contribution of each feature to the making of the prediction, i.e., the importance of the feature, is a desirable and arguably necessary component of a machine learning model. For a complex model such as a random forest, such importances are not innate – as they are, e.g., with linear regression. Efficient methods have been created to provide such capabilities, with one of the most popular among them being permutation feature importance due to its efficiency, model-agnostic nature, and perceived intuitiveness. However, permutation feature importance has been shown to be misleading in the presence of dependent features as a result of the creation of unrealistic observations when permuting the dependent features. In this work, we develop TRIP (Test for Reliable Interpretation via Permutation), a test requiring minimal assumptions that is able to detect unreliable permutation feature importance scores that are the result of model extrapolation. To build on this, we demonstrate how the test can be complemented in order to allow its use in high dimensional settings. Through testing on simulated data and applications, our results show that the test can be used to reliably detect when permutation feature importance scores are unreliable.

[LG-51] Beyond the ATE: Interpretable Modelling of Treatment Effects over Dose and Time ICML2025

链接: https://arxiv.org/abs/2507.07271
作者: Julianna Piskorz,Krzysztof Kacprzyk,Mihaela van der Schaar
类目: Machine Learning (cs.LG)
*备注: Presented at the Actionable Interpretability Workshop at ICML 2025

点击查看摘要

Abstract:The Average Treatment Effect (ATE) is a foundational metric in causal inference, widely used to assess intervention efficacy in randomized controlled trials (RCTs). However, in many applications – particularly in healthcare – this static summary fails to capture the nuanced dynamics of treatment effects that vary with both dose and time. We propose a framework for modelling treatment effect trajectories as smooth surfaces over dose and time, enabling the extraction of clinically actionable insights such as onset time, peak effect, and duration of benefit. To ensure interpretability, robustness, and verifiability – key requirements in high-stakes domains – we adapt SemanticODE, a recent framework for interpretable trajectory modelling, to the causal setting where treatment effects are never directly observed. Our approach decouples the estimation of trajectory shape from the specification of clinically relevant properties (e.g., maxima, inflection points), supporting domain-informed priors, post-hoc editing, and transparent analysis. We show that our method yields accurate, interpretable, and editable models of treatment dynamics, facilitating both rigorous causal analysis and practical decision-making.

[LG-52] Robust Multimodal Learning Framework For Intake Gesture Detection Using Contactless Radar and Wearable IMU Sensors

链接: https://arxiv.org/abs/2507.07261
作者: Chunzhuo Wang,Hans Hallez,Bart Vanrumste
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This manuscript has been submitted to a peer-reviewed journal and is currently under review

点击查看摘要

Abstract:Automated food intake gesture detection plays a vital role in dietary monitoring, enabling objective and continuous tracking of eating behaviors to support better health outcomes. Wrist-worn inertial measurement units (IMUs) have been widely used for this task with promising results. More recently, contactless radar sensors have also shown potential. This study explores whether combining wearable and contactless sensing modalities through multimodal learning can further improve detection performance. We also address a major challenge in multimodal learning: reduced robustness when one modality is missing. To this end, we propose a robust multimodal temporal convolutional network with cross-modal attention (MM-TCN-CMA), designed to integrate IMU and radar data, enhance gesture detection, and maintain performance under missing modality conditions. A new dataset comprising 52 meal sessions (3,050 eating gestures and 797 drinking gestures) from 52 participants is developed and made publicly available. Experimental results show that the proposed framework improves the segmental F1-score by 4.3% and 5.2% over unimodal Radar and IMU models, respectively. Under missing modality scenarios, the framework still achieves gains of 1.3% and 2.4% for missing radar and missing IMU inputs. This is the first study to demonstrate a robust multimodal learning framework that effectively fuses IMU and radar data for food intake gesture detection.

[LG-53] owards Robust Surrogate Models: Benchmarking Machine Learning Approaches to Expediting Phase Field Simulations of Brittle Fracture

链接: https://arxiv.org/abs/2507.07237
作者: Erfan Hamdi,Emma Lejeune
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 29 pages, 13 figures

点击查看摘要

Abstract:Data driven approaches have the potential to make modeling complex, nonlinear physical phenomena significantly more computationally tractable. For example, computational modeling of fracture is a core challenge where machine learning techniques have the potential to provide a much needed speedup that would enable progress in areas such as mutli-scale modeling and uncertainty quantification. Currently, phase field modeling (PFM) of fracture is one such approach that offers a convenient variational formulation to model crack nucleation, branching and propagation. To date, machine learning techniques have shown promise in approximating PFM simulations. However, most studies rely on overly simple benchmarks that do not reflect the true complexity of the fracture processes where PFM excels as a method. To address this gap, we introduce a challenging dataset based on PFM simulations designed to benchmark and advance ML methods for fracture modeling. This dataset includes three energy decomposition methods, two boundary conditions, and 1,000 random initial crack configurations for a total of 6,000 simulations. Each sample contains 100 time steps capturing the temporal evolution of the crack field. Alongside this dataset, we also implement and evaluate Physics Informed Neural Networks (PINN), Fourier Neural Operators (FNO) and UNet models as baselines, and explore the impact of ensembling strategies on prediction accuracy. With this combination of our dataset and baseline models drawn from the literature we aim to provide a standardized and challenging benchmark for evaluating machine learning approaches to solid mechanics. Our results highlight both the promise and limitations of popular current models, and demonstrate the utility of this dataset as a testbed for advancing machine learning in fracture mechanics research.

[LG-54] Efficient Parametric SVD of Koopman Operator for Stochastic Dynamical Systems NEURIPS2025

链接: https://arxiv.org/abs/2507.07222
作者: Minchan Jeong,J. Jon Ryu,Se-Young Yun,Gregory W. Wornell
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注: 28 pages, 4 figures. Under review for NeurIPS 2025. The first two authors contributed equally

点击查看摘要

Abstract:The Koopman operator provides a principled framework for analyzing nonlinear dynamical systems through linear operator theory. Recent advances in dynamic mode decomposition (DMD) have shown that trajectory data can be used to identify dominant modes of a system in a data-driven manner. Building on this idea, deep learning methods such as VAMPnet and DPNet have been proposed to learn the leading singular subspaces of the Koopman operator. However, these methods require backpropagation through potentially numerically unstable operations on empirical second moment matrices, such as singular value decomposition and matrix inversion, during objective computation, which can introduce biased gradient estimates and hinder scalability to large systems. In this work, we propose a scalable and conceptually simple method for learning the top-k singular functions of the Koopman operator for stochastic dynamical systems based on the idea of low-rank approximation. Our approach eliminates the need for unstable linear algebraic operations and integrates easily into modern deep learning pipelines. Empirical results demonstrate that the learned singular subspaces are both reliable and effective for downstream tasks such as eigen-analysis and multi-step prediction.

[LG-55] Scale leads to compositional generalization

链接: https://arxiv.org/abs/2507.07207
作者: Florian Redhardt,Yassir Akram,Simon Schug
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Code available at this https URL

点击查看摘要

Abstract:Can neural networks systematically capture discrete, compositional task structure despite their continuous, distributed nature? The impressive capabilities of large-scale neural networks suggest that the answer to this question is yes. However, even for the most capable models, there are still frequent failure cases that raise doubts about their compositionality. Here, we seek to understand what it takes for a standard neural network to generalize over tasks that share compositional structure. We find that simply scaling data and model size leads to compositional generalization. We show that this holds across different task encodings as long as the training distribution sufficiently covers the task space. In line with this finding, we prove that standard multilayer perceptrons can approximate a general class of compositional task families to arbitrary precision using only a linear number of neurons with respect to the number of task modules. Finally, we uncover that if networks successfully compositionally generalize, the constituents of a task can be linearly decoded from their hidden activations. We show that this metric correlates with failures of text-to-image generation models to compose known concepts.

[LG-56] DAF: An Efficient End-to-End Dynamic Activation Framework for on-Device DNN Training

链接: https://arxiv.org/abs/2507.07149
作者: Renyuan Liu,Yuyang Leng,Kaiyan Liu,Shaohan Hu,Chun-Fu(Richard)Chen,Peijun Zhao,Heechul Yun,Shuochao Yao
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Accepted to MobiSys 2025

点击查看摘要

Abstract:Recent advancements in on-device training for deep neural networks have underscored the critical need for efficient activation compression to overcome the memory constraints of mobile and edge devices. As activations dominate memory usage during training and are essential for gradient computation, compressing them without compromising accuracy remains a key research challenge. While existing methods for dynamic activation quantization promise theoretical memory savings, their practical deployment is impeded by system-level challenges such as computational overhead and memory fragmentation. To address these challenges, we introduce DAF, a Dynamic Activation Framework that enables scalable and efficient on-device training through system-level optimizations. DAF achieves both memory- and time-efficient dynamic quantization training by addressing key system bottlenecks. It develops hybrid reduction operations tailored to the memory hierarchies of mobile and edge SoCs, leverages collaborative CPU-GPU bit-packing for efficient dynamic quantization, and implements an importance-aware paging memory management scheme to reduce fragmentation and support dynamic memory adjustments. These optimizations collectively enable DAF to achieve substantial memory savings and speedup without compromising model training accuracy. Evaluations on various deep learning models across embedded and mobile platforms demonstrate up to a 22.9\times reduction in memory usage and a 3.2\times speedup, making DAF a scalable and practical solution for resource-constrained environments. Comments: Accepted to MobiSys 2025 Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG) Cite as: arXiv:2507.07149 [cs.NI] (or arXiv:2507.07149v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2507.07149 Focus to learn more arXiv-issued DOI via DataCite

[LG-57] An attention-aware GNN-based input defender against multi-turn jailbreak on LLM s

链接: https://arxiv.org/abs/2507.07146
作者: Zixuan Huang,Kecheng Huang,Lihao Yin,Bowei He,Huiling Zhen,Mingxuan Yuan,Zili Shao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have gained widespread popularity and are increasingly integrated into various applications. However, their capabilities can be exploited for both benign and harmful purposes. Despite rigorous training and fine-tuning for safety, LLMs remain vulnerable to jailbreak attacks. Recently, multi-turn attacks have emerged, exacerbating the issue. Unlike single-turn attacks, multi-turn attacks gradually escalate the dialogue, making them more difficult to detect and mitigate, even after they are identified. In this study, we propose G-Guard, an innovative attention-aware GNN-based input classifier designed to defend against multi-turn jailbreak attacks on LLMs. G-Guard constructs an entity graph for multi-turn queries, explicitly capturing relationships between harmful keywords and queries even when those keywords appear only in previous queries. Additionally, we introduce an attention-aware augmentation mechanism that retrieves the most similar single-turn query based on the multi-turn conversation. This retrieved query is treated as a labeled node in the graph, enhancing the ability of GNN to classify whether the current query is harmful. Evaluation results demonstrate that G-Guard outperforms all baselines across all datasets and evaluation metrics. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.07146 [cs.LG] (or arXiv:2507.07146v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.07146 Focus to learn more arXiv-issued DOI via DataCite

[LG-58] CCQ: Convolutional Code for Extreme Low-bit Quantization in LLM s

链接: https://arxiv.org/abs/2507.07145
作者: Zhaojing Zhou,Xunchao Li,Minghao Li,Handi Zhang,Haoshuang Wang,Wenbin Chang,Yiqun Liu,Qingqing Dang,Dianhai Yu,Yanjun Ma,Haifeng Wang
类目: Machine Learning (cs.LG)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:The rapid scaling of Large Language Models (LLMs) elevates inference costs and compounds substantial deployment barriers. While quantization to 8 or 4 bits mitigates this, sub-3-bit methods face severe accuracy, scalability, and efficiency degradation. We propose Convolutional Code Quantization (CCQ), an inference-optimized quantization approach compressing LLMs to 2.0-2.75 bits with minimal accuracy loss. Departing from error-prone scalar quantization or slow vector quantization, CCQ integrates a hardware-aware bit-shift encoding and decoding solution with Convolutional Code, Hybrid Encoding, and Code Cluster, jointly overcoming accuracy-speed bottlenecks. We construct a lookup-free encoding space, enabling a linear mapping between the codebook and weight vectors, thereby optimizing inference performance. Meanwhile, by drawing on the concept of data mapping from vector quantization, we minimize the performance degradation of the model under extremely low-bit conditions. Experiments demonstrate that CCQ achieves outstanding performance on LLMs across various benchmarks. We compress DeepSeek-V3 (671B total parameters) to 184GB and ERNIE-4.5-300B-A47B to 89GB, enabling single-GPU deployment of ERNIE 4.5 and eliminating inter-card communication. The 2-bit ERNIE-4.5-300B-A47B model and inference engine have been open-sourced.

[LG-59] Understanding Malware Propagation Dynamics through Scientific Machine Learning

链接: https://arxiv.org/abs/2507.07143
作者: Karthik Pappu,Prathamesh Dinesh Joshi,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
类目: Machine Learning (cs.LG)
*备注: 17 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Accurately modeling malware propagation is essential for designing effective cybersecurity defenses, particularly against adaptive threats that evolve in real time. While traditional epidemiological models and recent neural approaches offer useful foundations, they often fail to fully capture the nonlinear feedback mechanisms present in real-world networks. In this work, we apply scientific machine learning to malware modeling by evaluating three approaches: classical Ordinary Differential Equations (ODEs), Universal Differential Equations (UDEs), and Neural ODEs. Using data from the Code Red worm outbreak, we show that the UDE approach substantially reduces prediction error compared to both traditional and neural baselines by 44%, while preserving interpretability. We introduce a symbolic recovery method that transforms the learned neural feedback into explicit mathematical expressions, revealing suppression mechanisms such as network saturation, security response, and malware variant evolution. Our results demonstrate that hybrid physics-informed models can outperform both purely analytical and purely neural approaches, offering improved predictive accuracy and deeper insight into the dynamics of malware spread. These findings support the development of early warning systems, efficient outbreak response strategies, and targeted cyber defense interventions.

[LG-60] Str-GCL: Structural Commonsense Driven Graph Contrastive Learning WWW2025

链接: https://arxiv.org/abs/2507.07141
作者: Dongxiao He,Yongqi Huang,Jitao Zhao,Xiaobao Wang,Zhen Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by WWW 2025

点击查看摘要

Abstract:Graph Contrastive Learning (GCL) is a widely adopted approach in self-supervised graph representation learning, applying contrastive objectives to produce effective representations. However, current GCL methods primarily focus on capturing implicit semantic relationships, often overlooking the structural commonsense embedded within the graph’s structure and attributes, which contains underlying knowledge crucial for effective representation learning. Due to the lack of explicit information and clear guidance in general graph, identifying and integrating such structural commonsense in GCL poses a significant challenge. To address this gap, we propose a novel framework called Structural Commonsense Unveiling in Graph Contrastive Learning (Str-GCL). Str-GCL leverages first-order logic rules to represent structural commonsense and explicitly integrates them into the GCL framework. It introduces topological and attribute-based rules without altering the original graph and employs a representation alignment mechanism to guide the encoder in effectively capturing this commonsense. To the best of our knowledge, this is the first attempt to directly incorporate structural commonsense into GCL. Extensive experiments demonstrate that Str-GCL outperforms existing GCL methods, providing a new perspective on leveraging structural commonsense in graph representation learning.

[LG-61] Exploring Sparse Adapters for Scalable Merging of Parameter Efficient Experts

链接: https://arxiv.org/abs/2507.07140
作者: Samin Yeasar Arnob,Zhan Su,Minseon Kim,Oleksiy Ostapenko,Riyasat Ohib,Esra’a Saleh,Doina Precup,Lucas Caccia,Alessandro Sordoni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Merging parameter-efficient task experts has recently gained growing attention as a way to build modular architectures that can be rapidly adapted on the fly for specific downstream tasks, without requiring additional fine-tuning. Typically, LoRA serves as the foundational building block of such parameter-efficient modular architectures, leveraging low-rank weight structures to reduce the number of trainable parameters. In this paper, we study the properties of sparse adapters, which train only a subset of weights in the base neural network, as potential building blocks of modular architectures. First, we propose a simple method for training highly effective sparse adapters, which is conceptually simpler than existing methods in the literature and surprisingly outperforms both LoRA and full fine-tuning in our setting. Next, we investigate the merging properties of these sparse adapters by merging adapters for up to 20 natural language processing tasks, thus scaling beyond what is usually studied in the literature. Our findings demonstrate that sparse adapters yield superior in-distribution performance post-merging compared to LoRA or full model merging. Achieving strong held-out performance remains a challenge for all methods considered.

[LG-62] GNNs Meet Sequence Models Along the Shortest-Path: an Expressive Method for Link Prediction

链接: https://arxiv.org/abs/2507.07138
作者: Francesco Ferrini,Veronica Lachi,Antonio Longa,Bruno Lepri,Andrea Passerini
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) often struggle to capture the link-specific structural patterns crucial for accurate link prediction, as their node-centric message-passing schemes overlook the subgraph structures connecting a pair of nodes. Existing methods to inject such structural context either incur high computational cost or rely on simplistic heuristics (e.g., common neighbor counts) that fail to model multi-hop dependencies. We introduce SP4LP (Shortest Path for Link Prediction), a novel framework that combines GNN-based node encodings with sequence modeling over shortest paths. Specifically, SP4LP first applies a GNN to compute representations for all nodes, then extracts the shortest path between each candidate node pair and processes the resulting sequence of node embeddings using a sequence model. This design enables SP4LP to capture expressive multi-hop relational patterns with computational efficiency. Empirically, SP4LP achieves state-of-the-art performance across link prediction benchmarks. Theoretically, we prove that SP4LP is strictly more expressive than standard message-passing GNNs and several state-of-the-art structural features methods, establishing it as a general and principled approach for link prediction in graphs.

[LG-63] Automating Evaluation of Diffusion Model Unlearning with (Vision-) Language Model World Knowledge

链接: https://arxiv.org/abs/2507.07137
作者: Eric Yeats,Darryl Hannan,Henry Kvinge,Timothy Doster,Scott Mahan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning (MU) is a promising cost-effective method to cleanse undesired information (generated concepts, biases, or patterns) from foundational diffusion models. While MU is orders of magnitude less costly than retraining a diffusion model without the undesired information, it can be challenging and labor-intensive to prove that the information has been fully removed from the model. Moreover, MU can damage diffusion model performance on surrounding concepts that one would like to retain, making it unclear if the diffusion model is still fit for deployment. We introduce autoeval-dmun, an automated tool which leverages (vision-) language models to thoroughly assess unlearning in diffusion models. Given a target concept, autoeval-dmun extracts structured, relevant world knowledge from the language model to identify nearby concepts which are likely damaged by unlearning and to circumvent unlearning with adversarial prompts. We use our automated tool to evaluate popular diffusion model unlearning methods, revealing that language models (1) impose semantic orderings of nearby concepts which correlate well with unlearning damage and (2) effectively circumvent unlearning with synthetic adversarial prompts.

[LG-64] FACap: A Large-scale Fashion Dataset for Fine-grained Composed Image Retrieval

链接: https://arxiv.org/abs/2507.07135
作者: François Gardères,Shizhe Chen,Camille-Sovanneary Gauthier,Jean Ponce
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The composed image retrieval (CIR) task is to retrieve target images given a reference image and a modification text. Recent methods for CIR leverage large pretrained vision-language models (VLMs) and achieve good performance on general-domain concepts like color and texture. However, they still struggle with application domains like fashion, because the rich and diverse vocabulary used in fashion requires specific fine-grained vision and language understanding. An additional difficulty is the lack of large-scale fashion datasets with detailed and relevant annotations, due to the expensive cost of manual annotation by specialists. To address these challenges, we introduce FACap, a large-scale, automatically constructed fashion-domain CIR dataset. It leverages web-sourced fashion images and a two-stage annotation pipeline powered by a VLM and a large language model (LLM) to generate accurate and detailed modification texts. Then, we propose a new CIR model FashionBLIP-2, which fine-tunes the general-domain BLIP-2 model on FACap with lightweight adapters and multi-head query-candidate matching to better account for fine-grained fashion-specific information. FashionBLIP-2 is evaluated with and without additional fine-tuning on the Fashion IQ benchmark and the enhanced evaluation dataset enhFashionIQ, leveraging our pipeline to obtain higher-quality annotations. Experimental results show that the combination of FashionBLIP-2 and pretraining with FACap significantly improves the model’s performance in fashion CIR especially for retrieval with fine-grained modification texts, demonstrating the value of our dataset and approach in a highly demanding environment such as e-commerce websites. Code is available at this https URL.

[LG-65] Ampere: Communication-Efficient and High-Accuracy Split Federated Learning

链接: https://arxiv.org/abs/2507.07130
作者: Zihan Zhang,Leon Wong,Blesson Varghese
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A Federated Learning (FL) system collaboratively trains neural networks across devices and a server but is limited by significant on-device computation costs. Split Federated Learning (SFL) systems mitigate this by offloading a block of layers of the network from the device to a server. However, in doing so, it introduces large communication overheads due to frequent exchanges of intermediate activations and gradients between devices and the server and reduces model accuracy for non-IID data. We propose Ampere, a novel collaborative training system that simultaneously minimizes on-device computation and device-server communication while improving model accuracy. Unlike SFL, which uses a global loss by iterative end-to-end training, Ampere develops unidirectional inter-block training to sequentially train the device and server block with a local loss, eliminating the transfer of gradients. A lightweight auxiliary network generation method decouples training between the device and server, reducing frequent intermediate exchanges to a single transfer, which significantly reduces the communication overhead. Ampere mitigates the impact of data heterogeneity by consolidating activations generated by the trained device block to train the server block, in contrast to SFL, which trains on device-specific, non-IID activations. Extensive experiments on multiple CNNs and transformers show that, compared to state-of-the-art SFL baseline systems, Ampere (i) improves model accuracy by up to 13.26% while reducing training time by up to 94.6%, (ii) reduces device-server communication overhead by up to 99.1% and on-device computation by up to 93.13%, and (iii) reduces standard deviation of accuracy by 53.39% for various non-IID degrees highlighting superior performance when faced with heterogeneous data.

[LG-66] Synergistic Localization and Sensing in MIMO-OFDM Systems via Mixed-Integer Bilevel Learning

链接: https://arxiv.org/abs/2507.07118
作者: Zelin Zhu,Kai Yang,Rui Zhang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wireless localization and sensing technologies are essential in modern wireless networks, supporting applications in smart cities, the Internet of Things (IoT), and autonomous systems. High-performance localization and sensing systems are critical for both network efficiency and emerging intelligent applications. Integrating channel state information (CSI) with deep learning has recently emerged as a promising solution. Recent works have leveraged the spatial diversity of multiple input multiple output (MIMO) systems and the frequency granularity of orthogonal frequency division multiplexing (OFDM) waveforms to improve spatial resolution. Nevertheless, the joint modeling of localization and sensing under the high-dimensional CSI characteristics of MIMO-OFDM systems remains insufficiently investigated. This work aims to jointly model and optimize localization and sensing tasks to harness their potential synergy. We first formulate localization and sensing as a mixed-integer bilevel deep learning problem and then propose a novel stochastic proximal gradient-based mixed-integer bilevel optimization (SPG-MIBO) algorithm. SPG-MIBO is well-suited for high-dimensional and large-scale datasets, leveraging mini-batch training at each step for computational and memory efficiency. The algorithm is also supported by theoretical convergence guarantees. Extensive experiments on multiple datasets validate its effectiveness and highlight the performance gains from joint localization and sensing optimization.

[LG-67] Distributed Training under Packet Loss

链接: https://arxiv.org/abs/2507.07114
作者: Erez Weintraub,Ron Banner,Ariel Orda
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:State-of-the-art language and vision models are routinely trained across thousands of GPUs, often spanning multiple data-centers, yet today’s distributed frameworks still assume reliable connections (e.g., InfiniBand or RoCE). The resulting acknowledgment traffic and retransmissions inflate tail latencies and limit scalability. Leveraging unreliable connections will reduce latency but may sacrifice model accuracy and convergence once packets are dropped. A principled, end-to-end solution that preserves accuracy and convergence guarantees under genuine packet loss has previously been missing. We address this critical gap by introducing a novel distributed training framework capable of operating over unreliable connections, offering unbiased gradient aggregation and bounded parameter drift without modifying model code or optimizers. The key insight is a two-stage defense against missing messages: (i) Unbiased gradient aggregation: each worker reconstructs a consistent gradient estimate from whatever packets arrive, guaranteeing expectation-level correctness; and (ii) Bounded-drift parameter broadcasts: we prove the inter-worker model discrepancy remains O(1) even after arbitrarily many iterations, preventing the unbounded divergence typical of asynchronous setups. Analytical bounds are matched by experiments on the LLAMA2 7B model with 64 GPUs: tolerating 10% random packet loss yields at most 0.8% perplexity change. This work bridges the gap between communication-efficient datacenter protocols and the accuracy and generalization guarantees demanded by modern large-model training, enabling robust, high-throughput learning on commodity or wide-area networks.

[LG-68] A statistical physics framework for optimal learning

链接: https://arxiv.org/abs/2507.07907
作者: Francesca Mignacco,Francesco Mori
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 35 pages, 13 figures

点击查看摘要

Abstract:Learning is a complex dynamical process shaped by a range of interconnected decisions. Careful design of hyperparameter schedules for artificial neural networks or efficient allocation of cognitive resources by biological learners can dramatically affect performance. Yet, theoretical understanding of optimal learning strategies remains sparse, especially due to the intricate interplay between evolving meta-parameters and nonlinear learning dynamics. The search for optimal protocols is further hindered by the high dimensionality of the learning space, often resulting in predominantly heuristic, difficult to interpret, and computationally demanding solutions. Here, we combine statistical physics with control theory in a unified theoretical framework to identify optimal protocols in prototypical neural network models. In the high-dimensional limit, we derive closed-form ordinary differential equations that track online stochastic gradient descent through low-dimensional order parameters. We formulate the design of learning protocols as an optimal control problem directly on the dynamics of the order parameters with the goal of minimizing the generalization error at the end of training. This framework encompasses a variety of learning scenarios, optimization constraints, and control budgets. We apply it to representative cases, including optimal curricula, adaptive dropout regularization and noise schedules in denoising autoencoders. We find nontrivial yet interpretable strategies highlighting how optimal protocols mediate crucial learning tradeoffs, such as maximizing alignment with informative input directions while minimizing noise fitting. Finally, we show how to apply our framework to real datasets. Our results establish a principled foundation for understanding and designing optimal learning protocols and suggest a path toward a theory of meta-learning grounded in statistical physics.

[LG-69] Approximation Depth of Convex Polytopes

链接: https://arxiv.org/abs/2507.07779
作者: Egor Bakaev,Florestan Brunck,Amir Yehudayoff
类目: Metric Geometry (math.MG); Computational Geometry (cs.CG); Machine Learning (cs.LG); Combinatorics (math.CO)
*备注:

点击查看摘要

Abstract:We study approximations of polytopes in the standard model for computing polytopes using Minkowski sums and (convex hulls of) unions. Specifically, we study the ability to approximate a target polytope by polytopes of a given depth. Our main results imply that simplices can only be trivially approximated''. On the way, we obtain a characterization of simplices as the only outer additive’’ convex bodies.

[LG-70] A Unified Empirical Risk Minimization Framework for Flexible N-Tuples Weak Supervision

链接: https://arxiv.org/abs/2507.07771
作者: Shuying Huang,Junpeng Li,Changchun Hua,Yana Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To alleviate the annotation burden in supervised learning, N-tuples learning has recently emerged as a powerful weakly-supervised method. While existing N-tuples learning approaches extend pairwise learning to higher-order comparisons and accommodate various real-world scenarios, they often rely on task-specific designs and lack a unified theoretical foundation. In this paper, we propose a general N-tuples learning framework based on empirical risk minimization, which systematically integrates pointwise unlabeled data to enhance learning performance. This paper first unifies the data generation processes of N-tuples and pointwise unlabeled data under a shared probabilistic formulation. Based on this unified view, we derive an unbiased empirical risk estimator that generalizes a broad class of existing N-tuples models. We further establish a generalization error bound for theoretical support. To demonstrate the flexibility of the framework, we instantiate it in four representative weakly supervised scenarios, each recoverable as a special case of our general model. Additionally, to address overfitting issues arising from negative risk terms, we adopt correction functions to adjust the empirical risk. Extensive experiments on benchmark datasets validate the effectiveness of the proposed framework and demonstrate that leveraging pointwise unlabeled data consistently improves generalization across various N-tuples learning tasks.

[LG-71] Machine Learning-Assisted Surrogate Modeling with Multi-Objective Optimization and Decision-Making of a Steam Methane Reforming Reactor

链接: https://arxiv.org/abs/2507.07641
作者: Seyed Reza Nabavi,Zonglin Guo,Zhiyuan Wang
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents an integrated modeling and optimization framework for a steam methane reforming (SMR) reactor, combining a mathematical model, artificial neural network (ANN)-based hybrid modeling, advanced multi-objective optimization (MOO) and multi-criteria decision-making (MCDM) techniques. A one-dimensional fixed-bed reactor model accounting for internal mass transfer resistance was employed to simulate reactor performance. To reduce the high computational cost of the mathematical model, a hybrid ANN surrogate was constructed, achieving a 93.8% reduction in average simulation time while maintaining high predictive accuracy. The hybrid model was then embedded into three MOO scenarios using the non-dominated sorting genetic algorithm II (NSGA-II) solver: 1) maximizing methane conversion and hydrogen output; 2) maximizing hydrogen output while minimizing carbon dioxide emissions; and 3) a combined three-objective case. The optimal trade-off solutions were further ranked and selected using two MCDM methods: technique for order of preference by similarity to ideal solution (TOPSIS) and simplified preference ranking on the basis of ideal-average distance (sPROBID). Optimal results include a methane conversion of 0.863 with 4.556 mol/s hydrogen output in the first case, and 0.988 methane conversion with 3.335 mol/s hydrogen and 0.781 mol/s carbon dioxide in the third. This comprehensive methodology offers a scalable and effective strategy for optimizing complex catalytic reactor systems with multiple, often conflicting, objectives.

[LG-72] Concentration of measure for non-linear random matrices with applications to neural networks and non-commutative polynomials

链接: https://arxiv.org/abs/2507.07625
作者: Radosław Adamczak
类目: Probability (math.PR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We prove concentration inequalities for several models of non-linear random matrices. As corollaries we obtain estimates for linear spectral statistics of the conjugate kernel of neural networks and non-commutative polynomials in (possibly dependent) random matrices.

[LG-73] Galerkin-ARIMA: A Two-Stage Polynomial Regression Framework for Fast Rolling One-Step-Ahead Forecasting

链接: https://arxiv.org/abs/2507.07469
作者: Haojie Liu,Zihan Lin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:Time-series models like ARIMA remain widely used for forecasting but limited to linear assumptions and high computational cost in large and complex datasets. We propose Galerkin-ARIMA that generalizes the AR component of ARIMA and replace it with a flexible spline-based function estimated by Galerkin projection. This enables the model to capture nonlinear dependencies in lagged values and retain the MA component and Gaussian noise assumption. We derive a closed-form OLS estimator for the Galerkin coefficients and show the model is asymptotically unbiased and consistent under standard conditions. Our method bridges classical time-series modeling and nonparametric regression, which offering improved forecasting performance and computational efficiency.

[LG-74] Hess-MC2: Sequential Monte Carlo Squared using Hessian Information and Second Order Proposals

链接: https://arxiv.org/abs/2507.07461
作者: Joshua Murphy,Conor Rosato,Andrew Millard,Lee Devlin,Paul Horridge,Simon Maskell
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted to IEEE Machine Learning Signal Processing conference 2025

点击查看摘要

Abstract:When performing Bayesian inference using Sequential Monte Carlo (SMC) methods, two considerations arise: the accuracy of the posterior approximation and computational efficiency. To address computational demands, Sequential Monte Carlo Squared (SMC ^2 ) is well-suited for high-performance computing (HPC) environments. The design of the proposal distribution within SMC ^2 can improve accuracy and exploration of the posterior as poor proposals may lead to high variance in importance weights and particle degeneracy. The Metropolis-Adjusted Langevin Algorithm (MALA) uses gradient information so that particles preferentially explore regions of higher probability. In this paper, we extend this idea by incorporating second-order information, specifically the Hessian of the log-target. While second-order proposals have been explored previously in particle Markov Chain Monte Carlo (p-MCMC) methods, we are the first to introduce them within the SMC ^2 framework. Second-order proposals not only use the gradient (first-order derivative), but also the curvature (second-order derivative) of the target distribution. Experimental results on synthetic models highlight the benefits of our approach in terms of step-size selection and posterior approximation accuracy when compared to other proposals.

[LG-75] Probabilistic Approximate Optimization: A New Variational Monte Carlo Algorithm

链接: https://arxiv.org/abs/2507.07420
作者: Abdelrahman S. Abdelrahman,Shuvro Chowdhury,Flaviano Morone,Kerem Y. Camsari
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:We introduce a generalized \textitProbabilistic Approximate Optimization Algorithm (PAOA), a classical variational Monte Carlo framework that extends and formalizes prior work by Weitz \textitet al.~\citeCombes_2023, enabling parameterized and fast sampling on present-day Ising machines and probabilistic computers. PAOA operates by iteratively modifying the couplings of a network of binary stochastic units, guided by cost evaluations from independent samples. We establish a direct correspondence between derivative-free updates and the gradient of the full 2^N \times 2^N Markov flow, showing that PAOA admits a principled variational formulation. Simulated annealing emerges as a limiting case under constrained parameterizations, and we implement this regime on an FPGA-based probabilistic computer with on-chip annealing to solve large 3D spin-glass problems. Benchmarking PAOA against QAOA on the canonical 26-spin Sherrington-Kirkpatrick model with matched parameters reveals superior performance for PAOA. We show that PAOA naturally extends simulated annealing by optimizing multiple temperature profiles, leading to improved performance over SA on heavy-tailed problems such as SK-Lévy.

[LG-76] Platform for Representation and Integration of multimodal Molecular Embeddings

链接: https://arxiv.org/abs/2507.07367
作者: Erika Yilin Zheng,Yu Yan,Baradwaj Simha Sankar,Ethan Ji,Steven Swee,Irsyad Adam,Ding Wang,Alexander Russell Pelletier,Alex Bui,Wei Wang,Peipei Ping
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing machine learning methods for molecular (e.g., gene) embeddings are restricted to specific tasks or data modalities, limiting their effectiveness within narrow domains. As a result, they fail to capture the full breadth of gene functions and interactions across diverse biological contexts. In this study, we have systematically evaluated knowledge representations of biomolecules across multiple dimensions representing a task-agnostic manner spanning three major data sources, including omics experimental data, literature-derived text data, and knowledge graph-based representations. To distinguish between meaningful biological signals from chance correlations, we devised an adjusted variant of Singular Vector Canonical Correlation Analysis (SVCCA) that quantifies signal redundancy and complementarity across different data modalities and sources. These analyses reveal that existing embeddings capture largely non-overlapping molecular signals, highlighting the value of embedding integration. Building on this insight, we propose Platform for Representation and Integration of multimodal Molecular Embeddings (PRISME), a machine learning based workflow using an autoencoder to integrate these heterogeneous embeddings into a unified multimodal representation. We validated this approach across various benchmark tasks, where PRISME demonstrated consistent performance, and outperformed individual embedding methods in missing value imputations. This new framework supports comprehensive modeling of biomolecules, advancing the development of robust, broadly applicable multimodal embeddings optimized for downstream biomedical machine learning applications.

[LG-77] Way More Than the Sum of Their Parts: From Statistical to Structural Mixtures

链接: https://arxiv.org/abs/2507.07343
作者: James P. Crutchfield
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Dynamical Systems (math.DS); Statistics Theory (math.ST); Chaotic Dynamics (nlin.CD)
*备注: 22 pages, 16 Figures; this http URL

点击查看摘要

Abstract:We show that mixtures comprised of multicomponent systems typically are much more structurally complex than the sum of their parts; sometimes, infinitely more complex. We contrast this with the more familiar notion of statistical mixtures, demonstrating how statistical mixtures miss key aspects of emergent hierarchical organization. This leads us to identify a new kind of structural complexity inherent in multicomponent systems and to draw out broad consequences for system ergodicity.

[LG-78] Benchmarking Waitlist Mortality Prediction in Heart Transplantation Through Time-to-Event Modeling using New Longitudinal UNOS Dataset

链接: https://arxiv.org/abs/2507.07339
作者: Yingtao Luo,Reza Skandari,Carlos Martinez,Arman Kilic,Rema Padman
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: To appear in the Proceedings of AMIA Annual Symposium 2025

点击查看摘要

Abstract:Decisions about managing patients on the heart transplant waitlist are currently made by committees of doctors who consider multiple factors, but the process remains largely ad-hoc. With the growing volume of longitudinal patient, donor, and organ data collected by the United Network for Organ Sharing (UNOS) since 2018, there is increasing interest in analytical approaches to support clinical decision-making at the time of organ availability. In this study, we benchmark machine learning models that leverage longitudinal waitlist history data for time-dependent, time-to-event modeling of waitlist mortality. We train on 23,807 patient records with 77 variables and evaluate both survival prediction and discrimination at a 1-year horizon. Our best model achieves a C-Index of 0.94 and AUROC of 0.89, significantly outperforming previous models. Key predictors align with known risk factors while also revealing novel associations. Our findings can support urgency assessment and policy refinement in heart transplant decision making.

[LG-79] Bayesian Double Descent

链接: https://arxiv.org/abs/2507.07338
作者: Nick Polson,Vadim Sokolov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Double descent is a phenomenon of over-parameterized statistical models. Our goal is to view double descent from a Bayesian perspective. Over-parameterized models such as deep neural networks have an interesting re-descending property in their risk characteristics. This is a recent phenomenon in machine learning and has been the subject of many studies. As the complexity of the model increases, there is a U-shaped region corresponding to the traditional bias-variance trade-off, but then as the number of parameters equals the number of observations and the model becomes one of interpolation, the risk can become infinite and then, in the over-parameterized region, it re-descends – the double descent effect. We show that this has a natural Bayesian interpretation. Moreover, we show that it is not in conflict with the traditional Occam’s razor that Bayesian models possess, in that they tend to prefer simpler models when possible. We illustrate the approach with an example of Bayesian model selection in neural networks. Finally, we conclude with directions for future research.

[LG-80] me Series Foundation Models for Multivariate Financial Time Series Forecasting

链接: https://arxiv.org/abs/2507.07296
作者: Ben A. Marconi
类目: General Finance (q-fin.GN); Machine Learning (cs.LG)
*备注: 66 pages

点击查看摘要

Abstract:Financial time series forecasting presents significant challenges due to complex nonlinear relationships, temporal dependencies, variable interdependencies and limited data availability, particularly for tasks involving low-frequency data, newly listed instruments, or emerging market assets. Time Series Foundation Models (TSFMs) offer a promising solution through pretraining on diverse time series corpora followed by task-specific adaptation. This study evaluates two TSFMs (Tiny Time Mixers (TTM) and Chronos) across three financial forecasting tasks: US 10-year Treasury yield changes, EUR/USD volatility, and equity spread prediction. Results demonstrate that TTM exhibits strong transferability. When fine-tuning both the pretrained version of TTM and an untrained model with the same architecture, the pretrained version achieved 25-50% better performance when fine-tuned on limited data and 15-30% improvements even when fine-tuned on lengthier datasets. Notably, TTM’s zero-shot performance outperformed naive benchmarks in volatility forecasting and equity spread prediction, with the latter demonstrating that TSFMs can surpass traditional benchmark models without fine-tuning. The pretrained model consistently required 3-10 fewer years of data to achieve comparable performance levels compared to the untrained model, demonstrating significant sample-efficiency gains. However, while TTM outperformed naive baselines, traditional specialised models matched or exceeded its performance in two of three tasks, suggesting TSFMs prioritise breadth over task-specific optimisation. These findings indicate that TSFMs, though still nascent, offer substantial promise for financial forecasting-particularly in noisy, data-constrained tasks-but achieving competitive performance likely requires domain-specific pretraining and architectural refinements tailored to financial time series characteristics.

[LG-81] hermodynamic Prediction Enabled by Automatic Dataset Building and Machine Learning

链接: https://arxiv.org/abs/2507.07293
作者: Juejing Liu,Haydn Anderson,Noah I. Waxman,Vsevolod Kovalev,Byron Fisher,Elizabeth Li,Xiaofeng Guo
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:New discoveries in chemistry and materials science, with increasingly expanding volume of requisite knowledge and experimental workload, provide unique opportunities for machine learning (ML) to take critical roles in accelerating research efficiency. Here, we demonstrate (1) the use of large language models (LLMs) for automated literature reviews, and (2) the training of an ML model to predict chemical knowledge (thermodynamic parameters). Our LLM-based literature review tool (LMExt) successfully extracted chemical information and beyond into a machine-readable structure, including stability constants for metal cation-ligand interactions, thermodynamic properties, and other broader data types (medical research papers, and financial reports), effectively overcoming the challenges inherent in each domain. Using the autonomous acquisition of thermodynamic data, an ML model was trained using the CatBoost algorithm for accurately predicting thermodynamic parameters (e.g., enthalpy of formation) of minerals. This work highlights the transformative potential of integrated ML approaches to reshape chemistry and materials science research.

[LG-82] Almost Sure Convergence for the Last Iterate of Stochastic Gradient Descent Schemes

链接: https://arxiv.org/abs/2507.07281
作者: Marcel Hudiani
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the almost sure convergence rate for the last iterate of stochastic gradient descent (SGD) and stochastic heavy ball (SHB) in the parametric setting when the objective function F is globally convex or non-convex whose gradient is \gamma -Hölder. Using only discrete Gronwall’s inequality without Robbins-Siegmund theorem nor martingale convergence theory, we recover results for both SGD and SHB: \min_s\leq t |\nabla F(w_s)|^2 = o(t^p-1) for non-convex objectives and F(w_t) - F_* = o(t^2\gamma/(1+\gamma) \cdot \max(p-1,-2p+1)-\epsilon) for \beta \in (0, 1) and \min_s \leq t F(w_s) - F_* = o(t^p-1) almost surely for convex objectives. In addition, we proved that SHB with constant momentum parameter \beta \in (0, 1) attains a convergence rate of F(w_t) - F_* = O(t^\max(p-1,-2p+1) \log^2 \fract\delta) with probability at least 1-\delta when F is convex and \gamma = 1 and step size \alpha_t = \Theta(t^-p) with p \in (\frac12, 1) .

[LG-83] Large-scale portfolio optimization with variational neural annealing

链接: https://arxiv.org/abs/2507.07159
作者: Nishan Ranabhat,Behnam Javanparast,David Goerz,Estelle Inack
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Portfolio Management (q-fin.PM)
*备注: 16 pages, 13 figures, 1 table

点击查看摘要

Abstract:Portfolio optimization is a routine asset management operation conducted in financial institutions around the world. However, under real-world constraints such as turnover limits and transaction costs, its formulation becomes a mixed-integer nonlinear program that current mixed-integer optimizers often struggle to solve. We propose mapping this problem onto a classical Ising-like Hamiltonian and solving it with Variational Neural Annealing (VNA), via its classical formulation implemented using autoregressive neural networks. We demonstrate that VNA can identify near-optimal solutions for portfolios comprising more than 2,000 assets and yields performance comparable to that of state-of-the-art optimizers, such as Mosek, while exhibiting faster convergence on hard instances. Finally, we present a dynamical finite-size scaling analysis applied to the SP 500, Russell 1000, and Russell 3000 indices, revealing universal behavior and polynomial annealing time scaling of the VNA algorithm on portfolio optimization problems.

[LG-84] opological Machine Learning with Unreduced Persistence Diagrams

链接: https://arxiv.org/abs/2507.07156
作者: Nicole Abreu,Parker B. Edwards,Francis Motta
类目: Machine Learning (stat.ML); Computational Geometry (cs.CG); Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注: 10 figures, 2 tables, 8 pages(without appendix and references)

点击查看摘要

Abstract:Supervised machine learning pipelines trained on features derived from persistent homology have been experimentally observed to ignore much of the information contained in a persistence diagram. Computing persistence diagrams is often the most computationally demanding step in such a pipeline, however. To explore this, we introduce several methods to generate topological feature vectors from unreduced boundary matrices. We compared the performance of pipelines trained on vectorizations of unreduced PDs to vectorizations of fully-reduced PDs across several data and task types. Our results indicate that models trained on PDs built from unreduced diagrams can perform on par and even outperform those trained on fully-reduced diagrams on some tasks. This observation suggests that machine learning pipelines which incorporate topology-based features may benefit in terms of computational cost and performance by utilizing information contained in unreduced boundary matrices.

[LG-85] Class conditional conformal prediction for multiple inputs by p-value aggregation

链接: https://arxiv.org/abs/2507.07150
作者: Jean-Baptiste Fermanian(IMAG, IROKO),Mohamed Hebiri(LAMA),Joseph Salmon(IMAG, IROKO)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Conformal prediction methods are statistical tools designed to quantify uncertainty and generate predictive sets with guaranteed coverage probabilities. This work introduces an innovative refinement to these methods for classification tasks, specifically tailored for scenarios where multiple observations (multi-inputs) of a single instance are available at prediction time. Our approach is particularly motivated by applications in citizen science, where multiple images of the same plant or animal are captured by individuals. Our method integrates the information from each observation into conformal prediction, enabling a reduction in the size of the predicted label set while preserving the required class-conditional coverage guarantee. The approach is based on the aggregation of conformal p-values computed from each observation of a multi-input. By exploiting the exact distribution of these p-values, we propose a general aggregation framework using an abstract scoring function, encompassing many classical statistical tools. Knowledge of this distribution also enables refined versions of standard strategies, such as majority voting. We evaluate our method on simulated and real data, with a particular focus on Pl@ntNet, a prominent citizen science platform that facilitates the collection and identification of plant species through user-submitted images.

信息检索

[IR-0] Measuring Hypothesis Testing Errors in the Evaluation of Retrieval Systems

链接: https://arxiv.org/abs/2507.07924
作者: Jack McKechnie,Graham McDonald,Craig Macdonald
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The evaluation of Information Retrieval (IR) systems typically uses query-document pairs with corresponding human-labelled relevance assessments (qrels). These qrels are used to determine if one system is better than another based on average retrieval performance. Acquiring large volumes of human relevance assessments is expensive. Therefore, more efficient relevance assessment approaches have been proposed, necessitating comparisons between qrels to ascertain their efficacy. Discriminative power, i.e. the ability to correctly identify significant differences between systems, is important for drawing accurate conclusions on the robustness of qrels. Previous work has measured the proportion of pairs of systems that are identified as significantly different and has quantified Type I statistical errors. Type I errors lead to incorrect conclusions due to false positive significance tests. We argue that also identifying Type II errors (false negatives) is important as they lead science in the wrong direction. We quantify Type II errors and propose that balanced classification metrics, such as balanced accuracy, can be used to portray the discriminative power of qrels. We perform experiments using qrels generated using alternative relevance assessment methods to investigate measuring hypothesis testing errors in IR evaluation. We find that additional insights into the discriminative power of qrels can be gained by quantifying Type II errors, and that balanced classification metrics can be used to give an overall summary of discriminative power in one, easily comparable, number.

[IR-1] Document Similarity Enhanced IPS Estimation for Unbiased Learning to Rank

链接: https://arxiv.org/abs/2507.07909
作者: Zeyan Liang,Graham McDonald,Iadh Ounis
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Learning to Rank (LTR) models learn from historical user interactions, such as user clicks. However, there is an inherent bias in the clicks of users due to position bias, i.e., users are more likely to click highly-ranked documents than low-ranked documents. To address this bias when training LTR models, many approaches from the literature re-weight the users’ click data using Inverse Propensity Scoring (IPS). IPS re-weights the user’s clicks proportionately to the position in the historical ranking that a document was placed when it was clicked since low-ranked documents are less likely to be seen by a user. In this paper, we argue that low-ranked documents that are similar to highly-ranked relevant documents are also likely to be relevant. Moreover, accounting for the similarity of low-ranked documents to highly ranked relevant documents when calculating IPS can more effectively mitigate the effects of position bias. Therefore, we propose an extension to IPS, called IPSsim, that takes into consideration the similarity of documents when estimating IPS. We evaluate our IPSsim estimator using two large publicly available LTR datasets under a number of simulated user click settings, and with different numbers of training clicks. Our experiments show that our IPSsim estimator is more effective than the existing IPS estimators for learning an unbiased LTR model, particularly in top-n settings when n = 30. For example, when n = 50, our IPSsim estimator achieves a statistically significant ~3% improvement (p 0.05) in terms of NDCG compared to the Doubly Robust estimator from the literature.

[IR-2] NLGCL: Naturally Existing Neighbor Layers Graph Contrastive Learning for Recommendation RECSYS2025

链接: https://arxiv.org/abs/2507.07522
作者: Jinfeng Xu,Zheyu Chen,Shuo Yang,Jinze Li,Hewei Wang,Wei Wang,Xiping Hu,Edith Ngai
类目: Information Retrieval (cs.IR)
*备注: Accepted by RecSys 2025 as Spotlight Oral

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are widely used in collaborative filtering to capture high-order user-item relationships. To address the data sparsity problem in recommendation systems, Graph Contrastive Learning (GCL) has emerged as a promising paradigm that maximizes mutual information between contrastive views. However, existing GCL methods rely on augmentation techniques that introduce semantically irrelevant noise and incur significant computational and storage costs, limiting effectiveness and efficiency. To overcome these challenges, we propose NLGCL, a novel contrastive learning framework that leverages naturally contrastive views between neighbor layers within GNNs. By treating each node and its neighbors in the next layer as positive pairs, and other nodes as negatives, NLGCL avoids augmentation-based noise while preserving semantic relevance. This paradigm eliminates costly view construction and storage, making it computationally efficient and practical for real-world scenarios. Extensive experiments on four public datasets demonstrate that NLGCL outperforms state-of-the-art baselines in effectiveness and efficiency. Comments: Accepted by RecSys 2025 as Spotlight Oral Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2507.07522 [cs.IR] (or arXiv:2507.07522v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.07522 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] When Graph Contrastive Learning Backfires: Spectral Vulnerability and Defense in Recommendation

链接: https://arxiv.org/abs/2507.07436
作者: Zongwei Wang,Min Gao,Junliang Yu,Shazia Sadiq,Hongzhi Yin,Ling Liu
类目: Information Retrieval (cs.IR)
*备注: 24 pages, 6 figures

点击查看摘要

Abstract:Graph Contrastive Learning (GCL) has demonstrated substantial promise in enhancing the robustness and generalization of recommender systems, particularly by enabling models to leverage large-scale unlabeled data for improved representation learning. However, in this paper, we reveal an unexpected vulnerability: the integration of GCL inadvertently increases the susceptibility of a recommender to targeted promotion attacks. Through both theoretical investigation and empirical validation, we identify the root cause as the spectral smoothing effect induced by contrastive optimization, which disperses item embeddings across the representation space and unintentionally enhances the exposure of target items. Building on this insight, we introduce CLeaR, a bi-level optimization attack method that deliberately amplifies spectral smoothness, enabling a systematic investigation of the susceptibility of GCL-based recommendation models to targeted promotion attacks. Our findings highlight the urgent need for robust countermeasures; in response, we further propose SIM, a spectral irregularity mitigation framework designed to accurately detect and suppress targeted items without compromising model performance. Extensive experiments on multiple benchmark datasets demonstrate that, compared to existing targeted promotion attacks, GCL-based recommendation models exhibit greater susceptibility when evaluated with CLeaR, while SIM effectively mitigates these vulnerabilities.

附件下载

点击下载今日全部论文列表

目录

概览 (2025-07-11)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载