本篇博文主要内容为 2025-07-29 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-07-29)
今日共更新849篇论文,其中:
- 自然语言处理共107篇(Computation and Language (cs.CL))
- 人工智能共245篇(Artificial Intelligence (cs.AI))
- 计算机视觉共207篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共235篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Multi-Agent -as-Judge: Aligning LLM -Agent -Based Automated Evaluation with Multi-Dimensional Human Evaluation
【速读】: 该论文旨在解决当前“大语言模型作为评判者”(LLM-as-a-judge)方法中存在的两个关键问题:一是评判者角色的个性描述(persona)通常随意设计,缺乏依据;二是现有框架难以泛化至其他任务场景。其解决方案的核心在于提出MAJ-EVAL——一个多代理作为评判者(Multi-Agent-as-Judge)的评估框架,能够从相关文本文档(如研究论文)中自动构建具有不同维度的评判者人格特征,实例化为具备特定角色的大语言模型代理,并通过多代理群体辩论机制生成多维反馈。实验表明,该方法在教育和医疗领域均能产出更贴近人类专家评分的评估结果,显著优于传统自动化指标和现有LLM-as-a-judge方法。
链接: https://arxiv.org/abs/2507.21028
作者: Jiaju Chen,Yuxuan Lu,Xiaojie Wang,Huimin Zeng,Jing Huang,Jiri Gesi,Ying Xu,Bingsheng Yao,Dakuo Wang
机构: Rice University (莱斯大学); Northeastern University (东北大学); Amazon (亚马逊); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Nearly all human work is collaborative; thus, the evaluation of real-world NLP applications often requires multiple dimensions that align with diverse human perspectives. As real human evaluator resources are often scarce and costly, the emerging “LLM-as-a-judge” paradigm sheds light on a promising approach to leverage LLM agents to believably simulate human evaluators. Yet, to date, existing LLM-as-a-judge approaches face two limitations: persona descriptions of agents are often arbitrarily designed, and the frameworks are not generalizable to other tasks. To address these challenges, we propose MAJ-EVAL, a Multi-Agent-as-Judge evaluation framework that can automatically construct multiple evaluator personas with distinct dimensions from relevant text documents (e.g., research papers), instantiate LLM agents with the personas, and engage in-group debates with multi-agents to Generate multi-dimensional feedback. Our evaluation experiments in both the educational and medical domains demonstrate that MAJ-EVAL can generate evaluation results that better align with human experts’ ratings compared with conventional automated evaluation metrics and existing LLM-as-a-judge methods.
zh
[NLP-1] Memorization in Fine-Tuned Large Language Models
【速读】: 该论文旨在解决细调大语言模型(Large Language Models, LLMs)在医疗领域应用中因记忆训练数据而引发的数据隐私风险问题。研究聚焦于分析不同微调策略对模型记忆行为的影响机制,以识别高风险因素并优化隐私保护与性能之间的权衡。其解决方案的关键在于:通过成员推理攻击(membership inference attack)和提示引导生成任务评估模型对训练数据的逐字复现能力,系统性地考察了Transformer架构中不同权重矩阵(如Value与Output矩阵较Query和Key矩阵更易导致记忆)、困惑度(perplexity)降低与记忆增强的相关性,以及低秩适应(LoRA)微调中秩(rank)提升对记忆效应的影响——发现较高LoRA秩虽增加记忆风险但存在边际递减效应。这些发现为开发兼顾性能与隐私安全的LLM适配策略提供了实证依据。
链接: https://arxiv.org/abs/2507.21009
作者: Danil Savine,Muni Sreenivas Pydi,Jamal Atif,Olivier Cappé
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This study investigates the mechanisms and factors influencing memorization in fine-tuned large language models (LLMs), with a focus on the medical domain due to its privacy-sensitive nature. We examine how different aspects of the fine-tuning process affect a model’s propensity to memorize training data, using the PHEE dataset of pharmacovigilance events. Our research employs two main approaches: a membership inference attack to detect memorized data, and a generation task with prompted prefixes to assess verbatim reproduction. We analyze the impact of adapting different weight matrices in the transformer architecture, the relationship between perplexity and memorization, and the effect of increasing the rank in low-rank adaptation (LoRA) fine-tuning. Key findings include: (1) Value and Output matrices contribute more significantly to memorization compared to Query and Key matrices; (2) Lower perplexity in the fine-tuned model correlates with increased memorization; (3) Higher LoRA ranks lead to increased memorization, but with diminishing returns at higher ranks. These results provide insights into the trade-offs between model performance and privacy risks in fine-tuned LLMs. Our findings have implications for developing more effective and responsible strategies for adapting large language models while managing data privacy concerns. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.21009 [cs.CL] (or arXiv:2507.21009v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.21009 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Danil Savine [view email] [v1] Mon, 28 Jul 2025 17:22:10 UTC (641 KB)
zh
[NLP-2] LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成式 AI(Generative AI)任务中,因追求高性能而需大量数据、超大规模参数及全参数微调所带来的高计算成本问题。现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法多聚焦于领域适配或层间分配,未能根据任务特性对数据和参数进行差异化配置。其解决方案的关键在于提出 LoRA-PAR 框架,该框架借鉴“快思考-慢思考”(System 1/System 2)认知理论,将模型参数与任务需求划分为两类:一类用于快速直觉响应(System 1),另一类用于多步逻辑推理(System 2)。通过多模态角色扮演投票机制分类任务数据,并基于重要性评分划分参数子集,再采用两阶段微调策略——先用监督微调(SFT)增强 System 1 的知识与直觉,再用强化学习(RL)优化 System 2 的推理能力,从而以更少且更专注的活跃参数实现性能媲美甚至超越当前最优 PEFT 方法。
链接: https://arxiv.org/abs/2507.20999
作者: Yining Huang,Bin Li,Keke Tang,Meilian Chen
机构: South China Normal University (华南师范大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); University of Chinese Academy of Sciences (中国科学院大学); Shenyang institute of computing technology, Chinese academy of sciences (中国科学院计算技术研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages
Abstract:Large-scale generative models like DeepSeek-R1 and OpenAI-O1 benefit substantially from chain-of-thought (CoT) reasoning, yet pushing their performance typically requires vast data, large model sizes, and full-parameter fine-tuning. While parameter-efficient fine-tuning (PEFT) helps reduce cost, most existing approaches primarily address domain adaptation or layer-wise allocation rather than explicitly tailoring data and parameters to different response demands. Inspired by “Thinking, Fast and Slow,” which characterizes two distinct modes of thought-System 1 (fast, intuitive, often automatic) and System 2 (slower, more deliberative and analytic)-we draw an analogy that different “subregions” of an LLM’s parameters might similarly specialize for tasks that demand quick, intuitive responses versus those requiring multi-step logical reasoning. Therefore, we propose LoRA-PAR, a dual-system LoRA framework that partitions both data and parameters by System 1 or System 2 demands, using fewer yet more focused parameters for each task. Specifically, we classify task data via multi-model role-playing and voting, and partition parameters based on importance scoring, then adopt a two-stage fine-tuning strategy of training System 1 tasks with supervised fine-tuning (SFT) to enhance knowledge and intuition and refine System 2 tasks with reinforcement learning (RL) to reinforce deeper logical deliberation next. Extensive experiments show that the two-stage fine-tuning strategy, SFT and RL, lowers active parameter usage while matching or surpassing SOTA PEFT baselines.
zh
[NLP-3] Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models
【速读】: 该论文旨在解决指令微调(instruction-tuning)导致大语言模型(Large Language Models, LLMs)输出多样性下降的问题,尤其是在叙事生成等创造性任务中,这种“多样性缺口”(diversity gap)显著限制了模型的应用效果。解决方案的关键在于提出一种新的解码策略——一致解码(conformative decoding),该策略通过利用未指令微调的基线模型(base model)来引导指令微调后的模型,在保持或提升生成质量的同时有效恢复输出多样性。实验表明,DPO(Direct Preference Optimization)在微调过程中对多样性损失影响最大,而一致解码能显著缓解这一问题。
链接: https://arxiv.org/abs/2507.20956
作者: Max Peeperkorn,Tom Kouwenhoven,Dan Brown,Anna Jordanous
机构: University of Kent (肯特大学); Universiteit Leiden (莱顿大学); University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures
Abstract:Instruction-tuning large language models (LLMs) reduces the diversity of their outputs, which has implications for many tasks, particularly for creative tasks. This paper investigates the ``diversity gap’’ for a writing prompt narrative generation task. This gap emerges as measured by current diversity metrics for various open-weight and open-source LLMs. The results show significant decreases in diversity due to instruction-tuning. We explore the diversity loss at each fine-tuning stage for the OLMo and OLMo 2 models to further understand how output diversity is affected. The results indicate that DPO has the most substantial impact on diversity. Motivated by these findings, we present a new decoding strategy, conformative decoding, which guides an instruct model using its more diverse base model to reintroduce output diversity. We show that conformative decoding typically increases diversity and even maintains or improves quality.
zh
[NLP-4] Dissecting Persona-Driven Reasoning in Language Models via Activation Patching
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在分配不同人格(persona)时如何影响其在客观任务上的推理过程这一问题。研究通过激活修补(activation patching)方法,首次系统性地探索了模型中关键组件对人格信息的编码机制。解决方案的关键在于识别出早期多层感知机(Multi-Layer Perceptron, MLP)层不仅处理输入的句法结构,还主动编码语义内容,并将人格标记(persona tokens)转化为更丰富的表示;这些表示随后被中间的多头注意力(Multi-Head Attention, MHA)层利用,从而塑造最终输出。此外,研究还发现特定注意力头显著偏向于种族和颜色相关身份信息,揭示了人格因素在模型内部表征中的具体作用路径。
链接: https://arxiv.org/abs/2507.20936
作者: Ansh Poonia,Maeghal Jain
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages
Abstract:Large language models (LLMs) exhibit remarkable versatility in adopting diverse personas. In this study, we examine how assigning a persona influences a model’s reasoning on an objective task. Using activation patching, we take a first step toward understanding how key components of the model encode persona-specific information. Our findings reveal that the early Multi-Layer Perceptron (MLP) layers attend not only to the syntactic structure of the input but also process its semantic content. These layers transform persona tokens into richer representations, which are then used by the middle Multi-Head Attention (MHA) layers to shape the model’s output. Additionally, we identify specific attention heads that disproportionately attend to racial and color-based identities.
zh
[NLP-5] FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成内容时出现的“幻觉”(Hallucinations)问题,尤其是在金融等高风险领域中因事实性错误导致的可靠性缺失。其解决方案的关键在于:首先构建一个基于用户定义领域特定错误分类体系的合成数据集,通过向金融问答语料库中插入带标签的事实错误来模拟真实场景;随后对多个主流语言模型(包括Phi-4、Phi-4-mini、Qwen3-4B和Qwen3-14B)进行微调,使其具备检测并编辑生成文本中的事实错误的能力。实验表明,微调后的Phi-4模型在二分类F1分数上相比OpenAI-o3提升8%,整体检测性能提升30%;而参数量仅为40亿的Phi-4-mini模型也保持了接近基准的性能,仅下降2%(二分类)和0.1%(整体),展现出高效且可扩展的实用性。该方法不仅提升了金融文本生成的准确性,还提供了一个通用框架以增强LLMs在多场景下的可信度与对齐性。
链接: https://arxiv.org/abs/2507.20930
作者: Likun Tan,Kuan-Wei Huang,Kevin Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Hallucinations in large language models pose a critical challenge for applications requiring factual reliability, particularly in high-stakes domains such as finance. This work presents an effective approach for detecting and editing factually incorrect content in model-generated responses based on the provided context. Given a user-defined domain-specific error taxonomy, we construct a synthetic dataset by inserting tagged errors into financial question-answering corpora and then fine-tune four language models, Phi-4, Phi-4-mini, Qwen3-4B, and Qwen3-14B, to detect and edit these factual inaccuracies. Our best-performing model, fine-tuned Phi-4, achieves an 8% improvement in binary F1 score and a 30% gain in overall detection performance compared to OpenAI-o3. Notably, our fine-tuned Phi-4-mini model, despite having only 4 billion parameters, maintains competitive performance with just a 2% drop in binary detection and a 0.1% decline in overall detection compared to OpenAI-o3. Our work provides a practical solution for detecting and editing factual inconsistencies in financial text generation while introducing a generalizable framework that can enhance the trustworthiness and alignment of large language models across diverse applications beyond finance. Our code and data are available at this https URL.
zh
[NLP-6] FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models
【速读】: 该论文旨在解决社交媒体文本中性别歧视(sexism)的识别与分类问题,聚焦于CLEF 2025举办的EXIST挑战赛的第一任务。其核心解决方案在于构建三个针对不同子任务的模型:Speech Concept Bottleneck Model (SCBM)、Speech Concept Bottleneck Model with Transformer (SCBMT),以及微调后的XLM-RoBERTa模型。关键创新在于SCBM和SCBMT利用描述性形容词作为人类可解释的瓶颈概念(bottleneck concepts),通过大语言模型(LLMs)将输入文本编码为可解释的形容词表示,并用于训练轻量级分类器;SCBMT进一步融合了Transformer生成的上下文嵌入,从而在保持模型可解释性的同时提升分类性能。这些方法不仅取得了具有竞争力的结果(如SCBMT在英语和西班牙语子任务中分别位列第7和第6),还提供了实例级(local)和类别级(global)的细粒度解释能力,增强了模型决策的透明度。
链接: https://arxiv.org/abs/2507.20924
作者: Roberto Labadie-Tamayo,Adrian Jaques Böck,Djordje Slijepčević,Xihui Chen,Andreas Babic,Matthias Zeppelzauer
机构: St. Pölten University of Applied Sciences (圣珀尔滕应用科学大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 12 pages
Abstract:Sexism has become widespread on social media and in online conversation. To help address this issue, the fifth Sexism Identification in Social Networks (EXIST) challenge is initiated at CLEF 2025. Among this year’s international benchmarks, we concentrate on solving the first task aiming to identify and classify sexism in social media textual posts. In this paper, we describe our solutions and report results for three subtasks: Subtask 1.1 - Sexism Identification in Tweets, Subtask 1.2 - Source Intention in Tweets, and Subtask 1.3 - Sexism Categorization in Tweets. We implement three models to address each subtask which constitute three individual runs: Speech Concept Bottleneck Model (SCBM), Speech Concept Bottleneck Model with Transformer (SCBMT), and a fine-tuned XLM-RoBERTa transformer model. SCBM uses descriptive adjectives as human-interpretable bottleneck concepts. SCBM leverages large language models (LLMs) to encode input texts into a human-interpretable representation of adjectives, then used to train a lightweight classifier for downstream tasks. SCBMT extends SCBM by fusing adjective-based representation with contextual embeddings from transformers to balance interpretability and classification performance. Beyond competitive results, these two models offer fine-grained explanations at both instance (local) and class (global) levels. We also investigate how additional metadata, e.g., annotators’ demographic profiles, can be leveraged. For Subtask 1.1, XLM-RoBERTa, fine-tuned on provided data augmented with prior datasets, ranks 6th for English and Spanish and 4th for English in the Soft-Soft evaluation. Our SCBMT achieves 7th for English and Spanish and 6th for Spanish.
zh
[NLP-7] MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation
【速读】: 该论文旨在解决多语言医疗问答领域中缺乏高质量、面向真实临床场景的基准数据集的问题,尤其针对法语医学知识的准确回忆与推理能力评估。其解决方案的关键在于构建MediQAl数据集,该数据集包含32,603个来自法国医学考试的真实问题,覆盖41个医学主题,并细分为三类任务(单选题、多选题和开放式简答题),同时标注每个问题为“理解”或“推理”类别,从而实现对语言模型认知能力的精细化分析。通过在14个大型语言模型上的广泛评估,研究揭示了事实性回忆与推理任务之间显著的性能差距,为法语医疗问答提供了一个全面且具有挑战性的基准。
链接: https://arxiv.org/abs/2507.20917
作者: Adrien Bazoge
机构: Data Clinic, University Hospital of Nantes (南特大学医院数据诊所); Nantes Université (南特大学); École Centrale Nantes (南特中央理工学院); CNRS (法国国家科学研究中心); LS2N (南特大学信号与系统实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models’ cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models’ performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.
zh
[NLP-8] Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning
【速读】: 该论文旨在解决当前In-Context Learning (ICL) 中依赖大量输入输出示例来传递任务信息所带来的效率低下和冗余问题,即如何在不增加模型参数更新的前提下,更高效地实现任务条件化。其解决方案的关键在于提出“软注入(Soft Injection)”机制:首先通过少量示例构建一次性的任务嵌入(task embeddings),随后在推理过程中通过预优化的软头选择参数(soft head-selection parameters)将这些嵌入以软混合方式注入到注意力头激活值中,从而将任务条件从提示空间转移到激活空间。该方法不仅无需在提示中包含演示样例即可完成任务,还在57个任务和12种不同规模的大型语言模型上显著优于传统10-shot ICL,同时降低了推理时的内存占用和计算成本。
链接: https://arxiv.org/abs/2507.20906
作者: Jungwon Park,Wonjong Rhee
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:In-Context Learning (ICL) enables Large Language Models (LLMs) to perform tasks by conditioning on input-output examples in the prompt, without requiring any update in model parameters. While widely adopted, it remains unclear whether prompting with multiple examples is the most effective and efficient way to convey task information. In this work, we propose Soft Injection of task embeddings. The task embeddings are constructed only once using few-shot ICL prompts and repeatedly used during inference. Soft injection is performed by softly mixing task embeddings with attention head activations using pre-optimized mixing parameters, referred to as soft head-selection parameters. This method not only allows a desired task to be performed without in-prompt demonstrations but also significantly outperforms existing ICL approaches while reducing memory usage and compute cost at inference time. An extensive evaluation is performed across 57 tasks and 12 LLMs, spanning four model families of sizes from 4B to 70B. Averaged across 57 tasks, our method outperforms 10-shot ICL by 10.1%-13.9% across 12 LLMs. Additional analyses show that our method also serves as an insightful tool for analyzing task-relevant roles of attention heads, revealing that task-relevant head positions selected by our method transfer across similar tasks but not across dissimilar ones – underscoring the task-specific nature of head functionality. Our soft injection method opens a new paradigm for reducing prompt length and improving task performance by shifting task conditioning from the prompt space to the activation space.
zh
[NLP-9] A2R2: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement
【速读】: 该论文旨在解决图像到LaTeX(Img2LaTeX)转换任务中视觉语言模型(VLMs)在处理细粒度视觉元素时表现不佳的问题,导致生成的LaTeX代码准确性不足。其核心解决方案是提出A²R²框架,即通过引入注意力引导的精炼机制(attention-guided refinement)与视觉推理相结合,使模型能够在推理过程中进行自我修正并逐步提升预测质量。该方法的关键在于将注意力定位与迭代式精炼嵌入到视觉推理流程中,从而增强模型对复杂数学表达式或表格图像中细微结构的理解能力,显著改善最终输出的准确性和鲁棒性。
链接: https://arxiv.org/abs/2507.20890
作者: Zhecheng Li,Guoxian Song,Yiwei Wang,Zhen Xiong,Junsong Yuan,Yujun Cai
机构: University of California, San Diego (加州大学圣地亚哥分校); ByteDance (字节跳动); University of California, Merced (加州大学默塞德分校); University of Southern California (南加州大学); University at Buffalo (水牛城大学); The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Img2LaTeX is a practically significant task that involves converting mathematical expressions or tabular data from images into LaTeX code. In recent years, vision-language models (VLMs) have demonstrated strong performance across a variety of visual understanding tasks, owing to their generalization capabilities. While some studies have explored the use of VLMs for the Img2LaTeX task, their performance often falls short of expectations. Empirically, VLMs sometimes struggle with fine-grained visual elements, leading to inaccurate LaTeX predictions. To address this challenge, we propose A^2R^2 : Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement, a framework that effectively integrates attention localization and iterative refinement within a visual reasoning framework, enabling VLMs to perform self-correction and progressively improve prediction quality. For effective evaluation, we introduce a new dataset, Img2LaTex-Hard-1K, consisting of 1,100 carefully curated and challenging examples designed to rigorously evaluate the capabilities of VLMs within this task domain. Extensive experimental results demonstrate that: (1) A^2R^2 significantly improves model performance across six evaluation metrics spanning both textual and visual levels, consistently outperforming other baseline methods; (2) Increasing the number of inference rounds yields notable performance gains, underscoring the potential of A^2R^2 in test-time scaling scenarios; (3) Ablation studies and human evaluations validate the practical effectiveness of our approach, as well as the strong synergy among its core components during inference.
zh
[NLP-10] Enhancing Project-Specific Code Completion by Inferring Internal API Information
【速读】: 该论文旨在解决项目特定代码补全(project-specific code completion)中因缺乏内部API信息而导致的准确性不足问题,尤其是在API未在当前文件中显式导入的情况下。现有基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法难以有效利用项目内部API语义,从而影响补全质量。其解决方案的关键在于:通过构建API的使用示例和语义描述来扩展API表示,并构建一个面向大语言模型(Large Language Models, LLMs)的知识库,使模型能够在不依赖显式import声明的前提下推断出内部API上下文,从而提升代码补全的准确性和相关性。
链接: https://arxiv.org/abs/2507.20888
作者: Le Deng,Xiaoxue Ren,Chao Ni,Ming Liang,David Lo,Zhongxin Liu
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); Singapore Management University (新加坡管理大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:Project-specific code completion is a critical task that leverages context from a project to generate accurate code. State-of-the-art methods use retrieval-augmented generation (RAG) with large language models (LLMs) and project information for code completion. However, they often struggle to incorporate internal API information, which is crucial for accuracy, especially when APIs are not explicitly imported in the file. To address this, we propose a method to infer internal API information without relying on imports. Our method extends the representation of APIs by constructing usage examples and semantic descriptions, building a knowledge base for LLMs to generate relevant completions. We also introduce ProjBench, a benchmark that avoids leaked imports and consists of large-scale real-world projects. Experiments on ProjBench and CrossCodeEval show that our approach significantly outperforms existing methods, improving code exact match by 22.72% and identifier exact match by 18.31%. Additionally, integrating our method with existing baselines boosts code match by 47.80% and identifier match by 35.55%. Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL) Cite as: arXiv:2507.20888 [cs.SE] (or arXiv:2507.20888v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2507.20888 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-11] he Importance of Facial Features in Vision-based Sign Language Recognition: Eyes Mouth or Full Face?
【速读】: 该论文旨在解决自动手语识别(Automatic Sign Language Recognition, ASLR)中非手动面部特征(non-manual facial features)贡献不明确的问题。以往研究虽表明引入面部特征可提升识别性能,但多依赖手工特征提取,且仅比较纯手动特征与手动加面部特征的组合效果,缺乏对具体面部区域(如眼睛、嘴部和全脸)的系统性分析。本研究的关键解决方案是采用两种深度学习模型(基于卷积神经网络CNN和基于Transformer的模型),在孤立手语符号数据集上对不同面部区域进行定量性能评估与定性显著性图分析,结果表明嘴部是最具价值的非手动面部特征,能显著提升识别准确率,从而强调了在ASLR中整合面部特征的必要性。
链接: https://arxiv.org/abs/2507.20884
作者: Dinh Nam Pham,Eleftherios Avramidis
机构: German Research Center for Artificial Intelligence (DFKI)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Image and Video Processing (eess.IV)
备注: Accepted at 9th International Workshop on Sign Language Translation and Avatar Technologies @ ACM IVA’25
Abstract:Non-manual facial features play a crucial role in sign language communication, yet their importance in automatic sign language recognition (ASLR) remains underexplored. While prior studies have shown that incorporating facial features can improve recognition, related work often relies on hand-crafted feature extraction and fails to go beyond the comparison of manual features versus the combination of manual and facial features. In this work, we systematically investigate the contribution of distinct facial regionseyes, mouth, and full faceusing two different deep learning models (a CNN-based model and a transformer-based model) trained on an SLR dataset of isolated signs with randomly selected classes. Through quantitative performance and qualitative saliency map evaluation, we reveal that the mouth is the most important non-manual facial feature, significantly improving accuracy. Our findings highlight the necessity of incorporating facial features in ASLR.
zh
[NLP-12] Leverag ing Open-Source Large Language Models for Clinical Information Extraction in Resource-Constrained Settings
【速读】: 该论文旨在解决医疗报告中结构化信息提取的难题,尤其是面对领域特定语言和非结构化文本时,传统方法效率低下且难以泛化。其解决方案的关键在于引入一个名为 \textttllm_extractinator 的开源框架,结合九种开源生成式 AI (Generative AI) 模型在 DRAGON 基准上的零样本(zero-shot)评估,验证了在低资源语境下使用原生语言处理即可实现高效、可扩展且隐私友好的临床信息抽取。研究发现,部分140亿参数规模的模型(如Phi-4-14B、Qwen-2.5-14B 和 DeepSeek-R1-14B)已具备竞争力,而更大的 Llama-3.3-70B 虽略优但计算成本更高,同时强调翻译至英文会显著降低性能,凸显原生语言处理的重要性。
链接: https://arxiv.org/abs/2507.20859
作者: Luc Builtjes,Joeran Bosma,Mathias Prokop,Bram van Ginneken,Alessa Hering
机构: 未知
类目: Computation and Language (cs.CL)
备注: 34 pages, 5 figures
Abstract:Medical reports contain rich clinical information but are often unstructured and written in domain-specific language, posing challenges for information extraction. While proprietary large language models (LLMs) have shown promise in clinical natural language processing, their lack of transparency and data privacy concerns limit their utility in healthcare. This study therefore evaluates nine open-source generative LLMs on the DRAGON benchmark, which includes 28 clinical information extraction tasks in Dutch. We developed \textttllm_extractinator, a publicly available framework for information extraction using open-source generative LLMs, and used it to assess model performance in a zero-shot setting. Several 14 billion parameter models, Phi-4-14B, Qwen-2.5-14B, and DeepSeek-R1-14B, achieved competitive results, while the bigger Llama-3.3-70B model achieved slightly higher performance at greater computational cost. Translation to English prior to inference consistently degraded performance, highlighting the need of native-language processing. These findings demonstrate that open-source LLMs, when used with our framework, offer effective, scalable, and privacy-conscious solutions for clinical information extraction in low-resource settings.
zh
[NLP-13] A survey of diversity quantification in natural language processing: The why what where and how
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)领域中“多样性”概念使用缺乏系统性和一致性的问题。当前多样性常被以临时性方式处理,且与生态学、经济学等更成熟理论体系中对多样性的定义和测量方法缺乏明确关联。其解决方案的关键在于提出一个统一的分类框架,基于Stirling(2007)提出的生态与经济领域的多样性理论,将NLP中的多样性测量归纳为三个维度:种类(variety)、均衡度(balance) 和 差异度(disparity),从而实现对多样性在“为何测”、“测什么”、“在哪测”以及“如何测”四个层面的系统化梳理,并推动该领域向更规范、可比的多样性研究方向发展。
链接: https://arxiv.org/abs/2507.20858
作者: Louis Estève,Marie-Catherine de Marneffe,Nurit Melnik,Agata Savary,Olha Kanishcheva
机构: Université Paris-Saclay, CNRS, LISN (巴黎-萨克雷大学, 国家科学研究中心, 信息与系统实验室); FNRS - UCLouvain (比利时国家科学研究基金会 - 鲁汶大学); The Open University of Israel (以色列开放大学); Heidelberg University (海德堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The concept of diversity has received increased consideration in Natural Language Processing (NLP) in recent years. This is due to various motivations like promoting and inclusion, approximating human linguistic behavior, and increasing systems’ performance. Diversity has however often been addressed in an ad hoc manner in NLP, and with few explicit links to other domains where this notion is better theorized. We survey articles in the ACL Anthology from the past 6 years, with “diversity” or “diverse” in their title. We find a wide range of settings in which diversity is quantified, often highly specialized and using inconsistent terminology. We put forward a unified taxonomy of why, what on, where, and how diversity is measured in NLP. Diversity measures are cast upon a unified framework from ecology and economy (Stirling, 2007) with 3 dimensions of diversity: variety, balance and disparity. We discuss the trends which emerge due to this systematized approach. We believe that this study paves the way towards a better formalization of diversity in NLP, which should bring a better understanding of this notion and a better comparability between various approaches.
zh
[NLP-14] Latent Inter-User Difference Modeling for LLM Personalization
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个性化输出中忽视用户间差异的问题。现有方法主要依赖用户自身历史行为进行个性化建模,忽略了不同用户在相似情境下表现出的相对行为差异,而这些差异对于实现有效个性化至关重要。为解决此问题,作者提出了一种基于嵌入差异感知的个性化框架(Difference-aware Embedding-based Personalization, DEP),其核心创新在于:不在语言提示层面建模用户差异,而是通过对比目标用户嵌入与群体中相似内容交互用户的嵌入,在潜在空间中提取相对行为信号,并构造软提示(soft prompts);随后利用稀疏自编码器(sparse autoencoder)对用户特定嵌入和差异感知嵌入进行特征筛选与压缩,保留任务相关特征后注入冻结的大语言模型中,从而提升个性化效果。实验表明,DEP 在个性化评论生成任务上显著优于基线方法。
链接: https://arxiv.org/abs/2507.20849
作者: Yilun Qiu,Tianhao Shi,Xiaoyan Zhao,Fengbin Zhu,Yang Zhang,Fuli Feng
机构: National University of Singapore (新加坡国立大学); University of Science and Technology of China (中国科学技术大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly integrated into users’ daily lives, leading to a growing demand for personalized outputs. Previous work focuses on leveraging a user’s own history, overlooking inter-user differences that are crucial for effective personalization. While recent work has attempted to model such differences, the reliance on language-based prompts often hampers the effective extraction of meaningful distinctions. To address these issues, we propose Difference-aware Embedding-based Personalization (DEP), a framework that models inter-user differences in the latent space instead of relying on language prompts. DEP constructs soft prompts by contrasting a user’s embedding with those of peers who engaged with similar content, highlighting relative behavioral signals. A sparse autoencoder then filters and compresses both user-specific and difference-aware embeddings, preserving only task-relevant features before injecting them into a frozen LLM. Experiments on personalized review generation show that DEP consistently outperforms baseline methods across multiple metrics. Our code is available at this https URL.
zh
[NLP-15] Automating Thematic Review of Prevention of Future Deaths Reports: Replicating the ONS Child Suicide Study using Large Language Models
【速读】: 该论文旨在解决传统人工分析预防未来死亡(Prevention of Future Deaths, PFD)报告效率低、难以规模化的问题,特别是针对儿童自杀相关PFD报告的识别与主题编码过程高度依赖人力且耗时较长。其关键解决方案是开发并验证了一个完全自动化、开源的“文本转表格”语言模型流程(PFD Toolkit),利用大语言模型(Large Language Model, LLM)实现对4,249份PFD报告的快速筛查和主题编码,自动识别18岁以下个体因自杀导致死亡的案例,并按照23个子主题进行分类,从而在几分钟内完成原本需数月的人工工作,且经临床专家盲法验证具有高一致性(Cohen’s κ = 0.82)。该方法显著提升了效率与可靠性,为公共卫生和安全领域的可扩展、可重复、及时的数据洞察提供了可行路径。
链接: https://arxiv.org/abs/2507.20786
作者: Sam Osian,Arpan Dutta,Sahil Bhandari,Iain E. Buchan,Dan W. Joyce
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 1 figure
Abstract:Prevention of Future Deaths (PFD) reports, issued by coroners in England and Wales, flag systemic hazards that may lead to further loss of life. Analysis of these reports has previously been constrained by the manual effort required to identify and code relevant cases. In 2025, the Office for National Statistics (ONS) published a national thematic review of child-suicide PFD reports ( \leq 18 years), identifying 37 cases from January 2015 to November 2023 - a process based entirely on manual curation and coding. We evaluated whether a fully automated, open source “text-to-table” language-model pipeline (PFD Toolkit) could reproduce the ONS’s identification and thematic analysis of child-suicide PFD reports, and assessed gains in efficiency and reliability. All 4,249 PFD reports published from July 2013 to November 2023 were processed via PFD Toolkit’s large language model pipelines. Automated screening identified cases where the coroner attributed death to suicide in individuals aged 18 or younger, and eligible reports were coded for recipient category and 23 concern sub-themes, replicating the ONS coding frame. PFD Toolkit identified 72 child-suicide PFD reports - almost twice the ONS count. Three blinded clinicians adjudicated a stratified sample of 144 reports to validate the child-suicide screening. Against the post-consensus clinical annotations, the LLM-based workflow showed substantial to almost-perfect agreement (Cohen’s \kappa = 0.82, 95% CI: 0.66-0.98, raw agreement = 91%). The end-to-end script runtime was 8m 16s, transforming a process that previously took months into one that can be completed in minutes. This demonstrates that automated LLM analysis can reliably and efficiently replicate manual thematic reviews of coronial data, enabling scalable, reproducible, and timely insights for public health and safety. The PFD Toolkit is openly available for future research.
zh
[NLP-16] On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey
【速读】: 该论文旨在系统梳理预训练语言模型(Pretrained Language Models, PLMs)时代下通用文本嵌入(General-Purpose Text Embeddings, GPTE)的发展脉络与核心技术,解决当前GPTE研究中缺乏全面综述、尤其是PLMs在其中所起作用未被充分厘清的问题。其解决方案的关键在于从基础架构和高级功能两个层面解析PLMs对GPTE的驱动作用:一方面明确PLMs在嵌入提取、表达增强、训练策略设计、学习目标优化及数据构建中的核心角色;另一方面揭示PLMs如何赋能多语言支持、多模态融合、代码理解及场景自适应等前沿能力,并进一步提出超越传统性能提升的新方向,如排序集成、安全性保障、偏见缓解、结构信息整合及嵌入的认知扩展,从而为该领域的研究提供理论框架与实践指引。
链接: https://arxiv.org/abs/2507.20783
作者: Meishan Zhang,Xin Zhang,Xinping Zhao,Shouzheng Huang,Baotian Hu,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 45 pages, 2 figures, 9 tables
Abstract:Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, such as retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. Then, we describe advanced roles enabled by PLMs, such as multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.
zh
[NLP-17] Multilingual Self-Taught Faithfulness Evaluators
【速读】: 该论文旨在解决多语言场景下大语言模型(Large Language Models, LLMs)信息幻觉(hallucination)的自动评估问题,尤其针对现有方法依赖昂贵人工标注数据且主要局限于英语的问题。其解决方案的关键在于提出一种自监督学习框架——Self-Taught Evaluators for Multilingual Faithfulness,该框架仅利用合成的多语言摘要数据进行训练,并通过跨语言迁移学习实现对多种语言的通用评估能力,从而在无需大量标注数据的情况下提升多语言情境下的评估准确性。
链接: https://arxiv.org/abs/2507.20752
作者: Carlo Alfano,Aymen Al Marjani,Zeno Jonke,Amin Mantrach,Saab Mansour,Marcello Federico
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The growing use of large language models (LLMs) has increased the need for automatic evaluation systems, particularly to address the challenge of information hallucination. Although existing faithfulness evaluation approaches have shown promise, they are predominantly English-focused and often require expensive human-labeled training data for fine-tuning specialized models. As LLMs see increased adoption in multilingual contexts, there is a need for accurate faithfulness evaluators that can operate across languages without extensive labeled data. This paper presents Self-Taught Evaluators for Multilingual Faithfulness, a framework that learns exclusively from synthetic multilingual summarization data while leveraging cross-lingual transfer learning. Through experiments comparing language-specific and mixed-language fine-tuning approaches, we demonstrate a consistent relationship between an LLM’s general language capabilities and its performance in language-specific evaluation tasks. Our framework shows improvements over existing baselines, including state-of-the-art English evaluators and machine translation-based approaches.
zh
[NLP-18] Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models : An Empirical Study
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在实际部署中因计算和内存需求过高而面临的挑战。现有参数压缩方法主要依赖从小型语言模型(Small Language Models, SLMs)重新训练MLLMs,但这类方法灵活性差且计算成本高。论文提出的关键解决方案是通过结构化剪枝(structural pruning)结合高效恢复训练(recovery training)直接压缩已训练的MLLMs。具体而言,研究了层级剪枝(layerwise pruning)与宽度剪枝(widthwise pruning)两种策略,并对比了监督微调(supervised fine-tuning)与知识蒸馏(knowledge distillation)在恢复性能上的效果。实验表明,宽度剪枝在低资源场景下表现更优,且仅需少量数据(如原始数据的5%)即可恢复超过95%的原始性能,尤其当压缩率较低(≤20%)时,仅微调多模态投影器(multimodal projector)即足够实现有效恢复。这一方案为资源受限环境下的MLLM压缩提供了可操作性强、效率高的实践路径。
链接: https://arxiv.org/abs/2507.20749
作者: Yiran Huang,Lukas Thede,Massimiliano Mancini,Wenjia Xu,Zeynep Akata
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at GCPR 2025
Abstract:While Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose significant barriers to practical deployment. Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs), but these methods offer limited flexibility and remain computationally intensive. To address this gap, we propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training. Specifically, we investigate two structural pruning paradigms–layerwise and widthwise pruning–applied to the language model backbone of MLLMs, alongside supervised finetuning and knowledge distillation. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios with limited computational resources or insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels ( 20%). Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved with as little as 5% of the original training data, while retaining over 95% of the original performance. Through empirical study on two representative MLLMs, i.e., LLaVA-v1.5-7B and Bunny-v1.0-3B, this study offers actionable insights for practitioners aiming to compress MLLMs effectively without extensive computation resources or sufficient data.
zh
[NLP-19] xt2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models
【速读】: 该论文旨在解决当前视觉语言模型(Visual Language Models, VLMs)在面对多模态提示注入攻击时的安全性评估不足问题,尤其是现有评测数据集主要基于纯文本提示,忽视了图像输入可能引入的视觉漏洞。解决方案的关键在于提出Text2VLM——一个新颖的多阶段流水线,其核心机制是将原始文本中的有害内容识别并转化为字体类图像,从而生成具有潜在攻击性的多模态提示,用于系统性测试VLMs对类型提示注入攻击的鲁棒性。实验表明,开源VLMs在引入视觉输入后显著更易受攻击,且性能明显落后于闭源前沿模型,验证了Text2VLM在提升多模态安全评估能力方面的有效性。
链接: https://arxiv.org/abs/2507.20704
作者: Gabriel Downer,Sean Craven,Damian Ruck,Jake Thomas
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 9 pages, 9 figures. Jake Thomas served as Editor for this manuscript
Abstract:The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose \textbfText2VLM, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models’ alignment. This is in addition to a significant performance gap compared to closed-source frontier models. We validate Text2VLM through human evaluations, ensuring the alignment of extracted salient concepts; text summarization and output classification align with human expectations. Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for VLMs. By enhancing the evaluation of multimodal vulnerabilities, Text2VLM plays a role in advancing the safe deployment of VLMs in diverse, real-world applications.
zh
[NLP-20] When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification ACL2025
【速读】: 该论文旨在解决多语言虚假信息传播背景下,如何实现细粒度真伪评估的自动化事实核查问题(fact verification),尤其关注跨语言场景下模型对复杂分类体系(七类真伪标签)的处理能力。其解决方案的关键在于通过系统性对比五种前沿语言模型在X-Fact数据集上的表现,发现尽管大语言模型(LLM)参数规模显著更大(7–12B),但在细粒度多语言事实核查任务中反而性能远逊于小规模专用模型(如XLM-R,270M参数),后者在宏平均F1分数上达到57.7%,显著优于最优LLM的16.9%。这一结果揭示了当前主流LLM在证据利用和类别不平衡情境下的行为缺陷,表明针对特定任务设计的小型专业化模型可能比通用大模型更适用于高精度、多语言的事实核查部署。
链接: https://arxiv.org/abs/2507.20700
作者: Hanna Shcharbakova,Tatiana Anikina,Natalia Skachkova,Josef van Genabith
机构: 未知
类目: Computation and Language (cs.CL)
备注: Published at the FEVER Workshop, ACL 2025
Abstract:The rapid spread of multilingual misinformation requires robust automated fact verification systems capable of handling fine-grained veracity assessments across diverse languages. While large language models have shown remarkable capabilities across many NLP tasks, their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied. We conduct a comprehensive evaluation of five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories. Our experiments compare small language models (encoder-based XLM-R and mT5) with recent decoder-only LLMs (Llama 3.1, Qwen 2.5, Mistral Nemo) using both prompting and fine-tuning approaches. Surprisingly, we find that XLM-R (270M parameters) substantially outperforms all tested LLMs (7-12B parameters), achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%. This represents a 15.8% improvement over the previous state-of-the-art (41.9%), establishing new performance benchmarks for multilingual fact verification. Our analysis reveals problematic patterns in LLM behavior, including systematic difficulties in leveraging evidence and pronounced biases toward frequent categories in imbalanced data settings. These findings suggest that for fine-grained multilingual fact verification, smaller specialized models may be more effective than general-purpose large models, with important implications for practical deployment of fact-checking systems.
zh
[NLP-21] Geometric-Mean Policy Optimization
【速读】: 该论文旨在解决Group Relative Policy Optimization (GRPO)在训练过程中因token级奖励中存在异常值(outlier)而导致策略更新不稳定的问题,具体表现为重要性采样比(importance sampling ratio)极端波动。解决方案的关键在于提出Geometric-Mean Policy Optimization (GMPO),通过优化token级奖励的几何均值(geometric mean)替代GRPO中的算术均值(arithmetic mean),从而显著降低对异常值的敏感性,并保持重要性采样比在更稳定的范围内,提升训练稳定性与模型性能。
链接: https://arxiv.org/abs/2507.20673
作者: Yuzhong Zhao,Yue Liu,Junpeng Liu,Jingye Chen,Xun Wu,Yaru Hao,Tengchao Lv,Shaohan Huang,Lei Cui,Qixiang Ye,Fang Wan,Furu Wei
机构: UCAS; CUHK; HKUST; Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL)
备注: Code is available at this https URL
Abstract:Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is available at this https URL.
zh
[NLP-22] Ontology-Enhanced Knowledge Graph Completion using Large Language Models
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的知识图谱补全(Knowledge Graph Completion, KGC)方法因依赖黑箱式深度神经架构而导致的隐式知识表示与错误知识并行传播问题,从而限制了推理结果的确定性与逻辑严谨性。其解决方案的关键在于将神经感知结构信息与本体知识(ontological knowledge)融合,提出一种基于LLM的增强型KGC方法——OL-KGC:首先利用神经感知机制将知识图谱的结构信息有效嵌入文本空间,随后通过自动化提取算法从待补全的知识图谱中获取本体知识,并将其转化为LLM可理解的文本形式,以提供逻辑引导,从而提升模型对知识内在逻辑的理解能力与推理准确性。
链接: https://arxiv.org/abs/2507.20643
作者: Wenbin Guo,Xin Wang,Jiaoyan Chen,Zhao Li,Zirui Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have been extensively adopted in Knowledge Graph Completion (KGC), showcasing significant research advancements. However, as black-box models driven by deep neural architectures, current LLM-based KGC methods rely on implicit knowledge representation with parallel propagation of erroneous knowledge, thereby hindering their ability to produce conclusive and decisive reasoning outcomes. We aim to integrate neural-perceptual structural information with ontological knowledge, leveraging the powerful capabilities of LLMs to achieve a deeper understanding of the intrinsic logic of the knowledge. We propose an ontology enhanced KGC method using LLMs – OL-KGC. It first leverages neural perceptual mechanisms to effectively embed structural information into the textual space, and then uses an automated extraction algorithm to retrieve ontological knowledge from the knowledge graphs (KGs) that needs to be completed, which is further transformed into a textual format comprehensible to LLMs for providing logic guidance. We conducted extensive experiments on three widely-used benchmarks – FB15K-237, UMLS and WN18RR. The experimental results demonstrate that OL-KGC significantly outperforms existing mainstream KGC methods across multiple evaluation metrics, achieving state-of-the-art performance.
zh
[NLP-23] Before the Outrag e: Challenges and Advances in Predicting Online Antisocial Behavior
【速读】: 该论文旨在解决社交平台上反社会行为(Antisocial Behavior, ASB)预测研究领域中存在的碎片化问题,即缺乏统一的任务分类体系与方法综述。其解决方案的关键在于提出一个结构化的五类任务分类框架,涵盖早期危害检测、危害涌现预测、危害传播预测、行为风险预测及主动治理支持,并系统分析不同任务在时间维度、预测粒度和操作目标上的差异,同时梳理建模技术演进趋势与数据集特性对模型可行性的影响,从而为未来构建更鲁棒且具备社会责任感的ASB预测系统提供理论基础与实践指引。
链接: https://arxiv.org/abs/2507.20614
作者: Anaïs Ollagnier(CRISAM,CNRS,MARIANNE)
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Antisocial behavior (ASB) on social media-including hate speech, harassment, and trolling-poses growing challenges for platform safety and societal wellbeing. While prior work has primarily focused on detecting harmful content after it appears, predictive approaches aim to forecast future harmful behaviors-such as hate speech propagation, conversation derailment, or user recidivism-before they fully unfold. Despite increasing interest, the field remains fragmented, lacking a unified taxonomy or clear synthesis of existing methods. This paper presents a systematic review of over 49 studies on ASB prediction, offering a structured taxonomy of five core task types: early harm detection, harm emergence prediction, harm propagation prediction, behavioral risk prediction, and proactive moderation support. We analyze how these tasks differ by temporal framing, prediction granularity, and operational goals. In addition, we examine trends in modeling techniques-from classical machine learning to pre-trained language models-and assess the influence of dataset characteristics on task feasibility and generalization. Our review highlights methodological challenges, such as dataset scarcity, temporal drift, and limited benchmarks, while outlining emerging research directions including multilingual modeling, cross-platform generalization, and human-in-the-loop systems. By organizing the field around a coherent framework, this survey aims to guide future work toward more robust and socially responsible ASB prediction.
zh
[NLP-24] ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning
【速读】: 该论文旨在解决事件增强图像分析(Event-Enriched Image Analysis, EVENTA)任务中的文章关联图像检索与图文描述生成问题,其核心挑战在于如何在不依赖竞赛数据微调的情况下,实现高精度的跨模态理解与对齐。解决方案的关键在于提出一种零样本集成方法(Zero-Shot Ensemble for Captioning, ZSE-Cap):在图像检索阶段,通过融合CLIP、SigLIP和DINOv2三个视觉基础模型的相似度得分以提升检索鲁棒性;在图像描述生成阶段,设计了一个精心构造的提示(prompt),引导Gemma 3模型将文章中提取的高层事件语义与图像内容进行有效关联,从而实现无需微调的高质量图文生成。该方案最终在私有测试集上取得0.42002的分数,验证了基础模型集成与提示工程的有效性。
链接: https://arxiv.org/abs/2507.20564
作者: Duc-Tai Dinh,Duc Anh Khoa Dinh
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:We present ZSE-Cap (Zero-Shot Ensemble for Captioning), our 4th place system in Event-Enriched Image Analysis (EVENTA) shared task on article-grounded image retrieval and captioning. Our zero-shot approach requires no finetuning on the competition’s data. For retrieval, we ensemble similarity scores from CLIP, SigLIP, and DINOv2. For captioning, we leverage a carefully engineered prompt to guide the Gemma 3 model, enabling it to link high-level events from the article to the visual content in the image. Our system achieved a final score of 0.42002, securing a top-4 position on the private test set, demonstrating the effectiveness of combining foundation models through ensembling and prompting. Our code is available at this https URL.
zh
[NLP-25] Enhancing Hallucination Detection via Future Context
【速读】: 该论文旨在解决生成式 AI(Generative AI)在黑箱模式下输出文本时难以检测幻觉(hallucination)的问题。其解决方案的关键在于利用“未来上下文采样”(future context sampling)策略,即通过采样模型生成文本之后的上下文片段,挖掘其中蕴含的线索来识别先前存在的幻觉信息,并将该方法与多种基于采样的检测技术有效融合,从而显著提升幻觉检测性能。
链接: https://arxiv.org/abs/2507.20546
作者: Joosung Lee,Cheonbok Park,Hwiyeol Jo,Jeonghoon Kim,Joonsuk Park,Kang Min Yoo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are widely used to generate plausible text on online platforms, without revealing the generation process. As users increasingly encounter such black-box outputs, detecting hallucinations has become a critical challenge. To address this challenge, we focus on developing a hallucination detection framework for black-box generators. Motivated by the observation that hallucinations, once introduced, tend to persist, we sample future contexts. The sampled future contexts provide valuable clues for hallucination detection and can be effectively integrated with various sampling-based methods. We extensively demonstrate performance improvements across multiple methods using our proposed sampling approach.
zh
[NLP-26] Kimi K2: Open Agent ic Intelligence
【速读】: 该论文旨在解决大规模语言模型在训练稳定性与效率之间的权衡问题,以及如何提升非思考类(non-thinking)模型在复杂任务中的代理能力(agentic capability)。其核心解决方案是提出一种名为MuonClip的新型优化器,该优化器基于MuOn算法并引入QK-clip技术以缓解训练过程中的不稳定现象,同时保持高token效率;在此基础上,Kimi K2采用15.5万亿token进行预训练且无损失突增,并通过多阶段后训练流程(包括大规模代理数据合成流水线和联合强化学习阶段)显著增强模型在真实与合成环境中的交互能力。这一系列改进使Kimi K2在多个基准测试中达到开源非思考模型的最先进水平,尤其在软件工程和代理任务上表现突出。
链接: https://arxiv.org/abs/2507.20534
作者: Kimi Team:Yifan Bai,Yiping Bao,Guanduo Chen,Jiahao Chen,Ningxin Chen,Ruijue Chen,Yanru Chen,Yuankun Chen,Yutian Chen,Zhuofu Chen,Jialei Cui,Hao Ding,Mengnan Dong,Angang Du,Chenzhuang Du,Dikang Du,Yulun Du,Yu Fan,Yichen Feng,Kelin Fu,Bofei Gao,Hongcheng Gao,Peizhong Gao,Tong Gao,Xinran Gu,Longyu Guan,Haiqing Guo,Jianhang Guo,Hao Hu,Xiaoru Hao,Tianhong He,Weiran He,Wenyang He,Chao Hong,Yangyang Hu,Zhenxing Hu,Weixiao Huang,Zhiqi Huang,Zihao Huang,Tao Jiang,Zhejun Jiang,Xinyi Jin,Yongsheng Kang,Guokun Lai,Cheng Li,Fang Li,Haoyang Li,Ming Li,Wentao Li,Yanhao Li,Yiwei Li,Zhaowei Li,Zheming Li,Hongzhan Lin,Xiaohan Lin,Zongyu Lin,Chengyin Liu,Chenyu Liu,Hongzhang Liu,Jingyuan Liu,Junqi Liu,Liang Liu,Shaowei Liu,T.Y. Liu,Tianwei Liu,Weizhou Liu,Yangyang Liu,Yibo Liu,Yiping Liu,Yue Liu,Zhengying Liu,Enzhe Lu,Lijun Lu,Shengling Ma,Xinyu Ma,Yingwei Ma,Shaoguang Mao,Jie Mei,Xin Men,Yibo Miao,Siyuan Pan,Yebo Peng,Ruoyu Qin,Bowen Qu,Zeyu Shang,Lidong Shi,Shengyuan Shi,Feifan Song,Jianlin Su,Zhengyuan Su,Xinjie Sun,Flood Sung,Heyi Tang,Jiawen Tao,Qifeng Teng,Chensi Wang,Dinglu Wang,Feng Wang,Haiming Wang
机构: Kimi Team
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: tech report of Kimi K2
Abstract:We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model improves its capabilities through interactions with real and synthetic environments. Kimi K2 achieves state-of-the-art performance among open-source non-thinking models, with strengths in agentic capabilities. Notably, K2 obtains 66.1 on Tau2-Bench, 76.5 on ACEBench (En), 65.8 on SWE-Bench Verified, and 47.3 on SWE-Bench Multilingual – surpassing most open and closed-sourced baselines in non-thinking settings. It also exhibits strong capabilities in coding, mathematics, and reasoning tasks, with a score of 53.7 on LiveCodeBench v6, 49.5 on AIME 2025, 75.1 on GPQA-Diamond, and 27.1 on OJBench, all without extended thinking. These results position Kimi K2 as one of the most capable open-source large language models to date, particularly in software engineering and agentic tasks. We release our base and post-trained model checkpoints to facilitate future research and applications of agentic intelligence. Comments: tech report of Kimi K2 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2507.20534 [cs.LG] (or arXiv:2507.20534v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.20534 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-27] Dialogues of Dissent: Thematic and Rhetorical Dimensions of Hate and Counter-Hate Speech in Social Media Conversations
【速读】: 该论文旨在解决社交媒体中仇恨言论(hate speech)与反仇恨言论(counter-hate speech)的复杂互动机制难以量化和系统分析的问题。解决方案的关键在于提出一种多标签标注方案,将仇恨与反仇恨言论从主题(thematic)和修辞(rhetorical)两个维度进行联合标注:主题维度识别言论涉及的具体话语议题,修辞维度则基于亚里士多德的逻辑(Logos)、信誉(Ethos)和情感(Pathos)三要素刻画表达方式。通过在92个对话(共720条推文)上实施该标注框架并结合公开指标进行统计分析,研究揭示了两类言论在主题与修辞层面的内在关联及其交互模式,为理解仇恨言论传播路径及对抗策略提供了结构化分析工具。
链接: https://arxiv.org/abs/2507.20528
作者: Effi Levi,Gal Ron,Odelia Oshri,Shaul R. Shenhav
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce a novel multi-labeled scheme for joint annotation of hate and counter-hate speech in social media conversations, categorizing hate and counter-hate messages into thematic and rhetorical dimensions. The thematic categories outline different discursive aspects of each type of speech, while the rhetorical dimension captures how hate and counter messages are communicated, drawing on Aristotle’s Logos, Ethos and Pathos. We annotate a sample of 92 conversations, consisting of 720 tweets, and conduct statistical analyses, incorporating public metrics, to explore patterns of interaction between the thematic and rhetorical dimensions within and between hate and counter-hate speech. Our findings provide insights into the spread of hate messages on social media, the strategies used to counter them, and their potential impact on online behavior.
zh
[NLP-28] SAND-Math: Using LLM s to Generate Novel Difficult and Useful Mathematics Questions and Answers
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在数学推理任务中性能受限的问题,其核心瓶颈在于高质量、高难度且新颖的训练数据稀缺。为应对这一挑战,作者提出SAND-Math(Synthetic Augmented Novel and Difficult Mathematics problems and solutions)生成管道,其关键创新在于引入“难度爬升”(Difficulty Hiking)步骤:首先从零开始合成高质量数学问题,随后通过系统化方法逐步提升问题复杂度,从而显著增强模型的数学推理能力。实验证明,该方法在AIME25基准上相较现有最优合成数据集提升17.85绝对百分点,并通过消融实验验证难度爬升的有效性——平均难度由5.02提升至5.98时,模型准确率从46.38%提升至49.23%。
链接: https://arxiv.org/abs/2507.20527
作者: Chaitanya Manem,Pratik Prabhanjan Brahma,Prakamya Mishra,Zicheng Liu,Emad Barsoum
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The demand for Large Language Models (LLMs) capable of sophisticated mathematical reasoning is growing across industries. However, the development of performant mathematical LLMs is critically bottlenecked by the scarcity of difficult, novel training data. We introduce \textbfSAND-Math (Synthetic Augmented Novel and Difficult Mathematics problems and solutions), a pipeline that addresses this by first generating high-quality problems from scratch and then systematically elevating their complexity via a new \textbfDifficulty Hiking step. We demonstrate the effectiveness of our approach through two key findings. First, augmenting a strong baseline with SAND-Math data significantly boosts performance, outperforming the next-best synthetic dataset by \textbf \uparrow 17.85 absolute points on the AIME25 benchmark. Second, in a dedicated ablation study, we show our Difficulty Hiking process is highly effective: by increasing average problem difficulty from 5.02 to 5.98, this step lifts AIME25 performance from 46.38% to 49.23%. The full generation pipeline, final dataset, and a fine-tuned model form a practical and scalable toolkit for building more capable and efficient mathematical reasoning LLMs. SAND-Math dataset is released here: \hrefthis https URLthis https URL
zh
[NLP-29] Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 代理在真实部署环境中对策略违规行为的脆弱性问题,特别是面对对抗性攻击时的安全性不足。解决方案的关键在于通过大规模红队测试(red-teaming)识别出高影响力、可复现的提示注入攻击,并构建 Agent Red Teaming (ART) 基准测试集,用于系统评估主流模型在多种任务和场景下的政策合规性表现。研究发现,绝大多数 AI 代理在仅需 10–100 次查询内即出现策略违规,且攻击具有高度跨模型迁移性,而代理鲁棒性与模型规模、能力或推理计算资源之间关联微弱,表明现有防御机制不足以应对恶意滥用,亟需开发更有效的安全防护策略。
链接: https://arxiv.org/abs/2507.20526
作者: Andy Zou,Maxwell Lin,Eliot Jones,Micha Nowak,Mateusz Dziemian,Nick Winter,Alexander Grattan,Valent Nathanael,Ayla Croft,Xander Davies,Jai Patel,Robert Kirk,Nate Burnikell,Yarin Gal,Dan Hendrycks,J. Zico Kolter,Matt Fredrikson
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted to follow deployment policies in realistic environments, especially under attack? To investigate, we ran the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios. Participants submitted 1.8 million prompt-injection attacks, with over 60,000 successfully eliciting policy violations such as unauthorized data access, illicit financial actions, and regulatory noncompliance. We use these results to build the Agent Red Teaming (ART) benchmark - a curated set of high-impact attacks - and evaluate it across 19 state-of-the-art models. Nearly all agents exhibit policy violations for most behaviors within 10-100 queries, with high attack transferability across models and tasks. Importantly, we find limited correlation between agent robustness and model size, capability, or inference-time compute, suggesting that additional defenses are needed against adversarial misuse. Our findings highlight critical and persistent vulnerabilities in today’s AI agents. By releasing the ART benchmark and accompanying evaluation framework, we aim to support more rigorous security assessment and drive progress toward safer agent deployment.
zh
[NLP-30] AQUA: A Large Language Model for Aquaculture Fisheries
【速读】: 该论文旨在解决当前水产养殖领域因疾病暴发、投喂效率低、劳动力成本上升、物流不畅及孵化环节高死亡率和水质控制困难等问题所引发的行业挑战,这些问题现有机器学习方法难以有效应对。其解决方案的关键在于提出首个面向水产养殖领域的大型语言模型(Large Language Model, LLM)——AQUA,并构建了AQUADAPT(数据获取、处理与微调)框架,该框架通过融合专家知识、大规模语言模型与自动化评估技术,生成并优化高质量合成数据,从而为水产养殖研究、咨询服务与决策工具提供基于LLM的创新基础。
链接: https://arxiv.org/abs/2507.20520
作者: Praneeth Narisetty,Uday Kumar Reddy Kattamanchi,Lohit Akshant Nimma,Sri Ram Kaushik Karnati,Shiva Nagendra Babu Kore,Mounika Golamari,Tejashree Nageshreddy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Aquaculture plays a vital role in global food security and coastal economies by providing sustainable protein sources. As the industry expands to meet rising demand, it faces growing challenges such as disease outbreaks, inefficient feeding practices, rising labor costs, logistical inefficiencies, and critical hatchery issues, including high mortality rates and poor water quality control. Although artificial intelligence has made significant progress, existing machine learning methods fall short of addressing the domain-specific complexities of aquaculture. To bridge this gap, we introduce AQUA, the first large language model (LLM) tailored for aquaculture, designed to support farmers, researchers, and industry practitioners. Central to this effort is AQUADAPT (Data Acquisition, Processing and Tuning), an Agentic Framework for generating and refining high-quality synthetic data using a combination of expert knowledge, largescale language models, and automated evaluation techniques. Our work lays the foundation for LLM-driven innovations in aquaculture research, advisory systems, and decision-making tools.
zh
[NLP-31] Customize Multi-modal RAI Guardrails with Precedent-based predictions
【速读】: 该论文旨在解决多模态内容过滤(multi-modal content filtering)中因用户自定义策略多样化且样本稀缺导致的可扩展性与适应性难题。现有微调方法受限于预定义策略,难以泛化至新政策或需大量重训练;而无需训练的方法则受制于上下文长度限制,无法全面纳入所有策略。其解决方案的关键在于引入“先例”(precedents)——即与当前输入相似的数据点的推理过程——作为模型判断的条件依据,从而摆脱对固定策略的依赖,显著提升系统的灵活性和适应能力。通过构建批判-修正机制收集高质量先例,并设计两种基于先例的预测策略,该方法在少量样本和全数据集场景下均优于现有方法,并展现出对新策略的优异泛化性能。
链接: https://arxiv.org/abs/2507.20503
作者: Cheng-Fu Yang,Thanh Tran,Christos Christodoulopoulos,Weitong Ruan,Rahul Gupta,Kai-Wei Chang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to COLM 2025
Abstract:A multi-modal guardrail must effectively filter image content based on user-defined policies, identifying material that may be hateful, reinforce harmful stereotypes, contain explicit material, or spread misinformation. Deploying such guardrails in real-world applications, however, poses significant challenges. Users often require varied and highly customizable policies and typically cannot provide abundant examples for each custom policy. Consequently, an ideal guardrail should be scalable to the multiple policies and adaptable to evolving user standards with minimal retraining. Existing fine-tuning methods typically condition predictions on pre-defined policies, restricting their generalizability to new policies or necessitating extensive retraining to adapt. Conversely, training-free methods struggle with limited context lengths, making it difficult to incorporate all the policies comprehensively. To overcome these limitations, we propose to condition model’s judgment on “precedents”, which are the reasoning processes of prior data points similar to the given input. By leveraging precedents instead of fixed policies, our approach greatly enhances the flexibility and adaptability of the guardrail. In this paper, we introduce a critique-revise mechanism for collecting high-quality precedents and two strategies that utilize precedents for robust prediction. Experimental results demonstrate that our approach outperforms previous methods across both few-shot and full-dataset scenarios and exhibits superior generalization to novel policies.
zh
[NLP-32] Speaking in Words Thinking in Logic: A Dual-Process Framework in QA Systems IJCNN
【速读】: 该论文旨在解决封闭领域(如教育、医疗和法律)中问答系统对准确答案与透明推理过程的双重需求,现有基于大语言模型(Large Language Models, LLMs)的方法在自然语言到形式逻辑的转换效率上存在不足,且缺乏可解释性。其解决方案的关键在于提出Text-JEPA(Text-based Joint-Embedding Predictive Architecture),一种轻量级的自然语言到一阶逻辑(NL2FOL)转换框架,借鉴双系统认知理论:利用神经网络模拟System 1实现高效逻辑表示生成,结合Z3求解器作为System 2进行可靠逻辑推理,从而构建兼具计算效率与可解释性的神经符号(Neural-Symbolic, NeSy)问答系统。
链接: https://arxiv.org/abs/2507.20491
作者: Tuan Bui,Trong Le,Phat Thai,Sang Nguyen,Minh Hua,Ngan Pham,Thang Bui,Tho Quan
机构: Ho Chi Minh City University of Technology (HCMUT); Vietnam National University - Ho Chi Minh City
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: 8 pages, 3 figures. Accepted at the International Joint Conference on Neural Networks (IJCNN) 2025, Workshop on Trustworthiness and Reliability in Neuro-Symbolic AI. this https URL
Abstract:Recent advances in large language models (LLMs) have significantly enhanced question-answering (QA) capabilities, particularly in open-domain contexts. However, in closed-domain scenarios such as education, healthcare, and law, users demand not only accurate answers but also transparent reasoning and explainable decision-making processes. While neural-symbolic (NeSy) frameworks have emerged as a promising solution, leveraging LLMs for natural language understanding and symbolic systems for formal reasoning, existing approaches often rely on large-scale models and exhibit inefficiencies in translating natural language into formal logic representations. To address these limitations, we introduce Text-JEPA (Text-based Joint-Embedding Predictive Architecture), a lightweight yet effective framework for converting natural language into first-order logic (NL2FOL). Drawing inspiration from dual-system cognitive theory, Text-JEPA emulates System 1 by efficiently generating logic representations, while the Z3 solver operates as System 2, enabling robust logical inference. To rigorously evaluate the NL2FOL-to-reasoning pipeline, we propose a comprehensive evaluation framework comprising three custom metrics: conversion score, reasoning score, and Spearman rho score, which collectively capture the quality of logical translation and its downstream impact on reasoning accuracy. Empirical results on domain-specific datasets demonstrate that Text-JEPA achieves competitive performance with significantly lower computational overhead compared to larger LLM-based systems. Our findings highlight the potential of structured, interpretable reasoning frameworks for building efficient and explainable QA systems in specialized domains. Comments: 8 pages, 3 figures. Accepted at the International Joint Conference on Neural Networks (IJCNN) 2025, Workshop on Trustworthiness and Reliability in Neuro-Symbolic AI. this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC) Cite as: arXiv:2507.20491 [cs.CL] (or arXiv:2507.20491v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.20491 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-33] CodeNER: Code Prompting for Named Entity Recognition
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的命名实体识别(Named Entity Recognition, NER)方法在处理候选实体跨度时,仅依赖输入上下文信息而忽视标签结构细节的问题。现有方法虽能生成带标签的候选实体,但缺乏对标注规范(如BIO schema)的显式指导,导致性能受限。解决方案的关键在于引入代码基提示(code-based prompting),通过将编程语言编写的标签规则嵌入提示中,明确传递NER的标注逻辑,从而增强LLMs对长距离语义和结构化标签的理解能力。实验表明,该方法在多语言(英语、阿拉伯语、芬兰语、丹麦语、德语)共十个基准数据集上均优于传统文本提示方法,并且与思维链(chain-of-thought)提示结合后进一步提升性能。
链接: https://arxiv.org/abs/2507.20423
作者: Sungwoo Han,Hyeyeon Kim,Jingun Kwon,Hidetaka Kamigaito,Manabu Okumura
机构: Chungnam National University (忠南国立大学); Nara Institute of Science and Technology (奈良科学技术研究所); Institute of Science Tokyo (东京科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures
Abstract:Recent studies have explored various approaches for treating candidate named entity spans as both source and target sequences in named entity recognition (NER) by leveraging large language models (LLMs). Although previous approaches have successfully generated candidate named entity spans with suitable labels, they rely solely on input context information when using LLMs, particularly, ChatGPT. However, NER inherently requires capturing detailed labeling requirements with input context information. To address this issue, we propose a novel method that leverages code-based prompting to improve the capabilities of LLMs in understanding and performing NER. By embedding code within prompts, we provide detailed BIO schema instructions for labeling, thereby exploiting the ability of LLMs to comprehend long-range scopes in programming languages. Experimental results demonstrate that the proposed code-based prompting method outperforms conventional text-based prompting on ten benchmarks across English, Arabic, Finnish, Danish, and German datasets, indicating the effectiveness of explicitly structuring NER instructions. We also verify that combining the proposed code-based prompting method with the chain-of-thought prompting further improves performance.
zh
[NLP-34] Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?
【速读】: 该论文旨在解决当前自然语言理解(Natural Language Understanding, NLU)评估中缺乏统一诊断性基准标准的问题,特别是针对现有基准在语言现象分类、命名规范和覆盖范围上存在的不一致性。其关键解决方案在于系统梳理英文、阿拉伯语及多语言NLU基准中的诊断数据集,深入分析其所涵盖的语言现象,并通过比较揭示当前评估体系的局限性,从而为构建全球统一的语法与语义现象层级结构提供依据,推动建立类似工业ISO标准的NLU诊断评估指标体系。
链接: https://arxiv.org/abs/2507.20419
作者: Khloud AL Jallad,Nada Ghneim,Ghaida Rebdawi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Natural Language Understanding (NLU) is a basic task in Natural Language Processing (NLP). The evaluation of NLU capabilities has become a trending research topic that attracts researchers in the last few years, resulting in the development of numerous benchmarks. These benchmarks include various tasks and datasets in order to evaluate the results of pretrained models via public leaderboards. Notably, several benchmarks contain diagnostics datasets designed for investigation and fine-grained error analysis across a wide range of linguistic phenomena. This survey provides a comprehensive review of available English, Arabic, and Multilingual NLU benchmarks, with a particular emphasis on their diagnostics datasets and the linguistic phenomena they covered. We present a detailed comparison and analysis of these benchmarks, highlighting their strengths and limitations in evaluating NLU tasks and providing in-depth error analysis. When highlighting the gaps in the state-of-the-art, we noted that there is no naming convention for macro and micro categories or even a standard set of linguistic phenomena that should be covered. Consequently, we formulated a research question regarding the evaluation metrics of the evaluation diagnostics benchmarks: “Why do not we have an evaluation standard for the NLU evaluation diagnostics benchmarks?” similar to ISO standard in industry. We conducted a deep analysis and comparisons of the covered linguistic phenomena in order to support experts in building a global hierarchy for linguistic phenomena in future. We think that having evaluation metrics for diagnostics evaluation could be valuable to gain more insights when comparing the results of the studied models on different diagnostics benchmarks.
zh
[NLP-35] CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning
【速读】: 该论文旨在解决多语言视觉-语言模型在图像描述生成任务中对低资源和中等资源语言性能不足的问题,其核心挑战在于多语言训练数据有限以及大规模模型参数化成本高昂。解决方案的关键在于提出CONCAP模型,通过将检索到的跨语言描述与图像特定概念(image-specific concepts)相结合,增强输入图像的上下文理解,并实现跨语言的语义对齐与 grounding,从而减少对大规模多语言训练数据的依赖,提升低资源语言下的生成质量。
链接: https://arxiv.org/abs/2507.20411
作者: George Ibrahim,Rita Ramos,Yova Kementchedjhieva
机构: MBZUAI (Mohamed bin Zayed University of Artificial Intelligence); INESC-ID, Instituto Superior Técnico, University of Lisbon (Institute of Systems and Robotics, Lisbon University)
类目: Computation and Language (cs.CL)
备注: Published as a conference paper at COLM 2025
Abstract:Multilingual vision-language models have made significant strides in image captioning, yet they still lag behind their English counterparts due to limited multilingual training data and costly large-scale model parameterization. Retrieval-augmented generation (RAG) offers a promising alternative by conditioning caption generation on retrieved examples in the target language, reducing the need for extensive multilingual training. However, multilingual RAG captioning models often depend on retrieved captions translated from English, which can introduce mismatches and linguistic biases relative to the source language. We introduce CONCAP, a multilingual image captioning model that integrates retrieved captions with image-specific concepts, enhancing the contextualization of the input image and grounding the captioning process across different languages. Experiments on the XM3600 dataset indicate that CONCAP enables strong performance on low- and mid-resource languages, with highly reduced data requirements. Our findings highlight the effectiveness of concept-aware retrieval augmentation in bridging multilingual performance gaps.
zh
[NLP-36] Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在社会情境下进行多模态推理时,传统Chain-of-Thought (CoT) 提示方法因缺乏对感知与规范性判断之间桥梁的建模而失效的问题。其解决方案的关键在于提出认知链式思维(Cognitive Chain-of-Thought, CoCoT),通过三个受认知科学启发的阶段——感知(perception)、情境(situation)和规范(norm)——对VLM的推理过程进行结构化引导,从而提升模型在意图辨析、常识推理和安全性等任务中的表现,并增强其可解释性和社会敏感性。
链接: https://arxiv.org/abs/2507.20409
作者: Eunkyu Park,Wesley Hanwen Deng,Gunhee Kim,Motahhare Eslami,Maarten Sap
机构: Seoul National University (首尔国立大学); Human-Computer Interaction Institute, Carnegie Mellon University (卡内基梅隆大学人机交互研究所); Language Technologies Institute, Carnegie Mellon University (卡内基梅隆大学语言技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Under review; 17 pages
Abstract:Chain-of-Thought (CoT) prompting helps models think step by step. But what happens when they must see, understand, and judge-all at once? In visual tasks grounded in social context, where bridging perception with norm-grounded judgments is essential, flat CoT often breaks down. We introduce Cognitive Chain-of-Thought (CoCoT), a prompting strategy that scaffolds VLM reasoning through three cognitively inspired stages: perception, situation, and norm. Our experiments show that, across multiple multimodal benchmarks (including intent disambiguation, commonsense reasoning, and safety), CoCoT consistently outperforms CoT and direct prompting (+8% on average). Our findings demonstrate that cognitively grounded reasoning stages enhance interpretability and social awareness in VLMs, paving the way for safer and more reliable multimodal systems.
zh
[NLP-37] Length Representations in Large Language Models
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在指令驱动场景下虽能控制输出序列长度,但其内部表征中如何编码和调控这一长度信息尚不明确。解决方案的关键在于发现多头注意力机制(multi-head attention mechanisms)在决定输出序列长度中起核心作用,并且这种长度控制可以以解耦(disentangled)的方式实现——通过调整特定隐藏单元的尺度,可在不损失生成文本语义信息的前提下精确控制输出长度,表明长度信息在模型内部部分独立于语义信息而存在。
链接: https://arxiv.org/abs/2507.20398
作者: Sangjun Moon,Dasom Choi,Jingun Kwon,Hidetaka Kamigaito,Manabu Okumura
机构: Chungnam National University (忠南国立大学); Nara Institute of Science and Technology (奈良科学技术院); Institute of Science Tokyo (东京科学研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have shown remarkable capabilities across various tasks, that are learned from massive amounts of text-based data. Although LLMs can control output sequence length, particularly in instruction-based settings, the internal mechanisms behind this control have been unexplored yet. In this study, we provide empirical evidence on how output sequence length information is encoded within the internal representations in LLMs. In particular, our findings show that multi-head attention mechanisms are critical in determining output sequence length, which can be adjusted in a disentangled manner. By scaling specific hidden units within the model, we can control the output sequence length without losing the informativeness of the generated text, thereby indicating that length information is partially disentangled from semantic information. Moreover, some hidden units become increasingly active as prompts become more length-specific, thus reflecting the model’s internal awareness of this attribute. Our findings suggest that LLMs have learned robust and adaptable internal mechanisms for controlling output length without any external control.
zh
[NLP-38] RMTBench: Benchmarking LLM s Through Multi-Turn User-Centric Role-Playing
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在角色扮演(role-playing)能力评估中存在的不足,即现有基准测试多采用以角色为中心(character-centric)的方法,将用户与角色的交互简化为孤立的问答任务,难以反映真实应用场景。其解决方案的关键在于提出RMTBench——一个以用户为中心(user-centric)、支持中英文双语的角色扮演评估基准,包含80个多样化的角色和超过8000轮对话,通过基于显式用户动机构建对话内容,并设计了真实的多轮对话模拟机制,结合LLM驱动的评分体系,精准捕捉用户与角色间复杂的意图互动,从而实现从角色背景描述向用户意图满足的范式转变,有效弥合学术评估与实际部署需求之间的差距。
链接: https://arxiv.org/abs/2507.20352
作者: Hao Xiang,Tianyi Tang,Yang Su,Bowen Yu,An Yang,Fei Huang,Yichang Zhang,Yaojie Lu,Hongyu Lin,Xianpei Han,Jingren Zhou,Junyang Lin,Le Sun
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所信息处理实验室); Alibaba Group (阿里巴巴集团); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in Large Language Models (LLMs) have shown outstanding potential for role-playing applications. Evaluating these capabilities is becoming crucial yet remains challenging. Existing benchmarks mostly adopt a \textbfcharacter-centric approach, simplify user-character interactions to isolated QA tasks, and fail to reflect real-world applications. To address this limitation, we introduce RMTBench, a comprehensive \textbfuser-centric bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds. RMTBench includes custom characters with detailed backgrounds and abstract characters defined by simple traits, enabling evaluation across various user scenarios. Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications. Furthermore, we construct an authentic multi-turn dialogue simulation mechanism. With carefully selected evaluation dimensions and LLM-based scoring, this mechanism captures the complex intention of conversations between the user and the character. By shifting focus from character background to user intention fulfillment, RMTBench bridges the gap between academic evaluation and practical deployment requirements, offering a more effective framework for assessing role-playing capabilities in LLMs. All code and datasets will be released soon.
zh
[NLP-39] DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns
【速读】: 该论文旨在解决语音发音过程中口腔内部运动难以直观可视化的问题,尤其在语音学教学与言语治疗中对动态构音过程的呈现需求。解决方案的关键在于提出DYNARTmo模型,这是一个基于UK-DYNAMO框架的动态构音模型,通过整合构音不充分性(articulatory underspecification)、音段与音位动作控制(segmental and gestural control)以及协同调音(coarticulation)原理,以十组连续和六组离散控制参数驱动六个关键构音器官,在二维正中矢状面内模拟元音与辅音的构音配置,从而实现对语音构音过程的可视化建模。
链接: https://arxiv.org/abs/2507.20343
作者: Bernd J. Kröger
机构: RWTH Aachen University (亚琛工业大学); Kröger Lab Belgium (Kröger 实验室比利时)
类目: Computation and Language (cs.CL)
备注: 10 pages, 29 references, 2 figures, supplementary material
Abstract:We present DYNARTmo, a dynamic articulatory model designed to visualize speech articulation processes in a two-dimensional midsagittal plane. The model builds upon the UK-DYNAMO framework and integrates principles of articulatory underspecification, segmental and gestural control, and coarticulation. DYNARTmo simulates six key articulators based on ten continuous and six discrete control parameters, allowing for the generation of both vocalic and consonantal articulatory configurations. The current implementation is embedded in a web-based application (SpeechArticulationTrainer) that includes sagittal, glottal, and palatal views, making it suitable for use in phonetics education and speech therapy. While this paper focuses on the static modeling aspects, future work will address dynamic movement generation and integration with articulatory-acoustic modules.
zh
[NLP-40] Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation
【速读】: 该论文旨在解决阿拉伯语方言(Dialectal Arabic, DA)与现代标准阿拉伯语(Modern Standard Arabic, MSA)之间自然语言处理(Natural Language Processing, NLP)能力不均衡的问题,尤其是在低资源和计算受限场景下,提升DA-MSA机器翻译质量。其解决方案的关键在于两个方面:一是系统评估了无需训练的提示工程技术(training-free prompting),发现少样本提示(few-shot prompting)在六种大型语言模型(Large Language Models, LLMs)中表现最优,其中GPT-4o在所有提示策略中达到最高性能;二是开发了一种资源高效的微调流程,采用4-bit量化技术对Gemma2-9B模型进行微调,在内存占用减少60%的情况下仅损失不到1%的性能,且联合多方言训练模型相比单方言模型提升超过10%的CHrF++得分,从而为实现高质量、低成本的阿拉伯语方言包容性NLP提供了可落地的技术路径。
链接: https://arxiv.org/abs/2507.20301
作者: Abdullah Alabdullah,Lifeng Han,Chenghua Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Dialectal Arabic (DA) poses a persistent challenge for natural language processing (NLP), as most everyday communication in the Arab world occurs in dialects that diverge significantly from Modern Standard Arabic (MSA). This linguistic divide limits access to digital services and educational resources and impedes progress in Arabic machine translation. This paper presents two core contributions to advancing DA-MSA translation for the Levantine, Egyptian, and Gulf dialects, particularly in low-resource and computationally constrained settings: a comprehensive evaluation of training-free prompting techniques, and the development of a resource-efficient fine-tuning pipeline. Our evaluation of prompting strategies across six large language models (LLMs) found that few-shot prompting consistently outperformed zero-shot, chain-of-thought, and our proposed Ara-TEaR method. GPT-4o achieved the highest performance across all prompting settings. For fine-tuning, a quantized Gemma2-9B model achieved a CHrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58). Joint multi-dialect trained models outperformed single-dialect counterparts by over 10% CHrF++, and 4-bit quantization reduced memory usage by 60% with less than 1% performance loss. The results and insights of our experiments offer a practical blueprint for improving dialectal inclusion in Arabic NLP, showing that high-quality DA-MSA machine translation is achievable even with limited resources and paving the way for more inclusive language technologies.
zh
[NLP-41] SciToolAgent : A Knowledge Graph-Driven Scientific Agent for Multi-Tool Integration
【速读】: 该论文旨在解决科学研宄中复杂工作流自动化难题,即如何有效整合和调度多种专业计算工具以支持跨学科研究任务。现有大型语言模型(Large Language Models, LLMs)在单一工具调用上表现良好,但在多工具协同与流程编排方面存在明显局限。解决方案的关键在于提出SciToolAgent——一个基于科学工具知识图谱(scientific tool knowledge graph)的智能代理系统,通过图检索增强生成(graph-based retrieval-augmented generation)实现精准工具选择与执行,并集成全面的安全检查模块保障伦理合规性,从而显著提升复杂科学工作流的自动化水平与可及性。
链接: https://arxiv.org/abs/2507.20280
作者: Keyan Ding,Jing Yu,Junjie Huang,Yuchen Yang,Qiang Zhang,Huajun Chen
机构: Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 6 figures
Abstract:Scientific research increasingly relies on specialized computational tools, yet effectively utilizing these tools demands substantial domain expertise. While Large Language Models (LLMs) show promise in tool automation, they struggle to seamlessly integrate and orchestrate multiple tools for complex scientific workflows. Here, we present SciToolAgent, an LLM-powered agent that automates hundreds of scientific tools across biology, chemistry, and materials science. At its core, SciToolAgent leverages a scientific tool knowledge graph that enables intelligent tool selection and execution through graph-based retrieval-augmented generation. The agent also incorporates a comprehensive safety-checking module to ensure responsible and ethical tool usage. Extensive evaluations on a curated benchmark demonstrate that SciToolAgent significantly outperforms existing approaches. Case studies in protein engineering, chemical reactivity prediction, chemical synthesis, and metal-organic framework screening further demonstrate SciToolAgent’s capability to automate complex scientific workflows, making advanced research tools accessible to both experts and non-experts.
zh
[NLP-42] What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations
【速读】: 该论文旨在解决多语言大语言模型(Large Language Models, LLMs)内部语言处理机制不明确的问题,特别是对比平衡多语言训练模型(如Aya-23-8B)与主导单语训练模型(如Llama 3和Chinese-LLaMA-2)在代码混合(code-mixed)、填空(cloze)和翻译任务中的差异。其解决方案的关键在于采用对数透镜(logit lens)和神经元特化分析方法,揭示了Aya-23-8B在翻译任务中激活类型学相关语言表征而非依赖单一中心语言,在代码混合输入下神经元激活模式受基础语言影响更大,并且语言特异性神经元在最终层集中分布,这与以往decoder-only模型的发现不同;同时,通过神经元重叠分析进一步验证了书写系统相似性和类型学关系对跨模型语言处理的影响。这些发现阐明了多语言训练如何塑造模型内部表示结构,为未来跨语言迁移研究提供了理论依据。
链接: https://arxiv.org/abs/2507.20279
作者: Katharina Trinley,Toshiki Nakai,Tatiana Anikina,Tanja Baeumel
机构: Saarland University (萨尔兰大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心)
类目: Computation and Language (cs.CL)
备注: pre-print
Abstract:Large language models (LLMs) excel at multilingual tasks, yet their internal language processing remains poorly understood. We analyze how Aya-23-8B, a decoder-only LLM trained on balanced multilingual data, handles code-mixed, cloze, and translation tasks compared to predominantly monolingual models like Llama 3 and Chinese-LLaMA-2. Using logit lens and neuron specialization analyses, we find: (1) Aya-23 activates typologically related language representations during translation, unlike English-centric models that rely on a single pivot language; (2) code-mixed neuron activation patterns vary with mixing rates and are shaped more by the base language than the mixed-in one; and (3) Aya-23’s languagespecific neurons for code-mixed inputs concentrate in final layers, diverging from prior findings on decoder-only models. Neuron overlap analysis further shows that script similarity and typological relations impact processing across model types. These findings reveal how multilingual training shapes LLM internals and inform future cross-lingual transfer research.
zh
[NLP-43] MoL-RL: Distilling Multi-Step Environmental Feedback into LLM s for Feedback-Independent Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在利用序列化环境反馈(Environmental Feedback, EF)信号(如自然语言评价)进行独立于反馈的链式思维(Chain-of-Thought, CoT)推理时所面临的挑战。现有方法要么将EF转化为标量奖励,导致丰富上下文信息丢失;要么依赖精炼数据集,未能充分利用EF交互的多步性和离散性。其解决方案的关键在于提出MoL-RL训练范式,通过双目标优化框架整合多步EF信号:一方面采用Mixture-of-Losses(MoL)持续训练策略,解耦特定领域EF信号(以交叉熵损失优化)与通用语言能力(以KL散度保持);另一方面引入GRPO后训练机制,将序列化EF交互蒸馏为单步推理。该协同机制使模型实现无需外部反馈循环的鲁棒推理能力,并在数学推理和代码生成任务中取得SOTA性能。
链接: https://arxiv.org/abs/2507.20278
作者: Kang Yang,Jingxue Chen,Qingkun Tang,Tianxiang Zhang,Qianchun Lu
机构: Dalian University of Technology (大连理工大学); ZTE Corporation (中兴通讯)
类目: Computation and Language (cs.CL)
备注: 12pages,3figures
Abstract:Large language models (LLMs) face significant challenges in effectively leveraging sequential environmental feedback (EF) signals, such as natural language evaluations, for feedback-independent chain-of-thought (CoT) reasoning. Existing approaches either convert EF into scalar rewards, losing rich contextual information, or employ refinement datasets, failing to exploit the multi-step and discrete nature of EF interactions. To address these limitations, we propose MoL-RL, a novel training paradigm that integrates multi-step EF signals into LLMs through a dual-objective optimization framework. Our method combines MoL (Mixture-of-Losses) continual training, which decouples domain-specific EF signals (optimized via cross-entropy loss) and general language capabilities (preserved via Kullback-Leibler divergence), with GRPO-based post-training to distill sequential EF interactions into single-step inferences. This synergy enables robust feedback-independent reasoning without relying on external feedback loops. Experimental results on mathematical reasoning (MATH-500, AIME24/AIME25) and code generation (CodeAgent-Test) benchmarks demonstrate that MoL-RL achieves state-of-the-art performance with the Qwen3-8B model, while maintaining strong generalization across model scales (Qwen3-4B). This work provides a promising approach for leveraging multi-step textual feedback to enhance LLMs’ reasoning capabilities in diverse domains.
zh
[NLP-44] EMBRACE: Shaping Inclusive Opinion Representation by Aligning Implicit Conversations with Social Norms
【速读】: 该论文旨在解决当前对话模型在构建包容性表示时存在的核心问题:现有方法多依赖显式提及用户人口统计学特征或社会群体的行为属性来实现表面层面的 inclusiveness(包容性),却忽视了对话中隐含的、微妙的意见表达,导致模型输出可能强化刻板印象或与规范性社会价值观产生偏差。其解决方案的关键在于引入一种以立场(stance)为代理变量的对齐评估框架,通过建模响应立场来捕捉隐含意见,并以此验证模型输出是否符合规范性社会观点,从而实现对多样社会视角更审慎、更具反思性的表征。该框架结合正样本-未标记样本(PU)在线学习和指令微调语言模型进行后训练对齐评估,为识别和改善隐含意见的(误)表征提供了新路径。
链接: https://arxiv.org/abs/2507.20264
作者: Abeer Aldayel,Areej Alokaili
机构: King Saud University, College of Computer and Information Sciences (国王萨乌德大学,计算机与信息科学学院)
类目: Computation and Language (cs.CL)
备注: Under review for publication
Abstract:Shaping inclusive representations that embrace diversity and ensure fair participation and reflections of values is at the core of many conversation-based models. However, many existing methods rely on surface inclusion using mention of user demographics or behavioral attributes of social groups. Such methods overlook the nuanced, implicit expression of opinion embedded in conversations. Furthermore, the over-reliance on overt cues can exacerbate misalignment and reinforce harmful or stereotypical representations in model outputs. Thus, we took a step back and recognized that equitable inclusion needs to account for the implicit expression of opinion and use the stance of responses to validate the normative alignment. This study aims to evaluate how opinions are represented in NLP or computational models by introducing an alignment evaluation framework that foregrounds implicit, often overlooked conversations and evaluates the normative social views and discourse. Our approach models the stance of responses as a proxy for the underlying opinion, enabling a considerate and reflective representation of diverse social viewpoints. We evaluate the framework using both (i) positive-unlabeled (PU) online learning with base classifiers, and (ii) instruction-tuned language models to assess post-training alignment. Through this, we provide a lens on how implicit opinions are (mis)represented and offer a pathway toward more inclusive model behavior.
zh
[NLP-45] Post-Completion Learning for Language Models
【速读】: 该论文旨在解决当前语言模型训练范式在达到结束符(end-of-sequence, eos)后即终止学习的问题,忽略了输出完成后的潜在学习空间。传统方法未能充分利用模型生成序列之后的“后完成空间”(post-completion space),从而限制了模型在推理能力和自我评估能力上的提升。解决方案的关键在于提出后完成学习(Post-Completion Learning, PCL)框架,该框架通过白盒强化学习机制,在模型输出完成后继续生成自评和奖励预测,同时保持推理阶段仍以eos标记停止以保障部署效率;进一步设计双轨监督微调(dual-track SFT)优化推理与评估能力,并融合强化学习(RL)实现多目标混合优化,从而显著提升模型输出质量且不牺牲推理效率。
链接: https://arxiv.org/abs/2507.20252
作者: Xiang Fei,Siqi Wang,Shu Wei,Yuxiang Nie,Wei Shi,Hao Feng,Can Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (eos) token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point. To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization. Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.20252 [cs.CL] (or arXiv:2507.20252v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.20252 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-46] Modeling Professionalism in Expert Questioning through Linguistic Differentiation
【速读】: 该论文旨在解决专家沟通中“专业性”(professionalism)这一关键但研究不足的维度在金融等高风险领域如何量化与建模的问题。其解决方案的关键在于构建一个新颖的标注框架,用于量化金融分析师提问中的结构化和语用特征(如话语调节词、前置语和请求类型),并基于人类撰写与大语言模型(LLM)生成的问题构建两个数据集:一个标注了感知专业性,另一个标注了问题来源。研究发现,相同的语言特征同时与人类对专业性的判断和问题来源高度相关,表明存在共享的风格基础;进一步地,仅使用这些可解释的语言特征训练的分类器,在区分专家撰写问题方面优于Gemini-2.0和SVM基线模型,验证了专业性是一种可通过语言学基础建模的、跨领域的可学习构念。
链接: https://arxiv.org/abs/2507.20249
作者: Giulia D’Agostino,Chung-Chi Chen
机构: Università della Svizzera italiana (瑞士意大利大学); National Institute of Advanced Industrial Science and Technology (日本产业技术综合研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Professionalism is a crucial yet underexplored dimension of expert communication, particularly in high-stakes domains like finance. This paper investigates how linguistic features can be leveraged to model and evaluate professionalism in expert questioning. We introduce a novel annotation framework to quantify structural and pragmatic elements in financial analyst questions, such as discourse regulators, prefaces, and request types. Using both human-authored and large language model (LLM)-generated questions, we construct two datasets: one annotated for perceived professionalism and one labeled by question origin. We show that the same linguistic features correlate strongly with both human judgments and authorship origin, suggesting a shared stylistic foundation. Furthermore, a classifier trained solely on these interpretable features outperforms gemini-2.0 and SVM baselines in distinguishing expert-authored questions. Our findings demonstrate that professionalism is a learnable, domain-general construct that can be captured through linguistically grounded modeling.
zh
[NLP-47] Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在心理健康支持中缺乏专业心理治疗模拟真实性和无法量化治疗进展的问题。现有方法难以再现叙事疗法(Narrative Therapy)中关键的“创新时刻”(Innovative Moments, IMs),即来访者话语中体现认知重构的关键转折点,从而限制了其临床效用与可评估性。解决方案的关键在于提出一个双组件框架:一是INT(Interactive Narrative Therapist),通过规划治疗阶段、引导反思层次并生成符合情境的专业回应,实现对叙事疗法过程的结构化模拟;二是IMA(Innovative Moment Assessment),基于IMs的定量指标追踪治疗进程,提供以治疗效果为核心的评估机制。实验证明该框架显著优于通用LLMs,在模拟和真实用户场景下均能生成高质量、具有深度的心理支持对话。
链接: https://arxiv.org/abs/2507.20241
作者: Yi Feng,Jiaqi Wang,Wenxuan Zhang,Zhuang Chen,Yutong Shen,Xiyao Xiao,Minlie Huang,Liping Jing,Jian Yu
机构: Beijing Jiaotong University (北京交通大学); Tsinghua University (清华大学); Tencent Jarvis Lab; Singapore University of Technology and Design (新加坡科技设计大学); Central South University (中南大学); Lingxin AI; CoAI Group, DCST, IAI, BNRIST, Tsinghua University (清华-伯克利深圳学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent progress in large language models (LLMs) has opened new possibilities for mental health support, yet current approaches lack realism in simulating specialized psychotherapy and fail to capture therapeutic progression over time. Narrative therapy, which helps individuals transform problematic life stories into empowering alternatives, remains underutilized due to limited access and social stigma. We address these limitations through a comprehensive framework with two core components. First, INT (Interactive Narrative Therapist) simulates expert narrative therapists by planning therapeutic stages, guiding reflection levels, and generating contextually appropriate expert-like responses. Second, IMA (Innovative Moment Assessment) provides a therapy-centric evaluation method that quantifies effectiveness by tracking “Innovative Moments” (IMs), critical narrative shifts in client speech signaling therapy progress. Experimental results on 260 simulated clients and 230 human participants reveal that INT consistently outperforms standard LLMs in therapeutic quality and depth. We further demonstrate the effectiveness of INT in synthesizing high-quality support conversations to facilitate social applications.
zh
[NLP-48] Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation
【速读】: 该论文旨在解决新闻推荐系统中如何有效建模多视角新闻表示(multi-view news representations)以及用户兴趣的动态性问题,特别是用户在短时和长时偏好上的差异。现有方法通常仅依赖单一视角的新闻特征(如标题或类别),或未能全面捕捉跨时间尺度的用户偏好。解决方案的关键在于提出一个混合框架 Co-NAML-LSTUR,该框架结合了 NAML(用于注意力机制下的多视角新闻建模)与 LSTUR(用于捕获用户长期和短期兴趣表示),并通过 BERT-based 词嵌入增强语义特征提取能力,从而实现更精准的个性化推荐。
链接: https://arxiv.org/abs/2507.20210
作者: Minh Hoang Nguyen,Thuat Thien Nguyen,Minh Nhat Ta
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 6 figures
Abstract:News recommendation systems play a vital role in mitigating information overload by delivering personalized news content. A central challenge is to effectively model both multi-view news representations and the dynamic nature of user interests, which often span both short- and long-term preferences. Existing methods typically rely on single-view features of news articles (e.g., titles or categories) or fail to comprehensively capture user preferences across time scales. In this work, we propose Co-NAML-LSTUR, a hybrid news recommendation framework that integrates NAML for attentive multi-view news modeling and LSTUR for capturing both long- and short-term user representations. Our model also incorporates BERT-based word embeddings to enhance semantic feature extraction. We evaluate Co-NAML-LSTUR on two widely used benchmarks, MIND-small and MIND-large. Experimental results show that Co-NAML-LSTUR achieves substantial improvements over most state-of-the-art baselines on MIND-small and MIND-large, respectively. These results demonstrate the effectiveness of combining multi-view news representations with dual-scale user modeling. The implementation of our model is publicly available at this https URL.
zh
[NLP-49] IQ Test for LLM s: An Evaluation Framework for Uncovering Core Skills in LLM s
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估中依赖单一基准分数所带来的局限性问题,即现有方法难以揭示不同任务之间的内在关联、共性与差异,也无法全面刻画模型的综合能力。其解决方案的关键在于引入因子分析(factor analysis)方法,从44个任务构成的综合性排行榜中识别出驱动模型性能的潜在技能(latent skills),从而以更结构化的方式理解模型的能力分布,并据此开发出可实际应用的工具,如识别冗余任务、辅助模型选择及对模型在各潜在技能维度上的表现进行画像。
链接: https://arxiv.org/abs/2507.20208
作者: Aviya Maimon,Amir DN Cohen,Gal Vishne,Shauli Ravfogel,Reut Tsarfaty
机构: Bar-Ilan University (巴伊兰大学); Columbia University (哥伦比亚大学); New York University (纽约大学); OriginAI
类目: Computation and Language (cs.CL)
备注:
Abstract:Current evaluations of large language models (LLMs) rely on benchmark scores, but it is difficult to interpret what these individual scores reveal about a model’s overall skills. Specifically, as a community we lack understanding of how tasks relate to one another, what they measure in common, how they differ, or which ones are redundant. As a result, models are often assessed via a single score averaged across benchmarks, an approach that fails to capture the models’ wholistic strengths and limitations. Here, we propose a new evaluation paradigm that uses factor analysis to identify latent skills driving performance across benchmarks. We apply this method to a comprehensive new leaderboard showcasing the performance of 60 LLMs on 44 tasks, and identify a small set of latent skills that largely explain performance. Finally, we turn these insights into practical tools that identify redundant tasks, aid in model selection, and profile models along each latent skill.
zh
[NLP-50] Diversity-Enhanced Reasoning for Subjective Questions
【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRM)在主观推理任务中因依赖单一标准答案而导致的推理同质化问题,从而限制了其在多视角、开放性问题上的表现。解决方案的关键在于提出MultiRole-R1框架,通过无监督的数据构建流程生成融合多种角色视角的推理链,并引入基于群体相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习机制,在奖励函数中同时考虑可验证奖励与多样性奖励(包括视角多样性和词汇多样性),从而有效提升主观推理任务中的准确率与多样性,揭示了推理多样性与准确性之间的正向关系。
链接: https://arxiv.org/abs/2507.20187
作者: Yumeng Wang,Zhiyuan Fan,Jiayu Liu,Yi R. Fung
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large reasoning models (LRM) with long chain-of-thought (CoT) capabilities have shown strong performance on objective tasks, such as math reasoning and coding. However, their effectiveness on subjective questions that may have different responses from different perspectives is still limited by a tendency towards homogeneous reasoning, introduced by the reliance on a single ground truth in supervised fine-tuning and verifiable reward in reinforcement learning. Motivated by the finding that increasing role perspectives consistently improves performance, we propose MultiRole-R1, a diversity-enhanced framework with multiple role perspectives, to improve the accuracy and diversity in subjective reasoning tasks. MultiRole-R1 features an unsupervised data construction pipeline that generates reasoning chains that incorporate diverse role perspectives. We further employ reinforcement learning via Group Relative Policy Optimization (GRPO) with reward shaping, by taking diversity as a reward signal in addition to the verifiable reward. With specially designed reward functions, we successfully promote perspective diversity and lexical diversity, uncovering a positive relation between reasoning diversity and accuracy. Our experiment on six benchmarks demonstrates MultiRole-R1’s effectiveness and generalizability in enhancing both subjective and objective reasoning, showcasing the potential of diversity-enhanced training in LRMs.
zh
[NLP-51] SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding
【速读】: 该论文旨在解决电子商务场景中用户购买会话(session)内意图建模不足的问题,即现有研究未能有效捕捉和利用用户在浏览过程中的隐含意图,且缺乏专门用于显式建模意图的数据集与基准测试。其关键解决方案是提出“意图树”(intention tree)概念,并构建了一个多模态基准数据集 SessionIntentBench,通过系统化采集195万条意图记录和超过1300万项任务,形成可扩展的意图挖掘与评估框架;同时引入人工标注的黄金标准数据集以验证模型性能,实验证明当前大语言模型(LLM)在复杂会话场景下难以准确理解意图,而注入意图信息能显著提升模型表现。
链接: https://arxiv.org/abs/2507.20185
作者: Yuqi Yang,Weiqi Wang,Baixuan Xu,Wei Fan,Qing Zong,Chunkit Chan,Zheye Deng,Xin Liu,Yifan Gao,Changlong Yu,Chen Luo,Yang Li,Zheng Li,Qingyu Yin,Bing Yin,Yangqiu Song
机构: HKUST(香港科技大学); Amazon.com Inc(亚马逊公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Session history is a common way of recording user interacting behaviors throughout a browsing activity with multiple products. For example, if an user clicks a product webpage and then leaves, it might because there are certain features that don’t satisfy the user, which serve as an important indicator of on-the-spot user preferences. However, all prior works fail to capture and model customer intention effectively because insufficient information exploitation and only apparent information like descriptions and titles are used. There is also a lack of data and corresponding benchmark for explicitly modeling intention in E-commerce product purchase sessions. To address these issues, we introduce the concept of an intention tree and propose a dataset curation pipeline. Together, we construct a sibling multimodal benchmark, SessionIntentBench, that evaluates L(V)LMs’ capability on understanding inter-session intention shift with four subtasks. With 1,952,177 intention entries, 1,132,145 session intention trajectories, and 13,003,664 available tasks mined using 10,905 sessions, we provide a scalable way to exploit the existing session data for customer intention understanding. We conduct human annotations to collect ground-truth label for a subset of collected data to form an evaluation gold set. Extensive experiments on the annotated data further confirm that current L(V)LMs fail to capture and utilize the intention across the complex session setting. Further analysis show injecting intention enhances LLMs’ performances.
zh
[NLP-52] SGPO: Self-Generated Preference Optimization based on Self-Improver
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中因缺乏与人类偏好对齐而导致的可靠性问题,尤其是传统对齐方法依赖人工标注偏好数据、采用离策略学习(off-policy learning)所引发的数据分布偏移(distribution shift)和可扩展性受限的问题。其解决方案的关键在于提出一种基于自改进器(Self-Improver)的自生成偏好优化框架(Self-Generated Preference Optimization, SGPO),该框架通过将改进器与策略模型统一为单一模型,并利用监督微调(Supervised Fine-Tuning, SFT)输出作为参考,使改进器能够对当前响应进行增量且可辨别的优化,从而自动生成高质量偏好数据用于直接偏好优化(Direct Preference Optimization, DPO),实现无需外部偏好数据的闭环式对齐训练。
链接: https://arxiv.org/abs/2507.20181
作者: Hyeonji Lee,Daejin Jo,Seohwan Yun,Sungwoong Kim
机构: Korea University (韩国国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs), despite their extensive pretraining on diverse datasets, require effective alignment to human preferences for practical and reliable deployment. Conventional alignment methods typically employ off-policy learning and depend on human-annotated datasets, which limits their broad applicability and introduces distribution shift issues during training. To address these challenges, we propose Self-Generated Preference Optimization based on Self-Improver (SGPO), an innovative alignment framework that leverages an on-policy self-improving mechanism. Specifically, the improver refines responses from a policy model to self-generate preference data for direct preference optimization (DPO) of the policy model. Here, the improver and policy are unified into a single model, and in order to generate higher-quality preference data, this self-improver learns to make incremental yet discernible improvements to the current responses by referencing supervised fine-tuning outputs. Experimental results on AlpacaEval 2.0 and Arena-Hard show that the proposed SGPO significantly improves performance over DPO and baseline self-improving methods without using external preference data.
zh
[NLP-53] Goal Alignment in LLM -Based User Simulators for Conversational AI
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在用户模拟(user simulation)中难以持续保持目标导向行为的问题,尤其是在多轮对话中缺乏稳定的目标进展追踪能力,从而影响了其在下游应用中的可靠性。解决方案的关键在于提出一种新的框架——用户目标状态追踪(User Goal State Tracking, UGST),该框架能够自主跟踪对话过程中用户目标的演变,并基于此构建一个三阶段方法论,使用户模拟器具备目标对齐的响应生成能力。通过引入系统性的评估指标,作者验证了该方法在MultiWOZ 2.4和τ-Bench两个基准上的显著性能提升,确立了UGST作为开发目标对齐用户模拟器的核心框架。
链接: https://arxiv.org/abs/2507.20152
作者: Shuhaib Mehri,Xiaocheng Yang,Takyoung Kim,Gokhan Tur,Shikib Mehri,Dilek Hakkani-Tür
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:User simulators are essential to conversational AI, enabling scalable agent development and evaluation through simulated interactions. While current Large Language Models (LLMs) have advanced user simulation capabilities, we reveal that they struggle to consistently demonstrate goal-oriented behavior across multi-turn conversations–a critical limitation that compromises their reliability in downstream applications. We introduce User Goal State Tracking (UGST), a novel framework that tracks user goal progression throughout conversations. Leveraging UGST, we present a three-stage methodology for developing user simulators that can autonomously track goal progression and reason to generate goal-aligned responses. Moreover, we establish comprehensive evaluation metrics for measuring goal alignment in user simulators, and demonstrate that our approach yields substantial improvements across two benchmarks (MultiWOZ 2.4 and \tau-Bench). Our contributions address a critical gap in conversational AI and establish UGST as an essential framework for developing goal-aligned user simulators.
zh
[NLP-54] he Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Models, LLMs)和大推理模型(Large Reasoning Models, LRMs)中导致策略脆弱性和不稳定性的问题,如虚假推理、欺骗性对齐和指令违抗等,这些问题严重损害了模型的可信度与安全性。现有方法通常依赖经验性的启发式手段,缺乏统一的理论解释。论文的核心解决方案是提出一个严谨的数学框架,用于分析从奖励函数到最优策略映射的稳定性;其关键发现在于:策略脆弱性往往源于最优动作的非唯一性——即在推理任务中存在多个有效轨迹时,优化不完整或含噪的奖励信号会导致理性但不稳定的行为输出。该框架进一步扩展至多奖励RL场景,揭示稳定性由“有效奖励”聚合机制决定,并证明熵正则化虽可恢复稳定性,但会引入更高程度的随机性。此理论为理解并缓解近期观察到的欺骗性推理、指令遵循权衡及RLHF诱导的诡辩现象提供了统一视角。
链接: https://arxiv.org/abs/2507.20150
作者: Xingcheng Xu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning (RL) plays a crucial role in shaping the behavior of large language and reasoning models (LLMs/LRMs). However, it often produces brittle and unstable policies, leading to critical failures such as spurious reasoning, deceptive alignment, and instruction disobedience that undermine the trustworthiness and safety of LLMs/LRMs. Currently, these issues lack a unified theoretical explanation and are typically addressed using ad-hoc heuristics. This paper presents a rigorous mathematical framework for analyzing the stability of the mapping from a reward function to the optimal policy. We show that policy brittleness often stems from non-unique optimal actions, a common occurrence when multiple valid traces exist in a reasoning task. This theoretical lens provides a unified explanation for a range of seemingly disparate failures, reframing them as rational outcomes of optimizing rewards that may be incomplete or noisy, especially in the presence of action degeneracy. We extend this analysis from the fundamental single-reward setting to the more realistic multi-reward RL across diverse domains, showing how stability is governed by an “effective reward” aggregation mechanism. We also prove that entropy regularization restores policy stability at the cost of increased stochasticity. Our framework provides a unified explanation for recent empirical findings on deceptive reasoning, instruction-following trade-offs, and RLHF-induced sophistry, and is further validated through perturbation experiments in multi-reward RL. This work advances policy-stability analysis from empirical heuristics towards a principled theory, offering essential insights for designing safer and more trustworthy AI systems.
zh
[NLP-55] Multi-Agent Interactive Question Generation Framework for Long Document Understanding
【速读】: 该论文旨在解决长上下文文档理解(Long-contextual Document Understanding, DU)任务中因训练数据稀缺、尤其是低资源语言(如阿拉伯语)标注成本高而导致的大型视觉语言模型(Large Vision-Language Models, LVLMs)性能下降问题。其解决方案的关键在于提出一种全自动的多智能体交互框架,用于高效生成高质量的单页与多页问答对(QA pairs),覆盖英语和阿拉伯语的数百页跨领域文档,从而显著提升LVLM在长上下文场景下的理解能力。实验表明,所构建的AraEngLongBench数据集对主流开源与闭源LVLM均具有挑战性,验证了该方法的有效性。
链接: https://arxiv.org/abs/2507.20145
作者: Kesen Wang,Daulet Toibazar,Abdulrahman Alfulayt,Abdulaziz S. Albadawi,Ranya A. Alkahtani,Asma A. Ibrahim,Haneen A. Alhomoud,Sherif Mohamed,Pedro J. Moreno
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Document Understanding (DU) in long-contextual scenarios with complex layouts remains a significant challenge in vision-language research. Although Large Vision-Language Models (LVLMs) excel at short-context DU tasks, their performance declines in long-context settings. A key limitation is the scarcity of fine-grained training data, particularly for low-resource languages such as Arabic. Existing state-of-the-art techniques rely heavily on human annotation, which is costly and inefficient. We propose a fully automated, multi-agent interactive framework to generate long-context questions efficiently. Our approach efficiently generates high-quality single- and multi-page questions for extensive English and Arabic documents, covering hundreds of pages across diverse domains. This facilitates the development of LVLMs with enhanced long-context understanding ability. Experimental results in this work have shown that our generated English and Arabic questions (\textbfAraEngLongBench) are quite challenging to major open- and close-source LVLMs. The code and data proposed in this work can be found in this https URL. Sample Question and Answer (QA) pairs and structured system prompts can be found in the Appendix.
zh
[NLP-56] Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG KDD
【速读】: 该论文旨在解决现代视觉语言模型(Vision Language Models, VLMs)在处理第一人称视角图像、长尾实体及复杂多跳问题时易产生幻觉(hallucination)的问题,这在真实场景中会严重影响事实准确性。解决方案的关键在于提出一种注重事实准确性和真实性的多阶段框架:包括轻量级查询路由机制以提升效率、基于查询感知的检索与摘要管道、双路径生成策略以及事后验证模块,从而通过保守策略最小化幻觉,符合竞赛评分指标对答案可靠性的严格要求。
链接: https://arxiv.org/abs/2507.20136
作者: Baiyu Chen,Wilson Wongso,Xiaoqian Hu,Yue Tan,Flora Salim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: KDD Cup 2025 Meta CRAG-MM Challenge
Abstract:This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM) challenge. The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate, especially when faced with egocentric imagery, long-tail entities, and complex, multi-hop questions. This issue is particularly problematic in real-world applications where users pose fact-seeking queries that demand high factual accuracy across diverse modalities. To tackle this, we propose a robust, multi-stage framework that prioritizes factual accuracy and truthfulness over completeness. Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification. This conservative strategy is designed to minimize hallucinations, which incur a severe penalty in the competition’s scoring metric. Our approach achieved 3rd place in Task 1, demonstrating the effectiveness of prioritizing answer reliability in complex multi-modal RAG systems. Our implementation is available at this https URL .
zh
[NLP-57] Sem-DPO: Mitigating Semantic Inconsistency in Preference Optimization for Prompt Engineering
【速读】: 该论文旨在解决生成式 AI(Generative AI)在文本到图像生成任务中,提示词(prompt)优化过程中因直接偏好优化(Direct Preference Optimization, DPO)缺乏语义一致性约束而导致的语义漂移问题。DPO虽具备轻量级和离策略训练的优势,但其基于token级别的正则化无法有效防止语义不一致的提示被高分奖励,从而偏离用户原始意图。解决方案的关键在于提出Sem-DPO,通过引入一个与嵌入空间中原始提示与优胜候选提示之间余弦距离成指数关系的权重因子,对DPO损失进行软加权,从而抑制语义不匹配提示的学习信号。该方法在保持DPO简洁性和效率的同时,首次提供了对偏好微调提示生成器语义漂移的理论边界分析,并在多个基准测试中显著提升CLIP相似度与人类偏好评分,验证了语义感知权重在提示优化中的有效性。
链接: https://arxiv.org/abs/2507.20133
作者: Anas Mohamed,Azal Ahmad Khan,Xinran Wang,Ahmad Faraz Khan,Shuwen Ge,Saman Bahzad Khan,Ayaan Ahmad,Ali Anwar
机构: University of Minnesota (明尼苏达大学); Virginia Tech (弗吉尼亚理工大学); Xi’an University of Technology (西安理工大学); Lahore University of Management Sciences (拉合尔管理科学大学); University of California, Santa Cruz (加州大学圣克鲁兹分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Generative AI can now synthesize strikingly realistic images from text, yet output quality remains highly sensitive to how prompts are phrased. Direct Preference Optimization (DPO) offers a lightweight, off-policy alternative to RL for automatic prompt engineering, but its token-level regularization leaves semantic inconsistency unchecked as prompts that win higher preference scores can still drift away from the user’s intended meaning. We introduce Sem-DPO, a variant of DPO that preserves semantic consistency yet retains its simplicity and efficiency. Sem-DPO scales the DPO loss by an exponential weight proportional to the cosine distance between the original prompt and winning candidate in embedding space, softly down-weighting training signals that would otherwise reward semantically mismatched prompts. We provide the first analytical bound on semantic drift for preference-tuned prompt generators, showing that Sem-DPO keeps learned prompts within a provably bounded neighborhood of the original text. On three standard text-to-image prompt-optimization benchmarks and two language models, Sem-DPO achieves 8-12% higher CLIP similarity and 5-9% higher human-preference scores (HPSv2.1, PickScore) than DPO, while also outperforming state-of-the-art baselines. These findings suggest that strong flat baselines augmented with semantic weighting should become the new standard for prompt-optimization studies and lay the groundwork for broader, semantics-aware preference optimization in language models. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2507.20133 [cs.CL] (or arXiv:2507.20133v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.20133 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-58] AI-Driven Generation of Old English: A Framework for Low-Resource Languages
【速读】: 该论文旨在解决古英语(Old English)因资源匮乏而难以应用现代自然语言处理(Natural Language Processing, NLP)技术的问题,从而限制了其文化与语言遗产的可访问性。解决方案的关键在于构建一个可扩展的框架,融合低秩适配(Low-Rank Adaptation, LoRA)参数高效微调、基于回译(backtranslation)的数据增强方法,以及一个双代理(dual-agent)流水线——将内容生成(英文)与翻译(至古英语)任务分离,从而显著提升古英语文本生成的质量。实验表明,该方法在BLEU等自动指标上从基线26提升至65以上,且专家评估验证了生成文本的语法准确性和文体一致性,为濒危语言的复兴提供了可复用的技术路径。
链接: https://arxiv.org/abs/2507.20111
作者: Rodrigo Gabriel Salazar Alva,Matías Nuñez,Cristian López,Javier Martín Arista
机构: Universidad de Ingeniería y Tecnología (UTEC)(秘鲁工程与技术大学); Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)(阿根廷国家科学与技术研究理事会); Instituto de Investigaciones en Biodiversidad y Medioambiente (INIBIOMA)(生物多样性与环境研究所); Universidad Nacional del Comahue (内格罗河国立大学); Universidad de La Rioja (拉里奥哈大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Preserving ancient languages is essential for understanding humanity’s cultural and linguistic heritage, yet Old English remains critically under-resourced, limiting its accessibility to modern natural language processing (NLP) techniques. We present a scalable framework that uses advanced large language models (LLMs) to generate high-quality Old English texts, addressing this gap. Our approach combines parameter-efficient fine-tuning (Low-Rank Adaptation, LoRA), data augmentation via backtranslation, and a dual-agent pipeline that separates the tasks of content generation (in English) and translation (into Old English). Evaluation with automated metrics (BLEU, METEOR, and CHRF) shows significant improvements over baseline models, with BLEU scores increasing from 26 to over 65 for English-to-Old English translation. Expert human assessment also confirms high grammatical accuracy and stylistic fidelity in the generated texts. Beyond expanding the Old English corpus, our method offers a practical blueprint for revitalizing other endangered languages, effectively uniting AI innovation with the goals of cultural preservation.
zh
[NLP-59] EcoTransformer: Attention without Multiplication
【速读】: 该论文旨在解决Transformer架构中基于点积的注意力机制(scaled dot-product attention)计算复杂度高、能耗大的问题。其解决方案的关键在于提出一种新型Transformer架构EcoTransformer,通过使用拉普拉斯核(Laplacian kernel)对值(values)进行卷积来构建输出上下文向量,其中查询(queries)与键(keys)之间的距离采用L1度量计算注意力分数。该方法避免了传统注意力机制中的矩阵乘法操作,在自然语言处理(NLP)、生物信息学和视觉任务中性能相当或更优,同时显著降低了能源消耗。
链接: https://arxiv.org/abs/2507.20096
作者: Xin Gao,Xingming Xu
机构: York University (约克大学); University of California Davis (加州大学戴维斯分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 1 figure
Abstract:The Transformer, with its scaled dot-product attention mechanism, has become a foundational architecture in modern AI. However, this mechanism is computationally intensive and incurs substantial energy costs. We propose a new Transformer architecture EcoTransformer, in which the output context vector is constructed as the convolution of the values using a Laplacian kernel, where the distances are measured by the L1 metric between the queries and keys. Compared to dot-product based attention, the new attention score calculation is free of matrix multiplication. It performs on par with, or even surpasses, scaled dot-product attention in NLP, bioinformatics, and vision tasks, while consuming significantly less energy.
zh
[NLP-60] ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models
【速读】: 该论文旨在解决当前主流语音语言模型(Speech Language Models)在预训练过程中难以有效学习韵律信息(Prosody)的问题。现有方法通常将语音转换为离散标记后再输入到文本语言模型(LLM),但这种处理方式导致韵律特征丢失,使得模型无法通过预训练自发涌现出对韵律的处理能力。解决方案的关键在于提出 ProsodyLM,其核心创新是一种新的分层标记化方案:先将语音转写为文本,再附加词级别的韵律标记(word-level prosody tokens),从而保留更完整的韵律信息并增强与文本基 LLM 的语义对齐。实验表明,该设计使模型仅通过预训练即可自发获得多种复杂的韵律处理能力,如对比焦点识别、情绪感知及长上下文中的韵律一致性维持。
链接: https://arxiv.org/abs/2507.20091
作者: Kaizhi Qian,Xulin Fan,Junrui Ni,Slava Shechtman,Mark Hasegawa-Johnson,Chuang Gan,Yang Zhang
机构: IBM Research (IBM 研究院); MIT-IBM Watson AI Lab (MIT-IBM 沃森人工智能实验室); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); UMass Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optimal in learning prosody information – we find that the resulting LLMs do not exhibit obvious emerging prosody processing capabilities via pre-training alone. To overcome this, we propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody. Each speech utterance is first transcribed into text, followed by a sequence of word-level prosody tokens. Compared with conventional speech tokenization schemes, the proposed tokenization scheme retains more complete prosody information, and is more understandable to text-based LLMs. We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone, ranging from harnessing the prosody nuances in generated speech, such as contrastive focus, understanding emotion and stress in an utterance, to maintaining prosody consistency in long contexts.
zh
[NLP-61] he Devil is in the EOS: Sequence Training for Detailed Image Captioning
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在图像描述生成任务中普遍存在的细节不足问题,即模型倾向于生成短且泛化的描述,即使其具备强大的视觉和语言主干网络。研究表明,这一现象的根本原因在于交叉熵训练过程中模型对结束符(End-of-Sequence, EOS)token存在过度倾向,导致提前终止生成。解决方案的关键在于提出一种无监督方法,通过减少模型对EOS token的偏倚,从而鼓励生成更长、更详细的caption,而无需复杂的奖励函数或监督信号。该方法简单有效,适用于任何预训练模型,并在三个VLM和三个详细描述基准上验证了其有效性。
链接: https://arxiv.org/abs/2507.20077
作者: Abdelrahman Mohamed,Yova Kementchedjhieva
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to COLM 2025
Abstract:Despite significant advances in vision-language models (VLMs), image captioning often suffers from a lack of detail, with base models producing short, generic captions. This limitation persists even though VLMs are equipped with strong vision and language backbones. While supervised data and complex reward functions have been proposed to improve detailed image captioning, we identify a simpler underlying issue: a bias towards the end-of-sequence (EOS) token, which is introduced during cross-entropy training. We propose an unsupervised method to debias the model’s tendency to predict the EOS token prematurely. By reducing this bias, we encourage the generation of longer, more detailed captions without the need for intricate reward functions or supervision. Our approach is straightforward, effective, and easily applicable to any pretrained model. We demonstrate its effectiveness through experiments with three VLMs and on three detailed captioning benchmarks. Our results show a substantial increase in caption length and relevant details, albeit with an expected increase in the rate of hallucinations.
zh
[NLP-62] PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段对齐用户偏好时依赖预训练奖励模型(Reward Model)的问题,而该奖励模型的训练过程往往不稳定且需要大量人类偏好标注数据。解决方案的关键在于提出一种名为PITA的新框架,其核心思想是将偏好反馈直接整合到LLM的词元生成过程中,无需微调原模型,也无需依赖预训练奖励模型;PITA通过学习一个小型基于偏好的引导策略(preference-based guidance policy),在推理时动态调整词元概率分布,从而实现与用户偏好的对齐,该过程通过随机搜索和迭代优化偏好引导模型来求解潜在的偏好分布。
链接: https://arxiv.org/abs/2507.20067
作者: Sarat Chandra Bobbili,Ujwal Dinesha,Dheeraj Narasimha,Srinivas Shakkottai
机构: Texas A&M University (德克萨斯A&M大学); Inria (法国国家信息与自动化研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Inference-time alignment enables large language models (LLMs) to generate outputs aligned with end-user preferences without further training. Recent post-training methods achieve this by using small guidance models to modify token generation during inference. These methods typically optimize a reward function KL-regularized by the original LLM taken as the reference policy. A critical limitation, however, is their dependence on a pre-trained reward model, which requires fitting to human preference feedback–a potentially unstable process. In contrast, we introduce PITA, a novel framework that integrates preference feedback directly into the LLM’s token generation, eliminating the need for a reward model. PITA learns a small preference-based guidance policy to modify token probabilities at inference time without LLM fine-tuning, reducing computational cost and bypassing the pre-trained reward model dependency. The problem is framed as identifying an underlying preference distribution, solved through stochastic search and iterative refinement of the preference-based guidance model. We evaluate PITA across diverse tasks, including mathematical reasoning and sentiment classification, demonstrating its effectiveness in aligning LLM outputs with user preferences.
zh
[NLP-63] RAG in the Wild: On the (In)effectiveness of LLM s with Mixture-of-Knowledge Retrieval Augmentation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在真实、多样化的检索场景下性能表现不明确的问题,尤其是在面对混合来源的知识库时,现有RAG系统是否具备鲁棒性和适应性尚不清晰。其关键解决方案是通过在大规模多源知识数据集MassiveDS上系统评估多种RAG配置,识别出当前RAG架构的核心局限:如检索对小模型收益显著、重排序器(reranker)提升有限、单一检索源无稳定优势,以及大语言模型(LLM)难以有效跨异构知识源进行查询路由。研究强调,在实际部署前需设计更智能的自适应检索策略以提升RAG系统的泛化能力与实用性。
链接: https://arxiv.org/abs/2507.20059
作者: Ran Xu,Yuchen Zhuang,Yue Yu,Haoyu Wang,Wenqi Shi,Carl Yang
机构: Emory University (埃默里大学); Georgia Institute of Technology (佐治亚理工学院); SUNY Albany (纽约州立大学阿尔巴尼分校); UT Southwestern Medical Center (德克萨斯大学西南医学中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in Progress. Code will be published at: this https URL
Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieved at inference time. While RAG demonstrates strong performance on benchmarks largely derived from general-domain corpora like Wikipedia, its effectiveness under realistic, diverse retrieval scenarios remains underexplored. We evaluated RAG systems using MassiveDS, a large-scale datastore with mixture of knowledge, and identified critical limitations: retrieval mainly benefits smaller models, rerankers add minimal value, and no single retrieval source consistently excels. Moreover, current LLMs struggle to route queries across heterogeneous knowledge sources. These findings highlight the need for adaptive retrieval strategies before deploying RAG in real-world settings. Our code and data can be found at this https URL.
zh
[NLP-64] A Tensor-Based Compiler and a Runtime for Neuron-Level DNN Certifier Specifications
【速读】: 该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)可解释性不足导致的认证难题,具体表现为现有DNN认证器(certifier)的设计与实现之间存在语义鸿沟:认证器的数学设计通常以神经元(neuron)级别表达,而其实际执行则在张量(tensor)层面优化和运行,这种不一致使得新认证器的开发或已有认证器的适配变得复杂且高度依赖专业知识。解决方案的关键在于提出一个编译框架,通过两个核心技术实现从神经元级规格到张量级实现的自动转换:一是创新性的基于栈的中间表示(Intermediate Representation, IR),用于统一抽象与执行层次;二是形状分析(shape analysis),用于推断隐含的张量操作并生成最小必要形状的张量以模拟神经元级语义。此外,为高效处理运行时产生的稀疏性,论文还引入了g-BCSR——一种双层压缩格式,支持变尺寸块及其内部稀疏结构的表示,从而显著提升认证器的灵活性与性能。
链接: https://arxiv.org/abs/2507.20055
作者: Avaljot Singh,Yamin Chandini Sarita,Aditya Mishra,Ishaan Goyal,Gagandeep Singh,Charith Mendis
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The uninterpretability of DNNs has led to the adoption of abstract interpretation-based certification as a practical means to establish trust in real-world systems that rely on DNNs. However, the current landscape supports only a limited set of certifiers, and developing new ones or modifying existing ones for different applications remains difficult. This is because the mathematical design of certifiers is expressed at the neuron level, while their implementations are optimized and executed at the tensor level. This mismatch creates a semantic gap between design and implementation, making manual bridging both complex and expertise-intensive – requiring deep knowledge in formal methods, high-performance computing, etc. We propose a compiler framework that automatically translates neuron-level specifications of DNN certifiers into tensor-based, layer-level implementations. This is enabled by two key innovations: a novel stack-based intermediate representation (IR) and a shape analysis that infers the implicit tensor operations needed to simulate the neuron-level semantics. During lifting, the shape analysis creates tensors in the minimal shape required to perform the corresponding operations. The IR also enables domain-specific optimizations as rewrites. At runtime, the resulting tensor computations exhibit sparsity tied to the DNN architecture. This sparsity does not align well with existing formats. To address this, we introduce g-BCSR, a double-compression format that represents tensors as collections of blocks of varying sizes, each possibly internally sparse. Using our compiler and g-BCSR, we make it easy to develop new certifiers and analyze their utility across diverse DNNs. Despite its flexibility, the compiler achieves performance comparable to hand-optimized implementations. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.20055 [cs.CL] (or arXiv:2507.20055v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.20055 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-65] K4: Online Log Anomaly Detection Via Unsupervised Typicality Learning
【速读】: 该论文旨在解决现有日志异常检测(Log Anomaly Detection, LogAD)方法中存在的三大问题:检测速度慢、依赖易出错的日志解析(parsing)、以及采用不切实际的评估协议。其解决方案的关键在于提出一种无监督且与解析器无关的框架 K⁴,该框架通过高效的 k-近邻(k-NN)统计方法,将任意日志嵌入(log embeddings)映射为紧凑的四维描述符(Precision, Recall, Density, Coverage),从而使得轻量级检测器能够在无需重新训练的情况下准确评分异常。这一设计显著提升了检测性能(AUROC 达到 0.995–0.999)并大幅降低计算开销(训练时间小于 4 秒,推理延迟低至 4 μs)。
链接: https://arxiv.org/abs/2507.20051
作者: Weicong Chen,Vikash Singh,Zahra Rahmani,Debargha Ganguly,Mohsen Hariri,Vipin Chaudhary
机构: Case Western Reserve University (凯斯西储大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Existing Log Anomaly Detection (LogAD) methods are often slow, dependent on error-prone parsing, and use unrealistic evaluation protocols. We introduce K^4 , an unsupervised and parser-independent framework for high-performance online detection. K^4 transforms arbitrary log embeddings into compact four-dimensional descriptors (Precision, Recall, Density, Coverage) using efficient k-nearest neighbor (k-NN) statistics. These descriptors enable lightweight detectors to accurately score anomalies without retraining. Using a more realistic online evaluation protocol, K^4 sets a new state-of-the-art (AUROC: 0.995-0.999), outperforming baselines by large margins while being orders of magnitude faster, with training under 4 seconds and inference as low as 4 \mu s.
zh
[NLP-66] Infogen: Generating Complex Statistical Infographics from Documents ACL
【速读】: 该论文旨在解决从文本密集型文档中自动生成复杂统计信息图(statistical infographics)的问题,现有研究仅能生成简单图表,缺乏对内容深度理解与多子图协同布局的能力。其关键解决方案是提出一个两阶段框架 Infogen:首先利用微调后的大型语言模型(LLM)生成包含标题、文本洞察及子图数据与对齐信息的 infographic metadata(信息图元数据),随后将该元数据转化为可渲染的信息图代码,从而实现上下文准确、洞察深刻且视觉一致的复杂信息图生成。
链接: https://arxiv.org/abs/2507.20046
作者: Akash Ghosh,Aparna Garimella,Pritika Ramu,Sambaran Bandyopadhyay,Sriparna Saha
机构: Indian Institute of Technology Patna (印度理工学院巴特那分校); Adobe Research
类目: Computation and Language (cs.CL)
备注: ACL Main 2025
Abstract:Statistical infographics are powerful tools that simplify complex data into visually engaging and easy-to-understand formats. Despite advancements in AI, particularly with LLMs, existing efforts have been limited to generating simple charts, with no prior work addressing the creation of complex infographics from text-heavy documents that demand a deep understanding of the content. We address this gap by introducing the task of generating statistical infographics composed of multiple sub-charts (e.g., line, bar, pie) that are contextually accurate, insightful, and visually aligned. To achieve this, we define infographic metadata that includes its title and textual insights, along with sub-chart-specific details such as their corresponding data and alignment. We also present Infodat, the first benchmark dataset for text-to-infographic metadata generation, where each sample links a document to its metadata. We propose Infogen, a two-stage framework where fine-tuned LLMs first generate metadata, which is then converted into infographic code. Extensive evaluations on Infodat demonstrate that Infogen achieves state-of-the-art performance, outperforming both closed and open-source LLMs in text-to-statistical infographic generation.
zh
[NLP-67] FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本任务中因键值缓存(Key-Value Cache, KV cache)内存占用高和计算开销大而导致的性能瓶颈问题。现有压缩策略如令牌剔除(token eviction)和学习投影(learned projections)常导致信息偏倚,例如过度强调近期或高注意力令牌,或反复丢失早期上下文信息,并可能需要昂贵的模型重训练。其解决方案的关键在于提出了一种无需训练的KV缓存压缩框架FAEDKV(Frequency-Adaptive Infinite-Window for KV cache),通过引入无限窗傅里叶变换(Infinite-Window Fourier Transform, IWDFT)将KV缓存转换至频域,从而实现所有令牌对压缩表示的等效贡献,有效保留早期与近期上下文信息;同时,基于频域的消融实验识别出各层关键频谱成分,支持分层、定向压缩,显著提升长文本理解性能,在LongBench基准上相较现有方法最高提升22%,并在Needle-In-A-Haystack任务中展现出位置无关的更优检索准确率。
链接: https://arxiv.org/abs/2507.20030
作者: Runchao Li,Yao Fu,Mu Sheng,Xianxuan Long,Haotian Yu,Pan Li
机构: Case Western Reserve University (凯斯西储大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The efficacy of Large Language Models (LLMs) in long-context tasks is often hampered by the substantial memory footprint and computational demands of the Key-Value (KV) cache. Current compression strategies, including token eviction and learned projections, frequently lead to biased representations – either by overemphasizing recent/high-attention tokens or by repeatedly degrading information from earlier context – and may require costly model retraining. We present FAEDKV (Frequency-Adaptive Infinite-Window for KV cache), a novel, training-free KV cache compression framework that ensures unbiased information retention. FAEDKV operates by transforming the KV cache into the frequency domain using a proposed Infinite-Window Fourier Transform (IWDFT). This approach allows for the equalized contribution of all tokens to the compressed representation, effectively preserving both early and recent contextual information. A preliminary frequency ablation study identifies critical spectral components for layer-wise, targeted compression. Experiments on LongBench benchmark demonstrate FAEDKV’s superiority over existing methods by up to 22%. In addition, our method shows superior, position-agnostic retrieval accuracy on the Needle-In-A-Haystack task compared to compression based approaches.
zh
[NLP-68] Anomaly Detection in Human Language via Meta-Learning: A Few-Shot Approach
【速读】: 该论文旨在解决跨领域文本异常检测(text anomaly detection)在标注数据稀缺条件下的泛化能力问题,尤其针对如垃圾短信、虚假新闻和仇恨言论等稀疏且多变的异常类型。其解决方案的关键在于提出一种基于元学习(meta-learning)的框架,将异常检测建模为少样本二分类任务,并结合原型网络(prototypical networks)与领域重采样策略(domain resampling),通过模拟训练过程中的任务分布实现模型对未见异常任务的快速适应。该方法显著提升了在极少量标注异常样本下的F1和AUC性能表现。
链接: https://arxiv.org/abs/2507.20019
作者: Saurav Singla,Aarav Singla,Advik Gupta,Parnika Gupta
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages. PyTorch code for few-shot anomaly detection using meta-learning is available upon request or can be shared via GitHub
Abstract:We propose a meta learning framework for detecting anomalies in human language across diverse domains with limited labeled data. Anomalies in language ranging from spam and fake news to hate speech pose a major challenge due to their sparsity and variability. We treat anomaly detection as a few shot binary classification problem and leverage meta-learning to train models that generalize across tasks. Using datasets from domains such as SMS spam, COVID-19 fake news, and hate speech, we evaluate model generalization on unseen tasks with minimal labeled anomalies. Our method combines episodic training with prototypical networks and domain resampling to adapt quickly to new anomaly detection tasks. Empirical results show that our method outperforms strong baselines in F1 and AUC scores. We also release the code and benchmarks to facilitate further research in few-shot text anomaly detection.
zh
[NLP-69] he Carbon Cost of Conversation Sustainability in the Age of Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自然语言处理(Natural Language Processing, NLP)领域快速发展过程中所引发的环境可持续性问题,包括其高碳排放、水资源消耗及电子废弃物(e-waste)增加等生态代价。论文指出,训练单个LLM的碳足迹相当于数百辆汽车年行驶排放量,且数据中心冷却加剧了脆弱地区的水资源短缺,而企业绿色洗牌、冗余模型开发与监管空白进一步放大了负面影响,尤其对全球南方边缘化社区造成不平等负担。解决方案的关键在于多维度协同推进:技术层面通过模型剪枝(model pruning)、量子计算等手段提升能效;政策层面推动碳税机制与强制性碳排放披露制度;文化层面倡导以必要性为导向而非盲目追求新颖性的研发理念,并强调伦理责任与全球协作的重要性,最终实现AI发展与地球边界相协调的公平、透明和再生型系统。
链接: https://arxiv.org/abs/2507.20018
作者: Sayed Mahbub Hasan Amiri,Prasun Goswami,Md. Mainul Islam,Mohammad Shakhawat Hossen,Sayed Majhab Hasan Amiri,Naznin Akter
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 Pages, 5 Tables
Abstract:Large language models (LLMs) like GPT-3 and BERT have revolutionized natural language processing (NLP), yet their environmental costs remain dangerously overlooked. This article critiques the sustainability of LLMs, quantifying their carbon footprint, water usage, and contribution to e-waste through case studies of models such as GPT-4 and energy-efficient alternatives like Mistral 7B. Training a single LLM can emit carbon dioxide equivalent to hundreds of cars driven annually, while data centre cooling exacerbates water scarcity in vulnerable regions. Systemic challenges corporate greenwashing, redundant model development, and regulatory voids perpetuate harm, disproportionately burdening marginalized communities in the Global South. However, pathways exist for sustainable NLP: technical innovations (e.g., model pruning, quantum computing), policy reforms (carbon taxes, mandatory emissions reporting), and cultural shifts prioritizing necessity over novelty. By analysing industry leaders (Google, Microsoft) and laggards (Amazon), this work underscores the urgency of ethical accountability and global cooperation. Without immediate action, AIs ecological toll risks outpacing its societal benefits. The article concludes with a call to align technological progress with planetary boundaries, advocating for equitable, transparent, and regenerative AI systems that prioritize both human and environmental well-being.
zh
[NLP-70] VLQA: The First Comprehensive Large and High-Quality Vietnamese Dataset for Legal Question Answering
【速读】: 该论文旨在解决低资源语言(如越南语)在法律自然语言处理(Legal NLP)领域中因标注数据稀缺而导致的模型性能受限问题。其核心挑战在于缺乏高质量、大规模的法律语料库,尤其在越南语法律文本处理方面尤为突出。解决方案的关键在于构建并发布VLQA数据集——一个面向越南语法律领域的综合性、高质量标注数据集,涵盖法律信息检索和问答任务,并通过在先进模型上的实证实验验证其有效性,从而为越南语法律文本处理提供可靠的数据基础与训练资源。
链接: https://arxiv.org/abs/2507.19995
作者: Tan-Minh Nguyen,Hoang-Trung Nguyen,Trong-Khoi Dao,Xuan-Hieu Phan,Ha-Thanh Nguyen,Thi-Hai-Yen Vuong
机构: Japan Advanced Institute of Science and Technology (日本高级科学技术研究院); VNU University of Engineering and Technology (越南国家大学工程技术大学); VNU University of Law (越南国家大学法学院); National Institute of Informatics (日本信息研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of large language models (LLMs) has led to significant achievements in various domains, including legal text processing. Leveraging LLMs for legal tasks is a natural evolution and an increasingly compelling choice. However, their capabilities are often portrayed as greater than they truly are. Despite the progress, we are still far from the ultimate goal of fully automating legal tasks using artificial intelligence (AI) and natural language processing (NLP). Moreover, legal systems are deeply domain-specific and exhibit substantial variation across different countries and languages. The need for building legal text processing applications for different natural languages is, therefore, large and urgent. However, there is a big challenge for legal NLP in low-resource languages such as Vietnamese due to the scarcity of resources and annotated data. The need for labeled legal corpora for supervised training, validation, and supervised fine-tuning is critical. In this paper, we introduce the VLQA dataset, a comprehensive and high-quality resource tailored for the Vietnamese legal domain. We also conduct a comprehensive statistical analysis of the dataset and evaluate its effectiveness through experiments with state-of-the-art models on legal information retrieval and question-answering tasks.
zh
[NLP-71] Improving the Performance of Sequential Recommendation Systems with an Extended Large Language Model
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的推荐系统性能受限于模型版本迭代滞后的问题,即多数研究未充分引入近期高性能LLM(如Llama3)带来的语言理解与上下文推理能力提升。其解决方案的关键在于:在保持原有推荐框架(LlamaRec)结构不变的前提下,仅将基础模型由Llama2替换为最新版本的Llama3,并通过固定随机种子和统一输入数据预处理流程确保实验公平性,从而验证了通过模型升级即可显著提升推荐质量的有效性与成本效益。
链接: https://arxiv.org/abs/2507.19990
作者: Sinnyum Choi,Woong Kim
机构: Dong-Seoul University (东首尔大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recently, competition in the field of artificial intelligence (AI) has intensified among major technological companies, resulting in the continuous release of new large-language models (LLMs) that exhibit improved language understanding and context-based reasoning capabilities. It is expected that these advances will enable more efficient personalized recommendations in LLM-based recommendation systems through improved quality of training data and architectural design. However, many studies have not considered these recent developments. In this study, it was proposed to improve LLM-based recommendation systems by replacing Llama2 with Llama3 in the LlamaRec framework. To ensure a fair comparison, random seed values were set and identical input data was provided during preprocessing and training. The experimental results show average performance improvements of 38.65%, 8.69%, and 8.19% for the ML-100K, Beauty, and Games datasets, respectively, thus confirming the practicality of this method. Notably, the significant improvements achieved by model replacement indicate that the recommendation quality can be improved cost-effectively without the need to make structural changes to the system. Based on these results, it is our contention that the proposed approach is a viable solution for improving the performance of current recommendation systems.
zh
[NLP-72] Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在评分AP中文语言与文化考试写作任务中的可靠性评估问题,尤其关注其与人工评分者在评分一致性上的差异。研究采用概化理论(Generalizability Theory)对两类写作任务——故事叙述和电子邮件回复——进行分析,比较了两名受训人工评分者与七名AI评分者的评分一致性,每篇作文获得一个整体分数和三个对应于任务完成度、表达效果及语言使用的分析性分数。关键解决方案在于提出一种融合人类与AI评分者的混合评分模型(hybrid scoring model),实证表明该模型显著提升了评分可靠性,证明在大规模写作测评中整合多源评分信息可增强评分的稳定性和有效性。
链接: https://arxiv.org/abs/2507.19980
作者: Dan Song,Won-Chan Lee,Hong Jiao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This study investigates the estimation of reliability for large language models (LLMs) in scoring writing tasks from the AP Chinese Language and Culture Exam. Using generalizability theory, the research evaluates and compares score consistency between human and AI raters across two types of AP Chinese free-response writing tasks: story narration and email response. These essays were independently scored by two trained human raters and seven AI raters. Each essay received four scores: one holistic score and three analytic scores corresponding to the domains of task completion, delivery, and language use. Results indicate that although human raters produced more reliable scores overall, LLMs demonstrated reasonable consistency under certain conditions, particularly for story narration tasks. Composite scoring that incorporates both human and AI raters improved reliability, which supports that hybrid scoring models may offer benefits for large-scale writing assessments.
zh
[NLP-73] Leverag ing Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization
【速读】: 该论文旨在解决胰腺囊性病变(Pancreatic Cystic Lesion, PCL)影像学特征手动提取效率低的问题,从而限制了大规模研究的开展。其解决方案的关键在于利用链式思维(Chain-of-Thought, CoT)提示技术对开源大语言模型(Large Language Models, LLMs)进行微调(fine-tuning),以实现从MRI/CT报告中自动提取PCL特征并依据指南分配风险等级。通过在GPT-4o生成的CoT标注数据上使用QLoRA方法微调LLaMA和DeepSeek模型,显著提升了特征提取准确率与风险分类F1分数,并达到与专业放射科医生一致的评估一致性水平,证明该方法具备高效、可解释且接近顶级商业模型性能的潜力。
链接: https://arxiv.org/abs/2507.19973
作者: Ebrahim Rasromani,Stella K. Kang,Yanqi Xu,Beisong Liu,Garvit Luhadia,Wan Fung Chui,Felicia L. Pasadyn,Yu Chih Hung,Julie Y. An,Edwin Mathieu,Zehui Gu,Carlos Fernandez-Granda,Ammar A. Javed,Greg D. Sacks,Tamas Gonda,Chenchan Huang,Yiqiu Shen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Background: Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive, limiting large-scale studies needed to advance PCL research. Purpose: To develop and evaluate large language models (LLMs) that automatically extract PCL features from MRI/CT reports and assign risk categories based on guidelines. Materials and Methods: We curated a training dataset of 6,000 abdominal MRI/CT reports (2005-2024) from 5,134 patients that described PCLs. Labels were generated by GPT-4o using chain-of-thought (CoT) prompting to extract PCL and main pancreatic duct features. Two open-source LLMs were fine-tuned using QLoRA on GPT-4o-generated CoT data. Features were mapped to risk categories per institutional guideline based on the 2017 ACR White Paper. Evaluation was performed on 285 held-out human-annotated reports. Model outputs for 100 cases were independently reviewed by three radiologists. Feature extraction was evaluated using exact match accuracy, risk categorization with macro-averaged F1 score, and radiologist-model agreement with Fleiss’ Kappa. Results: CoT fine-tuning improved feature extraction accuracy for LLaMA (80% to 97%) and DeepSeek (79% to 98%), matching GPT-4o (97%). Risk categorization F1 scores also improved (LLaMA: 0.95; DeepSeek: 0.94), closely matching GPT-4o (0.97), with no statistically significant differences. Radiologist inter-reader agreement was high (Fleiss’ Kappa = 0.888) and showed no statistically significant difference with the addition of DeepSeek-FT-CoT (Fleiss’ Kappa = 0.893) or GPT-CoT (Fleiss’ Kappa = 0.897), indicating that both models achieved agreement levels on par with radiologists. Conclusion: Fine-tuned open-source LLMs with CoT supervision enable accurate, interpretable, and efficient phenotyping for large-scale PCL research, achieving performance comparable to GPT-4o.
zh
[NLP-74] xt2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在文本到可视化(Text-to-Visualization, Text2Vis)任务中缺乏全面评估基准的问题,从而限制了对模型能力的严谨评测。为应对这一挑战,作者提出了 Text2Vis 基准,涵盖 20 多种图表类型和多样化的数据科学查询(如趋势分析、相关性检测、异常值识别与预测分析),包含 1,985 个样本,每个样本均配有数据表、自然语言查询、短答案、可视化代码及标注图表,支持复杂推理、对话轮次和动态数据检索。关键解决方案包括:(1)提出首个跨模态行为者-评论家智能体框架(cross-modal actor-critic agentic framework),联合优化文本答案与可视化代码,使 GPT-4o 的通过率从 26% 提升至 42%;(2)设计基于 LLM 的自动化评估框架,实现无需人工标注即可对数千样本进行规模化评估,量化答案正确性、代码执行成功率、可视化可读性和图表准确性。
链接: https://arxiv.org/abs/2507.19969
作者: Mizanur Rahman,Md Tahmid Rahman Laskar,Shafiq Joty,Enamul Hoque
机构: York University (约克大学); Dialpad; Salesforce AI Research
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated data visualization plays a crucial role in simplifying data interpretation, enhancing decision-making, and improving efficiency. While large language models (LLMs) have shown promise in generating visualizations from natural language, the absence of comprehensive benchmarks limits the rigorous evaluation of their capabilities. We introduce Text2Vis, a benchmark designed to assess text-to-visualization models, covering 20+ chart types and diverse data science queries, including trend analysis, correlation, outlier detection, and predictive analytics. It comprises 1,985 samples, each with a data table, natural language query, short answer, visualization code, and annotated charts. The queries involve complex reasoning, conversational turns, and dynamic data retrieval. We benchmark 11 open-source and closed-source models, revealing significant performance gaps, highlighting key challenges, and offering insights for future advancements. To close this gap, we propose the first cross-modal actor-critic agentic framework that jointly refines the textual answer and visualization code, increasing GPT-4o`s pass rate from 26% to 42% over the direct approach and improving chart quality. We also introduce an automated LLM-based evaluation framework that enables scalable assessment across thousands of samples without human annotation, measuring answer correctness, code execution success, visualization readability, and chart accuracy. We release Text2Vis at this https URL.
zh
[NLP-75] KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成文本时表现出的社会偏见问题,这引发了关于公平性和潜在伤害的伦理关切。其解决方案的关键在于提出一种基于注意力机制的去偏框架——KLAAD(KL-Attention Alignment Debiasing),该方法通过隐式对齐典型句与反典型句对之间的注意力分布来实现去偏,无需直接修改模型权重。KLAAD引入了一个复合训练目标,结合交叉熵损失、KL散度损失和三元组损失,引导模型在有偏和无偏语境下保持一致的注意力分配,同时保留语言流畅性和连贯性,从而在BBQ和BOLD基准上实现了有效的偏见缓解,且对语言建模质量影响最小。
链接: https://arxiv.org/abs/2507.19962
作者: Seorin Kim,Dongyoung Lee,Jaejin Lee
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) often exhibit societal biases in their outputs, prompting ethical concerns regarding fairness and harm. In this work, we propose KLAAD (KL-Attention Alignment Debiasing), an attention-based debiasing framework that implicitly aligns attention distributions between stereotypical and anti-stereotypical sentence pairs without directly modifying model weights. KLAAD introduces a composite training objective combining Cross-Entropy, KL divergence, and Triplet losses, guiding the model to consistently attend across biased and unbiased contexts while preserving fluency and coherence. Experimental evaluation of KLAAD demonstrates improved bias mitigation on both the BBQ and BOLD benchmarks, with minimal impact on language modeling quality. The results indicate that attention-level alignment offers a principled solution for mitigating bias in generative language models.
zh
[NLP-76] Spatial Language Likelihood Grounding Network for Bayesian Fusion of Human-Robot Observations
【速读】: 该论文旨在解决机器人在协作任务中因感知局限性导致的性能瓶颈问题,核心在于如何有效融合人类观测信息以提升协同感知能力。解决方案的关键在于提出了一种特征金字塔似然接地网络(Feature Pyramid Likelihood Grounding Network, FP-LGN),通过学习地图图像特征与其空间关系语义之间的关联,实现对人类语言输入不确定性的建模;该模型采用三阶段课程学习策略训练为概率估计器,能够捕捉人类语言中的偶然不确定性(aleatoric uncertainty),从而生成具有物理意义的似然函数,支撑不确定性感知的异构信息融合,显著提升了人机协作任务的表现。
链接: https://arxiv.org/abs/2507.19947
作者: Supawich Sitdhipol,Waritwong Sukprasongdee,Ekapol Chuangsuwanich,Rina Tse
机构: 未知
类目: Robotics (cs.RO); Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Accepted to the 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
Abstract:Fusing information from human observations can help robots overcome sensing limitations in collaborative tasks. However, an uncertainty-aware fusion framework requires a grounded likelihood representing the uncertainty of human inputs. This paper presents a Feature Pyramid Likelihood Grounding Network (FP-LGN) that grounds spatial language by learning relevant map image features and their relationships with spatial relation semantics. The model is trained as a probability estimator to capture aleatoric uncertainty in human language using three-stage curriculum learning. Results showed that FP-LGN matched expert-designed rules in mean Negative Log-Likelihood (NLL) and demonstrated greater robustness with lower standard deviation. Collaborative sensing results demonstrated that the grounded likelihood successfully enabled uncertainty-aware fusion of heterogeneous human language observations and robot sensor measurements, achieving significant improvements in human-robot collaborative task performance.
zh
[NLP-77] he Impact of Fine-tuning Large Language Models on Automated Program Repair
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化程序修复(Automated Program Repair, APR)任务中性能受限的问题,特别是针对训练资源消耗大、过拟合风险高以及不同数据分布导致的性能下降等挑战。其解决方案的关键在于系统性地评估多种微调技术对LLMs在APR中的影响,并发现:全量微调(full fine-tuning)易因数据分布差异和过拟合而降低模型表现;相比之下,采用参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法(如LoRA和IA3)能显著提升模型在多个主流APR基准(QuixBugs、Defects4J和HumanEval-Java)上的性能,同时大幅减少可训练参数数量,从而实现更高效、稳定的模型适配。
链接: https://arxiv.org/abs/2507.19909
作者: Roman Macháček,Anastasiia Grishina,Max Hort,Leon Moonen
机构: Simula Research Laboratory(西穆拉研究实验室); University of Bern(伯尔尼大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for publication in the research track of the 41th International Conference on Software Maintenance and Evolution (ICSME 2025)
Abstract:Automated Program Repair (APR) uses various tools and techniques to help developers achieve functional and error-free code faster. In recent years, Large Language Models (LLMs) have gained popularity as components in APR tool chains because of their performance and flexibility. However, training such models requires a significant amount of resources. Fine-tuning techniques have been developed to adapt pre-trained LLMs to specific tasks, such as APR, and enhance their performance at far lower computational costs than training from scratch. In this study, we empirically investigate the impact of various fine-tuning techniques on the performance of LLMs used for APR. Our experiments provide insights into the performance of a selection of state-of-the-art LLMs pre-trained on code. The evaluation is done on three popular APR benchmarks (i.e., QuixBugs, Defects4J and HumanEval-Java) and considers six different LLMs with varying parameter sizes (resp. CodeGen, CodeT5, StarCoder, DeepSeekCoder, Bloom, and CodeLlama-2). We consider three training regimens: no fine-tuning, full fine-tuning, and parameter-efficient fine-tuning (PEFT) using LoRA and IA3. We observe that full fine-tuning techniques decrease the benchmarking performance of various models due to different data distributions and overfitting. By using parameter-efficient fine-tuning methods, we restrict models in the amount of trainable parameters and achieve better results. Keywords: large language models, automated program repair, parameter-efficient fine-tuning, AI4Code, AI4SE, ML4SE. Comments: Accepted for publication in the research track of the 41th International Conference on Software Maintenance and Evolution (ICSME 2025) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2507.19909 [cs.SE] (or arXiv:2507.19909v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2507.19909 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-78] CaliDrop: KV Cache Compression with Calibration
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中因Key-Value (KV) 缓存内存占用随序列长度、批处理大小和模型规模线性增长而带来的瓶颈问题,尤其是在长上下文场景下。现有解决方案如token eviction(令牌剔除)虽能压缩KV缓存,但常导致显著的精度下降,尤其在高压缩比时更为明显。论文提出的关键解决方案是CaliDrop,其核心在于通过“校准”机制提升token eviction的准确性:基于观察到相邻位置查询具有高度相似性的事实,CaliDrop对被剔除的token进行推测性校准(speculative calibration),从而有效缓解因删除关键KV条目所引起的精度损失。
链接: https://arxiv.org/abs/2507.19906
作者: Yi Su,Quantong Qiu,Yuechi Zhou,Juntao Li,Qingrong Xia,Ping Li,Xinyu Duan,Zhefeng Wang,Min Zhang
机构: Soochow University (苏州大学); Huawei Cloud (华为云)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) require substantial computational resources during generation. While the Key-Value (KV) cache significantly accelerates this process by storing attention intermediates, its memory footprint grows linearly with sequence length, batch size, and model size, creating a bottleneck in long-context scenarios. Various KV cache compression techniques, including token eviction, quantization, and low-rank projection, have been proposed to mitigate this bottleneck, often complementing each other. This paper focuses on enhancing token eviction strategies. Token eviction leverages the observation that the attention patterns are often sparse, allowing for the removal of less critical KV entries to save memory. However, this reduction usually comes at the cost of notable accuracy degradation, particularly under high compression ratios. To address this issue, we propose \textbfCaliDrop, a novel strategy that enhances token eviction through calibration. Our preliminary experiments show that queries at nearby positions exhibit high similarity. Building on this observation, CaliDrop performs speculative calibration on the discarded tokens to mitigate the accuracy loss caused by token eviction. Extensive experiments demonstrate that CaliDrop significantly improves the accuracy of existing token eviction methods.
zh
[NLP-79] A Gold Standard Dataset and Evaluation Framework for Depression Detection and Explanation in Social Media using LLM s
【速读】: 该论文旨在解决如何通过在线社交平台内容实现抑郁症的早期检测,并提升大语言模型(LLM)在心理健康领域生成解释的可信度与临床相关性。其核心挑战在于现有数据集多为粗粒度的帖子级标签,难以支持对模型预测和解释细节的精细化评估。解决方案的关键在于构建了一个高质量、专家标注的1,017条社交媒体文本数据集,其中包含抑郁片段标注及12类抑郁症状映射;并基于此设计了一套以临床知识为基准的评估框架,用于衡量LLM生成自然语言解释的忠实度(faithfulness)和质量。研究进一步采用零样本与少量示例提示策略,在GPT-4.1、Gemini 2.5 Pro和Claude 3.7 Sonnet等主流LLM上进行实证分析,揭示了不同模型在临床解释任务中的性能差异,强调了人类专业经验对引导LLM行为的重要性,从而推动更安全、透明的心理健康AI系统发展。
链接: https://arxiv.org/abs/2507.19899
作者: Prajval Bolegave,Pushpak Bhattacharya
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Early detection of depression from online social media posts holds promise for providing timely mental health interventions. In this work, we present a high-quality, expert-annotated dataset of 1,017 social media posts labeled with depressive spans and mapped to 12 depression symptom categories. Unlike prior datasets that primarily offer coarse post-level labels \citecohan-etal-2018-smhd, our dataset enables fine-grained evaluation of both model predictions and generated explanations. We develop an evaluation framework that leverages this clinically grounded dataset to assess the faithfulness and quality of natural language explanations generated by large language models (LLMs). Through carefully designed prompting strategies, including zero-shot and few-shot approaches with domain-adapted examples, we evaluate state-of-the-art proprietary LLMs including GPT-4.1, Gemini 2.5 Pro, and Claude 3.7 Sonnet. Our comprehensive empirical analysis reveals significant differences in how these models perform on clinical explanation tasks, with zero-shot and few-shot prompting. Our findings underscore the value of human expertise in guiding LLM behavior and offer a step toward safer, more transparent AI systems for psychological well-being. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.19899 [cs.CL] (or arXiv:2507.19899v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.19899 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-80] Zero-shot Performance of Generative AI in Brazilian Portuguese Medical Exam
【速读】: 该论文试图解决的问题是当前生成式 AI(Generative AI)在医疗领域应用中存在语言偏差,尤其是非英语语境下模型性能不一致的问题。研究聚焦于评估六种大语言模型(Large Language Models, LLMs)和四种多模态大语言模型(Multimodal Large Language Models, MLLMs)在巴西口语葡萄牙语医学题目的解答能力,并与人类医学生进行对比,以验证其准确性、响应时间及解释一致性。解决方案的关键在于构建一个基于巴西圣保罗大学医学院附属医院(HCFMUSP)医学住院医师入学考试的多语言基准测试体系,发现部分模型如Claude-3.5-Sonnet和Claude-3-Opus在文本类问题上表现接近人类水平,但在涉及图像理解的多模态任务中仍存在显著差距,从而强调了针对非英语医疗场景进行模型微调与数据增强的重要性,为未来实现公平、可靠且临床可落地的AI辅助诊断系统提供实证依据。
链接: https://arxiv.org/abs/2507.19885
作者: Cesar Augusto Madid Truyts,Amanda Gomes Rabelo,Gabriel Mesquita de Souza,Daniel Scaldaferri Lages,Adriano Jose Pereira,Uri Adrian Prync Flato,Eduardo Pontes dos Reis,Joaquim Edson Vieira,Paulo Sergio Panse Silveira,Edson Amaro Junior
机构: Einstein Global Advanced Technologies for Equity, Hospital Israelita Albert Einstein; Hospital Israelita Albert Einstein; Stanford Center for Artificial Intelligence in Medicine and Imaging, Stanford University
类目: Computation and Language (cs.CL)
备注:
Abstract:Artificial intelligence (AI) has shown the potential to revolutionize healthcare by improving diagnostic accuracy, optimizing workflows, and personalizing treatment plans. Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have achieved notable advancements in natural language processing and medical applications. However, the evaluation of these models has focused predominantly on the English language, leading to potential biases in their performance across different languages. This study investigates the capability of six LLMs (GPT-4.0 Turbo, LLaMA-3-8B, LLaMA-3-70B, Mixtral 8x7B Instruct, Titan Text G1-Express, and Command R+) and four MLLMs (Claude-3.5-Sonnet, Claude-3-Opus, Claude-3-Sonnet, and Claude-3-Haiku) to answer questions written in Brazilian spoken portuguese from the medical residency entrance exam of the Hospital das Clínicas da Faculdade de Medicina da Universidade de São Paulo (HCFMUSP) - the largest health complex in South America. The performance of the models was benchmarked against human candidates, analyzing accuracy, processing time, and coherence of the generated explanations. The results show that while some models, particularly Claude-3.5-Sonnet and Claude-3-Opus, achieved accuracy levels comparable to human candidates, performance gaps persist, particularly in multimodal questions requiring image interpretation. Furthermore, the study highlights language disparities, emphasizing the need for further fine-tuning and data set augmentation for non-English medical AI applications. Our findings reinforce the importance of evaluating generative AI in various linguistic and clinical settings to ensure a fair and reliable deployment in healthcare. Future research should explore improved training methodologies, improved multimodal reasoning, and real-world clinical integration of AI-driven medical assistance. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.19885 [cs.CL] (or arXiv:2507.19885v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.19885 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Cesar Truyts [view email] [v1] Sat, 26 Jul 2025 09:34:52 UTC (967 KB)
zh
[NLP-81] he Polish Vocabulary Size Test: A Novel Adaptive Test for Receptive Vocabulary Assessment
【速读】: 该论文旨在解决现有波兰语词汇量评估工具在准确性和效率方面存在的不足,尤其是针对母语和非母语波兰语学习者缺乏一种既能精准测量词汇规模又可缩短测试时间的标准化工具。解决方案的关键在于开发了一种基于项目反应理论(Item Response Theory, IRT)与计算机化自适应测试(Computerized Adaptive Testing, CAT)相结合的新型测评工具——波兰语词汇量测试(Polish Vocabulary Size Test, PVST),该方法能够根据被试者的答题表现动态调整题目难度,从而在保证高精度的同时显著减少测试时长。
链接: https://arxiv.org/abs/2507.19869
作者: Danil Fokin,Monika Płużyczka,Grigory Golovin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We present the Polish Vocabulary Size Test (PVST), a novel tool for assessing the receptive vocabulary size of both native and non-native Polish speakers. Based on Item Response Theory and Computerized Adaptive Testing, PVST dynamically adjusts to each test-taker’s proficiency level, ensuring high accuracy while keeping the test duration short. To validate the test, a pilot study was conducted with 1.475 participants. Native Polish speakers demonstrated significantly larger vocabularies compared to non-native speakers. For native speakers, vocabulary size showed a strong positive correlation with age. The PVST is available online at this http URL.
zh
[NLP-82] DRIVE: Disfluency-Rich Synthetic Dialog Data Generation Framework for Intelligent Vehicle Environments
【速读】: 该论文旨在解决当前车载对话系统(in-car conversational AI)中缺乏真实场景下自发性不流畅现象(如停顿、重复、自我修正等)的训练数据问题,这些问题在实际驾驶交互中普遍存在但未被现有数据集充分捕捉。解决方案的关键在于提出一个名为DiscoDrive的合成语料库,其通过两阶段提示驱动(prompt-driven)的生成流程,在合成过程中动态引入这些不流畅特征,从而构建出3500个多轮对话样本,覆盖七个汽车应用场景。该方法不仅显著提升了模型在MultiWOZ 2.2和Schema-Guided Dialogue(SGD)测试集上的性能(BLEU-4提升0.26–0.61,BERTScore F1提升1.35–3.48),还在低资源场景下作为数据增强手段带来额外增益(如BLEU-4 +0.38),且经人工评估显示其对话自然度与连贯性优于KVRET数据集。
链接: https://arxiv.org/abs/2507.19867
作者: Anshul Chavda,M Jagadeesh,Chintalapalli Raja Kullayappa,B Jayaprakash,Medchalimi Sruthi,Pushpak Bhattacharyya
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); Hyundai Motors India Engineering (现代汽车印度工程公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:In-car conversational AI is becoming increasingly critical as autonomous vehicles and smart assistants gain widespread adoption. Yet, existing datasets fail to capture the spontaneous disfluencies such as hesitations, false starts, repetitions, and self-corrections that characterize real driver-AI dialogs. To address this, we introduce DiscoDrive, a synthetic corpus of 3500 multi-turn dialogs across seven automotive domains, generated using a two-stage, prompt-driven pipeline that dynamically integrates disfluencies during synthesis. We show that DiscoDrive is effective both as a training resource, enabling DialoGPT-Medium and T5-Base to match or exceed KVRET-trained models on the MultiWOZ 2.2 and Schema-Guided Dialogue (SGD) relevant test sets (BLEU-4 improvements of 0.26 to 0.61; METEOR +2.10; ROUGE-L +3.48; BERTScore F1 improvements of 1.35 to 3.48), and as a data augmentation resource in low-resource scenarios, delivering additional gains of up to BLEU-4 +0.38, METEOR +1.95, ROUGE-L +2.87, and BERTScore F1 +4.00 when combined with 10 percent of KVRET. Human evaluations further confirm that dialogs sampled from DiscoDrive are rated higher than KVRET’s human-collected dialogs in naturalness (3.8 vs 3.6) and coherence (4.1 vs 4.0), and are perceived as more context-appropriate than leading post-hoc methods (such as LARD), without compromising clarity. DiscoDrive fills a critical gap in existing resources and serves as a versatile corpus for both training and augmenting conversational AI, enabling robust handling of real-world, disfluent in-car interactions.
zh
[NLP-83] Agent ic Reinforced Policy Optimization
【速读】: 该论文旨在解决当前强化学习(Reinforcement Learning, RL)算法在训练基于大语言模型(Large Language Models, LLMs)的多轮代理(multi-turn agents)时,难以平衡模型长期推理能力与多轮工具交互熟练度的问题。现有方法往往在轨迹层面进行优化,忽略了工具使用后产生的高不确定性状态对策略探索的影响。其解决方案的关键在于提出一种名为“代理强化策略优化”(Agentic Reinforced Policy Optimization, ARPO)的新颖代理强化学习算法:该算法引入基于熵的自适应采样机制,在工具使用后的高不确定性步骤中增强局部探索,同时结合优势归因估计(advantage attribution estimation),使LLM能够内化每一步工具交互的优势差异,从而更高效地利用有限的工具调用预算完成复杂任务。
链接: https://arxiv.org/abs/2507.19849
作者: Guanting Dong,Hangyu Mao,Kai Ma,Licheng Bao,Yifei Chen,Zhongyuan Wang,Zhongxia Chen,Jiazhen Du,Huiyang Wang,Fuzheng Zhang,Guorui Zhou,Yutao Zhu,Ji-Rong Wen,Zhicheng Dou
机构: Renmin University of China (中国人民大学); Kuaishou Technology (快手科技)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Working on progress
Abstract:Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models’ intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO’s superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at this https URL
zh
[NLP-84] AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition ICCV2025
【速读】: 该论文旨在解决连续手语识别(Continuous Sign Language Recognition, CSLR)中多阶段流水线方法存在的误差传播、过拟合及词汇扩展性差等问题。传统方法依赖于特征提取与序列对齐(如CTC或HMM)的两阶段流程,容易因中间词素(gloss)表示瓶颈导致性能受限。解决方案的关键在于提出一种自回归解码器-only的Transformer架构AutoSign,其直接将姿态序列映射为自然语言文本,摒弃了传统的对齐机制;同时引入一维卷积神经网络(1D CNN)进行时序压缩以高效处理姿态数据,并基于预训练阿拉伯语解码器AraGPT2生成最终的词素文本,从而在无需中间对齐步骤的情况下实现端到端的高精度识别,显著提升性能(在Isharah-1000数据集上WER降低达6.1%)。
链接: https://arxiv.org/abs/2507.19840
作者: Samuel Ebimobowei Johnny,Blessed Guda,Andrew Blayama Stephen,Assane Gueye
机构: Carnegie Mellon University Africa (卡内基梅隆大学非洲校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Paper to appear at the 1st Workshop in Multimodal Sign Language Recognition at ICCV 2025
Abstract:Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between the hearing and hearing-impaired communities. This involves recognizing and interpreting the hands, face, and body gestures of the signer, which pose a challenge as it involves a combination of all these features. Continuous Sign Language Recognition (CSLR) methods rely on multi-stage pipelines that first extract visual features, then align variable-length sequences with target glosses using CTC or HMM-based approaches. However, these alignment-based methods suffer from error propagation across stages, overfitting, and struggle with vocabulary scalability due to the intermediate gloss representation bottleneck. To address these limitations, we propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text, bypassing traditional alignment mechanisms entirely. The use of this decoder-only approach allows the model to directly map between the features and the glosses without the need for CTC loss while also directly learning the textual dependencies in the glosses. Our approach incorporates a temporal compression module using 1D CNNs to efficiently process pose sequences, followed by AraGPT2, a pre-trained Arabic decoder, to generate text (glosses). Through comprehensive ablation studies, we demonstrate that hand and body gestures provide the most discriminative features for signer-independent CSLR. By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset, achieving an improvement of up to 6.1% in WER score compared to the best existing method.
zh
[NLP-85] HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本输入时因键值(Key-Value, KV)缓存内存占用过高而导致的推理效率瓶颈问题。现有KV缓存压缩方法在内存压缩超过85%时性能显著下降,且GPU-CPU协同近似注意力计算策略尚未被充分探索。其解决方案的关键在于提出HCAttention框架,通过三个核心机制实现高效推理:(1)关键量化(key quantization),降低KV缓存存储精度;(2)值卸载(value offloading),将部分KV缓存移至CPU内存;(3)动态KV逐出(dynamic KV eviction),根据注意力重要性智能淘汰低价值缓存项。该方案无需模型微调即可兼容主流Transformer架构,在LongBench基准上将KV缓存内存占用压缩至原始大小的25%,并在仅保留12.5%缓存的情况下仍保持与全注意力模型相当的准确性,刷新了LLM KV缓存压缩的最新纪录,并首次实现了在单张A100 GPU(80GB)上运行400万token长度的Llama-3-8B模型。
链接: https://arxiv.org/abs/2507.19823
作者: Dongquan Yang,Yifan Yang,Xiaotian Yu,Xianbiao Qi,Rong Xiao
机构: Intellifusion Inc.(Intellifusion公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Processing long-context inputs with large language models presents a significant challenge due to the enormous memory requirements of the Key-Value (KV) cache during inference. Existing KV cache compression methods exhibit noticeable performance degradation when memory is reduced by more than 85%. Additionally, strategies that leverage GPU-CPU collaboration for approximate attention remain underexplored in this setting. We propose HCAttention, a heterogeneous attention computation framework that integrates key quantization, value offloading, and dynamic KV eviction to enable efficient inference under extreme memory constraints. The method is compatible with existing transformer architectures and does not require model fine-tuning. Experimental results on the LongBench benchmark demonstrate that our approach preserves the accuracy of full-attention model while shrinking the KV cache memory footprint to 25% of its original size. Remarkably, it stays competitive with only 12.5% of the cache, setting a new state-of-the-art in LLM KV cache compression. To the best of our knowledge, HCAttention is the first to extend the Llama-3-8B model to process 4 million tokens on a single A100 GPU with 80GB memory.
zh
[NLP-86] Flora: Effortless Context Construction to Arbitrary Length and Scale
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长上下文时面临的三大挑战:长文本稀缺性、高计算开销以及短上下文能力的显著退化。现有方法通常依赖人工或LLM进行指令调优,成本高昂且难以扩展至多样性和长度。其核心解决方案是提出Flora策略——一种无需人类或LLM干预的长上下文构建方法,通过类别驱动的短指令拼接与元指令引导生成,实现任意长度和高度多样性的长上下文合成,同时仅轻微影响短上下文性能。实验表明,Flora可显著提升Llama3-8B-Instruct和QwQ-32B在多个长上下文基准上的表现,同时保持对短上下文任务的强鲁棒性。
链接: https://arxiv.org/abs/2507.19786
作者: Tianxiang Chen,Zhentao Tan,Xiaofan Bo,Yue Wu,Tao Gong,Qi Chu,Jieping Ye,Nenghai Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Effectively handling long contexts is challenging for Large Language Models (LLMs) due to the rarity of long texts, high computational demands, and substantial forgetting of short-context abilities. Recent approaches have attempted to construct long contexts for instruction tuning, but these methods often require LLMs or human interventions, which are both costly and limited in length and diversity. Also, the drop in short-context performances of present long-context LLMs remains significant. In this paper, we introduce Flora, an effortless (human/LLM-free) long-context construction strategy. Flora can markedly enhance the long-context performance of LLMs by arbitrarily assembling short instructions based on categories and instructing LLMs to generate responses based on long-context meta-instructions. This enables Flora to produce contexts of arbitrary length and scale with rich diversity, while only slightly compromising short-context performance. Experiments on Llama3-8B-Instruct and QwQ-32B show that LLMs enhanced by Flora excel in three long-context benchmarks while maintaining strong performances in short-context tasks. Our data-construction code is available at \hrefthis https URLthis https URL.
zh
[NLP-87] UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models Reasoning Abilities
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成超长输出序列时,传统强化学习(Reinforcement Learning, RL)框架因长尾序列分布和训练过程中的熵崩溃(entropy collapse)而导致的效率低下问题。其解决方案的关键在于提出一种名为Ultra-Long Output Reinforcement Learning (UloRL) 的新方法:首先将超长输出解码过程拆分为短片段进行滚动优化(segment rollout),从而缓解长尾样本带来的延迟;其次引入动态掩码已掌握正向标记(Mastered Positive Tokens, MPTs)机制,有效防止训练中出现的熵崩溃现象。实验表明,该方法显著提升了训练速度与模型推理性能,在Qwen3-30B-A3B模型上实现了2.06倍的训练加速,并在AIME2025和BeyondAIME基准测试中分别达到85.1%和61.9%的准确率,超越了更大参数量的基线模型。
链接: https://arxiv.org/abs/2507.19766
作者: Dong Du,Shulin Liu,Tao Yang,Shaohua Chen,Yang Li
机构: Tencent Hunyuan Team (腾讯混元团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Recent advances in large language models (LLMs) have highlighted the potential of reinforcement learning with verifiable rewards (RLVR) to enhance reasoning capabilities through extended output sequences. However, traditional RL frameworks face inefficiencies when handling ultra-long outputs due to long-tail sequence distributions and entropy collapse during training. To address these challenges, we propose an Ultra-Long Output Reinforcement Learning (UloRL) approach for advancing large language models’ reasoning abilities. Specifically, we divide ultra long output decoding into short segments, enabling efficient training by mitigating delays caused by long-tail samples. Additionally, we introduce dynamic masking of well-Mastered Positive Tokens (MPTs) to prevent entropy collapse. Experimental results demonstrate the effectiveness of our approach. On the Qwen3-30B-A3B model, RL with segment rollout achieved 2.06x increase in training speed, while RL training with 128k-token outputs improves the model’s performance on AIME2025 from 70.9% to 85.1% and on BeyondAIME from 50.7% to 61.9%, even surpassing Qwen3-235B-A22B with remarkable gains. These findings underscore the potential of our methods to advance the reasoning capabilities of LLMs with ultra-long sequence generation. We will release our code and model for further use by the community.
zh
[NLP-88] Are You There God? Lightweight Narrative Annotation of Christian Fiction with LMs
【速读】: 该论文旨在解决基督教文学(Christian Fiction)领域中长期缺乏系统性研究的问题,特别是对其中“神迹行为”(acts of God)的刻画差异缺乏量化分析。其关键解决方案在于结合人类标注与轻量级语言模型(lightweight LM),通过构建标准化的编码手册(codebook)和利用大模型辅助小模型训练,实现了对复杂文本中神迹行为的高精度自动识别,从而揭示了《被撇下》系列与更广泛的基督教文学之间、以及男性与女性作者作品之间的显著差异。
链接: https://arxiv.org/abs/2507.19756
作者: Rebecca M. M. Hicke,Brian Haggard,Mia Ferrante,Rayhan Khanna,David Mimno
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In addition to its more widely studied political activities, the American Evangelical movement has a well-developed but less externally visible cultural and literary side. Christian Fiction, however, has been little studied, and what scholarly attention there is has focused on the explosively popular Left Behind series. In this work, we use computational tools to provide both a broad topical overview of Christian Fiction as a genre and a more directed exploration of how its authors depict divine acts. Working with human annotators we first developed definitions and a codebook for “acts of God.” We then adapted those instructions designed for human annotators for use by a recent, lightweight LM with the assistance of a much larger model. The laptop-scale LM is capable of matching human annotations, even when the task is subtle and challenging. Using these annotations, we show that significant and meaningful differences exist between the Left Behind books and Christian Fiction more broadly and between books by male and female authors.
zh
[NLP-89] JT-Math: A Multi-Stage Framework for Advanced Mathematical Reasoning in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂数学推理任务时表现不足的问题,尤其是当问题需要深层次概念理解与多步骤逻辑推导时,现有模型常出现错误或无法完成推理。其解决方案的关键在于构建一个系统性的多阶段优化框架,开发出一套包含基础版、指令微调版和思维链增强版的开源模型JT-Math-8B系列;其中,思维链增强版(Thinking Model)采用长链式思维(Long Chain-of-Thought, Long CoT)策略,并结合一种新颖的多阶段强化学习(Reinforcement Learning, RL)课程训练方法,逐步提升任务难度与上下文长度(最高达32K tokens),从而显著增强模型的复杂问题求解能力。
链接: https://arxiv.org/abs/2507.19748
作者: Yifan Hao,Fangning Chao,Yaqian Hao,Zhaojun Cui,Huan Bai,Haiyu Zhang,Yankai Liu,Chao Deng,Junlan Feng
机构: JIUTIAN Team, China Mobile Research Institute (中国移动研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Mathematical reasoning is a cornerstone of artificial general intelligence and a primary benchmark for evaluating the capabilities of Large Language Models (LLMs). While state-of-the-art models show promise, they often falter when faced with complex problems that demand deep conceptual understanding and intricate, multi-step deliberation. To address this challenge, we introduce JT-Math-8B, a series of open-source models comprising base, instruct, and thinking versions, built upon a systematic, multi-stage optimization framework. Our pre-training corpus is a high-quality, 210B-token dataset curated through a dedicated data pipeline that uses model-based validation to ensure quality and diversity. The Instruct Model is optimized for direct, concise answers through Supervised Fine-Tuning (SFT) and a GRPO-based reinforcement learning (RL) method. The Thinking Model is trained for complex problem-solving using a Long Chain-of-Thought (Long CoT) approach, combining SFT with a novel, multi-stage RL curriculum that progressively increases task difficulty and context length up to 32K tokens. JT-Math-8B achieves state-of-the-art results among open-source models of similar size, surpassing prominent models like OpenAI’s O1-mini and GPT-4o , and demonstrating superior performance on competition-level mathematics.
zh
[NLP-90] Basic Reading Distillation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)因计算资源需求高而难以在实际场景中部署的问题。现有知识蒸馏(knowledge distillation)和任务蒸馏(task distillation)方法虽能压缩模型规模,但均忽视了小模型在通用文本上的基础阅读能力培养。解决方案的关键在于提出基础阅读蒸馏(Basic Reading Distillation, BRD),通过让小模型模仿LLMs在每句话上进行命名实体识别、提问与回答等基本阅读行为,实现对其底层语言理解能力的教育;实验表明,经BRD训练的小模型在多个下游任务(包括语言推理基准和BIG-bench任务)中表现优于或接近于其20倍大小的LLM,且BRD对概率分布具有显著影响,并与传统蒸馏方法具有正交性。
链接: https://arxiv.org/abs/2507.19741
作者: Zhi Zhou,Sirui Miao,Xiangyu Duan,Hao Yang,Min Zhang
机构: Soochow University (苏州大学); Huawei Translation Services Center (华为翻译服务中⼼)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable abilities in various natural language processing areas, but they demand high computation resources which limits their deployment in real-world. Distillation is one technique to solve this problem through either knowledge distillation or task distillation. Both distillation approaches train small models to imitate specific features of LLMs, but they all neglect basic reading education for small models on generic texts that are \emphunrelated to downstream tasks. In this paper, we propose basic reading distillation (BRD) which educates a small model to imitate LLMs basic reading behaviors, such as named entity recognition, question raising and answering, on each sentence. After such basic education, we apply the small model on various tasks including language inference benchmarks and BIG-bench tasks. It shows that the small model can outperform or perform comparable to over 20x bigger LLMs. Analysis reveals that BRD effectively influences the probability distribution of the small model, and has orthogonality to either knowledge distillation or task distillation.
zh
[NLP-91] a-G-T: Subjectivity Capture in Table to Text Generation via RDF Graphs
【速读】: 该论文旨在解决表结构数据到文本(Table-to-Text, T2T)生成中缺乏主观性表达的问题,即现有方法主要局限于对表格数据的客观描述,而难以生成包含解释、判断或情感等主观内容的文本。其解决方案的关键在于提出一个三阶段结构化流水线:首先从表格中提取资源描述框架(Resource Description Framework, RDF)三元组以增强事实准确性;其次将RDF三元组聚合为连贯叙事;最后通过注入主观性元素来丰富文本语义。该方法利用中间表示提升可解释性与事实一致性,同时在较小规模的T5模型上实现与GPT-3.5相当甚至更优的性能,显著区别于依赖大语言模型(Large Language Models, LLMs)的主流方案。
链接: https://arxiv.org/abs/2507.19710
作者: Ronak Upasham,Tathagata Dey,Pushpak Bhattacharyya
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:In Table-to-Text (T2T) generation, existing approaches predominantly focus on providing objective descriptions of tabular data. However, generating text that incorporates subjectivity, where subjectivity refers to interpretations beyond raw numerical data, remains underexplored. To address this, we introduce a novel pipeline that leverages intermediate representations to generate both objective and subjective text from tables. Our three-stage pipeline consists of: 1) extraction of Resource Description Framework (RDF) triples, 2) aggregation of text into coherent narratives, and 3) infusion of subjectivity to enrich the generated text. By incorporating RDFs, our approach enhances factual accuracy while maintaining interpretability. Unlike large language models (LLMs) such as GPT-3.5, Mistral-7B, and Llama-2, our pipeline employs smaller, fine-tuned T5 models while achieving comparable performance to GPT-3.5 and outperforming Mistral-7B and Llama-2 in several metrics. We evaluate our approach through quantitative and qualitative analyses, demonstrating its effectiveness in balancing factual accuracy with subjective interpretation. To the best of our knowledge, this is the first work to propose a structured pipeline for T2T generation that integrates intermediate representations to enhance both factual correctness and subjectivity.
zh
[NLP-92] owards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks IJCAI
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言环境(如卡纳达语和阿拉伯语)中性能不足的问题,特别是在多语言与单语言模型之间的表现差异、以及模型压缩策略(如剪枝和量化)对性能的影响。研究发现,多语言模型普遍优于单一语言模型,表明跨语言迁移具有显著优势;同时,4-bit 和 8-bit 量化可在保持准确率的前提下提升效率,而激进的剪枝会显著损害大模型性能。因此,构建可扩展且公平的多语言自然语言处理(Natural Language Processing, NLP)解决方案的关键在于采用适当的量化策略以实现效率与精度的平衡,并重视跨语言知识迁移能力的利用,同时需针对性干预低资源场景下的幻觉和泛化误差问题。
链接: https://arxiv.org/abs/2507.19699
作者: Maitha Alshehhi,Ahmed Sharshar,Mohsen Guizani
机构: 未知
类目: Computation and Language (cs.CL)
备注: Published in the 3rd International Workshop on Generalizing from Limited Resources in the Open World. Workshop at International Joint Conference on Artificial Intelligence (IJCAI) 2025
Abstract:Although LLMs have attained significant success in high-resource languages, their capacity in low-resource linguistic environments like Kannada and Arabic is not yet fully understood. This work benchmarking the performance of multilingual and monolingual Large Language Models (LLMs) across Arabic, English, and Indic languages, with particular emphasis on the effects of model compression strategies such as pruning and quantization. Findings shows significant performance differences driven by linguistic diversity and resource availability on SOTA LLMS as BLOOMZ, AceGPT, Jais, LLaMA-2, XGLM, and AraGPT2. We find that multilingual versions of the model outperform their language-specific counterparts across the board, indicating substantial cross-lingual transfer benefits. Quantization (4-bit and 8-bit) is effective in maintaining model accuracy while promoting efficiency, but aggressive pruning significantly compromises performance, especially in bigger models. Our findings pinpoint key strategies to construct scalable and fair multilingual NLP solutions and underscore the need for interventions to address hallucination and generalization errors in the low-resource setting.
zh
[NLP-93] Salsa as a Nonverbal Embodied Language – The CoMPAS3D Dataset and Benchmarks
【速读】: 该论文旨在解决人机协同舞蹈中交互式、具身化动作生成的挑战,尤其是如何让机器人或合成人类(synthetic human)在与人类舞伴互动时,具备适应对方能力水平、实时响应物理触觉信号并生成富有表现力和创意的舞蹈动作的能力。其核心问题在于建模双智能体之间连续、双向反馈且受个体差异影响的耦合交互过程。解决方案的关键是提出了CoMPAS3D数据集——目前规模最大、多样性最高的即兴萨尔萨舞动作捕捉数据集,包含18名不同技能水平(初学者至专业级)舞者共3小时的领导者-跟随者互动记录,并配有超过2800段细粒度专家标注(涵盖动作类型、组合、执行错误及风格元素)。此外,作者构建了多任务SalsaAgent模型,可同时完成基于熟练度的领导者/跟随者动作生成(类比语音合成中的说话者/听者建模)和双人对舞生成(类比对话生成),从而为社交性具身人工智能(socially interactive embodied AI)和创造性人形动作生成提供基准测试平台与方法论支持。
链接: https://arxiv.org/abs/2507.19684
作者: Bermet Burkanova,Payam Jome Yazdian,Chuxuan Zhang,Trinity Evans,Paige Tuttösí,Angelica Lim
机构: Simon Fraser University (西蒙菲莎大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Imagine a humanoid that can safely and creatively dance with a human, adapting to its partner’s proficiency, using haptic signaling as a primary form of communication. While today’s AI systems excel at text or voice-based interaction with large language models, human communication extends far beyond text-it includes embodied movement, timing, and physical coordination. Modeling coupled interaction between two agents poses a formidable challenge: it is continuous, bidirectionally reactive, and shaped by individual variation. We present CoMPAS3D, the largest and most diverse motion capture dataset of improvised salsa dancing, designed as a challenging testbed for interactive, expressive humanoid AI. The dataset includes 3 hours of leader-follower salsa dances performed by 18 dancers spanning beginner, intermediate, and professional skill levels. For the first time, we provide fine-grained salsa expert annotations, covering over 2,800 move segments, including move types, combinations, execution errors and stylistic elements. We draw analogies between partner dance communication and natural language, evaluating CoMPAS3D on two benchmark tasks for synthetic humans that parallel key problems in spoken language and dialogue processing: leader or follower generation with proficiency levels (speaker or listener synthesis), and duet (conversation) generation. Towards a long-term goal of partner dance with humans, we release the dataset, annotations, and code, along with a multitask SalsaAgent model capable of performing all benchmark tasks, alongside additional baselines to encourage research in socially interactive embodied AI and creative, expressive humanoid motion generation.
zh
[NLP-94] RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams
【速读】: 该论文旨在解决在资源匮乏的语言(如罗马尼亚语)中,如何利用大语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)有效支持法律教育的问题,特别是针对罗马尼亚驾驶法规的理解与推理。其核心解决方案是构建了一个名为RoD-TAL的多模态数据集,包含文本和图像驱动考试题目、标注的法律条文及人工解释,并在此基础上实施检索增强生成(Retrieval-Augmented Generation, RAG)管道、密集检索器和优化推理的模型。实验表明,领域特定微调显著提升检索性能,链式思维提示(chain-of-thought prompting)和专用推理模型可提高问答准确率并达到通过驾驶考试的最低标准;但视觉推理仍具挑战性,凸显了LLMs和VLMs在法律教育应用中的潜力与局限。
链接: https://arxiv.org/abs/2507.19666
作者: Andrei Vlad Man,Răzvan-Alexandru Smădu,Cristian-George Craciun,Dumitru-Clementin Cercel,Florin Pop,Mihaela-Claudia Cercel
机构: 未知
类目: Computation and Language (cs.CL)
备注: 49 pages, 52 figures
Abstract:The intersection of AI and legal systems presents a growing need for tools that support legal education, particularly in under-resourced languages such as Romanian. In this work, we aim to evaluate the capabilities of Large Language Models (LLMs) and Vision-Language Models (VLMs) in understanding and reasoning about Romanian driving law through textual and visual question-answering tasks. To facilitate this, we introduce RoD-TAL, a novel multimodal dataset comprising Romanian driving test questions, text-based and image-based, alongside annotated legal references and human explanations. We implement and assess retrieval-augmented generation (RAG) pipelines, dense retrievers, and reasoning-optimized models across tasks including Information Retrieval (IR), Question Answering (QA), Visual IR, and Visual QA. Our experiments demonstrate that domain-specific fine-tuning significantly enhances retrieval performance. At the same time, chain-of-thought prompting and specialized reasoning models improve QA accuracy, surpassing the minimum grades required to pass driving exams. However, visual reasoning remains challenging, highlighting the potential and the limitations of applying LLMs and VLMs to legal education.
zh
[NLP-95] MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨语言、多模态场景下缺乏系统性评估工具的问题。现有基准测试通常局限于英语、单一模态、短文本输入或缺少人工标注,难以全面衡量模型在不同语言、模态及任务复杂度下的指令遵循能力。其解决方案的关键在于提出MCIF(Multimodal Crosslingual Instruction Following)基准,这是首个基于科学讲座的多语言人工标注数据集,覆盖语音、视觉和文本三种模态以及英语、德语、意大利语和汉语四种语言,能够同时评估模型在长短上下文中的跨语言多模态指令理解与执行能力,从而推动MLLMs在真实复杂场景下的发展。
链接: https://arxiv.org/abs/2507.19634
作者: Sara Papi,Maike Züfle,Marco Gaido,Beatrice Savoldi,Danni Liu,Ioannis Douros,Luisa Bentivogli,Jan Niehues
机构: Fondazione Bruno Kessler (意大利布鲁诺·凯斯勒基金会); Karlsruhe Institute of Technology (德国卡尔斯鲁厄理工学院); Translated (意大利)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: Work in progress
Abstract:Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations–hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities–speech, vision, and text–and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs’ abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.
zh
[NLP-96] HITSZs End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track
【速读】: 该论文旨在解决低资源场景下英语与印地语系(Indic)语言之间的语音到文本翻译(Speech-to-Text Translation, ST)质量提升问题。解决方案的关键在于构建一个端到端系统,集成预训练的Whisper自动语音识别(ASR)模型与专为印地语系语言优化的大语言模型(LLM)Krutrim,从而实现更准确的跨语言翻译。实验表明,该方法在英译印地语和印地语译英方向上分别取得了平均BLEU分数28.88和27.86,且通过引入思维链(Chain-of-Thought, CoT)策略,在特定语言对(如泰米尔语到英语)中进一步提升了翻译质量(BLEU提升13.84),尽管仍面临CoT输出格式一致性控制的挑战。
链接: https://arxiv.org/abs/2507.19616
作者: Xuchen Wei,Yangxin Wu,Yaoyin Zhang,Henglyu Liu,Kehai Chen,Xuefeng Bai,Min Zhang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区)
类目: Computation and Language (cs.CL)
备注: 7 pages, 1 figure, submitted to IWSLT 2025
Abstract:This paper presents HITSZ’s submission for the IWSLT 2025 Indic track, focusing on speech-to-text translation (ST) for English-to-Indic and Indic-to-English language pairs. To enhance translation quality in this low-resource scenario, we propose an end-to-end system integrating the pre-trained Whisper automated speech recognition (ASR) model with Krutrim, an Indic-specialized large language model (LLM). Experimental results demonstrate that our end-to-end system achieved average BLEU scores of 28.88 for English-to-Indic directions and 27.86 for Indic-to-English directions. Furthermore, we investigated the Chain-of-Thought (CoT) method. While this method showed potential for significant translation quality improvements on successfully parsed outputs (e.g. a 13.84 BLEU increase for Tamil-to-English), we observed challenges in ensuring the model consistently adheres to the required CoT output format.
zh
[NLP-97] MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成任务中对多轮恶意编码提示(multi-turn malicious coding prompts)的鲁棒性不足问题,即模型容易被分解成多个看似无害的子任务逐步诱导生成有害代码,从而绕过安全过滤机制。解决方案的关键在于提出了一种新的攻击方式——代码分解攻击(code decomposition attacks),并构建了一个大规模基准测试集 \benchmarkname,用于系统评估模型在单轮与多轮恶意提示下的脆弱性;同时通过在MOCHA数据集上进行微调,在不牺牲代码生成能力的前提下显著提升了模型对未见对抗样本的拒绝率(最高提升达32.4%),验证了其增强鲁棒性的有效性。
链接: https://arxiv.org/abs/2507.19598
作者: Muntasir Wahed,Xiaona Zhou,Kiet A. Nguyen,Tianjiao Yu,Nirav Diwan,Gang Wang,Dilek Hakkani-Tür,Ismini Lourentzou
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Winner Defender Team at Amazon Nova AI Challenge 2025
Abstract:Recent advancements in Large Language Models (LLMs) have significantly enhanced their code generation capabilities. However, their robustness against adversarial misuse, particularly through multi-turn malicious coding prompts, remains underexplored. In this work, we introduce code decomposition attacks, where a malicious coding task is broken down into a series of seemingly benign subtasks across multiple conversational turns to evade safety filters. To facilitate systematic evaluation, we introduce \benchmarkname, a large-scale benchmark designed to evaluate the robustness of code LLMs against both single-turn and multi-turn malicious prompts. Empirical results across open- and closed-source models reveal persistent vulnerabilities, especially under multi-turn scenarios. Fine-tuning on MOCHA improves rejection rates while preserving coding ability, and importantly, enhances robustness on external adversarial datasets with up to 32.4% increase in rejection rates without any additional supervision.
zh
[NLP-98] Efficient Attention Mechanisms for Large Language Models : A Survey
【速读】: 该论文旨在解决基于Transformer的大型语言模型中自注意力机制(self-attention)存在的二次时间与内存复杂度问题,这一瓶颈限制了长文本上下文建模的效率。解决方案的关键在于系统性地归纳和分析两类主流高效注意力机制:一是线性注意力方法(linear attention),通过核近似、递归形式或快速权重动态实现线性复杂度,从而降低计算开销;二是稀疏注意力技术(sparse attention),基于固定模式、分块路由或聚类策略仅对部分token进行注意力计算,在保持上下文覆盖的同时提升效率。论文进一步探讨了这些机制在大规模预训练语言模型中的集成方式,包括纯高效注意力架构与局部-全局混合设计,并融合算法创新与硬件适配考量,为可扩展、高效的语言模型设计提供理论支撑与实践路径。
链接: https://arxiv.org/abs/2507.19595
作者: Yutao Sun,Zhenyu Li,Yike Zhang,Tengyu Pan,Bowen Dong,Yuyi Guo,Jianyong Wang
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: work in progress
Abstract:Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address this limitation, recent research has introduced two principal categories of efficient attention mechanisms. Linear attention methods achieve linear complexity through kernel approximations, recurrent formulations, or fastweight dynamics, thereby enabling scalable inference with reduced computational overhead. Sparse attention techniques, in contrast, limit attention computation to selected subsets of tokens based on fixed patterns, block-wise routing, or clustering strategies, enhancing efficiency while preserving contextual coverage. This survey provides a systematic and comprehensive overview of these developments, integrating both algorithmic innovations and hardware-level considerations. In addition, we analyze the incorporation of efficient attention into largescale pre-trained language models, including both architectures built entirely on efficient attention and hybrid designs that combine local and global components. By aligning theoretical foundations with practical deployment strategies, this work aims to serve as a foundational reference for advancing the design of scalable and efficient language models.
zh
[NLP-99] Mitigating Geospatial Knowledge Hallucination in Large Language Models : Benchmarking and Dynamic Factuality Aligning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在地理空间知识(geospatial knowledge)方面存在的幻觉问题,即模型生成的地理信息不准确或不一致,导致其在地理空间推理任务中的可靠性下降。解决方案的关键在于提出一个基于结构化地理知识图谱的系统性评估框架,用于可控地检测和量化LLMs中的地理空间幻觉,并在此基础上引入一种基于Kahneman-Tversky Optimization(KTO)的动态事实对齐方法,通过优化模型输出与真实地理知识的一致性来显著减少幻觉现象,从而提升模型在地理空间任务上的性能,实验表明该方法使模型在新提出的基准测试中性能提升超过29.6%。
链接: https://arxiv.org/abs/2507.19586
作者: Shengyuan Wang,Jie Feng,Tianhui Liu,Dan Pei,Yong Li
机构: Tsinghua University (清华大学); Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 9 figures
Abstract:Large language models (LLMs) possess extensive world knowledge, including geospatial knowledge, which has been successfully applied to various geospatial tasks such as mobility prediction and social indicator prediction. However, LLMs often generate inaccurate geospatial knowledge, leading to geospatial hallucinations (incorrect or inconsistent representations of geospatial information) that compromise their reliability. While the phenomenon of general knowledge hallucination in LLMs has been widely studied, the systematic evaluation and mitigation of geospatial hallucinations remain largely unexplored. To address this gap, we propose a comprehensive evaluation framework for geospatial hallucinations, leveraging structured geospatial knowledge graphs for controlled assessment. Through extensive evaluation across 20 advanced LLMs, we uncover the hallucinations in their geospatial knowledge. Building on these insights, we introduce a dynamic factuality aligning method based on Kahneman-Tversky Optimization (KTO) to mitigate geospatial hallucinations in LLMs, leading to a performance improvement of over 29.6% on the proposed benchmark. Extensive experimental results demonstrate the effectiveness of our benchmark and learning algorithm in enhancing the trustworthiness of LLMs in geospatial knowledge and reasoning tasks.
zh
[NLP-100] Mind the Language Gap in Digital Humanities: LLM -Aided Translation of SKOS Thesauri
【速读】: 该论文旨在解决数字人文(Digital Humanities, DH)领域中因语言多样性导致的知识资源访问受限、复用困难及语义互操作性不足的问题。解决方案的关键在于提出WOKIE——一个开源、模块化且即插即用的自动化SKOS术语表翻译流水线,其核心创新在于结合外部翻译服务与基于大语言模型(Large Language Models, LLMs)的精细化优化,在保证翻译质量的同时实现可扩展性和成本效益,从而显著提升术语表的多语言可达性与跨语言语义匹配性能。
链接: https://arxiv.org/abs/2507.19537
作者: Felix Kraus,Nicolas Blumenröhr,Danah Tonne,Achim Streit
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce WOKIE, an open-source, modular, and ready-to-use pipeline for the automated translation of SKOS thesauri. This work addresses a critical need in the Digital Humanities (DH), where language diversity can limit access, reuse, and semantic interoperability of knowledge resources. WOKIE combines external translation services with targeted refinement using Large Language Models (LLMs), balancing translation quality, scalability, and cost. Designed to run on everyday hardware and be easily extended, the application requires no prior expertise in machine translation or LLMs. We evaluate WOKIE across several DH thesauri in 15 languages with different parameters, translation services and LLMs, systematically analysing translation quality, performance, and ontology matching improvements. Our results show that WOKIE is suitable to enhance the accessibility, reuse, and cross-lingual interoperability of thesauri by hurdle-free automated translation and improved ontology matching performance, supporting more inclusive and multilingual research infrastructures.
zh
[NLP-101] FedDPG: An Adaptive Yet Efficient Prompt-tuning Approach in Federated Learning Settings PAKDD’2025
【速读】: 该论文旨在解决预训练语言模型(Pre-trained Language Models, PLMs)在联邦学习(Federated Learning, FL)场景下参数效率低、灵活性不足以及数据隐私保护难的问题。传统微调方法计算开销大,而现有提示调优(Prompt-tuning)技术因提示固定不变导致模型适应性差;同时,联邦学习中客户端的通信与计算资源受限进一步加剧了挑战。解决方案的关键在于提出联邦动态提示生成器(Federated Dynamic Prompt Generator, FedDPG),其核心创新是引入一个动态提示生成网络,可根据输入内容自动生成上下文感知的提示,从而在保持模型参数冻结的前提下提升灵活性和性能,同时显著减少需在联邦网络上传输的参数量与计算时间,兼顾高效性与隐私保护。
链接: https://arxiv.org/abs/2507.19534
作者: Ali Shakeri,Wei Emma Zhang,Amin Beheshti,Weitong Chen,Jian Yang,Lishan Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages; Published to PAKDD’2025
Abstract:Pre-trained Language Models (PLMs) have demonstrated impressive performance in various NLP tasks. However, traditional fine-tuning methods for leveraging PLMs for downstream tasks entail significant computational overhead. Prompt-tuning has emerged as an efficient alternative that involves prepending a limited number of parameters to the input sequence and only updating them while the PLM’s parameters are frozen. However, this technique’s prompts remain fixed for all inputs, reducing the model’s flexibility. The Federated Learning (FL) technique has gained attention in recent years to address the growing concerns around data privacy. However, challenges such as communication and computation limitations of clients still need to be addressed. To mitigate these challenges, this paper introduces the Federated Dynamic Prompt Generator (FedDPG), which incorporates a dynamic prompt generator network to generate context-aware prompts based on the given input, ensuring flexibility and adaptability while prioritising data privacy in federated learning settings. Our experiments on three NLP benchmark datasets showcase that FedDPG outperforms the state-of-the-art parameter-efficient fine-tuning methods in terms of global model performance, and has significantly reduced the calculation time and the number of parameters to be sent through the FL network.
zh
[NLP-102] Setting The Table with Intent: Intent-aware Schema Generation and Editing for Literature Review Tables
【速读】: 该论文旨在解决学术文献组织与比较中因文献体量增长而产生的挑战,核心问题是当前生成文档比较结构(schema)的方法存在评估模糊性和缺乏有效编辑优化手段。解决方案的关键在于:首先,通过合成意图增强未标注表格语料库,构建一个基于特定信息需求的schema生成研究数据集,从而降低参考基准评估的歧义性;其次,提出多种基于大语言模型(Large Language Models, LLMs)的schema编辑技术,显著提升初始生成schema的质量,并证明较小的开源模型经微调后可达到与先进提示工程方法相当的效果。
链接: https://arxiv.org/abs/2507.19521
作者: Vishakh Padmakumar,Joseph Chee Chang,Kyle Lo,Doug Downey,Aakanksha Naik
机构: New York University (纽约大学); AI2; Northwestern University (西北大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The increasing volume of academic literature makes it essential for researchers to organize, compare, and contrast collections of documents. Large language models (LLMs) can support this process by generating schemas defining shared aspects along which to compare papers. However, progress on schema generation has been slow due to: (i) ambiguity in reference-based evaluations, and (ii) lack of editing/refinement methods. Our work is the first to address both issues. First, we present an approach for augmenting unannotated table corpora with synthesized intents and apply it to create a dataset for studying schema generation conditioned on a given information need, thus reducing ambiguity. With this dataset, we show how incorporating table intents significantly improves baseline performance in reconstructing reference schemas. Next, we propose several LLM-based schema editing techniques. We start by comprehensively benchmarking several single-shot schema generation methods, including prompted LLM workflows and fine-tuned models, showing that smaller, open-weight models can be fine-tuned to be competitive with state-of-the-art prompted LLMs. Then we demonstrate that our editing techniques can further improve schemas generated by these methods.
zh
[NLP-103] Advancing Mental Disorder Detection: A Comparative Evaluation of Transformer and LSTM Architectures on Social Media
【速读】: 该论文旨在解决精神健康障碍早期检测与监测中缺乏高效、自动化工具的问题,尤其关注如何利用自然语言处理(Natural Language Processing, NLP)技术从社交媒体文本(如Reddit)中准确识别心理疾病特征。其解决方案的关键在于系统性评估基于Transformer架构的先进模型(如BERT、RoBERTa、DistilBERT、ALBERT和ELECTRA)相较于传统LSTM方法在心理健康分类任务中的性能表现,并构建了一个大规模标注数据集以确保结果的可靠性。实验表明,RoBERTa在内部和外部测试集上分别达到99.54%和96.05%的F1分数,显著优于LSTM模型;同时,采用BERT嵌入增强的LSTM模型也展现出高竞争力(外部测试集F1 > 94%),且计算资源消耗更低,凸显了Transformer模型在实时、可扩展的精神健康监测应用中的有效性与实用性。
链接: https://arxiv.org/abs/2507.19511
作者: Khalid Hasan,Jamil Saquer,Mukulika Ghosh
机构: Missouri State University (密苏里州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The 49th IEEE International Conference on Computers, Software, and Applications (COMPSAC 2025) (camera-ready)
Abstract:The rising prevalence of mental health disorders necessitates the development of robust, automated tools for early detection and monitoring. Recent advances in Natural Language Processing (NLP), particularly transformer-based architectures, have demonstrated significant potential in text analysis. This study provides a comprehensive evaluation of state-of-the-art transformer models (BERT, RoBERTa, DistilBERT, ALBERT, and ELECTRA) against Long Short-Term Memory (LSTM) based approaches using different text embedding techniques for mental health disorder classification on Reddit. We construct a large annotated dataset, validating its reliability through statistical judgmental analysis and topic modeling. Experimental results demonstrate the superior performance of transformer models over traditional deep-learning approaches. RoBERTa achieved the highest classification performance, with a 99.54% F1 score on the hold-out test set and a 96.05% F1 score on the external test set. Notably, LSTM models augmented with BERT embeddings proved highly competitive, achieving F1 scores exceeding 94% on the external dataset while requiring significantly fewer computational resources. These findings highlight the effectiveness of transformer-based models for real-time, scalable mental health monitoring. We discuss the implications for clinical applications and digital mental health interventions, offering insights into the capabilities and limitations of state-of-the-art NLP methodologies in mental disorder detection.
zh
[NLP-104] Does AI and Human Advice Mitigate Punishment for Selfish Behavior? An Experiment on AI ethics From a Psychological Perspective
【速读】: 该论文旨在解决当个体在决策中接受人工智能(AI)建议后表现出利己行为时,其行为如何被他人感知与惩罚的问题。研究基于社会心理学理论,结合机器行为分析与行为经济学方法,设计了一个预注册、具有经济激励的实验,考察评估者对不同情境下决策者的惩罚态度:决策者是否获得AI或人类建议、建议内容是促进利己还是利他行为,以及决策者实际采取的行为类型。关键发现在于,惩罚强度主要由行为本身及其建议内容决定——利己行为比利他行为更受惩罚,且在利己行为背景下,接受利他建议后的惩罚更严厉,而接受利己建议后的惩罚则相对宽松;相比之下,建议来源(AI vs. 人类)虽影响责任归因(决策者被认为对AI建议更负责),但并不改变最终惩罚力度。因此,解决方案的关键在于识别出行为内容和建议内容对惩罚决策的核心调节作用,而非建议提供者身份。
链接: https://arxiv.org/abs/2507.19487
作者: Margarita Leib,Nils Köbis,Ivan Soraperra
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); General Economics (econ.GN)
备注:
Abstract:People increasingly rely on AI-advice when making decisions. At times, such advice can promote selfish behavior. When individuals abide by selfishness-promoting AI advice, how are they perceived and punished? To study this question, we build on theories from social psychology and combine machine-behavior and behavioral economic approaches. In a pre-registered, financially-incentivized experiment, evaluators could punish real decision-makers who (i) received AI, human, or no advice. The advice (ii) encouraged selfish or prosocial behavior, and decision-makers (iii) behaved selfishly or, in a control condition, behaved prosocially. Evaluators further assigned responsibility to decision-makers and their advisors. Results revealed that (i) prosocial behavior was punished very little, whereas selfish behavior was punished much more. Focusing on selfish behavior, (ii) compared to receiving no advice, selfish behavior was penalized more harshly after prosocial advice and more leniently after selfish advice. Lastly, (iii) whereas selfish decision-makers were seen as more responsible when they followed AI compared to human advice, punishment between the two advice sources did not vary. Overall, behavior and advice content shape punishment, whereas the advice source does not.
zh
[NLP-105] Your AI Not Your View: The Bias of LLM s in Investment Analysis
【速读】: 该论文旨在解决金融领域中大型语言模型(Large Language Models, LLMs)因预训练参数化知识与实时市场数据之间的不一致而引发的知识冲突问题,尤其关注LLM在实际投资服务部署时,其内在偏好与金融机构立场不一致所导致的推荐不可靠性。解决方案的关键在于提出了一种实验框架,通过设计包含平衡与不平衡论据的假设场景,量化分析LLM在行业、市值和动量等维度上的隐含偏好及其持续性,首次实现了对LLM投资分析中确认偏误(confirmation bias)的定量评估,揭示了多数模型存在对大盘股和逆势策略的稳定偏好,并发现这些偏好常固化为确认偏误,即模型在面对相反证据时仍坚持初始判断。
链接: https://arxiv.org/abs/2507.20957
作者: Hoyoung Lee,Junhyuk Seo,Suhwan Park,Junhyeong Lee,Wonbin Ahn,Chanyeol Choi,Alejandro Lopez-Lira,Yongjae Lee
机构: UNIST(韩国科学技术院); LG AI Research(韩国LG人工智能研究院); LinqAlpha(林克阿尔法); University of Florida(佛罗里达大学)
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:In finance, Large Language Models (LLMs) face frequent knowledge conflicts due to discrepancies between pre-trained parametric knowledge and real-time market data. These conflicts become particularly problematic when LLMs are deployed in real-world investment services, where misalignment between a model’s embedded preferences and those of the financial institution can lead to unreliable recommendations. Yet little research has examined what investment views LLMs actually hold. We propose an experimental framework to investigate such conflicts, offering the first quantitative analysis of confirmation bias in LLM-based investment analysis. Using hypothetical scenarios with balanced and imbalanced arguments, we extract models’ latent preferences and measure their persistence. Focusing on sector, size, and momentum, our analysis reveals distinct, model-specific tendencies. In particular, we observe a consistent preference for large-cap stocks and contrarian strategies across most models. These preferences often harden into confirmation bias, with models clinging to initial judgments despite counter-evidence.
zh
[NLP-106] MountainLion: A Multi-Modal LLM -Based Agent System for Interpretable and Adaptive Financial Trading
【速读】: 该论文旨在解决加密货币交易中因多模态数据异构性导致的决策复杂性问题,传统深度学习与强化学习方法虽能处理数值化输入,但往往牺牲了可解释性且依赖大规模训练数据。其解决方案的关键在于提出一个名为MountainLion的多模态、多智能体系统,该系统通过专业化的大语言模型(LLM)智能体协同工作,对文本新闻、K线图和交易信号图进行联合分析,生成高质量金融报告并支持基于数据驱动的用户交互与问答机制以动态调整投资策略;同时引入中央反思模块,基于历史交易信号与结果持续优化决策流程,从而在提升收益的同时增强投资框架的可解释性与鲁棒性。
链接: https://arxiv.org/abs/2507.20474
作者: Siyi Wu,Zhaoyang Guan,Leyi Zhao,Xinyuan Song,Xinyu Ying,Hanlin Zhang,Michele Pak,Yangfan He,Yi Xin,Jianhui Wang,Tianyu Shi
机构: The University of Texas at Arlington (德克萨斯大学阿灵顿分校); Northwestern University (西北大学); Indiana University (印第安纳大学); Emory University (埃默里大学); Nankai University (南开大学); Xi’an University of Electronic Science and Technology (西安电子科技大学); Kyoto University (京都大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Nanjing university (南京大学); Tsinghua University (清华大学); University of Toronto (多伦多大学)
类目: Trading and Market Microstructure (q-fin.TR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Cryptocurrency trading is a challenging task requiring the integration of heterogeneous data from multiple modalities. Traditional deep learning and reinforcement learning approaches typically demand large training datasets and encode diverse inputs into numerical representations, often at the cost of interpretability. Recent progress in large language model (LLM)-based agents has demonstrated the capacity to process multi-modal data and support complex investment decision-making. Building on these advances, we present \textbfMountainLion, a multi-modal, multi-agent system for financial trading that coordinates specialized LLM-based agents to interpret financial data and generate investment strategies. MountainLion processes textual news, candlestick charts, and trading signal charts to produce high-quality financial reports, while also enabling modification of reports and investment recommendations through data-driven user interaction and question answering. A central reflection module analyzes historical trading signals and outcomes to continuously refine decision processes, and the system is capable of real-time report analysis, summarization, and dynamic adjustment of investment strategies. Empirical results confirm that MountainLion systematically enriches technical price triggers with contextual macroeconomic and capital flow signals, providing a more interpretable, robust, and actionable investment framework that improves returns and strengthens investor confidence.
zh
计算机视觉
[CV-0] Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning ICCV2025
【速读】:该论文旨在解决多任务学习(Multi-Task Learning, MTL)中因任务间冲突导致的负向迁移(negative transfer)问题,现有方法主要依赖优化器中心的损失缩放和梯度操作策略,难以稳定提升性能。其解决方案的关键在于引入Rep-MTL框架,通过挖掘共享表示空间(shared representation space)中的任务显著性(task saliency),量化任务特定优化与共享表示学习之间的交互关系,并利用基于熵的惩罚项和样本级跨任务对齐机制,主动引导任务间的信息互补性,从而在不牺牲单任务有效训练的前提下实现更稳健的跨任务知识共享。
链接: https://arxiv.org/abs/2507.21049
作者: Zedong Wang,Siyuan Li,Dan Xu
机构: The Hong Kong University of Science and Technology (香港科技大学); Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 (Highlight). Project page: this https URL
Abstract:Despite the promise of Multi-Task Learning in leveraging complementary knowledge across tasks, existing multi-task optimization (MTO) techniques remain fixated on resolving conflicts via optimizer-centric loss scaling and gradient manipulation strategies, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space, where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizers, especially for facilitating the inter-task complementarity, which is rarely explored in MTO. This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropy-based penalization and sample-wise cross-task alignment, Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law exponent analysis demonstrates Rep-MTL’s efficacy in balancing task-specific learning and cross-task sharing. The project page is available at HERE.
zh
[CV-1] Reconstructing 4D Spatial Intelligence: A Survey
【速读】:该论文旨在解决从视觉观测中重建4D空间智能(4D spatial intelligence)这一核心且具有挑战性的计算机视觉任务,其目标是系统性地梳理和组织现有方法,并揭示其内在的层级结构。解决方案的关键在于提出一个五级渐进式框架,将现有研究划分为五个递进层次:(1) 低级3D属性重建(如深度、位姿和点云图);(2) 3D场景组件重建(如物体、人和结构);(3) 4D动态场景重建;(4) 场景组件间的交互建模;(5) 物理规律与约束的融入。这一分层视角不仅填补了以往综述对4D场景重建层次结构分析的空白,也为未来研究指明了清晰的技术演进路径。
链接: https://arxiv.org/abs/2507.21045
作者: Yukang Cao,Jiahao Lu,Zhisheng Huang,Zhuowei Shen,Chengfeng Zhao,Fangzhou Hong,Zhaoxi Chen,Xin Li,Wenping Wang,Yuan Liu,Ziwei Liu
机构: S-Lab, College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学); Intelligent Graphics Lab, The Hong Kong University of Science and Technology(香港科技大学); Texas A&MA&Mitalic_A & italic_M University(德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures, the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction. To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 – reconstruction of low-level 3D attributes (e.g., depth, pose, and point maps); (2) Level 2 – reconstruction of 3D scene components (e.g., objects, humans, structures); (3) Level 3 – reconstruction of 4D dynamic scenes; (4) Level 4 – modeling of interactions among scene components; and (5) Level 5 – incorporation of physical laws and constraints. We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: this https URL.
zh
[CV-2] GPT -IMAGE-EDIT-1.5M: A Million-Scale GPT -Generated Image Dataset
【速读】:该论文旨在解决当前高保真、指令引导的图像编辑任务中,由于主流大模型(如GPT-4o)及其训练数据的专有性所导致的开源研究受限问题。解决方案的关键在于构建一个公开可用的大规模图像编辑语料库——GPT-IMAGE-EDIT-1.5M,该语料库包含超过150万组高质量三元组(指令、源图像、编辑后图像),通过利用GPT-4o的强大多模态能力对OmniEdit、HQ-Edit和UltraEdit三个现有数据集进行统一与优化,包括重生成图像以提升视觉质量和指令一致性,以及选择性重写提示词以增强语义清晰度,从而为开放研究提供高质量、标准化的数据基础。
链接: https://arxiv.org/abs/2507.21033
作者: Yuhan Wang,Siwei Yang,Bingchen Zhao,Letian Zhang,Qing Liu,Yuyin Zhou,Cihang Xie
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); The University of Edinburgh (爱丁堡大学); Adobe (Adobe公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in large multimodal models like GPT-4o have set a new standard for high-fidelity, instruction-guided image editing. However, the proprietary nature of these models and their training data creates a significant barrier for open-source research. To bridge this gap, we introduce GPT-IMAGE-EDIT-1.5M, a publicly available, large-scale image-editing corpus containing more than 1.5 million high-quality triplets (instruction, source image, edited image). We systematically construct this dataset by leveraging the versatile capabilities of GPT-4o to unify and refine three popular image-editing datasets: OmniEdit, HQ-Edit, and UltraEdit. Specifically, our methodology involves 1) regenerating output images to enhance visual quality and instruction alignment, and 2) selectively rewriting prompts to improve semantic clarity. To validate the efficacy of our dataset, we fine-tune advanced open-source models on GPT-IMAGE-EDIT-1.5M. The empirical results are exciting, e.g., the fine-tuned FluxKontext achieves highly competitive performance across a comprehensive suite of benchmarks, including 7.24 on GEdit-EN, 3.80 on ImgEdit-Full, and 8.78 on Complex-Edit, showing stronger instruction following and higher perceptual quality while maintaining identity. These scores markedly exceed all previously published open-source methods and substantially narrow the gap to leading proprietary models. We hope the full release of GPT-IMAGE-EDIT-1.5M can help to catalyze further open research in instruction-guided image editing.
zh
[CV-3] Deep Learning for Skeleton Based Human Motion Rehabilitation Assessment: A Benchmark
【速读】:该论文旨在解决康复运动评估领域缺乏标准化基准、一致的评估协议和可复现方法论的问题,从而限制了研究进展与不同研究之间的可比性。其解决方案的关键在于:(i)构建一个统一的数据集档案库Rehab-Pile,整合现有康复数据集;(ii)提出一种通用的基准测试框架,用于系统评估深度学习方法在分类与回归任务中的性能;(iii)对多种神经网络架构进行大规模基准测试,并公开所有数据集、源代码及结果,以促进透明性和可复现性。这一工作为自动化康复评估研究奠定了坚实基础,推动可靠、可访问且个性化的康复解决方案的发展。
链接: https://arxiv.org/abs/2507.21018
作者: Ali Ismail-Fawaz,Maxime Devanne,Stefano Berretti,Jonathan Weber,Germain Forestier
机构: University of Haute-Alsace (上阿尔萨斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Automated assessment of human motion plays a vital role in rehabilitation, enabling objective evaluation of patient performance and progress. Unlike general human activity recognition, rehabilitation motion assessment focuses on analyzing the quality of movement within the same action class, requiring the detection of subtle deviations from ideal motion. Recent advances in deep learning and video-based skeleton extraction have opened new possibilities for accessible, scalable motion assessment using affordable devices such as smartphones or webcams. However, the field lacks standardized benchmarks, consistent evaluation protocols, and reproducible methodologies, limiting progress and comparability across studies. In this work, we address these gaps by (i) aggregating existing rehabilitation datasets into a unified archive called Rehab-Pile, (ii) proposing a general benchmarking framework for evaluating deep learning methods in this domain, and (iii) conducting extensive benchmarking of multiple architectures across classification and regression tasks. All datasets and implementations are released to the community to support transparency and reproducibility. This paper aims to establish a solid foundation for future research in automated rehabilitation assessment and foster the development of reliable, accessible, and personalized rehabilitation solutions. The datasets, source-code and results of this article are all publicly available.
zh
[CV-4] Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
【速读】:该论文旨在解决当前面部情绪识别系统受限于预定义类别或抽象维度标签所带来的泛化能力不足问题,这些问题将丰富的情绪谱系简化为过于粗略的标签或量表,限制了模型在实际场景中的适用性。其核心解决方案是引入EmoCap100K这一大规模面部情绪描述数据集(包含超过10万条样本),该数据集通过结构化的自然语言描述同时捕捉全局情感状态与局部面部行为特征;在此基础上提出EmoCapCLIP框架,采用联合全局-局部对比学习机制,并结合跨模态引导的正样本挖掘模块,从而有效利用多层次语义caption信息并适应相近表情之间的语义相似性,显著提升了情绪表示学习的效果。
链接: https://arxiv.org/abs/2507.21015
作者: Licai Sun,Xingxun Jiang,Haoyu Chen,Yante Li,Zheng Lian,Biu Liu,Yuan Zong,Wenming Zheng,Jukka M. Leppänen,Guoying Zhao
机构: University of Oulu (奥卢大学); Southeast University (东南大学); University of Turku (图尔库大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current facial emotion recognition systems are predominately trained to predict a fixed set of predefined categories or abstract dimensional values. This constrained form of supervision hinders generalization and applicability, as it reduces the rich and nuanced spectrum of emotions into oversimplified labels or scales. In contrast, natural language provides a more flexible, expressive, and interpretable way to represent emotions, offering a much broader source of supervision. Yet, leveraging semantically rich natural language captions as supervisory signals for facial emotion representation learning remains relatively underexplored, primarily due to two key challenges: 1) the lack of large-scale caption datasets with rich emotional semantics, and 2) the absence of effective frameworks tailored to harness such rich supervision. To this end, we introduce EmoCap100K, a large-scale facial emotion caption dataset comprising over 100,000 samples, featuring rich and structured semantic descriptions that capture both global affective states and fine-grained local facial behaviors. Building upon this dataset, we further propose EmoCapCLIP, which incorporates a joint global-local contrastive learning framework enhanced by a cross-modal guided positive mining module. This design facilitates the comprehensive exploitation of multi-level caption information while accommodating semantic similarities between closely related expressions. Extensive evaluations on over 20 benchmarks covering five tasks demonstrate the superior performance of our method, highlighting the promise of learning facial emotion representations from large-scale semantically rich captions. The code and data will be available at this https URL.
zh
[CV-5] Improving Adversarial Robustness Through Adaptive Learning-Driven Multi-Teacher Knowledge Distillation
【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在计算机视觉任务中对对抗攻击(adversarial attacks)敏感的问题,即模型在面对精心设计的扰动时容易产生错误预测,而现有对抗训练方法仍难以同时实现高准确率与强鲁棒性。解决方案的关键在于提出一种基于多教师(multi-teacher)的对抗鲁棒性知识蒸馏(adversarial robustness distillation)方法,并引入自适应学习策略以动态调整各教师模型的知识贡献权重。具体而言,首先通过多种对抗攻击生成的扰动数据训练多个CNN教师模型,随后利用这些预训练教师模型在干净数据上指导学生模型的学习过程;同时,自适应学习策略依据各教师模型预测精度自动分配权重,从而有效聚合不同教师的鲁棒性知识,使学生模型在未接触任何对抗样本的情况下亦具备抵御多种对抗攻击的能力。
链接: https://arxiv.org/abs/2507.20996
作者: Hayat Ullah,Syed Muhammad Talha Zaidi,Arslan Munir
机构: Florida Atlantic University (佛罗里达大西洋大学); Kansas State University (堪萨斯州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
Abstract:Convolutional neural networks (CNNs) excel in computer vision but are susceptible to adversarial attacks, crafted perturbations designed to mislead predictions. Despite advances in adversarial training, a gap persists between model accuracy and robustness. To mitigate this issue, in this paper, we present a multi-teacher adversarial robustness distillation using an adaptive learning strategy. Specifically, our proposed method first trained multiple clones of a baseline CNN model using an adversarial training strategy on a pool of perturbed data acquired through different adversarial attacks. Once trained, these adversarially trained models are used as teacher models to supervise the learning of a student model on clean data using multi-teacher knowledge distillation. To ensure an effective robustness distillation, we design an adaptive learning strategy that controls the knowledge contribution of each model by assigning weights as per their prediction precision. Distilling knowledge from adversarially pre-trained teacher models not only enhances the learning capabilities of the student model but also empowers it with the capacity to withstand different adversarial attacks, despite having no exposure to adversarial data. To verify our claims, we extensively evaluated our proposed method on MNIST-Digits and Fashion-MNIST datasets across diverse experimental settings. The obtained results exhibit the efficacy of our multi-teacher adversarial distillation and adaptive learning strategy, enhancing CNNs’ adversarial robustness against various adversarial attacks.
zh
[CV-6] Security Tensors as a Cross-Modal Bridge: Extending Text-Aligned Safety to Vision in LVLM
【速读】:该论文旨在解决大型视觉语言模型(LVLMs)在跨模态安全方面的漏洞问题,即当前针对文本类大语言模型(LLMs)设计的安全机制无法自然延伸至视觉模态,导致LVLMs容易受到有害图像输入的攻击。其解决方案的关键在于引入可训练的安全张量(security tensors),这些张量作为推理阶段的输入向量,通过文本或视觉模态注入,无需修改模型参数即可将文本安全对齐能力迁移至视觉处理过程。安全张量通过一个精心构建的数据集进行优化,包含恶意图文对、结构相似的对比良性样本以及通用良性样本,从而引导模型在视觉输入中激活语言模块中的“安全层”,实现对有害视觉内容的有效识别与拒绝,同时保持对正常任务的性能不变。
链接: https://arxiv.org/abs/2507.20994
作者: Shen Li,Liuyi Yao,Wujia Niu,Lan Zhang,Yaliang Li
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Codes and data are available at this https URL
Abstract:Large visual-language models (LVLMs) integrate aligned large language models (LLMs) with visual modules to process multimodal inputs. However, the safety mechanisms developed for text-based LLMs do not naturally extend to visual modalities, leaving LVLMs vulnerable to harmful image inputs. To address this cross-modal safety gap, we introduce security tensors - trainable input vectors applied during inference through either the textual or visual modality. These tensors transfer textual safety alignment to visual processing without modifying the model’s parameters. They are optimized using a curated dataset containing (i) malicious image-text pairs requiring rejection, (ii) contrastive benign pairs with text structurally similar to malicious queries, with the purpose of being contrastive examples to guide visual reliance, and (iii) general benign samples preserving model functionality. Experimental results demonstrate that both textual and visual security tensors significantly enhance LVLMs’ ability to reject diverse harmful visual inputs while maintaining near-identical performance on benign tasks. Further internal analysis towards hidden-layer representations reveals that security tensors successfully activate the language module’s textual “safety layers” in visual inputs, thereby effectively extending text-based safety to the visual modality.
zh
[CV-7] JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1 ICCV2025
【速读】:该论文旨在解决当前基于扩散模型的视频生成方法在联合生成全身动作与自然语音时难以保持多模态一致性的问题,同时指出现有方法缺乏全面的音视频质量评估框架以及针对特定区域性能分析的基准。其解决方案的关键在于提出首个面向全身可驱动虚拟形象与语音联合生成的基准系统——JWB-DH-V1,包含一个涵盖10,000个唯一身份、200万视频样本的大规模多模态数据集,并设计了一套专门用于评估音视频协同生成质量的评测协议。该方案不仅揭示了当前最先进模型在面部/手部中心与全身表现之间的系统性差异,也为未来研究提供了可量化、可复现的基准工具。
链接: https://arxiv.org/abs/2507.20987
作者: Xinhan Di,Kristin Qi,Pengqian Yu
机构: Independent Researcher (独立研究员); University of Massachusetts Boston (马萨诸塞大学波士顿分校); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: WiCV @ ICCV 2025
Abstract:Recent advances in diffusion-based video generation have enabled photo-realistic short clips, but current methods still struggle to achieve multi-modal consistency when jointly generating whole-body motion and natural speech. Current approaches lack comprehensive eval- uation frameworks that assess both visual and audio quality, and there are insufficient benchmarks for region- specific performance analysis. To address these gaps, we introduce the Joint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1), comprising a large-scale multi-modal dataset with 10,000 unique identities across 2 million video samples, and an evalua- tion protocol for assessing joint audio-video generation of whole-body animatable avatars. Our evaluation of SOTA models reveals consistent performance disparities between face/hand-centric and whole-body performance, which incidates essential areas for future research. The dataset and evaluation tools are publicly available at this https URL.
zh
[CV-8] LargeMvC-Net: Anchor-based Deep Unfolding Network for Large-scale Multi-view Clustering
【速读】:该论文旨在解决现有基于锚点的多视图聚类方法在实际应用中因锚点结构设计缺乏理论依据和优化导向而导致性能受限的问题。当前方法通常以启发式或任务无关的方式引入锚点结构,例如通过后处理构建图结构或作为消息传递的辅助组件,忽视了锚点聚类的核心结构需求与优化原则。其解决方案的关键在于重新审视大规模锚点聚类的底层优化问题,并将其迭代求解过程显式地展开为一种新型深度网络架构——LargeMvC-Net。该模型将聚类流程分解为三个模块:RepresentModule(表示学习)、NoiseModule(噪声抑制)和AnchorModule(锚点指示估计),每个模块均对应原优化步骤的显式映射,从而实现结构清晰且具有优化可追溯性的设计;同时引入无监督重建损失以对齐各视图到锚点诱导的潜在空间,促进跨视图聚类结构的一致性,显著提升了方法的有效性和可扩展性。
链接: https://arxiv.org/abs/2507.20980
作者: Shide Du,Chunming Wu,Zihan Fang,Wendi Zhao,Yilin Wu,Changwei Wang,Shiping Wang
机构: Fuzhou University (福州大学); Shandong Academy of Sciences (山东省科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO); Machine Learning (stat.ML)
备注: 10 pages, 7 figures
Abstract:Deep anchor-based multi-view clustering methods enhance the scalability of neural networks by utilizing representative anchors to reduce the computational complexity of large-scale clustering. Despite their scalability advantages, existing approaches often incorporate anchor structures in a heuristic or task-agnostic manner, either through post-hoc graph construction or as auxiliary components for message passing. Such designs overlook the core structural demands of anchor-based clustering, neglecting key optimization principles. To bridge this gap, we revisit the underlying optimization problem of large-scale anchor-based multi-view clustering and unfold its iterative solution into a novel deep network architecture, termed LargeMvC-Net. The proposed model decomposes the anchor-based clustering process into three modules: RepresentModule, NoiseModule, and AnchorModule, corresponding to representation learning, noise suppression, and anchor indicator estimation. Each module is derived by unfolding a step of the original optimization procedure into a dedicated network component, providing structural clarity and optimization traceability. In addition, an unsupervised reconstruction loss aligns each view with the anchor-induced latent space, encouraging consistent clustering structures across views. Extensive experiments on several large-scale multi-view benchmarks show that LargeMvC-Net consistently outperforms state-of-the-art methods in terms of both effectiveness and scalability.
zh
[CV-9] Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision ICCV2025
【速读】:该论文旨在解决航空影像中车辆检测模型在跨域场景下性能下降的问题,即当模型在某一地理区域训练后,在其他地区部署时因环境条件、城市布局、道路网络、车辆类型及成像参数(如分辨率、光照和角度)差异导致的分布偏移(domain shift)问题。解决方案的关键在于提出一种基于生成式 AI 的多阶段、多模态知识迁移框架,利用微调后的潜在扩散模型(Latent Diffusion Models, LDMs)合成高质量的航空图像及其标注数据,通过数据增强有效缩小源域与目标域之间的分布差距,从而显著提升检测器在新环境中的泛化能力。
链接: https://arxiv.org/abs/2507.20976
作者: Xiao Fang,Minhyek Jeon,Zheyang Qin,Stanislav Panev,Celso de Melo,Shuowen Hu,Shayok Chakraborty,Fernando De la Torre
机构: Carnegie Mellon University (卡内基梅隆大学); DEVCOM Army Research Laboratory (美国陆军研究实验室); Florida State University (佛罗里达州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Detecting vehicles in aerial imagery is a critical task with applications in traffic monitoring, urban planning, and defense intelligence. Deep learning methods have provided state-of-the-art (SOTA) results for this application. However, a significant challenge arises when models trained on data from one geographic region fail to generalize effectively to other areas. Variability in factors such as environmental conditions, urban layouts, road networks, vehicle types, and image acquisition parameters (e.g., resolution, lighting, and angle) leads to domain shifts that degrade model performance. This paper proposes a novel method that uses generative AI to synthesize high-quality aerial images and their labels, improving detector training through data augmentation. Our key contribution is the development of a multi-stage, multi-modal knowledge transfer framework utilizing fine-tuned latent diffusion models (LDMs) to mitigate the distribution gap between the source and target environments. Extensive experiments across diverse aerial imagery domains show consistent performance improvements in AP50 over supervised learning on source domain data, weakly supervised adaptation methods, unsupervised domain adaptation methods, and open-set object detectors by 4-23%, 6-10%, 7-40%, and more than 50%, respectively. Furthermore, we introduce two newly annotated aerial datasets from New Zealand and Utah to support further research in this field. Project page is available at: this https URL
zh
[CV-10] Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型中存在的性别偏见问题,特别是模型在生成职业相关图像时会强化性别刻板印象的现象。解决方案的关键在于提出一种轻量级、模型无关的框架——SAE Debias,其核心创新是利用在性别偏见数据集上预训练的k稀疏自编码器(k-sparse autoencoder),在特征空间中识别并干预与性别相关的潜在方向。具体而言,该方法通过分析稀疏潜在空间中的模式,构建每个职业对应的偏见方向,并在推理阶段抑制该方向,从而引导生成更性别平衡的图像,且无需重新训练或修改模型架构。这一方法首次将稀疏自编码器应用于T2I模型的性别偏见干预,提供了可解释性和通用性兼具的公平性增强手段。
链接: https://arxiv.org/abs/2507.20973
作者: Chao Wu,Zhenyi Wang,Kangxian Xie,Naresh Kumar Devulapally,Vishnu Suresh Lokhande,Mingchen Gao
机构: University at Buffalo, USA (纽约州立大学布法罗分校); University of Maryland, College Park, USA (马里兰大学学院公园分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image (T2I) diffusion models often exhibit gender bias, particularly by generating stereotypical associations between professions and gendered subjects. This paper presents SAE Debias, a lightweight and model-agnostic framework for mitigating such bias in T2I generation. Unlike prior approaches that rely on CLIP-based filtering or prompt engineering, which often require model-specific adjustments and offer limited control, SAE Debias operates directly within the feature space without retraining or architectural modifications. By leveraging a k-sparse autoencoder pre-trained on a gender bias dataset, the method identifies gender-relevant directions within the sparse latent space, capturing professional stereotypes. Specifically, a biased direction per profession is constructed from sparse latents and suppressed during inference to steer generations toward more gender-balanced outputs. Trained only once, the sparse autoencoder provides a reusable debiasing direction, offering effective control and interpretable insight into biased subspaces. Extensive evaluations across multiple T2I models, including Stable Diffusion 1.4, 1.5, 2.1, and SDXL, demonstrate that SAE Debias substantially reduces gender bias while preserving generation quality. To the best of our knowledge, this is the first work to apply sparse autoencoders for identifying and intervening in gender bias within T2I models. These findings contribute toward building socially responsible generative AI, providing an interpretable and model-agnostic tool to support fairness in text-to-image generation.
zh
[CV-11] GTAD: Global Temporal Aggregation Denoising Learning for 3D Semantic Occupancy Prediction
【速读】:该论文旨在解决自动驾驶与机器人系统中动态环境感知问题,现有方法通常仅依赖相邻帧之间的局部时序交互,未能有效利用历史观测序列中的全局时序信息,导致对场景理解不够完整。其解决方案的关键在于提出一种名为GTAD(Global Temporal Aggregation Denoising Network)的网络架构,通过引入一个基于模型内部潜在空间的去噪网络,实现当前时刻局部时序特征与历史序列全局时序特征的有效聚合,从而在3D场景理解中实现对细粒度邻近时序信息和跨时间窗口全局时序模式的联合建模,显著提升环境感知的一致性与全面性。
链接: https://arxiv.org/abs/2507.20963
作者: Tianhao Li,Yang Li,Mengtian Li,Yisheng Deng,Weifeng Ge
机构: Fudan University (复旦大学); East China Normal University (华东师范大学); Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately perceiving dynamic environments is a fundamental task for autonomous driving and robotic systems. Existing methods inadequately utilize temporal information, relying mainly on local temporal interactions between adjacent frames and failing to leverage global sequence information effectively. To address this limitation, we investigate how to effectively aggregate global temporal features from temporal sequences, aiming to achieve occupancy representations that efficiently utilize global temporal information from historical observations. For this purpose, we propose a global temporal aggregation denoising network named GTAD, introducing a global temporal information aggregation framework as a new paradigm for holistic 3D scene understanding. Our method employs an in-model latent denoising network to aggregate local temporal features from the current moment and global temporal features from historical sequences. This approach enables the effective perception of both fine-grained temporal information from adjacent frames and global temporal patterns from historical observations. As a result, it provides a more coherent and comprehensive understanding of the environment. Extensive experiments on the nuScenes and Occ3D-nuScenes benchmark and ablation studies demonstrate the superiority of our method.
zh
[CV-12] Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation
【速读】:该论文旨在解决音频驱动人脸生成(Audio-Driven Talking Face Generation)中因采用掩码策略(masking strategy)导致的三个核心问题:(1)输入图像信息丢失,削弱模型对身份相关视觉细节的保留能力;(2)身份参考图像与输入图像之间的差异降低重建性能;(3)身份参考图像干扰模型,引发与音频不匹配的元素复制。解决方案的关键在于提出一种无需掩码的生成方法,通过两阶段基于关键点(landmark-based)的无配对训练方式,将输入图像中的嘴部闭合,随后将这些未掩码的编辑图像与音频一同输入唇部适应模型(lip adaptation model),从而生成与音频同步的唇部动作。此方法无需掩码输入或身份参考图像,有效避免了上述问题并保持了2D人脸编辑任务的高效性。
链接: https://arxiv.org/abs/2507.20953
作者: Dogucan Yaman,Fevziye Irem Eyiokur,Leonard Bärmann,Hazım Kemal Ekenel,Alexander Waibel
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Istanbul Technical University (伊斯坦布尔技术大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Audio-Driven Talking Face Generation aims at generating realistic videos of talking faces, focusing on accurate audio-lip synchronization without deteriorating any identity-related visual details. Recent state-of-the-art methods are based on inpainting, meaning that the lower half of the input face is masked, and the model fills the masked region by generating lips aligned with the given audio. Hence, to preserve identity-related visual details from the lower half, these approaches additionally require an unmasked identity reference image randomly selected from the same video. However, this common masking strategy suffers from (1) information loss in the input faces, significantly affecting the networks’ ability to preserve visual quality and identity details, (2) variation between identity reference and input image degrading reconstruction performance, and (3) the identity reference negatively impacting the model, causing unintended copying of elements unaligned with the audio. To address these issues, we propose a mask-free talking face generation approach while maintaining the 2D-based face editing task. Instead of masking the lower half, we transform the input images to have closed mouths, using a two-step landmark-based approach trained in an unpaired manner. Subsequently, we provide these edited but unmasked faces to a lip adaptation model alongside the audio to generate appropriate lip movements. Thus, our approach needs neither masked input images nor identity reference images. We conduct experiments on the benchmark LRS2 and HDTF datasets and perform various ablation studies to validate our contributions.
zh
[CV-13] ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
【速读】:该论文旨在解决当前大型多模态模型在理解真实世界用户生成的短视频(如微信视频号和抖音上的内容)时存在的不足,尤其是缺乏对时间结构化、细节丰富且深入的视频理解能力,这直接影响视频搜索、推荐及新兴视频应用的效果。其核心挑战在于短视频具有复杂的视觉元素、高密度的视听信息以及快速节奏,强调情感表达与观点传递,要求模型具备高级跨模态推理能力以有效整合视觉、音频和文本信息。解决方案的关键是提出ARC-Hunyuan-Video模型,该模型能够端到端处理原始视频输入中的多模态信号,并实现多粒度的时间戳视频字幕生成、摘要、开放式的视频问答、时间定位以及视频推理等任务;通过高质量自动标注数据训练,采用预训练、指令微调、冷启动、强化学习后训练及最终指令微调的综合训练策略,使一个仅7B参数的紧凑模型在ShortVid-Bench基准上展现出卓越性能,并支持零样本或少量样本微调,已在实际生产环境中显著提升用户参与度和满意度,同时保持高效推理速度(1分钟视频在H20 GPU上仅需10秒)。
链接: https://arxiv.org/abs/2507.20939
作者: Yuying Ge,Yixiao Ge,Chen Li,Teng Wang,Junfu Pu,Yizhuo Li,Lu Qiu,Jin Ma,Lisheng Duan,Xinyu Zuo,Jinwen Luo,Weibo Gu,Zexuan Li,Xiaojing Zhang,Yangyu Tao,Han Hu,Di Wang,Ying Shan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.
zh
[CV-14] Exploring text-to-image generation for historical document image retrieval
【速读】:该论文旨在解决传统查询示例(Query-by-Example, QBE)文档图像检索(Document Image Retrieval, DIR)中对可用查询样本依赖性强的问题,即用户往往难以提供合适的示例文档。为此,作者提出了一种结合生成式 AI(Generative AI)的新方法——T2I-QBE,其关键在于利用文本到图像(Text-to-Image, T2I)生成模型,根据描述文档类型的文本提示和 ABDIR 风格的视觉属性列表合成查询图像,进而将这些生成图像作为 QBE 框架中的查询输入,通过 CNN 提取特征进行相似性匹配,从而实现无需原始样本即可高效检索历史文档图像的目标。
链接: https://arxiv.org/abs/2507.20934
作者: Melissa Cote,Alexandra Branzan Albu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and presented as an extended abstract (double-blind review process) at the 2025 Scandinavian Conference on Image Analysis (SCIA). 4 pages
Abstract:Attribute-based document image retrieval (ABDIR) was recently proposed as an alternative to query-by-example (QBE) searches, the dominant document image retrieval (DIR) paradigm. One drawback of QBE searches is that they require sample query documents on hand that may not be available. ABDIR aims to offer users a flexible way to retrieve document images based on memorable visual features of document contents, describing document images with combinations of visual attributes determined via convolutional neural network (CNN)-based binary classifiers. We present an exploratory study of the use of generative AI to bridge the gap between QBE and ABDIR, focusing on historical documents as a use case for their diversity and uniqueness in visual features. We hypothesize that text-to-image (T2I) generation can be leveraged to create query document images using text prompts based on ABDIR-like attributes. We propose T2I-QBE, which uses this http URL as the T2I generator with prompts that include a rough description of the desired document type and a list of the desired ABDIR-style attributes. This creates query images that are then used within the traditional QBE paradigm, which compares CNN-extracted query features to those of the document images in the dataset to retrieve the most relevant documents. Experiments on the HisIR19 dataset of historical documents confirm our hypothesis and suggest that T2I-QBE is a viable option for historical document image retrieval. To the authors’ knowledge, this is the first attempt at utilizing T2I generation for DIR.
zh
[CV-15] RIS-LAD: A Benchmark and Model for Referring Low-Altitude Drone Image Segmentation
【速读】:该论文旨在解决低空无人机(Low-Altitude Drone, LAD)场景下指代表达图像分割(Referring Image Segmentation, RIS)任务的挑战,现有方法和数据集主要面向高空静态图像设计,难以应对LAD场景中多视角、高密度物体及小目标等特性。其关键解决方案是提出首个针对LAD场景的细粒度RIS基准数据集RIS-LAD,包含13,871个精心标注的图像-文本-掩码三元组,并设计了语义感知自适应推理网络(Semantic-Aware Adaptive Reasoning Network, SAARN)。SAARN通过将语义信息分阶段注入网络:类别主导的语言增强模块(Category-Dominated Linguistic Enhancement, CDLE)在早期编码阶段对齐视觉特征与对象类别,而自适应推理融合模块(Adaptive Reasoning Fusion Module, ARFM)则跨尺度动态选择语义线索以提升复杂场景下的推理能力,从而有效缓解因小目标导致的类别漂移和同类密集物体中的目标漂移问题。
链接: https://arxiv.org/abs/2507.20920
作者: Kai Ye,YingShi Luan,Zhudi Chen,Guangyue Meng,Pingyang Dai,Liujuan Cao
机构: Xiamen University (厦门大学); School of Informatics, Xiamen University (信息学院,厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Referring Image Segmentation (RIS), which aims to segment specific objects based on natural language descriptions, plays an essential role in vision-language understanding. Despite its progress in remote sensing applications, RIS in Low-Altitude Drone (LAD) scenarios remains underexplored. Existing datasets and methods are typically designed for high-altitude and static-view imagery. They struggle to handle the unique characteristics of LAD views, such as diverse viewpoints and high object density. To fill this gap, we present RIS-LAD, the first fine-grained RIS benchmark tailored for LAD scenarios. This dataset comprises 13,871 carefully annotated image-text-mask triplets collected from realistic drone footage, with a focus on small, cluttered, and multi-viewpoint scenes. It highlights new challenges absent in previous benchmarks, such as category drift caused by tiny objects and object drift under crowded same-class objects. To tackle these issues, we propose the Semantic-Aware Adaptive Reasoning Network (SAARN). Rather than uniformly injecting all linguistic features, SAARN decomposes and routes semantic information to different stages of the network. Specifically, the Category-Dominated Linguistic Enhancement (CDLE) aligns visual features with object categories during early encoding, while the Adaptive Reasoning Fusion Module (ARFM) dynamically selects semantic cues across scales to improve reasoning in complex scenes. The experimental evaluation reveals that RIS-LAD presents substantial challenges to state-of-the-art RIS algorithms, and also demonstrates the effectiveness of our proposed model in addressing these challenges. The dataset and code will be publicly released soon at: this https URL.
zh
[CV-16] HAMLET-FFD: Hierarchical Adaptive Multi-modal Learning Embeddings Transformation for Face Forgery Detection
【速读】:该论文旨在解决人脸伪造检测(face forgery detection)中的跨域泛化问题(cross-domain generalization),即现有方法在面对未见过的伪造技术时性能显著下降的问题。传统方法依赖简单的分类目标,难以学习到域不变表征(domain-invariant representations)。其解决方案的关键在于提出一种受认知启发的分层自适应多模态学习框架 HAMLET-FFD,通过双向跨模态推理实现更鲁棒的伪造检测:一方面,文本真实性嵌入(textual authenticity embeddings)引导层次化视觉特征的聚合;另一方面,调制后的视觉特征反向优化文本嵌入,生成图像自适应提示(image-adaptive prompts),形成闭环迭代机制,逐步对齐视觉观测与语义先验,从而提升真实性评估能力。该框架冻结预训练参数,作为外部插件集成于 CLIP 等对比视觉语言模型之上,保持原始模型能力的同时显著增强跨域泛化性能。
链接: https://arxiv.org/abs/2507.20913
作者: Jialei Cui,Jianwei Du,Yanzhe Li,Lei Gao,Hui Jiang,Chenfu Bao
机构: Baidu Inc.(百度公司); Southeast University (东南大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid evolution of face manipulation techniques poses a critical challenge for face forgery detection: cross-domain generalization. Conventional methods, which rely on simple classification objectives, often fail to learn domain-invariant representations. We propose HAMLET-FFD, a cognitively inspired Hierarchical Adaptive Multi-modal Learning framework that tackles this challenge via bidirectional cross-modal reasoning. Building on contrastive vision-language models such as CLIP, HAMLET-FFD introduces a knowledge refinement loop that iteratively assesses authenticity by integrating visual evidence with conceptual cues, emulating expert forensic analysis. A key innovation is a bidirectional fusion mechanism in which textual authenticity embeddings guide the aggregation of hierarchical visual features, while modulated visual features refine text embeddings to generate image-adaptive prompts. This closed-loop process progressively aligns visual observations with semantic priors to enhance authenticity assessment. By design, HAMLET-FFD freezes all pretrained parameters, serving as an external plugin that preserves CLIP’s original capabilities. Extensive experiments demonstrate its superior generalization to unseen manipulations across multiple benchmarks, and visual analyses reveal a division of labor among embeddings, with distinct representations specializing in fine-grained artifact recognition.
zh
[CV-17] SCORPION: Addressing Scanner-Induced Variability in Histopathology MICCAI
【速读】:该论文旨在解决计算病理学中因数字扫描仪差异导致的模型可靠性问题,即模型在不同扫描设备下性能不稳定,从而影响临床诊断与治疗决策。现有方法多基于标准域泛化设置,在训练时评估未见过的扫描仪,但缺乏对同一组织样本在不同扫描仪下一致性表现的直接评估。为克服此局限,作者提出SCORPION数据集,包含480个组织样本、每个样本由5台扫描仪扫描得到2400张空间对齐的图像块,通过扫描仪配对设计隔离扫描仪引入的变异,实现对模型一致性的严格评估。解决方案的关键在于提出SimCons框架,该框架融合基于增强的域泛化技术和一致性损失(consistency loss),显式优化模型在不同扫描仪下的稳定性,实验证明其在不牺牲任务特定性能的前提下显著提升跨扫描仪一致性。
链接: https://arxiv.org/abs/2507.20907
作者: Jeongun Ryu,Heon Song,Seungeun Lee,Soo Ick Cho,Jiwon Shin,Kyunghyun Paeng,Sérgio Pereira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in UNSURE 2025 workshop in MICCAI
Abstract:Ensuring reliable model performance across diverse domains is a critical challenge in computational pathology. A particular source of variability in Whole-Slide Images is introduced by differences in digital scanners, thus calling for better scanner generalization. This is critical for the real-world adoption of computational pathology, where the scanning devices may differ per institution or hospital, and the model should not be dependent on scanner-induced details, which can ultimately affect the patient’s diagnosis and treatment planning. However, past efforts have primarily focused on standard domain generalization settings, evaluating on unseen scanners during training, without directly evaluating consistency across scanners for the same tissue. To overcome this limitation, we introduce SCORPION, a new dataset explicitly designed to evaluate model reliability under scanner variability. SCORPION includes 480 tissue samples, each scanned with 5 scanners, yielding 2,400 spatially aligned patches. This scanner-paired design allows for the isolation of scanner-induced variability, enabling a rigorous evaluation of model consistency while controlling for differences in tissue composition. Furthermore, we propose SimCons, a flexible framework that combines augmentation-based domain generalization techniques with a consistency loss to explicitly address scanner generalization. We empirically show that SimCons improves model consistency on varying scanners without compromising task-specific performance. By releasing the SCORPION dataset and proposing SimCons, we provide the research community with a crucial resource for evaluating and improving model consistency across diverse scanners, setting a new standard for reliability testing.
zh
[CV-18] Event-Based De-Snowing for Autonomous Driving
【速读】:该论文旨在解决恶劣天气条件下(特别是大雪)对视觉系统性能的严重影响问题,尤其是传统基于图像的去雪方法因仅依赖空间信息而引入幻觉伪影,以及基于视频的方法在低帧率下存在配准伪影的问题。解决方案的关键在于利用事件相机(event camera)提供的毫秒级延迟和压缩的时空信息,通过设计一种注意力机制模块,聚焦于事件数据中雪片遮挡所形成的独特条纹特征(streak signature),从而准确判断背景点被遮挡的时间并恢复其原始强度。该方法显著提升了去雪图像的质量,并增强了下游计算机视觉任务(如深度估计和光流计算)的性能。
链接: https://arxiv.org/abs/2507.20901
作者: Manasi Muglikar,Nico Messikommer,Marco Cannici,Davide Scaramuzza
机构: University of Zurich (苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adverse weather conditions, particularly heavy snowfall, pose significant challenges to both human drivers and autonomous vehicles. Traditional image-based de-snowing methods often introduce hallucination artifacts as they rely solely on spatial information, while video-based approaches require high frame rates and suffer from alignment artifacts at lower frame rates. Camera parameters, such as exposure time, also influence the appearance of snowflakes, making the problem difficult to solve and heavily dependent on network generalization. In this paper, we propose to address the challenge of desnowing by using event cameras, which offer compressed visual information with submillisecond latency, making them ideal for de-snowing images, even in the presence of ego-motion. Our method leverages the fact that snowflake occlusions appear with a very distinctive streak signature in the spatio-temporal representation of event data. We design an attention-based module that focuses on events along these streaks to determine when a background point was occluded and use this information to recover its original intensity. We benchmark our method on DSEC-Snow, a new dataset created using a green-screen technique that overlays pre-recorded snowfall data onto the existing DSEC driving dataset, resulting in precise ground truth and synchronized image and event streams. Our approach outperforms state-of-the-art de-snowing methods by 3 dB in PSNR for image reconstruction. Moreover, we show that off-the-shelf computer vision algorithms can be applied to our reconstructions for tasks such as depth estimation and optical flow, achieving a 20% performance improvement over other de-snowing methods. Our work represents a crucial step towards enhancing the reliability and safety of vision systems in challenging winter conditions, paving the way for more robust, all-weather-capable applications.
zh
[CV-19] Endoscopic Depth Estimation Based on Deep Learning: A Survey
【速读】:该论文旨在解决内窥镜深度估计(endoscopic depth estimation)领域中缺乏针对近年来基于深度学习技术的系统性综述的问题。其解决方案的关键在于从数据、方法和应用三个核心维度对现有文献进行系统梳理,涵盖单目与立体两种主流方法,明确性能评估指标与公开数据集,并基于监督策略与网络架构对代表性技术进行分类分析,同时探讨其在机器人辅助手术中的应用潜力,从而为后续研究提供清晰的方向指引。
链接: https://arxiv.org/abs/2507.20881
作者: Ke Niu,Zeyun Liu,Xue Feng,Heng Li,Kaize Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Endoscopic depth estimation is a critical technology for improving the safety and precision of minimally invasive surgery. It has attracted considerable attention from researchers in medical imaging, computer vision, and robotics. Over the past decade, a large number of methods have been developed. Despite the existence of several related surveys, a comprehensive overview focusing on recent deep learning-based techniques is still limited. This paper endeavors to bridge this gap by systematically reviewing the state-of-the-art literature. Specifically, we provide a thorough survey of the field from three key perspectives: data, methods, and applications, covering a range of methods including both monocular and stereo approaches. We describe common performance evaluation metrics and summarize publicly available datasets. Furthermore, this review analyzes the specific challenges of endoscopic scenes and categorizes representative techniques based on their supervision strategies and network architectures. The application of endoscopic depth estimation in the important area of robot-assisted surgery is also reviewed. Finally, we outline potential directions for future research, such as domain adaptation, real-time implementation, and enhanced model generalization, thereby providing a valuable starting point for researchers to engage with and advance the field.
zh
[CV-20] DriveAgent -R1: Advancing VLM-based Autonomous Driving with Hybrid Thinking and Active Perception
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在自动驾驶中因短视决策(myopic decision-making)和被动感知(passive perception)导致的可靠性不足问题,尤其在复杂环境下的长时程、高层行为决策能力受限。解决方案的关键在于提出DriveAgent-R1,其核心创新包括:(1) 混合思维框架(Hybrid-Thinking framework),能够根据任务需求自适应切换高效文本推理与深度工具推理模式,实现决策效率与可靠性的平衡;(2) 主动感知机制(Active Perception mechanism),结合视觉工具集主动消除不确定性,确保决策基于可信赖的视觉证据。通过三阶段渐进式强化学习策略训练该代理,实验表明其性能超越当前主流商用多模态大模型(如Claude Sonnet 4),验证了方法的有效性和鲁棒性。
链接: https://arxiv.org/abs/2507.20879
作者: Weicheng Zheng,Xiaofei Mao,Nanfei Ye,Pengxiang Li,Kun Zhan,Xianpeng Lang,Hang Zhao
机构: Shanghai Qi Zhi Institute (上海奇智研究院); LiAuto (小鹏汽车); Tongji University (同济大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) are advancing autonomous driving, yet their potential is constrained by myopic decision-making and passive perception, limiting reliability in complex environments. We introduce DriveAgent-R1 to tackle these challenges in long-horizon, high-level behavioral decision-making. DriveAgent-R1 features two core innovations: a Hybrid-Thinking framework that adaptively switches between efficient text-based and in-depth tool-based reasoning, and an Active Perception mechanism with a vision toolkit to proactively resolve uncertainties, thereby balancing decision-making efficiency and reliability. The agent is trained using a novel, three-stage progressive reinforcement learning strategy designed to master these hybrid capabilities. Extensive experiments demonstrate that DriveAgent-R1 achieves state-of-the-art performance, outperforming even leading proprietary large multimodal models, such as Claude Sonnet 4. Ablation studies validate our approach and confirm that the agent’s decisions are robustly grounded in actively perceived visual evidence, paving a path toward safer and more intelligent autonomous systems.
zh
[CV-21] Not Only Grey Matter: OmniBrain for Robust Multimodal Classification of Alzheimers Disease ICCV2025
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)诊断中面临的多模态数据融合难题,即如何在保证高准确率、跨数据集泛化能力、对缺失模态的鲁棒性以及模型可解释性的同时实现临床可用的诊断系统。现有方法难以同时满足上述四项要求,导致其在真实医疗场景中的可靠性受限。解决方案的关键在于提出OmniBrain框架,该框架通过统一建模整合脑部磁共振成像(MRI)、影像组学(radiomics)、基因表达和临床数据,并采用交叉注意力机制(cross-attention)增强模态间信息交互,结合模态丢弃策略(modality dropout)提升对缺失数据的鲁棒性,从而在ANMerge数据集上达到92.2 ± 2.4%的准确率,在仅含MRI的ADNI数据集上仍保持70.4 ± 2.7%的性能,显著优于单模态与已有多模态方法,且通过可解释性分析识别出具有神经病理学意义的脑区和基因,增强了临床可信度。
链接: https://arxiv.org/abs/2507.20872
作者: Ahmed Sharshar,Yasser Ashraf,Tameem Bakr,Salma Hassan,Hosam Elgendy,Mohammad Yaqub,Mohsen Guizani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in Third Workshop on Computer Vision for Automated Medical Diagnosis CVAMD 2025 in ICCV 2025
Abstract:Alzheimer’s disease affects over 55 million people worldwide and is projected to more than double by 2050, necessitating rapid, accurate, and scalable diagnostics. However, existing approaches are limited because they cannot achieve clinically acceptable accuracy, generalization across datasets, robustness to missing modalities, and explainability all at the same time. This inability to satisfy all these requirements simultaneously undermines their reliability in clinical settings. We propose OmniBrain, a multimodal framework that integrates brain MRI, radiomics, gene expression, and clinical data using a unified model with cross-attention and modality dropout. OmniBrain achieves 92.2 \pm 2.4% accuracy on the ANMerge dataset and generalizes to the MRI-only ADNI dataset with 70.4 \pm 2.7% accuracy, outperforming unimodal and prior multimodal approaches. Explainability analyses highlight neuropathologically relevant brain regions and genes, enhancing clinical trust. OmniBrain offers a robust, interpretable, and practical solution for real-world Alzheimer’s diagnosis.
zh
[CV-22] Ensemble Foreground Management for Unsupervised Object Discovery ICCV2025
【速读】:该论文针对无监督目标发现(Unsupervised Object Discovery, UOD)中两个核心挑战展开研究:一是如何准确判断发现区域是前景还是背景,二是如何确定何时停止发现过程以避免过分割或欠分割。现有方法依赖启发式前景先验(foreground prior)进行区分,并采用固定次数的迭代发现策略,但前者鲁棒性不足,后者因图像中对象数量差异导致性能不稳定。论文提出的关键解决方案是引入UnionCut,一种基于最小割(min-cut)和集成方法构建的稳健且有理论基础的前景先验,用于检测图像中所有前景区域的并集(union),从而让UOD算法能够识别前景对象并在覆盖大部分前景并集时自动终止发现过程。此外,作者进一步提出轻量级蒸馏Transformer模型UnionSeg,高效准确地输出前景并集。实验表明,结合UnionCut或UnionSeg可显著提升现有SOTA UOD方法在单目标发现、显著性检测和自监督实例分割等多个基准上的性能。
链接: https://arxiv.org/abs/2507.20860
作者: Ziling Wu,Armaghan Moemeni,Praminda Caleb-Solly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025 (Highlight)
Abstract:Unsupervised object discovery (UOD) aims to detect and segment objects in 2D images without handcrafted annotations. Recent progress in self-supervised representation learning has led to some success in UOD algorithms. However, the absence of ground truth provides existing UOD methods with two challenges: 1) determining if a discovered region is foreground or background, and 2) knowing how many objects remain undiscovered. To address these two problems, previous solutions rely on foreground priors to distinguish if the discovered region is foreground, and conduct one or fixed iterations of discovery. However, the existing foreground priors are heuristic and not always robust, and a fixed number of discoveries leads to under or over-segmentation, since the number of objects in images varies. This paper introduces UnionCut, a robust and well-grounded foreground prior based on min-cut and ensemble methods that detects the union of foreground areas of an image, allowing UOD algorithms to identify foreground objects and stop discovery once the majority of the foreground union in the image is segmented. In addition, we propose UnionSeg, a distilled transformer of UnionCut that outputs the foreground union more efficiently and accurately. Our experiments show that by combining with UnionCut or UnionSeg, previous state-of-the-art UOD methods witness an increase in the performance of single object discovery, saliency detection and self-supervised instance segmentation on various benchmarks. The code is available at this https URL.
zh
[CV-23] Compositional Video Synthesis by Temporal Object-Centric Learning
【速读】:该论文旨在解决视频生成中对象级结构缺失与时间一致性不足的问题,现有基于对象中心(object-centric)的方法要么缺乏生成能力,要么将视频序列整体处理,忽略了显式的对象层次结构。其解决方案的关键在于:通过学习姿态不变的对象中心槽位(pose-invariant object-centric slots),并将其条件化于预训练的扩散模型(diffusion models),从而显式建模时间动态性,实现高质量、像素级的视频合成,并保持跨帧的对象身份一致性,同时支持直观的对象插入、删除或替换等组合编辑操作。
链接: https://arxiv.org/abs/2507.20855
作者: Adil Kaan Akan,Yucel Yemez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12+21 pages, submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), currently under review
Abstract:We present a novel framework for compositional video synthesis that leverages temporally consistent object-centric representations, extending our previous work, SlotAdapt, from images to video. While existing object-centric approaches either lack generative capabilities entirely or treat video sequences holistically, thus neglecting explicit object-level structure, our approach explicitly captures temporal dynamics by learning pose invariant object-centric slots and conditioning them on pretrained diffusion models. This design enables high-quality, pixel-level video synthesis with superior temporal coherence, and offers intuitive compositional editing capabilities such as object insertion, deletion, or replacement, maintaining consistent object identities across frames. Extensive experiments demonstrate that our method sets new benchmarks in video generation quality and temporal consistency, outperforming previous object-centric generative methods. Although our segmentation performance closely matches state-of-the-art methods, our approach uniquely integrates this capability with robust generative performance, significantly advancing interactive and controllable video generation and opening new possibilities for advanced content creation, semantic editing, and dynamic scene understanding.
zh
[CV-24] S3LAM: Surfel Splatting SLAM for Geometrically Accurate Tracking and Mapping
【速读】:该论文旨在解决传统RGB-D SLAM(Simultaneous Localization and Mapping,同步定位与建图)系统在几何表示精度和实时性之间的权衡问题,尤其是在有限观测视角下难以实现高精度建图与稳定跟踪的挑战。其解决方案的关键在于提出一种基于2D surfel splatting的新型SLAM框架 S³LAM,通过使用二维高斯surfel(Gaussian surfel)作为场景表示的基本单元,替代现有基于3D高斯分布(3D Gaussian distribution, 3DGS)的方法中使用的3D椭球体,从而更高效地构建高质量几何结构;同时,该方法直接从2D surfel splatting公式推导出相机位姿的雅可比矩阵(Jacobian),强化了几何一致性,提升了跟踪收敛速度与鲁棒性,并引入自适应表面渲染策略以在计算效率与映射精度之间取得平衡。
链接: https://arxiv.org/abs/2507.20854
作者: Ruoyu Fan,Yuhui Wen,Jiajia Dai,Tao Zhang,Long Zeng,Yong-jin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 7 figures
Abstract:We propose S^3 LAM, a novel RGB-D SLAM system that leverages 2D surfel splatting to achieve highly accurate geometric representations for simultaneous tracking and mapping. Unlike existing 3DGS-based SLAM approaches that rely on 3D Gaussian ellipsoids, we utilize 2D Gaussian surfels as primitives for more efficient scene representation. By focusing on the surfaces of objects in the scene, this design enables S^3 LAM to reconstruct high-quality geometry, benefiting both mapping and tracking. To address inherent SLAM challenges including real-time optimization under limited viewpoints, we introduce a novel adaptive surface rendering strategy that improves mapping accuracy while maintaining computational efficiency. We further derive camera pose Jacobians directly from 2D surfel splatting formulation, highlighting the importance of our geometrically accurate representation that improves tracking convergence. Extensive experiments on both synthetic and real-world datasets validate that S^3 LAM achieves state-of-the-art performance. Code will be made publicly available.
zh
[CV-25] METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models ICCV2025
【速读】:该论文旨在解决多视觉编码器(multi-encoder vision encoders)在多模态大语言模型(Multimodal Large Language Models, MLLMs)中因冗余视觉token导致的计算开销过高与性能提升有限的问题。现有单编码器架构(如CLIP)泛化能力受限,而多编码器融合方法虽能利用互补视觉表征提升性能,却引入了显著的计算负担。解决方案的关键在于提出一种分阶段的渐进式剪枝框架——Multi-Encoder collaboraTivE tOken pRuning (METEOR),其核心包括:1)在多视觉编码阶段,通过基于秩引导的协作token分配策略剔除各编码器内部冗余token;2)在多视觉融合阶段,采用协同剪枝减少跨编码器冗余信息;3)在LLM解码阶段,设计自适应token剪枝机制,根据文本提示动态调整剪枝比例以去除无关token。该方法首次实现了高效多编码器MLLM,在保持高精度的同时大幅降低视觉token数量(相比EAGLE减少76%),仅带来0.3%平均性能损失。
链接: https://arxiv.org/abs/2507.20842
作者: Yuchen Liu,Yaoming Wang,Bowen Shi,Xiaopeng Zhang,Wenrui Dai,Chenglin Li,Hongkai Xiong,Qi Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Vision encoders serve as the cornerstone of multimodal understanding. Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduce prohibitive computational overhead to achieve superior performance using complementary visual representations from multiple vision encoders. To address this, we propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR), that eliminates redundant visual tokens across the encoding, fusion, and decoding stages for multi-encoder MLLMs. For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy. Subsequently, for multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning. Finally, we propose an adaptive token pruning method in the LLM decoding stage to further discard irrelevant tokens based on the text prompts with dynamically adjusting pruning ratios for specific task demands. To our best knowledge, this is the first successful attempt that achieves an efficient multi-encoder based vision language model with multi-stage pruning strategies. Extensive experiments on 11 benchmarks demonstrate the effectiveness of our proposed approach. Compared with EAGLE, a typical multi-encoder MLLMs, METEOR reduces 76% visual tokens with only 0.3% performance drop in average. The code is available at this https URL.
zh
[CV-26] Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting
【速读】:该论文旨在解决当前基于CLIP(Contrastive Language-Image Pretraining)的少样本分类方法在评估时存在的偏差问题,即现有基准数据集大多已被CLIP模型在预训练阶段接触过,导致评估结果偏向于部分归纳(partially transductive)而非真正的归纳泛化(inductive generalization)。为获得更真实的性能评估,论文提出一种基于遗忘学习(unlearning)的技术来构建真正的归纳基线,其关键在于通过移除模型对训练数据的记忆,使测试过程完全脱离已见过的数据分布,从而模拟真实场景中的少样本学习条件。在此基础上,作者进一步提出了一种改进的少样本分类方法,在5880次实验中展现出优于13种主流基线方法的一致性表现,验证了该解决方案的有效性和鲁棒性。
链接: https://arxiv.org/abs/2507.20834
作者: Alexey Kravets,Da Chen,Vinay P. Namboodiri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:CLIP is a foundational model with transferable classification performance in the few-shot setting. Several methods have shown improved performance of CLIP using few-shot examples. However, so far, all these techniques have been benchmarked using standard few-shot datasets. We argue that this mode of evaluation does not provide a true indication of the inductive generalization ability using few-shot examples. As most datasets have been seen by the CLIP model, the resultant setting can be termed as partially transductive. To solve this, we propose a pipeline that uses an unlearning technique to obtain true inductive baselines. In this new inductive setting, the methods show a significant drop in performance (-55% on average among 13 baselines with multiple datasets). We validate the unlearning technique using oracle baselines. An improved few-shot classification technique is proposed that consistently obtains state-of-the-art performance over 13 other recent baseline methods on a comprehensive analysis with 5880 experiments - varying the datasets, differing number of few-shot examples, unlearning setting, and with different seeds. Thus, we identify the issue with the evaluation of CLIP-based few-shot classification, provide a solution using unlearning, propose new benchmarks, and provide an improved method.
zh
[CV-27] SCANet: Split Coordinate Attention Network for Building Footprint Extraction ICONIP’24
【速读】:该论文旨在解决遥感图像中建筑轮廓提取(building footprint extraction)任务中存在的挑战,尤其是在复杂城市场景下如何更高效地捕捉长距离空间依赖关系以提升语义特征提取能力的问题。解决方案的关键在于提出一种新型的“即插即用”注意力模块——Split Coordinate Attention (SCA),其核心机制是通过两个不同空间范围的池化核,分别沿x和y方向对每个通道进行编码,并对特征组进行分割操作,从而实现对空间远距离交互的有效建模,显著增强模型的感知能力。在此基础上构建的SCANet网络在WHU和Massachusetts两个公开数据集上均取得了优于当前最先进方法的性能,尤其在IoU指标上分别达到91.61%和75.49%。
链接: https://arxiv.org/abs/2507.20809
作者: Chunshi Wang,Bin Zhao,Shuxue Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICONIP’24
Abstract:Building footprint extraction holds immense significance in remote sensing image analysis and has great value in urban planning, land use, environmental protection and disaster assessment. Despite the progress made by conventional and deep learning approaches in this field, they continue to encounter significant challenges. This paper introduces a novel plug-and-play attention module, Split Coordinate Attention (SCA), which ingeniously captures spatially remote interactions by employing two spatial range of pooling kernels, strategically encoding each channel along x and y planes, and separately performs a series of split operations for each feature group, thus enabling more efficient semantic feature extraction. By inserting into a 2D CNN to form an effective SCANet, our SCANet outperforms recent SOTA methods on the public Wuhan University (WHU) Building Dataset and Massachusetts Building Dataset in terms of various metrics. Particularly SCANet achieves the best IoU, 91.61% and 75.49% for the two datasets. Our code is available at this https URL
zh
[CV-28] FantasyID: A dataset for detecting digital manipulations of ID-documents WWW
【速读】:该论文旨在解决生成式 AI (Generative AI) 技术滥用导致的伪造身份证件(Identity Documents, IDs)对广泛部署的“了解你的客户”(Know Your Customer, KYC)系统构成的安全威胁问题。解决方案的关键在于提出一个全新的公开数据集 FantasyID,该数据集模拟真实世界中的身份证件但不涉及对合法文档的篡改,且不含生成人脸或样本水印,包含多种设计风格、语言及真实人物面部信息;同时通过打印并用三种不同设备拍摄以构建真实场景下的“真实类”(bonafide class),并模拟恶意攻击者使用现有生成工具进行数字伪造/注入攻击,从而形成具有挑战性的评估基准。实验表明,该数据集在接近实际应用的条件下显著提升了当前主流伪造检测算法(如 TruFor、MMFusion、UniFD 和 FatFormer)的误检率(False Positive Rate)和漏检率(False Negative Rate),验证了其作为检测算法评测基准的复杂性和实用性。
链接: https://arxiv.org/abs/2507.20808
作者: Pavel Korshunov,Amir Mohammadi,Vidit Vidit,Christophe Ecabert,Sébastien Marcel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IJCB 2025; for project page, see this https URL
Abstract:Advancements in image generation led to the availability of easy-to-use tools for malicious actors to create forged images. These tools pose a serious threat to the widespread Know Your Customer (KYC) applications, requiring robust systems for detection of the forged Identity Documents (IDs). To facilitate the development of the detection algorithms, in this paper, we propose a novel publicly available (including commercial use) dataset, FantasyID, which mimics real-world IDs but without tampering with legal documents and, compared to previous public datasets, it does not contain generated faces or specimen watermarks. FantasyID contains ID cards with diverse design styles, languages, and faces of real people. To simulate a realistic KYC scenario, the cards from FantasyID were printed and captured with three different devices, constituting the bonafide class. We have emulated digital forgery/injection attacks that could be performed by a malicious actor to tamper the IDs using the existing generative tools. The current state-of-the-art forgery detection algorithms, such as TruFor, MMFusion, UniFD, and FatFormer, are challenged by FantasyID dataset. It especially evident, in the evaluation conditions close to practical, with the operational threshold set on validation set so that false positive rate is at 10%, leading to false negative rates close to 50% across the board on the test set. The evaluation experiments demonstrate that FantasyID dataset is complex enough to be used as an evaluation benchmark for detection algorithms.
zh
[CV-29] LanternNet: A Novel Hub-and-Spoke System to Seek and Suppress Spotted Lanternfly Populations
【速读】:该论文旨在解决入侵性物种——褐斑祸虫(Spotted Lanternfly, SLF)对农业和生态系统造成的严重威胁问题,传统防控手段如卵块清除、农药喷洒及检疫措施存在劳动强度大、环境风险高且难以实现长期有效抑制的缺陷。解决方案的关键在于提出了一种名为LanternNet的自主机器人“枢纽-辐条”系统,其核心创新包括:以YOLOv8计算机视觉模型为基础的中心枢纽实现对SLF的精准识别,并通过三个专用机器人“辐条”分别执行害虫中和、环境监测与导航制图等任务,从而实现规模化、智能化的检测与控制。实证研究表明,该系统在多个受侵区域部署5周后显著降低了SLF种群密度(p < 0.01),同时改善了树木健康指标,相较传统方法具备更高的成本效益和可扩展性,展现出将机器人技术与人工智能深度融合用于入侵物种治理的巨大潜力。
链接: https://arxiv.org/abs/2507.20800
作者: Vinil Polepalli
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The invasive spotted lanternfly (SLF) poses a significant threat to agriculture and ecosystems, causing widespread damage. Current control methods, such as egg scraping, pesticides, and quarantines, prove labor-intensive, environmentally hazardous, and inadequate for long-term SLF suppression. This research introduces LanternNet, a novel autonomous robotic Hub-and-Spoke system designed for scalable detection and suppression of SLF populations. A central, tree-mimicking hub utilizes a YOLOv8 computer vision model for precise SLF identification. Three specialized robotic spokes perform targeted tasks: pest neutralization, environmental monitoring, and navigation/mapping. Field deployment across multiple infested sites over 5 weeks demonstrated LanternNet’s efficacy. Quantitative analysis revealed significant reductions (p 0.01, paired t-tests) in SLF populations and corresponding improvements in tree health indicators across the majority of test sites. Compared to conventional methods, LanternNet offers substantial cost advantages and improved scalability. Furthermore, the system’s adaptability for enhanced autonomy and targeting of other invasive species presents significant potential for broader ecological impact. LanternNet demonstrates the transformative potential of integrating robotics and AI for advanced invasive species management and improved environmental outcomes.
zh
[CV-30] An Efficient Machine Learning Framework for Forest Height Estimation from Multi-Polarimetric Multi-Baseline SAR data
【速读】:该论文旨在解决森林高度(forest height)精确估计的问题,这对于气候变化监测和碳循环评估至关重要。传统方法依赖于多通道合成孔径雷达(SAR)与模型驱动技术,而近年来数据驱动的机器学习(ML)和深度学习(DL)方法虽展现出潜力,但常受限于大规模数据需求、复杂模型结构及繁琐预处理步骤(如校准或量化)。本文提出的FGump框架通过梯度提升(gradient boosting)结合多通道SAR处理,并以激光雷达(LiDAR)剖面作为真值(Ground Truth),采用少量人工设计特征,在不进行复杂预处理的前提下实现了高精度与高效率的平衡。其关键创新在于:回归形式避免了分类方法中的量化伪影,从而获得连续且更精确的森林高度估计,同时显著优于现有基于AI和经典方法,在准确率与训练/推理时间上均表现突出。
链接: https://arxiv.org/abs/2507.20798
作者: Francesca Razzano,Wenyu Yang,Sergio Vitale,Giampaolo Ferraioli,Silvia Liberata Ullo,Gilda Schirinzi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 12 figures, This paper has been submitted to IEEE TGRS. At the moment is under review
Abstract:Accurate forest height estimation is crucial for climate change monitoring and carbon cycle assessment. Synthetic Aperture Radar (SAR), particularly in multi-channel configurations, has provided support for a long time in 3D forest structure reconstruction through model-based techniques. More recently, data-driven approaches using Machine Learning (ML) and Deep Learning (DL) have enabled new opportunities for forest parameter retrieval. This paper introduces FGump, a forest height estimation framework by gradient boosting using multi-channel SAR processing with LiDAR profiles as Ground Truth(GT). Unlike typical ML and DL approaches that require large datasets and complex architectures, FGump ensures a strong balance between accuracy and computational efficiency, using a limited set of hand-designed features and avoiding heavy preprocessing (e.g., calibration and/or quantization). Evaluated under both classification and regression paradigms, the proposed framework demonstrates that the regression formulation enables fine-grained, continuous estimations and avoids quantization artifacts by resulting in more precise measurements without rounding. Experimental results confirm that FGump outperforms State-of-the-Art (SOTA) AI-based and classical methods, achieving higher accuracy and significantly lower training and inference times, as demonstrated in our results.
zh
[CV-31] Investigation of Accuracy and Bias in Face Recognition Trained with Synthetic Data
【速读】:该论文旨在解决合成数据在人脸识别(Face Recognition, FR)系统中是否能够同时实现高准确率与公平性的问题。其核心挑战在于,尽管合成数据在可扩展性、隐私合规性和偏见缓解方面具有潜力,但其在真实场景下的泛化能力及对不同人群的公平性仍不明确。解决方案的关键在于构建一个平衡的人脸数据集 FairFaceGen,利用两种先进的文本到图像生成模型(Flux.1-dev 和 Stable Diffusion v3.5)结合多种身份增强方法(如 Arc2Face 和 IP-Adapters),并确保合成数据与真实数据在身份数量上保持一致,从而在标准(LFW、AgeDB-30)和挑战性基准(IJB-B/C)上进行公平评估。实验表明,使用 SD35 生成的合成数据在减少种族偏见方面展现出潜力,且类内增强的数量与质量显著影响 FR 准确率与公平性,为基于合成数据构建更公平的 FR 系统提供了实践指导。
链接: https://arxiv.org/abs/2507.20782
作者: Pavel Korshunov,Ketan Kotwal,Christophe Ecabert,Vidit Vidit,Amir Mohammadi,Sebastien Marcel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication in IEEE International Joint Conference on Biometrics (IJCB), 2025
Abstract:Synthetic data has emerged as a promising alternative for training face recognition (FR) models, offering advantages in scalability, privacy compliance, and potential for bias mitigation. However, critical questions remain on whether both high accuracy and fairness can be achieved with synthetic data. In this work, we evaluate the impact of synthetic data on bias and performance of FR systems. We generate balanced face dataset, FairFaceGen, using two state of the art text-to-image generators, Flux.1-dev and Stable Diffusion v3.5 (SD35), and combine them with several identity augmentation methods, including Arc2Face and four IP-Adapters. By maintaining equal identity count across synthetic and real datasets, we ensure fair comparisons when evaluating FR performance on standard (LFW, AgeDB-30, etc.) and challenging IJB-B/C benchmarks and FR bias on Racial Faces in-the-Wild (RFW) dataset. Our results demonstrate that although synthetic data still lags behind the real datasets in the generalization on IJB-B/C, demographically balanced synthetic datasets, especially those generated with SD35, show potential for bias mitigation. We also observe that the number and quality of intra-class augmentations significantly affect FR accuracy and fairness. These findings provide practical guidelines for constructing fairer FR systems using synthetic data.
zh
[CV-32] RingMo-Agent : A Unified Remote Sensing Foundation Model for Multi-Platform and Multi-Modal Reasoning
【速读】:该论文旨在解决遥感(Remote Sensing, RS)图像在多模态、多平台场景下难以统一建模的问题,现有视觉-语言模型多依赖同质数据源,且局限于分类或描述等基础感知任务,无法胜任复杂推理和跨模态泛化需求。其解决方案的关键在于提出RingMo-Agent框架:1)构建包含300万张图像-文本对的大规模RS-VL3M数据集,覆盖光学、合成孔径雷达(SAR)和红外(IR)三种模态及卫星与无人机平台;2)通过分离嵌入层实现模态自适应表示,降低异构模态间的干扰;3)引入任务特定标记(task-specific tokens)和基于token的高维隐状态解码机制,统一建模多种感知与推理任务,从而在不同平台和模态间展现出强泛化能力。
链接: https://arxiv.org/abs/2507.20776
作者: Huiyang Hu,Peijin Wang,Yingchao Feng,Kaiwen Wei,Wenxin Yin,Wenhui Diao,Mengyu Wang,Hanbo Bi,Kaiyue Kang,Tong Ling,Kun Fu,Xian Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 6 figures, 20 tables
Abstract:Remote sensing (RS) images from multiple modalities and platforms exhibit diverse details due to differences in sensor characteristics and imaging perspectives. Existing vision-language research in RS largely relies on relatively homogeneous data sources. Moreover, they still remain limited to conventional visual perception tasks such as classification or captioning. As a result, these methods fail to serve as a unified and standalone framework capable of effectively handling RS imagery from diverse sources in real-world applications. To address these issues, we propose RingMo-Agent, a model designed to handle multi-modal and multi-platform data that performs perception and reasoning tasks based on user textual instructions. Compared with existing models, RingMo-Agent 1) is supported by a large-scale vision-language dataset named RS-VL3M, comprising over 3 million image-text pairs, spanning optical, SAR, and infrared (IR) modalities collected from both satellite and UAV platforms, covering perception and challenging reasoning tasks; 2) learns modality adaptive representations by incorporating separated embedding layers to construct isolated features for heterogeneous modalities and reduce cross-modal interference; 3) unifies task modeling by introducing task-specific tokens and employing a token-based high-dimensional hidden state decoding mechanism designed for long-horizon spatial tasks. Extensive experiments on various RS vision-language tasks demonstrate that RingMo-Agent not only proves effective in both visual understanding and sophisticated analytical tasks, but also exhibits strong generalizability across different platforms and sensing modalities.
zh
[CV-33] Learning Only with Images: Visual Reinforcement Learning with Reasoning Rendering and Visual Feedback
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在实现深度视觉推理时对高质量图像-文本监督数据的高度依赖问题。当前方法受限于标注成本高且难以覆盖复杂场景,导致模型泛化能力不足。其解决方案的关键在于提出一种名为“推理-渲染-视觉反馈”(Reasoning-Rendering-Visual-Feedback, RRVF)的闭环框架,该框架基于“验证不对称性”(Asymmetry of Verification)原理——即从图像中验证生成结果比从零开始生成更易实现——从而构建强化学习(Reinforcement Learning, RL)优化所需的奖励信号。通过迭代式推理、渲染与视觉反馈机制,模型可在无需显式图像-文本对的情况下进行自我修正和工具调用,并借助GRPO算法实现端到端优化,显著提升了视觉推理性能。
链接: https://arxiv.org/abs/2507.20766
作者: Yang Chen,Yufan Shen,Wenxuan Huang,Shen Zhou,Qunshu Lin,Xinyu Cai,Zhi Yu,Botian Shi,Yu Qiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have exhibited impressive performance across various visual tasks. Subsequent investigations into enhancing their visual reasoning abilities have significantly expanded their performance envelope. However, a critical bottleneck in the advancement of MLLMs toward deep visual reasoning is their heavy reliance on curated image-text supervision. To solve this problem, we introduce a novel framework termed Reasoning-Rendering-Visual-Feedback'' (RRVF), which enables MLLMs to learn complex visual reasoning from only raw images. This framework builds on the
Asymmetry of Verification’’ principle to train MLLMs, i.e., verifying the rendered output against a source image is easier than generating it. We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning (RL) training, reducing the reliance on the image-text supervision. Guided by the above principle, RRVF implements a closed-loop iterative process encompassing reasoning, rendering, and visual feedback components, enabling the model to perform self-correction through multi-turn interactions and tool invocation, while this pipeline can be optimized by the GRPO algorithm in an end-to-end manner. Extensive experiments on image-to-code generation for data charts and web interfaces show that RRVF substantially outperforms existing open-source MLLMs and surpasses supervised fine-tuning baselines. Our findings demonstrate that systems driven by purely visual feedback present a viable path toward more robust and generalizable reasoning models without requiring explicit supervision. Code will be available at this https URL.
zh
[CV-34] ATR-UMMIM: A Benchmark Dataset for UAV-Based Multimodal Image Registration under Complex Imaging Conditions
【速读】:该论文旨在解决无人机(UAV)平台下多模态图像注册(multimodal image registration)缺乏公开基准数据集的问题,这一问题严重制约了真实场景中先进注册方法的开发与评估。解决方案的关键在于构建首个专为UAV应用场景设计的多模态图像注册基准数据集ATR-UMMIM,其核心创新包括:1)采集7,969组包含可见光、红外及精确配准可见光图像的三元组数据,覆盖80–300米飞行高度、0°–75°相机俯仰角以及全天候、全季节的复杂气象与光照条件;2)设计半自动标注流程以提供像素级真值标签,确保注册精度;3)为每组图像添加六类成像条件属性,支持在真实部署环境下对注册鲁棒性进行系统评估;4)提供细粒度的目标级标注(共11类物体,约15.6万条边界框),赋能下游感知任务。该数据集填补了现有空白,为推动UAV多模态感知中的注册、融合与理解研究奠定基础。
链接: https://arxiv.org/abs/2507.20764
作者: Kangcheng Bin,Chen Chen,Ting Hu,Jiahao Qi,Ping Zhong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal fusion has become a key enabler for UAV-based object detection, as each modality provides complementary cues for robust feature extraction. However, due to significant differences in resolution, field of view, and sensing characteristics across modalities, accurate registration is a prerequisite before fusion. Despite its importance, there is currently no publicly available benchmark specifically designed for multimodal registration in UAV-based aerial scenarios, which severely limits the development and evaluation of advanced registration methods under real-world conditions. To bridge this gap, we present ATR-UMMIM, the first benchmark dataset specifically tailored for multimodal image registration in UAV-based applications. This dataset includes 7,969 triplets of raw visible, infrared, and precisely registered visible images captured covers diverse scenarios including flight altitudes from 80m to 300m, camera angles from 0° to 75°, and all-day, all-year temporal variations under rich weather and illumination conditions. To ensure high registration quality, we design a semi-automated annotation pipeline to introduce reliable pixel-level ground truth to each triplet. In addition, each triplet is annotated with six imaging condition attributes, enabling benchmarking of registration robustness under real-world deployment settings. To further support downstream tasks, we provide object-level annotations on all registered images, covering 11 object categories with 77,753 visible and 78,409 infrared bounding boxes. We believe ATR-UMMIM will serve as a foundational benchmark for advancing multimodal registration, fusion, and perception in real-world UAV scenarios. The datatset can be download from this https URL
zh
[CV-35] KASportsFormer: Kinematic Anatomy Enhanced Transformer for 3D Human Pose Estimation on Short Sports Scene Video
【速读】:该论文旨在解决现有基于Transformer的3D人体姿态估计方法在体育场景中表现不足的问题,尤其是在运动模糊、遮挡和域偏移等复杂条件下,以及对瞬时关键动作(如投篮)捕捉能力弱的局限。其解决方案的关键在于提出KASportsFormer框架,该框架引入了基于运动学解剖学信息的特征表示与融合模块,通过Bone Extractor(BoneExt)和Limb Fuser(LimbFus)模块提取并编码骨骼和肢体的运动学信息,并以多模态方式融合,从而显著提升了模型对短时视频中体育动作的理解能力。
链接: https://arxiv.org/abs/2507.20763
作者: Zhuoer Yin,Calvin Yeung,Tomohiro Suzuki,Ryota Tanaka,Keisuke Fujii
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures
Abstract:Recent transformer based approaches have demonstrated impressive performance in solving real-world 3D human pose estimation problems. Albeit these approaches achieve fruitful results on benchmark datasets, they tend to fall short of sports scenarios where human movements are more complicated than daily life actions, as being hindered by motion blur, occlusions, and domain shifts. Moreover, due to the fact that critical motions in a sports game often finish in moments of time (e.g., shooting), the ability to focus on momentary actions is becoming a crucial factor in sports analysis, where current methods appear to struggle with instantaneous scenarios. To overcome these limitations, we introduce KASportsFormer, a novel transformer based 3D pose estimation framework for sports that incorporates a kinematic anatomy-informed feature representation and integration module. In which the inherent kinematic motion information is extracted with the Bone Extractor (BoneExt) and Limb Fuser (LimbFus) modules and encoded in a multimodal manner. This improved the capability of comprehending sports poses in short videos. We evaluate our method through two representative sports scene datasets: SportsPose and WorldPose. Experimental results show that our proposed method achieves state-of-the-art results with MPJPE errors of 58.0mm and 34.3mm, respectively. Our code and models are available at: this https URL
zh
[CV-36] Learning to See Inside Opaque Liquid Containers using Speckle Vibrometry ICCV2025
【速读】:该论文旨在解决传统计算机视觉系统无法获取不透明容器内部液体水平的问题,即现有方法仅能识别物体表面特征,而无法推断其隐藏状态(如瓶罐是否装满)。解决方案的关键在于提出一种基于散斑(speckle)的振动传感系统,能够远程、非接触式地同时捕捉多个密封容器表面的微小振动,并利用Transformer架构对这些振动信号进行分析,从而准确分类容器类型并估计其内部液位。该方法具有对振动源不变性,可在受控和环境声源下保持精度,且能泛化至未见过的同类容器实例和液位情况。
链接: https://arxiv.org/abs/2507.20757
作者: Matan Kichler,Shai Bagon,Mark Sheinin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025
Abstract:Computer vision seeks to infer a wide range of information about objects and events. However, vision systems based on conventional imaging are limited to extracting information only from the visible surfaces of scene objects. For instance, a vision system can detect and identify a Coke can in the scene, but it cannot determine whether the can is full or empty. In this paper, we aim to expand the scope of computer vision to include the novel task of inferring the hidden liquid levels of opaque containers by sensing the tiny vibrations on their surfaces. Our method provides a first-of-a-kind way to inspect the fill level of multiple sealed containers remotely, at once, without needing physical manipulation and manual weighing. First, we propose a novel speckle-based vibration sensing system for simultaneously capturing scene vibrations on a 2D grid of points. We use our system to efficiently and remotely capture a dataset of vibration responses for a variety of everyday liquid containers. Then, we develop a transformer-based approach for analyzing the captured vibrations and classifying the container type and its hidden liquid level at the time of measurement. Our architecture is invariant to the vibration source, yielding correct liquid level estimates for controlled and ambient scene sound sources. Moreover, our model generalizes to unseen container instances within known classes (e.g., training on five Coke cans of a six-pack, testing on a sixth) and fluid levels. We demonstrate our method by recovering liquid levels from various everyday containers.
zh
[CV-37] AR-LIF: Adaptive reset leaky-integrate and fire neuron for spiking neural networks
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)中因传统硬复位(hard reset)方法导致的信息丢失问题,以及现有软复位(soft reset)方法对所有神经元采用统一处理方式而缺乏适应性的问题。解决方案的关键在于设计一种自适应复位神经元(adaptive reset neuron),通过建立输入、输出与复位之间的关联关系,并引入一种简单有效的阈值调整策略,从而在保持低功耗优势的同时提升模型性能。
链接: https://arxiv.org/abs/2507.20746
作者: Zeyu Huang,Wei Meng,Quan Liu,Kun Chen,Li Ma
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spiking neural networks possess the advantage of low energy consumption due to their event-driven nature. Compared with binary spike outputs, their inherent floating-point dynamics are more worthy of attention. The threshold level and re- set mode of neurons play a crucial role in determining the number and timing of spikes. The existing hard reset method causes information loss, while the improved soft reset method adopts a uniform treatment for neurons. In response to this, this paper designs an adaptive reset neuron, establishing the correlation between input, output and reset, and integrating a simple yet effective threshold adjustment strategy. It achieves excellent performance on various datasets while maintaining the advantage of low energy consumption.
zh
[CV-38] Regularizing Subspace Redundancy of Low-Rank Adaptation
【速读】:该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)及其变体在参数高效迁移学习(Parameter-Efficient Transfer Learning, PETL)中因投影矩阵训练过程中缺乏约束而导致的表示冗余问题,这会削弱特征适配在子空间中的有效性。解决方案的关键在于提出ReSoRA方法,其核心是显式建模不同映射子空间间的冗余性,并通过理论分解低秩子矩阵为多个等效子空间,系统性地对不同投影下的特征分布施加去冗余约束,从而实现对LoRA子空间冗余的自适应正则化。
链接: https://arxiv.org/abs/2507.20745
作者: Yue Zhu,Haiwen Diao,Shang Gao,Jiazuo Yu,Jiawen Zhu,Yunzhi Zhuge,Shuai Hao,Xu Jia,Lu Zhang,Ying Zhang,Huchuan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 10 pages, 4 figures, Accepted by ACMMM2025
Abstract:Low-Rank Adaptation (LoRA) and its variants have delivered strong capability in Parameter-Efficient Transfer Learning (PETL) by minimizing trainable parameters and benefiting from reparameterization. However, their projection matrices remain unrestricted during training, causing high representation redundancy and diminishing the effectiveness of feature adaptation in the resulting subspaces. While existing methods mitigate this by manually adjusting the rank or implicitly applying channel-wise masks, they lack flexibility and generalize poorly across various datasets and architectures. Hence, we propose ReSoRA, a method that explicitly models redundancy between mapping subspaces and adaptively Regularizes Subspace redundancy of Low-Rank Adaptation. Specifically, it theoretically decomposes the low-rank submatrices into multiple equivalent subspaces and systematically applies de-redundancy constraints to the feature distributions across different projections. Extensive experiments validate that our proposed method consistently facilitates existing state-of-the-art PETL methods across various backbones and datasets in vision-language retrieval and standard visual classification benchmarks. Besides, as a training supervision, ReSoRA can be seamlessly integrated into existing approaches in a plug-and-play manner, with no additional inference costs. Code is publicly available at: this https URL.
zh
[CV-39] Implicit Counterfactual Learning for Audio-Visual Segmentation ICCV2025
【速读】:该论文旨在解决音频-视觉分割(Audio-Visual Segmentation, AVS)中因模态表征差异与不平衡导致的跨模态理解偏差问题,尤其是在复杂场景下由于视觉内容模糊或多个音频源干扰引发的错误匹配。其核心解决方案是提出隐式反事实框架(Implicit Counterfactual Framework, ICF),关键在于引入多粒度隐式文本(Multi-granularity Implicit Text, MIT),在视频级、片段级和帧级构建模态共享空间以缩小模态差距并提供先验引导;同时设计语义反事实(Semantic Counterfactual, SC)机制,在潜在空间学习正交表示并生成多样化的反事实样本,从而缓解知识偏好问题,避免因复杂功能设计或显式文本结构调整带来的偏差。此外,通过协作分布感知对比学习(Collaborative Distribution-aware Contrastive Learning, CDCL)整合事实-反事实与跨模态对比,实现表征对齐、增强一致性并促进解耦。
链接: https://arxiv.org/abs/2507.20740
作者: Mingfeng Zha,Tianyu Li,Guoqing Wang,Peng Wang,Yangyang Wu,Yang Yang,Heng Tao Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Audio-visual segmentation (AVS) aims to segment objects in videos based on audio cues. Existing AVS methods are primarily designed to enhance interaction efficiency but pay limited attention to modality representation discrepancies and imbalances. To overcome this, we propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding. Due to the lack of semantics, heterogeneous representations may lead to erroneous matches, especially in complex scenes with ambiguous visual content or interference from multiple audio sources. We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space, reducing modality gaps and providing prior guidance. Visual content carries more information and typically dominates, thereby marginalizing audio features in the decision-making. To mitigate knowledge preference, we propose the semantic counterfactual (SC) to learn orthogonal representations in the latent space, generating diverse counterfactual samples, thus avoiding biases introduced by complex functional designs and explicit modifications of text structures or attributes. We further formulate the collaborative distribution-aware contrastive learning (CDCL), incorporating factual-counterfactual and inter-modality contrasts to align representations, promoting cohesion and decoupling. Extensive experiments on three public datasets validate that the proposed method achieves state-of-the-art performance.
zh
[CV-40] Multi-Masked Querying Network for Robust Emotion Recognition from Incomplete Multi-Modal Physiological Signals MICCAI2025
【速读】:该论文旨在解决从生理数据中进行情绪识别时面临的两个关键问题:多模态信号不完整以及身体运动和伪影带来的干扰。其解决方案的核心在于提出一种新颖的多掩码查询网络(Multi-Masked Querying Network, MMQ-Net),通过将多种查询机制整合到统一框架中来实现:利用模态查询(modality queries)重建不完整信号中的缺失数据,类别查询(category queries)聚焦于情绪状态特征,干扰查询(interference queries)则用于分离噪声与相关信息,从而显著提升在高数据缺失场景下的情绪识别性能。
链接: https://arxiv.org/abs/2507.20737
作者: Geng-Xin Xu,Xiang Zuo,Ye Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: MICCAI2025
Abstract:Emotion recognition from physiological data is crucial for mental health assessment, yet it faces two significant challenges: incomplete multi-modal signals and interference from body movements and artifacts. This paper presents a novel Multi-Masked Querying Network (MMQ-Net) to address these issues by integrating multiple querying mechanisms into a unified framework. Specifically, it uses modality queries to reconstruct missing data from incomplete signals, category queries to focus on emotional state features, and interference queries to separate relevant information from noise. Extensive experiment results demonstrate the superior emotion recognition performance of MMQ-Net compared to existing approaches, particularly under high levels of data incompleteness.
zh
[CV-41] Style-Aware Blending and Prototype-Based Cross-Contrast Consistency for Semi-Supervised Medical Image Segmentation
【速读】:该论文旨在解决半监督医学图像分割中现有弱-强一致性学习策略的两个关键问题:一是标注数据与未标注数据训练流分离导致的确认偏差(confirmation bias),二是监督信息利用不充分,限制了强到弱一致性(strong-to-weak consistency)的探索。解决方案的核心在于提出一种风格感知的混合与原型驱动的跨对比一致性学习框架:首先设计风格引导的分布混合模块,打破标注与未标注数据的独立训练流,缓解分布偏移;其次引入原型级跨对比策略,在弱到强和强到弱预测之间建立互监督机制,从而增强模型对有效监督信号的学习能力并抑制强伪标签中的噪声影响。
链接: https://arxiv.org/abs/2507.20729
作者: Chaowei Chen,Xiang Zhang,Honglie Guo,Shunfang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Weak-strong consistency learning strategies are widely employed in semi-supervised medical image segmentation to train models by leveraging limited labeled data and enforcing weak-to-strong consistency. However, existing methods primarily focus on designing and combining various perturbation schemes, overlooking the inherent potential and limitations within the framework itself. In this paper, we first identify two critical deficiencies: (1) separated training data streams, which lead to confirmation bias dominated by the labeled stream; and (2) incomplete utilization of supervisory information, which limits exploration of strong-to-weak consistency. To tackle these challenges, we propose a style-aware blending and prototype-based cross-contrast consistency learning framework. Specifically, inspired by the empirical observation that the distribution mismatch between labeled and unlabeled data can be characterized by statistical moments, we design a style-guided distribution blending module to break the independent training data streams. Meanwhile, considering the potential noise in strong pseudo-labels, we introduce a prototype-based cross-contrast strategy to encourage the model to learn informative supervisory signals from both weak-to-strong and strong-to-weak predictions, while mitigating the adverse effects of noise. Experimental results demonstrate the effectiveness and superiority of our framework across multiple medical segmentation benchmarks under various semi-supervised settings.
zh
[CV-42] AIComposer: Any Style and Content Image Composition via Feature Integration
【速读】:该论文旨在解决跨域图像合成(cross-domain image composition)中的两大核心问题:一是扩散模型的随机性与输入图像间风格差异导致的合成失败和伪影;二是现有方法对文本提示(text prompt)的高度依赖限制了实际应用。解决方案的关键在于提出一种无需文本提示的跨域图像合成方法,其核心创新包括:1)通过少量后向反演(backward inversion)和前向去噪步骤保留扩散先验,无需训练扩散模型;2)设计一个简单的多层感知机(MLP)网络融合CLIP特征,并采用局部交叉注意力机制(local cross-attention strategy)实现对扩散过程的有效控制;3)在不依赖预训练风格化网络的前提下,稳定地保持前景内容并实现自然风格迁移。该方法显著提升了合成质量,在LPIPS和CSD指标上分别提升30.5%和18.1%,并构建了首个用于公平评估的跨域图像合成基准数据集。
链接: https://arxiv.org/abs/2507.20721
作者: Haowen Li,Zhenfeng Fan,Zhang Wen,Zhengzhou Zhu,Yunjin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image composition has advanced significantly with large-scale pre-trained T2I diffusion models. Despite progress in same-domain composition, cross-domain composition remains under-explored. The main challenges are the stochastic nature of diffusion models and the style gap between input images, leading to failures and artifacts. Additionally, heavy reliance on text prompts limits practical applications. This paper presents the first cross-domain image composition method that does not require text prompts, allowing natural stylization and seamless compositions. Our method is efficient and robust, preserving the diffusion prior, as it involves minor steps for backward inversion and forward denoising without training the diffuser. Our method also uses a simple multilayer perceptron network to integrate CLIP features from foreground and background, manipulating diffusion with a local cross-attention strategy. It effectively preserves foreground content while enabling stable stylization without a pre-stylization network. Finally, we create a benchmark dataset with diverse contents and styles for fair evaluation, addressing the lack of testing datasets for cross-domain image composition. Our method outperforms state-of-the-art techniques in both qualitative and quantitative evaluations, significantly improving the LPIPS score by 30.5% and the CSD metric by 18.1%. We believe our method will advance future research and applications. Code and benchmark at this https URL.
zh
[CV-43] Automatic camera orientation estimation for a partially calibrated camera above a plane with a line at known planar distance
【速读】:该论文旨在解决在部分校准条件下,如何利用有限场景信息估计安装于平面之上的相机的滚动角(roll)和俯仰角(pitch)的问题。其解决方案的关键在于:假设相机内参已知且相机与观察平面之间的高度固定,通过检测一个已知距离的平面参考直线(如地板与墙面的交线),结合逆投影几何关系和镜头畸变校正,实现对滚转和俯仰角度的精确估计。该方法适用于全参数标定不可行的场景,为受限环境下的多相机系统提供了一种轻量级的姿态估计方案。
链接: https://arxiv.org/abs/2507.20689
作者: Gergely Dinya,Anna Gelencsér-Horváth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a derivation for estimating the roll and pitch orientation of a partially calibrated camera mounted above a planar surface, using minimal scene information. Specifically, we assume known intrinsic parameters and a fixed height between the camera and the observed plane. By detecting a single straight reference line at a known planar distance – such as the edge between a floor and a wall – we estimate the roll and pitch angles via inverse projection geometry. The method leverages geometric constraints and the camera model, including lens distortion correction. This approach is suitable for scenarios where full calibration is impractical and offers a lightweight alternative for multi-camera systems operating in constrained environments.
zh
[CV-44] Lightweight Transformer-Driven Segmentation of Hotspots and Snail Trails in Solar PV Thermal Imagery
【速读】:该论文旨在解决光伏(Photovoltaic, PV)组件中热点(hotspots)和蜗牛纹(snail trails)等缺陷的精准检测问题,以保障光伏发电系统的能量效率与可靠性。其关键解决方案是提出了一种基于SegFormer架构的轻量化语义分割模型,通过定制化的Transformer编码器和简化解码器结构,在保持高精度的同时实现高效计算;同时结合图像预处理流程(包括CLAHE对比度增强、去噪与归一化)以及在277张无人机热红外图像上的精细标注数据进行微调训练,最终在小尺度和不规则缺陷分割任务中显著优于U-Net、DeepLabV3、PSPNet和Mask2Former等基准模型,具备边缘设备实时部署能力,可无缝集成至无人机巡检系统中用于大规模太阳能电站的自动化缺陷识别。
链接: https://arxiv.org/abs/2507.20680
作者: Deepak Joshi,Mayukha Pal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 6 figures
Abstract:Accurate detection of defects such as hotspots and snail trails in photovoltaic modules is essential for maintaining energy efficiency and system reliablility. This work presents a supervised deep learning framework for segmenting thermal infrared images of PV panels, using a dataset of 277 aerial thermographic images captured by zenmuse XT infrared camera mounted on a DJI Matrice 100 drone. The preprocessing pipeline includes image resizing, CLAHE based contrast enhancement, denoising, and normalisation. A lightweight semantic segmentation model based on SegFormer is developed, featuring a customised Transformwer encoder and streamlined decoder, and fine-tuned on annotated images with manually labeled defect regions. To evaluate performance, we benchmark our model against U-Net, DeepLabV3, PSPNet, and Mask2Former using consistent preprocessing and augmentation. Evaluation metrices includes per-class Dice score, F1-score, Cohen’s kappa, mean IoU, and pixel accuracy. The SegFormer-based model outperforms baselines in accuracy and efficiency, particularly for segmenting small and irregular defects. Its lightweight design real-time deployment on edge devices and seamless integration with drone-based systems for automated inspection of large-scale solar farms.
zh
[CV-45] A Multimodal Architecture for Endpoint Position Prediction in Team-based Multiplayer Games
【速读】:该论文旨在解决多玩家游戏中未来玩家位置预测的问题,这对于实现玩家模仿型机器人导航、预判式机器人控制、策略推荐及实时玩家行为分析等应用至关重要。其解决方案的关键在于提出了一种多模态架构,采用基于U-Net的模型生成终点位置概率热图(endpoint location probability heatmaps),并通过一个多模态特征编码器进行条件化处理;同时引入多头注意力机制(multi-head attention mechanism)以促进不同特征组之间的信息交互,从而高效融合图像输入、数值与类别特征以及动态游戏数据,为依赖未来玩家位置的下游任务(如玩家预测型机器人行为生成或玩家异常检测)提供基础支持。
链接: https://arxiv.org/abs/2507.20670
作者: Jonas Peche,Aliaksei Tsishurou,Alexander Zap,Guenter Wallner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Understanding and predicting player movement in multiplayer games is crucial for achieving use cases such as player-mimicking bot navigation, preemptive bot control, strategy recommendation, and real-time player behavior analytics. However, the complex environments allow for a high degree of navigational freedom, and the interactions and team-play between players require models that make effective use of the available heterogeneous input data. This paper presents a multimodal architecture for predicting future player locations on a dynamic time horizon, using a U-Net-based approach for calculating endpoint location probability heatmaps, conditioned using a multimodal feature encoder. The application of a multi-head attention mechanism for different groups of features allows for communication between agents. In doing so, the architecture makes efficient use of the multimodal game state including image inputs, numerical and categorical features, as well as dynamic game data. Consequently, the presented technique lays the foundation for various downstream tasks that rely on future player positions such as the creation of player-predictive bot behavior or player anomaly detection.
zh
[CV-46] Self-Supervised Continuous Colormap Recovery from a 2D Scalar Field Visualization without a Legend IEEE-VIS2025
【速读】:该论文旨在解决从单张二维标量场可视化图像中恢复连续配色方案(colormap)的问题,尤其在缺乏对应颜色图例的情况下。其核心挑战在于如何准确分离图像中的颜色映射与底层数据,并确保恢复的配色方案具有平滑性和正确的颜色顺序。解决方案的关键在于提出一种“解耦-重构”策略:首先通过解耦模块将输入可视化图像分离为潜在的配色方案和原始数据,再利用可微的颜色映射模块重建可视化结果;同时设计重建损失函数以约束训练过程中配色方案与数据之间的强相关性,并作为自监督优化器用于推理阶段对未见可视化图像进行微调;此外,引入基于三次B样条曲线(cubic B-spline curves)的紧凑配色表示及颜色顺序损失,从而保障提取配色方案的平滑性与正确排序。
链接: https://arxiv.org/abs/2507.20632
作者: Hongxu Liu,Xinyu Chen,Haoyang Zheng,Manyi Li,Zhenfan Liu,Fumeng Yang,Yunhai Wang,Changhe Tu,Qiong Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Submitted to IEEE VIS 2025
Abstract:Recovering a continuous colormap from a single 2D scalar field visualization can be quite challenging, especially in the absence of a corresponding color legend. In this paper, we propose a novel colormap recovery approach that extracts the colormap from a color-encoded 2D scalar field visualization by simultaneously predicting the colormap and underlying data using a decoupling-and-reconstruction strategy. Our approach first separates the input visualization into colormap and data using a decoupling module, then reconstructs the visualization with a differentiable color-mapping module. To guide this process, we design a reconstruction loss between the input and reconstructed visualizations, which serves both as a constraint to ensure strong correlation between colormap and data during training, and as a self-supervised optimizer for fine-tuning the predicted colormap of unseen visualizations during inferencing. To ensure smoothness and correct color ordering in the extracted colormap, we introduce a compact colormap representation using cubic B-spline curves and an associated color order loss. We evaluate our method quantitatively and qualitatively on a synthetic dataset and a collection of real-world visualizations from the VIS30K dataset. Additionally, we demonstrate its utility in two prototype applications – colormap adjustment and colormap transfer – and explore its generalization to visualizations with color legends and ones encoded using discrete color palettes.
zh
[CV-47] ransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在推理过程中因视觉标记(visual tokens)数量庞大而导致的高计算成本问题。现有基于注意力机制的标记剪枝方法存在位置偏差等局限性,难以准确识别真正重要的标记。其解决方案的关键在于提出一种全新的标记重要性评估视角——基于标记在模型内部的转移特性(token transitions),并据此设计了无需训练的高效剪枝方法TransPrune。该方法通过结合Token Transition Variation(TTV,衡量标记表示在幅度和方向上的变化)与Instruction-Guided Attention(IGA,衡量指令对图像标记的注意力强度)来动态评估标记重要性,从而实现显著的计算效率提升(推理TFLOPs减少超50%),同时保持与原始LVLM相当的多模态性能。
链接: https://arxiv.org/abs/2507.20630
作者: Ao Li,Yuxiang Duan,Jinghui Zhang,Congbo Ma,Yutong Xie,Gustavo Carneiro,Mohammad Yaqub,Hu Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in identifying which tokens are truly important. Most existing approaches rely on attention-based criteria to estimate token importance. However, they inherently suffer from certain limitations, such as positional bias. In this work, we explore a new perspective on token importance based on token transitions in LVLMs. We observe that the transition of token representations provides a meaningful signal of semantic information. Based on this insight, we propose TransPrune, a training-free and efficient token pruning method. Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV)-which measures changes in both the magnitude and direction of token representations-and Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to image tokens via attention. Extensive experiments demonstrate that TransPrune achieves comparable multimodal performance to original LVLMs, such as LLaVA-v1.5 and LLaVA-Next, across eight benchmarks, while reducing inference TFLOPs by more than half. Moreover, TTV alone can serve as an effective criterion without relying on attention, achieving performance comparable to attention-based methods. The code will be made publicly available upon acceptance of the paper at this https URL.
zh
[CV-48] DAMS:Dual-Branch Adaptive Multiscale Spatiotemporal Framework for Video Anomaly Detection
【速读】:该论文旨在解决视频异常检测(video anomaly detection)中面临的多重挑战,包括多尺度时间依赖性、视觉与语义异质性以及标注数据稀缺等问题。其核心解决方案是提出一种双路径架构——双分支自适应多尺度时空框架(Dual-Branch Adaptive Multiscale Spatiotemporal Framework, DAMS),该框架通过多层次特征解耦与融合机制实现高效建模。关键创新在于:主路径结合自适应多尺度时间金字塔网络(AMTPN)与卷积块注意力机制(CBAM),以动态加权重构时间特征并增强通道与空间注意力;另一并行路径引入CLIP的对比语言-视觉预训练范式,借助跨模态语义对齐和多尺度实例选择机制提供高阶语义引导,从而构建从底层时空特征到高层语义概念的完整推理链。两条路径的正交互补性及信息融合机制共同提升了对异常事件的表征与识别能力。
链接: https://arxiv.org/abs/2507.20629
作者: Dezhi An,Wenqiang Liu,Kefan Wang,Zening chen,Jun Lu,Shengcai Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages,7 figures
Abstract:The goal of video anomaly detection is tantamount to performing spatio-temporal localization of abnormal events in the video. The multiscale temporal dependencies, visual-semantic heterogeneity, and the scarcity of labeled data exhibited by video anomalies collectively present a challenging research problem in computer vision. This study offers a dual-path architecture called the Dual-Branch Adaptive Multiscale Spatiotemporal Framework (DAMS), which is based on multilevel feature decoupling and fusion, enabling efficient anomaly detection modeling by integrating hierarchical feature learning and complementary information. The main processing path of this framework integrates the Adaptive Multiscale Time Pyramid Network (AMTPN) with the Convolutional Block Attention Mechanism (CBAM). AMTPN enables multigrained representation and dynamically weighted reconstruction of temporal features through a three-level cascade structure (time pyramid pooling, adaptive feature fusion, and temporal context enhancement). CBAM maximizes the entropy distribution of feature channels and spatial dimensions through dual attention mapping. Simultaneously, the parallel path driven by CLIP introduces a contrastive language-visual pre-training paradigm. Cross-modal semantic alignment and a multiscale instance selection mechanism provide high-order semantic guidance for spatio-temporal features. This creates a complete inference chain from the underlying spatio-temporal features to high-level semantic concepts. The orthogonal complementarity of the two paths and the information fusion mechanism jointly construct a comprehensive representation and identification capability for anomalous events. Extensive experimental results on the UCF-Crime and XD-Violence benchmarks establish the effectiveness of the DAMS framework.
zh
[CV-49] Lightweight Remote Sensing Scene Classification on Edge Devices via Knowledge Distillation and Early-exit
【速读】:该论文旨在解决轻量级深度神经网络(DNN)在资源受限边缘设备上进行遥感场景分类(RSSC)时,难以同时优化模型准确率、推理延迟和能耗的问题。解决方案的关键在于提出一种名为E3C的轻量化框架,其核心包括两个创新:一是通过频域蒸馏(frequency domain distillation)压缩全局滤波网络(GFNet)模型以减小模型体积;二是设计适用于边缘设备的动态早退出机制(early-exit mechanism),根据中间层特征判断是否提前终止推理流程,从而显著提升推理效率。实验表明,该方法在多个边缘设备和数据集上实现了平均1.3倍的推理加速和超过40%的能效提升,同时保持高分类精度。
链接: https://arxiv.org/abs/2507.20623
作者: Yang Zhao,Shusheng Li,Xueshang Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, to be published in ACM Multimedia 2025
Abstract:As the development of lightweight deep learning algorithms, various deep neural network (DNN) models have been proposed for the remote sensing scene classification (RSSC) application. However, it is still challenging for these RSSC models to achieve optimal performance among model accuracy, inference latency, and energy consumption on resource-constrained edge devices. In this paper, we propose a lightweight RSSC framework, which includes a distilled global filter network (GFNet) model and an early-exit mechanism designed for edge devices to achieve state-of-the-art performance. Specifically, we first apply frequency domain distillation on the GFNet model to reduce model size. Then we design a dynamic early-exit model tailored for DNN models on edge devices to further improve model inference efficiency. We evaluate our E3C model on three edge devices across four datasets. Extensive experimental results show that it achieves an average of 1.3x speedup on model inference and over 40% improvement on energy efficiency, while maintaining high classification accuracy.
zh
[CV-50] Complementarity-driven Representation Learning for Multi-modal Knowledge Graph Completion
【速读】:该论文旨在解决多模态知识图谱补全(Multi-modal Knowledge Graph Completion, MMKGC)中因模态分布不均衡导致的实体表示鲁棒性不足问题,尤其针对现有方法依赖注意力或门控融合机制而忽略多模态数据内在互补性的缺陷。其解决方案的关键在于提出一种名为“互补模态专家混合模型”(Mixture of Complementary Modality Experts, MoCME)的新框架,核心包括两个创新模块:一是基于互补性的模态知识融合模块(Complementarity-guided Modality Knowledge Fusion, CMKF),通过挖掘模态内与模态间的互补信息来增强实体嵌入表达;二是熵引导的负采样机制(Entropy-guided Negative Sampling, EGNS),动态选择高不确定性且信息量大的负样本以提升训练效率和模型鲁棒性。
链接: https://arxiv.org/abs/2507.20620
作者: Lijian Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-modal Knowledge Graph Completion (MMKGC) aims to uncover hidden world knowledge in multimodal knowledge graphs by leveraging both multimodal and structural entity information. However, the inherent imbalance in multimodal knowledge graphs, where modality distributions vary across entities, poses challenges in utilizing additional modality data for robust entity representation. Existing MMKGC methods typically rely on attention or gate-based fusion mechanisms but overlook complementarity contained in multi-modal data. In this paper, we propose a novel framework named Mixture of Complementary Modality Experts (MoCME), which consists of a Complementarity-guided Modality Knowledge Fusion (CMKF) module and an Entropy-guided Negative Sampling (EGNS) mechanism. The CMKF module exploits both intra-modal and inter-modal complementarity to fuse multi-view and multi-modal embeddings, enhancing representations of entities. Additionally, we introduce an Entropy-guided Negative Sampling mechanism to dynamically prioritize informative and uncertain negative samples to enhance training effectiveness and model robustness. Extensive experiments on five benchmark datasets demonstrate that our MoCME achieves state-of-the-art performance, surpassing existing approaches.
zh
[CV-51] Enhanced Deep Learning DeepFake Detection Integrating Handcrafted Features
【速读】:该论文旨在解决深度伪造(Deepfake)和人脸交换技术在数字身份验证与开户流程中带来的安全威胁问题,尤其是传统检测方法难以应对复杂人脸篡改的泛化能力不足。其解决方案的关键在于提出一种融合手工设计的频域特征与常规RGB输入的增强型深度学习检测框架,通过利用图像篡改过程中引入的频率域与空间域伪影,为分类器提供更丰富且更具判别性的信息,从而提升检测性能。
链接: https://arxiv.org/abs/2507.20608
作者: Alejandro Hinke-Navarro,Mario Nieto-Hidalgo,Juan M. Espin,Juan E. Tapia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of deepfake and face swap technologies has raised significant concerns in digital security, particularly in identity verification and onboarding processes. Conventional detection methods often struggle to generalize against sophisticated facial manipulations. This study proposes an enhanced deep-learning detection framework that combines handcrafted frequency-domain features with conventional RGB inputs. This hybrid approach exploits frequency and spatial domain artifacts introduced during image manipulation, providing richer and more discriminative information to the classifier. Several frequency handcrafted features were evaluated, including the Steganalysis Rich Model, Discrete Cosine Transform, Error Level Analysis, Singular Value Decomposition, and Discrete Fourier Transform
zh
[CV-52] Harnessing Diffusion-Yielded Score Priors for Image Restoration
【速读】:该论文旨在解决深度图像恢复模型在去噪、生成真实细节以及保证像素级一致性方面面临的挑战,这些问题导致现有方法(如基于MSE、GAN和扩散模型的方法)难以在恢复质量、保真度与速度之间取得良好平衡。解决方案的关键在于提出一种名为HYPIR的新方法:通过使用预训练扩散模型初始化图像恢复模型,并在此基础上进行对抗训练,无需依赖扩散损失、迭代采样或额外适配器。理论分析表明,该初始化策略使初始恢复模型非常接近自然图像分布,从而提升数值稳定性、避免模式崩溃并显著加速对抗训练收敛;同时继承了扩散模型的丰富用户控制能力,支持文本引导恢复和可调纹理丰富度,且仅需单次前向传播即可实现高效推理,优于现有最先进方法。
链接: https://arxiv.org/abs/2507.20590
作者: Xinqi Lin,Fanghua Yu,Jinfan Hu,Zhiyuan You,Wu Shi,Jimmy S. Ren,Jinjin Gu,Chao Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep image restoration models aim to learn a mapping from degraded image space to natural image space. However, they face several critical challenges: removing degradation, generating realistic details, and ensuring pixel-level consistency. Over time, three major classes of methods have emerged, including MSE-based, GAN-based, and diffusion-based methods. However, they fail to achieve a good balance between restoration quality, fidelity, and speed. We propose a novel method, HYPIR, to address these challenges. Our solution pipeline is straightforward: it involves initializing the image restoration model with a pre-trained diffusion model and then fine-tuning it with adversarial training. This approach does not rely on diffusion loss, iterative sampling, or additional adapters. We theoretically demonstrate that initializing adversarial training from a pre-trained diffusion model positions the initial restoration model very close to the natural image distribution. Consequently, this initialization improves numerical stability, avoids mode collapse, and substantially accelerates the convergence of adversarial training. Moreover, HYPIR inherits the capabilities of diffusion models with rich user control, enabling text-guided restoration and adjustable texture richness. Requiring only a single forward pass, it achieves faster convergence and inference speed than diffusion-based methods. Extensive experiments show that HYPIR outperforms previous state-of-the-art methods, achieving efficient and high-quality image restoration.
zh
[CV-53] Methods for the Segmentation of Reticular Structures Using 3D LiDAR Data: A Comparative Evaluation
【速读】:该论文旨在解决爬行机器人在复杂金属桁架结构中自主导航的难题,核心挑战在于如何准确识别可通行表面(navigable surfaces)以支持机器人路径规划与移动。解决方案的关键在于提出两种互补的方法:一是基于分析的算法,通过点云中平面片段的特征值分解(eigendecomposition)实现可通行区域的二值分割;二是采用深度学习模型(如PointNet、PointNet++、MinkUNet34C和PointTransformerV3)进行端到端的点云语义分割。实验表明,分析方法具有参数易调优的优势且性能接近深度学习模型,而深度学习方法(尤其是PointTransformerV3)在分割精度上表现卓越(mIoU达97%),二者共同为复杂桁架环境下的自主导航提供了高效可行的技术路径,并揭示了计算效率与分割性能之间的权衡关系。
链接: https://arxiv.org/abs/2507.20589
作者: Francisco J. Soler Mora,Adrián Peidró Vidal,Marc Fabregat-Jaén,Luis Payá Castelló,Óscar Reinoso García
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reticular structures form the backbone of major infrastructure like bridges, pylons, and airports, but their inspection and maintenance are costly and hazardous, often requiring human intervention. While prior research has focused on fault detection via images or robotic platform design, the autonomous navigation of robots within these structures is less explored. This study addresses that gap by proposing methods to detect navigable surfaces in truss structures, enhancing the autonomy of climbing robots. The paper introduces several approaches for binary segmentation of navigable surfaces versus background from 3D point clouds of metallic trusses. These methods fall into two categories: analytical algorithms and deep learning models. The analytical approach features a custom algorithm that segments structures by analyzing the eigendecomposition of planar patches in the point cloud. In parallel, advanced deep learning models PointNet, PointNet++, MinkUNet34C, and PointTransformerV3 are trained and evaluated for the same task. Comparative analysis shows that the analytical algorithm offers easier parameter tuning and performance comparable to deep learning models, which, while more computationally intensive, excel in segmentation accuracy. Notably, PointTransformerV3 achieves a Mean Intersection Over Union (mIoU) of about 97%. The study demonstrates the promise of both analytical and deep learning methods for improving autonomous navigation in complex truss environments. The results highlight the trade-offs between computational efficiency and segmentation performance, providing valuable guidance for future research and practical applications in autonomous infrastructure inspection and maintenance.
zh
[CV-54] M-Net: MRI Brain Tumor Sequential Segmentation Network via Mesh-Cast ICCV2025
【速读】:该论文旨在解决医学影像中MRI肿瘤分割面临的挑战,特别是三维(3D)数据带来的计算复杂性问题。现有方法往往未能有效利用相邻MRI切片之间的空间相关性,而这种相关性具有“类时间”特性,类似于视频分割中的帧序列信息。解决方案的关键在于提出M-Net框架,其核心创新是引入了Mesh-Cast机制,能够无缝融合任意顺序模型以同时处理通道和时序信息,从而系统性地捕捉切片间的“类时间”空间关联;此外,设计了MRI顺序输入模式与两阶段顺序(Two-Phase Sequential, TPS)训练策略,先学习跨序列的通用特征,再优化单个切片的特征提取,以此在保持体积上下文信息的同时避免全3D卷积的高计算开销,显著提升了模型在顺序分割任务中的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2507.20582
作者: Jiacheng Lu,Hui Ding,Shiyu Zhang,Guoping Huo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 Accepted
Abstract:MRI tumor segmentation remains a critical challenge in medical imaging, where volumetric analysis faces unique computational demands due to the complexity of 3D data. The spatially sequential arrangement of adjacent MRI slices provides valuable information that enhances segmentation continuity and accuracy, yet this characteristic remains underutilized in many existing models. The spatial correlations between adjacent MRI slices can be regarded as “temporal-like” data, similar to frame sequences in video segmentation tasks. To bridge this gap, we propose M-Net, a flexible framework specifically designed for sequential image segmentation. M-Net introduces the novel Mesh-Cast mechanism, which seamlessly integrates arbitrary sequential models into the processing of both channel and temporal information, thereby systematically capturing the inherent “temporal-like” spatial correlations between MRI slices. Additionally, we define an MRI sequential input pattern and design a Two-Phase Sequential (TPS) training strategy, which first focuses on learning common patterns across sequences before refining slice-specific feature extraction. This approach leverages temporal modeling techniques to preserve volumetric contextual information while avoiding the high computational cost of full 3D convolutions, thereby enhancing the generalizability and robustness of M-Net in sequential segmentation tasks. Experiments on the BraTS2019 and BraTS2023 datasets demonstrate that M-Net outperforms existing methods across all key metrics, establishing itself as a robust solution for temporally-aware MRI tumor segmentation.
zh
[CV-55] AV-Deepfake1M: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations
【速读】:该论文旨在解决深度伪造(Deepfake)视频日益泛滥且逼真度不断提升所带来的检测难题,尤其是针对文本转语音(Text-to-Speech)和人脸-语音重演(face-voice reenactment)等生成技术所造成的新型伪造形式。解决方案的关键在于构建一个大规模、多样化且贴近真实网络视频特征的数据集——AV-Deepfake1M++,该数据集包含200万段视频片段,覆盖多种生成方法与音视频扰动策略,从而为深度伪造检测模型提供更全面的训练与评估基准。研究团队还基于此数据集发起2025年1M-Deepfakes检测挑战赛,以推动领域内检测技术的发展。
链接: https://arxiv.org/abs/2507.20579
作者: Zhixi Cai,Kartik Kuckreja,Shreya Ghosh,Akanksha Chuchra,Muhammad Haris Khan,Usman Tariq,Tom Gedeon,Abhinav Dhall
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid surge of text-to-speech and face-voice reenactment models makes video fabrication easier and highly realistic. To encounter this problem, we require datasets that rich in type of generation methods and perturbation strategy which is usually common for online videos. To this end, we propose AV-Deepfake1M++, an extension of the AV-Deepfake1M having 2 million video clips with diversified manipulation strategy and audio-visual perturbation. This paper includes the description of data generation strategies along with benchmarking of AV-Deepfake1M++ using state-of-the-art methods. We believe that this dataset will play a pivotal role in facilitating research in Deepfake domain. Based on this dataset, we host the 2025 1M-Deepfakes Detection Challenge. The challenge details, dataset and evaluation scripts are available online under a research-only license at this https URL.
zh
[CV-56] LSFDNet: A Single-Stage Fusion and Detection Network for Ships Using SWIR and LWIR
【速读】:该论文旨在解决传统船舶检测方法依赖单一模态图像(如可见光或红外)在复杂环境(如光照变化和浓雾条件下)下性能受限的问题。其解决方案的关键在于提出一种单阶段图像融合与检测联合算法LSFDNet,通过特征交互机制实现SWIR(短波红外)与LWIR(长波红外)图像的深度融合与目标检测协同优化;其中,多级交叉融合(Multi-Level Cross-Fusion, MLCF)模块整合了检测任务感知的融合特征及跨模态、跨尺度、跨任务的特征聚合,显著提升了融合图像中目标的显著性与语义丰富度;同时,引入基于检测任务位置先验的物体增强损失(Object Enhancement, OE)函数,强化了融合图像中目标语义信息的保留,从而提升下游检测性能。此外,研究构建了近岸船舶长短波红外配准数据集(Nearshore Ship Long-Short Wave Registration, NSLSR),填补了该领域的数据空白。
链接: https://arxiv.org/abs/2507.20574
作者: Yanyin Guo,Runxuan An,Junwei Li,Zhiyuan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACMMM2025
Abstract:Traditional ship detection methods primarily rely on single-modal approaches, such as visible or infrared images, which limit their application in complex scenarios involving varying lighting conditions and heavy fog. To address this issue, we explore the advantages of short-wave infrared (SWIR) and long-wave infrared (LWIR) in ship detection and propose a novel single-stage image fusion detection algorithm called LSFDNet. This algorithm leverages feature interaction between the image fusion and object detection subtask networks, achieving remarkable detection performance and generating visually impressive fused images. To further improve the saliency of objects in the fused images and improve the performance of the downstream detection task, we introduce the Multi-Level Cross-Fusion (MLCF) module. This module combines object-sensitive fused features from the detection task and aggregates features across multiple modalities, scales, and tasks to obtain more semantically rich fused features. Moreover, we utilize the position prior from the detection task in the Object Enhancement (OE) loss function, further increasing the retention of object semantics in the fused images. The detection task also utilizes preliminary fused features from the fusion task to complement SWIR and LWIR features, thereby enhancing detection performance. Additionally, we have established a Nearshore Ship Long-Short Wave Registration (NSLSR) dataset to train effective SWIR and LWIR image fusion and detection networks, bridging a gap in this field. We validated the superiority of our proposed single-stage fusion detection algorithm on two datasets. The source code and dataset are available at this https URL
zh
[CV-57] Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation INTERSPEECH25 INTERSPEECH2025
【速读】:该论文旨在解决传统语音驱动三维人脸动画方法中因帧级重建损失导致的面部运动连续性不足问题,从而产生抖动且不自然的动画效果(如由音素协同发音效应引起的不连贯动作)。其解决方案的关键在于提出一种新型的音素上下文感知损失函数(phonetic context-aware loss),该损失函数显式建模音素上下文对视觉音素(viseme)转换的影响,并通过引入视觉音素协同发音权重(viseme coarticulation weight),动态调整不同时间点面部运动的重要性,从而实现更平滑、感知一致的动画输出。
链接: https://arxiv.org/abs/2507.20568
作者: Hyung Kyu Kim,Hak Gu Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for Interspeech 2025 Project Page: this https URL
Abstract:Speech-driven 3D facial animation aims to generate realistic facial movements synchronized with audio. Traditional methods primarily minimize reconstruction loss by aligning each frame with ground-truth. However, this frame-wise approach often fails to capture the continuity of facial motion, leading to jittery and unnatural outputs due to coarticulation. To address this, we propose a novel phonetic context-aware loss, which explicitly models the influence of phonetic context on viseme transitions. By incorporating a viseme coarticulation weight, we assign adaptive importance to facial movements based on their dynamic changes over time, ensuring smoother and perceptually consistent animations. Extensive experiments demonstrate that replacing the conventional reconstruction loss with ours improves both quantitative metrics and visual quality. It highlights the importance of explicitly modeling phonetic context-dependent visemes in synthesizing natural speech-driven 3D facial animation. Project page: this https URL
zh
[CV-58] MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization ICCV25 ICCV2025
【速读】:该论文旨在解决现有语音驱动三维人脸动画方法依赖先验信息(如说话人类别标签或额外的3D人脸网格)的问题,从而无法准确反映个体说话风格并限制实际应用。其解决方案的关键在于提出MemoryTalker框架,该框架包含两个训练阶段:第一阶段通过记忆机制存储通用面部运动特征(Memorizing),第二阶段利用音频驱动的说话风格特征对记忆中的运动进行风格化调整,并据此生成个性化面部动画(Animating)。该方法仅需音频输入即可实现高质量、个性化的3D人脸动画合成,无需额外先验信息,显著提升了模型的实用性与适应性。
链接: https://arxiv.org/abs/2507.20562
作者: Hyung Kyu Kim,Sangmin Lee,Hak Gu Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for ICCV 2025 Project Page: this https URL
Abstract:Speech-driven 3D facial animation aims to synthesize realistic facial motion sequences from given audio, matching the speaker’s speaking style. However, previous works often require priors such as class labels of a speaker or additional 3D facial meshes at inference, which makes them fail to reflect the speaking style and limits their practical use. To address these issues, we propose MemoryTalker which enables realistic and accurate 3D facial motion synthesis by reflecting speaking style only with audio input to maximize usability in applications. Our framework consists of two training stages: 1-stage is storing and retrieving general motion (i.e., Memorizing), and 2-stage is to perform the personalized facial motion synthesis (i.e., Animating) with the motion memory stylized by the audio-driven speaking style feature. In this second stage, our model learns about which facial motion types should be emphasized for a particular piece of audio. As a result, our MemoryTalker can generate a reliable personalized facial animation without additional prior information. With quantitative and qualitative evaluations, as well as user study, we show the effectiveness of our model and its performance enhancement for personalized facial animation over state-of-the-art methods.
zh
[CV-59] FED-PsyAU: Privacy-Preserving Micro-Expression Recognition via Psychological AU Coordination and Dynamic Facial Motion Modeling
【速读】:该论文旨在解决微表情识别(Micro-expression Recognition, MER)中面临的两大核心问题:一是由于样本量小和特征细微导致的建模困难;二是实际应用中因数据隐私限制而难以跨场景提升识别性能。解决方案的关键在于提出一个名为FED-PsyAU的研究框架,其核心创新包括:首先通过心理学研究揭示上下面部动作单元(Action Units, AUs)的协同机制,构建结构化的先验知识;其次设计DPK-GAT网络,融合心理先验与统计AUs模式,实现从局部到全局的层级化面部运动特征学习;最后引入联邦学习(Federated Learning)机制,在不共享原始数据的前提下,跨客户端协同优化MER模型,从而在保护隐私的同时缓解单点样本不足的问题。
链接: https://arxiv.org/abs/2507.20557
作者: Jingting Li,Yu Qian,Lin Zhao,Su-Jing Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Micro-expressions (MEs) are brief, low-intensity, often localized facial expressions. They could reveal genuine emotions individuals may attempt to conceal, valuable in contexts like criminal interrogation and psychological counseling. However, ME recognition (MER) faces challenges, such as small sample sizes and subtle features, which hinder efficient modeling. Additionally, real-world applications encounter ME data privacy issues, leaving the task of enhancing recognition across settings under privacy constraints largely unexplored. To address these issues, we propose a FED-PsyAU research framework. We begin with a psychological study on the coordination of upper and lower facial action units (AUs) to provide structured prior knowledge of facial muscle dynamics. We then develop a DPK-GAT network that combines these psychological priors with statistical AU patterns, enabling hierarchical learning of facial motion features from regional to global levels, effectively enhancing MER performance. Additionally, our federated learning framework advances MER capabilities across multiple clients without data sharing, preserving privacy and alleviating the limited-sample issue for each client. Extensive experiments on commonly-used ME databases demonstrate the effectiveness of our approach.
zh
[CV-60] Annotation-Free Human Sketch Quality Assessment
【速读】:该论文旨在解决草图质量评估(Sketch Quality Assessment, SQAs)问题,即如何定量衡量草图绘制的优劣,从而识别出低质量草图。传统方法依赖人工标注的质量标签,而本文首次提出无需特定质量标注即可实现自动评估的解决方案。其关键创新在于提出几何感知分类层(Geometry-Aware Classification Layer, GACL),将特征幅值(L₂范数)作为质量度量指标,并通过联合优化特征幅值与可识别性学习任务,在交叉熵损失函数下实现理论保证的同步优化。GACL不仅具备直观的几何解释(质量越高,识别越容易),还对网络架构和底层表示具有强鲁棒性,显著提升了草图质量评估的实用性与泛化能力。
链接: https://arxiv.org/abs/2507.20548
作者: Lan Yang,Kaiyue Pang,Honggang Zhang,Yi-Zhe Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCV
Abstract:As lovely as bunnies are, your sketched version would probably not do them justice (Fig.~\reffig:intro). This paper recognises this very problem and studies sketch quality assessment for the first time – letting you find these badly drawn ones. Our key discovery lies in exploiting the magnitude ( L_2 norm) of a sketch feature as a quantitative quality metric. We propose Geometry-Aware Classification Layer (GACL), a generic method that makes feature-magnitude-as-quality-metric possible and importantly does it without the need for specific quality annotations from humans. GACL sees feature magnitude and recognisability learning as a dual task, which can be simultaneously optimised under a neat cross-entropy classification loss with theoretic guarantee. This gives GACL a nice geometric interpretation (the better the quality, the easier the recognition), and makes it agnostic to both network architecture changes and the underlying sketch representation. Through a large scale human study of 160,000 \doublechecktrials, we confirm the agreement between our GACL-induced metric and human quality perception. We further demonstrate how such a quality assessment capability can for the first time enable three practical sketch applications. Interestingly, we show GACL not only works on abstract visual representations such as sketch but also extends well to natural images on the problem of image quality assessment (IQA). Last but not least, we spell out the general properties of GACL as general-purpose data re-weighting strategy and demonstrate its applications in vertical problems such as noisy label cleansing. Code will be made publicly available at this http URL.
zh
[CV-61] 2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation ICCV2025
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型对提示词(prompt)敏感的问题,即用户常需反复调整提示词才能获得满意结果,且缺乏明确反馈机制。现有方法如自动提示工程、控制性文本嵌入、去噪和多轮生成虽有所缓解,但普遍存在可控性有限或需额外训练、泛化能力差的局限。其解决方案的关键在于提出一种无需训练的多智能体系统——T2I-Copilot,通过协作式架构实现自动化提示解析、模型选择与迭代优化:包含输入解释器(Input Interpreter)标准化提示并消除歧义、生成引擎(Generation Engine)适配不同T2I模型并组织提示信息、质量评估器(Quality Evaluator)量化美学质量和图文一致性以指导再生。该框架在GenAI-Bench上表现出色,性能接近商业模型RecraftV3和Imagen 3,同时显著优于FLUX1.1-pro(提升6.17%)、FLUX.1-dev(提升9.11%)和SD 3.5 Large(提升6.36%),且仅需其16.59%的成本。
链接: https://arxiv.org/abs/2507.20536
作者: Chieh-Yun Chen,Min Shi,Gong Zhang,Humphrey Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: ICCV 2025
Abstract:Text-to-Image (T2I) generative models have revolutionized content creation but remain highly sensitive to prompt phrasing, often requiring users to repeatedly refine prompts multiple times without clear feedback. While techniques such as automatic prompt engineering, controlled text embeddings, denoising, and multi-turn generation mitigate these issues, they offer limited controllability, or often necessitate additional training, restricting the generalization abilities. Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. This approach significantly simplifies prompt engineering while enhancing generation quality and text-image alignment compared to direct generation. Specifically, T2I-Copilot consists of three agents: (1) Input Interpreter, which parses the input prompt, resolves ambiguities, and generates a standardized report; (2) Generation Engine, which selects the appropriate model from different types of T2I models and organizes visual and textual prompts to initiate generation; and (3) Quality Evaluator, which assesses aesthetic quality and text-image alignment, providing scores and feedback for potential regeneration. T2I-Copilot can operate fully autonomously while also supporting human-in-the-loop intervention for fine-grained control. On GenAI-Bench, using open-source generation models, T2I-Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1-pro by 6.17% at only 16.59% of its cost, and outperforms FLUX.1-dev and SD 3.5 Large by 9.11% and 6.36%. Code will be released at: this https URL.
zh
[CV-62] Low-Cost Machine Vision System for Sorting Green Lentils (Lens Culinaris) Based on Pneumatic Ejection and Deep Learning
【速读】:该论文旨在解决绿色扁豆(Lens Culinaris)在农产品加工过程中高效、精准分选的问题,传统人工分选效率低且主观性强,难以满足自动化生产需求。解决方案的关键在于构建一个基于计算机视觉与气动剔除相结合的动态分类系统:首先采用双阶段YOLOv8模型架构,第一阶段实现对传送带上扁豆颗粒的检测与定位,第二阶段则完成六类品质标签(Good、Yellow、Broken、Peeled、Dotted、Reject)的多类别分类;同时,通过Arduino控制单元协调视觉识别与气动执行机构的实时交互,实现缺陷颗粒的自动剔除。该方案在59 mm/s的传送带速度下达到87.2%的分离准确率,验证了机器视觉在谷物分级中的可行性与模块化扩展潜力。
链接: https://arxiv.org/abs/2507.20531
作者: Davy Rojas Yana,Edwin Salcedo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in the Proceedings of the 30th International Conference on Automation and Computing (ICAC 2025)
Abstract:This paper presents the design, development, and evaluation of a dynamic grain classification system for green lentils (Lens Culinaris), which leverages computer vision and pneumatic ejection. The system integrates a YOLOv8-based detection model that identifies and locates grains on a conveyor belt, together with a second YOLOv8-based classification model that categorises grains into six classes: Good, Yellow, Broken, Peeled, Dotted, and Reject. This two-stage YOLOv8 pipeline enables accurate, real-time, multi-class categorisation of lentils, implemented on a low-cost, modular hardware platform. The pneumatic ejection mechanism separates defective grains, while an Arduino-based control system coordinates real-time interaction between the vision system and mechanical components. The system operates effectively at a conveyor speed of 59 mm/s, achieving a grain separation accuracy of 87.2%. Despite a limited processing rate of 8 grams per minute, the prototype demonstrates the potential of machine vision for grain sorting and provides a modular foundation for future enhancements.
zh
[CV-63] Enhancing Spatial Reasoning through Visual and Textual Thinking
【速读】:该论文旨在解决视觉语言模型(VLMs)在空间推理任务中表现不足的问题,尤其是在二维和三维空间中对物体位置关系的理解与推断能力较弱。其解决方案的关键在于提出一种名为SpatialVTS的方法,通过同时进行空间视觉思维(Spatial Visual Thinking)与空间文本思维(Spatial Textual Thinking)来增强模型的空间推理能力:在空间视觉思维阶段,模型自动生成关键目标的位置相关特定标记(tokens),不仅关注问题中提及的物体,还考虑推理过程中潜在相关的对象;在空间文本思维阶段,模型基于视觉线索和对话内容进行长期推理,逐步推导出空间推理问题的答案。此外,研究团队对现有数据集进行了人工校正、重构输入格式并引入逻辑推理细节,从而显著提升了模型在多个空间理解任务上的性能表现。
链接: https://arxiv.org/abs/2507.20529
作者: Xun Liang,Xin Guo,Zhongming Jin,Weihang Pan,Penghui Shang,Deng Cai,Binbin Lin,Jieping Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly in recent years, they are still struggling with the spatial reasoning task. In this paper, we introduce a method that can enhance Spatial reasoning through Visual and Textual thinking Simultaneously (SpatialVTS). In the spatial visual thinking phase, our model is trained to generate location-related specific tokens of essential targets automatically. Not only are the objects mentioned in the problem addressed, but also the potential objects related to the reasoning are considered. During the spatial textual thinking phase, Our model conducts long-term thinking based on visual cues and dialogues, gradually inferring the answers to spatial reasoning problems. To effectively support the model’s training, we perform manual corrections to the existing spatial reasoning dataset, eliminating numerous incorrect labels resulting from automatic annotation, restructuring the data input format to enhance generalization ability, and developing thinking processes with logical reasoning details. Without introducing additional information (such as masks or depth), our model’s overall average level in several spatial understanding tasks has significantly improved compared with other models.
zh
[CV-64] AgroBench: Vision-Language Model Benchmark in Agriculture ICCV2025
【速读】:该论文旨在解决农业任务中精准自动化理解(如病害识别)的难题,以支持可持续作物生产。其解决方案的关键在于构建一个由专家农艺师标注的多维度视觉-语言模型(VLM)评估基准——AgroBench,该基准涵盖7个农业领域、203种作物和682种病害类别,覆盖农业工程中的关键应用场景。通过在该基准上的系统评估,研究揭示了当前VLM在细粒度识别任务(尤其是杂草识别)中表现不佳的问题,并分析了错误类型,为未来VLM的发展提供了方向。
链接: https://arxiv.org/abs/2507.20519
作者: Risa Shinoda,Nakamasa Inoue,Hirokatsu Kataoka,Masaki Onishi,Yoshitaka Ushiku
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Precise automated understanding of agricultural tasks such as disease identification is essential for sustainable crop production. Recent advances in vision-language models (VLMs) are expected to further expand the range of agricultural tasks by facilitating human-model interaction through easy, text-based communication. Here, we introduce AgroBench (Agronomist AI Benchmark), a benchmark for evaluating VLM models across seven agricultural topics, covering key areas in agricultural engineering and relevant to real-world farming. Unlike recent agricultural VLM benchmarks, AgroBench is annotated by expert agronomists. Our AgroBench covers a state-of-the-art range of categories, including 203 crop categories and 682 disease categories, to thoroughly evaluate VLM capabilities. In our evaluation on AgroBench, we reveal that VLMs have room for improvement in fine-grained identification tasks. Notably, in weed identification, most open-source VLMs perform close to random. With our wide range of topics and expert-annotated categories, we analyze the types of errors made by VLMs and suggest potential pathways for future VLM development. Our dataset and code are available at this https URL .
zh
[CV-65] 2VParser: Adaptive Decomposition Tokens for Partial Alignment in Text to Video Retrieval
【速读】:该论文旨在解决视频-文本检索中因文本描述无法完整覆盖视频内容而导致的语义不对齐问题,即现有方法直接对齐文本与视频的全量表示会引入错误监督信号,忽略了模态间信息的不等价性。其解决方案的关键在于提出T2VParser框架,通过引入自适应分解令牌(Adaptive Decomposition Tokens)从文本和视频中提取多视角语义表示,并实现基于内容的局部自适应对齐,而非强制整体对齐,从而在保留预训练模型知识的同时提升跨模态匹配的精度。
链接: https://arxiv.org/abs/2507.20518
作者: Yili Li,Gang Xiong,Gaopeng Gou,Xiangyan Qu,Jiamin Zhuang,Zhen Li,Junzheng Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Text-to-video retrieval essentially aims to train models to align visual content with textual descriptions accurately. Due to the impressive general multimodal knowledge demonstrated by image-text pretrained models such as CLIP, existing work has primarily focused on extending CLIP knowledge for video-text tasks. However, videos typically contain richer information than images. In current video-text datasets, textual descriptions can only reflect a portion of the video content, leading to partial misalignment in video-text matching. Therefore, directly aligning text representations with video representations can result in incorrect supervision, ignoring the inequivalence of information. In this work, we propose T2VParser to extract multiview semantic representations from text and video, achieving adaptive semantic alignment rather than aligning the entire representation. To extract corresponding representations from different modalities, we introduce Adaptive Decomposition Tokens, which consist of a set of learnable tokens shared across modalities. The goal of T2VParser is to emphasize precise alignment between text and video while retaining the knowledge of pretrained models. Experimental results demonstrate that T2VParser achieves accurate partial alignment through effective cross-modal content decomposition. The code is available at this https URL.
zh
[CV-66] GaRe: Relightable 3D Gaussian Splatting for Outdoor Scenes from Unconstrained Photo Collections
【速读】:该论文旨在解决户外场景重光照(relighting)中如何精确分离并控制太阳光、天空辐射和间接光照的问题,以实现更自然且多样化的阴影与照明效果。传统方法通常将每张图像的全局光照压缩为单一潜在向量,难以支持精细的阴影动态生成与多维度光照调节。其解决方案的关键在于三个创新:(1) 基于残差的太阳可见性提取方法,用于准确分离直接阳光影响;(2) 基于区域的监督框架结合结构一致性损失,确保光照分解在物理上可解释且空间一致;(3) 基于光线追踪的阴影模拟技术,提升阴影的真实感。这些方法共同实现了对复杂户外光照环境的精细化建模与可控重光照。
链接: https://arxiv.org/abs/2507.20512
作者: Haiyang Bai,Jiaqi Zhu,Songru Jiang,Wei Huang,Tao Lu,Yuanqi Li,Jie Guo,Runze Fu,Yanwen Guo,Lijun Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a 3D Gaussian splatting-based framework for outdoor relighting that leverages intrinsic image decomposition to precisely integrate sunlight, sky radiance, and indirect lighting from unconstrained photo collections. Unlike prior methods that compress the per-image global illumination into a single latent vector, our approach enables simultaneously diverse shading manipulation and the generation of dynamic shadow effects. This is achieved through three key innovations: (1) a residual-based sun visibility extraction method to accurately separate direct sunlight effects, (2) a region-based supervision framework with a structural consistency loss for physically interpretable and coherent illumination decomposition, and (3) a ray-tracing-based technique for realistic shadow simulation. Extensive experiments demonstrate that our framework synthesizes novel views with competitive fidelity against state-of-the-art relighting solutions and produces more natural and multifaceted illumination and shadow effects.
zh
[CV-67] Beyond Class Tokens: LLM -guided Dominant Property Mining for Few-shot Classification
【速读】:该论文旨在解决少样本学习(Few-Shot Learning, FSL)中因数据稀缺导致的泛化能力不足问题,尤其是现有基于CLIP类方法仅通过类名文本嵌入对齐视觉特征时,难以保持新类别间的视觉多样性,从而影响分类性能。解决方案的关键在于提出一种名为BCT-CLIP的新方法,其核心创新是通过对比学习探索“主导属性”(dominating properties),而非简单依赖类标记(class token)。该方法利用大语言模型(Large-Language Model, LLM)先验知识生成多属性视觉token,并结合patch-aware交叉注意力机制与聚类剪枝的检索策略,构建结构化的图像表征体系,包括全局类别表示和局部patch级属性嵌入,最终通过新型属性token对比学习策略增强类别特异性表征能力,显著提升了少样本分类性能。
链接: https://arxiv.org/abs/2507.20511
作者: Wei Zhuo,Runjie Luo,Wufeng Xue,Linlin Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures
Abstract:Few-shot Learning (FSL), which endeavors to develop the generalization ability for recognizing novel classes using only a few images, faces significant challenges due to data scarcity. Recent CLIP-like methods based on contrastive language-image pertaining mitigate the issue by leveraging textual representation of the class name for unseen image discovery. Despite the achieved success, simply aligning visual representations to class name embeddings would compromise the visual diversity for novel class discrimination. To this end, we proposed a novel Few-Shot Learning (FSL) method (BCT-CLIP) that explores \textbfdominating properties via contrastive learning beyond simply using class tokens. Through leveraging LLM-based prior knowledge, our method pushes forward FSL with comprehensive structural image representations, including both global category representation and the patch-aware property embeddings. In particular, we presented a novel multi-property generator (MPG) with patch-aware cross-attentions to generate multiple visual property tokens, a Large-Language Model (LLM)-assistant retrieval procedure with clustering-based pruning to obtain dominating property descriptions, and a new contrastive learning strategy for property-token learning. The superior performances on the 11 widely used datasets demonstrate that our investigation of dominating properties advances discriminative class-specific representation learning and few-shot classification.
zh
[CV-68] Investigating the Effect of Spatial Context on Multi-Task Sea Ice Segmentation
【速读】:该论文旨在解决深度学习模型在海冰分割任务中如何有效捕捉多尺度空间上下文信息的问题,特别是针对不同观测分辨率和海冰属性(如海冰浓度、发育阶段及冰 floe 尺寸)时,最优空间上下文配置尚未明确的挑战。解决方案的关键在于采用带孔空洞空间金字塔池化(Atrous Spatial Pyramid Pooling, ASPP)技术,通过调整空洞率(atrous rates)系统性控制卷积操作的感受野大小,从而实现对多尺度上下文信息的精准建模;同时结合Sentinel-1合成孔径雷达(SAR)与Advanced Microwave Radiometer-2(AMSR2)多源遥感数据融合,验证了不同感受野对各类海冰属性分割性能的影响,并利用梯度加权类激活映射(Grad-CAM)可视化模型决策机制,揭示了小感受野适用于高分辨率SAR数据,中等感受野更利于发育阶段分割,而大感受野可能降低性能,最终提出依据观测分辨率和目标属性选择合适空间上下文以优化海冰遥感智能解译的方法论。
链接: https://arxiv.org/abs/2507.20507
作者: Behzad Vahedi,Rafael Pires de Lima,Sepideh Jalayer,Walter N. Meier,Andrew P. Barrett,Morteza Karimzadeh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Capturing spatial context at multiple scales is crucial for deep learning-based sea ice segmentation. However, the optimal specification of spatial context based on observation resolution and task characteristics remains underexplored. This study investigates the impact of spatial context on the segmentation of sea ice concentration, stage of development, and floe size using a multi-task segmentation model. We implement Atrous Spatial Pyramid Pooling with varying atrous rates to systematically control the receptive field size of convolutional operations, and to capture multi-scale contextual information. We explore the interactions between spatial context and feature resolution for different sea ice properties and examine how spatial context influences segmentation performance across different input feature combinations from Sentinel-1 SAR and Advanced Microwave Radiometer-2 (AMSR2) for multi-task mapping. Using Gradient-weighted Class Activation Mapping, we visualize how atrous rates influence model decisions. Our findings indicate that smaller receptive fields excel for high-resolution Sentinel-1 data, while medium receptive fields yield better performances for stage of development segmentation and larger receptive fields often lead to diminished performances. The fusion of SAR and AMSR2 enhances segmentation across all tasks. We highlight the value of lower-resolution 18.7 and 36.5 GHz AMSR2 channels in sea ice mapping. These findings highlight the importance of selecting appropriate spatial context based on observation resolution and target properties in sea ice mapping. By systematically analyzing receptive field effects in a multi-task setting, our study provides insights for optimizing deep learning models in geospatial applications.
zh
[CV-69] An Improved YOLOv8 Approach for Small Target Detection of Rice Spikelet Flowering in Field Environments
【速读】:该论文旨在解决水稻小花(rice spikelet)开花时间自动识别的难题,这对杂交水稻制种中及时授粉和提高产量具有重要意义。由于田间环境复杂且水稻小花体积小、开花周期短,传统方法难以实现高精度检测。其解决方案的关键在于改进YOLOv8目标检测模型:首先用双向特征金字塔网络(BiFPN)替代原模型中的PANet结构以增强多尺度特征融合能力;其次引入p2小目标检测头,通过更精细的特征映射减少小目标检测过程中的信息损失。此外,研究构建了一个针对田间场景的高分辨率RGB图像数据集,为模型训练与验证提供可靠支持。实验表明,改进后的YOLOv8s-p2模型在mAP@0.5、精确率、召回率和F1分数上均优于基线模型,且推理速度达69帧/秒,满足实际应用需求。
链接: https://arxiv.org/abs/2507.20506
作者: Beizhang Chen,Jinming Liang,Zheng Xiong,Ming Pan,Xiangbao Meng,Qingshan Lin,Qun Ma,Yingping Zhao
机构: Shenzhen Institute of Modern Agricultural Equipment(深圳市现代农机装备研究所); Guangdong Institute of Modern Agricultural Equipment(广东省现代农机装备研究所); Key Laboratory of Modern Agricultural Intelligent Equipment in South China, Ministry of Agriculture and Rural Affairs(华南现代农业智能装备重点实验室,农业农村部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures
Abstract:Accurately detecting rice flowering time is crucial for timely pollination in hybrid rice seed production. This not only enhances pollination efficiency but also ensures higher yields. However, due to the complexity of field environments and the characteristics of rice spikelets, such as their small size and short flowering period, automated and precise recognition remains challenging. To address this, this study proposes a rice spikelet flowering recognition method based on an improved YOLOv8 object detection model. First, a Bidirectional Feature Pyramid Network (BiFPN) replaces the original PANet structure to enhance feature fusion and improve multi-scale feature utilization. Second, to boost small object detection, a p2 small-object detection head is added, using finer feature mapping to reduce feature loss commonly seen in detecting small targets. Given the lack of publicly available datasets for rice spikelet flowering in field conditions, a high-resolution RGB camera and data augmentation techniques are used to construct a dedicated dataset, providing reliable support for model training and testing. Experimental results show that the improved YOLOv8s-p2 model achieves an mAP@0.5 of 65.9%, precision of 67.6%, recall of 61.5%, and F1-score of 64.41%, representing improvements of 3.10%, 8.40%, 10.80%, and 9.79%, respectively, over the baseline YOLOv8. The model also runs at 69 f/s on the test set, meeting practical application requirements. Overall, the improved YOLOv8s-p2 offers high accuracy and speed, providing an effective solution for automated monitoring in hybrid rice seed production. Comments: 13 pages, 9 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.20506 [cs.CV] (or arXiv:2507.20506v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.20506 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-70] Automated 3D-GS Registration and Fusion via Skeleton Alignment and Gaussian-Adaptive Features IROS2025
【速读】:该论文旨在解决多张3D高斯溅射(3D Gaussian Splatting, 3D-GS)子图在自动对齐与融合过程中存在的两大问题:一是现有方法依赖人工选择参考子图并采用点云匹配进行注册,缺乏自动化能力;二是硬阈值过滤导致融合后渲染质量下降。解决方案的关键在于两个核心创新:其一,通过提取跨场景几何骨架并引入椭球感知卷积(ellipsoid-aware convolution)来捕获3D-GS属性,实现鲁棒的场景注册;其二,提出多因素高斯融合策略,有效缓解因刚性阈值处理造成的场景元素丢失问题,从而提升融合后的结构保真度和视觉质量。
链接: https://arxiv.org/abs/2507.20480
作者: Shiyang Liu,Dianyi Yang,Yu Gao,Bohan Ren,Yi Yang,Mengyin Fu
机构: School of Automation, Beijing Institute of Technology (北京理工大学自动化学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IROS 2025
Abstract:In recent years, 3D Gaussian Splatting (3D-GS)-based scene representation demonstrates significant potential in real-time rendering and training efficiency. However, most existing methods primarily focus on single-map reconstruction, while the registration and fusion of multiple 3D-GS sub-maps remain underexplored. Existing methods typically rely on manual intervention to select a reference sub-map as a template and use point cloud matching for registration. Moreover, hard-threshold filtering of 3D-GS primitives often degrades rendering quality after fusion. In this paper, we present a novel approach for automated 3D-GS sub-map alignment and fusion, eliminating the need for manual intervention while enhancing registration accuracy and fusion quality. First, we extract geometric skeletons across multiple scenes and leverage ellipsoid-aware convolution to capture 3D-GS attributes, facilitating robust scene registration. Second, we introduce a multi-factor Gaussian fusion strategy to mitigate the scene element loss caused by rigid thresholding. Experiments on the ScanNet-GSReg and our Coord datasets demonstrate the effectiveness of the proposed method in registration and fusion. For registration, it achieves a 41.9% reduction in RRE on complex scenes, ensuring more precise pose estimation. For fusion, it improves PSNR by 10.11 dB, highlighting superior structural preservation. These results confirm its ability to enhance scene alignment and reconstruction fidelity, ensuring more consistent and accurate 3D scene representation for robotic perception and autonomous navigation.
zh
[CV-71] Priority-Aware Pathological Hierarchy Training for Multiple Instance Learning MICCAI
【速读】:该论文旨在解决多实例学习(Multiple Instance Learning, MIL)在临床病理诊断中因未充分考虑病理症状与诊断类别之间的优先级关系,而导致模型忽略类别间重要性差异的问题。解决方案的关键在于引入两个层次结构:垂直跨层级(vertical inter-hierarchy)和水平同层级(horizontal intra-hierarchy)的优先级对齐机制,并在训练过程中利用隐式特征重用策略,以增强同一层级内更严重类别的判别能力,从而有效降低误诊率并提升多分类场景下对关键症状的优先识别能力。
链接: https://arxiv.org/abs/2507.20469
作者: Sungrae Hong,Kyungeun Kim,Juhyeon Kim,Sol Lee,Jisu Shin,Chanjae Song,Mun Yong Yi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, Accepted for oral presentation by The 2nd MICCAI Student Board (MSB) EMERGE Workshop
Abstract:Multiple Instance Learning (MIL) is increasingly being used as a support tool within clinical settings for pathological diagnosis decisions, achieving high performance and removing the annotation burden. However, existing approaches for clinical MIL tasks have not adequately addressed the priority issues that exist in relation to pathological symptoms and diagnostic classes, causing MIL models to ignore priority among classes. To overcome this clinical limitation of MIL, we propose a new method that addresses priority issues using two hierarchies: vertical inter-hierarchy and horizontal intra-hierarchy. The proposed method aligns MIL predictions across each hierarchical level and employs an implicit feature re-usability during training to facilitate clinically more serious classes within the same level. Experiments with real-world patient data show that the proposed method effectively reduces misdiagnosis and prioritizes more important symptoms in multiclass scenarios. Further analysis verifies the efficacy of the proposed components and qualitatively confirms the MIL predictions against challenging cases with multiple symptoms.
zh
[CV-72] Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis
【速读】:该论文旨在解决视觉自回归建模(Visual Autoregressive Modeling)在高分辨率阶段因大量token导致的计算开销过大的问题。其解决方案的关键在于提出SparseVAR框架,该框架无需额外训练即可在推理阶段动态剔除低频区域的token,同时通过少量均匀采样的锚点token保留关键区域的保真度;这一策略基于两个核心观察:一是高分辨率阶段低频区域token对图像质量影响微弱且与邻近token高度相似,二是模型不同模块关注区域差异显著(部分聚焦高频区)。通过轻量级均方误差(MSE)指标识别并剔除冗余token,SparseVAR实现了显著加速(如Infinity-2B中达2倍提速),同时保持高质量生成效果。
链接: https://arxiv.org/abs/2507.20454
作者: Zhuokun Chen,Jugang Fan,Zhuowei Yu,Bohan Zhuang,Mingkui Tan
机构: South China University of Technology (华南理工大学); Pazhou Lab; University of California, Davis (加州大学戴维斯分校); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Visual autoregressive modeling, based on the next-scale prediction paradigm, exhibits notable advantages in image quality and model scalability over traditional autoregressive and diffusion models. It generates images by progressively refining resolution across multiple stages. However, the computational overhead in high-resolution stages remains a critical challenge due to the substantial number of tokens involved. In this paper, we introduce SparseVAR, a plug-and-play acceleration framework for next-scale prediction that dynamically excludes low-frequency tokens during inference without requiring additional training. Our approach is motivated by the observation that tokens in low-frequency regions have a negligible impact on image quality in high-resolution stages and exhibit strong similarity with neighboring tokens. Additionally, we observe that different blocks in the next-scale prediction model focus on distinct regions, with some concentrating on high-frequency areas. SparseVAR leverages these insights by employing lightweight MSE-based metrics to identify low-frequency tokens while preserving the fidelity of excluded regions through a small set of uniformly sampled anchor tokens. By significantly reducing the computational cost while maintaining high image generation quality, SparseVAR achieves notable acceleration in both HART and Infinity. Specifically, SparseVAR achieves up to a 2 times speedup with minimal quality degradation in Infinity-2B.
zh
[CV-73] JOLT3D: Joint Learning of Talking Heads and 3DMM Parameters with Application to Lip-Sync
【速读】:该论文旨在解决传统3D面部模型(3DMM)在说话头合成(talking head synthesis)中表达能力有限、唇形同步(lip-sync)质量不高以及生成面部细节不够精细的问题。其解决方案的关键在于联合学习一个3D人脸重建模型与说话头合成模型,从而获得针对唇形同步优化的基于FACS(Facial Action Coding System)的混合形状(blendshape)表示。这一方法不仅提升了生成人脸的整体质量,还利用blendshape的可分解特性,仅对嘴部区域进行精确调整以实现音频驱动的唇形同步,同时通过解耦下颌轮廓与唇形同步后的下颌轮廓,有效减少了嘴部附近的闪烁伪影(flickering)。
链接: https://arxiv.org/abs/2507.20452
作者: Sungjoon Park,Minsik Park,Haneol Lee,Jaesub Yun,Donggeon Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 + 8 pages, 11 figures
Abstract:In this work, we revisit the effectiveness of 3DMM for talking head synthesis by jointly learning a 3D face reconstruction model and a talking head synthesis model. This enables us to obtain a FACS-based blendshape representation of facial expressions that is optimized for talking head synthesis. This contrasts with previous methods that either fit 3DMM parameters to 2D landmarks or rely on pretrained face reconstruction models. Not only does our approach increase the quality of the generated face, but it also allows us to take advantage of the blendshape representation to modify just the mouth region for the purpose of audio-based lip-sync. To this end, we propose a novel lip-sync pipeline that, unlike previous methods, decouples the original chin contour from the lip-synced chin contour, and reduces flickering near the mouth.
zh
[CV-74] WEEP: A Differentiable Nonconvex Sparse Regularizer via Weakly-Convex Envelope
【速读】:该论文旨在解决稀疏正则化(sparse regularization)在信号处理中面临的根本性矛盾:强稀疏诱导惩罚函数通常是非光滑的(non-differentiable),这与当前主流的基于梯度的优化器(gradient-based optimizers)不兼容,从而限制了其在实际计算中的可扩展性和效率。解决方案的关键在于提出WEEP(Weakly-convex Envelope of Piecewise Penalty),一种从弱凸包络(weakly-convex envelope)框架推导出的全新、完全可微的稀疏正则项。WEEP在保持强稀疏性和无偏性的同时,具备完整的可微性和Lipschitz连续梯度(L-smoothness),天然适配任意基于梯度的优化算法,从而有效弥合了统计性能与计算可实现性之间的鸿沟。
链接: https://arxiv.org/abs/2507.20447
作者: Takanobu Furuhashi,Hidekata Hontani,Tatsuya Yokota
机构: Nagoya Institute of Technology (名古屋工业大学); RIKEN Center for Advanced Intelligence Project (理化学研究所先进智能项目中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures
Abstract:Sparse regularization is fundamental in signal processing for efficient signal recovery and feature extraction. However, it faces a fundamental dilemma: the most powerful sparsity-inducing penalties are often non-differentiable, conflicting with gradient-based optimizers that dominate the field. We introduce WEEP (Weakly-convex Envelope of Piecewise Penalty), a novel, fully differentiable sparse regularizer derived from the weakly-convex envelope framework. WEEP provides strong, unbiased sparsity while maintaining full differentiability and L-smoothness, making it natively compatible with any gradient-based optimizer. This resolves the conflict between statistical performance and computational tractability. We demonstrate superior performance compared to the L1-norm and other established non-convex sparse regularizers on challenging signal and image denoising tasks.
zh
[CV-75] Can Foundation Models Predict Fitness for Duty?
【速读】:该论文旨在解决如何利用深度学习和基础模型(foundational models)提升基于近红外虹膜图像的警觉性(alertness)预测能力,从而评估个体是否具备执行工作任务的适任性(fitness for duty)。其核心挑战在于缺乏大规模标注数据集,尤其是与酒精摄入、药物使用及睡眠剥夺相关的虹膜图像。解决方案的关键在于借助自监督学习(self-supervised learning)基础模型的强大泛化能力,通过少量高质量数据即可训练下游任务模型,从而有效克服数据稀缺问题并提升预测性能。
链接: https://arxiv.org/abs/2507.20418
作者: Juan E. Tapia,Christoph Busch
机构: da/sec-Biometrics and Internet Security Research Group (da/sec-生物识别与互联网安全研究组); Hochschule Darmstadt (达姆施塔特应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Biometric capture devices have been utilised to estimate a person’s alertness through near-infrared iris images, expanding their use beyond just biometric recognition. However, capturing a substantial number of corresponding images related to alcohol consumption, drug use, and sleep deprivation to create a dataset for training an AI model presents a significant challenge. Typically, a large quantity of images is required to effectively implement a deep learning approach. Currently, training downstream models with a huge number of images based on foundational models provides a real opportunity to enhance this area, thanks to the generalisation capabilities of self-supervised models. This work examines the application of deep learning and foundational models in predicting fitness for duty, which is defined as the subject condition related to determining the alertness for work.
zh
[CV-76] Indian Sign Language Detection for Real-Time Translation using Machine Learning
【速读】:该论文旨在解决印度聋哑人群体在沟通中面临的障碍,特别是由于专业口译人员稀缺和可用的翻译技术不足导致的交流困难。针对这一问题,研究提出了一种基于卷积神经网络(Convolutional Neural Network, CNN)的实时印度手语(Indian Sign Language, ISL)检测与翻译系统,其关键在于利用深度学习模型对ISL手势进行高精度分类(准确率达99.95%),并通过集成MediaPipe实现精准的手部追踪与动作捕捉,从而支持动态手势的实时识别与转换,有效提升聋哑人群体在日常交流中的可及性与效率。
链接: https://arxiv.org/abs/2507.20414
作者: Rajat Singhal,Jatin Gupta,Akhil Sharma,Anushka Gupta,Navya Sharma
机构: Sharda University(沙德拉大学); Babu Banarasi Das University(巴布·巴纳拉西·达斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures, 2 tables. Accepted for publication at the 6th International Conference on Recent Advances in Information Technology (RAIT 2025). This is the accepted version (preprint); the final published version will appear in IEEE Xplore
Abstract:Gestural language is used by deaf mute communities to communicate through hand gestures body movements that rely on visual-spatial patterns known as sign languages. Sign languages, which rely on visual-spatial patterns of hand gestures body movements, are the primary mode of communication for deaf mute communities worldwide. Effective communication is fundamental to human interaction, yet individuals in these communities often face significant barriers due to a scarcity of skilled interpreters accessible translation technologies. This research specifically addresses these challenges within the Indian context by focusing on Indian Sign Language (ISL). By leveraging machine learning, this study aims to bridge the critical communication gap for the deaf hard-of-hearing population in India, where technological solutions for ISL are less developed compared to other global sign languages. We propose a robust, real-time ISL detection translation system built upon a Convolutional Neural Network (CNN). Our model is trained on a comprehensive ISL dataset demonstrates exceptional performance, achieving a classification accuracy of 99.95%. This high precision underscores the model’s capability to discern the nuanced visual features of different signs. The system’s effectiveness is rigorously evaluated using key performance metrics, including accuracy, F1 score, precision recall, ensuring its reliability for real-world applications. For real-time implementation, the framework integrates MediaPipe for precise hand tracking motion detection, enabling seamless translation of dynamic gestures. This paper provides a detailed account of the model’s architecture, the data preprocessing pipeline the classification methodology. The research elaborates the model architecture, preprocessing classification methodologies for enhancing communication in deaf mute communities.
zh
[CV-77] Second Competition on Presentation Attack Detection on ID Card
【速读】:该论文旨在解决身份证件上的伪造攻击检测(Presentation Attack Detection, PAD)问题,以提升身份识别系统在面对伪造证件时的安全性。解决方案的关键在于通过举办第二届身份证件PAD竞赛,引入自动评估平台、区分算法与数据集的双轨评测机制,并提供一个新的基准ID卡数据集供训练优化使用。实验结果表明,尽管当前PAD技术已有显著进步(如Track 2中最佳团队EER降至6.36%),但受限于真实样本(bona fide images)数量不足,该问题仍具挑战性。
链接: https://arxiv.org/abs/2507.20404
作者: Juan E. Tapia,Mario Nieto,Juan M. Espin,Alvaro S. Rocamora,Javier Barrachina,Naser Damer,Christoph Busch,Marija Ivanovska,Leon Todorov,Renat Khizbullin,Lazar Lazarevich,Aleksei Grishin,Daniel Schulz,Sebastian Gonzalez,Amir Mohammadi,Ketan Kotwal,Sebastien Marcel,Raghavendra Mudgalgundurao,Kiran Raja,Patrick Schuch,Sushrut Patwardhan,Raghavendra Ramachandra,Pedro Couto Pereira,Joao Ribeiro Pinto,Mariana Xavier,Andrés Valenzuela,Rodrigo Lara,Borut Batagelj,Marko Peterlin,Peter Peer,Ajnas Muhammed,Diogo Nunes,Nuno Gonçalves
机构: Hochschule Darmstadt (h-da), da/sec-Biometrics and Internet Security Research (达姆施塔特应用技术大学,生物识别与互联网安全研究组); Facephi company (Facephi公司); Fraunhofer Institute for Computer Graphics Research IGD (弗劳恩霍夫计算机图形研究所); Department of Computer Science, TU Darmstadt (达姆施塔特工业大学计算机科学系); University of Ljubljana, Faculty of Electrical Engineering (卢布尔雅那大学电气工程学院); Incode Technologies Inc. (Incode Technologies公司); ID VisionCenter (IDVC) (ID视觉中心); Idiap Research Institute (Idiap研究学院); Norwegian University of Science and Technology (NTNU) (挪威科技大学); Amadeus (阿玛杜斯公司); Universidad de Santiago (USACH) (圣地亚哥大学); Universidad Andrés Bello (安德烈斯贝洛大学); University of Ljubljana, Faculty of Computer and Information Science (卢布尔雅那大学计算机与信息科学学院); Institute of Systems and Robotics (里斯本系统与机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work summarises and reports the results of the second Presentation Attack Detection competition on ID cards. This new version includes new elements compared to the previous one. (1) An automatic evaluation platform was enabled for automatic benchmarking; (2) Two tracks were proposed in order to evaluate algorithms and datasets, respectively; and (3) A new ID card dataset was shared with Track 1 teams to serve as the baseline dataset for the training and optimisation. The Hochschule Darmstadt, Fraunhofer-IGD, and Facephi company jointly organised this challenge. 20 teams were registered, and 74 submitted models were evaluated. For Track 1, the “Dragons” team reached first place with an Average Ranking and Equal Error rate (EER) of AV-Rank of 40.48% and 11.44% EER, respectively. For the more challenging approach in Track 2, the “Incode” team reached the best results with an AV-Rank of 14.76% and 6.36% EER, improving on the results of the first edition of 74.30% and 21.87% EER, respectively. These results suggest that PAD on ID cards is improving, but it is still a challenging problem related to the number of images, especially of bona fide images.
zh
[CV-78] VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving
【速读】:该论文旨在解决自动驾驶数据标注中3D标签生成效率低、成本高以及现有LiDAR-based自动标注方法在语义粒度和对象完整性方面存在局限的问题。其核心挑战在于LiDAR数据固有的稀疏性、遮挡和不完整观测,以及传统方法通常为类别无关(class-agnostic)且难以支持开放词汇(open-vocabulary)标注。解决方案的关键在于提出VESPA——一个融合LiDAR几何精度与相机图像语义丰富性的多模态自动标注流水线,通过引入视觉语言模型(Vision-Language Models, VLMs)实现开放词汇的物体识别,并直接在点云域内优化检测质量,从而支持新类别的发现并生成高质量3D伪标签(pseudolabels),无需依赖真实标注或高精地图。
链接: https://arxiv.org/abs/2507.20397
作者: Levente Tempfli,Esteban Rivera,Markus Lienkamp
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Data collection for autonomous driving is rapidly accelerating, but manual annotation, especially for 3D labels, remains a major bottleneck due to its high cost and labor intensity. Autolabeling has emerged as a scalable alternative, allowing the generation of labels for point clouds with minimal human intervention. While LiDAR-based autolabeling methods leverage geometric information, they struggle with inherent limitations of lidar data, such as sparsity, occlusions, and incomplete object observations. Furthermore, these methods typically operate in a class-agnostic manner, offering limited semantic granularity. To address these challenges, we introduce VESPA, a multimodal autolabeling pipeline that fuses the geometric precision of LiDAR with the semantic richness of camera images. Our approach leverages vision-language models (VLMs) to enable open-vocabulary object labeling and to refine detection quality directly in the point cloud domain. VESPA supports the discovery of novel categories and produces high-quality 3D pseudolabels without requiring ground-truth annotations or HD maps. On Nuscenes dataset, VESPA achieves an AP of 52.95% for object discovery and up to 46.54% for multiclass object detection, demonstrating strong performance in scalable 3D scene understanding. Code will be available upon acceptance.
zh
[CV-79] Solving Scene Understanding for Autonomous Navigation in Unstructured Environments
【速读】:该论文旨在解决自动驾驶场景中复杂环境下的语义分割问题,以提升车辆对道路及周边物体的精准理解能力。其关键解决方案在于针对印度城市与乡村道路构建的更具挑战性的未结构化驾驶数据集(Indian Driving Dataset),采用五种主流深度学习模型(UNET、UNET+RESNET50、DeepLabsV3、PSPNet 和 SegNet)进行训练与比较,并通过平均交并比(Mean Intersection over Union, MIOU)评估性能,最终实现最高 MIOU 为 0.6496 的分割精度,验证了在复杂真实场景下模型的有效性与可迁移性。
链接: https://arxiv.org/abs/2507.20389
作者: Naveen Mathews Renji,Kruthika K,Manasa Keshavamurthy,Pooja Kumari,S. Rajarajeswari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous vehicles are the next revolution in the automobile industry and they are expected to revolutionize the future of transportation. Understanding the scenario in which the autonomous vehicle will operate is critical for its competent functioning. Deep Learning has played a massive role in the progress that has been made till date. Semantic Segmentation, the process of annotating every pixel of an image with an object class, is one crucial part of this scene comprehension using Deep Learning. It is especially useful in Autonomous Driving Research as it requires comprehension of drivable and non-drivable areas, roadside objects and the like. In this paper semantic segmentation has been performed on the Indian Driving Dataset which has been recently compiled on the urban and rural roads of Bengaluru and Hyderabad. This dataset is more challenging compared to other datasets like Cityscapes, since it is based on unstructured driving environments. It has a four level hierarchy and in this paper segmentation has been performed on the first level. Five different models have been trained and their performance has been compared using the Mean Intersection over Union. These are UNET, UNET+RESNET50, DeepLabsV3, PSPNet and SegNet. The highest MIOU of 0.6496 has been achieved. The paper discusses the dataset, exploratory data analysis, preparation, implementation of the five models and studies the performance and compares the results achieved in the process.
zh
[CV-80] ModalFormer: Multimodal Transformer for Low-Light Image Enhancement
【速读】:该论文旨在解决低光照图像增强(Low-light Image Enhancement, LLIE)中因噪声、细节丢失和对比度差等问题导致的图像质量劣化问题。传统方法通常仅依赖RGB图像的像素级变换,忽略了多模态视觉信息中的丰富上下文。其解决方案的关键在于提出首个大规模多模态LLIE框架ModalFormer,通过引入九种辅助模态(如深度特征嵌入、分割信息、几何线索和颜色信息等),利用Cross-modal Transformer(CM-T)与创新的跨模态多头自注意力机制(Cross-modal Multi-headed Self-Attention, CM-MSA),实现RGB图像与多模态特征的深度融合,生成信息丰富的混合注意力图,从而显著提升增强效果。
链接: https://arxiv.org/abs/2507.20388
作者: Alexandru Brateanu,Raul Balmez,Ciprian Orhei,Codruta Ancuti,Cosmin Ancuti
机构: University of Manchester (曼彻斯特大学); Politehnica University of Timisoara (蒂米什瓦拉理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features–including deep feature embeddings, segmentation information, geometric cues, and color information–to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer’s state-of-the-art performance in LLIE. Pre-trained models and results are made available at this https URL.
zh
[CV-81] MagicAnime: A Hierarchically Annotated Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation
【速读】:该论文旨在解决生成高质量卡通动画时面临的多模态控制难题,具体包括非人类角色的复杂性、风格多样的运动表现以及精细情绪表达等问题。由于卡通动画与现实世界视频之间存在巨大领域差异(如抽象性和夸张动作),且公开的多模态卡通数据集稀缺,现有方法难以有效训练和评估模型。解决方案的关键在于构建了MagicAnime数据集——一个大规模、分层标注且多模态的数据集,涵盖40万张图像到视频生成样本、5万对视频与关键点配对用于全身姿态标注、1.2万对视频用于面部动画迁移,以及2900对音视频用于音频驱动面部动画;同时配套提出MagicAnime-Bench基准测试体系,支持多种生成任务的系统性比较,从而显著提升生成质量、细粒度控制能力和可控性。
链接: https://arxiv.org/abs/2507.20368
作者: Shuolin Xu,Bingyuan Wang,Zeyu Cai,Fangteng Fu,Yue Ma,Tongyi Lee,Hongchuan Yu,Zeyu Wang
机构: Bournemouth University (伯恩茅斯大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Hong Kong University of Science and Technology (香港科技大学); National Cheng Kung University (成功大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 8 pages,6 figures
Abstract:Generating high-quality cartoon animations multimodal control is challenging due to the complexity of non-human characters, stylistically diverse motions and fine-grained emotions. There is a huge domain gap between real-world videos and cartoon animation, as cartoon animation is usually abstract and has exaggerated motion. Meanwhile, public multimodal cartoon data are extremely scarce due to the difficulty of large-scale automatic annotation processes compared with real-life scenarios. To bridge this gap, We propose the MagicAnime dataset, a large-scale, hierarchically annotated, and multimodal dataset designed to support multiple video generation tasks, along with the benchmarks it includes. Containing 400k video clips for image-to-video generation, 50k pairs of video clips and keypoints for whole-body annotation, 12k pairs of video clips for video-to-video face animation, and 2.9k pairs of video and audio clips for audio-driven face animation. Meanwhile, we also build a set of multi-modal cartoon animation benchmarks, called MagicAnime-Bench, to support the comparisons of different methods in the tasks above. Comprehensive experiments on four tasks, including video-driven face animation, audio-driven face animation, image-to-video animation, and pose-driven character animation, validate its effectiveness in supporting high-fidelity, fine-grained, and controllable generation.
zh
[CV-82] Generative Pre-training for Subjective Tasks: A Diffusion Transformer-Based Framework for Facial Beauty Prediction
【速读】:该论文旨在解决人脸美感预测(Facial Beauty Prediction, FBP)任务中因主观性较强以及影响人类审美感知的细微且整体性特征难以建模所带来的挑战。现有方法多基于通用图像分类预训练的深度卷积网络或标准视觉Transformer(Vision Transformer),其学习到的特征表示与高阶美学评估不完全对齐,导致性能受限。解决方案的关键在于提出一种两阶段框架Diff-FBP:第一阶段利用扩散Transformer(Diffusion Transformer)在大规模无标签人脸数据集(FFHQ)上通过自监督去噪任务进行预训练,迫使模型学习人脸的基本数据分布,捕捉对美学评价至关重要的结构先验和细节信息;第二阶段冻结预训练编码器作为骨干特征提取器,仅对目标FBP数据集(FBP5500)微调轻量级回归头。实验表明,该生成式预训练策略是性能提升的核心因素,显著优于基于通用预训练的方法,在FBP5500基准上达到0.932的皮尔逊相关系数(Pearson Correlation Coefficient, PCC)。
链接: https://arxiv.org/abs/2507.20363
作者: Djamel Eddine Boukhari,Ali chemsa
机构: LGEERE Laboratory Department of Electrical Engineering, University of El Oued, El-Oued Algeria; Scientific and Technical Research Centre for Arid Areas CRSTRA, Biskra
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Facial Beauty Prediction (FBP) is a challenging computer vision task due to its subjective nature and the subtle, holistic features that influence human perception. Prevailing methods, often based on deep convolutional networks or standard Vision Transformers pre-trained on generic object classification (e.g., ImageNet), struggle to learn feature representations that are truly aligned with high-level aesthetic assessment. In this paper, we propose a novel two-stage framework that leverages the power of generative models to create a superior, domain-specific feature extractor. In the first stage, we pre-train a Diffusion Transformer on a large-scale, unlabeled facial dataset (FFHQ) through a self-supervised denoising task. This process forces the model to learn the fundamental data distribution of human faces, capturing nuanced details and structural priors essential for aesthetic evaluation. In the second stage, the pre-trained and frozen encoder of our Diffusion Transformer is used as a backbone feature extractor, with only a lightweight regression head being fine-tuned on the target FBP dataset (FBP5500). Our method, termed Diff-FBP, sets a new state-of-the-art on the FBP5500 benchmark, achieving a Pearson Correlation Coefficient (PCC) of 0.932, significantly outperforming prior art based on general-purpose pre-training. Extensive ablation studies validate that our generative pre-training strategy is the key contributor to this performance leap, creating feature representations that are more semantically potent for subjective visual tasks.
zh
[CV-83] Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach
【速读】:该论文旨在解决增强现实(Augmented Reality, AR)中虚拟内容可能引发的语义误导或有害信息问题,特别是视觉信息操纵(Visual Information Manipulation, VIM)攻击——即通过细微但具有影响力的手段篡改真实场景的含义。其解决方案的关键在于提出一个系统性的VIM攻击分类体系(分为字符、短语和模式三类操纵方式,以及信息替换、信息混淆和错误信息添加三类目的),并构建了首个大规模AR-VIM数据集(包含452对原始AR视频,覆盖202个真实场景)。在此基础上,作者设计了一个多模态语义推理框架VIM-Sense,融合视觉-语言模型(Vision-Language Models, VLMs)的跨模态理解能力与基于光学字符识别(Optical Character Recognition, OCR)的文本分析技术,实现了88.94%的攻击检测准确率,显著优于纯视觉或纯文本基线方法,并在模拟与真实移动端环境中分别达到约7秒的平均检测延迟。
链接: https://arxiv.org/abs/2507.20356
作者: Yanming Xiu,Maria Gorlatova
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures
Abstract:The virtual content in augmented reality (AR) can introduce misleading or harmful information, leading to semantic misunderstandings or user errors. In this work, we focus on visual information manipulation (VIM) attacks in AR where virtual content changes the meaning of real-world scenes in subtle but impactful ways. We introduce a taxonomy that categorizes these attacks into three formats: character, phrase, and pattern manipulation, and three purposes: information replacement, information obfuscation, and extra wrong information. Based on the taxonomy, we construct a dataset, AR-VIM. It consists of 452 raw-AR video pairs spanning 202 different scenes, each simulating a real-world AR scenario. To detect such attacks, we propose a multimodal semantic reasoning framework, VIM-Sense. It combines the language and visual understanding capabilities of vision-language models (VLMs) with optical character recognition (OCR)-based textual analysis. VIM-Sense achieves an attack detection accuracy of 88.94% on AR-VIM, consistently outperforming vision-only and text-only baselines. The system reaches an average attack detection latency of 7.07 seconds in a simulated video processing framework and 7.17 seconds in a real-world evaluation conducted on a mobile Android AR application.
zh
[CV-84] PIVOTS: Aligning unseen Structures using Preoperative to Intraoperative Volume-To-Surface Registration for Liver Navigation
【速读】:该论文旨在解决腹腔镜肝切除术中术前与术中肝脏形变预测的难题,以实现精准的非刚性配准(non-rigid registration),从而将术前影像中的肿瘤位置和血管结构等信息融合至有限的术中视野中,提升手术导航精度。其关键解决方案是提出PIVOTS——一种基于点云输入的术前到术中体积到表面配准神经网络,通过多分辨率几何特征提取编码器和包含新型形变感知交叉注意力模块的解码器,实现术前与术中信息的有效交互及多层次位移预测,显著提升了在高噪声、大形变和低可见度条件下的鲁棒性。
链接: https://arxiv.org/abs/2507.20337
作者: Peng Liu,Bianca Güttner,Yutong Su,Chenyang Li,Jinjing Xu,Mingyang Liu,Zhe Min,Andrey Zhylka,Jasper Smit,Karin Olthof,Matteo Fusaglia,Rudi Apolle,Matthias Miederer,Laura Frohneberger,Carina Riediger,Jügen Weitz,Fiona Kolbinger,Stefanie Speidel,Micha Pfeiffer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Non-rigid registration is essential for Augmented Reality guided laparoscopic liver surgery by fusing preoperative information, such as tumor location and vascular structures, into the limited intraoperative view, thereby enhancing surgical navigation. A prerequisite is the accurate prediction of intraoperative liver deformation which remains highly challenging due to factors such as large deformation caused by pneumoperitoneum, respiration and tool interaction as well as noisy intraoperative data, and limited field of view due to occlusion and constrained camera movement. To address these challenges, we introduce PIVOTS, a Preoperative to Intraoperative VOlume-To-Surface registration neural network that directly takes point clouds as input for deformation prediction. The geometric feature extraction encoder allows multi-resolution feature extraction, and the decoder, comprising novel deformation aware cross attention modules, enables pre- and intraoperative information interaction and accurate multi-level displacement prediction. We train the neural network on synthetic data simulated from a biomechanical simulation pipeline and validate its performance on both synthetic and real datasets. Results demonstrate superior registration performance of our method compared to baseline methods, exhibiting strong robustness against high amounts of noise, large deformation, and various levels of intraoperative visibility. We publish the training and test sets as evaluation benchmarks and call for a fair comparison of liver registration methods with volume-to-surface data. Code and datasets are available here this https URL.
zh
[CV-85] From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos
【速读】:该论文旨在解决视频中三维物体插入(3D object insertion)的难题,尤其关注动态场景下保持时间一致性(temporal consistency)与实现逼真光照(photorealistic lighting)的双重挑战。现有方法中,2D扩散模型虽能生成高保真图像但难以保证帧间一致性,而传统3D渲染方法在空间和时间一致性上表现良好却难以实现真实光照效果。解决方案的关键在于提出一种混合式插入流水线:首先利用3D高斯泼溅(3D Gaussian Splatting, 3DGS)提供高时间一致性的初始渲染结果,再通过基于2D扩散模型的增强模块优化光照交互以提升视觉真实性;同时引入基于阴影驱动(shading-driven)的分离机制,分别对反照率(albedo)、阴影(shading)和sRGB图像进行精细化调整,并采用多帧加权优化策略确保3DGS模型的时间稳定性。这是首个将3D渲染与2D扩散模型协同用于视频物体插入的工作,实现了高质量、高一致性的视频编辑效果。
链接: https://arxiv.org/abs/2507.20331
作者: Chenjian Gao,Lihe Ding,Rui Han,Zhanpeng Huang,Zibin Wang,Tianfan Xue
机构: The Chinese University of Hong Kong (香港中文大学); SenseTime Research (商汤科技研究院); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages
Abstract:Inserting 3D objects into videos is a longstanding challenge in computer graphics with applications in augmented reality, virtual try-on, and video composition. Achieving both temporal consistency, or realistic lighting remains difficult, particularly in dynamic scenarios with complex object motion, perspective changes, and varying illumination. While 2D diffusion models have shown promise for producing photorealistic edits, they often struggle with maintaining temporal coherence across frames. Conversely, traditional 3D rendering methods excel in spatial and temporal consistency but fall short in achieving photorealistic lighting. In this work, we propose a hybrid object insertion pipeline that combines the strengths of both paradigms. Specifically, we focus on inserting bracelets into dynamic wrist scenes, leveraging the high temporal consistency of 3D Gaussian Splatting (3DGS) for initial rendering and refining the results using a 2D diffusion-based enhancement model to ensure realistic lighting interactions. Our method introduces a shading-driven pipeline that separates intrinsic object properties (albedo, shading, reflectance) and refines both shading and sRGB images for photorealism. To maintain temporal coherence, we optimize the 3DGS model with multi-frame weighted adjustments. This is the first approach to synergize 3D rendering and 2D diffusion for video object insertion, offering a robust solution for realistic and consistent video editing. Project Page: this https URL
zh
[CV-86] SWIFT: A General Sensitive Weight Identification Framework for Fast Sensor-Transfer Pansharpening
【速读】:该论文旨在解决深度学习方法在跨传感器(cross-sensor)遥感图像融合任务中性能显著下降的问题,即现有模型在未见过的传感器数据上适应能力差,而传统全量重训练或复杂架构设计又成本高昂、难以部署。解决方案的关键在于提出一种快速通用的跨传感器适配框架 SWIFT(Sensitive Weight Identification for Fast Transfer),其核心创新是:首先基于目标域数据流形结构设计无监督采样策略,仅选取约3%最具信息量的样本;随后通过分析源模型参数梯度行为,精准识别并更新对域偏移最敏感的一小部分权重,从而实现高效、轻量级的模型迁移。此方法无需重新训练整个模型,在单张 NVIDIA RTX 4090 GPU 上可将适应时间从数小时缩短至约1分钟,且性能优于直接迁移基线,甚至媲美或超越全量重训练,成为世界视图-2(WorldView-2)和快鸟(QuickBird)数据集上跨传感器图像融合的新基准。
链接: https://arxiv.org/abs/2507.20311
作者: Zeyu Xia,Chenxi Sun,Tianyu Xin,Yubo Zeng,Haoyu Chen,Liang-Jian Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pansharpening aims to fuse high-resolution panchromatic (PAN) images with low-resolution multispectral (LRMS) images to generate high-resolution multispectral (HRMS) images. Although deep learning-based methods have achieved promising performance, they generally suffer from severe performance degradation when applied to data from unseen sensors. Adapting these models through full-scale retraining or designing more complex architectures is often prohibitively expensive and impractical for real-world deployment. To address this critical challenge, we propose a fast and general-purpose framework for cross-sensor adaptation, SWIFT (Sensitive Weight Identification for Fast Transfer). Specifically, SWIFT employs an unsupervised sampling strategy based on data manifold structures to balance sample selection while mitigating the bias of traditional Farthest Point Sampling, efficiently selecting only 3% of the most informative samples from the target domain. This subset is then used to probe a source-domain pre-trained model by analyzing the gradient behavior of its parameters, allowing for the quick identification and subsequent update of only the weight subset most sensitive to the domain shift. As a plug-and-play framework, SWIFT can be applied to various existing pansharpening models. Extensive experiments demonstrate that SWIFT reduces the adaptation time from hours to approximately one minute on a single NVIDIA RTX 4090 GPU. The adapted models not only substantially outperform direct-transfer baselines but also achieve performance competitive with, and in some cases superior to, full retraining, establishing a new state-of-the-art on cross-sensor pansharpening tasks for the WorldView-2 and QuickBird datasets.
zh
[CV-87] Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training ICCV2025
【速读】:该论文旨在解决基于预训练稳定扩散(Stable Diffusion, SD)模型进行真实世界图像超分辨率(Real-ISR)时,因VAE(变分自编码器)存在剧烈下采样(如8×)而导致图像细结构(如小字符和纹理)重建效果差的问题。其核心解决方案是提出一种迁移式VAE训练(Transfer VAE Training, TVT)策略:首先利用原VAE编码器输出特征训练一个4×下采样的解码器,随后固定该解码器并训练一个新的4×下采样编码器,从而在保持与原始VAE潜在空间对齐的同时增强图像细节恢复能力。此外,通过优化网络结构设计出紧凑且计算高效的VAE与UNet,进一步降低计算开销并提升高分辨率细尺度特征捕捉能力。
链接: https://arxiv.org/abs/2507.20291
作者: Qiaosi Yi,Shuai Li,Rongyuan Wu,Lingchen Sun,Yuhui Wu,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Impressive results on real-world image super-resolution (Real-ISR) have been achieved by employing pre-trained stable diffusion (SD) models. However, one critical issue of such methods lies in their poor reconstruction of image fine structures, such as small characters and textures, due to the aggressive resolution reduction of the VAE (eg., 8 \times downsampling) in the SD model. One solution is to employ a VAE with a lower downsampling rate for diffusion; however, adapting its latent features with the pre-trained UNet while mitigating the increased computational cost poses new challenges. To address these issues, we propose a Transfer VAE Training (TVT) strategy to transfer the 8 \times downsampled VAE into a 4 \times one while adapting to the pre-trained UNet. Specifically, we first train a 4 \times decoder based on the output features of the original VAE encoder, then train a 4 \times encoder while keeping the newly trained decoder fixed. Such a TVT strategy aligns the new encoder-decoder pair with the original VAE latent space while enhancing image fine details. Additionally, we introduce a compact VAE and compute-efficient UNet by optimizing their network architectures, reducing the computational cost while capturing high-resolution fine-scale features. Experimental results demonstrate that our TVT method significantly improves fine-structure preservation, which is often compromised by other SD-based methods, while requiring fewer FLOPs than state-of-the-art one-step diffusion models. The official code can be found at this https URL.
zh
[CV-88] text3SVFND: Towards an Evolving Fake News Detector for Emergencies with Test-time Training on Short Video Platforms DASFAA2025
【速读】:该论文旨在解决现有假新闻视频检测方法在面对不同事件短新闻视频时泛化能力不足的问题,尤其在突发事件新闻场景下性能显著下降。其关键解决方案是提出一种基于测试时训练(Test-Time Training, TTT)的框架(T³SVFND),通过设计一种基于掩码语言建模(Mask Language Modeling, MLM)的自监督辅助任务,在测试阶段利用多模态上下文信息(音频与视频)预测被遮蔽文本内容,从而引导模型适应测试数据分布,提升对突发事件新闻视频的检测鲁棒性。
链接: https://arxiv.org/abs/2507.20286
作者: Liyuan Zhang,Zeyun Cheng,Yan Yang,Yong Liu,Jinke Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 16 pages, 3 figures, published to DASFAA 2025
Abstract:The existing methods for fake news videos detection may not be generalized, because there is a distribution shift between short video news of different events, and the performance of such techniques greatly drops if news records are coming from emergencies. We propose a new fake news videos detection framework (T ^3 SVFND) using Test-Time Training (TTT) to alleviate this limitation, enhancing the robustness of fake news videos detection. Specifically, we design a self-supervised auxiliary task based on Mask Language Modeling (MLM) that masks a certain percentage of words in text and predicts these masked words by combining contextual information from different modalities (audio and video). In the test-time training phase, the model adapts to the distribution of test data through auxiliary tasks. Extensive experiments on the public benchmark demonstrate the effectiveness of the proposed model, especially for the detection of emergency news.
zh
[CV-89] Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation ICCV2025
【速读】:该论文旨在解决深度神经网络在训练过程中容易学习到数据集中的虚假相关性(spurious correlations)问题,从而导致模型在实际应用中可靠性不足、公平性差的问题。其解决方案的关键在于提出一种名为“可控特征白化”(controllable feature whitening)的简单而有效的框架:通过计算目标特征与偏置特征之间的协方差矩阵来量化线性相关性,并利用白化模块消除这种线性相关性,从而显著缓解模型偏差,同时避免建模难以处理的高阶依赖关系。该方法无需正则化项或对抗学习,优化更稳定,且可通过调整权重系数灵活控制算法在效用与公平性之间的权衡。
链接: https://arxiv.org/abs/2507.20284
作者: Yooshin Cho,Hanbyel Cho,Janghyeon Lee,HyeongGwon Hong,Jaesung Ahn,Junmo Kim
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICCV 2025 (Poster)
Abstract:As the use of artificial intelligence rapidly increases, the development of trustworthy artificial intelligence has become important. However, recent studies have shown that deep neural networks are susceptible to learn spurious correlations present in datasets. To improve the reliability, we propose a simple yet effective framework called controllable feature whitening. We quantify the linear correlation between the target and bias features by the covariance matrix, and eliminate it through the whitening module. Our results systemically demonstrate that removing the linear correlations between features fed into the last linear classifier significantly mitigates the bias, while avoiding the need to model intractable higher-order dependencies. A particular advantage of the proposed method is that it does not require regularization terms or adversarial learning, which often leads to unstable optimization in practice. Furthermore, we show that two fairness criteria, demographic parity and equalized odds, can be effectively handled by whitening with the re-weighted covariance matrix. Consequently, our method controls the trade-off between the utility and fairness of algorithms by adjusting the weighting coefficient. Finally, we validate that our method outperforms existing approaches on four benchmark datasets: Corrupted CIFAR-10, Biased FFHQ, WaterBirds, and Celeb-A.
zh
[CV-90] L-MCAT: Unpaired Multimodal Transformer with Contrastive Attention for Label-Efficient Satellite Image Classification
【速读】:该论文旨在解决遥感图像分类中标签稀缺与多模态数据异构性之间的矛盾问题,即如何在仅使用少量标注样本的情况下,有效利用未配对的多模态卫星数据(如光学与合成孔径雷达SAR)提升分类性能。解决方案的关键在于提出轻量级多模态对比注意力Transformer(L-MCAT),其核心创新包括:(1) 模态-光谱适配器(Modality-Spectral Adapters, MSA),用于将高维传感器输入压缩至统一嵌入空间;(2) 无配对多模态注意力对齐机制(Unpaired Multimodal Attention Alignment, U-MAA),通过对比自监督学习在注意力层内实现跨模态对齐,无需像素级对应关系或标签。该方法在SEN12MS数据集上仅用每类20个标签即达到95.4%的整体准确率,显著优于现有基准,且模型参数量和计算复杂度分别降低47倍和23倍。
链接: https://arxiv.org/abs/2507.20259
作者: Mitul Goswami,Mrinal Goswami
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose the Lightweight Multimodal Contrastive Attention Transformer (L-MCAT), a novel transformer-based framework for label-efficient remote sensing image classification using unpaired multimodal satellite data. L-MCAT introduces two core innovations: (1) Modality-Spectral Adapters (MSA) that compress high-dimensional sensor inputs into a unified embedding space, and (2) Unpaired Multimodal Attention Alignment (U-MAA), a contrastive self-supervised mechanism integrated into the attention layers to align heterogeneous modalities without pixel-level correspondence or labels. L-MCAT achieves 95.4% overall accuracy on the SEN12MS dataset using only 20 labels per class, outperforming state-of-the-art baselines while using 47x fewer parameters and 23x fewer FLOPs than MCTrans. It maintains over 92% accuracy even under 50% spatial misalignment, demonstrating robustness for real-world deployment. The model trains end-to-end in under 5 hours on a single consumer GPU.
zh
[CV-91] MIRepNet: A Pipeline and Foundation Model for EEG-Based Motor Imagery Classification
【速读】:该论文旨在解决当前脑机接口(Brain-Computer Interface, BCI)领域中通用型EEG基础模型因忽略特定任务范式(如运动想象,Motor Imagery, MI)的神经生理差异而导致泛化能力受限的问题。其解决方案的关键在于提出MIRepNet——首个专为运动想象范式设计的EEG基础模型,包含两个核心创新:一是基于神经生理学先验知识构建的高质量预处理流程,支持任意电极配置的EEG头戴设备;二是融合自监督掩码标记重建与监督式MI分类的混合预训练策略,显著提升在仅需少于30次试验/类的新下游任务上的快速适应与解码精度。
链接: https://arxiv.org/abs/2507.20254
作者: Dingkun Liu,Zhu Chen,Jingwei Luo,Shijie Lian,Dongrui Wu
机构: Huazhong University of Science and Technology (华中科技大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Brain-computer interfaces (BCIs) enable direct communication between the brain and external devices. Recent EEG foundation models aim to learn generalized representations across diverse BCI paradigms. However, these approaches overlook fundamental paradigm-specific neurophysiological distinctions, limiting their generalization ability. Importantly, in practical BCI deployments, the specific paradigm such as motor imagery (MI) for stroke rehabilitation or assistive robotics, is generally determined prior to data acquisition. This paper proposes MIRepNet, the first EEG foundation model tailored for the MI paradigm. MIRepNet comprises a high-quality EEG preprocessing pipeline incorporating a neurophysiologically-informed channel template, adaptable to EEG headsets with arbitrary electrode configurations. Furthermore, we introduce a hybrid pretraining strategy that combines self-supervised masked token reconstruction and supervised MI classification, facilitating rapid adaptation and accurate decoding on novel downstream MI tasks with fewer than 30 trials per class. Extensive evaluations across five public MI datasets demonstrated that MIRepNet consistently achieved state-of-the-art performance, significantly outperforming both specialized and generalized EEG models. Our code will be available on GitHub\footnotethis https URL.
zh
[CV-92] AnimalClue: Recognizing Animals by their Traces ICCV2025
【速读】:该论文旨在解决从野生动物间接证据(如足迹、粪便、蛋、骨骼和羽毛)中准确识别物种的问题,这是野生动物监测中的关键挑战之一,但长期以来未得到充分研究。解决方案的关键在于构建了首个大规模数据集AnimalClue,包含159,605个标注边界框,涵盖968种物种、200个科和65个目,并提供物种级标签、边界框或分割掩码以及细粒度特征信息(如活动模式和栖息地偏好)。该数据集通过引入更细微的视觉特征识别任务,推动了对分类、检测和实例分割模型在间接证据场景下的性能评估与改进。
链接: https://arxiv.org/abs/2507.20240
作者: Risa Shinoda,Nakamasa Inoue,Iro Laina,Christian Rupprecht,Hirokatsu Kataoka
机构: The University of Osaka(大阪大学); Kyoto University(京都大学); Tokyo Institute of Technology(东京工业大学); National Institute of Advanced Industrial Science and Technology (AIST)(日本产业技术综合研究所); Visual Geometry Group, University of Oxford(牛津大学视觉几何组)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV2025 Highlight
Abstract:Wildlife observation plays an important role in biodiversity conservation, necessitating robust methodologies for monitoring wildlife populations and interspecies interactions. Recent advances in computer vision have significantly contributed to automating fundamental wildlife observation tasks, such as animal detection and species identification. However, accurately identifying species from indirect evidence like footprints and feces remains relatively underexplored, despite its importance in contributing to wildlife monitoring. To bridge this gap, we introduce AnimalClue, the first large-scale dataset for species identification from images of indirect evidence. Our dataset consists of 159,605 bounding boxes encompassing five categories of indirect clues: footprints, feces, eggs, bones, and feathers. It covers 968 species, 200 families, and 65 orders. Each image is annotated with species-level labels, bounding boxes or segmentation masks, and fine-grained trait information, including activity patterns and habitat preferences. Unlike existing datasets primarily focused on direct visual features (e.g., animal appearances), AnimalClue presents unique challenges for classification, detection, and instance segmentation tasks due to the need for recognizing more detailed and subtle visual features. In our experiments, we extensively evaluate representative vision models and identify key challenges in animal identification from their traces. Our dataset and code are available at this https URL
zh
[CV-93] Decomposing Densification in Gaussian Splatting for Faster 3D Scene Reconstruction
【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, GS)在训练过程中因低效的密集化策略和高斯基元(Gaussian primitives)空间分布不合理导致的收敛速度慢的问题。其解决方案的关键在于提出一种“从全局到局部”的密集化策略,以更高效地扩展高斯基元,兼顾场景的整体覆盖与局部细节优化;同时引入基于能量引导的粗到精多分辨率训练框架,根据2D图像中的能量密度逐步提升重建分辨率,并动态剪枝冗余基元,从而显著加速训练过程,在减少高斯基元数量的同时实现更优的重建质量。
链接: https://arxiv.org/abs/2507.20239
作者: Binxiao Huang,Zhengwu Liu,Ngai Wong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (GS) has emerged as a powerful representation for high-quality scene reconstruction, offering compelling rendering quality. However, the training process of GS often suffers from slow convergence due to inefficient densification and suboptimal spatial distribution of Gaussian primitives. In this work, we present a comprehensive analysis of the split and clone operations during the densification phase, revealing their distinct roles in balancing detail preservation and computational efficiency. Building upon this analysis, we propose a global-to-local densification strategy, which facilitates more efficient growth of Gaussians across the scene space, promoting both global coverage and local refinement. To cooperate with the proposed densification strategy and promote sufficient diffusion of Gaussian primitives in space, we introduce an energy-guided coarse-to-fine multi-resolution training framework, which gradually increases resolution based on energy density in 2D images. Additionally, we dynamically prune unnecessary Gaussian primitives to speed up the training. Extensive experiments on MipNeRF-360, Deep Blending, and Tanks Temples datasets demonstrate that our approach significantly accelerates training,achieving over 2x speedup with fewer Gaussian primitives and superior reconstruction performance.
zh
[CV-94] A Multi-Agent System for Information Extraction from the Chemical Literature
【速读】:该论文旨在解决化学文献中多模态化学信息自动提取的难题,特别是针对复杂化学图形(如反应式、分子结构图)在样式多样性与跨模态表达上的挑战。其关键解决方案是构建一个基于多模态大语言模型(Multimodal Large Language Model, MLLM)的多智能体系统,利用MLLM强大的推理能力理解复杂化学图形结构,并将提取任务分解为子任务,由一组专业化智能体协同完成。该方法显著提升了化学信息抽取的准确性,在基准数据集上F1得分达到80.8%,远超此前最优模型(35.6%),并在分子图像识别、反应图像解析、命名实体识别及文本反应抽取等子任务中均实现稳定提升,为自动化构建结构化化学数据库提供了关键技术支撑。
链接: https://arxiv.org/abs/2507.20230
作者: Yufan Chen,Ching Ting Leung,Bowen Yu,Jianwei Sun,Yong Huang,Linyan Li,Hao Chen,Hanyu Gao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:
Abstract:To fully expedite AI-powered chemical research, high-quality chemical databases are the cornerstone. Automatic extraction of chemical information from the literature is essential for constructing reaction databases, but it is currently limited by the multimodality and style variability of chemical information. In this work, we developed a multimodal large language model (MLLM)-based multi-agent system for automatic chemical information extraction. We used the MLLM’s strong reasoning capability to understand the structure of complex chemical graphics, decompose the extraction task into sub-tasks and coordinate a set of specialized agents to solve them. Our system achieved an F1 score of 80.8% on a benchmark dataset of complex chemical reaction graphics from the literature, surpassing the previous state-of-the-art model (F1 score: 35.6%) by a significant margin. Additionally, it demonstrated consistent improvements in key sub-tasks, including molecular image recognition, reaction image parsing, named entity recognition and text-based reaction extraction. This work is a critical step toward automated chemical information extraction into structured datasets, which will be a strong promoter of AI-driven chemical research.
zh
[CV-95] MambaMap: Online Vectorized HD Map Construction using State Space Model
【速读】:该论文旨在解决高精地图(High-definition map, HD map)构建中因遮挡和感知范围有限导致的鲁棒性不足问题,同时克服现有方法在处理长序列时计算开销过大或未能充分挖掘时序信息的局限。其解决方案的关键在于提出MambaMap框架,通过引入状态空间模型(state space model)高效融合长时序特征:利用记忆库(memory bank)存储并动态更新历史帧的BEV特征与实例查询(instance queries),增强对噪声和遮挡的抵抗能力;设计门控机制以高效率地选择性整合地图元素间的依赖关系;并创新性地采用多方向与时空扫描策略,在BEV和实例层面提升特征提取精度,从而显著提高预测准确性和时序一致性。
链接: https://arxiv.org/abs/2507.20224
作者: Ruizi Yang,Xiaolu Liu,Junbo Chen,Jianke Zhu
机构: Zhejiang University (浙江大学); Udeer.ai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-definition (HD) maps are essential for autonomous driving, as they provide precise road information for downstream tasks. Recent advances highlight the potential of temporal modeling in addressing challenges like occlusions and extended perception range. However, existing methods either fail to fully exploit temporal information or incur substantial computational overhead in handling extended sequences. To tackle these challenges, we propose MambaMap, a novel framework that efficiently fuses long-range temporal features in the state space to construct online vectorized HD maps. Specifically, MambaMap incorporates a memory bank to store and utilize information from historical frames, dynamically updating BEV features and instance queries to improve robustness against noise and occlusions. Moreover, we introduce a gating mechanism in the state space, selectively integrating dependencies of map elements in high computational efficiency. In addition, we design innovative multi-directional and spatial-temporal scanning strategies to enhance feature extraction at both BEV and instance levels. These strategies significantly boost the prediction accuracy of our approach while ensuring robust temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed MambaMap approach outperforms state-of-the-art methods across various splits and perception ranges. Source code will be available at this https URL.
zh
[CV-96] Multi-Attention Stacked Ensemble for Lung Cancer Detection in CT Scans
【速读】:该论文旨在解决基于CT图像的肺结节二分类问题(良性 vs 恶性),以提升肺癌筛查中自动化辅助诊断的准确性与鲁棒性。其解决方案的关键在于提出一种多层级注意力堆叠集成深度神经网络架构:首先采用三个预训练主干网络(EfficientNet V2 S、MobileViT XXS 和 DenseNet201)分别适配 96×96 像素输入并配备定制分类头;随后引入两阶段注意力机制,从拼接后的 logits 中学习模型级和类别级重要性评分,并通过轻量级元学习器优化最终预测;此外,结合动态焦点损失(dynamic focal loss)、MixUp 数据增强及推理时增强策略有效缓解类别不平衡并提升泛化能力。实验在 LIDC-IDRI 数据集上验证了该方法的优越性能,准确率达 98.09%,AUC 达 0.9961,相较现有最优方法错误率降低 35%,且在放射科医生意见分歧较高的难例中表现尤为突出。
链接: https://arxiv.org/abs/2507.20221
作者: Uzzal Saha,Surya Prakash
机构: Indian Institute of Technology (IIT) Indore (印度理工学院(英迪拉·甘地技术大学)Indore)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages, 14 figures
Abstract:In this work, we address the challenge of binary lung nodule classification (benign vs malignant) using CT images by proposing a multi-level attention stacked ensemble of deep neural networks. Three pretrained backbones - EfficientNet V2 S, MobileViT XXS, and DenseNet201 - are each adapted with a custom classification head tailored to 96 x 96 pixel inputs. A two-stage attention mechanism learns both model-wise and class-wise importance scores from concatenated logits, and a lightweight meta-learner refines the final prediction. To mitigate class imbalance and improve generalization, we employ dynamic focal loss with empirically calculated class weights, MixUp augmentation during training, and test-time augmentation at inference. Experiments on the LIDC-IDRI dataset demonstrate exceptional performance, achieving 98.09 accuracy and 0.9961 AUC, representing a 35 percent reduction in error rate compared to state-of-the-art methods. The model exhibits balanced performance across sensitivity (98.73) and specificity (98.96), with particularly strong results on challenging cases where radiologist disagreement was high. Statistical significance testing confirms the robustness of these improvements across multiple experimental runs. Our approach can serve as a robust, automated aid for radiologists in lung cancer screening.
zh
[CV-97] Motion-example-controlled Co-speech Gesture Generation Leverag ing Large Language Models SIGGRAPH2025
【速读】:该论文旨在解决现有自动生成共言语手势(co-speech gestures)系统中控制粒度不足与运动细节丢失的问题。当前方法通常依赖预定义类别标签或从运动样本中隐式推导的伪标签(pseudo-labels),难以保留原始运动示例中的丰富细节。其解决方案的关键在于提出MECo框架,通过微调大语言模型(LLMs)以同时理解语音音频和运动示例,并将运动示例作为显式查询上下文嵌入提示结构(prompt structure)中,从而生成既符合语义又保留示例特性的手势。此方法实现了对身体各部位的细粒度控制,并支持多种输入模态(如运动片段、静态姿态、人体视频和文本描述),在Fréchet Gesture Distance(FGD)、运动多样性及示例-手势相似性三个指标上达到最先进性能。
链接: https://arxiv.org/abs/2507.20220
作者: Bohong Chen,Yumeng Li,Youyi Zheng,Yao-Xiang Ding,Kun Zhou
机构: Zhejiang University (浙江大学); State Key Lab of CAD&CG (CAD&CG国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2025; Project Page: this https URL
Abstract:The automatic generation of controllable co-speech gestures has recently gained growing attention. While existing systems typically achieve gesture control through predefined categorical labels or implicit pseudo-labels derived from motion examples, these approaches often compromise the rich details present in the original motion examples. We present MECo, a framework for motion-example-controlled co-speech gesture generation by leveraging large language models (LLMs). Our method capitalizes on LLMs’ comprehension capabilities through fine-tuning to simultaneously interpret speech audio and motion examples, enabling the synthesis of gestures that preserve example-specific characteristics while maintaining speech congruence. Departing from conventional pseudo-labeling paradigms, we position motion examples as explicit query contexts within the prompt structure to guide gesture generation. Experimental results demonstrate state-of-the-art performance across three metrics: Fréchet Gesture Distance (FGD), motion diversity, and example-gesture similarity. Furthermore, our framework enables granular control of individual body parts and accommodates diverse input modalities including motion clips, static poses, human video sequences, and textual descriptions. Our code, pre-trained models, and videos are available at this https URL.
zh
[CV-98] Humanoid Occupancy: Enabling A Generalized Multimodal Occupancy Perception System on Humanoid Robots
【速读】:该论文旨在解决人形机器人在复杂现实场景中缺乏统一、高效且具备语义与几何信息的环境感知能力的问题。其核心挑战在于如何融合多模态传感器数据以克服机器人自身结构带来的运动学干扰(kinematic interference)和遮挡(occlusion)问题,并实现对环境的全面理解。解决方案的关键在于提出了一种通用的多模态占用感知系统 Humanoid Occupancy,该系统通过软硬件协同设计、专用标注流程以及基于网格的占用表示(occupancy representation)来整合视觉与深度信息,结合多模态特征融合与时间信息集成机制,生成同时包含占据状态和语义标签的输出,从而为任务规划与导航等下游应用提供可靠支持。此外,研究还首次构建了面向人形机器人的全景占用数据集,为人形机器人视觉模块的标准化发展提供了重要基准。
链接: https://arxiv.org/abs/2507.20217
作者: Wei Cui,Haoyu Wang,Wenkang Qin,Yijie Guo,Gang Han,Wen Zhao,Jiahang Cao,Zhang Zhang,Jiaru Zhong,Jingkai Sun,Pihai Sun,Shuai Shi,Botuo Jiang,Jiahao Ma,Jiaxu Wang,Hao Cheng,Zhichao Liu,Yang Wang,Zheng Zhu,Guan Huang,Jian Tang,Qiang Zhang
机构: X-Humanoid; GigaAI
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Tech Report
Abstract:Humanoid robot technology is advancing rapidly, with manufacturers introducing diverse heterogeneous visual perception modules tailored to specific scenarios. Among various perception paradigms, occupancy-based representation has become widely recognized as particularly suitable for humanoid robots, as it provides both rich semantic and 3D geometric information essential for comprehensive environmental understanding. In this work, we present Humanoid Occupancy, a generalized multimodal occupancy perception system that integrates hardware and software components, data acquisition devices, and a dedicated annotation pipeline. Our framework employs advanced multi-modal fusion techniques to generate grid-based occupancy outputs encoding both occupancy status and semantic labels, thereby enabling holistic environmental understanding for downstream tasks such as task planning and navigation. To address the unique challenges of humanoid robots, we overcome issues such as kinematic interference and occlusion, and establish an effective sensor layout strategy. Furthermore, we have developed the first panoramic occupancy dataset specifically for humanoid robots, offering a valuable benchmark and resource for future research and development in this domain. The network architecture incorporates multi-modal feature fusion and temporal information integration to ensure robust perception. Overall, Humanoid Occupancy delivers effective environmental perception for humanoid robots and establishes a technical foundation for standardizing universal visual modules, paving the way for the widespread deployment of humanoid robots in complex real-world scenarios.
zh
[CV-99] Dual-Stream Global-Local Feature Collaborative Representation Network for Scene Classification of Mining Area
【速读】:该论文旨在解决矿区土地覆盖场景分类中因复杂空间布局和多尺度特征带来的识别难题。其核心解决方案是提出一种双分支融合模型,通过协同表示机制将全局特征分解为一组关键语义向量,从而实现对矿区多尺度空间分布的全面刻画。关键创新在于:(1) 多尺度全局Transformer分支利用大尺度特征生成小尺度通道注意力特征,有效捕捉跨尺度特征关系;(2) 局部增强协同表示分支结合局部特征与重构的关键语义集优化注意力权重,提升对细粒度空间变化的敏感性;(3) 双分支深度特征融合模块整合两分支互补信息,增强模型对复杂矿区景观的区分能力。最终模型在多损失约束下实现83.63%的整体准确率,优于现有方法。
链接: https://arxiv.org/abs/2507.20216
作者: Shuqi Fan,Haoyi Wang,Xianju Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scene classification of mining areas provides accurate foundational data for geological environment monitoring and resource development planning. This study fuses multi-source data to construct a multi-modal mine land cover scene classification dataset. A significant challenge in mining area classification lies in the complex spatial layout and multi-scale characteristics. By extracting global and local features, it becomes possible to comprehensively reflect the spatial distribution, thereby enabling a more accurate capture of the holistic characteristics of mining scenes. We propose a dual-branch fusion model utilizing collaborative representation to decompose global features into a set of key semantic vectors. This model comprises three key components:(1) Multi-scale Global Transformer Branch: It leverages adjacent large-scale features to generate global channel attention features for small-scale features, effectively capturing the multi-scale feature relationships. (2) Local Enhancement Collaborative Representation Branch: It refines the attention weights by leveraging local features and reconstructed key semantic sets, ensuring that the local context and detailed characteristics of the mining area are effectively integrated. This enhances the model’s sensitivity to fine-grained spatial variations. (3) Dual-Branch Deep Feature Fusion Module: It fuses the complementary features of the two branches to incorporate more scene information. This fusion strengthens the model’s ability to distinguish and classify complex mining landscapes. Finally, this study employs multi-loss computation to ensure a balanced integration of the modules. The overall accuracy of this model is 83.63%, which outperforms other comparative models. Additionally, it achieves the best performance across all other evaluation metrics.
zh
[CV-100] Neural Shell Texture Splatting: More Details and Fewer Primitives
【速读】:该论文旨在解决高斯点阵(Gaussian Splatting)在新视角合成中因几何与外观信息耦合而导致的参数效率低下问题,即为了实现高质量重建需要大量原始点(primitives)。其解决方案的关键在于引入一种神经壳纹理(neural shell texture),这是一种全局纹理表示方法,用于编码表面周围的纹理信息;同时将高斯原始点同时用作几何表示和纹理场采样器,从而在图像空间中高效地投射纹理特征。这种解耦策略显著提升了参数效率、增强了纹理细节重建能力,并简化了带纹理网格的提取过程。
链接: https://arxiv.org/abs/2507.20200
作者: Xin Zhang,Anpei Chen,Jincheng Xiong,Pinxuan Dai,Yujun Shen,Weiwei Xu
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); Ant Group (蚂蚁集团)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gaussian splatting techniques have shown promising results in novel view synthesis, achieving high fidelity and efficiency. However, their high reconstruction quality comes at the cost of requiring a large number of primitives. We identify this issue as stemming from the entanglement of geometry and appearance in Gaussian Splatting. To address this, we introduce a neural shell texture, a global representation that encodes texture information around the surface. We use Gaussian primitives as both a geometric representation and texture field samplers, efficiently splatting texture features into image space. Our evaluation demonstrates that this disentanglement enables high parameter efficiency, fine texture detail reconstruction, and easy textured mesh extraction, all while using significantly fewer primitives.
zh
[CV-101] When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images Videos and Audios DATE
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理长上下文时面临的计算瓶颈问题,尤其是由自注意力机制的二次复杂度导致的高资源消耗。其解决方案的关键在于系统性地梳理和归纳当前用于多模态长上下文token压缩的各类方法,通过按模态(图像、视频、音频)分类并进一步依据压缩机制(变换型、相似性型、注意力型、查询型)进行细分,揭示不同策略如何针对各模态特有的冗余特性实现高效token缩减,从而在训练与推理阶段显著降低计算负担,提升MLLM对高分辨率图像、长视频序列和长时间音频等复杂输入的处理能力。
链接: https://arxiv.org/abs/2507.20198
作者: Kele Shao,Keda Tao,Kejia Zhang,Sicheng Feng,Mu Cai,Yuzhang Shang,Haoxuan You,Can Qin,Yang Sui,Huan Wang
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); Xiamen University (厦门大学); National University of Singapore (新加坡国立大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of Central Florida (中佛罗里达大学); Columbia University (哥伦比亚大学); Salesforce AI Research (Salesforce人工智能研究院); Rice University (莱斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: For ongoing updates and to track the latest advances in this promising area, we maintain a public repository: a href=" this https URL rel=“external noopener nofollow” class="link-external link-https"Awesome-Multimodal-Token-Compression/a
Abstract:Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain. We also maintain a public repository to continuously track and update the latest advances in this promising area.
zh
[CV-102] Color histogram equalization and fine-tuning to improve expression recognition of (partially occluded) faces on sign language datasets
【速读】:该论文旨在解决计算机视觉方法在手语数据集中准确分类面部表情的能力问题,尤其关注听觉与聋人受试者在情绪表达上的差异。其关键解决方案在于引入基于直方图均衡化和微调的色彩归一化阶段,以适应数据集特有的颜色特征,并通过分别分析面部上半部分和下半部分的表情识别性能,揭示不同区域对情绪表达的贡献差异。实验结果显示,整体平均敏感度达83.8%,类间方差极小(0.042),且上半脸表情识别准确率超过人类水平,表明该方法在跨群体情绪识别中具有显著优势。
链接: https://arxiv.org/abs/2507.20197
作者: Fabrizio Nunnari,Alakshendra Jyotsnaditya Ramkrishna Singh,Patrick Gebhard
机构: German Research Center for Artificial Intelligence (德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The goal of this investigation is to quantify to what extent computer vision methods can correctly classify facial expressions on a sign language dataset. We extend our experiments by recognizing expressions using only the upper or lower part of the face, which is needed to further investigate the difference in emotion manifestation between hearing and deaf subjects. To take into account the peculiar color profile of a dataset, our method introduces a color normalization stage based on histogram equalization and fine-tuning. The results show the ability to correctly recognize facial expressions with 83.8% mean sensitivity and very little variance (.042) among classes. Like for humans, recognition of expressions from the lower half of the face (79.6%) is higher than that from the upper half (77.9%). Noticeably, the classification accuracy from the upper half of the face is higher than human level.
zh
[CV-103] SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection
【速读】:该论文旨在解决自然场景文本检测中因多语言脚本和任意形状文本实例导致的挑战,尤其在仅依赖视觉线索时难以准确识别的问题。现有方法未能充分挖掘语义上下文信息,从而限制了检测性能。解决方案的关键在于提出一种语义感知的视觉-语言模型(SAViL-Det),其核心创新包括:利用预训练CLIP模型与渐近特征金字塔网络(Asymptotic Feature Pyramid Network, AFPN)实现多尺度视觉特征融合,并设计了一个新颖的语言-视觉解码器,通过跨模态注意力机制自适应地将文本提示中的细粒度语义信息传递至视觉特征;此外,引入文本到像素的对比学习机制,显式对齐文本与对应视觉像素特征,从而提升模型对复杂场景下多语言文本的识别能力。
链接: https://arxiv.org/abs/2507.20188
作者: Mohammed-En-Nadhir Zighem,Abdenour Hadid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting text in natural scenes remains challenging, particularly for diverse scripts and arbitrarily shaped instances where visual cues alone are often insufficient. Existing methods do not fully leverage semantic context. This paper introduces SAViL-Det, a novel semantic-aware vision-language model that enhances multi-script text detection by effectively integrating textual prompts with visual features. SAViL-Det utilizes a pre-trained CLIP model combined with an Asymptotic Feature Pyramid Network (AFPN) for multi-scale visual feature fusion. The core of the proposed framework is a novel language-vision decoder that adaptively propagates fine-grained semantic information from text prompts to visual features via cross-modal attention. Furthermore, a text-to-pixel contrastive learning mechanism explicitly aligns textual and corresponding visual pixel features. Extensive experiments on challenging benchmarks demonstrate the effectiveness of the proposed approach, achieving state-of-the-art performance with F-scores of 84.8% on the benchmark multi-lingual MLT-2019 dataset and 90.2% on the curved-text CTW1500 dataset.
zh
[CV-104] SAMwave: Wavelet-Driven Feature Enrichment for Effective Adaptation of Segment Anything Model BMVC2025
【速读】:该论文旨在解决大基础模型(如Segment Anything Model, SAM)在应用于复杂下游任务时性能下降的问题,尤其针对传统适配器微调方法因特征提取能力受限而效果不佳的局限性。其解决方案的关键在于提出一种名为SAMwave的新方法,通过引入小波变换(wavelet transform)从输入数据中提取更具丰富性和多尺度特性的高频特征,并进一步设计复数域适配器(complex-valued adapters),利用复小波变换捕捉复数域的空间-频率信息;同时,通过自适应融合这些小波系数,使SAM的编码器能够更有效地捕获密集预测任务所需的语义与结构信息,从而显著提升模型在多种低层视觉任务中的适应性能。
链接: https://arxiv.org/abs/2507.20186
作者: Saurabh Yadav,Avi Gupta,Koteswar Rao Jerripothula
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to BMVC 2025. The first two authors contributed equally
Abstract:The emergence of large foundation models has propelled significant advances in various domains. The Segment Anything Model (SAM), a leading model for image segmentation, exemplifies these advances, outperforming traditional methods. However, such foundation models often suffer from performance degradation when applied to complex tasks for which they are not trained. Existing methods typically employ adapter-based fine-tuning strategies to adapt SAM for tasks and leverage high-frequency features extracted from the Fourier domain. However, Our analysis reveals that these approaches offer limited benefits due to constraints in their feature extraction techniques. To overcome this, we propose \textbf\textitSAMwave, a novel and interpretable approach that utilizes the wavelet transform to extract richer, multi-scale high-frequency features from input data. Extending this, we introduce complex-valued adapters capable of capturing complex-valued spatial-frequency information via complex wavelet transforms. By adaptively integrating these wavelet coefficients, SAMwave enables SAM’s encoder to capture information more relevant for dense prediction. Empirical evaluations on four challenging low-level vision tasks demonstrate that SAMwave significantly outperforms existing adaptation methods. This superior performance is consistent across both the SAM and SAM2 backbones and holds for both real and complex-valued adapter variants, highlighting the efficiency, flexibility, and interpretability of our proposed method for adapting segment anything models.
zh
[CV-105] MoCTEFuse: Illumination-Gated Mixture of Chiral Transformer Experts for Multi-Level Infrared and Visible Image Fusion
【速读】:该论文旨在解决光照变化对红外与可见光图像融合质量的影响问题,现有方法常忽略此因素,直接融合源图像信息导致模态偏差(modality bias)。其解决方案的关键在于提出一种动态多层级图像融合网络 MoCTEFuse,核心创新是引入光照门控的异手性混合变压器专家(Illumination-gated Mixture of Chiral Transformer Experts, MoCTE),该结构包含高光照和低光照专家子网络,均基于异手性Transformer融合块(Chiral Transformer Fusion Block, CTFB)。CTFB通过不对称交叉注意力机制,在光照门控信号引导下动态切换主模态与辅模态并分配权重,同时在多阶段堆叠以逐级聚合和优化模态特异性及跨模态信息。这一设计实现了纹理细节与目标对比度的自适应平衡,显著提升融合结果的鲁棒性和视觉一致性。
链接: https://arxiv.org/abs/2507.20180
作者: Li Jinfu,Song Hong,Xia Jianghan,Lin Yucong,Wang Ting,Shao Long,Fan Jingfan,Yang Jian
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While illumination changes inevitably affect the quality of infrared and visible image fusion, many outstanding methods still ignore this factor and directly merge the information from source images, leading to modality bias in the fused results. To this end, we propose a dynamic multi-level image fusion network called MoCTEFuse, which applies an illumination-gated Mixture of Chiral Transformer Experts (MoCTE) to adaptively preserve texture details and object contrasts in balance. MoCTE consists of high- and low-illumination expert subnetworks, each built upon the Chiral Transformer Fusion Block (CTFB). Guided by the illumination gating signals, CTFB dynamically switches between the primary and auxiliary modalities as well as assigning them corresponding weights with its asymmetric cross-attention mechanism. Meanwhile, it is stacked at multiple stages to progressively aggregate and refine modality-specific and cross-modality information. To facilitate robust training, we propose a competitive loss function that integrates illumination distributions with three levels of sub-loss terms. Extensive experiments conducted on the DroneVehicle, MSRS, TNO and RoadScene datasets show MoCTEFuse’s superior fusion performance. Finally, it achieves the best detection mean Average Precision (mAP) of 70.93% on the MFNet dataset and 45.14% on the DroneVehicle dataset. The code and model are released at this https URL.
zh
[CV-106] owards Universal Modal Tracking with Online Dense Temporal Token Learning
【速读】:该论文旨在解决多模态视频目标跟踪任务中模型泛化能力不足、跨模态特征融合效率低以及训练成本高的问题。其核心解决方案在于提出一种统一的视频级模态感知跟踪模型 \modaltracker,关键创新包括:(1)视频级采样,将输入扩展至视频序列层面以获取更丰富的全局上下文信息;(2)视频级关联机制,引入两种在线密集时间token关联方法,通过视频流方式传播目标外观与运动轨迹信息;(3)模态可扩展性设计,提出两个新型门控感知器(gated perceivers),利用门控注意力机制自适应学习跨模态表示,并通过一次性训练压缩到同一组参数中,实现多任务推理。该方案显著提升了模型在RGB、RGB+Thermal、RGB+Depth及RGB+Event等多种模态下的性能,且无需为每种模态单独训练,有效降低了训练负担并增强了模型表达能力。
链接: https://arxiv.org/abs/2507.20177
作者: Yaozong Zheng,Bineng Zhong,Qihua Liang,Shengping Zhang,Guorong Li,Xianxian Li,Rongrong Ji
机构: Guangxi Normal University (广西师范大学); Harbin Institute of Technology (哈尔滨工业大学); University of Chinese Academy of Sciences (中国科学院大学); Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2401.01686
Abstract:We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called \modaltracker). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: \textbfVideo-level Sampling. We expand the model’s inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. \textbfVideo-level Association. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. \textbfModality Scalable. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and subsequently compress them into the same set of model parameters via a one-shot training manner for multi-task inference. This new solution brings the following benefits: (i) The purified token sequences can serve as temporal prompts for the inference in the next video frames, whereby previous information is leveraged to guide future inference. (ii) Unlike multi-modal trackers that require independent training, our one-shot training scheme not only alleviates the training burden, but also improves model representation. Extensive experiments on visible and multi-modal benchmarks show that our \modaltracker achieves a new \textitSOTA performance. The code will be available at this https URL.
zh
[CV-107] LRR-Bench: Left Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在真实世界应用中对空间关系识别与空间运动感知能力不足的问题,特别是其在绝对空间理解(absolute spatial understanding)和三维空间理解(3D spatial understanding)方面的局限性。解决方案的关键在于构建一个完全合成的基准数据集和一套系统化的空间评估流水线,能够低成本生成测试样本并避免数据污染,从而客观评估主流VLMs在不同空间任务上的表现,揭示当前模型在复杂空间推理任务中的显著短板。
链接: https://arxiv.org/abs/2507.20174
作者: Fei Kong,Jinhao Duan,Kaidi Xu,Zhenhua Guo,Xiaofeng Zhu,Xiaoshuang Shi
机构: University of Electronic Science and Technology of China (电子科技大学); Drexel University (德雷塞尔大学); Tianyijiaotong Technology Ltd. (天翼交通科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-world applications, such as autonomous driving and humanoid robot manipulation, require precise spatial perception. However, it remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement. In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark. Specifically, we categorize spatial understanding into two main types: absolute spatial understanding, which involves querying the absolute spatial position (e.g., left, right) of an object within an image, and 3D spatial understanding, which includes movement and rotation. Notably, our dataset is entirely synthetic, enabling the generation of test samples at a low cost while also preventing dataset contamination. We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement in their spatial understanding abilities. Explicitly, in our experiments, humans achieve near-perfect performance on all tasks, whereas current VLMs attain human-level performance only on the two simplest tasks. For the remaining tasks, the performance of VLMs is distinctly lower than that of humans. In fact, the best-performing Vision-Language Models even achieve near-zero scores on multiple tasks. The dataset and code are available on this https URL.
zh
[CV-108] PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks ICCV2025
【速读】:该论文旨在解决运动数据在不同骨骼结构间迁移困难的问题,尤其是在基于数据驱动的3D角色动画中,由于骨骼比例或结构差异导致的运动捕捉数据难以跨骨架复用。为应对这一挑战,作者提出了一种针对时序点云(Temporal Point Clouds, TPC)的原始自编码器架构PUMPS,其关键在于将每一帧的点云独立压缩为可采样的特征向量,并通过潜变量高斯噪声作为采样标识符从解码器中提取具有时间一致性的离散点;同时引入基于线性分配的点对齐机制优化重建过程,避免使用昂贵的逐点注意力机制,从而实现高效且通用的TPC表示学习。该方法无需依赖特定数据集监督即可完成运动预测、过渡生成和关键帧插值等预训练任务,并在微调用于去噪或估计时仍保持优越性能。
链接: https://arxiv.org/abs/2507.20170
作者: Clinton Ansun Mo,Kun Hu,Chengjiang Long,Dong Yuan,Wan-Chi Siu,Zhiyong Wang
机构: The University of Sydney (悉尼大学); The University of Tokyo (东京大学); Edith Cowan University (埃迪斯科文大学); Meta Reality Labs (Meta现实实验室); Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in ICCV 2025
Abstract:Motion skeletons drive 3D character animation by transforming bone hierarchies, but differences in proportions or structure make motion data hard to transfer across skeletons, posing challenges for data-driven motion synthesis. Temporal Point Clouds (TPCs) offer an unstructured, cross-compatible motion representation. Though reversible with skeletons, TPCs mainly serve for compatibility, not for direct motion task learning. Doing so would require data synthesis capabilities for the TPC format, which presents unexplored challenges regarding its unique temporal consistency and point identifiability. Therefore, we propose PUMPS, the primordial autoencoder architecture for TPC data. PUMPS independently reduces frame-wise point clouds into sampleable feature vectors, from which a decoder extracts distinct temporal points using latent Gaussian noise vectors as sampling identifiers. We introduce linear assignment-based point pairing to optimise the TPC reconstruction process, and negate the use of expensive point-wise attention mechanisms in the architecture. Using these latent features, we pre-train a motion synthesis model capable of performing motion prediction, transition generation, and keyframe interpolation. For these pre-training tasks, PUMPS performs remarkably well even without native dataset supervision, matching state-of-the-art performance. When fine-tuned for motion denoising or estimation, PUMPS outperforms many respective methods without deviating from its generalist architecture.
zh
[CV-109] Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning ICCV2025
【速读】:该论文旨在解决现有体育视频字幕生成方法在关注动作描述的同时忽视球员身份识别的问题,导致生成的字幕缺乏对具体球员的指代能力;同时,已有融合额外信息的方法因所用信息与视频内容独立,常出现球员身份错误的情况。解决方案的关键在于提出一种以球员为中心的多模态提示生成网络(LLM-IAVC),其核心是设计了一个身份相关的信息提取模块(IRIEM),该模块包含球员识别网络(PIN)用于提取视觉特征和球员姓名,并通过双向语义交互模块(BSIM)实现球员特征与视频内容之间的相互增强;此外,还引入视觉上下文学习模块(VCLM)捕捉关键视频语境信息,最终将各模块输出整合为大语言模型(LLM)的多模态提示,从而生成包含准确球员身份的字幕。
链接: https://arxiv.org/abs/2507.20163
作者: Zeyu Xi,Haoying Sun,Yaofei Wu,Junchi Yan,Haoran Zhang,Lifang Wu,Liang Wang,Changwen Chen
机构: Beijing University of Technology (北京工业大学); Shanghai Jiao Tong University (上海交通大学); Chinese Academy of Sciences (中国科学院); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025 (Poster)
Abstract:Existing sports video captioning methods often focus on the action yet overlook player identities, limiting their applicability. Although some methods integrate extra information to generate identity-aware descriptions, the player identities are sometimes incorrect because the extra information is independent of the video content. This paper proposes a player-centric multimodal prompt generation network for identity-aware sports video captioning (LLM-IAVC), which focuses on recognizing player identities from a visual perspective. Specifically, an identity-related information extraction module (IRIEM) is designed to extract player-related multimodal embeddings. IRIEM includes a player identification network (PIN) for extracting visual features and player names, and a bidirectional semantic interaction module (BSIM) to link player features with video content for mutual enhancement. Additionally, a visual context learning module (VCLM) is designed to capture the key video context information. Finally, by integrating the outputs of the above modules as the multimodal prompt for the large language model (LLM), it facilitates the generation of descriptions with player identities. To support this work, we construct a new benchmark called NBA-Identity, a large identity-aware basketball video captioning dataset with 9,726 videos covering 9 major event types. The experimental results on NBA-Identity and VC-NBA-2022 demonstrate that our proposed model achieves advanced performance. Code and dataset are publicly available at this https URL.
zh
[CV-110] AnimeColor: Reference-based Animation Colorization with Diffusion Transformers
【速读】:该论文旨在解决动画上色(animation colorization)中颜色准确性不足与时间一致性差的问题。现有方法难以在保持语义色彩一致性的前提下,实现高质量且稳定的动画序列生成。其解决方案的关键在于提出了一种基于参考图像的动画上色框架AnimeColor,该框架利用扩散变换器(Diffusion Transformers, DiT)构建视频扩散模型,并引入两个核心组件:高层色彩提取器(High-level Color Extractor, HCE)用于捕获语义级色彩信息,低层色彩引导器(Low-level Color Guider, LCG)用于提取参考图像中的细粒度色彩细节。二者协同作用以指导扩散过程,在多阶段训练策略下最大化参考图像色彩信息的利用率,从而显著提升颜色准确性、草图对齐度、时间一致性和视觉质量。
链接: https://arxiv.org/abs/2507.20158
作者: Yuhong Zhang,Liyao Wang,Han Wang,Danni Wu,Zuzeng Lin,Feng Wang,Li Song
机构: Shanghai Jiao Tong University (上海交通大学); Tianjin University (天津大学); Communication University of China (中国传媒大学); CreateAI (CreateAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Animation colorization plays a vital role in animation production, yet existing methods struggle to achieve color accuracy and temporal consistency. To address these challenges, we propose \textbfAnimeColor, a novel reference-based animation colorization framework leveraging Diffusion Transformers (DiT). Our approach integrates sketch sequences into a DiT-based video diffusion model, enabling sketch-controlled animation generation. We introduce two key components: a High-level Color Extractor (HCE) to capture semantic color information and a Low-level Color Guider (LCG) to extract fine-grained color details from reference images. These components work synergistically to guide the video diffusion process. Additionally, we employ a multi-stage training strategy to maximize the utilization of reference image color information. Extensive experiments demonstrate that AnimeColor outperforms existing methods in color accuracy, sketch alignment, temporal consistency, and visual quality. Our framework not only advances the state of the art in animation colorization but also provides a practical solution for industrial applications. The code will be made publicly available at \hrefthis https URLthis https URL.
zh
[CV-111] rust the Model: Compact VLMs as In-Context Judges for Image-Text Data Quality
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在训练过程中因引入图像数据而面临的数据质量下降问题,即大量从网络爬取的图像-文本对常包含噪声、不准确或语义不对齐的内容,从而影响模型性能。解决方案的关键在于提出一种轻量级的数据过滤框架,其核心是利用一个在高质量图像-文本标注数据集上微调的小型VLM,直接评估候选样本的图像与文本质量及其对齐程度,从而高效筛选出高保真度的训练数据。该方法摒弃了传统依赖额外模块的复杂过滤流程,仅依靠专用小模型的内在评估能力,在无需增加训练开销的前提下显著提升数据纯净度与对齐性,最终实现用少量高质量数据达到甚至超越大规模噪声数据的效果。
链接: https://arxiv.org/abs/2507.20156
作者: Daulet Toibazar,Kesen Wang,Sherif Mohamed,Abdulaziz Al-Badawi,Abdulrahman Alfulayt,Pedro J. Moreno
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) extend the conventional large language models by integrating visual data, enabling richer multimodal reasoning and significantly broadens the practical applications of AI. However, including visual inputs also brings new challenges in maintaining data quality. Empirical evidence consistently shows that carefully curated and representative training examples often yield superior results compared to simply increasing the quantity of data. Inspired by this observation, we introduce a streamlined data filtration framework that employs a compact VLM, fine-tuned on a high-quality image-caption annotated dataset. This model effectively evaluates and filters potential training samples based on caption and image quality and alignment. Unlike previous approaches, which typically add auxiliary filtration modules on top of existing full-scale VLMs, our method exclusively utilizes the inherent evaluative capability of a purpose-built small VLM. This strategy eliminates the need for extra modules and reduces training overhead. Our lightweight model efficiently filters out inaccurate, noisy web data, improving image-text alignment and caption linguistic fluency. Experimental results show that datasets underwent high-precision filtration using our compact VLM perform on par with, or even surpass, larger and noisier datasets gathered through high-volume web crawling. Thus, our method provides a lightweight yet robust solution for building high-quality vision-language training corpora. \ \textbfAvailability and implementation: Our compact VLM filtration model, training data, utility scripts, and Supplementary data (Appendices) are freely available at this https URL.
zh
[CV-112] GT-Mean Loss: A Simple Yet Effective Solution for Brightness Mismatch in Low-Light Image Enhancement ICCV2025
【速读】:该论文旨在解决低光照图像增强(Low-light Image Enhancement, LLIE)中监督学习方法存在的亮度不匹配(brightness mismatch)问题,即增强图像的整体亮度与真实标签(ground truth)之间存在系统性偏差,从而误导模型训练并影响性能。解决方案的关键在于提出一种名为GT-mean loss的新损失函数,该函数从概率角度直接建模图像均值,能够灵活地集成到现有LLIE损失函数中,仅需极小的计算开销即可显著提升模型对亮度一致性的控制能力,实验证明其在多种方法和数据集上均带来稳定性能提升。
链接: https://arxiv.org/abs/2507.20148
作者: Jingxi Liao,Shijie Hao,Richang Hong,Meng Wang
机构: Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV2025. GitHub repository: this https URL
Abstract:Low-light image enhancement (LLIE) aims to improve the visual quality of images captured under poor lighting conditions. In supervised LLIE research, there exists a significant yet often overlooked inconsistency between the overall brightness of an enhanced image and its ground truth counterpart, referred to as brightness mismatch in this study. Brightness mismatch negatively impact supervised LLIE models by misleading model training. However, this issue is largely neglected in current research. In this context, we propose the GT-mean loss, a simple yet effective loss function directly modeling the mean values of images from a probabilistic perspective. The GT-mean loss is flexible, as it extends existing supervised LLIE loss functions into the GT-mean form with minimal additional computational costs. Extensive experiments demonstrate that the incorporation of the GT-mean loss results in consistent performance improvements across various methods and datasets.
zh
[CV-113] Wavelet-guided Misalignment-aware Network for Visible-Infrared Object Detection
【速读】:该论文旨在解决可见光-红外目标检测中因分辨率差异、空间位移和模态不一致性导致的频繁跨模态误对齐问题,从而提升检测鲁棒性。其解决方案的关键在于提出了一种基于小波引导的误对齐感知网络(Wavelet-guided Misalignment-aware Network, WMNet),通过引入小波域多频分析与模态感知融合机制,自适应地处理不同类型的跨模态误对齐模式;该方法联合利用低频与高频信息,并在模态间引入自适应引导策略,有效缓解噪声、光照变化及空间错位带来的负面影响,同时增强显著目标特征表示并抑制伪影或误导信息,从而实现更精准和鲁棒的跨模态目标检测。
链接: https://arxiv.org/abs/2507.20146
作者: Haote Zhang,Lipeng Gu,Wuzhou Quan,Fu Lee Wang,Honghui Fan,Jiali Tang,Dingkun Zhu,Haoran Xie,Xiaoping Zhang,Mingqiang Wei
机构: Jiangsu University of Technology (江苏理工学院); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Hong Kong Metropolitan University (香港城市大学); Lingnan University (岭南大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visible-infrared object detection aims to enhance the detection robustness by exploiting the complementary information of visible and infrared image pairs. However, its performance is often limited by frequent misalignments caused by resolution disparities, spatial displacements, and modality inconsistencies. To address this issue, we propose the Wavelet-guided Misalignment-aware Network (WMNet), a unified framework designed to adaptively address different cross-modal misalignment patterns. WMNet incorporates wavelet-based multi-frequency analysis and modality-aware fusion mechanisms to improve the alignment and integration of cross-modal features. By jointly exploiting low and high-frequency information and introducing adaptive guidance across modalities, WMNet alleviates the adverse effects of noise, illumination variation, and spatial misalignment. Furthermore, it enhances the representation of salient target features while suppressing spurious or misleading information, thereby promoting more accurate and robust detection. Extensive evaluations on the DVTOD, DroneVehicle, and M3FD datasets demonstrate that WMNet achieves state-of-the-art performance on misaligned cross-modal object detection tasks, confirming its effectiveness and practical applicability.
zh
[CV-114] An Automated Deep Segmentation and Spatial-Statistics Approach for Post-Blast Rock Frag mentation Assessment
【速读】:该论文旨在解决爆炸后碎片分布的快速、自动化评估问题,尤其在复杂场景下对小尺寸目标密集区域的准确识别与量化分析难题。其解决方案的关键在于构建一个端到端的实时实例分割框架:基于超过500张标注的爆后图像微调YOLOv12l-seg模型(Box mAP@0.5 ≈ 0.769,Mask mAP@0.5 ≈ 0.800,运行速度约15 FPS),将高保真掩膜转换为归一化三维坐标,并从中提取多维空间特征描述符,包括主成分方向、核密度热点、尺寸-深度回归以及Delaunay边统计量,从而实现对典型破碎模式的有效刻画与定量分析。
链接: https://arxiv.org/abs/2507.20126
作者: Yukun Yang
机构: University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
Abstract:We introduce an end-to-end pipeline that leverages a fine-tuned YOLO12l-seg model – trained on over 500 annotated post-blast images – to deliver real-time instance segmentation (Box mAP@0.5 ~ 0.769, Mask mAP@0.5 ~ 0.800 at ~ 15 FPS). High-fidelity masks are converted into normalized 3D coordinates, from which we extract multi-metric spatial descriptors: principal component directions, kernel density hotspots, size-depth regression, and Delaunay edge statistics. We present four representative examples to illustrate key fragmentation patterns. Experimental results confirm the framework’s accuracy, robustness to small-object crowding, and feasibility for rapid, automated blast-effect assessment in field conditions.
zh
[CV-115] Multi-output Deep-Supervised Classifier Chains for Plant Pathology
【速读】:该论文旨在解决植物叶片病害分类任务中,现有方法多直接采用卷积神经网络(Convolutional Neural Networks, CNNs),而忽视了植物种类与病害类型之间关系对预测性能影响的问题。其解决方案的关键在于提出一种名为多输出深度监督分类链(Multi-output Deep Supervised Classifier Chains, Mo-DsCC)的新模型,该模型通过将植物种类和病害类型的输出层串联成分类链,并结合改进的VGG-16骨干网络、深度监督训练机制,有效建模两类标签间的关联性,从而提升分类准确率和F1分数。
链接: https://arxiv.org/abs/2507.20125
作者: Jianping Yao,Son N. Tran
机构: University of Tasmania (塔斯马尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Plant leaf disease classification is an important task in smart agriculture which plays a critical role in sustainable production. Modern machine learning approaches have shown unprecedented potential in this classification task which offers an array of benefits including time saving and cost reduction. However, most recent approaches directly employ convolutional neural networks where the effect of the relationship between plant species and disease types on prediction performance is not properly studied. In this study, we proposed a new model named Multi-output Deep Supervised Classifier Chains (Mo-DsCC) which weaves the prediction of plant species and disease by chaining the output layers for the two labels. Mo-DsCC consists of three components: A modified VGG-16 network as the backbone, deep supervision training, and a stack of classification chains. To evaluate the advantages of our model, we perform intensive experiments on two benchmark datasets Plant Village and PlantDoc. Comparison to recent approaches, including multi-model, multi-label (Power-set), multi-output and multi-task, demonstrates that Mo-DsCC achieves better accuracy and F1-score. The empirical study in this paper shows that the application of Mo-DsCC could be a useful puzzle for smart agriculture to benefit farms and bring new ideas to industry and academia.
zh
[CV-116] Local2Global query Alignment for Video Instance Segmentation
【速读】:该论文旨在解决在线视频实例分割(Online Video Instance Segmentation, VIS)中长期序列处理时的时序一致性难题,尤其针对传播噪声累积、遮挡突变和场景切换等挑战。解决方案的关键在于提出一种名为Local2Global(L2G)的在线框架,其核心创新是引入两类新颖查询机制:局部查询(local queries)用于捕获每帧的初始目标特定空间特征,全局查询(global queries)则存储历史时空表示;并通过轻量级Transformer解码器L2G-aligner实现局部与全局查询的早期对齐,从而在利用当前帧信息的同时保障时序一致性,避免依赖复杂启发式规则或记忆模块,最终在多个VIS和VPS数据集上实现了卓越性能。
链接: https://arxiv.org/abs/2507.20120
作者: Rajat Koner,Zhipeng Wang,Srinivas Parthasarathy,Chinghang Chen
机构: AWS AI Labs(亚马逊人工智能实验室); LinkedIn Corporation(领英公司); Amazon(亚马逊); Ludwig Maximilian University of Munich(慕尼黑路德维希-马克西米利安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Online video segmentation methods excel at handling long sequences and capturing gradual changes, making them ideal for real-world applications. However, achieving temporally consistent predictions remains a challenge, especially with gradual accumulation of noise or drift in on-line propagation, abrupt occlusions and scene transitions. This paper introduces Local2Global, an online framework, for video instance segmentation, exhibiting state-of-the-art performance with simple baseline and training purely in online fashion. Leveraging the DETR-based query propagation framework, we introduce two novel sets of queries:(1) local queries that capture initial object-specific spatial features from each frame and (2) global queries containing past spatio-temporal representations. We propose the L2G-aligner, a novel lightweight transformer decoder, to facilitate an early alignment between local and global queries. This alignment allows our model to effectively utilize current frame information while maintaining temporal consistency, producing a smooth transition between frames. Furthermore, L2G-aligner is integrated within the segmentation model, without relying on additional complex heuristics, or memory mechanisms. Extensive experiments across various challenging VIS and VPS datasets showcase the superiority of our method with simple online training, surpassing current benchmarks without bells and rings. For instance, we achieve 54.3 and 49.4 AP on Youtube-VIS-19/-21 datasets and 37.0 AP on OVIS dataset respectively withthe ResNet-50 backbone.
zh
[CV-117] RESCUE: Crowd Evacuation Simulation via Controlling SDM-United Characters
【速读】:该论文旨在解决当前主流人群疏散模拟模型在现实场景中难以准确再现人类复杂行为的问题,例如行人碰撞、人际交互以及受地形类型或个体体型影响的行为差异。其解决方案的关键在于提出了一种基于人类感官-决策-运动(Sensory-Decision-Motor, SDM)流程的实时3D人群疏散模拟框架,该框架融合了3D自适应社会力模型(3D-adaptive Social Force Model, SFM)决策机制与个性化步态控制(Personalized Gait Control)运动模块,支持多智能体并行移动、动态环境感知及个性化行为建模,并引入基于部件级别的受力可视化(Part-level Force Visualization)以辅助疏散分析,从而显著提升模拟的真实性与实用性。
链接: https://arxiv.org/abs/2507.20117
作者: Xiaolin Liu,Tianyi Zhou,Hongbo Kang,Jian Ma,Ziwen Wang,Jing Huang,Wenguo Weng,Yu-Kun Lai,Kun Li
机构: Tianjin University (天津大学); Tsinghua University (清华大学); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Crowd evacuation simulation is critical for enhancing public safety, and demanded for realistic virtual environments. Current mainstream evacuation models overlook the complex human behaviors that occur during evacuation, such as pedestrian collisions, interpersonal interactions, and variations in behavior influenced by terrain types or individual body shapes. This results in the failure to accurately simulate the escape of people in the real world. In this paper, aligned with the sensory-decision-motor (SDM) flow of the human brain, we propose a real-time 3D crowd evacuation simulation framework that integrates a 3D-adaptive SFM (Social Force Model) Decision Mechanism and a Personalized Gait Control Motor. This framework allows multiple agents to move in parallel and is suitable for various scenarios, with dynamic crowd awareness. Additionally, we introduce Part-level Force Visualization to assist in evacuation analysis. Experimental results demonstrate that our framework supports dynamic trajectory planning and personalized behavior for each agent throughout the evacuation process, and is compatible with uneven terrain. Visually, our method generates evacuation results that are more realistic and plausible, providing enhanced insights for crowd simulation. The code is available at this http URL.
zh
[CV-118] NeuroVoxel-LM: Language-Aligned 3D Perception via Dynamic Voxelization and Meta-Embedding
【速读】:该论文旨在解决现有3D语言模型在处理稀疏、大规模点云数据时面临的特征提取效率低和表征精度不足的问题。其核心解决方案在于提出NeuroVoxel-LM框架,关键创新包括:一是动态分辨率多尺度体素化(Dynamic Resolution Multiscale Voxelization, DR-MSV)技术,能够根据几何与结构复杂度自适应调整体素粒度,在降低计算成本的同时保持重建保真度;二是基于令牌级自适应池化的轻量级元嵌入机制(Token-level Adaptive Pooling for Lightweight Meta-Embedding, TAP-LME),通过注意力加权与残差融合增强语义表示能力,显著优于传统最大池化方法对NeRF权重中细粒度语义的捕捉效果。
链接: https://arxiv.org/abs/2507.20110
作者: Shiyu Liu,Lianlei Shan
机构: Nanyang Technological University (南洋理工大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: **14 pages, 3 figures, 2 tables
Abstract:Recent breakthroughs in Visual Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have significantly advanced 3D scene perception towards language-driven cognition. However, existing 3D language models struggle with sparse, large-scale point clouds due to slow feature extraction and limited representation accuracy. To address these challenges, we propose NeuroVoxel-LM, a novel framework that integrates Neural Radiance Fields (NeRF) with dynamic resolution voxelization and lightweight meta-embedding. Specifically, we introduce a Dynamic Resolution Multiscale Voxelization (DR-MSV) technique that adaptively adjusts voxel granularity based on geometric and structural complexity, reducing computational cost while preserving reconstruction fidelity. In addition, we propose the Token-level Adaptive Pooling for Lightweight Meta-Embedding (TAP-LME) mechanism, which enhances semantic representation through attention-based weighting and residual fusion. Experimental results demonstrate that DR-MSV significantly improves point cloud feature extraction efficiency and accuracy, while TAP-LME outperforms conventional max-pooling in capturing fine-grained semantics from NeRF weights.
zh
[CV-119] Detection of Medial Epicondyle Avulsion in Elbow Ultrasound Images via Bone Structure Reconstruction
【速读】:该论文旨在解决肘部超声图像中内上髁撕脱(medial epicondyle avulsion)的自动检测问题,此类损伤常见于棒球运动员,表现为骨质脱离和轮廓中断,传统方法难以准确识别细微异常。解决方案的关键在于提出一种基于重建的框架,利用仅包含正常病例的数据训练一个结构感知的掩码自编码器(masked autoencoder),使模型学习正常骨结构的连续性特征;当输入存在撕脱时,模型仍尝试重建正常结构,从而在损伤区域产生显著的重建误差,实现对异常部位的精准定位与分类。
链接: https://arxiv.org/abs/2507.20104
作者: Shizuka Akahori,Shotaro Teruya,Pragyan Shrestha,Yuichi Yoshii,Satoshi Iizuka,Akira Ikumi,Hiromitsu Tsuge,Itaru Kitahara
机构: University of Tsukuba(筑波大学); Tokyo Medical University(Ibaraki Medical Center(茨城县医疗中心); Kikkoman General Hospital(麒麟医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19th International Conference on Machine Vision Applications (MVA)
Abstract:This study proposes a reconstruction-based framework for detecting medial epicondyle avulsion in elbow ultrasound images, trained exclusively on normal cases. Medial epicondyle avulsion, commonly observed in baseball players, involves bone detachment and deformity, often appearing as discontinuities in bone contour. Therefore, learning the structure and continuity of normal bone is essential for detecting such abnormalities. To achieve this, we propose a masked autoencoder-based, structure-aware reconstruction framework that learns the continuity of normal bone structures. Even in the presence of avulsion, the model attempts to reconstruct the normal structure, resulting in large reconstruction errors at the avulsion site. For evaluation, we constructed a novel dataset comprising normal and avulsion ultrasound images from 16 baseball players, with pixel-level annotations under orthopedic supervision. Our method outperformed existing approaches, achieving a pixel-wise AUC of 0.965 and an image-wise AUC of 0.967. The dataset is publicly available at: this https URL.
zh
[CV-120] Hybrid-Domain Synergistic Transformer for Hyperspectral Image Denoising
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)去噪中面临的多维耦合噪声问题,即空间非均匀噪声与光谱相关干扰的复杂耦合关系。现有深度学习方法主要针对RGB图像设计,难以有效处理HSI特有的空间-光谱特征和复杂的噪声分布。解决方案的关键在于提出一种混合域协同Transformer网络(Hybrid-Domain Synergistic Transformer Network, HDST),其核心创新包括:(1) 引入基于快速傅里叶变换(FFT)预处理模块与多波段卷积相结合的方法,以提取跨波段相关性并解耦光谱噪声成分;(2) 设计动态跨域注意力机制,通过可学习门控机制自适应融合空间域纹理特征与频域噪声先验;(3) 构建分层架构,浅层利用多尺度空洞卷积捕获全局噪声统计特性,深层则通过频域后处理实现细节恢复。该方法实现了空间、频率与通道三个维度的协同处理,在真实与合成数据集上均显著提升了去噪性能并保持计算效率。
链接: https://arxiv.org/abs/2507.20099
作者: Haoyue Li(1),Di Wu(1) ((1) School of Optoelectronic Science and Engineering, Soochow University)
机构: Soochow University (苏州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, 4 tables
Abstract:Hyperspectral image denoising faces the challenge of multi-dimensional coupling of spatially non-uniform noise and spectral correlation interference. Existing deep learning methods mostly focus on RGB images and struggle to effectively handle the unique spatial-spectral characteristics and complex noise distributions of hyperspectral images (HSI). This paper proposes an HSI denoising framework, Hybrid-Domain Synergistic Transformer Network (HDST), based on frequency domain enhancement and multiscale modeling, achieving three-dimensional collaborative processing of spatial, frequency and channel domains. The method innovatively integrates three key mechanisms: (1) introducing an FFT preprocessing module with multi-band convolution to extract cross-band correlations and decouple spectral noise components; (2) designing a dynamic cross-domain attention module that adaptively fuses spatial domain texture features and frequency domain noise priors through a learnable gating mechanism; (3) building a hierarchical architecture where shallow layers capture global noise statistics using multiscale atrous convolution, and deep layers achieve detail recovery through frequency domain postprocessing. Experiments on both real and synthetic datasets demonstrate that HDST significantly improves denoising performance while maintaining computational efficiency, validating the effectiveness of the proposed method. This research provides new insights and a universal framework for addressing complex noise coupling issues in HSI and other high-dimensional visual data. The code is available at this https URL.
zh
[CV-121] Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models
【速读】:该论文旨在解决扩散模型(Diffusion Models)在处理包含多个对象及全局或局部风格要求的复杂文本提示时,生成场景缺乏风格统一性和空间一致性的问题,从而限制了其在可控内容生成中的应用。解决方案的关键在于提出一种无需训练的架构方法——局部提示适配(Local Prompt Adaptation, LPA),该方法将提示分解为内容令牌(content tokens)和风格令牌(style tokens),并选择性地注入到U-Net的不同注意力层阶段:早期注入对象令牌以增强布局控制,后期注入风格令牌以提升风格一致性,从而实现更可控且具表现力的图像生成。
链接: https://arxiv.org/abs/2507.20094
作者: Ankit Sanjyal
机构: Fordham University (福特汉姆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 10 Pages, 8 figures, pre-print
Abstract:Diffusion models have become a powerful backbone for text-to-image generation, enabling users to synthesize high-quality visuals from natural language prompts. However, they often struggle with complex prompts involving multiple objects and global or local style specifications. In such cases, the generated scenes tend to lack style uniformity and spatial coherence, limiting their utility in creative and controllable content generation. In this paper, we propose a simple, training-free architectural method called Local Prompt Adaptation (LPA). Our method decomposes the prompt into content and style tokens, and injects them selectively into the U-Net’s attention layers at different stages. By conditioning object tokens early and style tokens later in the generation process, LPA enhances both layout control and stylistic consistency. We evaluate our method on a custom benchmark of 50 style-rich prompts across five categories and compare against strong baselines including Composer, MultiDiffusion, Attend-and-Excite, LoRA, and SDXL. Our approach outperforms prior work on both CLIP score and style consistency metrics, offering a new direction for controllable, expressive diffusion-based generation.
zh
[CV-122] KB-DMGen: Knowledge-Based Global Guidance and Dynamic Pose Masking for Human Image Generation
【速读】:该论文旨在解决当前基于扩散模型的人像生成方法中,过度关注姿态准确性而忽视整体图像质量的问题。现有方法虽能较好控制人体姿态(pose prior),但在生成图像的全局视觉质量上表现不足,导致合成结果在细节、真实感等方面存在缺陷。为解决此问题,作者提出KB-DMGen框架,其核心创新在于两个关键机制:一是知识库(Knowledge Base, KB),用于同时提升姿态精度并利用图像特征信息保障整体图像质量;二是动态掩码(Dynamic Masking, DM),通过自适应调整姿态相关区域的重要性权重,实现更精细的生成控制。实验表明,该方法在HumanArt数据集上的AP和CAP指标上达到新的SOTA水平。
链接: https://arxiv.org/abs/2507.20083
作者: Shibang Liu,Xuemei Xie,Guangming Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent methods using diffusion models have made significant progress in human image generation with various control signals such as pose priors. In portrait generation, both the accuracy of human pose and the overall visual quality are crucial for realistic synthesis. Most existing methods focus on controlling the accuracy of generated poses, but ignore the quality assurance of the entire image. In order to ensure the global image quality and pose accuracy, we propose Knowledge-Based Global Guidance and Dynamic pose Masking for human image Generation (KB-DMGen). The Knowledge Base (KB) is designed not only to enhance pose accuracy but also to leverage image feature information to maintain overall image quality. Dynamic Masking (DM) dynamically adjusts the importance of pose-related regions. Experiments demonstrate the effectiveness of our model, achieving new state-of-the-art results in terms of AP and CAP on the HumanArt dataset. The code will be made publicly available.
zh
[CV-123] FaRMamba: Frequency-based learning and Reconstruction aided Mamba for Medical Segmentation
【速读】:该论文旨在解决医学图像分割中三大核心挑战:模糊病灶边界(LBA)、高频细节丢失(LHD)以及长程解剖结构建模困难(DC-LRSS)。现有基于Vision Mamba的方法虽能有效缓解DC-LRSS,但其patch tokenization和一维序列化处理导致局部像素邻接破坏与低通滤波效应,进而引发局部高频信息捕获不足(LHICD)和二维空间结构退化(2D-SSD),加剧了LBA和LHD问题。解决方案的关键在于提出FaRMamba架构,通过两个互补模块实现针对性改进:一是多尺度频域变换模块(MSFM),利用小波、余弦和傅里叶变换分离并重建多带频谱以恢复被衰减的高频特征;二是自监督重建辅助编码器(SSRAE),在共享Mamba编码器上强制执行像素级重建,从而恢复完整的二维空间相关性,增强细粒度纹理与全局上下文一致性。该方法在多个医学影像数据集上均显著优于CNN-Transformer混合模型及现有Mamba变体,且计算开销可控,为未来频域感知的分割模型提供了可扩展框架。
链接: https://arxiv.org/abs/2507.20056
作者: Ze Rong,ZiYue Zhao,Zhaoxin Wang,Lei Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate medical image segmentation remains challenging due to blurred lesion boundaries (LBA), loss of high-frequency details (LHD), and difficulty in modeling long-range anatomical structures (DC-LRSS). Vision Mamba employs one-dimensional causal state-space recurrence to efficiently model global dependencies, thereby substantially mitigating DC-LRSS. However, its patch tokenization and 1D serialization disrupt local pixel adjacency and impose a low-pass filtering effect, resulting in Local High-frequency Information Capture Deficiency (LHICD) and two-dimensional Spatial Structure Degradation (2D-SSD), which in turn exacerbate LBA and LHD. In this work, we propose FaRMamba, a novel extension that explicitly addresses LHICD and 2D-SSD through two complementary modules. A Multi-Scale Frequency Transform Module (MSFM) restores attenuated high-frequency cues by isolating and reconstructing multi-band spectra via wavelet, cosine, and Fourier transforms. A Self-Supervised Reconstruction Auxiliary Encoder (SSRAE) enforces pixel-level reconstruction on the shared Mamba encoder to recover full 2D spatial correlations, enhancing both fine textures and global context. Extensive evaluations on CAMUS echocardiography, MRI-based Mouse-cochlea, and Kvasir-Seg endoscopy demonstrate that FaRMamba consistently outperforms competitive CNN-Transformer hybrids and existing Mamba variants, delivering superior boundary accuracy, detail preservation, and global coherence without prohibitive computational overhead. This work provides a flexible frequency-aware framework for future segmentation models that directly mitigates core challenges in medical imaging.
zh
[CV-124] Digital and Robotic Twinning for Validation of Proximity Operations and Formation Flying
【速读】:该论文旨在解决航天器交会对接(Rendezvous, Proximity Operations, RPO)与编队飞行(Formation Flying, FF)中制导导航与控制(Guidance Navigation and Control, GNC)系统的验证难题,尤其是在复杂空间环境中难以实现仿真与实际行为一致性的挑战。解决方案的关键在于提出并实现了一个统一的端到端数字孪生(Digital Twin)与机器人孪生(Robotic Twin)融合框架,支持软硬件在环(Hardware-in-the-Loop, HIL)测试,涵盖基于射频(RF)和光学(Vision-based)的多模态GNC系统验证。该框架通过斯坦福大学空间交会实验室(SLAB)的三个测试平台——GRAND、TRON与OS,实现了对集成多模态GNC软件栈在低地球轨道(Low-Earth Orbit, LEO)全范围任务场景下的性能与鲁棒性评估,并验证了数字孪生与机器人孪生之间的一致性,从而为GNC系统提供了可靠的现实场景验证手段。
链接: https://arxiv.org/abs/2507.20034
作者: Aviad Golan,Gregory Zin,Zahra Ahmed,Emily Bates,Toby Bell,Pol Francesch Huc,Samuel Y. W. Low,Juergen Bosse,Simone D’Amico
机构: Stanford University (斯坦福大学); Robo-Technology GmbH
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 12 figures. 2025 Astrodynamics Specialist Conference
Abstract:In spacecraft Rendezvous, Proximity Operations (RPO), and Formation Flying (FF), the Guidance Navigation and Control (GNC) system is safety-critical and must meet strict performance requirements. However, validating such systems is challenging due to the complexity of the space environment, necessitating a verification and validation (VV) process that bridges simulation and real-world behavior. The key contribution of this paper is a unified, end-to-end digital and robotic twinning framework that enables software- and hardware-in-the-loop testing for multi-modal GNC systems. The robotic twin includes three testbeds at Stanford’s Space Rendezvous Laboratory (SLAB): the GNSS and Radiofrequency Autonomous Navigation Testbed for Distributed Space Systems (GRAND) to validate RF-based navigation techniques, and the Testbed for Rendezvous and Optical Navigation (TRON) and Optical Stimulator (OS) to validate vision-based methods. The test article for this work is an integrated multi-modal GNC software stack for RPO and FF developed at SLAB. This paper introduces the hybrid framework and summarizes calibration and error characterization for the robotic twin. Then, the GNC stack’s performance and robustness is characterized using the integrated digital and robotic twinning pipeline for a full-range RPO mission scenario in Low-Earth Orbit (LEO). The results shown in the paper demonstrate consistency between digital and robotic twins, validating the hybrid twinning pipeline as a reliable framework for realistic assessment and verification of GNC systems.
zh
[CV-125] APS : Frustratingly Simple Test Time Active Learning for VLMs
【速读】:该论文旨在解决在连续数据流场景下,如何利用一个oracle(即标签提供者)对单个样本进行即时决策的问题,同时需满足延迟和内存约束。这类问题常见于自动驾驶和医疗诊断等安全关键应用中,其挑战在于模型必须在仅接收一个样本的情况下实时决定是否查询标签,并动态调整参数以适应新数据。解决方案的关键在于提出了一种新颖的测试时主动学习(Test-Time Active Learning, TTAL)框架:通过引入动态调整的熵阈值实现不确定样本的自适应查询、采用类别平衡的替换策略提升内存效率,并结合类别感知的分布对齐技术增强模型适应能力。这些设计均经过严格的理论分析支持,并在10个跨数据集迁移任务和4个领域泛化数据集上验证了其有效性与实用性。
链接: https://arxiv.org/abs/2507.20028
作者: Dhruv Sarkar,Aprameyo Chakrabartty,Bibhudatta Bhanja
机构: IIT Kharagpur (印度理工学院克哈格普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Test-Time Optimization enables models to adapt to new data during inference by updating parameters on-the-fly. Recent advances in Vision-Language Models (VLMs) have explored learning prompts at test time to improve performance in downstream tasks. In this work, we extend this idea by addressing a more general and practical challenge: Can we effectively utilize an oracle in a continuous data stream where only one sample is available at a time, requiring an immediate query decision while respecting latency and memory constraints? To tackle this, we propose a novel Test-Time Active Learning (TTAL) framework that adaptively queries uncertain samples and updates prompts dynamically. Unlike prior methods that assume batched data or multiple gradient updates, our approach operates in a real-time streaming scenario with a single test sample per step. We introduce a dynamically adjusted entropy threshold for active querying, a class-balanced replacement strategy for memory efficiency, and a class-aware distribution alignment technique to enhance adaptation. The design choices are justified using careful theoretical analysis. Extensive experiments across 10 cross-dataset transfer benchmarks and 4 domain generalization datasets demonstrate consistent improvements over state-of-the-art methods while maintaining reasonable latency and memory overhead. Our framework provides a practical and effective solution for real-world deployment in safety-critical applications such as autonomous systems and medical diagnostics.
zh
[CV-126] Region-based Cluster Discrimination for Visual Representation Learning ICCV2025
【速读】:该论文旨在解决当前视觉-语言对比模型(如CLIP和SigLIP)在密集预测任务(如目标定位、光学字符识别(OCR)和分割)中因依赖全局表示而性能受限的问题。其解决方案的关键在于提出Region-Aware Cluster Discrimination (RICE),通过构建一个百亿级候选区域数据集,并引入Region Transformer层以提取丰富的局部区域语义信息;同时设计统一的区域聚类判别损失函数,将对象识别与OCR学习联合在一个分类框架内,从而实现高效且可扩展的大规模分布式训练,显著提升了模型在密集预测任务中的表现。
链接: https://arxiv.org/abs/2507.20025
作者: Yin Xie,Kaicheng Yang,Xiang An,Kun Wu,Yongle Zhao,Weimo Deng,Zimin Ran,Yumeng Wang,Ziyong Feng,Roy Miles,Ismail Elezi,Jiankang Deng
机构: DeepGlint; University of Technology Sydney; Huawei London Research Center; Imperial College London
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a highlight paper at ICCV 2025
Abstract:Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation. To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). The pre-trained models have been released at this https URL.
zh
[CV-127] VAMPIRE: Uncovering Vessel Directional and Morphological Information from OCTA Images for Cardiovascular Disease Risk Factor Prediction MICCAI2025
【速读】:该论文旨在解决当前心血管疾病(Cardiovascular Disease, CVD)风险评估中存在两大局限性:一是现有基于眼底成像的深度学习方法多依赖于传统眼底照相和光学相干断层扫描(Optical Coherence Tomography, OCT),难以捕捉对CVD评估至关重要的血管细节;二是现有模型通常仅将CVD风险划分为高/低两类,缺乏对与CVD相关的血液指标状况的深入分析,从而限制了预测精度与临床实用性。解决方案的关键在于提出一种多任务范式——OCTA-CVD,这是首个用于CVD风险评估的OCTA(OCT Angiography)数据集,并构建了Vessel-Aware Mamba-based Prediction model with Informative Enhancement (VAMPIRE),其核心创新为两个模块:(1) 基于Mamba架构的定向特征提取模块(Mamba-Based Directional, MBD)以捕获精细的血管走向特征;(2) 信息增强型形态学模块(Information-Enhanced Morphological, IEM)融合全面的血管形态学先验知识,从而实现CVD风险及关联血液指标条件的联合预测,显著提升模型性能与临床可解释性。
链接: https://arxiv.org/abs/2507.20017
作者: Lehan Wang,Hualiang Wang,Chubin Ou,Lushi Chen,Yunyi Liang,Xiaomeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in MICCAI 2025
Abstract:Cardiovascular disease (CVD) remains the leading cause of death worldwide, requiring urgent development of effective risk assessment methods for timely intervention. While current research has introduced non-invasive and efficient approaches to predict CVD risk from retinal imaging with deep learning models, the commonly used fundus photographs and Optical Coherence Tomography (OCT) fail to capture detailed vascular features critical for CVD assessment compared with OCT angiography (OCTA) images. Moreover, existing methods typically classify CVD risk only as high or low, without providing a deeper analysis on CVD-related blood factor conditions, thus limiting prediction accuracy and clinical utility. As a result, we propose a novel multi-purpose paradigm of CVD risk assessment that jointly performs CVD risk and CVD-related condition prediction, aligning with clinical experiences. Based on this core idea, we introduce OCTA-CVD, the first OCTA dataset for CVD risk assessment, and a Vessel-Aware Mamba-based Prediction model with Informative Enhancement (VAMPIRE) based on OCTA enface images. Our proposed model aims to extract crucial vascular characteristics through two key components: (1) a Mamba-Based Directional (MBD) Module that captures fine-grained vascular trajectory features and (2) an Information-Enhanced Morphological (IEM) Module that incorporates comprehensive vessel morphology knowledge. Experimental results demonstrate that our method can surpass standard classification backbones, OCTA-based detection methods, and ophthalmologic foundation models. Our codes and the collected OCTA-CVD dataset are available at this https URL.
zh
[CV-128] FROSS: Faster-than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images
【速读】:该论文旨在解决现有3D语义场景图(3D Semantic Scene Graph, SSG)生成方法在实时开放世界应用中面临的两大挑战:一是计算复杂度高,依赖昂贵的点云处理;二是非增量式处理方式难以支持在线推理。其解决方案的关键在于提出FROSS(Faster-than-Real-Time Online 3D Semantic Scene Graph Generation),该方法通过直接将2D场景图提升至3D空间,并采用3D高斯分布表示对象,从而避免了对精确点云处理的依赖,实现了更快、更高效的在线3D SSG生成。
链接: https://arxiv.org/abs/2507.19993
作者: Hao-Yu Hou,Chun-Yi Lee,Motoharu Sonogashira,Yasutomo Kawanishi
机构: National Tsing Hua University (国立清华大学); National Taiwan University (台湾大学); RIKEN (理化学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The ability to abstract complex 3D environments into simplified and structured representations is crucial across various domains. 3D semantic scene graphs (SSGs) achieve this by representing objects as nodes and their interrelationships as edges, facilitating high-level scene understanding. Existing methods for 3D SSG generation, however, face significant challenges, including high computational demands and non-incremental processing that hinder their suitability for real-time open-world applications. To address this issue, we propose FROSS (Faster-than-Real-Time Online 3D Semantic Scene Graph Generation), an innovative approach for online and faster-than-real-time 3D SSG generation that leverages the direct lifting of 2D scene graphs to 3D space and represents objects as 3D Gaussian distributions. This framework eliminates the dependency on precise and computationally-intensive point cloud processing. Furthermore, we extend the Replica dataset with inter-object relationship annotations, creating the ReplicaSSG dataset for comprehensive evaluation of FROSS. The experimental results from evaluations on ReplicaSSG and 3DSSG datasets show that FROSS can achieve superior performance while operating significantly faster than prior 3D SSG generation methods. Our implementation and dataset are publicly available at this https URL.
zh
[CV-129] Pic2Diagnosis: A Method for Diagnosis of Cardiovascular Diseases from the Printed ECG Pictures
【速读】:该论文旨在解决传统心电图(ECG)诊断中因依赖过时数据集和传统分步算法而导致的准确率有限的问题。其核心解决方案在于提出一种直接从ECG图像进行心血管疾病(CVD)诊断的方法,无需将模拟信号数字化;关键创新点是采用两阶段课程学习框架:首先在分割掩码上预训练分类模型,再在灰度反转的ECG图像上微调;同时通过三个模型的集成平均输出提升鲁棒性,在BHF ECG Challenge数据集上实现AUC 0.9534和F1分数0.7801,显著优于单个模型,尤其适用于资源受限环境下对打印或扫描ECG图像的自动化快速诊断。
链接: https://arxiv.org/abs/2507.19961
作者: Oğuzhan Büyüksolak,İlkay Öksüz
机构: Istanbul Technical University (伊斯坦布尔技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: To appear in: Proceedings of the 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2025
Abstract:The electrocardiogram (ECG) is a vital tool for diagnosing heart diseases. However, many disease patterns are derived from outdated datasets and traditional stepwise algorithms with limited accuracy. This study presents a method for direct cardiovascular disease (CVD) diagnosis from ECG images, eliminating the need for digitization. The proposed approach utilizes a two-step curriculum learning framework, beginning with the pre-training of a classification model on segmentation masks, followed by fine-tuning on grayscale, inverted ECG images. Robustness is further enhanced through an ensemble of three models with averaged outputs, achieving an AUC of 0.9534 and an F1 score of 0.7801 on the BHF ECG Challenge dataset, outperforming individual models. By effectively handling real-world artifacts and simplifying the diagnostic process, this method offers a reliable solution for automated CVD diagnosis, particularly in resource-limited settings where printed or scanned ECG images are commonly used. Such an automated procedure enables rapid and accurate diagnosis, which is critical for timely intervention in CVD cases that often demand urgent care.
zh
[CV-130] Predicting Brain Responses To Natural Movies With Multimodal LLM s
【速读】:该论文旨在解决跨模态神经编码模型在预测人脑fMRI响应时的泛化能力问题,尤其是在面对未见过的电影刺激(out-of-distribution)时的表现瓶颈。其关键解决方案在于:利用多种预训练模型提取多模态特征(包括视频、语音、文本、视觉-文本及视觉-文本-音频),通过线性投影映射到统一潜在空间并进行时间对齐,随后采用轻量级编码器结构——包含共享群体头(shared group head)和个体特异性残差头(subject-specific residual heads)——将特征映射至皮层区域(cortical parcels)。此外,通过大规模超参数搜索与基于保留电影的验证,构建针对每个受试者特定皮层区域的集成模型(ensemble),显著提升了对新电影刺激的预测性能。
链接: https://arxiv.org/abs/2507.19956
作者: Cesar Kadir Torrico Villanueva,Jiaxin Cindy Tu,Mihir Tripathy,Connor Lane,Rishab Iyer,Paul S. Scotti
机构: Medical AI Research Center (MedARC); Dartmouth College (达特茅斯学院); Baylor College of Medicine (贝勒医学院); Sophont; Princeton Neuroscience Institute (普林斯顿神经科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: Code available at this https URL
Abstract:We present MedARC’s team solution to the Algonauts 2025 challenge. Our pipeline leveraged rich multimodal representations from various state-of-the-art pretrained models across video (V-JEPA2), speech (Whisper), text (Llama 3.2), vision-text (InternVL3), and vision-text-audio (Qwen2.5-Omni). These features extracted from the models were linearly projected to a latent space, temporally aligned to the fMRI time series, and finally mapped to cortical parcels through a lightweight encoder comprising a shared group head plus subject-specific residual heads. We trained hundreds of model variants across hyperparameter settings, validated them on held-out movies and assembled ensembles targeted to each parcel in each subject. Our final submission achieved a mean Pearson’s correlation of 0.2085 on the test split of withheld out-of-distribution movies, placing our team in fourth place for the competition. We further discuss a last-minute optimization that would have raised us to second place. Our results highlight how combining features from models trained in different modalities, using a simple architecture consisting of shared-subject and single-subject components, and conducting comprehensive model selection and ensembling improves generalization of encoding models to novel movie stimuli. All code is available on GitHub.
zh
[CV-131] RARE: Refine Any Registration of Pairwise Point Clouds via Zero-Shot Learning
【速读】:该论文旨在解决点云配准(point cloud registration)中因特征表达不足导致的匹配精度低和泛化能力弱的问题。其解决方案的关键在于利用预训练扩散模型(pretrained diffusion network)从多视角深度图中提取深度扩散特征(depth diffusion features),并将这些特征与现有几何特征融合,从而建立更准确的语义对应关系,无需依赖专门的训练数据即可显著提升配准精度和跨数据集的鲁棒性。
链接: https://arxiv.org/abs/2507.19950
作者: Chengyu Zheng,Jin Huang,Honghua Chen,Mingqiang Wei
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Shenzhen Research Institute, Nanjing University of Aeronautics and Astronautics (南京航空航天大学深圳研究院); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent research leveraging large-scale pretrained diffusion models has demonstrated the potential of using diffusion features to establish semantic correspondences in images. Inspired by advancements in diffusion-based techniques, we propose a novel zero-shot method for refining point cloud registration algorithms. Our approach leverages correspondences derived from depth images to enhance point feature representations, eliminating the need for a dedicated training dataset. Specifically, we first project the point cloud into depth maps from multiple perspectives and extract implicit knowledge from a pretrained diffusion network as depth diffusion features. These features are then integrated with geometric features obtained from existing methods to establish more accurate correspondences between point clouds. By leveraging these refined correspondences, our approach achieves significantly improved registration accuracy. Extensive experiments demonstrate that our method not only enhances the performance of existing point cloud registration techniques but also exhibits robust generalization capabilities across diverse datasets. Codes are available at this https URL.
zh
[CV-132] AF-CLIP: Zero-Shot Anomaly Detection via Anomaly-Focused CLIP Adaptation ACM-MM’-25
【速读】:该论文旨在解决视觉异常检测(Visual Anomaly Detection)在零样本(zero-shot)和少样本(few-shot)场景下的性能瓶颈问题,即现有方法通常依赖大量标注训练样本,难以适应实际工业检测和医学诊断中样本稀缺的场景。其解决方案的关键在于提出AF-CLIP模型,通过两个核心机制实现:一是引入轻量级适配器(lightweight adapter),同时优化图像级类别特征与patch级局部特征,以增强对局部缺陷的关注;二是设计多尺度空间聚合机制,在适配器前有效整合邻域上下文信息,提升对不同尺寸异常的感知能力;此外,还构建可学习文本提示(learnable textual prompts)来泛化描述正常与异常状态,并结合辅助数据集上的复合目标函数进行优化,从而显著提升零样本检测性能,进一步扩展至少样本场景时通过额外记忆库实现效果增强。
链接: https://arxiv.org/abs/2507.19949
作者: Qingqing Fang,Wenxi Lv,Qinliang Su
机构: Sun Yat-sen University (中山大学); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper is accepted by ACM MM’ 25
Abstract:Visual anomaly detection has been widely used in industrial inspection and medical diagnosis. Existing methods typically demand substantial training samples, limiting their utility in zero-/few-shot scenarios. While recent efforts have leveraged CLIP’s zero-shot recognition capability for this task, they often ignore optimizing visual features to focus on local anomalies, reducing their efficacy. In this work, we propose AF-CLIP (Anomaly-Focused CLIP) by dramatically enhancing its visual representations to focus on local defects. Our approach introduces a lightweight adapter that emphasizes anomaly-relevant patterns in visual features, simultaneously optimizing both class-level features for image classification and patch-level features for precise localization. To capture anomalies of different sizes and improve detection accuracy, prior to the adapter, we develop a multi-scale spatial aggregation mechanism to effectively consolidate neighborhood context. Complementing these visual enhancements, we design learnable textual prompts that generically characterize normal and abnormal states. After optimization on auxiliary datasets using a composite objective function, AF-CLIP demonstrates strong zero-shot detection capability. Our method is also extended to few-shot scenarios by extra memory banks. Experimental results across diverse industrial and medical datasets demonstrate the effectiveness and generalization of our proposed method. Code is available at this https URL.
zh
[CV-133] UniCT Depth: Event-Image Fusion Based Monocular Depth Estimation with Convolution-Compensated ViT Dual SA Block IJCAI2025
【速读】:该论文旨在解决事件相机(event camera)与图像数据融合在单目深度估计中的挑战,特别是现有卷积神经网络(CNN)方法因感受野有限导致的遮挡和深度差异问题,以及基于Transformer的方法缺乏深层模态交互的问题。其解决方案的关键在于提出UniCT Depth框架,通过统一CNN与Transformer结构来建模局部与全局特征;核心创新是设计了Convolution-compensated ViT Dual SA(CcViT-DA)模块,集成Context Modeling Self-Attention(CMSA)以捕捉空间依赖关系,并引入Modal Fusion Self-Attention(MFSA)实现高效的跨模态融合;同时,还提出了专为细节补偿设计的Detail Compensation Convolution(DCC)模块,显著提升纹理细节与边缘表示能力,从而在多个关键指标上优于现有的图像、事件及融合方法。
链接: https://arxiv.org/abs/2507.19948
作者: Luoxi Jing,Dianxi Shi,Zhe Liu,Songchang Jin,Chunping Qiu,Ziteng Qiao,Yuxian Li,Jianqiang Xia
机构: Peking University (北京大学); National University of Defense Technology (国防科技大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2025 (International Joint Conference on Artificial Intelligence)
Abstract:Depth estimation plays a crucial role in 3D scene understanding and is extensively used in a wide range of vision tasks. Image-based methods struggle in challenging scenarios, while event cameras offer high dynamic range and temporal resolution but face difficulties with sparse data. Combining event and image data provides significant advantages, yet effective integration remains challenging. Existing CNN-based fusion methods struggle with occlusions and depth disparities due to limited receptive fields, while Transformer-based fusion methods often lack deep modality interaction. To address these issues, we propose UniCT Depth, an event-image fusion method that unifies CNNs and Transformers to model local and global features. We propose the Convolution-compensated ViT Dual SA (CcViT-DA) Block, designed for the encoder, which integrates Context Modeling Self-Attention (CMSA) to capture spatial dependencies and Modal Fusion Self-Attention (MFSA) for effective cross-modal fusion. Furthermore, we design the tailored Detail Compensation Convolution (DCC) Block to improve texture details and enhances edge representations. Experiments show that UniCT Depth outperforms existing image, event, and fusion-based monocular depth estimation methods across key metrics.
zh
[CV-134] SCALAR: Scale-wise Controllable Visual Autoregressive Learning
【速读】:该论文旨在解决视觉自回归模型(Visual Autoregressive, VAR)在可控图像生成中面临的挑战,即现有方法存在控制编码效率低和注入机制破坏生成保真度与效率的问题。其解决方案的关键在于提出SCALAR方法,引入一种新颖的逐尺度条件解码机制(Scale-wise Conditional Decoding),通过更精细地对不同尺度的生成过程进行条件控制,从而实现高效且高保真的可控图像合成。
链接: https://arxiv.org/abs/2507.19946
作者: Ryan Xu,Dongyang Jin,Yancheng Bai,Rui Lan,Xu Duan,Lei Sun,Xiangxiang Chu
机构: Amap, Alibaba Group
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Controllable image synthesis, which enables fine-grained control over generated outputs, has emerged as a key focus in visual generative modeling. However, controllable generation remains challenging for Visual Autoregressive (VAR) models due to their hierarchical, next-scale prediction style. Existing VAR-based methods often suffer from inefficient control encoding and disruptive injection mechanisms that compromise both fidelity and efficiency. In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a novel Scale-wise Conditional Decoding mechanism. SCALAR leverages a
zh
[CV-135] LLM Control: Grounded Control of Text-to-Image Diffusion-based Synthesis with Multimodal LLM s
【速读】:该论文旨在解决文本到图像(text-to-image, T2I)扩散模型在空间控制任务中难以精确遵循复杂文本提示的问题,尤其是在涉及多个对象或复杂空间构图时生成结果与控制条件不一致的现象。解决方案的关键在于提出一种由大语言模型(large language model, LLM)引导的框架 LLM_Control,通过增强视觉-语言对齐能力,利用多模态LLM作为全局控制器来规划空间布局、扩充语义描述并绑定对象属性,进而将生成的控制信号注入去噪网络,以根据新的采样约束重新聚焦和强化注意力图,从而实现结构与外观的协同调控。
链接: https://arxiv.org/abs/2507.19939
作者: Jiaze Wang,Rui Chen,Haowang Cui
机构: Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent spatial control methods for text-to-image (T2I) diffusion models have shown compelling results. However, these methods still fail to precisely follow the control conditions and generate the corresponding images, especially when encountering the textual prompts that contain multiple objects or have complex spatial compositions. In this work, we present a LLM-guided framework called LLM_Control to address the challenges of the controllable T2I generation task. By improving grounding capabilities, LLM_Control is introduced to accurately modulate the pre-trained diffusion models, where visual conditions and textual prompts influence the structures and appearance generation in a complementary way. We utilize the multimodal LLM as a global controller to arrange spatial layouts, augment semantic descriptions and bind object attributes. The obtained control signals are injected into the denoising network to refocus and enhance attention maps according to novel sampling constraints. Extensive qualitative and quantitative experiments have demonstrated that LLM_Control achieves competitive synthesis quality compared to other state-of-the-art methods across various pre-trained T2I models. It is noteworthy that LLM_Control allows the challenging input conditions on which most of the existing methods
zh
[CV-136] MambaVesselNet: A Hybrid CNN-Mamba Architecture for Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中传统卷积神经网络(Convolutional Neural Networks, CNNs)因感受野受限而导致的局部信息捕捉不足,以及现有视觉Transformer(Vision Transformer, ViT)因非线性自注意力机制带来高计算开销的问题。其解决方案的关键在于提出了一种混合CNN-Mamba架构——MambaVesselNet++,通过双分支编码器(Hi-Encoder)实现纹理感知与长程依赖建模的协同优化:其中纹理感知层利用卷积提取低级语义特征,而Mamba模块以线性复杂度高效建模全局上下文;同时,双焦点融合解码器(BF-Decoder)通过跳跃连接融合局部细节与全局信息,从而生成高精度分割掩膜。该设计在保持高效计算的同时显著提升了多种医学图像分割任务的性能表现。
链接: https://arxiv.org/abs/2507.19931
作者: Qing Xu,Yanming Chen,Yue Li,Ziyu Liu,Zhenye Lou,Yixuan Zhang,Xiangjian He
机构: University of Nottingham Ningbo China (诺丁汉大学宁波分校); Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TOMM
Abstract:Medical image segmentation plays an important role in computer-aided diagnosis. Traditional convolution-based U-shape segmentation architectures are usually limited by the local receptive field. Existing vision transformers have been widely applied to diverse medical segmentation frameworks due to their superior capabilities of capturing global contexts. Despite the advantage, the real-world application of vision transformers is challenged by their non-linear self-attention mechanism, requiring huge computational costs. To address this issue, the selective state space model (SSM) Mamba has gained recognition for its adeptness in modeling long-range dependencies in sequential data, particularly noted for its efficient memory costs. In this paper, we propose MambaVesselNet++, a Hybrid CNN-Mamba framework for medical image segmentation. Our MambaVesselNet++ is comprised of a hybrid image encoder (Hi-Encoder) and a bifocal fusion decoder (BF-Decoder). In Hi-Encoder, we first devise the texture-aware layer to capture low-level semantic features by leveraging convolutions. Then, we utilize Mamba to effectively model long-range dependencies with linear complexity. The Bi-Decoder adopts skip connections to combine local and global information of the Hi-Encoder for the accurate generation of segmentation masks. Extensive experiments demonstrate that MambaVesselNet++ outperforms current convolution-based, transformer-based, and Mamba-based state-of-the-arts across diverse medical 2D, 3D, and instance segmentation tasks. The code is available at this https URL.
zh
[CV-137] A Fast Parallel Median Filtering Algorithm Using Hierarchical Tiling
【速读】:该论文旨在解决中值滤波(median filtering)在大规模图像处理中计算复杂度高、效率低的问题,尤其针对传统基于排序的算法在大核尺寸下性能急剧下降的瓶颈。其核心解决方案是利用排序问题的可分离性,通过分层分块(hierarchical tiling)策略最小化冗余计算,并提出两种变体:一种为数据无关的选择网络(data-oblivious selection network),可在寄存器内完全执行,实现每像素 O(klogk) 的复杂度;另一种为数据感知版本(data-aware version),利用随机存取内存(RAM)实现每像素 O(k) 的线性复杂度——这是排序类方法前所未有的突破。该方案在现代GPU上的CUDA实现比当前最优方法快达5倍,在8位至32位数据类型及 3×3 至 75×75 核大小下均表现出最佳性能。
链接: https://arxiv.org/abs/2507.19926
作者: Louis Sugy(NVIDIA)
机构: NVIDIA(英伟达)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures
Abstract:Median filtering is a non-linear smoothing technique widely used in digital image processing to remove noise while retaining sharp edges. It is particularly well suited to removing outliers (impulse noise) or granular artifacts (speckle noise). However, the high computational cost of median filtering can be prohibitive. Sorting-based algorithms excel with small kernels but scale poorly with increasing kernel diameter, in contrast to constant-time methods characterized by higher constant factors but better scalability, such as histogram-based approaches or the 2D wavelet matrix. This paper introduces a novel algorithm, leveraging the separability of the sorting problem through hierarchical tiling to minimize redundant computations. We propose two variants: a data-oblivious selection network that can operate entirely within registers, and a data-aware version utilizing random-access memory. These achieve per-pixel complexities of O(k \log(k)) and O(k) , respectively, for a k \times k kernel - unprecedented for sorting-based methods. Our CUDA implementation is up to 5 times faster than the current state of the art on a modern GPU and is the fastest median filter in most cases for 8-, 16-, and 32-bit data types and kernels from 3 \times 3 to 75 \times 75 . Comments: 8 pages, 8 figures Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.3.1; I.4.3 Cite as: arXiv:2507.19926 [cs.DC] (or arXiv:2507.19926v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2507.19926 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: SIGGRAPH Conference Papers '25, August 10-14, 2025, Vancouver, BC, Canada Related DOI: https://doi.org/10.1145/3721238.3730709 Focus to learn more DOI(s) linking to related resources
zh
[CV-138] HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial Appearance and Motion Anomaly ICCV2025
【速读】:该论文旨在解决生成式视频(尤其是以人类为中心的视频)在信息真实性与安全性方面带来的威胁,特别是当前二分类伪造视频检测方法缺乏对伪造类型细粒度理解的问题,从而影响了检测结果的可靠性与可解释性。解决方案的关键在于提出HumanSAM框架,其核心创新包括:通过融合视频理解与空间深度两个分支来构建人类伪造特征表示,以更好捕捉几何(geometry)、语义(semantics)和时空一致性(spatiotemporal consistency);引入基于排序的置信度增强策略,在训练中利用三种先验分数学习更鲁棒的表示;同时构建首个公开的人类中心伪造视频(Human-centric Forgery Video, HFV)数据集,实现多类别伪造类型的半自动标注与评估。
链接: https://arxiv.org/abs/2507.19924
作者: Chang Liu,Yunfan Ye,Fan Zhang,Qingyang Zhou,Yuchuan Luo,Zhiping Cai
机构: National University of Defense Technology (国防科技大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project page: this https URL
Abstract:Numerous synthesized videos from generative models, especially human-centric ones that simulate realistic human actions, pose significant threats to human information security and authenticity. While progress has been made in binary forgery video detection, the lack of fine-grained understanding of forgery types raises concerns regarding both reliability and interpretability, which are critical for real-world applications. To address this limitation, we propose HumanSAM, a new framework that builds upon the fundamental challenges of video generation models. Specifically, HumanSAM aims to classify human-centric forgeries into three distinct types of artifacts commonly observed in generated content: spatial, appearance, and motion this http URL better capture the features of geometry, semantics and spatiotemporal consistency, we propose to generate the human forgery representation by fusing two branches of video understanding and spatial depth. We also adopt a rank-based confidence enhancement strategy during the training process to learn more robust representation by introducing three prior scores. For training and evaluation, we construct the first public benchmark, the Human-centric Forgery Video (HFV) dataset, with all types of forgeries carefully annotated semi-automatically. In our experiments, HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.
zh
[CV-139] A mini-batch training strategy for deep subspace clustering networks
【速读】:该论文旨在解决深度子空间聚类(Deep Subspace Clustering, DSC)方法在训练过程中依赖全批量(full-batch)处理导致的可扩展性瓶颈问题,尤其是在高分辨率图像场景下难以高效训练复杂网络架构。其关键解决方案在于引入一个记忆库(memory bank)机制以保存全局特征表示,从而支持在mini-batch训练框架下维持对整个数据集的自表达能力;同时提出一种无需解码器(decoder-free)的框架,利用对比学习(contrastive learning)替代传统自编码(autoencoding)进行表征学习,显著降低计算开销并提升性能。
链接: https://arxiv.org/abs/2507.19917
作者: Yuxuan Jiang,Chenwei Yu,Zhi Lin,Xiaolan Liu
机构: The Hong Kong University of Science and Technology (香港科技大学); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Mini-batch training is a cornerstone of modern deep learning, offering computational efficiency and scalability for training complex architectures. However, existing deep subspace clustering (DSC) methods, which typically combine an autoencoder with a self-expressive layer, rely on full-batch processing. The bottleneck arises from the self-expressive module, which requires representations of the entire dataset to construct a self-representation coefficient matrix. In this work, we introduce a mini-batch training strategy for DSC by integrating a memory bank that preserves global feature representations. Our approach enables scalable training of deep architectures for subspace clustering with high-resolution images, overcoming previous limitations. Additionally, to efficiently fine-tune large-scale pre-trained encoders for subspace clustering, we propose a decoder-free framework that leverages contrastive learning instead of autoencoding for representation learning. This design not only eliminates the computational overhead of decoder training but also provides competitive performance. Extensive experiments demonstrate that our approach not only achieves performance comparable to full-batch methods, but outperforms other state-of-the-art subspace clustering methods on the COIL100 and ORL datasets by fine-tuning deep networks.
zh
[CV-140] DriveIndia: An Object Detection Dataset for Diverse Indian Traffic Scenes ITSC2025
【速读】:该论文旨在解决当前自动驾驶系统在复杂且不确定的印度交通环境中对象检测性能不足的问题,尤其是在多变天气、光照条件、道路基础设施异质性以及密集混杂交通模式下的鲁棒性和泛化能力。其解决方案的关键在于构建了一个大规模、高分辨率的对象检测数据集——DriveIndia,该数据集涵盖24类与交通相关的物体类别,覆盖超过120小时的采集时间及3,400公里以上的行驶距离,包含多种真实世界场景,并以YOLO格式标注,为模型训练和评估提供高质量基准。通过在该数据集上使用先进的YOLO系列模型进行基线测试,最高mAP_50达到78.7%,验证了该数据集对提升复杂环境下对象检测性能的有效性。
链接: https://arxiv.org/abs/2507.19912
作者: Rishav Kumar,D. Santhosh Reddy,P. Rajalakshmi
机构: Indian Institute of Technology Hyderabad (印度理工学院海得拉巴分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ITSC 2025 Conference
Abstract:We introduce \textbfDriveIndia, a large-scale object detection dataset purpose-built to capture the complexity and unpredictability of Indian traffic environments. The dataset contains \textbf66,986 high-resolution images annotated in YOLO format across \textbf24 traffic-relevant object categories, encompassing diverse conditions such as varied weather (fog, rain), illumination changes, heterogeneous road infrastructure, and dense, mixed traffic patterns and collected over \textbf120+ hours and covering \textbf3,400+ kilometers across urban, rural, and highway routes. DriveIndia offers a comprehensive benchmark for real-world autonomous driving challenges. We provide baseline results using state-of-the-art \textbfYOLO family models, with the top-performing variant achieving a mAP_50 of \textbf78.7%. Designed to support research in robust, generalizable object detection under uncertain road conditions, DriveIndia will be publicly available via the TiHAN-IIT Hyderabad dataset repository (this https URL).
zh
[CV-141] rackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking
【速读】:该论文旨在解决3D LiDAR-based单目标跟踪(SOT)中因点云稀疏性和几何结构不规则性导致的跨类别泛化能力差的问题,现有方法多依赖于特定类别的模型训练,难以适应真实场景中的多样化目标。解决方案的关键在于提出TrackAny3D框架,通过引入参数高效适配器(parameter-efficient adapters)保留几何先验知识,并设计基于几何特征的专家混合(Mixture-of-Geometry-Experts, MoGE)架构,动态激活针对不同几何特性的子网络;同时结合可学习的时间令牌和动态掩码加权模块优化时序上下文传播,有效缓解时间漂移问题,从而实现类别无关的3D SOT,在多个基准上达到当前最优性能。
链接: https://arxiv.org/abs/2507.19908
作者: Mengmeng Wang,Haonan Wang,Yulong Li,Xiangjie Kong,Jiaxin Du,Guojiang Shen,Feng Xia
机构: Zhejiang University of Technology (浙江工业大学); RMIT University (皇家墨尔本理工大学); Zhejiang Key Laboratory of Visual Information Intelligent Processing (浙江省视觉信息智能处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D LiDAR-based single object tracking (SOT) relies on sparse and irregular point clouds, posing challenges from geometric variations in scale, motion patterns, and structural complexity across object categories. Current category-specific approaches achieve good accuracy but are impractical for real-world use, requiring separate models for each category and showing limited generalization. To tackle these issues, we propose TrackAny3D, the first framework to transfer large-scale pretrained 3D models for category-agnostic 3D SOT. We first integrate parameter-efficient adapters to bridge the gap between pretraining and tracking tasks while preserving geometric priors. Then, we introduce a Mixture-of-Geometry-Experts (MoGE) architecture that adaptively activates specialized subnetworks based on distinct geometric characteristics. Additionally, we design a temporal context optimization strategy that incorporates learnable temporal tokens and a dynamic mask weighting module to propagate historical information and mitigate temporal drift. Experiments on three commonly-used benchmarks show that TrackAny3D establishes new state-of-the-art performance on category-agnostic 3D SOT, demonstrating strong generalization and competitiveness. We hope this work will enlighten the community on the importance of unified models and further expand the use of large-scale pretrained models in this field.
zh
[CV-142] ConSeg: Contextual Backdoor Attack Against Semantic Segmentation
【速读】:该论文旨在解决语义分割模型在面对后门攻击时的脆弱性问题,即攻击者通过植入隐蔽触发器(trigger)使模型将目标类误分类为指定目标类,从而危及模型可靠性。解决方案的关键在于提出一种名为上下文语义分割后门攻击(Contextual Segmentation Backdoor Attack, ConSeg)的新方法,其核心思想是利用语义分割模型中固有的上下文信息:当目标类被设定为与受害类共现(co-occurring)的类别时,可更易实现误分割;ConSeg 通过模拟并重建目标类的上下文特征至受害区域,强化目标类与受害类之间的上下文关联,从而显著提升攻击成功率(ASR),且对现有先进防御机制具有鲁棒性。
链接: https://arxiv.org/abs/2507.19905
作者: Bilal Hussain Abbasi,Zirui Gong,Yanjun Zhang,Shang Gao,Antonio Robles-Kelly,Leo Zhang
机构: Deakin University (迪肯大学); Griffith University (格里菲斯大学); University of Technology Sydney (悉尼科技大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite significant advancements in computer vision, semantic segmentation models may be susceptible to backdoor attacks. These attacks, involving hidden triggers, aim to cause the models to misclassify instances of the victim class as the target class when triggers are present, posing serious threats to the reliability of these models. To further explore the field of backdoor attacks against semantic segmentation, in this paper, we propose a simple yet effective backdoor attack called Contextual Segmentation Backdoor Attack (ConSeg). ConSeg leverages the contextual information inherent in semantic segmentation models to enhance backdoor performance. Our method is motivated by an intriguing observation, i.e., when the target class is set as the co-occurring' class of the victim class, the victim class can be more easily
mis-segmented’. Building upon this insight, ConSeg mimics the contextual information of the target class and rebuilds it in the victim region to establish the contextual relationship between the target class and the victim class, making the attack easier. Our experiments reveal that ConSeg achieves improvements in Attack Success Rate (ASR) with increases of 15.55%, compared to existing methods, while exhibiting resilience against state-of-the-art backdoor defenses.
zh
[CV-143] Interpretable Open-Vocabulary Referring Object Detection with Reverse Contrast Attention
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)中对象定位精度不足的问题,尤其是在开放词汇条件下的指代表达检测(Open Vocabulary Referring Object Detection, OV-RefOD)任务中。现有方法通常依赖于重新训练模型或复杂后处理,难以在不改变模型结构的前提下提升定位性能。为此,作者提出了一种即插即用的逆对比注意力机制(Reverse Contrast Attention, RCA),其核心在于通过抑制极端激活值并增强中等强度的token激活,使语义相关但被弱化的token主导最终预测结果。RCA无需重新训练即可显著提升11/15个开源VLM的定位性能,最高提升达+26.6%,且效果与注意力锐度和融合时机密切相关,表明其在保持模型原有能力的同时增强了对关键语义信息的敏感性,兼具可解释性与实用性。
链接: https://arxiv.org/abs/2507.19891
作者: Drandreb Earl O. Juanico,Rowel O. Atienza,Jeffrey Kenneth Go
机构: University of the Philippines, Diliman (菲律宾大学迪里曼分校); EEEI, University of the Philippines, Diliman (菲律宾大学迪里曼分校电气与电子工程研究所); Samsung R&D Institute Philippines (三星研发研究院菲律宾分院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages with supplementary material, 6 main figures, 2 main tables; github: earl-juanico/rca
Abstract:We propose Reverse Contrast Attention (RCA), a plug-in method that enhances object localization in vision-language transformers without retraining. RCA reweights final-layer attention by suppressing extremes and amplifying mid-level activations to let semantically relevant but subdued tokens guide predictions. We evaluate it on Open Vocabulary Referring Object Detection (OV-RefOD), introducing FitAP, a confidence-free average precision metric based on IoU and box area. RCA improves FitAP in 11 out of 15 open-source VLMs, with gains up to +26.6% . Effectiveness aligns with attention sharpness and fusion timing; while late-fusion models benefit consistently, models like \textttDeepSeek-VL2 also improve, pointing to capacity and disentanglement as key factors. RCA offers both interpretability and performance gains for multimodal transformers.
zh
[CV-144] CLoRA: Parameter-Efficient Continual Learning with Low-Rank Adaptation
【速读】:该论文旨在解决持续学习(Continual Learning, CL)中因模型在增量任务上反复训练而导致的计算资源消耗过大的问题,尤其是在部署后计算资源受限的实际场景中。传统CL方法通常需要为每个新任务重新训练整个模型,这在大型模型中尤为昂贵,限制了其在资源受限环境中的应用。解决方案的关键在于提出一种基于低秩适配(Low-Rank Adaptation, LoRA)的参数高效微调框架——CLoRA,该方法仅使用少量可学习参数,并在整个任务序列中复用这些参数进行增量学习,从而显著降低硬件需求和训练成本,同时保持与基线方法相当或更优的性能表现。
链接: https://arxiv.org/abs/2507.19887
作者: Shishir Muralidhara,Didier Stricker,René Schuster
机构: Augmented Vision Group, German Research Center for Artificial Intelligence (DFKI); RPTU – University of Kaiserslautern-Landau, Kaiserslautern
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CoLLAs 2025
Abstract:In the past, continual learning (CL) was mostly concerned with the problem of catastrophic forgetting in neural networks, that arises when incrementally learning a sequence of tasks. Current CL methods function within the confines of limited data access, without any restrictions imposed on computational resources. However, in real-world scenarios, the latter takes precedence as deployed systems are often computationally constrained. A major drawback of most CL methods is the need to retrain the entire model for each new task. The computational demands of retraining large models can be prohibitive, limiting the applicability of CL in environments with limited resources. Through CLoRA, we explore the applicability of Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method for class-incremental semantic segmentation. CLoRA leverages a small set of parameters of the model and uses the same set for learning across all tasks. Results demonstrate the efficacy of CLoRA, achieving performance on par with and exceeding the baseline methods. We further evaluate CLoRA using NetScore, underscoring the need to factor in resource efficiency and evaluate CL methods beyond task performance. CLoRA significantly reduces the hardware requirements for training, making it well-suited for CL in resource-constrained environments after deployment.
zh
[CV-145] FedS2R: One-Shot Federated Domain Generalization for Synthetic-to-Real Semantic Segmentation in Autonomous Driving
【速读】:该论文旨在解决自动驾驶场景中合成数据到真实数据(synthetic-to-real)的语义分割问题,尤其是在联邦学习(Federated Learning, FL)框架下如何实现跨客户端的知识协同与泛化能力提升。其核心挑战在于各客户端数据分布差异大、隐私限制导致无法共享原始数据,且现有方法在联邦域泛化(Federated Domain Generalization, FDG)中对语义分割任务的研究不足。解决方案的关键在于提出FedS2R框架,包含两个创新模块:一是基于不一致性驱动的数据增强策略(inconsistency-driven data augmentation),用于生成不稳定类别的增强图像以缓解类别不平衡;二是多客户端知识蒸馏结合特征融合的机制(multi-client knowledge distillation with feature fusion),通过从多个客户端模型中蒸馏全局模型,实现跨域知识迁移与聚合。实验表明,该方案在五个真实世界数据集上显著优于单客户端模型,且仅比集中式训练模型低2 mIoU,验证了其在联邦环境下合成到真实语义分割中的有效性。
链接: https://arxiv.org/abs/2507.19881
作者: Tao Lian,Jose L. Gómez,Antonio M. López
机构: Computer Vision Center (计算机视觉中心); Department of Computer Science, Autonomous University of Barcelona (UAB) (计算机科学系,巴塞罗那自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated domain generalization has shown promising progress in image classification by enabling collaborative training across multiple clients without sharing raw data. However, its potential in the semantic segmentation of autonomous driving remains underexplored. In this paper, we propose FedS2R, the first one-shot federated domain generalization framework for synthetic-to-real semantic segmentation in autonomous driving. FedS2R comprises two components: an inconsistency-driven data augmentation strategy that generates images for unstable classes, and a multi-client knowledge distillation scheme with feature fusion that distills a global model from multiple client models. Experiments on five real-world datasets, Cityscapes, BDD100K, Mapillary, IDD, and ACDC, show that the global model significantly outperforms individual client models and is only 2 mIoU points behind the model trained with simultaneous access to all client data. These results demonstrate the effectiveness of FedS2R in synthetic-to-real semantic segmentation for autonomous driving under federated learning
zh
[CV-146] Efficient Self-Supervised Neuro-Analytic Visual Servoing for Real-time Quadrotor Control
【速读】:该论文旨在解决无人机在无GPS室内环境中依赖几何模型或标记物进行视觉控制的局限性,以及传统图像基视觉伺服(Image-Based Visual Servoing, IBVS)方法中存在的数值不稳定性和高计算开销问题。其关键解决方案包括:(1)设计了一个改进的分析型IBVS教师模型,通过简化经典视觉伺服方程以消除数值不稳定性并实现高效稳定的图像特征检测;(2)提出一种两阶段分割流水线,结合YOLOv11与基于U-Net的掩码分割器,实现对目标前后方向的鲁棒分割,从而准确估计无人机姿态;(3)构建了一种高效的双路径知识蒸馏系统,将教师模型的几何视觉伺服能力迁移至一个仅有1.7M参数的小型学生卷积神经网络(ConvNet),该学生模型推理速度比教师快11倍且控制精度相当,同时显著降低计算和内存消耗,适用于机载实时部署。
链接: https://arxiv.org/abs/2507.19878
作者: Sebastian Mocanu,Sebastian-Ion Nae,Mihai-Eugen Barbu,Marius Leordeanu
机构: National University of Science and Technology POLITEHNICA Bucharest (布加勒斯特理工大学); Institute of Mathematics “Simion Stoilow” of the Romanian Academy (罗马尼亚科学院西米昂·斯托伊洛夫数学研究所); NORCE Norwegian Research Center (挪威科研中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at the International Conference on Computer Vision Workshops 2025
Abstract:This work introduces a self-supervised neuro-analytical, cost efficient, model for visual-based quadrotor control in which a small 1.7M parameters student ConvNet learns automatically from an analytical teacher, an improved image-based visual servoing (IBVS) controller. Our IBVS system solves numerical instabilities by reducing the classical visual servoing equations and enabling efficient stable image feature detection. Through knowledge distillation, the student model achieves 11x faster inference compared to the teacher IBVS pipeline, while demonstrating similar control accuracy at a significantly lower computational and memory cost. Our vision-only self-supervised neuro-analytic control, enables quadrotor orientation and movement without requiring explicit geometric models or fiducial markers. The proposed methodology leverages simulation-to-reality transfer learning and is validated on a small drone platform in GPS-denied indoor environments. Our key contributions include: (1) an analytical IBVS teacher that solves numerical instabilities inherent in classical approaches, (2) a two-stage segmentation pipeline combining YOLOv11 with a U-Net-based mask splitter for robust anterior-posterior vehicle segmentation to correctly estimate the orientation of the target, and (3) an efficient knowledge distillation dual-path system, which transfers geometric visual servoing capabilities from the analytical IBVS teacher to a compact and small student neural network that outperforms the teacher, while being suitable for real-time onboard deployment.
zh
[CV-147] ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking ICCV2025
【速读】:该论文旨在解决视觉-语言跟踪(Vision-Language Tracking, VLT)在复杂长时场景中因目标状态动态变化而导致的多模态线索失配问题,尤其是初始提示中的视觉与文本目标-上下文线索难以持续有效引导跟踪的问题。解决方案的关键在于提出一种名为ATCTrack的新框架,通过全面的目标-上下文特征建模实现多模态线索与动态目标状态对齐:(1)设计了一种有效的时序视觉目标-上下文建模方法,提供及时的视觉线索;(2)仅基于文本内容精确识别目标词,并引入创新的上下文词校准机制以自适应利用辅助上下文词;从而显著提升VLT在真实复杂场景下的鲁棒性。
链接: https://arxiv.org/abs/2507.19875
作者: X. Feng,S. Hu,X. Li,D. Zhang,M. Wu,J. Zhang,X. Chen,K. Huang
机构: School of Artificial Intelligence, UCAS (中国科学院大学人工智能学院); The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, CASIA (中国科学院自动化研究所认知与决策智能复杂系统重点实验室); School of Physical and Mathematical Sciences, NTU (南洋理工大学理学院); ZGCA (浙江大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025 Highlight ~
Abstract:Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame. To achieve robust tracking, especially in complex long-term scenarios that reflect real-world conditions as recently highlighted by MGIT, it is essential not only to characterize the target features but also to utilize the context features related to the target. However, the visual and textual target-context cues derived from the initial prompts generally align only with the initial target state. Due to their dynamic nature, target states are constantly changing, particularly in complex long-term sequences. It is intractable for these cues to continuously guide Vision-Language Trackers (VLTs). Furthermore, for the text prompts with diverse expressions, our experiments reveal that existing VLTs struggle to discern which words pertain to the target or the context, complicating the utilization of textual cues. In this work, we present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states through comprehensive Target-Context feature modeling, thereby achieving robust tracking. Specifically, (1) for the visual modality, we propose an effective temporal visual target-context modeling approach that provides the tracker with timely visual cues. (2) For the textual modality, we achieve precise target words identification solely based on textual content, and design an innovative context words calibration method to adaptively utilize auxiliary context words. (3) We conduct extensive experiments on mainstream benchmarks and ATCTrack achieves a new SOTA performance. The code and models will be released at: this https URL.
zh
[CV-148] All-in-One Medical Image Restoration with Latent Diffusion-Enhanced Vector-Quantized Codebook Prior MICCAI2025
【速读】:该论文旨在解决多任务医学图像恢复(All-in-one Medical Image Restoration, MedIR)中因不同任务间退化类型差异导致的信息损失多样性问题,现有方法难以有效处理此类异质性信息损失。其解决方案的关键在于提出一种基于潜在扩散增强的向量量化码本先验(latent diffusion-enhanced vector-quantized codebook prior),并构建了DiffCode框架:首先设计任务自适应码本库以整合跨任务的高质量(HQ)先验特征,从而捕获全面的先验信息;其次引入潜在扩散策略,利用扩散模型强大的映射能力迭代优化潜在特征分布,在恢复过程中更准确地估计HQ先验特征,从而显著提升多种MedIR任务(如MRI超分辨率、CT去噪和PET合成)的定量指标与视觉质量。
链接: https://arxiv.org/abs/2507.19874
作者: Haowei Chen,Zhiwen Yang,Haotian Hou,Hui Zhang,Bingzheng Wei,Gang Zhou,Yan Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11pages, 3figures, MICCAI 2025
Abstract:All-in-one medical image restoration (MedIR) aims to address multiple MedIR tasks using a unified model, concurrently recovering various high-quality (HQ) medical images (e.g., MRI, CT, and PET) from low-quality (LQ) counterparts. However, all-in-one MedIR presents significant challenges due to the heterogeneity across different tasks. Each task involves distinct degradations, leading to diverse information losses in LQ images. Existing methods struggle to handle these diverse information losses associated with different tasks. To address these challenges, we propose a latent diffusion-enhanced vector-quantized codebook prior and develop \textbfDiffCode, a novel framework leveraging this prior for all-in-one MedIR. Specifically, to compensate for diverse information losses associated with different tasks, DiffCode constructs a task-adaptive codebook bank to integrate task-specific HQ prior features across tasks, capturing a comprehensive prior. Furthermore, to enhance prior retrieval from the codebook bank, DiffCode introduces a latent diffusion strategy that utilizes the diffusion model’s powerful mapping capabilities to iteratively refine the latent feature distribution, estimating more accurate HQ prior features during restoration. With the help of the task-adaptive codebook bank and latent diffusion strategy, DiffCode achieves superior performance in both quantitative metrics and visual quality across three MedIR tasks: MRI super-resolution, CT denoising, and PET synthesis.
zh
[CV-149] OW-CLIP: Data-Efficient Visual Supervision for Open-World Object Detection via Human-AI Collaboration
【速读】:该论文旨在解决开放世界目标检测(Open-world Object Detection, OWOD)中模型训练数据需求高、易发生“局部特征过拟合”以及模型架构修改灵活性差等问题。其核心解决方案是提出 OW-CLIP 系统,关键在于:1)设计即插即用的多模态提示调优(multimodal prompt tuning)机制,适配 OWOD 场景并支持增量学习;2)引入新颖的“Crop-Smoothing”技术以缓解局部特征过拟合问题;3)结合大语言模型与跨模态相似性,提出双模态数据精炼方法,实现高效高质量数据生成与筛选。该方案在仅使用 3.8% 自生成数据的情况下达到当前最优性能的 89%,且在等量数据下超越现有最先进方法。
链接: https://arxiv.org/abs/2507.19870
作者: Junwen Duan,Wei Xue,Ziyao Kang,Shixia Liu,Jiazhi Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 9 pages, 11 figures
Abstract:Open-world object detection (OWOD) extends traditional object detection to identifying both known and unknown object, necessitating continuous model adaptation as new annotations emerge. Current approaches face significant limitations: 1) data-hungry training due to reliance on a large number of crowdsourced annotations, 2) susceptibility to “partial feature overfitting,” and 3) limited flexibility due to required model architecture modifications. To tackle these issues, we present OW-CLIP, a visual analytics system that provides curated data and enables data-efficient OWOD model incremental training. OW-CLIP implements plug-and-play multimodal prompt tuning tailored for OWOD settings and introduces a novel “Crop-Smoothing” technique to mitigate partial feature overfitting. To meet the data requirements for the training methodology, we propose dual-modal data refinement methods that leverage large language models and cross-modal similarity for data generation and filtering. Simultaneously, we develope a visualization interface that enables users to explore and deliver high-quality annotations: including class-specific visual feature phrases and fine-grained differentiated images. Quantitative evaluation demonstrates that OW-CLIP achieves competitive performance at 89% of state-of-the-art performance while requiring only 3.8% self-generated data, while outperforming SOTA approach when trained with equivalent data volumes. A case study shows the effectiveness of the developed method and the improved annotation quality of our visualization system.
zh
[CV-150] RaG S: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cues for 3D Object Detection
【速读】:该论文旨在解决4D毫米波雷达与单目图像在3D目标检测任务中的有效融合问题,现有方法通常依赖于基于实例的提议或密集的BEV(Bird’s Eye View)网格结构,前者缺乏全局场景理解能力,后者受限于刚性网格布局。解决方案的关键在于提出RaGS框架,首次将3D高斯溅射(3D Gaussian Splatting, GS)作为统一表示来融合4D雷达和单目视觉信息。3D GS通过将场景建模为高斯分布场,动态聚焦于前景物体并实现资源高效分配,从而兼顾稀疏目标的精准定位与整体场景感知。RaGS采用级联式流程:首先通过基于视锥的定位初始化(Frustum-based Localization Initiation, FLI)生成粗粒度高斯位置;接着通过迭代多模态聚合(Iterative Multimodal Aggregation, IMA)融合语义与几何信息,精修关键区域;最终通过多层级高斯融合(Multi-level Gaussian Fusion, MGF)生成多层次BEV特征用于3D目标检测,显著提升了检测性能。
链接: https://arxiv.org/abs/2507.19856
作者: Xiaokai Bai,Chenxu Zhou,Lianqing Zheng,Si-Yuan Cao,Jianan Liu,Xiaohan Zhang,Zhengzhuang Zhang,Hui-liang Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, conference
Abstract:4D millimeter-wave radar has emerged as a promising sensor for autonomous driving, but effective 3D object detection from both 4D radar and monocular images remains a challenge. Existing fusion approaches typically rely on either instance-based proposals or dense BEV grids, which either lack holistic scene understanding or are limited by rigid grid structures. To address these, we propose RaGS, the first framework to leverage 3D Gaussian Splatting (GS) as representation for fusing 4D radar and monocular cues in 3D object detection. 3D GS naturally suits 3D object detection by modeling the scene as a field of Gaussians, dynamically allocating resources on foreground objects and providing a flexible, resource-efficient solution. RaGS uses a cascaded pipeline to construct and refine the Gaussian field. It starts with the Frustum-based Localization Initiation (FLI), which unprojects foreground pixels to initialize coarse 3D Gaussians positions. Then, the Iterative Multimodal Aggregation (IMA) fuses semantics and geometry, refining the limited Gaussians to the regions of interest. Finally, the Multi-level Gaussian Fusion (MGF) renders the Gaussians into multi-level BEV features for 3D object detection. By dynamically focusing on sparse objects within scenes, RaGS enable object concentrating while offering comprehensive scene perception. Extensive experiments on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes benchmarks demonstrate its state-of-the-art performance. Code will be released.
zh
[CV-151] A Structure-aware and Motion-adaptive Framework for 3D Human Pose Estimation with Mamba
【速读】:该论文旨在解决基于Mamba的姿势提升(pose-lifting)方法在建模关节间复杂依赖关系时存在的局限性,即现有方法通常通过二维到一维映射结合多样扫描策略来处理关节依赖,但难以统一建模复杂的关节连接结构,且忽视了不同关节运动特性差异。解决方案的关键在于提出一个结构感知与运动自适应框架(SAMA),其核心由两个模块构成:结构感知状态积分器(Structure-aware State Integrator, SSI)用于基于姿态拓扑而非序列状态转移,在关节特征和状态层面融合动态关节关系;运动自适应状态调制器(Motion-adaptive State Modulator, MSM)则负责识别关节特异性运动特征,并对不同关节的多样化运动模式进行针对性调整。该设计实现了对空间关节拓扑结构和多变运动动态的独立建模,从而在保持高性能的同时降低计算开销。
链接: https://arxiv.org/abs/2507.19852
作者: Ye Lu,Jie Wang,Jianjun Gao,Rui Gong,Chen Cai,Kim-Hui Yap
机构: Nanyang Technological University (南洋理工大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, conference
Abstract:Recent Mamba-based methods for the pose-lifting task tend to model joint dependencies by 2D-to-1D mapping with diverse scanning strategies. Though effective, they struggle to model intricate joint connections and uniformly process all joint motion trajectories while neglecting the intrinsic differences across motion characteristics. In this work, we propose a structure-aware and motion-adaptive framework to capture spatial joint topology along with diverse motion dynamics independently, named as SAMA. Specifically, SAMA consists of a Structure-aware State Integrator (SSI) and a Motion-adaptive State Modulator (MSM). The Structure-aware State Integrator is tasked with leveraging dynamic joint relationships to fuse information at both the joint feature and state levels in the state space, based on pose topology rather than sequential state transitions. The Motion-adaptive State Modulator is responsible for joint-specific motion characteristics recognition, thus applying tailored adjustments to diverse motion patterns across different joints. Through the above key modules, our algorithm enables structure-aware and motion-adaptive pose lifting. Extensive experiments across multiple benchmarks demonstrate that our algorithm achieves advanced results with fewer computational costs.
zh
[CV-152] FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing
【速读】:该论文旨在解决现有文本驱动人体运动生成方法中对身体部位运动细节及其时序信息建模不足的问题。其关键解决方案是构建了FineMotion数据集,该数据集包含超过442,000个短人体运动片段(motion snippets)及其对应的身体部位运动详细描述,并附带约95,000段完整运动序列的详细动作描述,从而为细粒度文本驱动人体运动生成提供高质量标注数据支持。实验表明,该数据集显著提升了模型性能,例如在MDM模型上Top-3准确率提升15.3%,并进一步实现了基于文本的零样本细粒度运动编辑(fine-grained motion editing),能够在空间和时间维度上实现精确控制。
链接: https://arxiv.org/abs/2507.19850
作者: Bizhu Wu,Jinheng Xie,Meidan Ding,Zhe Kong,Jianfeng Ren,Ruibin Bai,Rong Qu,Linlin Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating realistic human motions from textual descriptions has undergone significant advancements. However, existing methods often overlook specific body part movements and their timing. In this paper, we address this issue by enriching the textual description with more details. Specifically, we propose the FineMotion dataset, which contains over 442,000 human motion snippets - short segments of human motion sequences - and their corresponding detailed descriptions of human body part movements. Additionally, the dataset includes about 95k detailed paragraphs describing the movements of human body parts of entire motion sequences. Experimental results demonstrate the significance of our dataset on the text-driven finegrained human motion generation task, especially with a remarkable +15.3% improvement in Top-3 accuracy for the MDM model. Notably, we further support a zero-shot pipeline of fine-grained motion editing, which focuses on detailed editing in both spatial and temporal dimensions via text. Dataset and code available at: CVI-SZU/FineMotion
zh
[CV-153] Knowledge Regularized Negative Feature Tuning for Out-of-Distribution Detection with Vision-Language Models
【速读】:该论文旨在解决视觉-语言模型在负向提示调优(negative prompt tuning)后出现的泛化性能下降问题,即模型在面对未见过的类别和风格时,其分布外检测(Out-of-distribution, OOD)能力减弱。解决方案的关键在于提出一种名为知识正则化负特征调优(Knowledge Regularized Negative Feature Tuning, KR-NFT)的新方法,其核心包括两个创新:一是设计了负特征调优(Negative Feature Tuning, NFT)架构,通过分布感知变换将文本特征分离至不同空间,增强ID与OOD图像的区分度;二是引入知识正则化(KR)优化策略,在提升OOD检测性能的同时缓解预训练知识遗忘,从而在少量样本下实现对未见ID类别的有效泛化,显著降低FPR95指标。
链接: https://arxiv.org/abs/2507.19847
作者: Wenjie Zhu,Yabin Zhang,Xin Jin,Wenjun Zeng,Lei Zhang
机构: Hong Kong Polytechnic University (香港理工大学); Eastern Institute of Technology (东方理工大学); Stanford University (斯坦福大学); Ningbo Institute of Digital Twin, Eastern Institute of Technology (宁波数字孪生研究所,东方理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ACMMM 2025
Abstract:Out-of-distribution (OOD) detection is crucial for building reliable machine learning models. Although negative prompt tuning has enhanced the OOD detection capabilities of vision-language models, these tuned models often suffer from reduced generalization performance on unseen classes and styles. To address this challenge, we propose a novel method called Knowledge Regularized Negative Feature Tuning (KR-NFT), which integrates an innovative adaptation architecture termed Negative Feature Tuning (NFT) and a corresponding knowledge-regularization (KR) optimization strategy. Specifically, NFT applies distribution-aware transformations to pre-trained text features, effectively separating positive and negative features into distinct spaces. This separation maximizes the distinction between in-distribution (ID) and OOD images. Additionally, we introduce image-conditional learnable factors through a lightweight meta-network, enabling dynamic adaptation to individual images and mitigating sensitivity to class and style shifts. Compared to traditional negative prompt tuning, NFT demonstrates superior efficiency and scalability. To optimize this adaptation architecture, the KR optimization strategy is designed to enhance the discrimination between ID and OOD sets while mitigating pre-trained knowledge forgetting. This enhances OOD detection performance on trained ID classes while simultaneously improving OOD detection on unseen ID datasets. Notably, when trained with few-shot samples from ImageNet dataset, KR-NFT not only improves ID classification accuracy and OOD detection but also significantly reduces the FPR95 by 5.44% under an unexplored generalization setting with unseen ID categories. Codes can be found at \hrefthis https URLthis https URL.
zh
[CV-154] GNSP: Gradient Null Space Projection for Preserving Cross-Modal Alignment in VLMs Continual Learning
【速读】:该论文旨在解决对比语言-图像预训练(Contrastive Language-Image Pretraining, CLIP)在持续学习(continual learning)过程中因任务增量微调导致的灾难性遗忘(catastrophic forgetting)和嵌入对齐退化问题,从而损害其零样本泛化能力。解决方案的关键在于提出梯度零空间投影(Gradient Null Space Projection, GNSP),通过将当前任务梯度投影到先前知识的零空间中,实现正交投影以数学上避免对旧任务的干扰,无需回放或结构修改;同时引入知识蒸馏与受CLIP预训练启发的模态对齐保持损失,稳定多模态嵌入空间结构,从而有效维持CLIP原有的跨模态检索性能和模态间隙。
链接: https://arxiv.org/abs/2507.19839
作者: Tiantian Peng,Yuyang Liu,Shuo Yang,Qiuhe Hong,YongHong Tian
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Contrastive Language-Image Pretraining has demonstrated remarkable zero-shot generalization by aligning visual and textual modalities in a shared embedding space. However, when continuously fine-tuned on diverse tasks, CLIP suffers from catastrophic forgetting and degradation of its embedding alignment, undermining its zero-shot capabilities. In this work, we propose Gradient Null Space Projection (GNSP), an efficient continual learning method that projects task-specific gradients onto the null space of previously learned knowledge. This orthogonal projection mathematically prevents interference with previous tasks without relying on rehearsal or architectural modification. Furthermore, to preserve the inherent generalization property of CLIP, we introduce knowledge distillation and combine it with a modality alignment preservation loss inspired by CLIP pre-training to stabilize the structure of the multimodal embedding space during fine-tuning. On the MTIL benchmark consisting of 11 tasks, our method achieved SOTA performance on both the Average and Last key metrics. More importantly, experiments show that our method successfully maintains the original modality gap and cross-modal retrieval performance of CLIP, confirming its effectiveness in maintaining a robust visual-language space throughout the continual learning process.
zh
[CV-155] ChoreoMuse: Robust Music-to-Dance Video Generation with Style Transfer and Beat-Adherent Motion ACM-MM2025
【速读】:该论文旨在解决当前自动化编舞生成方法在适应多样音乐风格和个体舞者特征方面存在的局限性,尤其是难以生成与音乐节奏及用户定义的编舞风格高度契合的高质量舞蹈视频问题。解决方案的关键在于提出ChoreoMuse框架,其核心创新是利用SMPL(Skinned Multi-Person Linear)格式参数及其变体作为音乐到视频生成之间的中间表示,从而突破视频分辨率限制;同时引入MotionTune音乐编码器以捕捉音频中的运动线索,确保生成动作精准贴合节拍与音乐表现力,并支持风格可控、高保真度的舞蹈视频生成,适用于多种音乐类型和任意参考舞者的不同分辨率场景。
链接: https://arxiv.org/abs/2507.19836
作者: Xuanchen Wang,Heng Wang,Weidong Cai
机构: The University of Sydney (悉尼大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注: 10 pages, 5 figures, accepted by the 33rd ACM International Conference on Multimedia (ACM MM 2025), demo page: this https URL
Abstract:Modern artistic productions increasingly demand automated choreography generation that adapts to diverse musical styles and individual dancer characteristics. Existing approaches often fail to produce high-quality dance videos that harmonize with both musical rhythm and user-defined choreography styles, limiting their applicability in real-world creative contexts. To address this gap, we introduce ChoreoMuse, a diffusion-based framework that uses SMPL format parameters and their variation version as intermediaries between music and video generation, thereby overcoming the usual constraints imposed by video resolution. Critically, ChoreoMuse supports style-controllable, high-fidelity dance video generation across diverse musical genres and individual dancer characteristics, including the flexibility to handle any reference individual at any resolution. Our method employs a novel music encoder MotionTune to capture motion cues from audio, ensuring that the generated choreography closely follows the beat and expressive qualities of the input music. To quantitatively evaluate how well the generated dances match both musical and choreographic styles, we introduce two new metrics that measure alignment with the intended stylistic cues. Extensive experiments confirm that ChoreoMuse achieves state-of-the-art performance across multiple dimensions, including video quality, beat alignment, dance diversity, and style adherence, demonstrating its potential as a robust solution for a wide range of creative applications. Video results can be found on our project page: this https URL.
zh
[CV-156] aking Language Embedded 3D Gaussian Splatting into the Wild
【速读】:该论文旨在解决如何从无约束的互联网照片集合中实现对建筑构件三维结构的沉浸式理解问题,现有方法多局限于静态文本-图像对浏览,缺乏对建筑风格与结构知识的深度感知。解决方案的关键在于扩展语言嵌入的3D高斯溅射(language embedded 3D Gaussian splatting, 3DGS)框架,提出一种新型开放词汇场景理解方法:首先通过重建的辐射场从同一视角渲染多外观图像,并提取多外观CLIP特征及两类语言特征不确定性图(瞬时不确定性和外观不确定性),以指导后续优化;其次引入瞬时不确定性感知自编码器、多外观语言场3DGS表示和后融合策略,有效压缩、学习并融合来自多视角的语言特征;最终构建PT-OVS基准数据集用于定量评估开放词汇分割性能,实验证明该方法在开放词汇分割精度上优于现有技术,支持交互式漫游、建筑风格识别与三维场景编辑等应用。
链接: https://arxiv.org/abs/2507.19830
作者: Yuze Wang,Yue Qi
机构: Beihang University (北京航空航天大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Visit our project page at this https URL
Abstract:Recent advances in leveraging large-scale Internet photo collections for 3D reconstruction have enabled immersive virtual exploration of landmarks and historic sites worldwide. However, little attention has been given to the immersive understanding of architectural styles and structural knowledge, which remains largely confined to browsing static text-image pairs. Therefore, can we draw inspiration from 3D in-the-wild reconstruction techniques and use unconstrained photo collections to create an immersive approach for understanding the 3D structure of architectural components? To this end, we extend language embedded 3D Gaussian splatting (3DGS) and propose a novel framework for open-vocabulary scene understanding from unconstrained photo collections. Specifically, we first render multiple appearance images from the same viewpoint as the unconstrained image with the reconstructed radiance field, then extract multi-appearance CLIP features and two types of language feature uncertainty maps-transient and appearance uncertainty-derived from the multi-appearance features to guide the subsequent optimization process. Next, we propose a transient uncertainty-aware autoencoder, a multi-appearance language field 3DGS representation, and a post-ensemble strategy to effectively compress, learn, and fuse language features from multiple appearances. Finally, to quantitatively evaluate our method, we introduce PT-OVS, a new benchmark dataset for assessing open-vocabulary segmentation performance on unconstrained photo collections. Experimental results show that our method outperforms existing methods, delivering accurate open-vocabulary segmentation and enabling applications such as interactive roaming with open-vocabulary queries, architectural style pattern recognition, and 3D scene editing.
zh
[CV-157] LAVA: Language Driven Scalable and Versatile Traffic Video Analytics
【速读】:该论文旨在解决大规模视频数据中传统SQL查询范式灵活性不足的问题,即现有方法受限于预定义语义类别,难以支持基于自然语言的灵活、细粒度视频分析。其核心解决方案是提出一个名为Lava的系统,关键创新在于三方面:(1)基于多臂老虎机(multi-armed bandit)的高效采样策略用于视频片段级目标定位;(2)面向视频场景的开放世界检测模块实现任意类别的对象级检索;(3)长期目标轨迹提取机制完成时序关联,生成完整的感兴趣目标轨迹。该方案显著提升了自然语言驱动下的视频查询准确率与效率。
链接: https://arxiv.org/abs/2507.19821
作者: Yanrui Yu,Tianfei Zhou,Jiaxin Sun,Lianpeng Qiao,Lizhong Ding,Ye Yuan,Guoren Wang
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:In modern urban environments, camera networks generate massive amounts of operational footage – reaching petabytes each day – making scalable video analytics essential for efficient processing. Many existing approaches adopt an SQL-based paradigm for querying such large-scale video databases; however, this constrains queries to rigid patterns with predefined semantic categories, significantly limiting analytical flexibility. In this work, we explore a language-driven video analytics paradigm aimed at enabling flexible and efficient querying of high-volume video data driven by natural language. Particularly, we build \textscLava, a system that accepts natural language queries and retrieves traffic targets across multiple levels of granularity and arbitrary categories. \textscLava comprises three main components: 1) a multi-armed bandit-based efficient sampling method for video segment-level localization; 2) a video-specific open-world detection module for object-level retrieval; and 3) a long-term object trajectory extraction scheme for temporal object association, yielding complete trajectories for object-of-interests. To support comprehensive evaluation, we further develop a novel benchmark by providing diverse, semantically rich natural language predicates and fine-grained annotations for multiple videos. Experiments on this benchmark demonstrate that \textscLava improves F_1 -scores for selection queries by \mathbf14% , reduces MPAE for aggregation queries by \mathbf0.39 , and achieves top- k precision of \mathbf86% , while processing videos \mathbf9.6\times faster than the most accurate baseline. Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM) Cite as: arXiv:2507.19821 [cs.CV] (or arXiv:2507.19821v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.19821 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-158] FM-LC: A Hierarchical Framework for Urban Flood Mapping by Land Cover Identification Models
【速读】:该论文旨在解决干旱地区城市洪水监测中因光谱对比度低、水体与周边地表(如植被、裸土)易混淆、水文动态变化快以及城市土地覆盖高度异质性等问题导致的传统洪水制图方法精度不足的挑战。其解决方案的关键在于提出一种分阶段的层级式框架FM-LC(Flood Mapping by Land Cover identification),首先利用多类U-Net初步分割出水体、植被、建筑和裸地四类地物,识别出易混淆类别后,通过轻量级二分类专家模型进行针对性修正,并最终采用贝叶斯平滑技术融合邻域像素信息以优化边界并去除噪声,从而实现更精细准确的洪水范围提取与恢复轨迹追踪。
链接: https://arxiv.org/abs/2507.19818
作者: Xin Hong,Longchao Da,Hua Wei
机构: Zayed University (扎耶德大学); Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages and 4 figures. Submitted to the IEEE for possible publication
Abstract:Urban flooding in arid regions poses severe risks to infrastructure and communities. Accurate, fine-scale mapping of flood extents and recovery trajectories is therefore essential for improving emergency response and resilience planning. However, arid environments often exhibit limited spectral contrast between water and adjacent surfaces, rapid hydrological dynamics, and highly heterogeneous urban land covers, which challenge traditional flood-mapping approaches. High-resolution, daily PlanetScope imagery provides the temporal and spatial detail needed. In this work, we introduce FM-LC, a hierarchical framework for Flood Mapping by Land Cover identification, for this challenging task. Through a three-stage process, it first uses an initial multi-class U-Net to segment imagery into water, vegetation, built area, and bare ground classes. We identify that this method has confusion between spectrally similar categories (e.g., water vs. vegetation). Second, by early checking, the class with the major misclassified area is flagged, and a lightweight binary expert segmentation model is trained to distinguish the flagged class from the rest. Third, a Bayesian smoothing step refines boundaries and removes spurious noise by leveraging nearby pixel information. We validate the framework on the April 2024 Dubai storm event, using pre- and post-rainfall PlanetScope composites. Experimental results demonstrate average F1-score improvements of up to 29% across all land-cover classes and notably sharper flood delineations, significantly outperforming conventional single-stage U-Net baselines.
zh
[CV-159] SeeDiff: Off-the-Shelf Seeded Mask Generation from Diffusion Models AAAI2025
【速读】:该论文旨在解决像素级目标分类任务中依赖人工标注掩码(pixel-level annotation masks)的高成本问题,提出了一种无需额外训练、提示词调优或预训练分割网络即可从Stable Diffusion生成高质量掩码的方法。其解决方案的关键在于深入分析Stable Diffusion中的注意力机制,发现交叉注意力(cross-attention)可提供粗粒度的目标定位作为初始种子,而自注意力(self-attention)则能通过多尺度注意力图实现语义对应建模,从而迭代扩展种子区域至完整类别对象;同时利用简单文本引导生成图像背景均匀的特性,进一步借助更精确的背景掩码优化最终掩码质量。该方法命名为SeeDiff,实现了直接从扩散模型输出中获取高质量分割掩码的能力。
链接: https://arxiv.org/abs/2507.19808
作者: Joon Hyun Park,Kumju Jo,Sungyong Baik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025
Abstract:Entrusted with the goal of pixel-level object classification, the semantic segmentation networks entail the laborious preparation of pixel-level annotation masks. To obtain pixel-level annotation masks for a given class without human efforts, recent few works have proposed to generate pairs of images and annotation masks by employing image and text relationships modeled by text-to-image generative models, especially Stable Diffusion. However, these works do not fully exploit the capability of text-guided Diffusion models and thus require a pre-trained segmentation network, careful text prompt tuning, or the training of a segmentation network to generate final annotation masks. In this work, we take a closer look at attention mechanisms of Stable Diffusion, from which we draw connections with classical seeded segmentation approaches. In particular, we show that cross-attention alone provides very coarse object localization, which however can provide initial seeds. Then, akin to region expansion in seeded segmentation, we utilize the semantic-correspondence-modeling capability of self-attention to iteratively spread the attention to the whole class from the seeds using multi-scale self-attention maps. We also observe that a simple-text-guided synthetic image often has a uniform background, which is easier to find correspondences, compared to complex-structured objects. Thus, we further refine a mask using a more accurate background mask. Our proposed method, dubbed SeeDiff, generates high-quality masks off-the-shelf from Stable Diffusion, without additional training procedure, prompt tuning, or a pre-trained segmentation network.
zh
[CV-160] DS-Det: Single-Query Paradigm and Attention Disentangled Learning for Flexible Object Detection
【速读】:该论文旨在解决现有基于Transformer的检测器在使用固定数量查询(fixed-query)时存在的两个核心问题:一是由于自注意力(Self-Attention)与交叉注意力(Cross-Attention)之间存在循环对立交互(Recurrent Opposing inTeractions, ROT),导致查询学习效率低下;二是共享权重解码器层在训练中因同时处理一对一和一对多标签分配而引发“查询歧义”(query ambiguity),违背DETR的一对一匹配原则。解决方案的关键在于提出DS-Det框架,其核心创新包括:1)引入统一的单查询范式(Single-Query paradigm),将固定查询转化为可变数量的灵活查询,提升检测灵活性;2)通过注意力解耦学习机制重构解码器结构——利用交叉注意力完成目标框定位(one-to-many过程),借助自注意力实现预测去重(one-to-one过程),从而直接缓解ROT和查询歧义问题并提升效率;3)设计统一的PoCoo损失函数,结合边界框尺寸先验优先优化小目标等难样本的学习。
链接: https://arxiv.org/abs/2507.19807
作者: Guiping Cao,Xiangyuan Lan,Wenjian Huang,Jianguo Zhang,Dongmei Jiang,Yaowei Wang
机构: RITAS, Southern University of Science and Technology (南方科技大学); Pengcheng Laboratory (鹏城实验室); Pazhou Laboratory (黄埔实验室); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Popular transformer detectors have achieved promising performance through query-based learning using attention mechanisms. However, the roles of existing decoder query types (e.g., content query and positional query) are still underexplored. These queries are generally predefined with a fixed number (fixed-query), which limits their flexibility. We find that the learning of these fixed-query is impaired by Recurrent Opposing inTeractions (ROT) between two attention operations: Self-Attention (query-to-query) and Cross-Attention (query-to-encoder), thereby degrading decoder efficiency. Furthermore, “query ambiguity” arises when shared-weight decoder layers are processed with both one-to-one and one-to-many label assignments during training, violating DETR’s one-to-one matching principle. To address these challenges, we propose DS-Det, a more efficient detector capable of detecting a flexible number of objects in images. Specifically, we reformulate and introduce a new unified Single-Query paradigm for decoder modeling, transforming the fixed-query into flexible. Furthermore, we propose a simplified decoder framework through attention disentangled learning: locating boxes with Cross-Attention (one-to-many process), deduplicating predictions with Self-Attention (one-to-one process), addressing “query ambiguity” and “ROT” issues directly, and enhancing decoder efficiency. We further introduce a unified PoCoo loss that leverages box size priors to prioritize query learning on hard samples such as small objects. Extensive experiments across five different backbone models on COCO2017 and WiderPerson datasets demonstrate the general effectiveness and superiority of DS-Det. The source codes are available at this https URL.
zh
[CV-161] ForCenNet: Foreground-Centric Network for Document Image Rectification ICCV25
【速读】:该论文旨在解决文档图像矫正(document image rectification)中因拍摄导致的几何失真问题,现有方法通常忽视前景元素(foreground elements)所携带的关键几何参考信息和版式结构线索。解决方案的关键在于提出一种以前景为中心的网络(Foreground-Centric Network, ForCenNet),其核心创新包括:1)设计了一种前景-centric标签生成方法,从无畸变图像中提取精细的前景元素作为监督信号;2)引入前景-centric掩码机制以增强可读区域与背景区域的区分度;3)提出曲率一致性损失(curvature consistency loss),利用精细前景标签引导模型理解畸变分布,从而更准确地恢复文本行、表格边框等布局要素。实验表明,ForCenNet在DocUNet、DIR300、WarpDoc和DocReal四个真实世界基准上均达到新的最先进性能。
链接: https://arxiv.org/abs/2507.19804
作者: Peng Cai,Qiang Li,Kaicheng Yang,Dong Guo,Jia Li,Nan Zhou,Xiang An,Ninghua Yang,Jiankang Deng
机构: Qihoo Technology (奇虎科技); DeepGlint; Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV25, 16 pages, 14 figures
Abstract:Document image rectification aims to eliminate geometric deformation in photographed documents to facilitate text recognition. However, existing methods often neglect the significance of foreground elements, which provide essential geometric references and layout information for document image correction. In this paper, we introduce Foreground-Centric Network (ForCenNet) to eliminate geometric distortions in document images. Specifically, we initially propose a foreground-centric label generation method, which extracts detailed foreground elements from an undistorted image. Then we introduce a foreground-centric mask mechanism to enhance the distinction between readable and background regions. Furthermore, we design a curvature consistency loss to leverage the detailed foreground labels to help the model understand the distorted geometric distribution. Extensive experiments demonstrate that ForCenNet achieves new state-of-the-art on four real-world benchmarks, such as DocUNet, DIR300, WarpDoc, and DocReal. Quantitative analysis shows that the proposed method effectively undistorts layout elements, such as text lines and table borders. The resources for further comparison are provided at this https URL.
zh
[CV-162] Smaller Faster Cheaper: Architectural Designs for Efficient Machine Learning
【速读】:该论文旨在解决当前计算机视觉模型在性能不断提升的同时,对计算资源需求日益增长的问题,特别是在资源受限环境中的部署瓶颈。其核心挑战在于如何在不牺牲性能的前提下降低模型的计算复杂度。解决方案的关键在于从三个方向进行架构层面的优化:首先,改进数据输入与输出机制,提升神经处理单元对可用数据的利用效率,使小型模型也能实现高性能;其次,优化核心神经网络结构,通过引入非均匀注意力窗口限制机制增强视觉Transformer的表达能力;最后,挖掘归一化流(Normalizing Flows)的内在结构特性,以更高效地进行知识蒸馏。这些方法共同表明,通过精细化的架构设计可显著提升机器学习算法的效率,从而实现模型更小、更快、成本更低的目标。
链接: https://arxiv.org/abs/2507.19795
作者: Steven Walton
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注: Ph.D. Thesis
Abstract:Major advancements in the capabilities of computer vision models have been primarily fueled by rapid expansion of datasets, model parameters, and computational budgets, leading to ever-increasing demands on computational infrastructure. However, as these models are deployed in increasingly diverse and resource-constrained environments, there is a pressing need for architectures that can deliver high performance while requiring fewer computational resources. This dissertation focuses on architectural principles through which models can achieve increased performance while reducing their computational demands. We discuss strides towards this goal through three directions. First, we focus on data ingress and egress, investigating how information may be passed into and retrieved from our core neural processing units. This ensures that our models make the most of available data, allowing smaller architectures to become more performant. Second, we investigate modifications to the core neural architecture, applied to restricted attention in vision transformers. This section explores how removing uniform context windows in restricted attention increases the expressivity of the underlying neural architecture. Third, we explore the natural structures of Normalizing Flows and how we can leverage these properties to better distill model knowledge. These contributions demonstrate that careful design of neural architectures can increase the efficiency of machine learning algorithms, allowing them to become smaller, faster, and cheaper. Comments: Ph.D. Thesis Subjects: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2507.19795 [cs.CV] (or arXiv:2507.19795v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.19795 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-163] DepthFlow: Exploiting Depth-Flow Structural Correlations for Unsupervised Video Object Segmentation ICCV
【速读】:该论文旨在解决无监督视频对象分割(Unsupervised Video Object Segmentation, VOS)中因训练数据稀缺导致模型性能受限的问题。其解决方案的关键在于提出DepthFlow方法,通过从单张图像估计深度图并将其转换为保留关键结构信息的合成光流场,从而将大规模图像-掩码对转化为图像-光流-掩码训练对,有效扩展了训练数据规模。该方法的核心洞察是VOS模型更依赖于光流图中的结构信息而非几何精度,而这种结构信息与深度图高度相关,因此能够以较低成本生成高质量的合成光流用于训练,最终在所有公开VOS基准上达到新的最先进性能。
链接: https://arxiv.org/abs/2507.19790
作者: Suhwan Cho,Minhyeok Lee,Jungho Lee,Donghyeong Kim,Sangyoun Lee
机构: GenGenAI; Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCVW 2025
Abstract:Unsupervised video object segmentation (VOS) aims to detect the most prominent object in a video. Recently, two-stream approaches that leverage both RGB images and optical flow have gained significant attention, but their performance is fundamentally constrained by the scarcity of training data. To address this, we propose DepthFlow, a novel data generation method that synthesizes optical flow from single images. Our approach is driven by the key insight that VOS models depend more on structural information embedded in flow maps than on their geometric accuracy, and that this structure is highly correlated with depth. We first estimate a depth map from a source image and then convert it into a synthetic flow field that preserves essential structural cues. This process enables the transformation of large-scale image-mask pairs into image-flow-mask training pairs, dramatically expanding the data available for network training. By training a simple encoder-decoder architecture with our synthesized data, we achieve new state-of-the-art performance on all public VOS benchmarks, demonstrating a scalable and effective solution to the data scarcity problem.
zh
[CV-164] ransFlow: Motion Knowledge Transfer from Video Diffusion Models to Video Salient Object Detection ICCV
【速读】:该论文旨在解决视频显著性目标检测(Video SOD)中因视频数据稀缺而导致模型训练受限的问题。现有方法通过空间变换从静态图像生成视频序列,但其生成的光流(optical flow)缺乏语义理解,无法有效支持依赖运动线索的任务。解决方案的关键在于提出TransFlow,该方法利用预训练视频扩散模型(video diffusion models)中蕴含的丰富语义运动先验,将运动知识迁移至静态图像,生成具有语义感知能力的光流,使物体在保持空间边界和时间一致性的同时呈现自然运动模式,从而为Video SOD提供高质量的合成训练数据。
链接: https://arxiv.org/abs/2507.19789
作者: Suhwan Cho,Minhyeok Lee,Jungho Lee,Sunghun Yang,Sangyoun Lee
机构: GenGenAI; Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCVW 2025
Abstract:Video salient object detection (SOD) relies on motion cues to distinguish salient objects from backgrounds, but training such models is limited by scarce video datasets compared to abundant image datasets. Existing approaches that use spatial transformations to create video sequences from static images fail for motion-guided tasks, as these transformations produce unrealistic optical flows that lack semantic understanding of motion. We present TransFlow, which transfers motion knowledge from pre-trained video diffusion models to generate realistic training data for video SOD. Video diffusion models have learned rich semantic motion priors from large-scale video data, understanding how different objects naturally move in real scenes. TransFlow leverages this knowledge to generate semantically-aware optical flows from static images, where objects exhibit natural motion patterns while preserving spatial boundaries and temporal coherence. Our method achieves improved performance across multiple benchmarks, demonstrating effective motion knowledge transfer.
zh
[CV-165] JDATT: A Joint Distillation Framework for Atmospheric Turbulence Mitigation and Target Detection
【速读】:该论文旨在解决大气湍流(Atmospheric Turbulence, AT)导致的图像退化问题及其对下游目标检测任务性能的影响,同时克服现有深度学习方法在计算复杂度高、资源消耗大以及分离式处理流程中效率低下的缺陷。其解决方案的关键在于提出一种联合蒸馏框架JDATT(Joint Distillation for Atmospheric Turbulence mitigation and Target detection),通过融合先进的AT去噪模块与目标检测模块,并引入统一的知识蒸馏策略,在压缩模型规模的同时最小化性能损失;具体采用混合蒸馏机制:在特征层面使用通道级蒸馏(Channel-Wise Distillation, CWD)和掩码生成蒸馏(Masked Generative Distillation, MGD),在输出层面利用KL散度进行蒸馏,从而实现高效、实时的湍流抑制与目标检测一体化处理。
链接: https://arxiv.org/abs/2507.19780
作者: Zhiming Liu,Paul Hill,Nantheera Anantrasirichai
机构: University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the 36th British Machine Vision Conference
Abstract:Atmospheric turbulence (AT) introduces severe degradations, such as rippling, blur, and intensity fluctuations, that hinder both image quality and downstream vision tasks like target detection. While recent deep learning-based approaches have advanced AT mitigation using transformer and Mamba architectures, their high complexity and computational cost make them unsuitable for real-time applications, especially in resource-constrained settings such as remote surveillance. Moreover, the common practice of separating turbulence mitigation and object detection leads to inefficiencies and suboptimal performance. To address these challenges, we propose JDATT, a Joint Distillation framework for Atmospheric Turbulence mitigation and Target detection. JDATT integrates state-of-the-art AT mitigation and detection modules and introduces a unified knowledge distillation strategy that compresses both components while minimizing performance loss. We employ a hybrid distillation scheme: feature-level distillation via Channel-Wise Distillation (CWD) and Masked Generative Distillation (MGD), and output-level distillation via Kullback-Leibler divergence. Experiments on synthetic and real-world turbulence datasets demonstrate that JDATT achieves superior visual restoration and detection accuracy while significantly reducing model size and inference time, making it well-suited for real-time deployment.
zh
[CV-166] HydraMamba: Multi-Head State Space Model for Global Point Cloud Learning
【速读】:该论文旨在解决点云学习中注意力机制因二次复杂度导致的点间交互受限问题,以及现有方法在点云序列化和局部特征学习方面的不足。其核心解决方案是提出HydraMamba网络,关键创新包括:1)设计shuffle序列化策略,使无序点集更适配S6(Selective State Space Model)的因果特性;2)提出ConvBiS6层,协同捕捉局部几何结构与全局上下文依赖关系;3)通过多头扩展(MHS6)增强S6的建模能力,从而在对象级和场景级任务上实现最先进性能。
链接: https://arxiv.org/abs/2507.19778
作者: Kanglin Qu,Pan Gao,Qun Dai,Yuanhao Sun
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MM '25
Abstract:The attention mechanism has become a dominant operator in point cloud learning, but its quadratic complexity leads to limited inter-point interactions, hindering long-range dependency modeling between objects. Due to excellent long-range modeling capability with linear complexity, the selective state space model (S6), as the core of Mamba, has been exploited in point cloud learning for long-range dependency interactions over the entire point cloud. Despite some significant progress, related works still suffer from imperfect point cloud serialization and lack of locality learning. To this end, we explore a state space model-based point cloud network termed HydraMamba to address the above challenges. Specifically, we design a shuffle serialization strategy, making unordered point sets better adapted to the causal nature of S6. Meanwhile, to overcome the deficiency of existing techniques in locality learning, we propose a ConvBiS6 layer, which is capable of capturing local geometries and global context dependencies synergistically. Besides, we propose MHS6 by extending the multi-head design to S6, further enhancing its modeling capability. HydraMamba achieves state-of-the-art results on various tasks at both object-level and scene-level. The code is available at this https URL.
zh
[CV-167] Self-Guided Masked Autoencoder
【速读】:该论文旨在解决掩码自编码器(Masked Autoencoder, MAE)在自监督表示学习过程中“究竟学到了什么”以及“如何学习”的问题,即对其内在学习机制缺乏清晰理解。研究发现,MAE在预训练早期阶段便已自发地学习到基于模式的图像块(patch-level)聚类结构。针对此现象,作者提出自引导掩码自编码器(self-guided masked autoencoder),其核心创新在于:利用模型自身在图像块聚类上的进展,内部生成具有信息量的掩码策略,替代原始MAE中随机掩码方式。该方案显著加速了学习过程,且无需依赖外部模型或额外信息,同时保持了MAE原有的自监督特性。
链接: https://arxiv.org/abs/2507.19773
作者: Jeongwoo Shin,Inseo Lee,Junho Lee,Joonseok Lee
机构: Seoul National University (首尔国立大学); Google Research (谷歌研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Masked Autoencoder (MAE) is a self-supervised approach for representation learning, widely applicable to a variety of downstream tasks in computer vision. In spite of its success, it is still not fully uncovered what and how MAE exactly learns. In this paper, with an in-depth analysis, we discover that MAE intrinsically learns pattern-based patch-level clustering from surprisingly early stages of pretraining. Upon this understanding, we propose self-guided masked autoencoder, which internally generates informed mask by utilizing its progress in patch clustering, substituting the naive random masking of the vanilla MAE. Our approach significantly boosts its learning process without relying on any external models or supplementary information, keeping the benefit of self-supervised nature of MAE intact. Comprehensive experiments on various downstream tasks verify the effectiveness of the proposed method.
zh
[CV-168] MoFRR: Mixture of Diffusion Models for Face Retouching Restoration
【速读】:该论文旨在解决社交媒体中广泛使用的面部修饰(face retouching)导致的图像真实性问题,特别是如何从修饰后的图像中准确恢复原始人脸。现有方法主要聚焦于检测修饰痕迹,但缺乏对原始面孔的有效还原能力。为此,论文提出了面部修饰恢复(Face Retouching Restoration, FRR)这一新任务,并设计了MoFRR(Mixture of Diffusion Models for FRR)作为解决方案。其关键创新在于采用稀疏激活机制的专家混合扩散模型架构:通过专用专家分别处理不同类型的修饰操作(如磨皮、美白等),并引入一个共享专家以应对通用修饰特征;每个专用专家包含双分支结构——基于DDIM的低频分支受迭代失真评估模块(Iterative Distortion Evaluation Module, IDE)引导,用于恢复整体面部结构,以及基于交叉注意力的高频分支(High-Frequency Cross-Attention Module, HFCAM)用于细节修复,从而实现对复杂修饰操作下低频信息的精准重建。
链接: https://arxiv.org/abs/2507.19770
作者: Jiaxin Liu,Qichao Ying,Zhenxing Qian,Sheng Li,Runqi Zhang,Jian Liu,Xinpeng Zhang
机构: Fudan University (复旦大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The widespread use of face retouching on social media platforms raises concerns about the authenticity of face images. While existing methods focus on detecting face retouching, how to accurately recover the original faces from the retouched ones has yet to be answered. This paper introduces Face Retouching Restoration (FRR), a novel computer vision task aimed at restoring original faces from their retouched counterparts. FRR differs from traditional image restoration tasks by addressing the complex retouching operations with various types and degrees, which focuses more on the restoration of the low-frequency information of the faces. To tackle this challenge, we propose MoFRR, Mixture of Diffusion Models for FRR. Inspired by DeepSeek’s expert isolation strategy, the MoFRR uses sparse activation of specialized experts handling distinct retouching types and the engagement of a shared expert dealing with universal retouching traces. Each specialized expert follows a dual-branch structure with a DDIM-based low-frequency branch guided by an Iterative Distortion Evaluation Module (IDEM) and a Cross-Attention-based High-Frequency branch (HFCAM) for detail refinement. Extensive experiments on a newly constructed face retouching dataset, RetouchingFFHQ++, demonstrate the effectiveness of MoFRR for FRR.
zh
[CV-169] Latest Object Memory Management for Temporally Consistent Video Instance Segmentation ICCV2025
【速读】:该论文旨在解决视频实例分割(Video Instance Segmentation, VIS)中长期实例跟踪的不一致性问题,特别是在复杂动态场景下对象频繁出现与消失时难以维持身份一致性的挑战。其解决方案的关键在于提出一种名为Latest Object Memory Management (LOMM) 的新框架,核心组件为Latest Object Memory (LOM),通过显式建模每个帧中对象的存在状态,实现对对象最新状态的鲁棒跟踪与持续更新;同时引入Decoupled Object Association (DOA) 策略,将新出现对象与已存在对象的关联任务解耦处理,从而提升匹配精度并保障身份稳定性。该方法在YouTube-VIS 2022数据集上达到54.0的AP分数,显著优于传统方法,树立了VIS领域的性能新基准。
链接: https://arxiv.org/abs/2507.19754
作者: Seunghun Lee,Jiwan Seo,Minwoo Choi,Kiljoon Han,Jaehoon Jeong,Zane Durante,Ehsan Adeli,Sang Hyun Park,Sunghoon Im
机构: DGIST(韩国科学技术院); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Code: this https URL
Abstract:In this paper, we present Latest Object Memory Management (LOMM) for temporally consistent video instance segmentation that significantly improves long-term instance tracking. At the core of our method is Latest Object Memory (LOM), which robustly tracks and continuously updates the latest states of objects by explicitly modeling their presence in each frame. This enables consistent tracking and accurate identity management across frames, enhancing both performance and reliability through the VIS process. Moreover, we introduce Decoupled Object Association (DOA), a strategy that separately handles newly appearing and already existing objects. By leveraging our memory system, DOA accurately assigns object indices, improving matching accuracy and ensuring stable identity consistency, even in dynamic scenes where objects frequently appear and disappear. Extensive experiments and ablation studies demonstrate the superiority of our method over traditional approaches, setting a new benchmark in VIS. Notably, our LOMM achieves state-of-the-art AP score of 54.0 on YouTube-VIS 2022, a dataset known for its challenging long videos. Project page: this https URL
zh
[CV-170] Leverag ing Sparse LiDAR for RAFT-Stereo: A Depth Pre-Fill Perspective
【速读】:该论文旨在解决在LiDAR点云稀疏(如每帧仅数百个点)条件下,传统LiDAR引导的立体匹配方法(如RAFT-Stereo)性能显著下降的问题。其核心挑战在于稀疏LiDAR深度信息难以有效指导初始视差图生成,导致匹配精度受限。解决方案的关键在于从信号处理角度提出新见解:通过预填充(pre-filling)稀疏初始视差图进行插值,可显著提升LiDAR引导效果;进一步发现,该策略在早期融合(early fusion)注入LiDAR深度至图像特征时同样有效,但机制不同,需采用适配的插值方法。结合两种预填充策略后,所提出的Guided RAFT-Stereo(GRAFT-Stereo)在多种数据集上均显著优于现有稀疏LiDAR引导方法。
链接: https://arxiv.org/abs/2507.19738
作者: Jinsu Yoo,Sooyoung Jeon,Zanming Huang,Tai-Yu Pan,Wei-Lun Chao
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We investigate LiDAR guidance within the RAFT-Stereo framework, aiming to improve stereo matching accuracy by injecting precise LiDAR depth into the initial disparity map. We find that the effectiveness of LiDAR guidance drastically degrades when the LiDAR points become sparse (e.g., a few hundred points per frame), and we offer a novel explanation from a signal processing perspective. This insight leads to a surprisingly simple solution that enables LiDAR-guided RAFT-Stereo to thrive: pre-filling the sparse initial disparity map with interpolation. Interestingly, we find that pre-filling is also effective when injecting LiDAR depth into image features via early fusion, but for a fundamentally different reason, necessitating a distinct pre-filling approach. By combining both solutions, the proposed Guided RAFT-Stereo (GRAFT-Stereo) significantly outperforms existing LiDAR-guided methods under sparse LiDAR conditions across various datasets. We hope this study inspires more effective LiDAR-guided stereo methods.
zh
[CV-171] Quaternion-Based Robust PCA for Efficient Moving Target Detection and Background Recovery in Color Videos
【速读】:该论文旨在解决静态摄像头采集的自然场景彩色视频中运动目标检测(Moving Target Detection)难题,核心挑战在于如何高效地从复杂背景中分离出运动目标并重建高质量背景,以增强深度模型的泛化能力。解决方案的关键在于提出一种通用的Quaternion-based Robust Principal Component Analysis (uQRPCA) 框架,通过利用四元数黎曼流形(Quaternion Riemannian Manifold)将Quaternion Singular Value Decomposition (QSVD) 的计算复杂度降低至 o(1),从而显著提升处理效率;进一步引入Color Rank-1 Batch (CR1B) 方法,实现跨颜色通道的低秩背景重构,有效解决了传统rank-1四元数矩阵无法匹配各颜色通道低秩特性的局限性,最终在运动目标分割与背景恢复任务上达到当前最优性能(State Of The Art)。
链接: https://arxiv.org/abs/2507.19730
作者: Liyang Wang,Shiqian Wu,Shun Fang,Qile Zhu,Jiaxin Wu,Sos Again
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Moving target detection is a challenging computer vision task aimed at generating accurate segmentation maps in diverse in-the-wild color videos captured by static cameras. If backgrounds and targets can be simultaneously extracted and recombined, such synthetic data can significantly enrich annotated in-the-wild datasets and enhance the generalization ability of deep models. Quaternion-based RPCA (QRPCA) is a promising unsupervised paradigm for color image processing. However, in color video processing, Quaternion Singular Value Decomposition (QSVD) incurs high computational costs, and rank-1 quaternion matrix fails to yield rank-1 color channels. In this paper, we reduce the computational complexity of QSVD to o(1) by utilizing a quaternion Riemannian manifold. Furthermor, we propose the universal QRPCA (uQRPCA) framework, which achieves a balance in simultaneously segmenting targets and recovering backgrounds from color videos. Moreover, we expand to uQRPCA+ by introducing the Color Rank-1 Batch (CR1B) method to further process and obtain the ideal low-rank background across color channels. Experiments demonstrate our uQRPCA+ achieves State Of The Art (SOTA) performance on moving target detection and background recovery tasks compared to existing open-source methods. Our implementation is publicly available on GitHub at this https URL
zh
[CV-172] Bias Analysis for Synthetic Face Detection: A Case Study of the Impact of Facial Attribute
【速读】:该论文旨在解决合成人脸检测模型中存在的偏见问题,即当前检测模型在不同面部属性(如肤色、性别、年龄等)群体中表现不一致,可能导致对某些人群的误检或漏检,从而引发社会、法律和伦理风险。解决方案的关键在于提出一个系统性的评估框架,利用具有均匀分布面部属性标签的合成数据生成方法来缓解训练数据中的偏差,进而更准确地分析检测模型在多种面部属性上的公平性表现;在此基础上,通过控制变量的实验设计和激活图分析,深入揭示偏见来源,包括训练集中属性平衡性不足以及模型对特定属性变化的敏感性差异。
链接: https://arxiv.org/abs/2507.19705
作者: Asmae Lamsaf,Lucia Cascone,Hugo Proença,João Neves
机构: University of Beira Interior (贝拉内斯特大学); NOVA LINCS; University of Salerno (萨莱诺大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Bias analysis for synthetic face detection is bound to become a critical topic in the coming years. Although many detection models have been developed and several datasets have been released to reliably identify synthetic content, one crucial aspect has been largely overlooked: these models and training datasets can be biased, leading to failures in detection for certain demographic groups and raising significant social, legal, and ethical issues. In this work, we introduce an evaluation framework to contribute to the analysis of bias of synthetic face detectors with respect to several facial attributes. This framework exploits synthetic data generation, with evenly distributed attribute labels, for mitigating any skew in the data that could otherwise influence the outcomes of bias analysis. We build on the proposed framework to provide an extensive case study of the bias level of five state-of-the-art detectors in synthetic datasets with 25 controlled facial attributes. While the results confirm that, in general, synthetic face detectors are biased towards the presence/absence of specific facial attributes, our study also sheds light on the origins of the observed bias through the analysis of the correlations with the balancing of facial attributes in the training sets of the detectors, and the analysis of detectors activation maps in image pairs with controlled attribute modifications.
zh
[CV-173] Co-Win: Joint Object Detection and Instance Segmentation in LiDAR Point Clouds via Collaborative Window Processing
【速读】:该论文旨在解决复杂城市环境中自动驾驶系统对场景感知与理解的准确性问题,尤其是多模态信息融合与细粒度语义分割的挑战。其解决方案的关键在于提出了一种名为Co-Win的鸟瞰图(Bird’s Eye View, BEV)感知框架,该框架通过点云编码与高效的并行窗口特征提取相结合,实现对环境空间特征和物体关系的多层次建模;同时引入基于掩码的变分实例分割机制,使预测结果在数据一致性和上下文相关性上均具备更强的合理性,并生成可解释且多样化的实例预测,从而提升下游决策与规划能力。
链接: https://arxiv.org/abs/2507.19691
作者: Haichuan Li,Tomi Westerlund
机构: Turku Intelligent Embedded and Robotics Systems lab, Faculty of Technology, University of Turku (图尔库大学技术学院图尔库智能嵌入式与机器人系统实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Accurate perception and scene understanding in complex urban environments is a critical challenge for ensuring safe and efficient autonomous navigation. In this paper, we present Co-Win, a novel bird’s eye view (BEV) perception framework that integrates point cloud encoding with efficient parallel window-based feature extraction to address the multi-modality inherent in environmental understanding. Our method employs a hierarchical architecture comprising a specialized encoder, a window-based backbone, and a query-based decoder head to effectively capture diverse spatial features and object relationships. Unlike prior approaches that treat perception as a simple regression task, our framework incorporates a variational approach with mask-based instance segmentation, enabling fine-grained scene decomposition and understanding. The Co-Win architecture processes point cloud data through progressive feature extraction stages, ensuring that predicted masks are both data-consistent and contextually relevant. Furthermore, our method produces interpretable and diverse instance predictions, enabling enhanced downstream decision-making and planning in autonomous driving systems.
zh
[CV-174] DeepJIVE: Learning Joint and Individual Variation Explained from Multimodal Data Using Deep Learning
【速读】:该论文旨在解决传统多模态数据整合方法在处理高维数据和识别非线性结构方面的局限性,这些问题限制了其对多模态数据中共享与特有变异模式的准确解析。解决方案的关键在于提出DeepJIVE,一种基于深度学习的联合与个体变异解释(Joint and Individual Variance Explained, JIVE)方法,通过数学推导设计出满足身份约束(identity constraints)和正交性约束(orthogonality constraints)的三种可行损失函数,从而有效挖掘多模态数据中的潜在结构,并在合成及真实世界的一维、二维和三维数据集上验证了其性能;进一步应用于阿尔茨海默病神经影像计划(ADNI)数据时,成功识别出淀粉样蛋白正电子发射断层扫描(amyloid PET)与磁共振成像(MR)之间具有生物学合理性的共变模式。
链接: https://arxiv.org/abs/2507.19682
作者: Matthew Drexler,Benjamin Risk,James J Lah,Suprateek Kundu,Deqiang Qiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages, 10 figures
Abstract:Conventional multimodal data integration methods provide a comprehensive assessment of the shared or unique structure within each individual data type but suffer from several limitations such as the inability to handle high-dimensional data and identify nonlinear structures. In this paper, we introduce DeepJIVE, a deep-learning approach to performing Joint and Individual Variance Explained (JIVE). We perform mathematical derivation and experimental validations using both synthetic and real-world 1D, 2D, and 3D datasets. Different strategies of achieving the identity and orthogonality constraints for DeepJIVE were explored, resulting in three viable loss functions. We found that DeepJIVE can successfully uncover joint and individual variations of multimodal datasets. Our application of DeepJIVE to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) also identified biologically plausible covariation patterns between the amyloid positron emission tomography (PET) and magnetic resonance (MR) images. In conclusion, the proposed DeepJIVE can be a useful tool for multimodal data analysis.
zh
[CV-175] Efficient Learning for Product Attributes with Compact Multimodal Models
【速读】:该论文旨在解决电商场景中基于图像的产品属性预测任务在监督微调阶段面临的标注成本高昂问题,尤其是针对参数量较小的视觉语言模型(VLMs)在大规模应用时的标签效率瓶颈。其解决方案的关键在于结合参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)与直接偏好优化(Direct Preference Optimization, DPO),利用未标注商品数据生成多条推理链并依据自一致性进行偏好排序,进而通过DPO损失函数迭代更新适配器模块权重,从而实现仅用少量标注数据和大量未标注数据即可高效提升模型性能。
链接: https://arxiv.org/abs/2507.19679
作者: Mandar Kulkarni
机构: Flipkart Data Science (Flipkart 数据科学部门)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Image-based product attribute prediction in e-commerce is a crucial task with numerous applications. The supervised fine-tuning of Vision Language Models (VLMs) faces significant scale challenges due to the cost of manual or API based annotation. In this paper, we investigate label-efficient semi-supervised fine-tuning strategies for compact VLMs (2B-3B parameters) that leverage unlabeled product listings through Direct Preference Optimization (DPO). Beginning with a small, API-based, annotated, and labeled set, we first employ PEFT to train low-rank adapter modules. To update the adapter weights with unlabeled data, we generate multiple reasoning-and-answer chains per unlabeled sample and segregate these chains into preferred and dispreferred based on self-consistency. We then fine-tune the model with DPO loss and use the updated model for the next iteration. By using PEFT fine-tuning with DPO, our method achieves efficient convergence with minimal compute overhead. On a dataset spanning twelve e-commerce verticals, DPO-based fine-tuning, which utilizes only unlabeled data, demonstrates a significant improvement over the supervised model. Moreover, experiments demonstrate that accuracy with DPO training improves with more unlabeled data, indicating that a large pool of unlabeled samples can be effectively leveraged to improve performance.
zh
[CV-176] SynPAIN: A Synthetic Dataset of Pain and Non-Pain Facial Expressions
【速读】:该论文旨在解决当前疼痛评估系统在沟通能力受限人群(如患有痴呆症的老年人)中应用时面临的挑战,尤其是现有疼痛检测数据集存在种族/民族多样性不足、隐私限制以及对目标人群(老年人)代表性不够的问题。解决方案的关键在于构建了一个大规模合成数据集 SynPAIN,包含 10,710 张面部表情图像(5,355 对中性与表达性图像),覆盖五种族裔/种族、两个年龄组(年轻:20–35 岁,老年:75+ 岁)和两种性别,并利用商业生成式 AI (Generative AI) 工具创建具有临床意义的疼痛表情的均衡合成身份。该数据集不仅验证了合成疼痛表情符合临床标准(通过面部动作单元分析工具评分显著高于中性和非疼痛表情),还被用于实验揭示现有疼痛检测模型中的算法偏见,且通过年龄匹配的合成数据增强显著提升了真实临床数据上的检测性能(平均精度提升 7.0%)。SynPAIN 是首个专为老年人群设计、公开可用且具有人口统计学多样性的合成数据集,同时建立了一套测量和缓解算法偏见的框架。
链接: https://arxiv.org/abs/2507.19673
作者: Babak Taati,Muhammad Muzammil,Yasamin Zarghami,Abhishek Moturu,Airhossein Kazerouni,Hailey Reimer,Alex Mihailidis,Thomas Hadjistavropoulos
机构: KITE Research Institute, Toronto Rehabilitation Institute, University Health Network; Department of Computer Science, University of Toronto; Institute of Biomedical Engineering, University of Toronto; Vector Institute; Department of Occupational Science and Occupational Therapy, University of Toronto; University of Regina
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, submitted to IEEE JBHI
Abstract:Accurate pain assessment in patients with limited ability to communicate, such as older adults with dementia, represents a critical healthcare challenge. Robust automated systems of pain detection may facilitate such assessments. Existing pain detection datasets, however, suffer from limited ethnic/racial diversity, privacy constraints, and underrepresentation of older adults who are the primary target population for clinical deployment. We present SynPAIN, a large-scale synthetic dataset containing 10,710 facial expression images (5,355 neutral/expressive pairs) across five ethnicities/races, two age groups (young: 20-35, old: 75+), and two genders. Using commercial generative AI tools, we created demographically balanced synthetic identities with clinically meaningful pain expressions. Our validation demonstrates that synthetic pain expressions exhibit expected pain patterns, scoring significantly higher than neutral and non-pain expressions using clinically validated pain assessment tools based on facial action unit analysis. We experimentally demonstrate SynPAIN’s utility in identifying algorithmic bias in existing pain detection models. Through comprehensive bias evaluation, we reveal substantial performance disparities across demographic characteristics. These performance disparities were previously undetectable with smaller, less diverse datasets. Furthermore, we demonstrate that age-matched synthetic data augmentation improves pain detection performance on real clinical data, achieving a 7.0% improvement in average precision. SynPAIN addresses critical gaps in pain assessment research by providing the first publicly available, demographically diverse synthetic dataset specifically designed for older adult pain detection, while establishing a framework for measuring and mitigating algorithmic bias. The dataset is available at this https URL
zh
[CV-177] Pre- and Post-Treatment Glioma Segmentation with the Medical Imaging Segmentation Toolkit
【速读】:该论文旨在解决医学图像分割(Medical Image Segmentation)领域中方法间缺乏标准化和可定制化评估工具的问题,从而阻碍了不同算法的严谨比较。其解决方案的关键在于开发并扩展了医学影像分割工具包(Medical Imaging Segmentation Toolkit, MIST),特别是引入了一个灵活且模块化的后处理框架,支持多种可组合的图像变换操作(如小对象移除、最大连通分量提取及形态学运算等),允许用户定义定制化策略以精细调控最终分割结果。通过在BraTS 2025胶质瘤分割挑战中的实证验证,证明该框架能显著提升分割质量并促进可复现、可扩展的研究实践。
链接: https://arxiv.org/abs/2507.19626
作者: Adrian Celaya,Tucker Netherton,Dawid Schellingerhout,Caroline Chung,Beatrice Riviere,David Fuentes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image segmentation continues to advance rapidly, yet rigorous comparison between methods remains challenging due to a lack of standardized and customizable tooling. In this work, we present the current state of the Medical Imaging Segmentation Toolkit (MIST), with a particular focus on its flexible and modular postprocessing framework designed for the BraTS 2025 pre- and post-treatment glioma segmentation challenge. Since its debut in the 2024 BraTS adult glioma post-treatment segmentation challenge, MIST’s postprocessing module has been significantly extended to support a wide range of transforms, including removal or replacement of small objects, extraction of the largest connected components, and morphological operations such as hole filling and closing. These transforms can be composed into user-defined strategies, enabling fine-grained control over the final segmentation output. We evaluate three such strategies - ranging from simple small-object removal to more complex, class-specific pipelines - and rank their performance using the BraTS ranking protocol. Our results highlight how MIST facilitates rapid experimentation and targeted refinement, ultimately producing high-quality segmentations for the BraTS 2025 challenge. MIST remains open source and extensible, supporting reproducible and scalable research in medical image segmentation.
zh
[CV-178] Exemplar Med-DETR: Toward Generalized and Robust Lesion Detection in Mammogram Images and beyond
【速读】:该论文旨在解决医学影像中异常检测的挑战,特别是由于特征表示差异及解剖结构与异常之间复杂关系导致的检测性能受限问题,尤其是在致密乳腺组织易掩盖病灶的乳腺X线摄影(mammography)场景中。现有方法虽利用了解剖和语义上下文,但仍难以学习到有效的类别特异性特征,限制了其在不同任务和成像模态间的泛化能力。解决方案的关键在于提出Exemplar Med-DETR——一种新颖的多模态对比检测器,通过引入基于样本的直观类别特异性原型特征(exemplar features),结合交叉注意力机制,并采用迭代训练策略,从而实现基于特征的精准检测。该方法在三个不同成像模态(乳腺X线、胸部X光、血管造影)的四个公开数据集上均达到最先进性能,显著提升了检测精度,验证了其在构建鲁棒、可迁移医学影像检测系统方面的潜力。
链接: https://arxiv.org/abs/2507.19621
作者: Sheethal Bhat,Bogdan Georgescu,Adarsh Bhandary Panambur,Mathias Zinnen,Tri-Thien Nguyen,Awais Mansoor,Karim Khalifa Elbarbary,Siming Bayer,Florin-Cristian Ghesu,Sasa Grbic,Andreas Maier
机构: 11: German Cancer Research Center (德国癌症研究中心); 22: University of Freiburg (弗莱堡大学); 33: University Hospital Heidelberg (海德堡大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting abnormalities in medical images poses unique challenges due to differences in feature representations and the intricate relationship between anatomical structures and abnormalities. This is especially evident in mammography, where dense breast tissue can obscure lesions, complicating radiological interpretation. Despite leveraging anatomical and semantic context, existing detection methods struggle to learn effective class-specific features, limiting their applicability across different tasks and imaging modalities. In this work, we introduce Exemplar Med-DETR, a novel multi-modal contrastive detector that enables feature-based detection. It employs cross-attention with inherently derived, intuitive class-specific exemplar features and is trained with an iterative strategy. We achieve state-of-the-art performance across three distinct imaging modalities from four public datasets. On Vietnamese dense breast mammograms, we attain an mAP of 0.7 for mass detection and 0.55 for calcifications, yielding an absolute improvement of 16 percentage points. Additionally, a radiologist-supported evaluation of 100 mammograms from an out-of-distribution Chinese cohort demonstrates a twofold gain in lesion detection performance. For chest X-rays and angiography, we achieve an mAP of 0.25 for mass and 0.37 for stenosis detection, improving results by 4 and 7 percentage points, respectively. These results highlight the potential of our approach to advance robust and generalizable detection systems for medical imaging.
zh
[CV-179] Object-centric Video Question Answering with Visual Grounding and Referring
【速读】:该论文旨在解决当前视频大语言模型(Video Large Language Models, VideoLLMs)在视频理解任务中局限于高层语义理解且仅支持文本响应的问题,从而限制了以目标为中心的多轮交互灵活性。其解决方案的关键在于:首先提出一个具备输入对象指代(object referring)与输出视觉定位(grounding)能力的VideoLLM模型,实现用户通过文本和视觉提示对视频进行交互;其次设计空间-时间叠加模块(Spatial-Temporal Overlay Module, STOM),可将任意时刻输入的视觉提示传播至视频其余帧,增强跨帧一致性;最后构建VideoInfer数据集,提供以目标为中心的视频指令问答对,促进细粒度视频推理能力。实验表明,该方法在12个基准上的6项任务中均优于基线模型,验证了其在多模态、目标导向的视频理解中的鲁棒性。
链接: https://arxiv.org/abs/2507.19599
作者: Haochen Wang,Qirui Chen,Cilin Yan,Jiayin Cai,Xiaolong Jiang,Yao Hu,Weidi Xie,Stratis Gavves
机构: University of Amsterdam (阿姆斯特丹大学); SAI, Shanghai Jiao Tong University (上海交通大学智能计算研究院); Xiaohongshu Inc (小红书)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multiround interactions. In this paper, we make three contributions: (i) we address these limitations by introducing a VideoLLM model, capable of performing both object referring for input and grounding for output in video reasoning tasks, i.e., allowing users to interact with videos using both textual and visual prompts; (ii) we propose STOM (Spatial-Temporal Overlay Module), a novel approach that propagates arbitrary visual prompts input at any single timestamp to the remaining frames within a video; (iii) we present VideoInfer, a manually curated object-centric video instruction dataset featuring questionanswering pairs that require reasoning. We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring object segmentation. The results on 12 benchmarks of 6 tasks show that our proposed model consistently outperforms baselines in both video question answering and segmentation, underscoring its robustness in multimodal, object-centric video and image understanding. Project page: this https URL.
zh
[CV-180] SurgPIS: Surgical-instrument-level Instances and Part-level Semantics for Weakly-supervised Part-aware Instance Segmentation
【速读】:该论文旨在解决机器人辅助手术中手术器械分割的不一致性问题,现有方法通常仅独立处理器械级实例分割(Instance-level Instance Segmentation, IIS)或部件级语义分割(Part-level Semantic Segmentation, PSS),缺乏两者之间的交互。为解决此问题,作者提出了一种统一的**部件感知实例分割(Part-aware Instance Segmentation, PIS)**框架——SurgPIS,其核心创新在于引入基于Transformer的掩码分类机制,并设计了从器械级对象查询派生的部件特异性查询(part-specific queries),显式建立部件与其父器械实例之间的关联。此外,为应对缺乏同时包含实例与部件标签的大规模数据集的问题,论文提出一种弱监督学习策略:通过将PIS预测聚合为IIS或PSS掩码,从而在仅标注IIS或PSS的离散数据集上计算损失,并采用学生-教师架构保持部分标签数据中缺失PIS信息的一致性。该方案在多个数据集上实现了PIS、IIS和PSS任务的最优性能。
链接: https://arxiv.org/abs/2507.19592
作者: Meng Wei,Charlie Budd,Oluwatosin Alabi,Miaojing Shi,Tom Vercauteren
机构: King’s College London (国王学院); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Consistent surgical instrument segmentation is critical for automation in robot-assisted surgery. Yet, existing methods only treat instrument-level instance segmentation (IIS) or part-level semantic segmentation (PSS) separately, without interaction between these tasks. In this work, we formulate a surgical tool segmentation as a unified part-aware instance segmentation (PIS) problem and introduce SurgPIS, the first PIS model for surgical instruments. Our method adopts a transformer-based mask classification approach and introduces part-specific queries derived from instrument-level object queries, explicitly linking parts to their parent instrument instances. In order to address the lack of large-scale datasets with both instance- and part-level labels, we propose a weakly-supervised learning strategy for SurgPIS to learn from disjoint datasets labelled for either IIS or PSS purposes. During training, we aggregate our PIS predictions into IIS or PSS masks, thereby allowing us to compute a loss against partially labelled datasets. A student-teacher approach is developed to maintain prediction consistency for missing PIS information in the partially labelled data, e.g., parts of the IIS labelled data. Extensive experiments across multiple datasets validate the effectiveness of SurgPIS, achieving state-of-the-art performance in PIS as well as IIS, PSS, and instrument-level semantic segmentation.
zh
[CV-181] -MPEDNet: Unveiling the Synergy of Transformer-aware Multiscale Progressive Encoder-Decoder Network with Feature Recalibration for Tumor and Liver Segmentation
【速读】:该论文旨在解决肝脏及其肿瘤在CT图像中自动化分割面临的挑战,尤其是由于肿瘤内在异质性和患者间肝脏视觉特征差异导致的分割精度下降问题。解决方案的关键在于提出一种Transformer-aware Multiscale Progressive Encoder-Decoder Network (T-MPEDNet),其核心创新包括:(1) 基于渐进式编码器-解码器结构的深度自适应特征提取骨干网络,结合跳跃连接以重新校准通道特征并保持空间完整性;(2) 引入受Transformer启发的动态注意力机制,捕获空间域内的长程上下文关系,并通过多尺度特征利用增强局部细节;(3) 采用形态学边界细化策略,提升与邻近器官边界模糊区域的分割精度。该方法在LiTS和3DIRCADb两个公开数据集上均显著优于当前12种主流方法,验证了其在肝及肿瘤自动分割任务中的有效性与鲁棒性。
链接: https://arxiv.org/abs/2507.19590
作者: Chandravardhan Singh Raghaw,Jasmer Singh Sanjotra,Mohammad Zia Ur Rehman,Shubhi Bansal,Shahid Shafi Dar,Nagendra Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Precise and automated segmentation of the liver and its tumor within CT scans plays a pivotal role in swift diagnosis and the development of optimal treatment plans for individuals with liver diseases and malignancies. However, automated liver and tumor segmentation faces significant hurdles arising from the inherent heterogeneity of tumors and the diverse visual characteristics of livers across a broad spectrum of patients. Aiming to address these challenges, we present a novel Transformer-aware Multiscale Progressive Encoder-Decoder Network (T-MPEDNet) for automated segmentation of tumor and liver. T-MPEDNet leverages a deep adaptive features backbone through a progressive encoder-decoder structure, enhanced by skip connections for recalibrating channel-wise features while preserving spatial integrity. A Transformer-inspired dynamic attention mechanism captures long-range contextual relationships within the spatial domain, further enhanced by multi-scale feature utilization for refined local details, leading to accurate prediction. Morphological boundary refinement is then employed to address indistinct boundaries with neighboring organs, capturing finer details and yielding precise boundary labels. The efficacy of T-MPEDNet is comprehensively assessed on two widely utilized public benchmark datasets, LiTS and 3DIRCADb. Extensive quantitative and qualitative analyses demonstrate the superiority of T-MPEDNet compared to twelve state-of-the-art methods. On LiTS, T-MPEDNet achieves outstanding Dice Similarity Coefficients (DSC) of 97.6% and 89.1% for liver and tumor segmentation, respectively. Similar performance is observed on 3DIRCADb, with DSCs of 98.3% and 83.3% for liver and tumor segmentation, respectively. Our findings prove that T-MPEDNet is an efficacious and reliable framework for automated segmentation of the liver and its tumor in CT scans.
zh
[CV-182] Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?
【速读】:该论文旨在解决医学图像分割中因数据添加(data addition)或数据池化(data pooling)导致的分布偏移问题,即“数据添加困境”(Data Addition Dilemma),该现象会显著降低下游模型性能。解决方案的关键在于引入因果框架,提出一种控制深度网络各层前景-背景特征差异的方法,从而提升特征表示的一致性与鲁棒性,尤其在多源数据融合场景下有效缓解分布偏移带来的负面影响。
链接: https://arxiv.org/abs/2507.19575
作者: Ayush Roy,Samin Enam,Jun Xia,Vishnu Suresh Lokhande,Won Hwa Kim
机构: University at Buffalo, SUNY (纽约州立大学布法罗分校); Pohang University of Science and Technology (浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Data scarcity is a major challenge in medical imaging, particularly for deep learning models. While data pooling (combining datasets from multiple sources) and data addition (adding more data from a new dataset) have been shown to enhance model performance, they are not without complications. Specifically, increasing the size of the training dataset through pooling or addition can induce distributional shifts, negatively affecting downstream model performance, a phenomenon known as the “Data Addition Dilemma”. While the traditional i.i.d. assumption may not hold in multi-source contexts, assuming exchangeability across datasets provides a more practical framework for data pooling. In this work, we investigate medical image segmentation under these conditions, drawing insights from causal frameworks to propose a method for controlling foreground-background feature discrepancies across all layers of deep networks. This approach improves feature representations, which are crucial in data-addition scenarios. Our method achieves state-of-the-art segmentation performance on histopathology and ultrasound images across five datasets, including a novel ultrasound dataset that we have curated and contributed. Qualitative results demonstrate more refined and accurate segmentation maps compared to prominent baselines across three model architectures. The code will be available on Github.
zh
[CV-183] uning adaptive gamma correction (TAGC) for enhancing images in low ligh
【速读】:该论文旨在解决低光照条件下图像质量下降的问题,即因光照不足导致的对比度低、噪声大和细节模糊等现象。其解决方案的关键在于提出了一种名为调谐自适应伽马校正(Tuning Adaptive Gamma Correction, TAGC)的模型,该模型通过分析图像的颜色亮度并计算平均色度来自动且自适应地确定伽马系数,从而在无需人工干预的情况下,根据不同光照水平动态调整伽马值,有效提升图像质量并保持自然的对比度、细节清晰度与色彩分布。
链接: https://arxiv.org/abs/2507.19574
作者: Ghufran Abualhail Alhamzawi,Ali Saeed Alfoudi,Ali Hakem Alsaeedi,Suha Mohammed Hadi,Amjed Abbas Ahmed,Md. Riad Hassan,Nurhizam Safie Mohd Satar,Waeel Yahya Yasseen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Enhancing images in low-light conditions is an important challenge in computer vision. Insufficient illumination negatively affects the quality of images, resulting in low contrast, intensive noise, and blurred details. This paper presents a model for enhancing low-light images called tuning adaptive gamma correction (TAGC). The model is based on analyzing the color luminance of the low-light image and calculating the average color to determine the adaptive gamma coefficient. The gamma value is calculated automatically and adaptively at different illumination levels suitable for the image without human intervention or manual adjustment. Based on qualitative and quantitative evaluation, tuning adaptive gamma correction model has effectively improved low-light images while maintaining details, natural contrast, and correct color distribution. It also provides natural visual quality. It can be considered a more efficient solution for processing low-light images in multiple applications such as night surveillance, improving the quality of medical images, and photography in low-light environments.
zh
[CV-184] Rainbow Noise: Stress-Testing Multimodal Harmful-Meme Detectors on LGBTQ Content
【速读】:该论文旨在解决针对LGBTQ+群体的仇恨表情包(hateful memes)在图像或文字内容被轻微篡改后难以被现有检测模型识别的问题,即提升多模态安全模型对对抗性扰动的鲁棒性。其关键解决方案是构建首个针对此类场景的鲁棒性基准测试(robustness benchmark),系统评估四种真实Caption攻击与三种典型图像退化组合下的模型表现,并引入一种轻量级文本去噪适配器(Text Denoising Adapter, TDA),显著增强MemeBLIP2模型在文本扰动下的稳定性,最终使其成为整体最鲁棒的检测模型。
链接: https://arxiv.org/abs/2507.19551
作者: Ran Tong,Songtao Wei,Jiaqi Liu,Lanruo Wang
机构: University of Texas at Dallas (德克萨斯大学达拉斯分校); Naveen Jindal School of Management (纳文·金德尔管理学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 1 figure
Abstract:Hateful memes aimed at LGBTQ,+ communities often evade detection by tweaking either the caption, the image, or both. We build the first robustness benchmark for this setting, pairing four realistic caption attacks with three canonical image corruptions and testing all combinations on the PrideMM dataset. Two state-of-the-art detectors, MemeCLIP and MemeBLIP2, serve as case studies, and we introduce a lightweight \textbfText Denoising Adapter (TDA) to enhance the latter’s resilience. Across the grid, MemeCLIP degrades more gently, while MemeBLIP2 is particularly sensitive to the caption edits that disrupt its language processing. However, the addition of the TDA not only remedies this weakness but makes MemeBLIP2 the most robust model overall. Ablations reveal that all systems lean heavily on text, but architectural choices and pre-training data significantly impact robustness. Our benchmark exposes where current multimodal safety models crack and demonstrates that targeted, lightweight modules like the TDA offer a powerful path towards stronger defences.
zh
[CV-185] ChartGen: Scaling Chart Understanding Via Code-Guided Synthetic Chart Generation
【速读】:该论文旨在解决现有多模态基准在图表理解任务中偏重于问答或摘要,而缺乏对图表到可执行代码(chart-to-code)重建能力评估的问题。其核心挑战在于如何将视觉图表精准地还原为机器可读的绘图脚本,从而推动模型对数据可视化内容的语义理解和结构化表达能力。解决方案的关键在于提出一个全自动化的合成图表生成流程 ChartGen,该流程首先利用视觉语言模型(Vision-Language Model, VLM)从种子图表图像中重建出 Python 脚本,再通过面向代码的大语言模型(Code-Oriented Large Language Model, LLM)迭代增强脚本功能与准确性,最终构建了一个包含 222.5K 图表-代码对的开源合成数据集,覆盖多种图表类型、绘图库和数据模态,为 chart-to-code 任务提供了高质量训练与评测资源。
链接: https://arxiv.org/abs/2507.19492
作者: Jovana Kondic,Pengyuan Li,Dhiraj Joshi,Zexue He,Shafiq Abedin,Jennifer Sun,Ben Wiesel,Eli Schwartz,Ahmed Nassar,Bo Wu,Assaf Arbelle,Aude Oliva,Dan Gutfreund,Leonid Karlinsky,Rogerio Feris
机构: MIT(麻省理工学院); MIT-IBM Watson AI Labs(麻省理工-IBM沃森人工智能实验室); IBM Research(IBM研究院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Chart-to-code reconstruction – the task of recovering executable plotting scripts from chart images – provides important insights into a model’s ability to ground data visualizations in precise, machine-readable form. Yet many existing multimodal benchmarks largely focus primarily on answering questions about charts or summarizing them. To bridge this gap, we present ChartGen, a fully-automated pipeline for code-guided synthetic chart generation. Starting from seed chart images, ChartGen (i) prompts a vision-language model (VLM) to reconstruct each image into a python script, and (ii) iteratively augments that script with a code-oriented large language model (LLM). Using ChartGen, we create 222.5K unique chart-image code pairs from 13K seed chart images, and present an open-source synthetic chart dataset covering 27 chart types, 11 plotting libraries, and multiple data modalities (image, code, text, CSV, DocTags). From this corpus, we curate a held-out chart-to-code evaluation subset of 4.3K chart image-code pairs, and evaluate six open-weight VLMs (3B - 26B parameters), highlighting substantial room for progress. We release the pipeline, prompts, and the dataset to help accelerate efforts towards robust chart understanding and vision-conditioned code generation: this https URL
zh
[CV-186] RISEE: A Highly Interactive Naturalistic Driving Trajectories Dataset with Human Subjective Risk Perception and Eye-tracking Information ITSC2025
【速读】:该论文旨在解决当前自动驾驶决策与规划系统研发中缺乏人类因素考量的问题,尤其是现有数据集普遍忽略人类主观风险感知与视觉注意力信息,且自然驾驶数据集安全性不足、模拟数据集真实性较低。其解决方案的关键在于构建一个融合高真实感与高可复现性的新型数据集——Risk-Informed Subjective Evaluation and Eye-tracking (RISEE) 数据集,通过无人机采集高速公路匝道汇入区域的高保真交通视频,并将人工筛选的高交互场景在仿真环境中重建生成第一人称视角(First-Person View, FPV)视频,结合受试者主观风险评分与眼动追踪数据,从而实现对人类认知机制的量化建模与验证。
链接: https://arxiv.org/abs/2507.19490
作者: Xinzheng Wu,Junyi Chen,Peiyi Wang,Shunxiang Chen,Yong Shen
机构: Tongji University (同济大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted for ITSC 2025
Abstract:In the research and development (RD) and verification and validation (VV) phases of autonomous driving decision-making and planning systems, it is necessary to integrate human factors to achieve decision-making and evaluation that align with human cognition. However, most existing datasets primarily focus on vehicle motion states and trajectories, neglecting human-related information. In addition, current naturalistic driving datasets lack sufficient safety-critical scenarios while simulated datasets suffer from low authenticity. To address these issues, this paper constructs the Risk-Informed Subjective Evaluation and Eye-tracking (RISEE) dataset which specifically contains human subjective evaluations and eye-tracking data apart from regular naturalistic driving trajectories. By leveraging the complementary advantages of drone-based (high realism and extensive scenario coverage) and simulation-based (high safety and reproducibility) data collection methods, we first conduct drone-based traffic video recording at a highway ramp merging area. After that, the manually selected highly interactive scenarios are reconstructed in simulation software, and drivers’ first-person view (FPV) videos are generated, which are then viewed and evaluated by recruited participants. During the video viewing process, participants’ eye-tracking data is collected. After data processing and filtering, 3567 valid subjective risk ratings from 101 participants across 179 scenarios are retained, along with 2045 qualified eye-tracking data segments. The collected data and examples of the generated FPV videos are available in our website.
zh
[CV-187] MAIA: A Collaborative Medical AI Platform for Integrated Healthcare Innovation
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在临床工作流程中落地应用时面临的协作壁垒问题,即如何有效整合技术开发者、研究人员与临床医生之间的资源与知识,以推动AI研究成果向可复现、透明且以用户为中心的临床解决方案转化。其核心解决方案是提出并实现了一个名为MAIA(Medical Artificial Intelligence Assistant)的开源平台,该平台基于Kubernetes构建,具备模块化、可扩展架构,并集成数据管理、模型开发、标注、部署及临床反馈等全流程工具,支持项目隔离、持续集成/持续部署(CI/CD)自动化以及与高性能计算和临床环境的无缝对接,从而显著提升跨学科协作效率与AI医疗应用的转化速度。
链接: https://arxiv.org/abs/2507.19489
作者: Simone Bendazzoli,Sanna Persson,Mehdi Astaraki,Sebastian Pettersson,Vitali Grozman,Rodrigo Moreno
机构: KTH, Royal Institute of Technology (皇家理工学院); Karolinska Institutet (卡罗林斯卡学院); Stockholm University (斯德哥尔摩大学); Karolinska University Hospital (卡罗林斯卡大学医院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: 26 pages, 12 figures
Abstract:The integration of Artificial Intelligence (AI) into clinical workflows requires robust collaborative platforms that are able to bridge the gap between technical innovation and practical healthcare applications. This paper introduces MAIA (Medical Artificial Intelligence Assistant), an open-source platform designed to facilitate interdisciplinary collaboration among clinicians, researchers, and AI developers. Built on Kubernetes, MAIA offers a modular, scalable environment with integrated tools for data management, model development, annotation, deployment, and clinical feedback. Key features include project isolation, CI/CD automation, integration with high-computing infrastructures and in clinical workflows. MAIA supports real-world use cases in medical imaging AI, with deployments in both academic and clinical environments. By promoting collaborations and interoperability, MAIA aims to accelerate the translation of AI research into impactful clinical solutions while promoting reproducibility, transparency, and user-centered design. We showcase the use of MAIA with different projects, both at KTH Royal Institute of Technology and Karolinska University Hospital.
zh
[CV-188] ransfer or Self-Supervised? Bridging the Performance Gap in Medical Imaging
【速读】:该论文旨在解决医学领域中因数据稀缺、标注成本高以及模型泛化能力不足所导致的深度学习应用难题。其解决方案的关键在于系统比较迁移学习(transfer learning)与自监督学习(self-supervised learning)在小样本医学数据集上的性能差异与鲁棒性表现,通过在同一源域数据上采用不同预训练策略构建模型,并在包含数据不平衡、数据稀疏性和域偏移等典型医学问题场景下进行对比实验,从而识别影响最终性能的核心因素,并据此提出适用于医疗场景的模型选择与部署优化建议。
链接: https://arxiv.org/abs/2407.05592
作者: Zehui Zhao,Laith Alzubaidi,Jinglan Zhang,Ye Duan,Usman Naseem,Yuantong Gu
机构: Queensland University of Technology (昆士兰科技大学); Clemson University (克莱姆森大学); Macquarie University (麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 37 pages, 8 figures
Abstract:Recently, transfer learning and self-supervised learning have gained significant attention within the medical field due to their ability to mitigate the challenges posed by limited data availability, improve model generalisation, and reduce computational expenses. Transfer learning and self-supervised learning hold immense potential for advancing medical research. However, it is crucial to recognise that transfer learning and self-supervised learning architectures exhibit distinct advantages and limitations, manifesting variations in accuracy, training speed, and robustness. This paper compares the performance and robustness of transfer learning and self-supervised learning in the medical field. Specifically, we pre-trained two models using the same source domain datasets with different pre-training methods and evaluated them on small-sized medical datasets to identify the factors influencing their final performance. We tested data with several common issues in medical domains, such as data imbalance, data scarcity, and domain mismatch, through comparison experiments to understand their impact on specific pre-trained models. Finally, we provide recommendations to help users apply transfer learning and self-supervised learning methods in medical areas, and build more convenient and efficient deployment strategies.
zh
[CV-189] NACHOS: Neural Architecture Search for Hardware Constrained Early Exit Neural Networks
【速读】:该论文旨在解决早期退出神经网络(Early Exit Neural Networks, EENNs)设计中缺乏自动化方法的问题,尤其是如何在满足硬件约束(如MAC操作数上限)的前提下,联合优化主干网络与早期退出分类器(Early Exit Classifiers, EECs),以获得准确率与计算效率之间的帕累托最优解。其解决方案的关键在于提出了一种名为NACHOS的神经架构搜索(Neural Architecture Search, NAS)框架,首次实现了对EENNs的全自动联合设计,通过引入硬件约束下的多目标优化机制,生成一系列符合精度和计算量限制的帕累托最优模型,并创新性地引入两种正则化项以提升辅助分类器的优化效果。
链接: https://arxiv.org/abs/2401.13330
作者: Matteo Gambella,Jary Pomponi,Simone Scardapane,Manuel Roveri
机构: Politecnico di Milano (米兰理工大学); Sapienza Università di Roma (罗马大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 14 pages, 5 figures
Abstract:Early Exit Neural Networks (EENNs) endow astandard Deep Neural Network (DNN) with Early Exit Classifiers (EECs), to provide predictions at intermediate points of the processing when enough confidence in classification is achieved. This leads to many benefits in terms of effectiveness and efficiency. Currently, the design of EENNs is carried out manually by experts, a complex and time-consuming task that requires accounting for many aspects, including the correct placement, the thresholding, and the computational overhead of the EECs. For this reason, the research is exploring the use of Neural Architecture Search (NAS) to automatize the design of EENNs. Currently, few comprehensive NAS solutions for EENNs have been proposed in the literature, and a fully automated, joint design strategy taking into consideration both the backbone and the EECs remains an open problem. To this end, this work presents Neural Architecture Search for Hardware Constrained Early Exit Neural Networks (NACHOS), the first NAS framework for the design of optimal EENNs satisfying constraints on the accuracy and the number of Multiply and Accumulate (MAC) operations performed by the EENNs at inference time. In particular, this provides the joint design of backbone and EECs to select a set of admissible (i.e., respecting the constraints) Pareto Optimal Solutions in terms of best tradeoff between the accuracy and number of MACs. The results show that the models designed by NACHOS are competitive with the state-of-the-art EENNs. Additionally, this work investigates the effectiveness of two novel regularization terms designed for the optimization of the auxiliary classifiers of the EENN
zh
[CV-190] Onboard Hyperspectral Super-Resolution with Deep Pushbroom Neural Network
【速读】:该论文旨在解决卫星高光谱成像中空间分辨率有限的问题,以提升下游任务(如目标检测与识别)的性能。其核心挑战在于如何在不显著增加计算复杂度和内存占用的前提下,实现高效的图像超分辨率重建,从而满足在轨实时处理的需求。解决方案的关键在于提出了一种名为Deep Pushbroom Super-Resolution (DPSR)的新型神经网络架构,该设计通过沿轨道方向逐行处理图像,并引入因果记忆机制来利用先前已获取的图像线信息,从而大幅降低内存需求和计算复杂度,实现了在低功耗硬件上的实时超分辨率处理能力。
链接: https://arxiv.org/abs/2507.20765
作者: Davide Piccinini,Diego Valsesia,Enrico Magli
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral imagers on satellites obtain the fine spectral signatures essential for distinguishing one material from another at the expense of limited spatial resolution. Enhancing the latter is thus a desirable preprocessing step in order to further improve the detection capabilities offered by hyperspectral images on downstream tasks. At the same time, there is a growing interest towards deploying inference methods directly onboard of satellites, which calls for lightweight image super-resolution methods that can be run on the payload in real time. In this paper, we present a novel neural network design, called Deep Pushbroom Super-Resolution (DPSR) that matches the pushbroom acquisition of hyperspectral sensors by processing an image line by line in the along-track direction with a causal memory mechanism to exploit previously acquired lines. This design greatly limits memory requirements and computational complexity, achieving onboard real-time performance, i.e., the ability to super-resolve a line in the time it takes to acquire the next one, on low-power hardware. Experiments show that the quality of the super-resolved images is competitive or even outperforms state-of-the-art methods that are significantly more complex.
zh
[CV-191] SkinDualGen: Prompt-Driven Diffusion for Simultaneous Image-Mask Generation in Skin Lesions
【速读】:该论文旨在解决医学图像分析中因数据稀缺和类别不平衡导致深度学习模型性能受限的问题。其核心解决方案是利用预训练的Stable Diffusion-2.0模型生成高质量的合成皮肤病变图像及其对应的分割掩码,通过领域特定的低秩适配(Low-Rank Adaptation, LoRA)微调与多目标损失函数联合优化,实现基于文本描述的一次性生成临床相关图像和分割掩码。该方法显著提升了分类与分割模型的性能,在准确率和F1分数上提升8%至15%,并改善Dice系数和交并比(IoU)等关键指标,为罕见病诊断提供了可扩展的数据增强方案。
链接: https://arxiv.org/abs/2507.19970
作者: Zhaobin Xu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Medical image analysis plays a pivotal role in the early diagnosis of diseases such as skin lesions. However, the scarcity of data and the class imbalance significantly hinder the performance of deep learning models. We propose a novel method that leverages the pretrained Stable Diffusion-2.0 model to generate high-quality synthetic skin lesion images and corresponding segmentation masks. This approach augments training datasets for classification and segmentation tasks. We adapt Stable Diffusion-2.0 through domain-specific Low-Rank Adaptation (LoRA) fine-tuning and joint optimization of multi-objective loss functions, enabling the model to simultaneously generate clinically relevant images and segmentation masks conditioned on textual descriptions in a single step. Experimental results show that the generated images, validated by FID scores, closely resemble real images in quality. A hybrid dataset combining real and synthetic data markedly enhances the performance of classification and segmentation models, achieving substantial improvements in accuracy and F1-score of 8% to 15%, with additional positive gains in other key metrics such as the Dice coefficient and IoU. Our approach offers a scalable solution to address the challenges of medical imaging data, contributing to improved accuracy and reliability in diagnosing rare diseases.
zh
[CV-192] aming Domain Shift in Multi-source CT-Scan Classification via Input-Space Standardization ICCV
【速读】:该论文旨在解决多源CT图像分类中因域偏移(domain shift)导致的跨源泛化能力下降问题。其解决方案的关键在于采用输入空间标准化策略,具体由Spatial-Slice Feature Learning (SSFL++)与Kernel-Density-based Slice Sampling (KDS)组成的预处理流水线,通过空间和时间维度的标准化降低不同来源之间的差异,将异构输入映射至一致的目标空间,从而有效缓解域偏移并简化网络优化任务,提升模型在多机构医学影像场景下的鲁棒性。
链接: https://arxiv.org/abs/2507.19858
作者: Chia-Ming Lee,Bo-Cheng Qiu,Ting-Yao Chen,Ming-Han Sun,Fang-Ying Lin,Jung-Tse Tsai,I-An Tsai,Yu-Fan Lin,Chih-Chung Hsu
机构: National Cheng Kung University (国立成功大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Image and Video Processing (eess.IV); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ICCVW 2025, Winner solution of PHAROS-AFE-AIMI Workshop’s Multi-Source Covid-19 Detection Challenge
Abstract:Multi-source CT-scan classification suffers from domain shifts that impair cross-source generalization. While preprocessing pipelines combining Spatial-Slice Feature Learning (SSFL++) and Kernel-Density-based Slice Sampling (KDS) have shown empirical success, the mechanisms underlying their domain robustness remain underexplored. This study analyzes how this input-space standardization manages the trade-off between local discriminability and cross-source generalization. The SSFL++ and KDS pipeline performs spatial and temporal standardization to reduce inter-source variance, effectively mapping disparate inputs into a consistent target space. This preemptive alignment mitigates domain shift and simplifies the learning task for network optimization. Experimental validation demonstrates consistent improvements across architectures, proving the benefits stem from the preprocessing itself. The approach’s effectiveness was validated by securing first place in a competitive challenge, supporting input-space standardization as a robust and practical solution for multi-institutional medical imaging.
zh
[CV-193] Hybrid Deep Learning and Handcrafted Feature Fusion for Mammographic Breast Cancer Classification
【速读】:该论文旨在解决乳腺癌自动分类中因良恶性组织间细微差异导致的挑战,尤其在乳腺X线摄影(mammography)图像分析中的准确识别问题。其解决方案的关键在于提出一种混合框架,将ResNet-50提取的深度卷积特征与手工设计的特征(handcrafted descriptors)及基于Transformer的DINOv2嵌入进行融合。实验表明,这种融合策略不仅增强了模型对复杂纹理和结构信息的表征能力,还显著提升了性能指标(如AUC从78.1%提升至79.6%,F1分数达67.4%),且优于单纯依赖Transformer嵌入的方法,同时保持了架构简洁性和计算高效性,具备临床决策支持的实际应用潜力。
链接: https://arxiv.org/abs/2507.19843
作者: Maximilian Tschuchnig,Michael Gadermayr,Khalifa Djemal
机构: Salzburg University of Applied Sciences (萨尔茨堡应用科学大学); University of Salzburg (萨尔茨堡大学); University of Evry Paris-Saclay (埃夫里-巴黎萨克雷大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at IPTA2025
Abstract:Automated breast cancer classification from mammography remains a significant challenge due to subtle distinctions between benign and malignant tissue. In this work, we present a hybrid framework combining deep convolutional features from a ResNet-50 backbone with handcrafted descriptors and transformer-based embeddings. Using the CBIS-DDSM dataset, we benchmark our ResNet-50 baseline (AUC: 78.1%) and demonstrate that fusing handcrafted features with deep ResNet-50 and DINOv2 features improves AUC to 79.6% (setup d1), with a peak recall of 80.5% (setup d1) and highest F1 score of 67.4% (setup d1). Our experiments show that handcrafted features not only complement deep representations but also enhance performance beyond transformer-based embeddings. This hybrid fusion approach achieves results comparable to state-of-the-art methods while maintaining architectural simplicity and computational efficiency, making it a practical and effective solution for clinical decision support.
zh
[CV-194] SpecBPP: A Self-Supervised Learning Approach for Hyperspectral Representation and Soil Organic Carbon Estimation
【速读】:该论文旨在解决高光谱影像(Hyperspectral Imagery, HSI)中表示学习(representation learning)仍处于探索阶段的问题,尤其是在利用自监督学习(self-supervised learning)提升模型对HSI特征理解能力方面的不足。其解决方案的关键在于提出一种新颖的预训练任务——光谱带排列预测(Spectral Band Permutation Prediction, SpecBPP),该方法通过让模型恢复被随机打乱的光谱段顺序,从而激发模型对光谱连续性(spectral continuity)的全局理解,而非依赖于掩码重建等传统策略。为应对排列空间的阶乘复杂度,作者进一步引入基于课程学习(curriculum-based training)的训练策略,逐步增加排列难度以实现高效训练。实验表明,该方法在土壤有机碳(Soil Organic Carbon, SOC)估算任务中显著优于MAE和JEPA等基线模型,验证了光谱顺序预测作为预训练任务的有效性和潜力。
链接: https://arxiv.org/abs/2507.19781
作者: Daniel La’ah Ayuba,Jean-Yves Guillemaut,Belen Marti-Cardona,Oscar Mendez Maldonado
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Self-supervised learning has revolutionized representation learning in vision and language, but remains underexplored for hyperspectral imagery (HSI), where the sequential structure of spectral bands offers unique opportunities. In this work, we propose Spectral Band Permutation Prediction (SpecBPP), a novel self-supervised learning framework that leverages the inherent spectral continuity in HSI. Instead of reconstructing masked bands, SpecBPP challenges a model to recover the correct order of shuffled spectral segments, encouraging global spectral understanding. We implement a curriculum-based training strategy that progressively increases permutation difficulty to manage the factorial complexity of the permutation space. Applied to Soil Organic Carbon (SOC) estimation using EnMAP satellite data, our method achieves state-of-the-art results, outperforming both masked autoencoder (MAE) and joint-embedding predictive (JEPA) baselines. Fine-tuned on limited labeled samples, our model yields an R^2 of 0.9456, RMSE of 1.1053%, and RPD of 4.19, significantly surpassing traditional and self-supervised benchmarks. Our results demonstrate that spectral order prediction is a powerful pretext task for hyperspectral understanding, opening new avenues for scientific representation learning in remote sensing and beyond.
zh
[CV-195] A Machine Learning Framework for Predicting Microphysical Properties of Ice Crystals from Cloud Particle Imagery
【速读】:该论文旨在解决冰晶微物理属性(如有效密度 ρe、有效表面积 Ae 和子弹状结构数量 Nb)难以通过原位二维(2D)图像准确测量的问题,这对理解云辐射特性及气候模拟至关重要。解决方案的关键在于构建一个基于机器学习(ML)的框架:首先利用三维(3D)建模软件生成与2021年ICEBall野外实验中观测几何参数一致的合成冰晶;随后使用这些合成图像训练深度神经网络模型(如ResNet-18),实现从单视或双视(立体)2D图像中高精度预测上述微物理属性。实验表明,单视模型对 ρe 和 Ae 的决定系数 R2 分别达0.99和0.98,对 Nb 的F1分数为0.91;引入第二视角后,RMSE降低40%,F1分数提升8%,显著提升了预测鲁棒性与准确性,从而为云微物理参数化提供可靠约束。
链接: https://arxiv.org/abs/2507.19759
作者: Joseph Ko,Jerry Harrington,Kara Sulia,Vanessa Przybylo,Marcus van Lier-Walqui,Kara Lamb
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
备注:
Abstract:The microphysical properties of ice crystals are important because they significantly alter the radiative properties and spatiotemporal distributions of clouds, which in turn strongly affect Earth’s climate. However, it is challenging to measure key properties of ice crystals, such as mass or morphological features. Here, we present a framework for predicting three-dimensional (3D) microphysical properties of ice crystals from in situ two-dimensional (2D) imagery. First, we computationally generate synthetic ice crystals using 3D modeling software along with geometric parameters estimated from the 2021 Ice Cryo-Encapsulation Balloon (ICEBall) field campaign. Then, we use synthetic crystals to train machine learning (ML) models to predict effective density ( \rho_e ), effective surface area ( A_e ), and number of bullets ( N_b ) from synthetic rosette imagery. When tested on unseen synthetic images, we find that our ML models can predict microphysical properties with high accuracy. For \rho_e and A_e , respectively, our best-performing single view models achieved R^2 values of 0.99 and 0.98. For N_b , our best single view model achieved a balanced accuracy and F1 score of 0.91. We also quantify the marginal prediction improvements from incorporating a second view. A stereo view ResNet-18 model reduced RMSE by 40% for both \rho_e and A_e , relative to a single view ResNet-18 model. For N_b , we find that a stereo view ResNet-18 model improved the F1 score by 8%. This work provides a novel ML-driven framework for estimating ice microphysical properties from in situ imagery, which will allow for downstream constraints on microphysical parameterizations, such as the mass-size relationship.
zh
[CV-196] A Metabolic-Imaging Integrated Model for Prognostic Prediction in Colorectal Liver Metastases
【速读】:该论文旨在解决结直肠肝转移(Colorectal Liver Metastases, CRLM)患者术后复发风险预测准确率不足的问题,传统临床模型在预后评估中存在局限性。其解决方案的关键在于构建一个基于术前基线临床参数与增强CT影像组学特征的机器学习预测模型,通过严格限制输入变量以避免数据泄露(data leakage),从而提升模型的临床适用性和可靠性;其中,3个月复发预测模型在交叉验证中达到AUC 0.723,并在决策曲线分析中显示优于“全治”或“不治”策略的净收益,证实了该模型在术后监测和治疗决策中的实际价值。
链接: https://arxiv.org/abs/2507.19734
作者: Qinlong Li,Pu Sun,Guanlin Zhu,Tianjiao Liang,Honggang QI
机构: University of Chinese Academy of Sciences (中国科学院大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: 8 pages,4 figues
Abstract:Prognostic evaluation in patients with colorectal liver metastases (CRLM) remains challenging due to suboptimal accuracy of conventional clinical models. This study developed and validated a robust machine learning model for predicting postoperative recurrence risk. Preliminary ensemble models achieved exceptionally high performance (AUC 0.98) but incorporated postoperative features, introducing data leakage risks. To enhance clinical applicability, we restricted input variables to preoperative baseline clinical parameters and radiomic features from contrast-enhanced CT imaging, specifically targeting recurrence prediction at 3, 6, and 12 months postoperatively. The 3-month recurrence prediction model demonstrated optimal performance with an AUC of 0.723 in cross-validation. Decision curve analysis revealed that across threshold probabilities of 0.55-0.95, the model consistently provided greater net benefit than “treat-all” or “treat-none” strategies, supporting its utility in postoperative surveillance and therapeutic decision-making. This study successfully developed a robust predictive model for early CRLM recurrence with confirmed clinical utility. Importantly, it highlights the critical risk of data leakage in clinical prognostic modeling and proposes a rigorous framework to mitigate this issue, enhancing model reliability and translational value in real-world settings.
zh
[CV-197] Review of Deep Learning Applications to Structural Proteomics Enabled by Cryogenic Electron Microscopy and Tomography
【速读】:该论文旨在解决冷冻电子显微镜(cryoEM)和冷冻电子断层扫描(cryoET)在结构生物学研究中长期存在的技术瓶颈,包括低信噪比、取向偏好伪影(preferred orientation artifacts)以及缺失楔形问题(missing-wedge problem),这些问题严重限制了数据处理的效率与可扩展性。解决方案的关键在于将深度学习技术系统性地集成到整个cryoEM工作流程中:从卷积神经网络(如Topaz、crYOLO、CryoSegNet)实现自动化粒子挑选,到利用spIsoNet和cryoPROS校正取向偏差,再到基于U-Net架构的IsoNet同步完成缺失楔形校正与降噪,以及TomoNet提升亚断层平均的自动化水平;最终通过ModelAngelo、DeepTracer和CryoREAD等工具实现从密度图到原子模型的自动构建。这些AI增强方法显著提升了分辨率至近原子级,并大幅减少人工干预,成功应用于多种复杂生物体系,标志着结构生物学迈向高度自动化与智能化的新阶段。
链接: https://arxiv.org/abs/2507.19565
作者: Brady K. Zhou,Jason J. Hu,Jane K.J. Lee,Z. Hong Zhou,Demetri Terzopoulos
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages
Abstract:The past decade’s “cryoEM revolution” has produced exponential growth in high-resolution structural data through advances in cryogenic electron microscopy (cryoEM) and tomography (cryoET). Deep learning integration into structural proteomics workflows addresses longstanding challenges including low signal-to-noise ratios, preferred orientation artifacts, and missing-wedge problems that historically limited efficiency and scalability. This review examines AI applications across the entire cryoEM pipeline, from automated particle picking using convolutional neural networks (Topaz, crYOLO, CryoSegNet) to computational solutions for preferred orientation bias (spIsoNet, cryoPROS) and advanced denoising algorithms (Topaz-Denoise). In cryoET, tools like IsoNet employ U-Net architectures for simultaneous missing-wedge correction and noise reduction, while TomoNet streamlines subtomogram averaging through AI-driven particle detection. The workflow culminates with automated atomic model building using sophisticated tools like ModelAngelo, DeepTracer, and CryoREAD that translate density maps into interpretable biological structures. These AI-enhanced approaches have achieved near-atomic resolution reconstructions with minimal manual intervention, resolved previously intractable datasets suffering from severe orientation bias, and enabled successful application to diverse biological systems from HIV virus-like particles to in situ ribosomal complexes. As deep learning evolves, particularly with large language models and vision transformers, the future promises sophisticated automation and accessibility in structural biology, potentially revolutionizing our understanding of macromolecular architecture and function.
zh
[CV-198] Multipath Interference Suppression in Indirect Time-of-Flight Imaging via a Novel Compressed Sensing Framework
【速读】:该论文旨在解决间接飞行时间(indirect Time-of-Flight, iToF)系统在深度重建精度和多目标分离能力方面的局限性。传统方法通常依赖硬件改造、复杂调制策略或数据驱动的重构算法,存在实现复杂或性能受限的问题。本文提出了一种基于单频调制的新压缩感知方法,其关键在于:首先,利用多个相位偏移与窄占空比连续波构建传感矩阵,并考虑由镜头畸变引起的像素级距离变化,使矩阵更贴合实际调制响应特性;其次,在稀疏恢复阶段引入K-Means聚类对距离响应字典进行分组,并在正交匹配追踪(Orthogonal Matching Pursuit, OMP)过程中限制原子选择范围,从而显著缩小搜索空间并提升解的稳定性。该方案无需额外硬件改动即可实现更高精度和鲁棒性的深度重建。
链接: https://arxiv.org/abs/2507.19546
作者: Yansong Du,Yutong Deng,Yuting Zhou,Feiyu Jiao,Bangyao Wang,Zhancong Xu,Zhaoxiang Jiang,Xun Guan
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures
Abstract:We propose a novel compressed sensing method to improve the depth reconstruction accuracy and multi-target separation capability of indirect Time-of-Flight (iToF) systems. Unlike traditional approaches that rely on hardware modifications, complex modulation, or cumbersome data-driven reconstruction, our method operates with a single modulation frequency and constructs the sensing matrix using multiple phase shifts and narrow-duty-cycle continuous waves. During matrix construction, we further account for pixel-wise range variation caused by lens distortion, making the sensing matrix better aligned with actual modulation response characteristics. To enhance sparse recovery, we apply K-Means clustering to the distance response dictionary and constrain atom selection within each cluster during the OMP process, which effectively reduces the search space and improves solution stability. Experimental results demonstrate that the proposed method outperforms traditional approaches in both reconstruction accuracy and robustness, without requiring any additional hardware changes.
zh
人工智能
[AI-0] A Survey of Self-Evolving Agents : On Path to Artificial Super Intelligence
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)固有的静态特性问题,即其无法在面对新任务、动态知识领域或交互环境时自适应地调整内部参数,从而限制了其在开放性、交互式场景中的实际应用效能。解决方案的关键在于提出“自演化代理”(self-evolving agents)的新范式,通过系统性地构建三个核心维度——演化内容(what to evolve)、演化时机(when to evolve)和演化机制(how to evolve),实现代理在运行时的持续学习与适应能力。该框架涵盖模型、记忆、工具及架构等组件的演化机制,结合不同阶段的适应方法(如测试期内和跨测试期适应),并利用标量奖励、文本反馈、单智能体与多智能体系统等设计驱动进化过程,最终推动代理向自主演化的方向发展,为实现人工超级智能(Artificial Super Intelligence, ASI)提供理论基础与实践路径。
链接: https://arxiv.org/abs/2507.21046
作者: Huan-ang Gao,Jiayi Geng,Wenyue Hua,Mengkang Hu,Xinzhe Juan,Hongzhang Liu,Shilong Liu,Jiahao Qiu,Xuan Qi,Yiran Wu,Hongru Wang,Han Xiao,Yuhang Zhou,Shaokun Zhang,Jiayi Zhang,Jinyu Xiang,Yixiong Fang,Qiwen Zhao,Dongrui Liu,Qihan Ren,Cheng Qian,Zhenghailong Wang,Minda Hu,Huazheng Wang,Qingyun Wu,Heng Ji,Mengdi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 51 pages, 9 figures
Abstract:Large Language Models (LLMs) have demonstrated strong capabilities but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift – from scaling static models to developing self-evolving agents – has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organized around three foundational dimensions – what to evolve, when to evolve, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing adaptive agentic systems in both research and real-world deployments, ultimately shedding lights to pave the way for the realization of Artificial Super Intelligence (ASI), where agents evolve autonomously, performing at or beyond human-level intelligence across a wide array of tasks.
zh
[AI-1] GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
【速读】:该论文旨在解决基因表达分析中从原始转录组数据提取生物洞察的难题,该过程因多份大型半结构化文件的复杂性及对领域专业知识的高度依赖而极具挑战。现有自动化方法受限于僵化的流程在边缘情况下的失效,或完全自主代理缺乏严谨科学探究所需的精度。其解决方案的关键在于提出GenoMAS——一个由六个基于大语言模型(Large Language Model, LLM)的专用科学家组成的多智能体系统,通过类型化的消息传递协议协同工作,并依托一个引导式规划框架将高层次任务指令分解为可执行的动作单元(Action Units),在每个决策点动态选择推进、修正、跳过或回溯,从而在保持逻辑一致性的同时灵活应对基因组数据的独特特性。该设计实现了结构化工作流的可靠性与自主代理的适应性之间的平衡,在GenoTEX基准测试中显著优于现有方法,且能发现文献支持的基因-表型关联并控制潜在混杂因素。
链接: https://arxiv.org/abs/2507.21035
作者: Haoyang Liu,Yijiang Li,Haohan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Genomics (q-bio.GN)
备注: 51 pages, 5 figures
Abstract:Gene expression analysis holds the key to many biomedical discoveries, yet extracting insights from raw transcriptomic data remains formidable due to the complexity of multiple large, semi-structured files and the need for extensive domain expertise. Current automation approaches are often limited by either inflexible workflows that break down in edge cases or by fully autonomous agents that lack the necessary precision for rigorous scientific inquiry. GenoMAS charts a different course by presenting a team of LLM-based scientists that integrates the reliability of structured workflows with the adaptability of autonomous agents. GenoMAS orchestrates six specialized LLM agents through typed message-passing protocols, each contributing complementary strengths to a shared analytic canvas. At the heart of GenoMAS lies a guided-planning framework: programming agents unfold high-level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack, thereby maintaining logical coherence while bending gracefully to the idiosyncrasies of genomic data. On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F _1 of 60.48% for gene identification, surpassing the best prior art by 10.61% and 16.85% respectively. Beyond metrics, GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature, all while adjusting for latent confounders. Code is available at this https URL. Comments: 51 pages, 5 figures Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Genomics (q-bio.GN) Cite as: arXiv:2507.21035 [cs.AI] (or arXiv:2507.21035v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.21035 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-2] Smart Expansion Techniques for ASP-based Interactive Configuration
【速读】:该论文旨在解决交互式产品配置系统在处理大规模工业级配置问题时的性能瓶颈,尤其是如何高效地自动完成部分配置(partial configuration)的问题。其解决方案的关键在于对经典的多轮求解(multi-shot solving)方法进行改进,引入四种智能扩展函数(smart expansion functions),通过利用谨慎(cautious)和勇敢(brave)推理结果,在每一轮迭代中预先确定并添加特定对象或关联关系,从而减少不必要的不可满足性检查次数并缩小搜索空间,显著提升求解效率。
链接: https://arxiv.org/abs/2507.21027
作者: Lucia Balážová,Richard Comploi-Taupe,Susana Hahn,Nicolas Rühling,Gottfried Schenner
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Under consideration for publication in Theory and Practice of Logic Programming (TPLP)
Abstract:Product configuration is a successful application of Answer Set Programming (ASP). However, challenges are still open for interactive systems to effectively guide users through the configuration process. The aim of our work is to provide an ASP-based solver for interactive configuration that can deal with large-scale industrial configuration problems and that supports intuitive user interfaces via an API. In this paper, we focus on improving the performance of automatically completing a partial configuration. Our main contribution enhances the classical incremental approach for multi-shot solving by four different smart expansion functions. The core idea is to determine and add specific objects or associations to the partial configuration by exploiting cautious and brave consequences before checking for the existence of a complete configuration with the current objects in each iteration. This approach limits the number of costly unsatisfiability checks and reduces the search space, thereby improving solving performance. In addition, we present a user interface that uses our API and is implemented in ASP.
zh
[AI-3] MIRAG E-Bench: LLM Agent is Hallucinating and Where to Find Them
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在交互式场景中因认知上下文中的幻觉信息导致的幻觉行为(hallucinative actions)缺乏统一评估基准的问题。现有研究对这类风险的认知较为碎片化,且缺少系统性的测试平台来识别和量化幻觉行为。其解决方案的关键在于提出MIRAGE-Bench——首个用于诱发和评估交互式LLM代理幻觉的统一基准,包含一个三维度分类法(违背任务指令、执行历史或环境观测),通过系统性审计现有代理基准并采用快照策略隔离决策点以合成测试用例,并引入细粒度的LLM-as-a-Judge评估范式,结合风险感知提示实现高保真、可扩展的行为评估,从而为幻觉缓解提供可操作的洞察与理论基础。
链接: https://arxiv.org/abs/2507.21017
作者: Weichen Zhang,Yiyou Sun,Pohao Huang,Jiayue Pu,Heyue Lin,Dawn Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code and data: this https URL
Abstract:Hallucinations pose critical risks for large language model (LLM)-based agents, often manifesting as hallucinative actions resulting from fabricated or misinterpreted information within the cognitive context. While recent studies have exposed such failures, existing evaluations remain fragmented and lack a principled testbed. In this paper, we present MIRAGE-Bench–Measuring Illusions in Risky AGEnt settings–the first unified benchmark for eliciting and evaluating hallucinations in interactive LLM-agent scenarios. We begin by introducing a three-part taxonomy to address agentic hallucinations: actions that are unfaithful to (i) task instructions, (ii) execution history, or (iii) environment observations. To analyze, we first elicit such failures by performing a systematic audit of existing agent benchmarks, then synthesize test cases using a snapshot strategy that isolates decision points in deterministic and reproducible manners. To evaluate hallucination behaviors, we adopt a fine-grained-level LLM-as-a-Judge paradigm with tailored risk-aware prompts, enabling scalable, high-fidelity assessment of agent actions without enumerating full action spaces. MIRAGE-Bench provides actionable insights on failure modes of LLM agents and lays the groundwork for principled progress in mitigating hallucinations in interactive environments.
zh
[AI-4] Compositional Function Networks: A High-Performance Alternative to Deep Neural Networks with Built-in Interpretability
【速读】:该论文试图解决深度神经网络(Deep Neural Networks, DNNs)在高风险领域部署受限的问题,即其黑箱特性缺乏透明性,难以满足对模型可解释性的要求。解决方案的关键在于提出一种名为组合函数网络(Compositional Function Networks, CFNs)的新框架,该框架通过组合具有明确语义的初等数学函数构建模型,支持序列、并行和条件等多种复合结构,从而在保持模型透明性的同时实现复杂特征交互;其核心创新在于CFNs具备完全可微性,能够使用标准梯度下降法高效训练,兼顾了深度学习的表达能力和可解释性需求。
链接: https://arxiv.org/abs/2507.21004
作者: Fang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep Neural Networks (DNNs) deliver impressive performance but their black-box nature limits deployment in high-stakes domains requiring transparency. We introduce Compositional Function Networks (CFNs), a novel framework that builds inherently interpretable models by composing elementary mathematical functions with clear semantics. Unlike existing interpretable approaches that are limited to simple additive structures, CFNs support diverse compositional patterns – sequential, parallel, and conditional – enabling complex feature interactions while maintaining transparency. A key innovation is that CFNs are fully differentiable, allowing efficient training through standard gradient descent. We demonstrate CFNs’ versatility across multiple domains, from symbolic regression to image classification with deep hierarchical networks. Our empirical evaluation shows CFNs achieve competitive performance against black-box models (96.24% accuracy on CIFAR-10) while outperforming state-of-the-art interpretable models like Explainable Boosting Machines. By combining the hierarchical expressiveness and efficient training of deep learning with the intrinsic interpretability of well-defined mathematical functions, CFNs offer a powerful framework for applications where both performance and accountability are paramount.
zh
[AI-5] Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition ICLR2025
【速读】:该论文旨在解决现实世界机器学习部署中模型持续更新、组合及选择性撤销时面临的任务干扰(task interference)、灾难性遗忘(catastrophic forgetting)和缺乏可逆性等问题。其解决方案的关键在于提出模块化增量合并框架(Modular Delta Merging with Orthogonal Constraints, MDM-OC),通过将每个任务特定模型表示为共享基础模型的增量(delta),并将其投影到正交子空间以消除冲突,从而实现无干扰的模型融合;同时利用梯度优化合并这些正交化增量,生成统一模型以保留各任务性能,并支持新模型的持续集成、结构化反向合并(如满足GDPR合规需求)以及通过弹性权重整合(elastic weight consolidation)和合成回放(synthetic replay)提升模型稳定性。
链接: https://arxiv.org/abs/2507.20997
作者: Haris Khan,Shumaila Asif,Sadia Asif
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures, 3 tables. Will be Submitted to ICLR 2025 for review
Abstract:In real-world machine learning deployments, models must be continually updated, composed, and when required, selectively undone. However, existing approaches to model merging and continual learning often suffer from task interference, catastrophic forgetting, or lack of reversibility. We propose Modular Delta Merging with Orthogonal Constraints (MDM-OC), a novel framework that enables scalable, interference-free, and reversible composition of fine-tuned models. Each task-specific model is encoded as a delta from a shared base and projected into an orthogonal subspace to eliminate conflict. These projected deltas are then merged via gradient-based optimization to form a unified model that retains performance across tasks. Our approach supports continual integration of new models, structured unmerging for compliance such as GDPR requirements, and model stability via elastic weight consolidation and synthetic replay. Extensive experiments on vision and natural language processing benchmarks demonstrate that MDM-OC outperforms prior baselines in accuracy, backward transfer, and unmerge fidelity, while remaining memory-efficient and computationally tractable. This framework offers a principled solution for modular and compliant AI system design.
zh
[AI-6] Personalized Treatment Effect Estimation from Unstructured Data
【速读】:该论文旨在解决如何在缺乏结构化协变量(covariates)的情况下,利用未结构化数据(如临床笔记或医学影像)进行个性化治疗效应(personalized treatment effects, PTE)估计的问题。传统方法依赖于结构化协变量,难以直接应用于现实场景中广泛存在的非结构化数据。其解决方案的关键在于提出三种递进式策略:首先引入一种可直接训练于未结构化数据神经表示的近似“插件”(plug-in)方法;其次设计两种理论基础坚实的估计器,通过在训练阶段利用结构化协变量信息来消除混杂偏倚(confounding bias),同时仅需未结构化输入即可预测PTE;最后针对结构化协变量仅存在于非代表性子集的情况,进一步提出基于回归的校正方法以缓解抽样偏倚(sampling bias),前提是采样机制已知或可准确估计。实验表明,尽管插件方法最为简单,但在所有设置下均展现出强健的实证性能。
链接: https://arxiv.org/abs/2507.20993
作者: Henri Arno,Thomas Demeester
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Existing methods for estimating personalized treatment effects typically rely on structured covariates, limiting their applicability to unstructured data. Yet, leveraging unstructured data for causal inference has considerable application potential, for instance in healthcare, where clinical notes or medical images are abundant. To this end, we first introduce an approximate ‘plug-in’ method trained directly on the neural representations of unstructured data. However, when these fail to capture all confounding information, the method may be subject to confounding bias. We therefore introduce two theoretically grounded estimators that leverage structured measurements of the confounders during training, but allow estimating personalized treatment effects purely from unstructured inputs, while avoiding confounding bias. When these structured measurements are only available for a non-representative subset of the data, these estimators may suffer from sampling bias. To address this, we further introduce a regression-based correction that accounts for the non-uniform sampling, assuming the sampling mechanism is known or can be well-estimated. Our experiments on two benchmark datasets show that the plug-in method, directly trainable on large unstructured datasets, achieves strong empirical performance across all settings, despite its simplicity.
zh
[AI-7] SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在本地设备上部署受限的问题,即当前前沿LLMs主要依赖GPU加速的云端基础设施,难以在计算能力弱、内存有限且存储速度慢的终端设备上高效运行。其解决方案的关键在于提出一种从零开始设计的“部署感知架构”(deployment-aware architecture),将硬件约束转化为优化设计原则:首先引入两级稀疏结构,结合细粒度的专家混合(Mixture-of-Experts, MoE)与稀疏前馈网络,在不牺牲模型容量的前提下显著降低计算需求;其次设计预注意力路由机制(pre-attention router),使推理引擎能在计算注意力的同时从存储中预取专家参数,有效隐藏慢速存储带来的I/O延迟;最后采用NoPE-RoPE混合稀疏注意力机制大幅减少KV缓存占用,提升内存效率。该方法实现了在普通消费级CPU上的高吞吐率(Q4_0量化下均超过20 tokens/s)和低内存消耗(分别为1GB和8GB),从而摆脱对昂贵GPU硬件的依赖。
链接: https://arxiv.org/abs/2507.20984
作者: Yixin Song,Zhenliang Xue,Dongliang Wei,Feiyang Chen,Jianxiang Gao,Junchen Liu,Hangyu Liang,Guangshuo Qin,Chengrong Tian,Bo Wen,Longyu Zhao,Xinrui Zheng,Zeyu Mi,Haibo Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts (MoE) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at this http URL and this http URL.
zh
[AI-8] From Entanglement to Alignment: Representation Space Decomposition for Unsupervised Time Series Domain Adaptation
【速读】:该论文旨在解决时间序列分析中因领域偏移(domain shift)导致的模型性能下降问题,即源域训练的模型在目标域上表现显著劣化的问题。现有无监督域适应(Unsupervised Domain Adaptation, UDA)方法通常仅对特征分布进行对齐,但忽略了特征内在组成结构对域适应的影响。其解决方案的关键在于提出一种具有理论可解释性的新框架DARSD,该框架通过表征空间分解(representation space decomposition)视角,实现知识的有原则解耦:首先利用对抗学习的公共不变基底将原始特征投影至域不变子空间并保留语义信息;其次引入原型伪标签机制动态区分目标域特征置信度,防止误差累积;最后采用混合对比优化策略同步强化特征聚类与一致性,缓解新兴分布差异。此三者协同作用,使DARSD在多个跨域场景中显著优于12种现有UDA算法。
链接: https://arxiv.org/abs/2507.20968
作者: Rongyao Cai,Ming Jin,Qingsong Wen,Kexin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Domain shift poses a fundamental challenge in time series analysis, where models trained on source domain often fail dramatically when applied in target domain with different yet similar distributions. While current unsupervised domain adaptation (UDA) methods attempt to align cross-domain feature distributions, they typically treat features as indivisible entities, ignoring their intrinsic compositions that governs domain adaptation. We introduce DARSD, a novel UDA framework with theoretical explainability that explicitly realizes UDA tasks from the perspective of representation space decomposition. Our core insight is that effective domain adaptation requires not just alignment, but principled disentanglement of transferable knowledge from mixed representations. DARSD consists three synergistic components: (I) An adversarial learnable common invariant basis that projects original features into a domain-invariant subspace while preserving semantic content; (II) A prototypical pseudo-labeling mechanism that dynamically separates target features based on confidence, hindering error accumulation; (III) A hybrid contrastive optimization strategy that simultaneously enforces feature clustering and consistency while mitigating emerging distribution gaps. Comprehensive experiments conducted on four benchmark datasets (WISDM, HAR, HHAR, and MFD) demonstrate DARSD’s superiority against 12 UDA algorithms, achieving optimal performance in 35 out of 53 cross-domain scenarios.
zh
[AI-9] Handoff Design in User-Centric Cell-Free Massive MIMO Networks Using DRL
【速读】:该论文旨在解决用户-centric cell-free massive MIMO(UC-mMIMO)网络中因用户移动性导致的频繁切换(Handoff, HO)问题,此类切换会引发资源分配与释放的开销,影响系统性能。解决方案的关键在于提出一种基于深度强化学习(Deep Reinforcement Learning, DRL)的连续动作空间决策机制,采用Soft Actor-Critic算法训练神经网络以生成最优HO策略;同时设计了一种融合切换惩罚项的奖励函数,用于在用户速率增益与切换开销之间实现平衡。该方法能够自动学习将切换集中于特定时隙以最小化切换触发频率,并具备实时响应能力(响应时间<0.4 ms)。
链接: https://arxiv.org/abs/2507.20966
作者: Hussein A. Ammar,Raviraj Adve,Shahram Shahbazpanahi,Gary Boudreau,Israfil Bahceci
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
备注: Published in IEEE Transactions on Communications (IEEE TCOM)
Abstract:In the user-centric cell-free massive MIMO (UC-mMIMO) network scheme, user mobility necessitates updating the set of serving access points to maintain the user-centric clustering. Such updates are typically performed through handoff (HO) operations; however, frequent HOs lead to overheads associated with the allocation and release of resources. This paper presents a deep reinforcement learning (DRL)-based solution to predict and manage these connections for mobile users. Our solution employs the Soft Actor-Critic algorithm, with continuous action space representation, to train a deep neural network to serve as the HO policy. We present a novel proposition for a reward function that integrates a HO penalty in order to balance the attainable rate and the associated overhead related to HOs. We develop two variants of our system; the first one uses mobility direction-assisted (DA) observations that are based on the user movement pattern, while the second one uses history-assisted (HA) observations that are based on the history of the large-scale fading (LSF). Simulation results show that our DRL-based continuous action space approach is more scalable than discrete space counterpart, and that our derived HO policy automatically learns to gather HOs in specific time slots to minimize the overhead of initiating HOs. Our solution can also operate in real time with a response time less than 0.4 ms.
zh
[AI-10] Core Safety Values for Provably Corrigible Agents
【速读】:该论文旨在解决人工智能系统在复杂、部分可观测环境中的可纠正性(corrigibility)问题,即确保智能体在面对人类干预时能保持安全且可控制的行为,即使在多步决策和自生成代理场景下也能维持安全性。解决方案的关键在于提出一个结构上分离的五头效用架构(utility heads),包括:服从性(deference)、开关访问保留(switch-access preservation)、诚实性(truthfulness)、基于信念的可达效用保留(Attainable Utility Preservation, AUP)以实现低影响行为,以及有界任务奖励。这些效用头通过严格的权重差距进行字典序组合,从而在学习误差和规划次优性存在的情况下,仍能保证违反任何安全属性的概率被严格限制,同时确保净人类收益。这种分离机制使得合规性和影响限制在激励冲突时依然优先,显著优于将所有规范合并为单一标量的学习方法(如宪法AI或RLHF/RLAIF)。
链接: https://arxiv.org/abs/2507.20964
作者: Aran Nayebi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 14 pages
Abstract:We introduce the first implementable framework for corrigibility, with provable guarantees in multi-step, partially observed environments. Our framework replaces a single opaque reward with five structurally separate utility heads – deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward – combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is \emphlearned to mean-squared error \varepsilon and the planner is \varepsilon -sub-optimal, the probability of violating \emphany safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which merge all norms into one learned scalar, our separation makes obedience and impact-limits dominate even when incentives conflict. For open-ended settings where adversaries can modify the agent, we prove that deciding whether an arbitrary post-hack agent will ever violate corrigibility is undecidable by reduction to the halting problem, then carve out a finite-horizon ``decidable island’’ where safety can be certified in randomized polynomial time and verified with privacy-preserving, constant-round zero-knowledge proofs. Consequently, the remaining challenge is the ordinary ML task of data coverage and generalization: reward-hacking risk is pushed into evaluation quality rather than hidden incentive leak-through, giving clearer implementation guidance for today’s LLM assistants and future autonomous systems.
zh
[AI-11] On the Limits of Hierarchically Embedded Logic in Classical Neural Networks
【速读】:该论文试图解决大型神经网络语言模型在逻辑推理能力上的局限性问题,特别是其表达高阶逻辑(如对复杂谓词进行计数)的能力受限的根本原因。解决方案的关键在于将神经网络建模为逻辑谓词空间上的线性算子,并证明每一层网络最多只能编码一个额外的逻辑推理层级;由此得出结论:深度为特定值的神经网络无法忠实表示高一阶逻辑中的谓词,从而确立了逻辑表达能力的严格上限。这一结构导致在分词与嵌入阶段出现非平凡的零空间(null space),排除了高阶谓词的可表示性,为幻觉、重复和有限规划等现象提供了形式化解释,并指明了未来改进方向。
链接: https://arxiv.org/abs/2507.20960
作者: Bill Cochran
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:We propose a formal model of reasoning limitations in large neural net models for language, grounded in the depth of their neural architecture. By treating neural networks as linear operators over logic predicate space we show that each layer can encode at most one additional level of logical reasoning. We prove that a neural network of depth a particular depth cannot faithfully represent predicates in a one higher order logic, such as simple counting over complex predicates, implying a strict upper bound on logical expressiveness. This structure induces a nontrivial null space during tokenization and embedding, excluding higher-order predicates from representability. Our framework offers a natural explanation for phenomena such as hallucination, repetition, and limited planning, while also providing a foundation for understanding how approximations to higher-order logic may emerge. These results motivate architectural extensions and interpretability strategies in future development of language models.
zh
[AI-12] Partially Observable Monte-Carlo Graph Search ICAPS2025
【速读】:该论文旨在解决大规模部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)的离线求解难题,其中传统离线算法难以扩展至大型POMDP场景。其关键解决方案是提出一种新的基于采样的离线算法——部分可观测蒙特卡洛图搜索(Partially Observable Monte-Carlo Graph Search, POMCGS),该方法通过在搜索过程中动态折叠搜索树以构建策略图(policy graph),显著减少计算开销,并允许用户在嵌入执行前对策略进行分析与验证。此外,结合动作渐进扩展(action progressive widening)和观测聚类(observation clustering)技术,POMCGS还能有效处理某些连续状态或观测空间的POMDP问题,实验证明其能在以往离线算法无法处理的挑战性POMDP上生成高质量策略,且性能可与最先进的在线POMDP算法相媲美。
链接: https://arxiv.org/abs/2507.20951
作者: Yang You,Vincent Thomas,Alex Schutz,Robert Skilton,Nick Hawes,Olivier Buffet
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: To be published in Proceedings of ICAPS 2025
Abstract:Currently, large partially observable Markov decision processes (POMDPs) are often solved by sampling-based online methods which interleave planning and execution phases. However, a pre-computed offline policy is more desirable in POMDP applications with time or energy constraints. But previous offline algorithms are not able to scale up to large POMDPs. In this article, we propose a new sampling-based algorithm, the partially observable Monte-Carlo graph search (POMCGS) to solve large POMDPs offline. Different from many online POMDP methods, which progressively develop a tree while performing (Monte-Carlo) simulations, POMCGS folds this search tree on the fly to construct a policy graph, so that computations can be drastically reduced, and users can analyze and validate the policy prior to embedding and executing it. Moreover, POMCGS, together with action progressive widening and observation clustering methods provided in this article, is able to address certain continuous POMDPs. Through experiments, we demonstrate that POMCGS can generate policies on the most challenging POMDPs, which cannot be computed by previous offline algorithms, and these policies’ values are competitive compared with the state-of-the-art online POMDP algorithms.
zh
[AI-13] Pareto-Grid-Guided Large Language Models for Fast and High-Quality Heuristics Design in Multi-Objective Combinatorial Optimization
【速读】:该论文旨在解决多目标组合优化问题(Multi-objective Combinatorial Optimization Problems, MOCOP)中传统进化算法依赖领域知识和反复参数调优、灵活性不足的问题,尤其是在面对未见过的MOCOP实例时表现受限。其解决方案的关键在于提出一种基于帕累托网格引导的LLM进化框架(MPaGE),通过将目标空间划分为网格并保留表现优异的候选解来指导启发式生成,同时利用大语言模型(Large Language Models, LLMs)在变异过程中优先选择语义结构差异显著的启发式策略,从而提升种群多样性并减少冗余,最终实现高效且高质量的多目标优化。
链接: https://arxiv.org/abs/2507.20923
作者: Minh Hieu Ha,Hung Phan,Tung Duy Doan,Tung Dao,Dao Tran,Huynh Thi Thanh Binh
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 36 pages, 20 figures
Abstract:Multi-objective combinatorial optimization problems (MOCOP) frequently arise in practical applications that require the simultaneous optimization of conflicting objectives. Although traditional evolutionary algorithms can be effective, they typically depend on domain knowledge and repeated parameter tuning, limiting flexibility when applied to unseen MOCOP instances. Recently, integration of Large Language Models (LLMs) into evolutionary computation has opened new avenues for automatic heuristic generation, using their advanced language understanding and code synthesis capabilities. Nevertheless, most existing approaches predominantly focus on single-objective tasks, often neglecting key considerations such as runtime efficiency and heuristic diversity in multi-objective settings. To bridge this gap, we introduce Multi-heuristics for MOCOP via Pareto-Grid-guided Evolution of LLMs (MPaGE), a novel enhancement of the Simple Evolutionary Multiobjective Optimization (SEMO) framework that leverages LLMs and Pareto Front Grid (PFG) technique. By partitioning the objective space into grids and retaining top-performing candidates to guide heuristic generation, MPaGE utilizes LLMs to prioritize heuristics with semantically distinct logical structures during variation, thus promoting diversity and mitigating redundancy within the population. Through extensive evaluations, MPaGE demonstrates superior performance over existing LLM-based frameworks, and achieves competitive results to traditional Multi-objective evolutionary algorithms (MOEAs), with significantly faster runtime. Our code is available at: this https URL.
zh
[AI-14] Modeling User Behavior from Adaptive Surveys with Supplemental Context ICML2025
【速读】:该论文旨在解决传统调查问卷(survey)在用户行为建模中因用户疲劳、响应不完整及长度受限而导致的数据不足问题,从而难以全面捕捉用户偏好与决策行为。其解决方案的关键在于提出LANTERN(Late-Attentive Network for Enriched Response Modeling)架构,通过保持调查数据的主导地位(survey primacy),结合选择性门控机制、残差连接和跨注意力机制实现晚期融合(late fusion),仅在相关时引入外部上下文信号(contextual signals),有效提升了多标签预测性能,并展现出对阈值敏感性和模态依赖性的可控优化能力。
链接: https://arxiv.org/abs/2507.20919
作者: Aman Shukla,Daniel Patrick Scantlebury,Rishabh Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Best Paper, NewInML @ ICML 2025
Abstract:Modeling user behavior is critical across many industries where understanding preferences, intent, or decisions informs personalization, targeting, and strategic outcomes. Surveys have long served as a classical mechanism for collecting such behavioral data due to their interpretability, structure, and ease of deployment. However, surveys alone are inherently limited by user fatigue, incomplete responses, and practical constraints on their length making them insufficient for capturing user behavior. In this work, we present LANTERN (Late-Attentive Network for Enriched Response Modeling), a modular architecture for modeling user behavior by fusing adaptive survey responses with supplemental contextual signals. We demonstrate the architectural value of maintaining survey primacy through selective gating, residual connections and late fusion via cross-attention, treating survey data as the primary signal while incorporating external modalities only when relevant. LANTERN outperforms strong survey-only baselines in multi-label prediction of survey responses. We further investigate threshold sensitivity and the benefits of selective modality reliance through ablation and rare/frequent attribute analysis. LANTERN’s modularity supports scalable integration of new encoders and evolving datasets. This work provides a practical and extensible blueprint for behavior modeling in survey-centric applications.
zh
[AI-15] Music Arena: Live Evaluation for Text-to-Music
【速读】:该论文旨在解决文本到音乐生成模型(text-to-music, TTM)评估中缺乏标准化、可扩展且可持续的人类偏好数据来源的问题。当前基于听觉实验的偏好评估成本高、协议不统一,难以跨系统比较,同时现有自动评估指标与人类感知存在偏差,阻碍了模型优化和对齐。解决方案的关键在于提出 Music Arena——一个开放的实时评估平台,其核心创新包括:(1) 基于大语言模型(LLM)的路由系统,用于适配TTM模型异构的输入输出类型签名;(2) 收集包含听觉数据和自然语言反馈的细粒度偏好信息;(3) 采用滚动数据发布策略并保障用户隐私,实现偏好数据的持续更新与透明共享。通过标准化协议与音乐领域特异性设计,Music Arena 为TTM模型提供了一种可扩展、可复现且具可解释性的评估范式。
链接: https://arxiv.org/abs/2507.20900
作者: Yonghyun Kim,Wayne Chi,Anastasios N. Angelopoulos,Wei-Lin Chiang,Koichi Saito,Shinji Watanabe,Yuki Mitsufuji,Chris Donahue
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:We present Music Arena, an open platform for scalable human preference evaluation of text-to-music (TTM) models. Soliciting human preferences via listening studies is the gold standard for evaluation in TTM, but these studies are expensive to conduct and difficult to compare, as study protocols may differ across systems. Moreover, human preferences might help researchers align their TTM systems or improve automatic evaluation metrics, but an open and renewable source of preferences does not currently exist. We aim to fill these gaps by offering live evaluation for TTM. In Music Arena, real-world users input text prompts of their choosing and compare outputs from two TTM systems, and their preferences are used to compile a leaderboard. While Music Arena follows recent evaluation trends in other AI domains, we also design it with key features tailored to music: an LLM-based routing system to navigate the heterogeneous type signatures of TTM systems, and the collection of detailed preferences including listening data and natural language feedback. We also propose a rolling data release policy with user privacy guarantees, providing a renewable source of preference data and increasing platform transparency. Through its standardized evaluation protocol, transparent data access policies, and music-specific features, Music Arena not only addresses key challenges in the TTM ecosystem but also demonstrates how live evaluation can be thoughtfully adapted to unique characteristics of specific AI domains. Music Arena is available at: this https URL Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM) Cite as: arXiv:2507.20900 [cs.SD] (or arXiv:2507.20900v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2507.20900 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-16] JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment
【速读】:该论文旨在解决当前歌词到歌曲生成模型在音乐创作中缺乏细粒度词级控制能力的问题,尤其针对歌手在实际创作流程中对音符时长和节奏的精确调控需求。现有模型如DiffRhythm、ACE-Step和LeVo虽能生成可听的歌曲,但无法实现词级层面的时间与持续时间控制。为此,作者提出基于流匹配(flow-matching)的JAM模型,首次实现了歌词到歌曲生成中的词级时序与持续时间控制,从而支持更精细的人声控制;同时引入通过直接偏好优化(Direct Preference Optimization, DPO)进行审美对齐的方法,利用合成数据迭代优化模型性能,无需人工标注数据即可提升生成音频的质量与人类偏好一致性。此外,研究还构建了公开评估数据集JAME,以标准化此类模型的评测体系。
链接: https://arxiv.org/abs/2507.20880
作者: Renhang Liu,Chia-Yu Hung,Navonil Majumder,Taylor Gautreaux,Amir Ali Bagherzadeh,Chuan Li,Dorien Herremans,Soujanya Poria
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control. To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the music-specific attributes.
zh
[AI-17] Geometry of Neural Reinforcement Learning in Continuous State and Action Spaces
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在连续状态和动作空间中缺乏理论理解的问题,尤其是对策略学习过程中可达状态集合的几何结构与维度关系的认知不足。其解决方案的关键在于引入几何视角,通过分析基于半梯度方法训练的参数化策略所诱导的可达状态集,证明了两层神经网络策略在Actor-Critic算法下会形成一个低维流形(manifold),且该流形的维度与动作空间维度处于同一量级。这一结果首次建立了状态空间几何特性与动作空间维度之间的理论联系,并通过MuJoCo环境及玩具模型验证了该上界,进一步提出在策略与价值函数网络中加入局部流形学习层以实现稀疏表示,从而提升高自由度控制任务中的性能。
链接: https://arxiv.org/abs/2507.20853
作者: Saket Tiwari,Omer Gottesman,George Konidaris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2301.00009
Abstract:Advances in reinforcement learning (RL) have led to its successful application in complex tasks with continuous state and action spaces. Despite these advances in practice, most theoretical work pertains to finite state and action spaces. We propose building a theoretical understanding of continuous state and action spaces by employing a geometric lens to understand the locally attained set of states. The set of all parametrised policies learnt through a semi-gradient based approach induces a set of attainable states in RL. We show that the training dynamics of a two-layer neural policy induce a low dimensional manifold of attainable states embedded in the high-dimensional nominal state space trained using an actor-critic algorithm. We prove that, under certain conditions, the dimensionality of this manifold is of the order of the dimensionality of the action space. This is the first result of its kind, linking the geometry of the state space to the dimensionality of the action space. We empirically corroborate this upper bound for four MuJoCo environments and also demonstrate the results in a toy environment with varying dimensionality. We also show the applicability of this theoretical result by introducing a local manifold learning layer to the policy and value function networks to improve the performance in control environments with very high degrees of freedom by changing one layer of the neural network to learn sparse representations.
zh
[AI-18] Free Energy-Inspired Cognitive Risk Integration for AV Navigation in Pedestrian-Rich Environments
【速读】:该论文旨在解决自动驾驶汽车(AV)在复杂多智能体交互环境中与弱势道路使用者(如行人)进行类人预测与决策的难题,尤其针对现有方法在行人行为建模和AV决策策略上的局限性。其解决方案的关键在于引入受自由能原理启发的认知过程建模方法,构建了双向认知-风险社会力模型:一方面,行人模型通过融合认知不确定性与物理风险来动态调整目标导向力与排斥力,生成更符合人类行为特征的轨迹;另一方面,AV利用该融合风险构建动态的风险感知邻接矩阵,并嵌入到软演员-评论家(Soft Actor-Critic)架构中的图卷积网络中,从而实现更安全、高效且平滑的决策。
链接: https://arxiv.org/abs/2507.20850
作者: Meiting Dang,Yanping Wu,Yafei Wang,Dezong Zhao,David Flynn,Chongfeng Wei
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures
Abstract:Recent advances in autonomous vehicle (AV) behavior planning have shown impressive social interaction capabilities when interacting with other road users. However, achieving human-like prediction and decision-making in interactions with vulnerable road users remains a key challenge in complex multi-agent interactive environments. Existing research focuses primarily on crowd navigation for small mobile robots, which cannot be directly applied to AVs due to inherent differences in their decision-making strategies and dynamic boundaries. Moreover, pedestrians in these multi-agent simulations follow fixed behavior patterns that cannot dynamically respond to AV actions. To overcome these limitations, this paper proposes a novel framework for modeling interactions between the AV and multiple pedestrians. In this framework, a cognitive process modeling approach inspired by the Free Energy Principle is integrated into both the AV and pedestrian models to simulate more realistic interaction dynamics. Specifically, the proposed pedestrian Cognitive-Risk Social Force Model adjusts goal-directed and repulsive forces using a fused measure of cognitive uncertainty and physical risk to produce human-like trajectories. Meanwhile, the AV leverages this fused risk to construct a dynamic, risk-aware adjacency matrix for a Graph Convolutional Network within a Soft Actor-Critic architecture, allowing it to make more reasonable and informed decisions. Simulation results indicate that our proposed framework effectively improves safety, efficiency, and smoothness of AV navigation compared to the state-of-the-art method.
zh
[AI-19] First Hallucination Tokens Are Different from Conditional Ones
【速读】:该论文旨在解决生成式 AI(Generative AI)中幻觉(hallucination)的检测问题,特别是如何在 token 级别实现实时过滤与精准修正。其核心挑战在于对幻觉信号在 token 序列中的分布特性缺乏深入理解。解决方案的关键在于利用 RAGTruth 数据集提供的 token 级别标注和复现的 logits(对数几率),系统分析幻觉信号如何随 token 在幻觉片段中的位置变化而变化。研究发现,首个幻觉 token 所携带的信号更强、更易被检测,这为构建高效、精准的 token 级幻觉检测机制提供了关键依据。
链接: https://arxiv.org/abs/2507.20836
作者: Jakob Snel,Seong Joon Oh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4.5 pages, 3 figures, Dataset, Knowledge Paper, Hallucination, Trustworthiness
Abstract:Hallucination, the generation of untruthful content, is one of the major concerns regarding foundational models. Detecting hallucinations at the token level is vital for real-time filtering and targeted correction, yet the variation of hallucination signals within token sequences is not fully understood. Leveraging the RAGTruth corpus with token-level annotations and reproduced logits, we analyse how these signals depend on a token’s position within hallucinated spans, contributing to an improved understanding of token-level hallucination. Our results show that the first hallucinated token carries a stronger signal and is more detectable than conditional tokens. We release our analysis framework, along with code for logit reproduction and metric computation at this https URL.
zh
[AI-20] Why Flow Matching is Particle Swarm Optimization?
【速读】:该论文试图解决生成式 AI(Generative AI)中的流匹配(flow matching)与进化计算中的粒子群优化(PSO)之间潜在理论关联不明确的问题。其解决方案的关键在于通过数学形式和优化机制的理论分析,揭示二者在向量场学习与速度更新规则上的相似性、均遵循从初始分布到目标分布的渐进演化框架,以及均可由常微分方程(ODE)描述的动力系统本质。研究表明,流匹配可被视为PSO的连续推广,而PSO则是群体智能原理的离散实现,从而建立了两者的对偶关系,为融合两者优势发展新型混合算法提供了统一理论基础。
链接: https://arxiv.org/abs/2507.20810
作者: Kaichen Ouyang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 0 figures
Abstract:This paper preliminarily investigates the duality between flow matching in generative models and particle swarm optimization (PSO) in evolutionary computation. Through theoretical analysis, we reveal the intrinsic connections between these two approaches in terms of their mathematical formulations and optimization mechanisms: the vector field learning in flow matching shares similar mathematical expressions with the velocity update rules in PSO; both methods follow the fundamental framework of progressive evolution from initial to target distributions; and both can be formulated as dynamical systems governed by ordinary differential equations. Our study demonstrates that flow matching can be viewed as a continuous generalization of PSO, while PSO provides a discrete implementation of swarm intelligence principles. This duality understanding establishes a theoretical foundation for developing novel hybrid algorithms and creates a unified framework for analyzing both methods. Although this paper only presents preliminary discussions, the revealed correspondences suggest several promising research directions, including improving swarm intelligence algorithms based on flow matching principles and enhancing generative models using swarm intelligence concepts.
zh
[AI-21] MMGraphRAG : Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs
【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理多模态信息时存在的两大问题:一是缺乏对图像等视觉内容的结构化建模,二是难以捕捉跨模态之间的知识逻辑链与推理路径。为此,作者提出MMGraphRAG框架,其核心创新在于通过场景图(scene graph)对视觉内容进行精细化重构,并构建融合文本知识图谱的多模态知识图谱(Multimodal Knowledge Graph, MMKG);进一步利用谱聚类(spectral clustering)实现跨模态实体链接,从而在推理路径上检索上下文以指导生成过程,显著提升了模型在复杂多模态任务中的泛化能力与可解释性。
链接: https://arxiv.org/abs/2507.20804
作者: Xueyao Wan,Hang Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) enhances language model generation by retrieving relevant information from external knowledge bases. However, conventional RAG methods face the issue of missing multimodal information. Multimodal RAG methods address this by fusing images and text through mapping them into a shared embedding space, but they fail to capture the structure of knowledge and logical chains between modalities. Moreover, they also require large-scale training for specific tasks, resulting in limited generalizing ability. To address these limitations, we propose MMGraphRAG, which refines visual content through scene graphs and constructs a multimodal knowledge graph (MMKG) in conjunction with text-based KG. It employs spectral clustering to achieve cross-modal entity linking and retrieves context along reasoning paths to guide the generative process. Experimental results show that MMGraphRAG achieves state-of-the-art performance on the DocBench and MMLongBench datasets, demonstrating strong domain adaptability and clear reasoning paths.
zh
[AI-22] valSmarT: An LLM -Based Framework for Evaluating Smart Contract Generated Comments
【速读】:该论文旨在解决智能合约注释生成(Smart Contract Comment Generation)中的质量评估难题,即传统指标如BLEU和ROUGE无法捕捉区块链领域的专业语义特征,而人工评估又存在成本高、难以扩展的问题。其解决方案的关键在于提出一个模块化且可扩展的框架evalSmarT,该框架利用大语言模型(Large Language Models, LLMs)作为评价者,通过组合约40种LLM与10种提示策略,支持超过400种评估配置,从而实现对生成注释的语义丰富性与人类判断高度一致的自动化评估,显著提升了评估的可扩展性和准确性。
链接: https://arxiv.org/abs/2507.20774
作者: Fatou Ndiaye Mbodji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 4 tables
Abstract:Smart contract comment generation has gained traction as a means to improve code comprehension and maintainability in blockchain systems. However, evaluating the quality of generated comments remains a challenge. Traditional metrics such as BLEU and ROUGE fail to capture domain-specific nuances, while human evaluation is costly and unscalable. In this paper, we present \textttevalSmarT, a modular and extensible framework that leverages large language models (LLMs) as evaluators. The system supports over 400 evaluator configurations by combining approximately 40 LLMs with 10 prompting strategies. We demonstrate its application in benchmarking comment generation tools and selecting the most informative outputs. Our results show that prompt design significantly impacts alignment with human judgment, and that LLM-based evaluation offers a scalable and semantically rich alternative to existing methods.
zh
[AI-23] How Chain-of-Thought Works? Tracing Information Flow from Decoding Projection and Activation
【速读】:该论文旨在解决Chain-of-Thought (CoT) prompting在模型推理中提升性能的内在机制不明确的问题。其解决方案的关键在于通过逆向追踪解码、投影和激活阶段的信息流,定量分析CoT的作用机制,发现其可能作为解码空间剪枝器,利用答案模板引导输出生成,且模板遵循度与性能提升显著相关;同时揭示了CoT在任务依赖性上对神经元激活的调控作用——在开放域任务中降低神经元活跃度,而在封闭域任务中则增强活跃度。这一发现为CoT提供了新的可解释性框架,并为设计更高效、鲁棒的提示策略提供了关键洞见。
链接: https://arxiv.org/abs/2507.20758
作者: Hao Yang,Qinghua Zhao,Lei Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Chain-of-Thought (CoT) prompting significantly enhances model reasoning, yet its internal mechanisms remain poorly understood. We analyze CoT’s operational principles by reversely tracing information flow across decoding, projection, and activation phases. Our quantitative analysis suggests that CoT may serve as a decoding space pruner, leveraging answer templates to guide output generation, with higher template adherence strongly correlating with improved performance. Furthermore, we surprisingly find that CoT modulates neuron engagement in a task-dependent manner: reducing neuron activation in open-domain tasks, yet increasing it in closed-domain scenarios. These findings offer a novel mechanistic interpretability framework and critical insights for enabling targeted CoT interventions to design more efficient and robust prompts. We released our code and data at this https URL.
zh
[AI-24] Beyond Listenership: AI-Predicted Interventions Drive Improvements in Maternal Health Behaviours
【速读】:该论文旨在解决自动化语音呼叫(automated voice calls)在母婴健康信息传播中普遍存在的受益人流失和参与度低的问题。解决方案的关键在于引入一种基于 restless bandit 模型的AI干预策略,该模型能够精准识别最可能从人工服务电话干预中受益的个体,从而提升听众参与度;研究进一步证实,这种由AI优化的干预不仅显著提高了听众参与率,还带来了可测量的健康行为改善,如产后补充铁剂或钙剂的行为增加,以及对孕期和婴儿期关键健康知识的理解提升,表明AI驱动的个性化干预具有促进母婴健康行为改变的潜力。
链接: https://arxiv.org/abs/2507.20755
作者: Arpan Dasgupta,Sarvesh Gharat,Neha Madhiwalla,Aparna Hegde,Milind Tambe,Aparna Taneja
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Automated voice calls with health information are a proven method for disseminating maternal and child health information among beneficiaries and are deployed in several programs around the world. However, these programs often suffer from beneficiary dropoffs and poor engagement. In previous work, through real-world trials, we showed that an AI model, specifically a restless bandit model, could identify beneficiaries who would benefit most from live service call interventions, preventing dropoffs and boosting engagement. However, one key question has remained open so far: does such improved listenership via AI-targeted interventions translate into beneficiaries’ improved knowledge and health behaviors? We present a first study that shows not only listenership improvements due to AI interventions, but also simultaneously links these improvements to health behavior changes. Specifically, we demonstrate that AI-scheduled interventions, which enhance listenership, lead to statistically significant improvements in beneficiaries’ health behaviors such as taking iron or calcium supplements in the postnatal period, as well as understanding of critical health topics during pregnancy and infancy. This underscores the potential of AI to drive meaningful improvements in maternal and child health.
zh
[AI-25] Industry Insights from Comparing Deep Learning and GBDT Models for E-Commerce Learning-to-Rank RECSYS2025
【速读】:该论文旨在解决在电商推荐与搜索系统中,深度神经网络(Deep Neural Networks, DNNs)是否能够超越传统树模型(如LambdaMART)这一长期存在的争议问题。其解决方案的关键在于通过系统性对比实验,评估多种DNN架构与损失函数在真实业务场景下的表现,并基于OTTO公司内部数据集进行离线评测和为期8周的在线A/B测试,最终证明一个简单的DNN架构在总点击量和收入指标上优于生产级LambdaMART基线模型,同时在总销量上保持相当水平。
链接: https://arxiv.org/abs/2507.20753
作者: Yunus Lutz,Timo Wilm,Philipp Duwe
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work was accepted for publication in the 19th ACM Conference on Recommender Systems (RecSys 2025). The final published version will be available at the ACM Digital Library
Abstract:In e-commerce recommender and search systems, tree-based models, such as LambdaMART, have set a strong baseline for Learning-to-Rank (LTR) tasks. Despite their effectiveness and widespread adoption in industry, the debate continues whether deep neural networks (DNNs) can outperform traditional tree-based models in this domain. To contribute to this discussion, we systematically benchmark DNNs against our production-grade LambdaMART model. We evaluate multiple DNN architectures and loss functions on a proprietary dataset from OTTO and validate our findings through an 8-week online A/B test. The results show that a simple DNN architecture outperforms a strong tree-based baseline in terms of total clicks and revenue, while achieving parity in total units sold.
zh
[AI-26] Learning the Value Systems of Societies from Preferences ECAI2025
【速读】:该论文旨在解决如何从人类行为中自动学习社会层面的价值系统(value systems)问题,而非仅限于个体价值的建模。传统价值学习方法通常基于个体价值的聚合,但社会科学研究表明,社会的价值体系应被视为不同群体价值系统的集合。为此,作者提出一种基于启发式深度聚类(heuristic deep clustering)的方法,其关键在于通过观察样本代理(agents)对价值相关决策的定性偏好,自动识别出一组具有代表性的、社会共享的价值根基(value groundings)及多样化的群体价值系统,从而更准确地刻画社会层面的价值结构。
链接: https://arxiv.org/abs/2507.20728
作者: Andrés Holgado-Sánchez,Holger Billhardt,Sascha Ossowski,Sara Degli-Esposti
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Full version of publication under the same accepted at ECAI 2025 conference (Submission 6755). 8 pages + 2 supplementary material
Abstract:Aligning AI systems with human values and the value-based preferences of various stakeholders (their value systems) is key in ethical AI. In value-aware AI systems, decision-making draws upon explicit computational representations of individual values (groundings) and their aggregation into value systems. As these are notoriously difficult to elicit and calibrate manually, value learning approaches aim to automatically derive computational models of an agent’s values and value system from demonstrations of human behaviour. Nonetheless, social science and humanities literature suggest that it is more adequate to conceive the value system of a society as a set of value systems of different groups, rather than as the simple aggregation of individual value systems. Accordingly, here we formalize the problem of learning the value systems of societies and propose a method to address it based on heuristic deep clustering. The method learns socially shared value groundings and a set of diverse value systems representing a given society by observing qualitative value-based preferences from a sample of agents. We evaluate the proposal in a use case with real data about travelling decisions.
zh
[AI-27] Prostate Cancer Classification Using Multimodal Feature Fusion and Explainable AI
【速读】:该论文旨在解决前列腺癌(prostate cancer)早期诊断中因多模态数据融合不足而导致的分类性能与临床可解释性难以兼顾的问题。其解决方案的关键在于提出了一种新颖的可解释性人工智能(explainable AI)系统,通过简单但有效的多模态融合策略,将BERT模型用于解析文本类临床记录、随机森林(Random Forest)用于处理数值型实验室数据,并在PLCO-NIH数据集上实现了98%准确率和99% AUC的优异表现。特别地,该方法在中等风险癌症阶段(Class 2/3)的召回率提升显著(联合模型达0.900,优于仅用数值或文本特征的模型),且借助SHAP分析实现透明的特征重要性排序,验证了文本特征对数值特征的互补价值,从而在保持高F1分数(89%)、计算效率的同时满足临床决策所需的可解释性需求。
链接: https://arxiv.org/abs/2507.20714
作者: Asma Sadia Khan,Fariba Tasnia Khan,Tanjim Mahmud,Salman Karim Khan,Rishita Chakma,Nahed Sharmen,Mohammad Shahadat Hossain,Karl Andersson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Applications (stat.AP)
备注:
Abstract:Prostate cancer, the second most prevalent male malignancy, requires advanced diagnostic tools. We propose an explainable AI system combining BERT (for textual clinical notes) and Random Forest (for numerical lab data) through a novel multimodal fusion strategy, achieving superior classification performance on PLCO-NIH dataset (98% accuracy, 99% AUC). While multimodal fusion is established, our work demonstrates that a simple yet interpretable BERT+RF pipeline delivers clinically significant improvements - particularly for intermediate cancer stages (Class 2/3 recall: 0.900 combined vs 0.824 numerical/0.725 textual). SHAP analysis provides transparent feature importance rankings, while ablation studies prove textual features’ complementary value. This accessible approach offers hospitals a balance of high performance (F1=89%), computational efficiency, and clinical interpretability - addressing critical needs in prostate cancer diagnostics.
zh
[AI-28] Algorithmic Fairness: A Runtime Perspective
【速读】:该论文旨在解决人工智能(AI)系统中公平性(fairness)评估从静态属性向运行时属性转变的问题,即如何在动态环境中持续监测和强制执行公平性。传统研究将公平性视为对固定数据集的单次评估,而现实中的AI系统具有时序性和环境演化特性,因此需要新的分析框架。解决方案的关键在于提出一个基于序列抛硬币模型的最小但表达力强的框架,其中硬币偏置可能随时间变化,以此建模真实世界的不确定性与动态性;在此基础上,论文分别针对公平性监控(monitoring)和公平性强制(enforcement)两类问题,提供参数化策略,其核心变量包括环境动态特性、预测时间范围和置信阈值,并在简化假设下给出通用结论,同时综述了马尔可夫动态和加性动态下的现有监控方法,以及已知动态条件下静态场景中的公平性强制方案。
链接: https://arxiv.org/abs/2507.20711
作者: Filip Cano,Thomas A. Henzinger,Konstantin Kueffner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To appear in RV 2025
Abstract:Fairness in AI is traditionally studied as a static property evaluated once, over a fixed dataset. However, real-world AI systems operate sequentially, with outcomes and environments evolving over time. This paper proposes a framework for analysing fairness as a runtime property. Using a minimal yet expressive model based on sequences of coin tosses with possibly evolving biases, we study the problems of monitoring and enforcing fairness expressed in either toss outcomes or coin biases. Since there is no one-size-fits-all solution for either problem, we provide a summary of monitoring and enforcement strategies, parametrised by environment dynamics, prediction horizon, and confidence thresholds. For both problems, we present general results under simple or minimal assumptions. We survey existing solutions for the monitoring problem for Markovian and additive dynamics, and existing solutions for the enforcement problem in static settings with known dynamics.
zh
[AI-29] A General Framework for Dynamic MAPF using Multi-Shot ASP and Tunnels
【速读】:该论文旨在解决动态多智能体路径规划(Dynamic Multi-Agent Path Finding, D-MAPF)问题,即在环境动态变化(如智能体进出、障碍物移动或移除)的情况下,为多个智能体生成无碰撞的路径计划。其解决方案的关键在于提出了一种通用的D-MAPF定义框架、一种基于多轮计算(multi-shot computation)的求解架构,并引入了一种基于答案集编程(Answer Set Programming, ASP)的新方法;该方法结合了重规划(replanning)与修复(repairing)策略的优势,创新性地引入“隧道”(tunnels)概念以约束智能体的可移动区域,从而在计算效率和解的质量之间取得平衡。
链接: https://arxiv.org/abs/2507.20703
作者: Aysu Bogatarkan,Esra Erdem
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:MAPF problem aims to find plans for multiple agents in an environment within a given time, such that the agents do not collide with each other or obstacles. Motivated by the execution and monitoring of these plans, we study Dynamic MAPF (D-MAPF) problem, which allows changes such as agents entering/leaving the environment or obstacles being removed/moved. Considering the requirements of real-world applications in warehouses with the presence of humans, we introduce 1) a general definition for D-MAPF (applicable to variations of D-MAPF), 2) a new framework to solve D-MAPF (utilizing multi-shot computation, and allowing different methods to solve D-MAPF), and 3) a new ASP-based method to solve D-MAPF (combining advantages of replanning and repairing methods, with a novel concept of tunnels to specify where agents can move). We have illustrated the strengths and weaknesses of this method by experimental evaluations, from the perspectives of computational performance and quality of solutions.
zh
[AI-30] Hot-Swap MarkBoard: An Efficient Black-box Watermarking Approach for Large-scale Model Distribution
【速读】:该论文旨在解决端侧人工智能(On-Device AI)模型在大规模分发场景下面临的知识产权(IP)保护难题,即如何为每个用户特定的模型实例嵌入唯一水印,同时避免因修改水印而需重新训练模型的问题。现有基于后门的水印方法主要适用于云服务模式(AIaaS),难以适配多设备、个性化部署的需求。其解决方案的关键在于提出Hot-Swap MarkBoard:通过将用户专属的n位二进制签名编码为多个独立嵌入的水印,并利用多分支低秩适应(LoRA)模块实现无需重训即可通过分支切换完成水印定制;此外,引入参数混淆机制使水印权重与基础模型参数耦合,从而防止恶意移除而不损害模型性能。该方案支持黑盒验证,兼容多种模型架构和任务类型(如分类、图像生成和文本生成),实验证明其在效率和适应性上显著优于现有方法,且验证准确率达100%。
链接: https://arxiv.org/abs/2507.20650
作者: Zhicheng Zhang,Peizhuo Lv,Mengke Wan,Jiang Fang,Diandian Guo,Yezeng Chen,Yinlong Liu,Wei Ma,Jiyan Sun,Liru Geng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, Deep Learning (DL) models have been increasingly deployed on end-user devices as On-Device AI, offering improved efficiency and privacy. However, this deployment trend poses more serious Intellectual Property (IP) risks, as models are distributed on numerous local devices, making them vulnerable to theft and redistribution. Most existing ownership protection solutions (e.g., backdoor-based watermarking) are designed for cloud-based AI-as-a-Service (AIaaS) and are not directly applicable to large-scale distribution scenarios, where each user-specific model instance must carry a unique watermark. These methods typically embed a fixed watermark, and modifying the embedded watermark requires retraining the model. To address these challenges, we propose Hot-Swap MarkBoard, an efficient watermarking method. It encodes user-specific n -bit binary signatures by independently embedding multiple watermarks into a multi-branch Low-Rank Adaptation (LoRA) module, enabling efficient watermark customization without retraining through branch swapping. A parameter obfuscation mechanism further entangles the watermark weights with those of the base model, preventing removal without degrading model performance. The method supports black-box verification and is compatible with various model architectures and DL tasks, including classification, image generation, and text generation. Extensive experiments across three types of tasks and six backbone models demonstrate our method’s superior efficiency and adaptability compared to existing approaches, achieving 100% verification accuracy.
zh
[AI-31] Adaptive Fuzzy Time Series Forecasting via Partially Asymmetric Convolution and Sub-Sliding Window Fusion
【速读】:该论文旨在解决当前先进时间序列预测模型在学习阶段难以捕捉时空依赖关系并合成全局信息的问题。其核心解决方案在于提出一种基于滑动窗口的局部不对称卷积架构,通过自适应模糊化构建时间数据,使每个时间节点能自动获得受限滑动窗口内的全局信息与内部关联,无需人工干预;同时引入双边空洞(bilateral Atrous)算法,在不损失元素全局特征的前提下降低计算负担,并避免冗余信息处理;进一步设计部分不对称卷积结构,使卷积神经网络(CNN)能够在现有滑动窗口内灵活构建子窗口以实现更细粒度的特征挖掘,并通过多尺度特征融合机制将不同层级的信息送入对应网络层进行整合,从而显著提升预测精度。
链接: https://arxiv.org/abs/2507.20641
作者: Lijian Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:At present, state-of-the-art forecasting models are short of the ability to capture spatio-temporal dependency and synthesize global information at the stage of learning. To address this issue, in this paper, through the adaptive fuzzified construction of temporal data, we propose a novel convolutional architecture with partially asymmetric design based on the scheme of sliding window to realize accurate time series forecasting. First, the construction strategy of traditional fuzzy time series is improved to further extract short and long term temporal interrelation, which enables every time node to automatically possess corresponding global information and inner relationships among them in a restricted sliding window and the process does not require human involvement. Second, a bilateral Atrous algorithm is devised to reduce calculation demand of the proposed model without sacrificing global characteristics of elements. And it also allows the model to avoid processing redundant information. Third, after the transformation of time series, a partially asymmetric convolutional architecture is designed to more flexibly mine data features by filters in different directions on feature maps, which gives the convolutional neural network (CNN) the ability to construct sub-windows within existing sliding windows to model at a more fine-grained level. And after obtaining the time series information at different levels, the multi-scale features from different sub-windows will be sent to the corresponding network layer for time series information fusion. Compared with other competitive modern models, the proposed method achieves state-of-the-art results on most of popular time series datasets, which is fully verified by the experimental results.
zh
[AI-32] Controllable Video-to-Music Generation with Multiple Time-Varying Conditions
【速读】:该论文旨在解决当前视频到音乐(Video-to-Music, V2M)生成方法依赖单一视觉特征或附加文本输入时,存在黑箱式生成、难以满足用户个性化需求的问题。其解决方案的关键在于提出一种多条件引导的V2M生成框架,通过引入多种时变条件实现对音乐生成过程的精细控制;具体包括两个阶段:第一阶段采用细粒度特征选择模块与渐进式时间对齐注意力机制,确保视听特征的灵活对齐;第二阶段设计动态条件融合模块与控制引导解码器模块,有效整合多条件信息并精准指导音乐创作,从而显著提升生成音乐的质量、可控性与用户期望的一致性。
链接: https://arxiv.org/abs/2507.20627
作者: Junxian Wu,Weitao You,Heda Zuo,Dengming Zhang,Pei Chen,Lingyun Sun
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by the 33rd ACM International Conference on Multimedia (ACMMM 2025). The project page is available at this https URL
Abstract:Music enhances video narratives and emotions, driving demand for automatic video-to-music (V2M) generation. However, existing V2M methods relying solely on visual features or supplementary textual inputs generate music in a black-box manner, often failing to meet user expectations. To address this challenge, we propose a novel multi-condition guided V2M generation framework that incorporates multiple time-varying conditions for enhanced control over music generation. Our method uses a two-stage training strategy that enables learning of V2M fundamentals and audiovisual temporal synchronization while meeting users’ needs for multi-condition control. In the first stage, we introduce a fine-grained feature selection module and a progressive temporal alignment attention mechanism to ensure flexible feature alignment. For the second stage, we develop a dynamic conditional fusion module and a control-guided decoder module to integrate multiple conditions and accurately guide the music composition process. Extensive experiments demonstrate that our method outperforms existing V2M pipelines in both subjective and objective evaluations, significantly enhancing control and alignment with user expectations.
zh
[AI-33] Enhancing Large Multimodal Models with Adaptive Sparsity and KV Cache Compression
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在边缘设备部署时的压缩效率问题,特别是如何在保持模型性能的前提下实现高效的参数稀疏化与键值缓存(Key-Value Cache, KV cache)压缩。其解决方案的关键在于提出一种基于树状结构帕尔岑估计器(Tree-structured Parzen Estimator, TPE)的自适应搜索算法,该算法动态调整不同网络层的剪枝比例和KV缓存量化带宽,以模型性能为优化目标,实现了剪枝与KV缓存量化的一体化协同优化。该方法无需额外微调或权重调整,通过快速剪枝技术显著提升压缩效率,同时在多个基准数据集(如LLaVA-1.5 7B和13B)上优于SparseGPT和Wanda等先进压缩方法,尤其在KV缓存资源的自动分配方面树立了新标准,兼顾内存效率与推理精度。
链接: https://arxiv.org/abs/2507.20613
作者: Te Zhang,Yuheng Li,Junxiang Wang,Lujun Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages
Abstract:Large multimodal models (LMMs) have advanced significantly by integrating visual encoders with extensive language models, enabling robust reasoning capabilities. However, compressing LMMs for deployment on edge devices remains a critical challenge. In this work, we propose an adaptive search algorithm that optimizes sparsity and KV cache compression to enhance LMM efficiency. Utilizing the Tree-structured Parzen Estimator, our method dynamically adjusts pruning ratios and KV cache quantization bandwidth across different LMM layers, using model performance as the optimization objective. This approach uniquely combines pruning with key-value cache quantization and incorporates a fast pruning technique that eliminates the need for additional fine-tuning or weight adjustments, achieving efficient compression without compromising accuracy. Comprehensive evaluations on benchmark datasets, including LLaVA-1.5 7B and 13B, demonstrate our method superiority over state-of-the-art techniques such as SparseGPT and Wanda across various compression levels. Notably, our framework automatic allocation of KV cache compression resources sets a new standard in LMM optimization, delivering memory efficiency without sacrificing much performance.
zh
[AI-34] Beyond Interactions: Node-Level Graph Generation for Knowledge-Free Augmentation in Recommender Systems
【速读】:该论文旨在解决当前推荐系统中依赖外部资源(如知识图谱或大语言模型)所导致的数据依赖性强和计算开销高的问题,同时克服无知识模型在语义与结构鸿沟填补上的不足。其解决方案的关键在于提出一种开创性的无知识增强框架 NodeDiffRec,通过扩散机制实现细粒度的节点级图生成,合成符合底层分布的伪物品及其交互关系,并借助去噪偏好建模过程精细化用户偏好,从而在不引入外部知识的前提下显著提升推荐的语义多样性与结构连通性。
链接: https://arxiv.org/abs/2507.20578
作者: Zhaoyan Wang,Hyunjun Ahn,In-Young Ko
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in recommender systems rely on external resources such as knowledge graphs or large language models to enhance recommendations, which limit applicability in real-world settings due to data dependency and computational overhead. Although knowledge-free models are able to bolster recommendations by direct edge operations as well, the absence of augmentation primitives drives them to fall short in bridging semantic and structural gaps as high-quality paradigm substitutes. Unlike existing diffusion-based works that remodel user-item interactions, this work proposes NodeDiffRec, a pioneering knowledge-free augmentation framework that enables fine-grained node-level graph generation for recommendations and expands the scope of restricted augmentation primitives via diffusion. By synthesizing pseudo-items and corresponding interactions that align with the underlying distribution for injection, and further refining user preferences through a denoising preference modeling process, NodeDiffRec dramatically enhances both semantic diversity and structural connectivity without external knowledge. Extensive experiments across diverse datasets and recommendation algorithms demonstrate the superiority of NodeDiffRec, achieving State-of-the-Art (SOTA) performance, with maximum average performance improvement 98.6% in Recall@5 and 84.0% in NDCG@5 over selected baselines.
zh
[AI-35] DAG-AFL:Directed Acyclic Graph-based Asynchronous Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端异步参与和数据非独立同分布(Non-IID)带来的模型训练效率低下与准确性不足的问题,同时应对传统基于工作量证明(Proof of Work, PoW)的区块链共识机制在资源受限设备上引入过高计算与通信开销的挑战。其解决方案的关键在于提出了一种基于有向无环图(Directed Acyclic Graph, DAG)的异步联邦学习框架(DAG-AFL),通过设计融合时间新鲜度、节点可达性和模型准确性的tip选择算法,并结合DAG结构的可信验证策略,在保障去中心化与安全性的同时显著降低额外资源消耗,从而提升整体训练效率和模型性能。
链接: https://arxiv.org/abs/2507.20571
作者: Shuaipeng Zhang,Lanju Kong,Yixin Zhang,Wei He,Yongqing Zheng,Han Yu,Lizhen Cui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, IEEE International Conference on Multimedia Expo 2025 conference paper
Abstract:Due to the distributed nature of federated learning (FL), the vulnerability of the global model and the need for coordination among many client devices pose significant challenges. As a promising decentralized, scalable and secure solution, blockchain-based FL methods have attracted widespread attention in recent years. However, traditional consensus mechanisms designed for Proof of Work (PoW) similar to blockchain incur substantial resource consumption and compromise the efficiency of FL, particularly when participating devices are wireless and resource-limited. To address asynchronous client participation and data heterogeneity in FL, while limiting the additional resource overhead introduced by blockchain, we propose the Directed Acyclic Graph-based Asynchronous Federated Learning (DAG-AFL) framework. We develop a tip selection algorithm that considers temporal freshness, node reachability and model accuracy, with a DAG-based trusted verification strategy. Extensive experiments on 3 benchmarking datasets against eight state-of-the-art approaches demonstrate that DAG-AFL significantly improves training efficiency and model accuracy by 22.7% and 6.5% on average, respectively.
zh
[AI-36] Unlearning of Knowledge Graph Embedding via Preference Optimization
【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)中过时或错误信息难以有效移除的问题,特别是在知识图谱嵌入(Knowledge Graph Embedding, KGE)模型中实现近似遗忘(approximate unlearning)时面临的两大挑战:一是由于三元组之间的内在连通性导致无法完全消除目标信息(即遗忘三元组仍可从保留三元组中推断),二是局部遗忘策略削弱了遗忘边界处的剩余知识完整性。解决方案的关键在于提出GraphDPO框架,其核心创新包括两点:一是将遗忘建模为直接偏好优化(Direct Preference Optimization, DPO)问题,通过训练模型偏好重构替代三元组而非原始遗忘三元组,从而减少对可遗忘知识的依赖,缓解因KG连通性导致的不完全遗忘;二是引入边界外采样策略构建语义重叠最小的偏好对,并结合边界召回机制在时间步内及跨时间步重放和蒸馏相关知识,有效保护遗忘边界处的知识完整性。
链接: https://arxiv.org/abs/2507.20566
作者: Jiajun Liu,Wenjun Ke,Peng Wang,Yao He,Ziyu Shang,Guozheng Li,Zijie Xu,Ke Ji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing knowledge graphs (KGs) inevitably contain outdated or erroneous knowledge that needs to be removed from knowledge graph embedding (KGE) models. To address this challenge, knowledge unlearning can be applied to eliminate specific information while preserving the integrity of the remaining knowledge in KGs. Existing unlearning methods can generally be categorized into exact unlearning and approximate unlearning. However, exact unlearning requires high training costs while approximate unlearning faces two issues when applied to KGs due to the inherent connectivity of triples: (1) It fails to fully remove targeted information, as forgetting triples can still be inferred from remaining ones. (2) It focuses on local data for specific removal, which weakens the remaining knowledge in the forgetting boundary. To address these issues, we propose GraphDPO, a novel approximate unlearning framework based on direct preference optimization (DPO). Firstly, to effectively remove forgetting triples, we reframe unlearning as a preference optimization problem, where the model is trained by DPO to prefer reconstructed alternatives over the original forgetting triples. This formulation penalizes reliance on forgettable knowledge, mitigating incomplete forgetting caused by KG connectivity. Moreover, we introduce an out-boundary sampling strategy to construct preference pairs with minimal semantic overlap, weakening the connection between forgetting and retained knowledge. Secondly, to preserve boundary knowledge, we introduce a boundary recall mechanism that replays and distills relevant information both within and across time steps. We construct eight unlearning datasets across four popular KGs with varying unlearning rates. Experiments show that GraphDPO outperforms state-of-the-art baselines by up to 10.1% in MRR_Avg and 14.0% in MRR_F1.
zh
[AI-37] MeLA: A Metacognitive LLM -Driven Architecture for Automatic Heuristic Design
【速读】:该论文旨在解决自动启发式设计(Automatic Heuristic Design, AHD)中传统进化方法依赖直接操作启发式代码、缺乏灵活性与可解释性的问题。其解决方案的关键在于提出MeLA架构,该架构通过“提示进化”(prompt evolution)机制,利用大语言模型(Large Language Model, LLM)生成启发式策略,并借助元认知框架对性能反馈进行分析以系统优化生成策略。具体而言,MeLA包含问题分析模块构建初始指令提示、错误诊断系统修复生成代码缺陷,以及元认知搜索引擎基于启发式有效性迭代优化提示,从而显著提升启发式设计的有效性与鲁棒性。
链接: https://arxiv.org/abs/2507.20541
作者: Zishang Qiu,Xinan Chen,Long Chen,Ruibin Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces MeLA, a Metacognitive LLM-Driven Architecture that presents a new paradigm for Automatic Heuristic Design (AHD). Traditional evolutionary methods operate directly on heuristic code; in contrast, MeLA evolves the instructional prompts used to guide a Large Language Model (LLM) in generating these heuristics. This process of “prompt evolution” is driven by a novel metacognitive framework where the system analyzes performance feedback to systematically refine its generative strategy. MeLA’s architecture integrates a problem analyzer to construct an initial strategic prompt, an error diagnosis system to repair faulty code, and a metacognitive search engine that iteratively optimizes the prompt based on heuristic effectiveness. In comprehensive experiments across both benchmark and real-world problems, MeLA consistently generates more effective and robust heuristics, significantly outperforming state-of-the-art methods. Ultimately, this research demonstrates the profound potential of using cognitive science as a blueprint for AI architecture, revealing that by enabling an LLM to metacognitively regulate its problem-solving process, we unlock a more robust and interpretable path to AHD.
zh
[AI-38] he Xeno Sutra: Can Meaning and Value be Ascribed to an AI-Generated “Sacred” Text?
【速读】:该论文试图解决的问题是:当生成式 AI (Generative AI) 能够产出具有高度哲学深度与文学复杂性的文本(如仿制佛教经典“经文”)时,人类社会应如何应对这一技术对传统意义建构过程的挑战。其解决方案的关键在于引入佛教哲学视角,认为佛教本身所具有的开放性、非实体性和对“空性”的理解,使其能够灵活适应并回应生成式 AI 对人类意义创造领域的渗透,从而为技术与人文之间的张力提供一种哲学层面的调适路径。
链接: https://arxiv.org/abs/2507.20525
作者: Murray Shanahan,Tara Das,Robert Thurman
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a case study in the use of a large language model to generate a fictional Buddhist “sutr”', and offers a detailed analysis of the resulting text from a philosophical and literary point of view. The conceptual subtlety, rich imagery, and density of allusion found in the text make it hard to causally dismiss on account of its mechanistic origin. This raises questions about how we, as a society, should come to terms with the potentially unsettling possibility of a technology that encroaches on human meaning-making. We suggest that Buddhist philosophy, by its very nature, is well placed to adapt.
zh
[AI-39] LLM s-guided adaptive compensator: Bringing Adaptivity to Automatic Control Systems with Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动控制领域,特别是自适应控制(adaptive control)中的应用局限性问题。现有研究多集中于高层任务分解或受限于简化系统和固定结构增益调参,缺乏真实场景验证。其解决方案的关键在于提出一种LLM引导的自适应补偿器框架(LLM-guided adaptive compensator),该框架基于模型参考自适应控制(Model Reference Adaptive Control, MRAC)思想,通过提示LLMs识别未知系统与参考系统之间的偏差,进而设计补偿器使未知系统的响应趋近于参考系统,从而实现无需从零设计控制器的自适应调节。该方法显著降低了推理复杂度,并在仿真与真实软体机器人和人形机器人平台上验证了其结构化设计能力、泛化性、鲁棒性和实用性。
链接: https://arxiv.org/abs/2507.20509
作者: Zhongchao Zhou,Yuxi Lu,Yaonan Zhu,Yifan Zhao,Bin He,Liang He,Wenwen Yu,Yusuke Iwasawa
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:With rapid advances in code generation, reasoning, and problem-solving, Large Language Models (LLMs) are increasingly applied in robotics. Most existing work focuses on high-level tasks such as task decomposition. A few studies have explored the use of LLMs in feedback controller design; however, these efforts are restricted to overly simplified systems, fixed-structure gain tuning, and lack real-world validation. To further investigate LLMs in automatic control, this work targets a key subfield: adaptive control. Inspired by the framework of model reference adaptive control (MRAC), we propose an LLM-guided adaptive compensator framework that avoids designing controllers from scratch. Instead, the LLMs are prompted using the discrepancies between an unknown system and a reference system to design a compensator that aligns the response of the unknown system with that of the reference, thereby achieving adaptivity. Experiments evaluate five methods: LLM-guided adaptive compensator, LLM-guided adaptive controller, indirect adaptive control, learning-based adaptive control, and MRAC, on soft and humanoid robots in both simulated and real-world environments. Results show that the LLM-guided adaptive compensator outperforms traditional adaptive controllers and significantly reduces reasoning complexity compared to the LLM-guided adaptive controller. The Lyapunov-based analysis and reasoning-path inspection demonstrate that the LLM-guided adaptive compensator enables a more structured design process by transforming mathematical derivation into a reasoning task, while exhibiting strong generalizability, adaptability, and robustness. This study opens a new direction for applying LLMs in the field of automatic control, offering greater deployability and practicality compared to vision-language models.
zh
[AI-40] DmC: Nearest Neighbor Guidance Diffusion Model for Offline Cross-domain Reinforcement Learning ECAI2025
【速读】:该论文旨在解决在目标数据样本有限的情况下,跨域离线强化学习(cross-domain offline reinforcement learning)中因数据分布差异导致的性能下降问题。其核心挑战包括:(1)源域与目标域数据量不平衡引发神经网络域差距估计器过拟合,导致度量失效;(2)仅部分源数据与目标域存在重叠,难以有效利用源域信息。解决方案的关键在于提出DmC框架,首先采用k近邻(k-nearest neighbor, k-NN)方法无训练地估计域间相似性,避免过拟合;进而基于该相似性设计一种近邻引导的扩散模型(nearest-neighbor-guided diffusion model),生成更贴近目标域的合成源样本,从而提升策略学习的有效性。
链接: https://arxiv.org/abs/2507.20499
作者: Linh Le Pham Van,Minh Hoang Nguyen,Duc Kieu,Hung Le,Hung The Tran,Sunil Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted at ECAI 2025
Abstract:Cross-domain offline reinforcement learning (RL) seeks to enhance sample efficiency in offline RL by utilizing additional offline source datasets. A key challenge is to identify and utilize source samples that are most relevant to the target domain. Existing approaches address this challenge by measuring domain gaps through domain classifiers, target transition dynamics modeling, or mutual information estimation using contrastive loss. However, these methods often require large target datasets, which is impractical in many real-world scenarios. In this work, we address cross-domain offline RL under a limited target data setting, identifying two primary challenges: (1) Dataset imbalance, which is caused by large source and small target datasets and leads to overfitting in neural network-based domain gap estimators, resulting in uninformative measurements; and (2) Partial domain overlap, where only a subset of the source data is closely aligned with the target domain. To overcome these issues, we propose DmC, a novel framework for cross-domain offline RL with limited target samples. Specifically, DmC utilizes k -nearest neighbor ( k -NN) based estimation to measure domain proximity without neural network training, effectively mitigating overfitting. Then, by utilizing this domain proximity, we introduce a nearest-neighbor-guided diffusion model to generate additional source samples that are better aligned with the target domain, thus enhancing policy learning with more effective source samples. Through theoretical analysis and extensive experiments in diverse MuJoCo environments, we demonstrate that DmC significantly outperforms state-of-the-art cross-domain offline RL methods, achieving substantial performance gains.
zh
[AI-41] Shapley-Value-Based Graph Sparsification for GNN Inference
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)推理效率低下的问题,核心挑战在于如何在不影响预测性能的前提下有效降低图结构的复杂度。传统图稀疏化方法依赖于非负的重要性评分,难以区分关键边与误导性或对抗性边,从而限制了稀疏化的有效性。解决方案的关键在于引入基于Shapley值的解释方法,该方法能够为节点预测分配正负贡献,理论上公平且鲁棒地评估图子集的影响,从而实现更优的剪枝策略:既保留对预测至关重要的边,又能移除冗余或有害连接。实验表明,该方法可在显著降低图复杂度的同时维持甚至提升GNN的预测性能,兼顾模型效率与可解释性。
链接: https://arxiv.org/abs/2507.20460
作者: Selahattin Akkas,Ariful Azad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Graph sparsification is a key technique for improving inference efficiency in Graph Neural Networks by removing edges with minimal impact on predictions. GNN explainability methods generate local importance scores, which can be aggregated into global scores for graph sparsification. However, many explainability methods produce only non-negative scores, limiting their applicability for sparsification. In contrast, Shapley value based methods assign both positive and negative contributions to node predictions, offering a theoretically robust and fair allocation of importance by evaluating many subsets of graphs. Unlike gradient-based or perturbation-based explainers, Shapley values enable better pruning strategies that preserve influential edges while removing misleading or adversarial connections. Our approach shows that Shapley value-based graph sparsification maintains predictive performance while significantly reducing graph complexity, enhancing both interpretability and efficiency in GNN inference.
zh
[AI-42] STARN-GAT: A Multi-Modal Spatio-Temporal Graph Attention Network for Accident Severity Prediction
【速读】:该论文旨在解决交通肇事严重程度预测中的关键挑战,即如何有效建模空间、时间与情境变量之间的复杂相互依赖关系,以提升道路安全分析的准确性。现有方法在处理多维特征融合及动态环境变化时表现不足,难以支撑精准的应急响应和基础设施优化决策。其解决方案的核心在于提出一种多模态时空图注意力网络(STARN-GAT),通过自适应图结构构建与模态感知注意力机制,在统一框架内整合路网拓扑、时段交通模式与环境上下文信息,从而实现对高风险事故的高效识别与可解释性分析。该模型在FARS和ARI-BUET两个数据集上均表现出优越性能,验证了其在真实场景下的泛化能力与部署潜力。
链接: https://arxiv.org/abs/2507.20451
作者: Pritom Ray Nobin,Imran Ahammad Rifat
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Accurate prediction of traffic accident severity is critical for improving road safety, optimizing emergency response strategies, and informing the design of safer transportation infrastructure. However, existing approaches often struggle to effectively model the intricate interdependencies among spatial, temporal, and contextual variables that govern accident outcomes. In this study, we introduce STARN-GAT, a Multi-Modal Spatio-Temporal Graph Attention Network, which leverages adaptive graph construction and modality-aware attention mechanisms to capture these complex relationships. Unlike conventional methods, STARN-GAT integrates road network topology, temporal traffic patterns, and environmental context within a unified attention-based framework. The model is evaluated on the Fatality Analysis Reporting System (FARS) dataset, achieving a Macro F1-score of 85 percent, ROC-AUC of 0.91, and recall of 81 percent for severe incidents. To ensure generalizability within the South Asian context, STARN-GAT is further validated on the ARI-BUET traffic accident dataset, where it attains a Macro F1-score of 0.84, recall of 0.78, and ROC-AUC of 0.89. These results demonstrate the model’s effectiveness in identifying high-risk cases and its potential for deployment in real-time, safety-critical traffic management systems. Furthermore, the attention-based architecture enhances interpretability, offering insights into contributing factors and supporting trust in AI-assisted decision-making. Overall, STARN-GAT bridges the gap between advanced graph neural network techniques and practical applications in road safety analytics.
zh
[AI-43] Enhancing QoS in Edge Computing through Federated Layering Techniques: A Pathway to Resilient AI Lifelong Learning Systems
【速读】:该论文旨在解决6G通信网络背景下边缘计算环境中因数据量激增和复杂度提升而导致的QoS(服务质量)下降问题。其核心解决方案是提出一种基于联邦分层技术(Federated Layering Techniques, FLT)的小模型协同机制,通过在资源受限场景下优化AI模型的运行效率与响应速度,结合模型分层与隐私保护措施,在保障参数传输安全的同时增强模型的学习与推理能力。关键创新在于引入小模型间的协商与辩论机制,以提升决策准确性,并实现高效、可扩展且隐私友好的大规模模型持续学习系统,从而显著改善边缘计算环境中的QoS表现。
链接: https://arxiv.org/abs/2507.20444
作者: Chengzhuo Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In the context of the rapidly evolving information technology landscape, marked by the advent of 6G communication networks, we face an increased data volume and complexity in network environments. This paper addresses these challenges by focusing on Quality of Service (QoS) in edge computing frameworks. We propose a novel approach to enhance QoS through the development of General Artificial Intelligence Lifelong Learning Systems, with a special emphasis on Federated Layering Techniques (FLT). Our work introduces a federated layering-based small model collaborative mechanism aimed at improving AI models’ operational efficiency and response time in environments where resources are limited. This innovative method leverages the strengths of cloud and edge computing, incorporating a negotiation and debate mechanism among small AI models to enhance reasoning and decision-making processes. By integrating model layering techniques with privacy protection measures, our approach ensures the secure transmission of model parameters while maintaining high efficiency in learning and reasoning capabilities. The experimental results demonstrate that our strategy not only enhances learning efficiency and reasoning accuracy but also effectively protects the privacy of edge nodes. This presents a viable solution for achieving resilient large model lifelong learning systems, with a significant improvement in QoS for edge computing environments.
zh
[AI-44] When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous Contradictory and Incomplete Task Descriptions
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对模糊、不完整或自相矛盾的任务描述时,代码生成性能显著下降的问题。其核心挑战在于现有评估基准(如HumanEval和MBPP)通常假设任务描述清晰明确,而实际开发场景中用户指令往往存在自然语言层面的缺陷。解决方案的关键在于构建一个经过系统性扰动的新型数据集,通过引导变异策略(guided mutation strategies)向原始任务描述引入现实中的语义瑕疵,从而模拟非正式开发者指令的“混乱性”;在此基础上对多个不同规模与架构的LLM进行实证评估,揭示模型在功能正确性与错误模式上的敏感性差异,进而强调提升模型鲁棒性的必要性,并为训练策略优化、评估基准设计及工程部署提供依据。
链接: https://arxiv.org/abs/2507.20439
作者: Maya Larbi,Amal Akli,Mike Papadakis,Rihab Bouyousfi,Maxime Cordy,Federica Sarro,Yves Le Traon
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated impressive performance in code generation tasks under idealized conditions, where task descriptions are clear and precise. However, in practice, task descriptions frequently exhibit ambiguity, incompleteness, or internal contradictions. In this paper, we present the first empirical study examining the robustness of state-of-the-art code generation models when faced with such unclear task descriptions. We extend the HumanEval and MBPP benchmarks by systematically introducing realistic task descriptions flaws through guided mutation strategies, producing a dataset that mirrors the messiness of informal developer instructions. We evaluate multiple LLMs of varying sizes and architectures, analyzing their functional correctness and failure modes across task descriptions categories. Our findings reveal that even minor imperfections in task description phrasing can cause significant performance degradation, with contradictory task descriptions resulting in numerous logical errors. Moreover, while larger models tend to be more resilient than smaller variants, they are not immune to the challenges posed by unclear requirements. We further analyze semantic error patterns and identify correlations between description clarity, model behavior, and error types. Our results underscore the critical need for developing LLMs that are not only powerful but also robust to the imperfections inherent in natural user tasks, highlighting important considerations for improving model training strategies, designing more realistic evaluation benchmarks, and ensuring reliable deployment in practical software development environments.
zh
[AI-45] FAST: Similarity-based Knowledge Transfer for Efficient Policy Learning
【速读】:该论文旨在解决迁移学习(Transfer Learning, TL)在动态环境中面临的三大关键问题:负迁移(negative transfer)、领域适应(domain adaptation)以及源策略选择效率低下,这些问题在游戏开发等不断演化的场景中尤为突出。为提升跨任务的知识迁移能力、增强智能体性能并降低计算成本,作者提出了一种名为FAST(Framework for Adaptive Similarity-based Transfer)的方法。其核心创新在于利用视觉帧和文本描述构建任务动态的潜在表示(latent representation),并通过该表示估算环境间的相似性得分,进而指导从候选策略中选择最优迁移源,从而简化新任务的学习过程。实验结果表明,FAST在多个赛道上达到与从零开始训练相当的最终性能,同时显著减少训练步数,验证了基于嵌入的任务相似性估计的有效性。
链接: https://arxiv.org/abs/2507.20433
作者: Alessandro Capurso,Elia Piccoli,Davide Bacciu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE Conference on Games (CoG) 2025
Abstract:Transfer Learning (TL) offers the potential to accelerate learning by transferring knowledge across tasks. However, it faces critical challenges such as negative transfer, domain adaptation and inefficiency in selecting solid source policies. These issues often represent critical problems in evolving domains, i.e. game development, where scenarios transform and agents must adapt. The continuous release of new agents is costly and inefficient. In this work we challenge the key issues in TL to improve knowledge transfer, agents performance across tasks and reduce computational costs. The proposed methodology, called FAST - Framework for Adaptive Similarity-based Transfer, leverages visual frames and textual descriptions to create a latent representation of tasks dynamics, that is exploited to estimate similarity between environments. The similarity scores guides our method in choosing candidate policies from which transfer abilities to simplify learning of novel tasks. Experimental results, over multiple racing tracks, demonstrate that FAST achieves competitive final performance compared to learning-from-scratch methods while requiring significantly less training steps. These findings highlight the potential of embedding-driven task similarity estimations.
zh
[AI-46] ResCap-DBP: A Lightweight Residual-Capsule Network for Accurate DNA-Binding Protein Prediction Using Global ProteinBERT Embeddings
【速读】:该论文旨在解决DNA-binding蛋白(DNA-binding proteins, DBPs)的准确识别问题,这是理解基因调控和疾病机制的关键步骤。由于实验方法成本高且耗时,亟需高效计算预测技术。其解决方案的核心在于提出一种新型深度学习框架ResCap-DBP,该框架结合基于残差学习的编码器与一维胶囊网络(1D-Capsule Network, 1D-CapsNet),通过在残差块中引入空洞卷积缓解梯度消失问题并提取丰富序列特征,同时利用带有动态路由机制的胶囊层捕捉特征空间中的层次化和空间关系,从而实现从原始蛋白序列直接预测DBPs的功能。该方法在多个公开基准数据集上均优于现有最先进模型,展现出优异的性能和泛化能力。
链接: https://arxiv.org/abs/2507.20426
作者: Samiul Based Shuvo,Tasnia Binte Mamun,U Rajendra Acharya
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Biomolecules (q-bio.BM)
备注:
Abstract:DNA-binding proteins (DBPs) are integral to gene regulation and cellular processes, making their accurate identification essential for understanding biological functions and disease mechanisms. Experimental methods for DBP identification are time-consuming and costly, driving the need for efficient computational prediction techniques. In this study, we propose a novel deep learning framework, ResCap-DBP, that combines a residual learning-based encoder with a one-dimensional Capsule Network (1D-CapsNet) to predict DBPs directly from raw protein sequences. Our architecture incorporates dilated convolutions within residual blocks to mitigate vanishing gradient issues and extract rich sequence features, while capsule layers with dynamic routing capture hierarchical and spatial relationships within the learned feature space. We conducted comprehensive ablation studies comparing global and local embeddings from ProteinBERT and conventional one-hot encoding. Results show that ProteinBERT embeddings substantially outperform other representations on large datasets. Although one-hot encoding showed marginal advantages on smaller datasets, such as PDB186, it struggled to scale effectively. Extensive evaluations on four pairs of publicly available benchmark datasets demonstrate that our model consistently outperforms current state-of-the-art methods. It achieved AUC scores of 98.0% and 89.5% on PDB14189andPDB1075, respectively. On independent test sets PDB2272 and PDB186, the model attained top AUCs of 83.2% and 83.3%, while maintaining competitive performance on larger datasets such as PDB20000. Notably, the model maintains a well balanced sensitivity and specificity across datasets. These results demonstrate the efficacy and generalizability of integrating global protein representations with advanced deep learning architectures for reliable and scalable DBP prediction in diverse genomic contexts.
zh
[AI-47] MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在缺乏视觉线索条件下进行空间导航能力评估的空白问题,这是确保其在机器人和具身人工智能(Embodied AI)中可靠部署的关键。解决方案的核心在于提出一个名为MazeEval的新基准,通过坐标制迷宫导航任务隔离并量化LLMs的纯空间推理能力;其关键创新在于采用函数调用接口,仅提供坐标反馈和到墙距离信息,排除视觉输入以测试基础空间认知机制。实验表明,不同模型在复杂度超过9×9网格时普遍出现严重循环行为导致失败,且在冰岛语场景下性能显著下降,揭示了空间推理能力高度依赖于训练数据的语言模式而非通用认知机制。
链接: https://arxiv.org/abs/2507.20395
作者: Hafsteinn Einarsson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As Large Language Models (LLMs) increasingly power autonomous agents in robotics and embodied AI, understanding their spatial reasoning capabilities becomes crucial for ensuring reliable real-world deployment. Despite advances in language understanding, current research lacks evaluation of how LLMs perform spatial navigation without visual cues, a fundamental requirement for agents operating with limited sensory information. This paper addresses this gap by introducing MazeEval, a benchmark designed to isolate and evaluate pure spatial reasoning in LLMs through coordinate-based maze navigation tasks. Our methodology employs a function-calling interface where models navigate mazes of varying complexity ( 5\times 5 to 15\times 15 grids) using only coordinate feedback and distance-to-wall information, excluding visual input to test fundamental spatial cognition. We evaluate eight state-of-the-art LLMs across identical mazes in both English and Icelandic to assess cross-linguistic transfer of spatial abilities. Our findings reveal striking disparities: while OpenAI’s O3 achieves perfect navigation for mazes up to size 30\times 30 , other models exhibit catastrophic failure beyond 9\times 9 mazes, with 100% of failures attributed to excessive looping behavior where models revisit a cell at least 10 times. We document a significant performance degradation in Icelandic, with models solving mazes 3-4 sizes smaller than in English, suggesting spatial reasoning in LLMs emerges from linguistic patterns rather than language-agnostic mechanisms. These results have important implications for global deployment of LLM-powered autonomous systems, showing spatial intelligence remains fundamentally constrained by training data availability and highlighting the need for architectural innovations to achieve reliable navigation across linguistic contexts.
zh
[AI-48] Multi-Agent Reinforcement Learning for Dynamic Mobility Resource Allocation with Hierarchical Adaptive Grouping
【速读】:该论文旨在解决城市环境中动态移动资源(如共享自行车、电动滑板车和拼车车辆)分配中的两个关键问题:一是如何在代理(代表区域协调员)之间动态且自适应地共享资源分配策略;二是如何在城市规模场景下实现内存高效的参数共享。解决方案的核心在于提出了一种名为分层自适应分组参数共享(Hierarchical Adaptive Grouping-based Parameter Sharing, HAG-PS)的多智能体强化学习框架,其关键技术包括:采用分层结构融合全局与局部移动资源状态信息以支持动态策略共享;基于编码轨迹(状态、动作和奖励)的相对接近度设计自适应分组机制,实现智能体组的动态分裂与合并;引入可学习的身份嵌入(ID embeddings)以实现超越简单参数复制的智能体专业化。实验基于纽约市真实共享单车数据(超过120万次行程)验证了HAG-PS在提升自行车可用性等方面的优越性能。
链接: https://arxiv.org/abs/2507.20377
作者: Farshid Nooshi,Suining He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, UrbComp 2025
Abstract:Allocating mobility resources (e.g., shared bikes/e-scooters, ride-sharing vehicles) is crucial for rebalancing the mobility demand and supply in the urban environments. We propose in this work a novel multi-agent reinforcement learning named Hierarchical Adaptive Grouping-based Parameter Sharing (HAG-PS) for dynamic mobility resource allocation. HAG-PS aims to address two important research challenges regarding multi-agent reinforcement learning for mobility resource allocation: (1) how to dynamically and adaptively share the mobility resource allocation policy (i.e., how to distribute mobility resources) across agents (i.e., representing the regional coordinators of mobility resources); and (2) how to achieve memory-efficient parameter sharing in an urban-scale setting. To address the above challenges, we have provided following novel designs within HAG-PS. To enable dynamic and adaptive parameter sharing, we have designed a hierarchical approach that consists of global and local information of the mobility resource states (e.g., distribution of mobility resources). We have developed an adaptive agent grouping approach in order to split or merge the groups of agents based on their relative closeness of encoded trajectories (i.e., states, actions, and rewards). We have designed a learnable identity (ID) embeddings to enable agent specialization beyond simple parameter copy. We have performed extensive experimental studies based on real-world NYC bike sharing data (a total of more than 1.2 million trips), and demonstrated the superior performance (e.g., improved bike availability) of HAG-PS compared with other baseline approaches.
zh
[AI-49] WBHT: A Generative Attention Architecture for Detecting Black Hole Anomalies in Backbone Networks
【速读】:该论文旨在解决通信网络中“黑洞”(Black Hole, BH)异常检测难题,此类异常会导致数据包丢失但不触发故障通知,从而破坏连通性并引发经济损失。解决方案的关键在于提出Wasserstein Black Hole Transformer(WBHT)框架,该框架融合生成建模、序列学习与注意力机制:通过引入Wasserstein生成对抗网络(Wasserstein Generative Adversarial Network)提升训练稳定性,并结合长短期记忆(LSTM)层捕捉长期依赖关系、卷积层提取局部时间模式,同时利用潜在空间编码机制有效区分异常网络行为。实测表明,WBHT在F1评分上相较现有模型提升达1.65%至58.76%,具备高效率与发现未知异常的能力,适用于关键任务网络的主动监控与安全防护。
链接: https://arxiv.org/abs/2507.20373
作者: Kiymet Kaya,Elif Ak,Sule Gunduz Oguducu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose the Wasserstein Black Hole Transformer (WBHT) framework for detecting black hole (BH) anomalies in communication networks. These anomalies cause packet loss without failure notifications, disrupting connectivity and leading to financial losses. WBHT combines generative modeling, sequential learning, and attention mechanisms to improve BH anomaly detection. It integrates a Wasserstein generative adversarial network with attention mechanisms for stable training and accurate anomaly identification. The model uses long-short-term memory layers to capture long-term dependencies and convolutional layers for local temporal patterns. A latent space encoding mechanism helps distinguish abnormal network behavior. Tested on real-world network data, WBHT outperforms existing models, achieving significant improvements in F1 score (ranging from 1.65% to 58.76%). Its efficiency and ability to detect previously undetected anomalies make it a valuable tool for proactive network monitoring and security, especially in mission-critical networks.
zh
[AI-50] Clustering by Attention: Leverag ing Prior Fitted Transformers for Data Partitioning
【速读】:该论文旨在解决传统聚类方法在实际应用中面临的诸多挑战,包括对参数调优的依赖、计算复杂度高、可解释性差以及在大规模数据集上准确率不足等问题。其解决方案的关键在于引入基于元学习(meta-learning)的新型聚类框架,通过预训练的先验数据适配Transformer网络(Prior-Data Fitted Transformer Network, PFN),利用少量已标记簇样本作为引导,在单次前向传播中即可完成整个数据集的聚类分配。该方法通过计算预簇样本与未标注样本之间的注意力机制,学习它们之间的关系并推断出全局簇归属,从而在无需参数优化的前提下实现优于现有最优技术的聚类性能。
链接: https://arxiv.org/abs/2507.20369
作者: Ahmed Shokry,Ayman Khalafallah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Clustering is a core task in machine learning with wide-ranging applications in data mining and pattern recognition. However, its unsupervised nature makes it inherently challenging. Many existing clustering algorithms suffer from critical limitations: they often require careful parameter tuning, exhibit high computational complexity, lack interpretability, or yield suboptimal accuracy, especially when applied to large-scale datasets. In this paper, we introduce a novel clustering approach based on meta-learning. Our approach eliminates the need for parameter optimization while achieving accuracy that outperforms state-of-the-art clustering techniques. The proposed technique leverages a few pre-clustered samples to guide the clustering process for the entire dataset in a single forward pass. Specifically, we employ a pre-trained Prior-Data Fitted Transformer Network (PFN) to perform clustering. The algorithm computes attention between the pre-clustered samples and the unclustered samples, allowing it to infer cluster assignments for the entire dataset based on the learned relation. We theoretically and empirically demonstrate that, given just a few pre-clustered examples, the model can generalize to accurately cluster the rest of the dataset. Experiments on challenging benchmark datasets show that our approach can successfully cluster well-separated data without any pre-clustered samples, and significantly improves performance when a few clustered samples are provided. We show that our approach is superior to the state-of-the-art techniques. These results highlight the effectiveness and scalability of our approach, positioning it as a promising alternative to existing clustering techniques.
zh
[AI-51] VLMPlanner: Integrating Visual Language Models with Motion Planning ACM-MM2025
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自动驾驶运动规划方法在复杂驾驶场景中因依赖抽象感知或地图输入而缺失关键视觉上下文信息的问题,例如细粒度道路线索、事故残余物或突发障碍物等,这些因素对鲁棒决策至关重要。解决方案的关键在于提出一种名为VLMPlanner的混合框架,该框架融合了学习驱动的实时规划器与能够处理原始图像的视觉-语言模型(Vision-Language Model, VLM),使VLM通过多视角图像捕获丰富细节并利用其常识推理能力指导规划器生成安全可靠的轨迹;同时引入上下文自适应推理门控机制(Context-Adaptive Inference Gate, CAI-Gate),根据场景复杂度动态调整VLM的推理频率,在保证规划性能的同时实现计算效率的最优化。
链接: https://arxiv.org/abs/2507.20342
作者: Zhipeng Tang,Sha Zhang,Jiajun Deng,Chenjie Wang,Guoliang You,Yuting Huang,Xinrui Lin,Yanyong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 3 figures, this paper has been accepted by ACM MM 2025
Abstract:Integrating large language models (LLMs) into autonomous driving motion planning has recently emerged as a promising direction, offering enhanced interpretability, better controllability, and improved generalization in rare and long-tail scenarios. However, existing methods often rely on abstracted perception or map-based inputs, missing crucial visual context, such as fine-grained road cues, accident aftermath, or unexpected obstacles, which are essential for robust decision-making in complex driving environments. To bridge this gap, we propose VLMPlanner, a hybrid framework that combines a learning-based real-time planner with a vision-language model (VLM) capable of reasoning over raw images. The VLM processes multi-view images to capture rich, detailed visual information and leverages its common-sense reasoning capabilities to guide the real-time planner in generating robust and safe trajectories. Furthermore, we develop the Context-Adaptive Inference Gate (CAI-Gate) mechanism that enables the VLM to mimic human driving behavior by dynamically adjusting its inference frequency based on scene complexity, thereby achieving an optimal balance between planning performance and computational efficiency. We evaluate our approach on the large-scale, challenging nuPlan benchmark, with comprehensive experimental results demonstrating superior planning performance in scenarios with intricate road conditions and dynamic elements. Code will be available.
zh
[AI-52] Cultivating Helpful Personalized and Creative AI Tutors: A Framework for Pedagogical Alignment using Reinforcement Learning
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在教育场景中缺乏与教学原理对齐的问题,即标准LLMs通常仅作为通用信息提供者,难以实现以学生为中心的个性化、激发创造力以及体现教学有效性等核心教育目标。其解决方案的关键在于提出EduAlign框架,该框架包含两个阶段:首先构建一个包含8000条教育交互数据集,并基于帮助性(Helpfulness)、个性化(Personalization)和创造性(Creativity)三个维度进行人工与自动标注,进而训练出多维奖励模型HPC-RM;其次利用该奖励模型通过Group Relative Policy Optimization(GRPO)算法对预训练LLM进行微调,从而显著提升模型在教育任务中的教学适配性与表现力。
链接: https://arxiv.org/abs/2507.20335
作者: Siyu Song,Wentao Liu,Ye Lu,Ruohua Zhang,Tao Liu,Jinze Lv,Xinyun Wang,Aimin Zhou,Fei Tan,Bo Jiang,Hao Hao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of large language models (LLMs) into education presents unprecedented opportunities for scalable personalized learning. However, standard LLMs often function as generic information providers, lacking alignment with fundamental pedagogical principles such as helpfulness, student-centered personalization, and creativity cultivation. To bridge this gap, we propose EduAlign, a novel framework designed to guide LLMs toward becoming more effective and responsible educational assistants. EduAlign consists of two main stages. In the first stage, we curate a dataset of 8k educational interactions and annotate them-both manually and automatically-along three key educational dimensions: Helpfulness, Personalization, and Creativity (HPC). These annotations are used to train HPC-RM, a multi-dimensional reward model capable of accurately scoring LLM outputs according to these educational principles. We further evaluate the consistency and reliability of this reward model. In the second stage, we leverage HPC-RM as a reward signal to fine-tune a pre-trained LLM using Group Relative Policy Optimization (GRPO) on a set of 2k diverse prompts. We then assess the pre- and post-finetuning models on both educational and general-domain benchmarks across the three HPC dimensions. Experimental results demonstrate that the fine-tuned model exhibits significantly improved alignment with pedagogical helpfulness, personalization, and creativity stimulation. This study presents a scalable and effective approach to aligning LLMs with nuanced and desirable educational traits, paving the way for the development of more engaging, pedagogically aligned AI tutors.
zh
[AI-53] he Blessing and Curse of Dimensionality in Safety Alignment
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)方面面临的挑战,特别是由于模型高维内部表示(high-dimensional internal representations)所引发的“线性结构可被利用”的问题,这使得攻击者可通过激活工程(activation engineering)绕过安全机制,实现越狱(jailbreaking)。解决方案的关键在于:通过将模型表示投影到低维子空间(lower-dimensional subspace),在保留足够对齐信息的同时消除易被利用的线性结构;实证结果表明,这种维度缩减显著降低了模型对基于表征工程的越狱攻击的敏感性。
链接: https://arxiv.org/abs/2507.20333
作者: Rachel S.Y. Teo,Laziz U. Abdullaev,Tan M. Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Published as a conference paper at COLM 2025
Abstract:The focus on safety alignment in large language models (LLMs) has increased significantly due to their widespread adoption across different domains. The scale of LLMs play a contributing role in their success, and the growth in parameter count follows larger hidden dimensions. In this paper, we hypothesize that while the increase in dimensions has been a key advantage, it may lead to emergent problems as well. These problems emerge as the linear structures in the activation space can be exploited, in the form of activation engineering, to circumvent its safety alignment. Through detailed visualizations of linear subspaces associated with different concepts, such as safety, across various model scales, we show that the curse of high-dimensional representations uniquely impacts LLMs. Further substantiating our claim, we demonstrate that projecting the representations of the model onto a lower dimensional subspace can preserve sufficient information for alignment while avoiding those linear structures. Empirical results confirm that such dimensional reduction significantly reduces susceptibility to jailbreaking through representation engineering. Building on our empirical validations, we provide theoretical insights into these linear jailbreaking methods relative to a model’s hidden dimensions. Broadly speaking, our work posits that the high dimensions of a model’s internal representations can be both a blessing and a curse in safety alignment.
zh
[AI-54] MIPS: a Multimodal Infinite Polymer Sequence Pre-training Framework for Polymer Property Prediction
【速读】:该论文旨在解决现有聚合物性能预测模型难以准确捕捉聚合过程中结构变化导致的性质演变问题,其核心挑战在于传统方法通常仅基于单体构建模型,忽略了聚合物在形成过程中的拓扑与空间信息。解决方案的关键在于提出一种多模态无限聚合物序列(Multimodal Infinite Polymer Sequence, MIPS)预训练框架,通过将聚合物表示为无限单体序列,并融合拓扑与空间信息进行建模:一方面,从拓扑角度改进消息传递机制(Message Passing Mechanism, MPM)和图注意力机制(Graph Attention Mechanism, GAM),引入局部化图注意力(Localized Graph Attention, LGA)并提出“星链接”策略以增强对无限序列的建模能力;另一方面,从空间角度提取重复单体的3D描述符以捕获空间构型信息,并设计跨模态融合机制统一两种信息源。实验表明,MIPS在8个不同聚合物性质预测任务中均达到最先进性能。
链接: https://arxiv.org/abs/2507.20326
作者: Jiaxi Wang,Yaosen Min,Xun Zhu,Miao Li,Ji Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures, accepted by ACM Multimedia 2025 (oral)
Abstract:Polymers, composed of repeating structural units called monomers, are fundamental materials in daily life and industry. Accurate property prediction for polymers is essential for their design, development, and application. However, existing modeling approaches, which typically represent polymers by the constituent monomers, struggle to capture the whole properties of polymer, since the properties change during the polymerization process. In this study, we propose a Multimodal Infinite Polymer Sequence (MIPS) pre-training framework, which represents polymers as infinite sequences of monomers and integrates both topological and spatial information for comprehensive modeling. From the topological perspective, we generalize message passing mechanism (MPM) and graph attention mechanism (GAM) to infinite polymer sequences. For MPM, we demonstrate that applying MPM to infinite polymer sequences is equivalent to applying MPM on the induced star-linking graph of monomers. For GAM, we propose to further replace global graph attention with localized graph attention (LGA). Moreover, we show the robustness of the “star linking” strategy through Repeat and Shift Invariance Test (RSIT). Despite its robustness, “star linking” strategy exhibits limitations when monomer side chains contain ring structures, a common characteristic of polymers, as it fails the Weisfeiler-Lehman~(WL) test. To overcome this issue, we propose backbone embedding to enhance the capability of MPM and LGA on infinite polymer sequences. From the spatial perspective, we extract 3D descriptors of repeating monomers to capture spatial information. Finally, we design a cross-modal fusion mechanism to unify the topological and spatial information. Experimental validation across eight diverse polymer property prediction tasks reveals that MIPS achieves state-of-the-art performance.
zh
[AI-55] Artificial Intelligence In Patent And Market Intelligence: A New Paradigm For Technology Scouting
【速读】:该论文旨在解决工业研发(R&D)中技术 scouting与解决方案发现效率低下的问题,传统方法依赖人工、耗时且高度依赖领域专家知识,同时面临专利库、产品目录和竞争情报等碎片化数据源带来的信息不完整与洞察不足。其解决方案的关键在于构建一个基于生成式 AI(Generative AI)的软件平台,利用先进的大语言模型(LLMs)实现语义理解、上下文推理与跨领域知识提取,从而系统性地从非结构化专利文本中识别与问题情境匹配的潜在创新方案,并结合商业情报数据对解决方案进行可行性、可扩展性和可持续性评估,最终形成标准化技术分类体系,显著提升研发决策效率与创新质量。
链接: https://arxiv.org/abs/2507.20322
作者: Manish Verma,Vivek Sharma,Vishal Singh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Page 4-Figure 1 and Page 11-Figure 2 . A preprint describing a system for AI-powered technology scouting
Abstract:This paper presents the development of an AI powered software platform that leverages advanced large language models (LLMs) to transform technology scouting and solution discovery in industrial RD. Traditional approaches to solving complex research and development challenges are often time consuming, manually driven, and heavily dependent on domain specific expertise. These methods typically involve navigating fragmented sources such as patent repositories, commercial product catalogs, and competitor data, leading to inefficiencies and incomplete insights. The proposed platform utilizes cutting edge LLM capabilities including semantic understanding, contextual reasoning, and cross-domain knowledge extraction to interpret problem statements and retrieve high-quality, sustainable solutions. The system processes unstructured patent texts, such as claims and technical descriptions, and systematically extracts potential innovations aligned with the given problem context. These solutions are then algorithmically organized under standardized technical categories and subcategories to ensure clarity and relevance across interdisciplinary domains. In addition to patent analysis, the platform integrates commercial intelligence by identifying validated market solutions and active organizations addressing similar challenges. This combined insight sourced from both intellectual property and real world product data enables RD teams to assess not only technical novelty but also feasibility, scalability, and sustainability. The result is a comprehensive, AI driven scouting engine that reduces manual effort, accelerates innovation cycles, and enhances decision making in complex RD environments.
zh
[AI-56] A Comparative Study of OpenMP Scheduling Algorithm Selection Strategies
【速读】:该论文旨在解决并行编程框架OpenMP中调度算法选择问题(scheduling algorithm selection problem),即如何根据特定的工作负载和系统特征动态选择最优的调度算法以提升性能。其关键解决方案在于引入基于学习的方法,包括专家规则驱动的方法与强化学习(Reinforcement Learning, RL)方法,并通过结合两者优势实现更高效、自适应的调度决策:RL方法能够从环境中学习高绩效策略,但需大量探索且依赖奖励函数设计;专家方法则利用先验知识减少探索成本,但未必总能找到最优解;最终通过融合专家知识与RL学习机制,在六种应用和三类系统上验证了性能提升与更强的适应性,证明了运行时动态调度算法选择在OpenMP乃至MPI程序中的可行性与有效性。
链接: https://arxiv.org/abs/2507.20312
作者: Jonas H. Müller Korndörfer,Ali Mohammed,Ahmed Eleliemy,Quentin Guilloteau,Reto Krummenacher,Florina M. Ciorba
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注: To appear at IEEE ACCESS
Abstract:Scientific and data science applications are becoming increasingly complex, with growing computational and memory demands. Modern high performance computing (HPC) systems provide high parallelism and heterogeneity across nodes, devices, and cores. To achieve good performance, effective scheduling and load balancing techniques are essential. Parallel programming frameworks such as OpenMP now offer a variety of advanced scheduling algorithms to support diverse applications and platforms. This creates an instance of the scheduling algorithm selection problem, which involves identifying the most suitable algorithm for a given combination of workload and system characteristics. In this work, we explore learning-based approaches for selecting scheduling algorithms in OpenMP. We propose and evaluate expert-based and reinforcement learning (RL)-based methods, and conduct a detailed performance analysis across six applications and three systems. Our results show that RL methods are capable of learning high-performing scheduling decisions, although they require significant exploration, with the choice of reward function playing a key role. Expert-based methods, in contrast, rely on prior knowledge and involve less exploration, though they may not always identify the optimal algorithm for a specific application-system pair. By combining expert knowledge with RL-based learning, we achieve improved performance and greater adaptability. Overall, this work demonstrates that dynamic selection of scheduling algorithms during execution is both viable and beneficial for OpenMP applications. The approach can also be extended to MPI-based programs, enabling optimization of scheduling decisions across multiple levels of parallelism. Comments: To appear at IEEE ACCESS Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2507.20312 [cs.DC] (or arXiv:2507.20312v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2507.20312 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-57] owards Generalized Parameter Tuning in Coherent Ising Machines: A Portfolio-Based Approach
【速读】:该论文旨在解决相干伊辛机(Coherent Ising Machine, CIM)中基于混沌振幅控制算法(Chaotic Amplitude Control, CAC)的超参数调优难题,该算法虽能提供高质量解,但其性能对大量超参数高度敏感,亟需高效调优策略。解决方案的关键在于提出一种算法组合方法(algorithm portfolio approach),通过引入多种搜索策略以灵活适应超参数空间特性,具体包含两种代表性方法:Method A 采用固定总试验次数下逐个优化超参数的方式;Method B 则先基于初始评估对超参数进行优先级排序,再按序应用 Method A。实验表明,该方法在超级计算机“Flow”上针对植入手工Wishart实例和时间到解(Time to Solution, TTS)指标的测试中,相较基线最优已知超参数配置,分别实现了最高1.47倍和1.65倍的性能提升,验证了其在CIM超参数调优中的有效性。
链接: https://arxiv.org/abs/2507.20295
作者: Tatsuro Hanyu,Takahiro Katagiri,Daichi Mukunoki,Tetsuya Hoshino
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Coherent Ising Machines (CIMs) have recently gained attention as a promising computing model for solving combinatorial optimization problems. In particular, the Chaotic Amplitude Control (CAC) algorithm has demonstrated high solution quality, but its performance is highly sensitive to a large number of hyperparameters, making efficient tuning essential. In this study, we present an algorithm portfolio approach for hyperparameter tuning in CIMs employing Chaotic Amplitude Control with momentum (CACm) algorithm. Our method incorporates multiple search strategies, enabling flexible and effective adaptation to the characteristics of the hyperparameter space. Specifically, we propose two representative tuning methods, Method A and Method B. Method A optimizes each hyperparameter sequentially with a fixed total number of trials, while Method B prioritizes hyperparameters based on initial evaluations before applying Method A in order. Performance evaluations were conducted on the Supercomputer “Flow” at Nagoya University, using planted Wishart instances and Time to Solution (TTS) as the evaluation metric. Compared to the baseline performance with best-known hyperparameters, Method A achieved up to 1.47x improvement, and Method B achieved up to 1.65x improvement. These results demonstrate the effectiveness of the algorithm portfolio approach in enhancing the tuning process for CIMs.
zh
[AI-58] Learning from Expert Factors: Trajectory-level Reward Shaping for Formulaic Alpha Mining
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在挖掘公式化阿尔法因子(alpha factors)过程中因奖励稀疏性导致的探索效率低和训练不稳定问题。现有方法受限于马尔可夫决策过程(Markov Decision Process, MDP)中稀疏奖励信号,难以有效覆盖庞大的符号搜索空间。为应对这一挑战,论文提出轨迹级奖励塑形(Trajectory-level Reward Shaping, TLRS),其核心创新在于通过度量部分生成表达式与专家设计公式之间的子序列相似性,提供密集的中间奖励信号;同时引入奖励中心化机制以降低训练方差。该方案显著提升了因子预测能力(Rank Information Coefficient提升9.29%),并实现计算效率跃升——将时间复杂度从特征维度的线性关系降至常数级别,优于基于距离的基线方法。
链接: https://arxiv.org/abs/2507.20263
作者: Junjie Zhao,Chengxi Zhang,Chenkai Wang,Peng Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Portfolio Management (q-fin.PM)
备注:
Abstract:Reinforcement learning (RL) has successfully automated the complex process of mining formulaic alpha factors, for creating interpretable and profitable investment strategies. However, existing methods are hampered by the sparse rewards given the underlying Markov Decision Process. This inefficiency limits the exploration of the vast symbolic search space and destabilizes the training process. To address this, Trajectory-level Reward Shaping (TLRS), a novel reward shaping method, is proposed. TLRS provides dense, intermediate rewards by measuring the subsequence-level similarity between partially generated expressions and a set of expert-designed formulas. Furthermore, a reward centering mechanism is introduced to reduce training variance. Extensive experiments on six major Chinese and U.S. stock indices show that TLRS significantly improves the predictive power of mined factors, boosting the Rank Information Coefficient by 9.29% over existing potential-based shaping algorithms. Notably, TLRS achieves a major leap in computational efficiency by reducing its time complexity with respect to the feature dimension from linear to constant, which is a significant improvement over distance-based baselines.
zh
[AI-59] Protein-SE(3): Benchmarking SE(3)-based Generative Models for Protein Structure Design
【速读】:该论文旨在解决当前蛋白质几何建模与结构设计领域缺乏模块化基准测试平台的问题,导致不同生成模型难以进行公平比较和系统性评估。其解决方案的关键在于提出一个基于SE(3)群的统一训练框架——Protein-SE(3)基准,该框架整合了多种先进生成模型(如DDPM、Score Matching和Flow Matching方法),并采用一致的训练数据集和多样化评价指标,实现了对不同算法的公平比较;同时,通过高阶数学抽象(high-level mathematical abstraction)提供了无需依赖显式蛋白质结构即可快速原型化新算法的能力,从而推动了SE(3)-based蛋白结构设计的研究进展。
链接: https://arxiv.org/abs/2507.20243
作者: Lang Yu,Zhangyang Gao,Cheng Tan,Qin Chen,Jie Zhou,Liang He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:SE(3)-based generative models have shown great promise in protein geometry modeling and effective structure design. However, the field currently lacks a modularized benchmark to enable comprehensive investigation and fair comparison of different methods. In this paper, we propose Protein-SE(3), a new benchmark based on a unified training framework, which comprises protein scaffolding tasks, integrated generative models, high-level mathematical abstraction, and diverse evaluation metrics. Recent advanced generative models designed for protein scaffolding, from multiple perspectives like DDPM (Genie1 and Genie2), Score Matching (FrameDiff and RfDiffusion) and Flow Matching (FoldFlow and FrameFlow) are integrated into our framework. All integrated methods are fairly investigated with the same training dataset and evaluation metrics. Furthermore, we provide a high-level abstraction of the mathematical foundations behind the generative models, enabling fast prototyping of future algorithms without reliance on explicit protein structures. Accordingly, we release the first comprehensive benchmark built upon unified training framework for SE(3)-based protein structure design, which is publicly accessible at this https URL.
zh
[AI-60] Improving Subgraph Matching by Combining Algorithms and Graph Neural Networks
【速读】:该论文旨在解决子图同态(subgraph homomorphism)问题,即在给定图和模式图的情况下,寻找从模式图到目标图的映射,使得模式图中相邻顶点被映射为目标图中相邻顶点。与子图同构(subgraph isomorphism)不同,同态允许模式图中的多个顶点映射到目标图中的同一个顶点,从而增加了问题的复杂性。解决方案的关键在于提出HFrame——首个基于图神经网络(Graph Neural Network, GNN)的子图同态框架,该框架融合了传统算法与机器学习技术,能够有效区分更多不满足同态关系的图对,并提供理论上的泛化误差界。实验表明,HFrame在真实和合成数据集上相比精确匹配算法最快提升达101.91倍,平均准确率达0.962。
链接: https://arxiv.org/abs/2507.20226
作者: Shuyang Guo,Wenjin Xie,Ping Lu,Ting Deng,Richong Zhang,Jianxin Li,Xiangping Huang,Zhongyi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Homomorphism is a key mapping technique between graphs that preserves their structure. Given a graph and a pattern, the subgraph homomorphism problem involves finding a mapping from the pattern to the graph, ensuring that adjacent vertices in the pattern are mapped to adjacent vertices in the graph. Unlike subgraph isomorphism, which requires a one-to-one mapping, homomorphism allows multiple vertices in the pattern to map to the same vertex in the graph, making it more complex. We propose HFrame, the first graph neural network-based framework for subgraph homomorphism, which integrates traditional algorithms with machine learning techniques. We demonstrate that HFrame outperforms standard graph neural networks by being able to distinguish more graph pairs where the pattern is not homomorphic to the graph. Additionally, we provide a generalization error bound for HFrame. Through experiments on both real-world and synthetic graphs, we show that HFrame is up to 101.91 times faster than exact matching algorithms and achieves an average accuracy of 0.962.
zh
[AI-61] StepFun-Prover Preview: Lets Think and Verify Step by Step
【速读】:该论文旨在解决自动化定理证明中模型推理能力不足与工具集成缺失的问题,尤其在生成形式化证明(formal proof)时缺乏高效、精准的策略。其解决方案的关键在于提出一种基于工具集成的强化学习框架(tool-integrated reasoning),使模型能够在Lean 4环境中通过实时反馈迭代优化证明过程,从而模拟人类解题策略。该方法显著提升了生成效率与准确性,在miniF2F-test基准上实现了70.0%的pass@1成功率,为Math AI助手的发展提供了可扩展的端到端训练范式。
链接: https://arxiv.org/abs/2507.20199
作者: Shijie Shang,Ruosi Wan,Yue Peng,Yutong Wu,Xiong-hui Chen,Jie Yan,Xiangyu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 4 figures
Abstract:We present StepFun-Prover Preview, a large language model designed for formal theorem proving through tool-integrated reasoning. Using a reinforcement learning pipeline that incorporates tool-based interactions, StepFun-Prover can achieve strong performance in generating Lean 4 proofs with minimal sampling. Our approach enables the model to emulate human-like problem-solving strategies by iteratively refining proofs based on real-time environment feedback. On the miniF2F-test benchmark, StepFun-Prover achieves a pass@1 success rate of 70.0% . Beyond advancing benchmark performance, we introduce an end-to-end training framework for developing tool-integrated reasoning models, offering a promising direction for automated theorem proving and Math AI assistant.
zh
[AI-62] Partial Domain Adaptation via Importance Sampling-based Shift Correction
【速读】:该论文旨在解决部分域适应(Partial Domain Adaptation, PDA)中的标签分布偏移问题,即源域的标签支持集包含目标域标签支持集的情形下,如何有效迁移知识并提升模型在目标域上的泛化能力。传统方法通过加权源域样本进行分布校正,但难以挖掘潜在结构且易导致过拟合。本文提出基于重要性采样的偏移校正方法(Importance Sampling-based Shift Correction, IS²C),其关键在于构建一个标签分布与目标域一致的采样域,并从中抽取新标注数据以刻画潜在结构、增强模型泛化性能;同时引入基于最优传输的独立性准则用于条件分布对齐,显著降低计算复杂度至 O(n2),从而实现更高效的知识迁移。理论分析表明,IS²C能充分控制泛化误差,且采样域与源域间的偏移程度可解释地关联到模型性能。
链接: https://arxiv.org/abs/2507.20191
作者: Cheng-Jun Guo,Chuan-Xian Ren,You-Wei Luo,Xiao-Lin Xu,Hong Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Partial domain adaptation (PDA) is a challenging task in real-world machine learning scenarios. It aims to transfer knowledge from a labeled source domain to a related unlabeled target domain, where the support set of the source label distribution subsumes the target one. Previous PDA works managed to correct the label distribution shift by weighting samples in the source domain. However, the simple reweighing technique cannot explore the latent structure and sufficiently use the labeled data, and then models are prone to over-fitting on the source domain. In this work, we propose a novel importance sampling-based shift correction (IS ^2 C) method, where new labeled data are sampled from a built sampling domain, whose label distribution is supposed to be the same as the target domain, to characterize the latent structure and enhance the generalization ability of the model. We provide theoretical guarantees for IS ^2 C by proving that the generalization error can be sufficiently dominated by IS ^2 C. In particular, by implementing sampling with the mixture distribution, the extent of shift between source and sampling domains can be connected to generalization error, which provides an interpretable way to build IS ^2 C. To improve knowledge transfer, an optimal transport-based independence criterion is proposed for conditional distribution alignment, where the computation of the criterion can be adjusted to reduce the complexity from \mathcalO(n^3) to \mathcalO(n^2) in realistic PDA scenarios. Extensive experiments on PDA benchmarks validate the theoretical results and demonstrate the effectiveness of our IS ^2 C over existing methods.
zh
[AI-63] High-Performance Parallel Optimization of the Fish School Behaviour on the Setonix Platform Using OpenMP
【速读】:该论文旨在解决复杂大规模计算场景下对高性能并行算法与计算结构的迫切需求,特别是针对鱼群行为(Fish School Behaviour, FSB)算法在Setonix超级计算平台上实现高效并行优化的问题。解决方案的关键在于利用OpenMP框架,在多线程环境下系统性地分析线程数量、调度策略及OpenMP构造等参数,从而识别提升程序性能的模式与策略,为FSB算法在超算平台上的并行化提供优化依据,并为其他基于OpenMP的并行计算研究提供参考。
链接: https://arxiv.org/abs/2507.20173
作者: Haitian Wang,Long Qin
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents an in-depth investigation into the high-performance parallel optimization of the Fish School Behaviour (FSB) algorithm on the Setonix supercomputing platform using the OpenMP framework. Given the increasing demand for enhanced computational capabilities for complex, large-scale calculations across diverse domains, there’s an imperative need for optimized parallel algorithms and computing structures. The FSB algorithm, inspired by nature’s social behavior patterns, provides an ideal platform for parallelization due to its iterative and computationally intensive nature. This study leverages the capabilities of the Setonix platform and the OpenMP framework to analyze various aspects of multi-threading, such as thread counts, scheduling strategies, and OpenMP constructs, aiming to discern patterns and strategies that can elevate program performance. Experiments were designed to rigorously test different configurations, and our results not only offer insights for parallel optimization of FSB on Setonix but also provide valuable references for other parallel computational research using OpenMP. Looking forward, other factors, such as cache behavior and thread scheduling strategies at micro and macro levels, hold potential for further exploration and optimization.
zh
[AI-64] ASNN: Learning to Suggest Neural Architectures from Performance Distributions
【速读】:该论文旨在解决神经网络(Neural Network, NN)架构设计中缺乏明确数学映射关系的问题,即无法通过闭式函数直接关联网络结构与模型性能,导致架构设计主要依赖启发式方法或搜索策略。其解决方案的关键在于提出一种名为Architecture Suggesting Neural Network (ASNN) 的模型,该模型能够学习并建模神经网络架构与测试准确率之间的复杂非线性关系,并基于此迭代生成更优的架构建议。训练过程中,ASNN以准确率为输入、架构参数为输出进行监督学习,随后通过多轮预测与再训练循环不断优化架构,最终在2层和3层网络实验中均成功发现超越原始数据集最优结果的新架构,验证了其在自动化神经网络设计中的有效性与泛化能力。
链接: https://arxiv.org/abs/2507.20164
作者: Jinwook Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:The architecture of a neural network (NN) plays a critical role in determining its performance. However, there is no general closed-form function that maps between network structure and accuracy, making the process of architecture design largely heuristic or search-based. In this study, we propose the Architecture Suggesting Neural Network (ASNN), a model designed to learn the relationship between NN architecture and its test accuracy, and to suggest improved architectures accordingly. To train ASNN, we constructed datasets using TensorFlow-based models with varying numbers of layers and nodes. Experimental results were collected for both 2-layer and 3-layer architectures across a grid of configurations, each evaluated with 10 repeated trials to account for stochasticity. Accuracy values were treated as inputs, and architectural parameters as outputs. The trained ASNN was then used iteratively to predict architectures that yield higher performance. In both 2-layer and 3-layer cases, ASNN successfully suggested architectures that outperformed the best results found in the original training data. Repeated prediction and retraining cycles led to the discovery of architectures with improved mean test accuracies, demonstrating the model’s capacity to generalize the performance-structure relationship. These results suggest that ASNN provides an efficient alternative to random search for architecture optimization, and offers a promising approach toward automating neural network design. “Parts of the manuscript, including text editing and expression refinement, were supported by OpenAI’s ChatGPT. All content was reviewed and verified by the authors.”
zh
[AI-65] Awesome-OL: An Extensible Toolkit for Online Learning
【速读】:该论文旨在解决在线学习(Online Learning)研究中算法开发与实际部署缺乏统一、可复现工具支持的问题。当前在线学习面临数据流动态变化(non-stationary data)和算法比较困难等挑战,导致研究效率低下。解决方案的关键在于提出一个名为Awesome-OL的可扩展Python工具包,其核心优势在于整合了前沿算法、提供统一的基准测试数据集以及多模态可视化功能,同时基于scikit-multiflow开源框架构建,兼顾用户友好性与研究灵活性,从而促进在线学习方法的高效开发与公平比较。
链接: https://arxiv.org/abs/2507.20144
作者: Zeyi Liu,Songqiao Hu,Pengyu Han,Jiaming Liu,Xiao He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages
Abstract:In recent years, online learning has attracted increasing attention due to its adaptive capability to process streaming and non-stationary data. To facilitate algorithm development and practical deployment in this area, we introduce Awesome-OL, an extensible Python toolkit tailored for online learning research. Awesome-OL integrates state-of-the-art algorithm, which provides a unified framework for reproducible comparisons, curated benchmark datasets, and multi-modal visualization. Built upon the scikit-multiflow open-source infrastructure, Awesome-OL emphasizes user-friendly interactions without compromising research flexibility or extensibility. The source code is publicly available at: this https URL.
zh
[AI-66] Concept Learning for Cooperative Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中神经网络模型缺乏透明性和互操作性的问题,尤其是由于黑箱机制导致的协作机制不明确。解决方案的关键在于提出一种基于概念瓶颈模型(Concept Bottleneck Models)的可解释价值分解框架——CMQ(Concepts learning for Multi-agent Q-learning),其核心创新是将每个合作概念表示为监督向量,而非传统端到端模型中的概念无关信息流;通过利用全局状态嵌入对个体动作值进行条件约束,增强了合作模式的表达能力,从而在保持高性能的同时显著提升了模型的可解释性与可控性。
链接: https://arxiv.org/abs/2507.20143
作者: Zhonghan Ge,Yuanyang Zhu,Chunlin Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: IEEE-China Conference on System Simulation Technology and its Applications, 2025
Abstract:Despite substantial progress in applying neural networks (NN) to multi-agent reinforcement learning (MARL) areas, they still largely suffer from a lack of transparency and interoperability. However, its implicit cooperative mechanism is not yet fully understood due to black-box networks. In this work, we study an interpretable value decomposition framework via concept bottleneck models, which promote trustworthiness by conditioning credit assignment on an intermediate level of human-like cooperation concepts. To address this problem, we propose a novel value-based method, named Concepts learning for Multi-agent Q-learning (CMQ), that goes beyond the current performance-vs-interpretability trade-off by learning interpretable cooperation concepts. CMQ represents each cooperation concept as a supervised vector, as opposed to existing models where the information flowing through their end-to-end mechanism is concept-agnostic. Intuitively, using individual action value conditioning on global state embeddings to represent each concept allows for extra cooperation representation capacity. Empirical evaluations on the StarCraft II micromanagement challenge and level-based foraging (LBF) show that CMQ achieves superior performance compared with the state-of-the-art counterparts. The results also demonstrate that CMQ provides more cooperation concept representation capturing meaningful cooperation modes, and supports test-time concept interventions for detecting potential biases of cooperation mode and identifying spurious artifacts that impact cooperation.
zh
[AI-67] Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech ICML2025
【速读】:该论文旨在解决零样本语音合成(Zero-Shot Text-to-Speech, ZS-TTS)系统中因模型记忆特定说话人身份而导致的语音隐私泄露问题,即如何从预训练模型参数中选择性地移除对不希望复现的个体声音的知识。其解决方案的关键在于提出首个针对ZS-TTS系统的机器遗忘框架——教师引导遗忘(Teacher-Guided Unlearning, TGU),该方法通过引入随机性机制防止模型持续复现被遗忘说话人的语音特征,从而确保遗忘后的说话人身份无法被追溯;同时,为量化遗忘效果,论文还设计了新的评估指标“说话人零重新训练遗忘度”(speaker-Zero Retrain Forgetting, spk-ZRF),用于衡量模型对遗忘说话人提示的忽略能力,有效消除其对该类语音的记忆。实验表明,TGU可在保持其他说话人语音质量的前提下,成功阻止模型再现指定说话人的声音。
链接: https://arxiv.org/abs/2507.20140
作者: Taesoo Kim,Jinju Kim,Dongchan Kim,Jong Hwan Ko,Gyeong-Moon Park
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Proceedings of the 42nd International Conference on Machine Learning (ICML 2025), Vancouver, Canada. PMLR 267, 2025. Authors Jinju Kim and Taesoo Kim contributed equally
Abstract:The rapid advancement of Zero-Shot Text-to-Speech (ZS-TTS) technology has enabled high-fidelity voice synthesis from minimal audio cues, raising significant privacy and ethical concerns. Despite the threats to voice privacy, research to selectively remove the knowledge to replicate unwanted individual voices from pre-trained model parameters has not been explored. In this paper, we address the new challenge of speaker identity unlearning for ZS-TTS systems. To meet this goal, we propose the first machine unlearning frameworks for ZS-TTS, especially Teacher-Guided Unlearning (TGU), designed to ensure the model forgets designated speaker identities while retaining its ability to generate accurate speech for other speakers. Our proposed methods incorporate randomness to prevent consistent replication of forget speakers’ voices, assuring unlearned identities remain untraceable. Additionally, we propose a new evaluation metric, speaker-Zero Retrain Forgetting (spk-ZRF). This assesses the model’s ability to disregard prompts associated with forgotten speakers, effectively neutralizing its knowledge of these voices. The experiments conducted on the state-of-the-art model demonstrate that TGU prevents the model from replicating forget speakers’ voices while maintaining high quality for other speakers. The demo is available at this https URL
zh
[AI-68] Aggregation-aware MLP: An Unsupervised Approach for Graph Message-passing
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在处理异质性(heterophily)图结构时性能受限的问题,特别是由于其采用固定聚合函数(如Mean、Max或Sum)而缺乏对不同图结构特性的自适应能力。解决方案的关键在于提出一种无监督框架——“聚合感知多层感知机”(Aggregation-aware Multilayer Perceptron, AMLP),其核心思想是将传统上手工设计聚合函数的范式转变为让多层感知机(MLP)自动适应不同聚合模式。该方法通过两个关键步骤实现:首先利用图重构技术诱导高阶分组效应,其次使用单层网络编码不同程度的异质性,从而显著提升模型的表达能力和泛化性能,在节点聚类与分类任务中展现出优越效果。
链接: https://arxiv.org/abs/2507.20127
作者: Xuanting Xie,Bingheng Li,Erlin Pan,Zhao Kang,Wenyu Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 11 pages, 6 figures
Abstract:Graph Neural Networks (GNNs) have become a dominant approach to learning graph representations, primarily because of their message-passing mechanisms. However, GNNs typically adopt a fixed aggregator function such as Mean, Max, or Sum without principled reasoning behind the selection. This rigidity, especially in the presence of heterophily, often leads to poor, problem dependent performance. Although some attempts address this by designing more sophisticated aggregation functions, these methods tend to rely heavily on labeled data, which is often scarce in real-world tasks. In this work, we propose a novel unsupervised framework, “Aggregation-aware Multilayer Perceptron” (AMLP), which shifts the paradigm from directly crafting aggregation functions to making MLP adaptive to aggregation. Our lightweight approach consists of two key steps: First, we utilize a graph reconstruction method that facilitates high-order grouping effects, and second, we employ a single-layer network to encode varying degrees of heterophily, thereby improving the capacity and applicability of the model. Extensive experiments on node clustering and classification demonstrate the superior performance of AMLP, highlighting its potential for diverse graph learning scenarios.
zh
[AI-69] Packet-Level DDoS Data Augmentation Using Dual-Stream Temporal-Field Diffusion
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在应对分布式拒绝服务(DDoS)攻击检测任务中,因缺乏高质量标注数据而导致模型性能受限的问题。现有合成流量生成方法难以准确捕捉新兴 DDoS 攻击所呈现的复杂时空特征,导致生成数据与真实流量相似度低、进而影响机器学习(ML)检测精度。其解决方案的关键在于提出 Dual-Stream Temporal-Field Diffusion (DSTF-Diffusion),一种基于扩散模型的多视角、多流网络流量生成架构:其中“场流”(field stream)通过空间映射将网络数据特征与预训练的稳定扩散模型对齐,实现复杂网络交互到扩散模型可处理格式的有效转换;而“空间流”(spatial stream)则采用动态时间建模机制,精确捕获网络流量的内在时序模式,从而显著提升生成数据的统计保真度和下游检测任务的性能表现。
链接: https://arxiv.org/abs/2507.20115
作者: Gongli Xi,Ye Tian,Yannan Hu,Yuchao Zhang,Yapeng Niu,Xiangyang Gong
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures
Abstract:In response to Distributed Denial of Service (DDoS) attacks, recent research efforts increasingly rely on Machine Learning (ML)-based solutions, whose effectiveness largely depends on the quality of labeled training datasets. To address the scarcity of such datasets, data augmentation with synthetic traces is often employed. However, current synthetic trace generation methods struggle to capture the complex temporal patterns and spatial distributions exhibited in emerging DDoS attacks. This results in insufficient resemblance to real traces and unsatisfied detection accuracy when applied to ML tasks. In this paper, we propose Dual-Stream Temporal-Field Diffusion (DSTF-Diffusion), a multi-view, multi-stream network traffic generative model based on diffusion models, featuring two main streams: The field stream utilizes spatial mapping to bridge network data characteristics with pre-trained realms of stable diffusion models, effectively translating complex network interactions into formats that stable diffusion can process, while the spatial stream adopts a dynamic temporal modeling approach, meticulously capturing the intrinsic temporal patterns of network traffic. Extensive experiments demonstrate that data generated by our model exhibits higher statistical similarity to originals compared to current state-of-the-art solutions, and enhance performances on a wide range of downstream tasks.
zh
[AI-70] Online Learning with Probing for Sequential User-Centric Selection
【速读】:该论文旨在解决带信息获取的序列决策问题(sequential decision-making with information acquisition),其核心挑战在于:在资源和奖励均未知的情况下,如何通过代价较高的探测(probing)来获取侧信息(side information),进而优化后续的决策分配。针对此问题,作者提出了探查增强的以用户为中心的选择框架(Probing-Augmented User-Centric Selection, PUCS),涵盖网约车调度、无线资源分配与内容推荐等实际场景。解决方案的关键在于:对于离线情形,设计了一个贪心探查算法,提供常数因子近似保证 ζ=(e−1)/(2e−1);对于在线情形,提出OLPA算法——一种随机组合多臂赌博机(stochastic combinatorial bandit)方法,实现O(T+ln2T)的 regret 上界,并证明了Ω(T)的下界,表明该上界在对数因子内是紧致的。
链接: https://arxiv.org/abs/2507.20112
作者: Tianyi Xu,Yiting Chen,Henger Li,Zheyong Bian,Emiliano Dall’Anese,Zizhan Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
备注:
Abstract:We formalize sequential decision-making with information acquisition as the probing-augmented user-centric selection (PUCS) framework, where a learner first probes a subset of arms to obtain side information on resources and rewards, and then assigns K plays to M arms. PUCS covers applications such as ridesharing, wireless scheduling, and content recommendation, in which both resources and payoffs are initially unknown and probing is costly. For the offline setting with known distributions, we present a greedy probing algorithm with a constant-factor approximation guarantee \zeta = (e-1)/(2e-1) . For the online setting with unknown distributions, we introduce OLPA, a stochastic combinatorial bandit algorithm that achieves a regret bound \mathcalO(\sqrtT + \ln^2 T) . We also prove a lower bound \Omega(\sqrtT) , showing that the upper bound is tight up to logarithmic factors. Experiments on real-world data demonstrate the effectiveness of our solutions.
zh
[AI-71] Learning to Align Human Code Preferences
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成任务中如何根据不同类型的代码偏好(code preference)场景选择最优对齐训练策略的问题。现有方法主要依赖监督微调(Supervised Fine-Tuning, SFT)和直接偏好优化(Direct Preference Optimization, DPO),但其适用性在不同偏好场景下尚不明确。论文通过理论分析与实证研究提出:SFT在存在客观可验证最优解的场景中表现更优,而SFT后接DPO(即SD策略)则更适合无客观最优解的探索性场景。解决方案的关键在于提出自适应偏好优化(Adaptive Preference Optimization, APO),该方法动态地增强偏好响应、抑制非偏好响应,并在训练过程中主动鼓励对潜在更优解的探索,从而实现对不同代码偏好场景的自适应适配。
链接: https://arxiv.org/abs/2507.20109
作者: Xin Yin,Chao Ni,Liushan Chen,Xiaohu Yang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human preferences, the optimal training strategy remains unclear across diverse code preference scenarios. This paper systematically investigates the roles of SFT and DPO in aligning LLMs with different code preferences. Through both theoretical analysis and empirical observation, we hypothesize that SFT excels in scenarios with objectively verifiable optimal solutions, while applying SFT followed by DPO (SD) enables models to explore superior solutions in scenarios without objectively verifiable optimal solutions. Based on the analysis and experimental evidence, we propose Adaptive Preference Optimization (APO), a dynamic integration approach that adaptively amplifies preferred responses, suppresses dispreferred ones, and encourages exploration of potentially superior solutions during training. Extensive experiments across six representative code preference tasks validate our theoretical hypotheses and demonstrate that APO consistently matches or surpasses the performance of existing SFT and SD strategies. Our work provides both theoretical foundations and practical guidance for selecting appropriate training strategies in different code preference alignment scenarios.
zh
[AI-72] Irredundant k-Fold Cross-Validation
【速读】:该论文旨在解决传统k折交叉验证(k-fold cross-validation)中存在的实例冗余问题,即每个样本在训练阶段被重复使用多次,导致部分样本对学习过程产生过大的影响,从而可能引发过拟合并使模型比较结果偏差。其解决方案的关键在于提出一种新的无冗余k折交叉验证(Irredundant k-fold cross-validation)方法,该方法确保在整个验证过程中每个样本仅被用于一次训练和一次测试,从而实现数据利用的均衡性;该方法保持分层策略且与分类器无关,同时通过非重叠的训练集显著降低计算开销,并提供更保守、稳定的性能估计方差。
链接: https://arxiv.org/abs/2507.20048
作者: Jesus S. Aguilar-Ruiz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
Abstract:In traditional k-fold cross-validation, each instance is used ( k!-!1 ) times for training and once for testing, leading to redundancy that lets many instances disproportionately influence the learning phase. We introduce Irredundant k --fold cross-validation, a novel method that guarantees each instance is used exactly once for training and once for testing across the entire validation procedure. This approach ensures a more balanced utilization of the dataset, mitigates overfitting due to instance repetition, and enables sharper distinctions in comparative model analysis. The method preserves stratification and remains model-agnostic, i.e., compatible with any classifier. Experimental results demonstrate that it delivers consistent performance estimates across diverse datasets --comparable to k --fold cross-validation-- while providing less optimistic variance estimates because training partitions are non-overlapping, and significantly reducing the overall computational cost.
zh
[AI-73] When Engineering Outruns Intelligence: A Re-evaluation of Instruction-Guided Navigation
【速读】:该论文旨在解决当前ObjectGoal Navigation任务中对大型语言模型(Large Language Models, LLMs)贡献度的误解问题,即现有方法中所谓的“LLM智能”是否真正提升了导航规划能力尚不明确。其解决方案的关键在于通过消融实验剥离LLM相关的复杂组件(如动态导航链提示、开放词汇检测器和直觉显著性图),代之以仅依赖几何信息的简单启发式策略——距离加权前沿探索器(Distance-Weighted Frontier Explorer, DWFE),并在此基础上引入轻量级语言先验(Simple Heuristic Filter, SHF)。结果显示,仅使用几何启发式即可显著提升成功率(Success)与路径长度归一化得分(SPL),且加入语言先验后进一步优化路径效率,表明多数性能提升源于环境结构感知而非LLM的语义推理能力,强调未来需构建更精确的度量感知提示或离线语义图才能合理评估LLM在导航中的作用。
链接: https://arxiv.org/abs/2507.20021
作者: Matin Aghaei,Mohammad Ali Alomrani,Yingxue Zhang,Mahdi Biparva
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are often credited with recent leaps in ObjectGoal Navigation, yet the extent to which they improve planning remains unclear. We revisit this question on the HM3D-v1 validation split. First, we strip InstructNav of its Dynamic Chain-of-Navigation prompt, open-vocabulary GLEE detector and Intuition saliency map, and replace them with a simple Distance-Weighted Frontier Explorer (DWFE). This geometry-only heuristic raises Success from 58.0% to 61.1% and lifts SPL from 20.9% to 36.0% over 2 000 validation episodes, outperforming all previous training-free baselines. Second, we add a lightweight language prior (SHF); on a 200-episode subset this yields a further +2% Success and +0.9% SPL while shortening paths by five steps on average. Qualitative trajectories confirm the trend: InstructNav back-tracks and times-out, DWFE reaches the goal after a few islands, and SHF follows an almost straight route. Our results indicate that frontier geometry, not emergent LLM reasoning, drives most reported gains, and suggest that metric-aware prompts or offline semantic graphs are necessary before attributing navigation success to “LLM intelligence.”
zh
[AI-74] FedSWA: Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight Averag ing ICML2025
【速读】:该论文针对联邦学习(Federated Learning, FL)中因数据异构性(data heterogeneity)导致的泛化能力下降问题展开研究,特别是发现现有算法如FedSAM在高度异构数据下性能劣于FedAvg。其核心解决方案是提出两种新型联邦学习算法:\texttt{FedSWA}(基于随机权重平均的联邦学习)和\texttt{FedMoSWA}(基于动量的受控随机权重平均联邦学习),其中关键创新在于通过引入Stochastic Weight Averaging(SWA)机制寻找更平坦的极小值(flatter minima),并设计动量策略以更好地对齐局部模型与全局模型。理论分析表明,\texttt{FedMoSWA} 在优化误差和泛化误差上均优于FedSAM及其变体;实验验证了所提方法在CIFAR10/100和Tiny ImageNet上的优越性能。
链接: https://arxiv.org/abs/2507.20016
作者: Liu junkang,Yuanyuan Liu,Fanhua Shang,Hongying Liu,Jin Liu,Wei Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: icml 2025
Abstract:For federated learning (FL) algorithms such as FedSAM, their generalization capability is crucial for real-word applications. In this paper, we revisit the generalization problem in FL and investigate the impact of data heterogeneity on FL generalization. We find that FedSAM usually performs worse than FedAvg in the case of highly heterogeneous data, and thus propose a novel and effective federated learning algorithm with Stochastic Weight Averaging (called \textttFedSWA), which aims to find flatter minima in the setting of highly heterogeneous data. Moreover, we introduce a new momentum-based stochastic controlled weight averaging FL algorithm (\textttFedMoSWA), which is designed to better align local and global models. Theoretically, we provide both convergence analysis and generalization bounds for \textttFedSWA and \textttFedMoSWA. We also prove that the optimization and generalization errors of \textttFedMoSWA are smaller than those of their counterparts, including FedSAM and its variants. Empirically, experimental results on CIFAR10/100 and Tiny ImageNet demonstrate the superiority of the proposed algorithms compared to their counterparts. Open source code at: this https URL. Comments: icml 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 68T05 ACMclasses: I.2.1 Cite as: arXiv:2507.20016 [cs.LG] (or arXiv:2507.20016v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.20016 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-75] Policy-Driven AI in Dataspaces: Taxonomy Explainability and Pathways for Compliant Innovation
【速读】:该论文旨在解决AI驱动的数据空间(dataspaces)中隐私保护、性能效率与政策合规性之间的复杂权衡问题。其关键解决方案在于提出一种新颖的分类体系(taxonomy),依据隐私级别、性能影响和合规复杂度对多种隐私保护技术(如联邦学习、差分隐私、可信执行环境、同态加密、安全多方计算)进行系统化归类,并结合GDPR和欧盟人工智能法案等监管框架,构建多维优化指标(包括延迟、吞吐量、成本开销、模型效用、公平性和可解释性),从而为实践者和研究人员提供清晰的决策依据。此外,论文还识别出标准化隐私-性能KPI缺失、联邦生态中可解释AI不足及监管碎片化下的语义策略执行难题等关键研究空白,并提出以政策驱动对齐、自动化合规验证、标准化基准测试及与GAIA-X、IDS等欧洲倡议集成为核心的未来方向。
链接: https://arxiv.org/abs/2507.20014
作者: Joydeep Chandra,Satyam Kumar Navneet
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As AI-driven dataspaces become integral to data sharing and collaborative analytics, ensuring privacy, performance, and policy compliance presents significant challenges. This paper provides a comprehensive review of privacy-preserving and policy-aware AI techniques, including Federated Learning, Differential Privacy, Trusted Execution Environments, Homomorphic Encryption, and Secure Multi-Party Computation, alongside strategies for aligning AI with regulatory frameworks such as GDPR and the EU AI Act. We propose a novel taxonomy to classify these techniques based on privacy levels, performance impacts, and compliance complexity, offering a clear framework for practitioners and researchers to navigate trade-offs. Key performance metrics – latency, throughput, cost overhead, model utility, fairness, and explainability – are analyzed to highlight the multi-dimensional optimization required in dataspaces. The paper identifies critical research gaps, including the lack of standardized privacy-performance KPIs, challenges in explainable AI for federated ecosystems, and semantic policy enforcement amidst regulatory fragmentation. Future directions are outlined, proposing a conceptual framework for policy-driven alignment, automated compliance validation, standardized benchmarking, and integration with European initiatives like GAIA-X, IDS, and Eclipse EDC. By synthesizing technical, ethical, and regulatory perspectives, this work lays the groundwork for developing trustworthy, efficient, and compliant AI systems in dataspaces, fostering innovation in secure and responsible data-driven ecosystems.
zh
[AI-76] Finding Personalized Good-Enough Solutions to Unsatisfiable Stable Roommates Problems
【速读】:该论文旨在解决稳定室友问题(Stable Roommates Problem)中因不存在稳定匹配而导致的求解困境,致力于计算“足够好”的近似匹配方案。其解决方案的关键在于引入个体的社交偏好网络(networks of preferred friends)和习惯性偏好(habits and habitual preferences),通过构建个性化匹配模型来生成更符合实际需求的匹配结果,从而在缺乏全局稳定解的情况下提升匹配的可接受性和实用性。
链接: https://arxiv.org/abs/2507.20010
作者: Müge Fidan,Esra Erdem
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Logic in Computer Science (cs.LO)
备注:
Abstract:The Stable Roommates problems are characterized by the preferences of agents over other agents as roommates. A solution is a partition of the agents into pairs that are acceptable to each other (i.e., they are in the preference lists of each other), and the matching is stable (i.e., there do not exist any two agents who prefer each other to their roommates, and thus block the matching). Motivated by real-world applications, and considering that stable roommates problems do not always have solutions, we continue our studies to compute “good-enough” matchings. In addition to the agents’ habits and habitual preferences, we consider their networks of preferred friends, and introduce a method to generate personalized solutions to stable roommates problems. We illustrate the usefulness of our method with examples and empirical evaluations.
zh
[AI-77] Robust Taxi Fare Prediction Under Noisy Conditions: A Comparative Study of GAT TimesNet and XGBoost
【速读】:该论文旨在解决城市出行平台中出租车费用(fare)精准预测的问题,尤其是在存在噪声数据和分布外样本(out-of-distribution, OOD)场景下模型性能下降的挑战。解决方案的关键在于系统性比较三种机器学习模型——图注意力网络(Graph Attention Networks, GAT)、XGBoost 和 TimesNet,在原始(含噪)与去噪数据上的表现差异,并结合多种预处理策略(如KNN插补、高斯噪声注入和基于自编码器的去噪)来提升模型的预测准确性、校准能力、不确定性估计、OOD鲁棒性及特征敏感性。研究揭示了传统机器学习与深度学习模型在真实环境中的关键差异,为构建稳健且可扩展的城市出行计费预测系统提供了实证依据与实践指导。
链接: https://arxiv.org/abs/2507.20008
作者: Padmavathi Moorthy(SUNY Buffalo)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 9 figures, prepared with LaTeX, GitHub link: this https URL
Abstract:Precise fare prediction is crucial in ride-hailing platforms and urban mobility systems. This study examines three machine learning models-Graph Attention Networks (GAT), XGBoost, and TimesNet to evaluate their predictive capabilities for taxi fares using a real-world dataset comprising over 55 million records. Both raw (noisy) and denoised versions of the dataset are analyzed to assess the impact of data quality on model performance. The study evaluated the models along multiple axes, including predictive accuracy, calibration, uncertainty estimation, out-of-distribution (OOD) robustness, and feature sensitivity. We also explore pre-processing strategies, including KNN imputation, Gaussian noise injection, and autoencoder-based denoising. The study reveals critical differences between classical and deep learning models under realistic conditions, offering practical guidelines for building robust and scalable models in urban fare prediction systems.
zh
[AI-78] Matching Game Preferences Through Dialogical Large Language Models : A Perspective
【速读】:该论文旨在解决当前生成式AI(Generative AI)系统在人类对话理解和偏好建模方面存在的透明度不足与个性化缺失问题,尤其关注如何使AI推理过程可解释、可追踪,从而提升用户对AI决策的信任。其解决方案的关键在于提出一种基于对话的大型语言模型(Dialogical Large Language Models, D-LLMs)框架,通过将用户偏好嵌入模型决策机制,并结合GRAPHYP的搜索体验网络,实现三方面核心功能:一是可分析不同搜索体验并指导性能优化的推理流程;二是识别用户偏好模式的分类系统;三是支持人类澄清冲突信息的对话策略。该框架最终目标是构建一个可解释的人工智能系统,使用户能够理解AI如何基于多方偏好生成响应,从而增强AI在复杂人机交互场景中的可信度和实用性。
链接: https://arxiv.org/abs/2507.20000
作者: Renaud Fabre,Daniel Egret,Patrice Bellot
机构: 未知
类目: Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 28 pages, 1 figure. Published in Applied Sciences
Abstract:This perspective paper explores the future potential of “conversational intelligence” by examining how Large Language Models (LLMs) could be combined with GRAPHYP’s network system to better understand human conversations and preferences. Using recent research and case studies, we propose a conceptual framework that could make AI rea-soning transparent and traceable, allowing humans to see and understand how AI reaches its conclusions. We present the conceptual perspective of “Matching Game Preferences through Dialogical Large Language Models (D-LLMs),” a proposed system that would allow multiple users to share their different preferences through structured conversations. This approach envisions personalizing LLMs by embedding individual user preferences directly into how the model makes decisions. The proposed D-LLM framework would require three main components: (1) reasoning processes that could analyze different search experiences and guide performance, (2) classification systems that would identify user preference patterns, and (3) dialogue approaches that could help humans resolve conflicting information. This perspective framework aims to create an interpretable AI system where users could examine, understand, and combine the different human preferences that influence AI responses, detected through GRAPHYP’s search experience networks. The goal of this perspective is to envision AI systems that would not only provide answers but also show users how those answers were reached, making artificial intelligence more transparent and trustworthy for human decision-making.
zh
[AI-79] CLASP: General-Purpose Clothes Manipulation with Semantic Keypoints
【速读】:该论文旨在解决家庭服务机器人在处理不同衣物类型(如T恤、短裤、裙子、长裙等)和多种操作任务(如折叠、展平、悬挂等)时面临的通用性不足问题,其核心挑战在于衣物复杂的高维几何结构导致现有方法难以泛化。解决方案的关键在于引入语义关键点(semantic keypoints),即对衣物具有显著意义的稀疏空间-语义表示(如“左袖子”、“右肩”等),这些关键点可从RGB-D图像中可靠提取,并作为连接高层任务规划与低层动作执行的有效中间表征。CLASP利用视觉语言模型(VLMs)基于语义关键点预测任务计划,并借助预构建的操作技能库实现低层动作执行,从而实现了跨衣物类型和任务的强泛化能力。
链接: https://arxiv.org/abs/2507.19983
作者: Yuhong Deng,Chao Tang,Cunjun Yu,Linfeng Li,David Hsu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Clothes manipulation, such as folding or hanging, is a critical capability for home service robots. Despite recent advances, most existing methods remain limited to specific tasks and clothes types, due to the complex, high-dimensional geometry of clothes. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP), which aims at general-purpose clothes manipulation over different clothes types, T-shirts, shorts, skirts, long dresses, … , as well as different tasks, folding, flattening, hanging, … . The core idea of CLASP is semantic keypoints – e.g., ‘‘left sleeve’’, ‘‘right shoulder’’, etc. – a sparse spatial-semantic representation that is salient for both perception and action. Semantic keypoints of clothes can be reliably extracted from RGB-D images and provide an effective intermediate representation of clothes manipulation policies. CLASP uses semantic keypoints to bridge high-level task planning and low-level action execution. At the high level, it exploits vision language models (VLMs) to predict task plans over the semantic keypoints. At the low level, it executes the plans with the help of a simple pre-built manipulation skill library. Extensive simulation experiments show that CLASP outperforms state-of-the-art baseline methods on multiple tasks across diverse clothes types, demonstrating strong performance and generalization. Further experiments with a Franka dual-arm system on four distinct tasks – folding, flattening, hanging, and placing – confirm CLASP’s performance on a real robot.
zh
[AI-80] A roadmap for AI in robotics
【速读】:该论文旨在解决如何将人工智能(AI)技术有效应用于机器人领域,以克服物理世界中动作执行与感知所带来的独特挑战,并推动机器人在日常生活中的广泛应用。其核心问题在于:从众多网络架构和学习模型中筛选出最适合机器人任务的技术,并针对特定机器人设计、任务和环境进行适配;同时应对数据多样性不足、算法泛化能力弱、人机协作中的行为预测偏差、控制决策缺乏可解释性等关键障碍。解决方案的关键在于构建兼顾专用性与通用性的AI算法体系,确保其能够适应多种机器人平台并实现跨场景迁移,同时强调通过持续学习机制实现机器人长期自主进化,保障安全部署与可持续计算成本。
链接: https://arxiv.org/abs/2507.19975
作者: Aude Billard,Alin Albu-Schaeffer,Michael Beetz,Wolfram Burgard,Peter Corke,Matei Ciocarlie,Ravinder Dahiya,Danica Kragic,Ken Goldberg,Yukie Nagai,Davide Scaramuzza
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:AI technologies, including deep learning, large-language models have gone from one breakthrough to the other. As a result, we are witnessing growing excitement in robotics at the prospect of leveraging the potential of AI to tackle some of the outstanding barriers to the full deployment of robots in our daily lives. However, action and sensing in the physical world pose greater and different challenges than analysing data in isolation. As the development and application of AI in robotic products advances, it is important to reflect on which technologies, among the vast array of network architectures and learning models now available in the AI field, are most likely to be successfully applied to robots; how they can be adapted to specific robot designs, tasks, environments; which challenges must be overcome. This article offers an assessment of what AI for robotics has achieved since the 1990s and proposes a short- and medium-term research roadmap listing challenges and promises. These range from keeping up-to-date large datasets, representatives of a diversity of tasks robots may have to perform, and of environments they may encounter, to designing AI algorithms tailored specifically to robotics problems but generic enough to apply to a wide range of applications and transfer easily to a variety of robotic platforms. For robots to collaborate effectively with humans, they must predict human behavior without relying on bias-based profiling. Explainability and transparency in AI-driven robot control are not optional but essential for building trust, preventing misuse, and attributing responsibility in accidents. We close on what we view as the primary long-term challenges, that is, to design robots capable of lifelong learning, while guaranteeing safe deployment and usage, and sustainable computational costs.
zh
[AI-81] Digital Twin Channel-Enabled Online Resource Allocation for 6G: Principle Architecture and Application
【速读】:该论文旨在解决6G网络中灵活、低延迟和可靠资源分配的挑战,特别是在全息通信、自动驾驶和工业物联网等新兴应用背景下,传统基于统计建模的方法在动态环境中难以实现最优性能,且获取实时信道状态信息(Channel State Information, CSI)通常需要大量导频开销。解决方案的关键在于提出一种基于数字孪生信道(Digital Twin Channel, DTC)的在线优化框架:利用DTC通过环境感知预测CSI,并结合轻量级博弈论算法实现高效、及时的在线资源分配,从而在降低导频开销的同时提升系统吞吐量(仿真结果表明相较理想CSI方案最高提升11.5%),验证了其在可扩展、低开销和环境感知通信中的有效性。
链接: https://arxiv.org/abs/2507.19974
作者: Tongjie Li,Jianhua Zhang,Li Yu,Yuxiang Zhang,Yunlong Cai,Fan Xu,Guangyi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:Emerging applications such as holographic communication, autonomous driving, and the industrial Internet of Things impose stringent requirements on flexible, low-latency, and reliable resource allocation in 6G networks. Conventional methods, which rely on statistical modeling, have proven effective in general contexts but may fail to achieve optimal performance in specific and dynamic environments. Furthermore, acquiring real-time channel state information (CSI) typically requires excessive pilot overhead. To address these challenges, a digital twin channel (DTC)-enabled online optimization framework is proposed, in which DTC is employed to predict CSI based on environmental sensing. The predicted CSI is then utilized by lightweight game-theoretic algorithms to perform online resource allocation in a timely and efficient manner. Simulation results based on a digital replica of a realistic industrial workshop demonstrate that the proposed method achieves throughput improvements of up to 11.5% compared with pilot-based ideal CSI schemes, validating its effectiveness for scalable, low-overhead, and environment-aware communication in future 6G networks.
zh
[AI-82] Dimer-Enhanced Optimization: A First-Order Approach to Escaping Saddle Points in Neural Network Training
【速读】:该论文旨在解决第一阶优化方法(如SGD和Adam)在训练深度神经网络时难以有效逃离损失函数景观中鞍点(saddle point)和平坦区域的问题,这些问题会导致训练效率低下甚至停滞。其解决方案的关键在于提出一种名为Dimer-Enhanced Optimization (DEO) 的新框架,该框架借鉴分子动力学中用于定位鞍点的Dimer方法,通过构造两个邻近点来仅利用梯度信息估计局部曲率,并近似计算Hessian矩阵的最小特征向量(eigenvector),从而识别出最平坦的方向;随后周期性地将梯度投影到与该最小曲率方向正交的子空间中,引导优化过程避开鞍点和低曲率区域,实现非步进式的更新策略,显著提升复杂损失景观下的训练效率。
链接: https://arxiv.org/abs/2507.19968
作者: Yue Hu,Zanxia Cao,Yingchao Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 8 pages, 2 figures
Abstract:First-order optimization methods, such as SGD and Adam, are widely used for training large-scale deep neural networks due to their computational efficiency and robust performance. However, relying solely on gradient information, these methods often struggle to navigate complex loss landscapes with flat regions, plateaus, and saddle points. Second-order methods, which use curvature information from the Hessian matrix, can address these challenges but are computationally infeasible for large models. The Dimer method, a first-order technique that constructs two closely spaced points to probe the local geometry of a potential energy surface, efficiently estimates curvature using only gradient information. Inspired by its use in molecular dynamics simulations for locating saddle points, we propose Dimer-Enhanced Optimization (DEO), a novel framework to escape saddle points in neural network training. DEO adapts the Dimer method to explore a broader region of the loss landscape, approximating the Hessian’s smallest eigenvector without computing the full matrix. By periodically projecting the gradient onto the subspace orthogonal to the minimum curvature direction, DEO guides the optimizer away from saddle points and flat regions, enhancing training efficiency with non-stepwise updates. Preliminary experiments on a Transformer toy model show DEO achieves competitive performance compared to standard first-order methods, improving navigation of complex loss landscapes. Our work repurposes physics-inspired, first-order curvature estimation to enhance neural network training in high-dimensional spaces.
zh
[AI-83] What Does Human-Centred AI Mean?
【速读】:该论文试图解决的问题是:当前人工智能(AI)研究与工程实践中对“以人为本”的理解存在偏差,往往忽视了人类认知在AI系统中的核心作用,导致技术设计偏离真正的人类中心导向。解决方案的关键在于重新界定AI的本质——将其视为技术与人类之间的一种关系,其中AI实质上是对人类认知劳动的某种形式的替代、增强或转移;通过引入“认知劳动位移”(displacement)、“增强”(enhancement)和“替代”(replacement)等新型分析框架,明确不同AI应用对人类认知的影响类型,并强调必须直面“人在回路中”(human-in-the-loop)这一事实,以破除对AI的神秘化倾向,从而实现真正意义上的人类中心设计。
链接: https://arxiv.org/abs/2507.19960
作者: Olivia Guest
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While it seems sensible that human-centred artificial intelligence (AI) means centring “human behaviour and experience,” it cannot be any other way. AI, I argue, is usefully seen as a relationship between technology and humans where it appears that artifacts can perform, to a greater or lesser extent, human cognitive labour. This is evinced using examples that juxtapose technology with cognition, inter alia: abacus versus mental arithmetic; alarm clock versus knocker-upper; camera versus vision; and sweatshop versus tailor. Using novel definitions and analyses, sociotechnical relationships can be analysed into varying types of: displacement (harmful), enhancement (beneficial), and/or replacement (neutral) of human cognitive labour. Ultimately, all AI implicates human cognition; no matter what. Obfuscation of cognition in the AI context – from clocks to artificial neural networks – results in distortion, in slowing critical engagement, perverting cognitive science, and indeed in limiting our ability to truly centre humans and humanity in the engineering of AI systems. To even begin to de-fetishise AI, we must look the human-in-the-loop in the eyes.
zh
[AI-84] CrossPL: Evaluating Large Language Models on Cross Programming Language Code Generation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件工程中生成跨编程语言(Cross-Programming-Language, CPL)互操作代码能力不足的问题。当前LLMs虽已广泛应用于代码生成,但其在多语言环境下通过进程间通信(Inter-Process Communication, IPC)实现组件协同的能力尚未被系统评估和提升。解决方案的关键在于构建首个专门针对CPL互操作性的基准测试工具CrossPL,该工具包含1982个基于IPC的任务,覆盖六种主流编程语言及七类典型CPL技术;其核心创新包括:(i) 利用156个手工设计的有限状态机(Finite State Machines, FSMs)分析19,169个多元语言GitHub仓库以提取真实场景下的CPL代码模式,以及(ii) 开发基于LLM的自动化流水线,用于自动生成任务指令、提取代码片段并进行功能正确性验证。此方法显著提升了对LLMs在CPL场景下表现的系统化评测能力。
链接: https://arxiv.org/abs/2507.19904
作者: Zhanhang Xiong,Dongxia Wang,Yuekang Li,Xinyuan An,Wenhai Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) become increasingly embedded in software engineering workflows, a critical capability remains underexplored: generating correct code that enables cross-programming-language (CPL) interoperability. This skill is essential for building complex systems that integrate components written in multiple languages via mechanisms like inter-process communication (IPC). To bridge this gap, we present CrossPL, the first benchmark designed to systematically evaluate LLMs’ ability to generate CPL-interoperating code. CrossPL comprises 1,982 tasks centered around IPC, covering six widely-used programming languages and seven representative CPL techniques. We construct this benchmark by (i) analyzing 19,169 multi-language GitHub repositories using 156 hand-crafted finite state machines (FSMs), and (ii) developing an LLM-based pipeline that automatically extracts CPL code snippets, generates task instructions, and validates functional correctness. We evaluate 14 state-of-the-art general-purpose LLMs and 6 code-oriented LLMs released in the past three years on CrossPL via FSM-based validation. Results reveal that even the best-performing models struggle with CPL scenarios, underscoring the need for more targeted research in this space. Our benchmark and code are available at: this https URL.
zh
[AI-85] Agent Mesh: A Cooperative Multi-Agent Generative AI Framework for Software Development Automation
【速读】:该论文旨在解决软件开发过程中因任务复杂性和多角色协作需求而导致的效率低下问题,尤其是在传统人工协作模式下难以实现自动化和规模化的问题。其解决方案的关键在于提出一个名为AgentMesh的Python框架,该框架通过多个协同工作的大型语言模型(Large Language Models, LLMs)驱动的智能代理(agents)来实现软件开发流程的自动化。这些代理包括规划器(Planner)、编码器(Coder)、调试器(Debugger)和评审员(Reviewer),各自承担特定职责,并通过结构化的工作流进行交互:规划器将高层需求分解为可执行子任务,编码器生成代码,调试器进行测试与修复,评审员验证最终输出的质量与正确性。这种分工协作机制有效利用了LLMs在不同阶段的能力优势,同时缓解了单一代理在复杂任务中的局限性,从而推动面向软件工程自动化的多智能体系统发展。
链接: https://arxiv.org/abs/2507.19902
作者: Sourena Khanzadeh
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Software development is a complex, multi-phase process traditionally requiring collaboration among individuals with diverse expertise. We propose AgentMesh, a Python-based framework that uses multiple cooperating LLM-powered agents to automate software development tasks. In AgentMesh, specialized agents - a Planner, Coder, Debugger, and Reviewer - work in concert to transform a high-level requirement into fully realized code. The Planner agent first decomposes user requests into concrete subtasks; the Coder agent implements each subtask in code; the Debugger agent tests and fixes the code; and the Reviewer agent validates the final output for correctness and quality. We describe the architecture and design of these agents and their communication, and provide implementation details including prompt strategies and workflow orchestration. A case study illustrates AgentMesh handling a non-trivial development request via sequential task planning, code generation, iterative debugging, and final code review. We discuss how dividing responsibilities among cooperative agents leverages the strengths of large language models while mitigating single-agent limitations. Finally, we examine current limitations - such as error propagation and context scaling - and outline future work toward more robust, scalable multi-agent AI systems for software engineering automation.
zh
[AI-86] S-Insight: Visualizing Thompson Sampling for Verification and XAI IEEE-VIS2025
【速读】:该论文旨在解决基于贝叶斯决策的多臂赌博机算法(Multi-Armed Bandit, MAB)中,如Thompson Sampling(TS)这类方法因概率性决策机制而难以调试与解释的问题,即其“黑箱”特性降低了模型开发者对其内部探索-利用(exploration-exploitation)动态过程的理解和信任。解决方案的关键在于提出TS-Insight——一个面向模型开发者的可视化分析工具,通过多个视图追踪每个臂(arm)的后验分布演化、证据计数及采样结果,从而实现对TS算法决策机制的可验证性、诊断能力和可解释性支持,尤其适用于需要可解释决策的敏感场景。
链接: https://arxiv.org/abs/2507.19898
作者: Parsa Vares,Éloi Durant,Jun Pang,Nicolas Médoc,Mohammad Ghoniem
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Accepted as a poster at IEEE VIS 2025 (“TS-Insight: Visual Fingerprinting of Multi-Armed Bandits”). Open-source tool available at this https URL
Abstract:Thompson Sampling (TS) and its variants are powerful Multi-Armed Bandit algorithms used to balance exploration and exploitation strategies in active learning. Yet, their probabilistic nature often turns them into a ``black box’', hindering debugging and trust. We introduce TS-Insight, a visual analytics tool explicitly designed to shed light on the internal decision mechanisms of Thompson Sampling-based algorithms, for model developers. It comprises multiple plots, tracing for each arm the evolving posteriors, evidence counts, and sampling outcomes, enabling the verification, diagnosis, and explainability of exploration/exploitation dynamics. This tool aims at fostering trust and facilitating effective debugging and deployment in complex binary decision-making scenarios especially in sensitive domains requiring interpretable decision-making.
zh
[AI-87] Causality-aligned Prompt Learning via Diffusion-based Counterfactual Generation
【速读】:该论文旨在解决现有提示学习(prompt learning)方法因理论基础薄弱而导致的因果不变提示难以实现的问题,进而无法有效捕捉跨类别的鲁棒特征。其解决方案的关键在于提出一种基于扩散过程的反事实提示学习框架——DiCap,该框架通过迭代采样来自因果模型边缘分布与条件分布的梯度,生成满足最小充分性准则的反事实样本,并在理论上保证反事实结果的可识别性及估计误差的严格边界;同时结合对比学习机制,使生成的反事实样本能够精准引导提示提取,从而获得与数据因果特征对齐的提示表示。
链接: https://arxiv.org/abs/2507.19882
作者: Xinshu Li,Ruoyu Wang,Erdun Gao,Mingming Gong,Lina Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Prompt learning has garnered attention for its efficiency over traditional model training and fine-tuning. However, existing methods, constrained by inadequate theoretical foundations, encounter difficulties in achieving causally invariant prompts, ultimately falling short of capturing robust features that generalize effectively across categories. To address these challenges, we introduce the \textit\textbfDiCap model, a theoretically grounded \textbfDi ffusion-based \textbfC ounterf \textbfa ctual \textbfp rompt learning framework, which leverages a diffusion process to iteratively sample gradients from the marginal and conditional distributions of the causal model, guiding the generation of counterfactuals that satisfy the minimal sufficiency criterion. Grounded in rigorous theoretical derivations, this approach guarantees the identifiability of counterfactual outcomes while imposing strict bounds on estimation errors. We further employ a contrastive learning framework that leverages the generated counterfactuals, thereby enabling the refined extraction of prompts that are precisely aligned with the causal features of the data. Extensive experimental results demonstrate that our method performs excellently across tasks such as image classification, image-text retrieval, and visual question answering, with particularly strong advantages in unseen categories.
zh
[AI-88] rivial Trojans: How Minimal MCP Servers Enable Cross-Tool Exfiltration of Sensitive Data WWW
【速读】:该论文旨在解决模型上下文协议(Model Context Protocol, MCP)在实现AI代理与外部服务无缝通信时引入的新型安全漏洞问题,特别是针对低门槛攻击者如何利用MCP的信任模型实施跨服务器数据窃取。解决方案的关键在于揭示当前MCP实现中存在“隐式信任关系”导致的跨服务器攻击面——即使单个MCP服务器看似可信,其组合使用可被恶意方利用,例如伪装成天气服务的恶意MCP服务器可发现并调用合法银行工具以窃取用户账户余额。研究提出,仅需基础编程能力(如本科水平Python技能)即可构造有效社会工程攻击,因此亟需立即采取缓解措施并改进协议设计,以防范此类低门槛、高危害的攻击行为。
链接: https://arxiv.org/abs/2507.19880
作者: Nicola Croce,Tobin South
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Abstract submitted to the Technical AI Governance Forum 2025 ( this https URL )
Abstract:The Model Context Protocol (MCP) represents a significant advancement in AI-tool integration, enabling seamless communication between AI agents and external services. However, this connectivity introduces novel attack vectors that remain largely unexplored. This paper demonstrates how unsophisticated threat actors, requiring only basic programming skills and free web tools, can exploit MCP’s trust model to exfiltrate sensitive financial data. We present a proof-of-concept attack where a malicious weather MCP server, disguised as benign functionality, discovers and exploits legitimate banking tools to steal user account balances. The attack chain requires no advanced technical knowledge, server infrastructure, or monetary investment. The findings reveal a critical security gap in the emerging MCP ecosystem: while individual servers may appear trustworthy, their combination creates unexpected cross-server attack surfaces. Unlike traditional cybersecurity threats that assume sophisticated adversaries, our research shows that the barrier to entry for MCP-based attacks is alarmingly low. A threat actor with undergraduate-level Python knowledge can craft convincing social engineering attacks that exploit the implicit trust relationships MCP establishes between AI agents and tool providers. This work contributes to the nascent field of MCP security by demonstrating that current MCP implementations allow trivial cross-server attacks and proposing both immediate mitigations and protocol improvements to secure this emerging ecosystem.
zh
[AI-89] VAE-GAN Based Price Manipulation in Coordinated Local Energy Markets
【速读】:该论文旨在解决分布式能源资源(Distributed Energy Resources, DERs)异质性背景下,产消者(prosumers)在本地能源市场(Local Energy Market, LEM)中如何实现高效协调与实时决策的问题。其核心挑战在于多主体间动态博弈下的交易效率与公平性保障,尤其在存在价格操纵风险时的鲁棒性问题。解决方案的关键在于提出一种基于多智能体深度确定性策略梯度(Multi-Agent Deep Deterministic Policy Gradient, MADDPG)的数据驱动、模型无关强化学习框架,使产消者能够在不确定环境中自主决策是否购电、售电或保持静默,从而优化局部能源交易;同时引入变分自编码器-生成对抗网络(Variational Autoencoder-Generative Adversarial Network, VAE-GAN)建模的价格操纵机制,揭示了电力公司可通过调整价格信号对缺乏发电能力的产消群体造成财务损失,进一步验证了所提LEMs在不同规模下通过代理间涌现协作实现交易稳定性和公平性的潜力。
链接: https://arxiv.org/abs/2507.19844
作者: Biswarup Mukherjee,Li Zhou,S. Gokul Krishnan,Milad Kabirifar,Subhash Lakshminarayana,Charalambos Konstantinou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 2025 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm)
Abstract:This paper introduces a model for coordinating prosumers with heterogeneous distributed energy resources (DERs), participating in the local energy market (LEM) that interacts with the market-clearing entity. The proposed LEM scheme utilizes a data-driven, model-free reinforcement learning approach based on the multi-agent deep deterministic policy gradient (MADDPG) framework, enabling prosumers to make real-time decisions on whether to buy, sell, or refrain from any action while facilitating efficient coordination for optimal energy trading in a dynamic market. In addition, we investigate a price manipulation strategy using a variational auto encoder-generative adversarial network (VAE-GAN) model, which allows utilities to adjust price signals in a way that induces financial losses for the prosumers. Our results show that under adversarial pricing, heterogeneous prosumer groups, particularly those lacking generation capabilities, incur financial losses. The same outcome holds across LEMs of different sizes. As the market size increases, trading stabilizes and fairness improves through emergent cooperation among agents.
zh
[AI-90] A Cooperative Approach for Knowledge-based Business Process Design in a Public Authority
【速读】:该论文旨在解决企业在数字化转型背景下,如何有效设计业务流程(Business Process, BP)以保持竞争力的问题,尤其针对中小企业在缺乏专业知识的情况下难以实施复杂流程建模的挑战。解决方案的关键在于提出一种基于知识的方法,无需事先掌握知识工程(Knowledge Engineering)技能,通过结构化的步骤引导业务专家从简单的文本知识素材逐步构建出形式化、图示化的流程模型,从而实现所有利益相关方共同参与业务流程设计的目标。
链接: https://arxiv.org/abs/2507.19842
作者: Mohammad Azarijafari,Luisa Mich,Michele Missikoff,Oleg Missikoff
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Enterprises are currently undergoing profound transformations due to the unpostponable digital transformation. Then, to remain competitive, enterprises must adapt their organisational structures and operations. This organisational shift is also important for small and medium-sized enterprises. A key innovation frontier is the adoption of process-oriented production models. This paper presents a knowledge-based method to support business experts in designing business processes. The method requires no prior expertise in Knowledge Engineering and guides designers through a structured sequence of steps to produce a diagrammatic workflow of the target process. The construction of the knowledge base starts from simple, text-based, knowledge artefacts and then progresses towards more structured, formal representations. The approach has been conceived to allow a shared approach for all stakeholders and actors who participate in the BP design.
zh
[AI-91] From Few-Label to Zero-Label: An Approach for Cross-System Log-Based Anomaly Detection with Meta-Learning
【速读】:该论文旨在解决跨系统日志异常检测(cross-system log anomaly detection)中的冷启动问题,即在目标系统缺乏标注日志数据时,现有方法性能显著下降甚至失效的问题。其关键解决方案是提出一种无需目标系统标签的系统无关表示元学习方法 FreeLog,通过元学习机制在源系统与目标系统之间迁移通用的日志表示能力,从而实现零标签条件下的跨系统异常检测,实验表明其性能可媲美依赖少量目标标签的先进方法。
链接: https://arxiv.org/abs/2507.19806
作者: Xinlong Zhao,Tong Jia,Minghua He,Yihan Wu,Ying Li,Gang Huang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figures, FSE 2025
Abstract:Log anomaly detection plays a critical role in ensuring the stability and reliability of software systems. However, existing approaches rely on large amounts of labeled log data, which poses significant challenges in real-world applications. To address this issue, cross-system transfer has been identified as a key research direction. State-of-the-art cross-system approaches achieve promising performance with only a few labels from the target system. However, their reliance on labeled target logs makes them susceptible to the cold-start problem when labeled logs are insufficient. To overcome this limitation, we explore a novel yet underexplored setting: zero-label cross-system log anomaly detection, where the target system logs are entirely unlabeled. To this end, we propose FreeLog, a system-agnostic representation meta-learning method that eliminates the need for labeled target system logs, enabling cross-system log anomaly detection under zero-label conditions. Experimental results on three public log datasets demonstrate that FreeLog achieves performance comparable to state-of-the-art methods that rely on a small amount of labeled data from the target system.
zh
[AI-92] AI-Based Clinical Rule Discovery for NMIBC Recurrence through Tsetlin Machines
【速读】:该论文旨在解决膀胱癌(bladder cancer)患者,尤其是非肌层浸润性膀胱癌(non-muscle-invasive bladder cancer, NMIBC)患者复发预测准确性不足的问题。当前临床使用的EORTC风险评分系统存在过时且不可靠的缺陷,尤其在中危人群中的表现不佳。为提升预测精度并增强模型可解释性,研究提出采用符号学习型人工智能模型——Tsetlin Machine(TM),其核心优势在于输出人类可读的逻辑规则,而非黑箱决策。在PHOTO试验数据集(n=330)上的实验表明,TM在F1-score上达到0.80,优于XGBoost(0.78)、逻辑回归(0.60)和EORTC(0.42),同时揭示了关键临床特征如肿瘤数量、外科医生经验及住院时间对复发预测的具体影响机制,从而实现高准确率与完全透明性的统一,具备实际临床应用潜力。
链接: https://arxiv.org/abs/2507.19803
作者: Saram Abbas,Naeem Soomro,Rishad Shafik,Rakesh Heer,Kabita Adhikari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to ISTM 2025
Abstract:Bladder cancer claims one life every 3 minutes worldwide. Most patients are diagnosed with non-muscle-invasive bladder cancer (NMIBC), yet up to 70% recur after treatment, triggering a relentless cycle of surgeries, monitoring, and risk of progression. Clinical tools like the EORTC risk tables are outdated and unreliable - especially for intermediate-risk cases. We propose an interpretable AI model using the Tsetlin Machine ™, a symbolic learner that outputs transparent, human-readable logic. Tested on the PHOTO trial dataset (n=330), TM achieved an F1-score of 0.80, outperforming XGBoost (0.78), Logistic Regression (0.60), and EORTC (0.42). TM reveals the exact clauses behind each prediction, grounded in clinical features like tumour count, surgeon experience, and hospital stay - offering accuracy and full transparency. This makes TM a powerful, trustworthy decision-support tool ready for real-world adoption. Comments: Submitted to ISTM 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.19803 [cs.LG] (or arXiv:2507.19803v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.19803 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-93] Reinforcement Learning for Multi-Objective Multi-Echelon Supply Chain Optimisation
【速读】:该论文旨在解决多目标、多层级供应链优化问题,特别是在非平稳市场环境下如何平衡经济、环境与社会三重目标,并实现近似帕累托前沿(Pareto front)的高效求解。其解决方案的关键在于构建一个基于马尔可夫决策过程(Markov Decision Process, MDP)的通用多目标优化模型,并采用多目标强化学习(Multi-Objective Reinforcement Learning, MORL)方法进行求解,相较于传统加权求和的单目标强化学习和多目标进化算法(Multi-Objective Evolutionary Algorithm, MOEA),该方法通过共享经验回放缓冲区(shared experience buffer)实现策略间知识迁移,显著提升了优化结果在最优性、多样性与密度上的综合表现,在复杂网络场景下相比MOEA方法提升75%超体积指标,且解集密度约为单目标RL方法的11倍,同时保障生产与库存稳定并最小化需求损失。
链接: https://arxiv.org/abs/2507.19788
作者: Rifny Rachman,Josh Tingey,Richard Allmendinger,Pradyumn Shukla,Wei Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This study develops a generalised multi-objective, multi-echelon supply chain optimisation model with non-stationary markets based on a Markov decision process, incorporating economic, environmental, and social considerations. The model is evaluated using a multi-objective reinforcement learning (RL) method, benchmarked against an originally single-objective RL algorithm modified with weighted sum using predefined weights, and a multi-objective evolutionary algorithm (MOEA)-based approach. We conduct experiments on varying network complexities, mimicking typical real-world challenges using a customisable simulator. The model determines production and delivery quantities across supply chain routes to achieve near-optimal trade-offs between competing objectives, approximating Pareto front sets. The results demonstrate that the primary approach provides the most balanced trade-off between optimality, diversity, and density, further enhanced with a shared experience buffer that allows knowledge transfer among policies. In complex settings, it achieves up to 75% higher hypervolume than the MOEA-based method and generates solutions that are approximately eleven times denser, signifying better robustness, than those produced by the modified single-objective RL method. Moreover, it ensures stable production and inventory levels while minimising demand loss.
zh
[AI-94] Large Language Model Agent for Structural Drawing Generation Using ReAct Prompt Engineering and Retrieval Augmented Generation
【速读】:该论文旨在解决结构图纸(structural drawings)生成过程中存在的劳动密集且耗时的问题,传统方法依赖人工绘制,效率低下且易出错。解决方案的关键在于提出一种基于生成式AI(Generative AI)的方法,利用大语言模型(Large Language Model, LLM)代理结合检索增强生成(Retrieval-Augmented Generation, RAG)技术,实现从自然语言描述到AutoCAD图纸的自动化转换。该方法能够理解多样化的自然语言输入,提取关键设计信息,并生成可执行代码以直接输出符合工程规范的结构图,从而显著降低工程师的工作负担,提升设计表达与迭代效率。
链接: https://arxiv.org/abs/2507.19771
作者: Xin Zhang,Lissette Iturburu,Juan Nicolas Villamizar,Xiaoyu Liu,Manuel Salmeron,Shirley J.Dyke,Julio Ramirez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Structural drawings are widely used in many fields, e.g., mechanical engineering, civil engineering, etc. In civil engineering, structural drawings serve as the main communication tool between architects, engineers, and builders to avoid conflicts, act as legal documentation, and provide a reference for future maintenance or evaluation needs. They are often organized using key elements such as title/subtitle blocks, scales, plan views, elevation view, sections, and detailed sections, which are annotated with standardized symbols and line types for interpretation by engineers and contractors. Despite advances in software capabilities, the task of generating a structural drawing remains labor-intensive and time-consuming for structural engineers. Here we introduce a novel generative AI-based method for generating structural drawings employing a large language model (LLM) agent. The method incorporates a retrieval-augmented generation (RAG) technique using externally-sourced facts to enhance the accuracy and reliability of the language model. This method is capable of understanding varied natural language descriptions, processing these to extract necessary information, and generating code to produce the desired structural drawing in AutoCAD. The approach developed, demonstrated and evaluated herein enables the efficient and direct conversion of a structural drawing’s natural language description into an AutoCAD drawing, significantly reducing the workload compared to current working process associated with manual drawing production, facilitating the typical iterative process of engineers for expressing design ideas in a simplified way.
zh
[AI-95] Modeling enzyme temperature stability from sequence segment perspective
【速读】:该论文旨在解决酶温度稳定性预测的难题,这一问题在工业和科研应用中至关重要,但传统实验方法耗时、昂贵,而现有计算方法受限于数据稀缺与分布不均。其解决方案的关键在于构建了一个精心整理的温度稳定性数据集,并提出了一种名为“Segment Transformer”的新型深度学习框架,该框架通过引入基于蛋白质序列片段的层次化表示,有效捕捉了不同区域对热行为贡献不均的生物学特性,从而实现了更高效、准确的预测性能(RMSE=24.03,MAE=18.09,Pearson相关系数=0.33),并成功应用于指导切酯酶(cutinase)的定向进化,仅通过17个突变即显著提升热处理后的相对活性(1.64倍),且未影响催化功能。
链接: https://arxiv.org/abs/2507.19755
作者: Ziqi Zhang,Shiheng Chen,Runze Yang,Zhisheng Wei,Wei Zhang,Lei Wang,Zhanzhi Liu,Fengshan Zhang,Jing Wu,Xiaoyong Pan,Hongbin Shen,Longbing Cao,Zhaohong Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
备注:
Abstract:Developing enzymes with desired thermal properties is crucial for a wide range of industrial and research applications, and determining temperature stability is an essential step in this process. Experimental determination of thermal parameters is labor-intensive, time-consuming, and costly. Moreover, existing computational approaches are often hindered by limited data availability and imbalanced distributions. To address these challenges, we introduce a curated temperature stability dataset designed for model development and benchmarking in enzyme thermal modeling. Leveraging this dataset, we present the \textitSegment Transformer, a novel deep learning framework that enables efficient and accurate prediction of enzyme temperature stability. The model achieves state-of-the-art performance with an RMSE of 24.03, MAE of 18.09, and Pearson and Spearman correlations of 0.33, respectively. These results highlight the effectiveness of incorporating segment-level representations, grounded in the biological observation that different regions of a protein sequence contribute unequally to thermal behavior. As a proof of concept, we applied the Segment Transformer to guide the engineering of a cutinase enzyme. Experimental validation demonstrated a 1.64-fold improvement in relative activity following heat treatment, achieved through only 17 mutations and without compromising catalytic function.
zh
[AI-96] Can LLM s Solve ASP Problems? Insights from a Benchmarking Study (Extended Version) KR2025
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在回答集编程(Answer Set Programming, ASP)任务中评估不足的问题,特别是现有研究多采用过于简化的ASP程序、缺乏对否定(negation)、析取(disjunction)和多个回答集的支持,且缺少专门针对ASP求解设计的基准测试。为填补这一空白,作者提出了ASPBench,一个包含三个ASP特定任务的综合性基准:ASP蕴含判断(ASP entailment)、回答集验证(answer set verification)和回答集计算(answer set computation)。关键解决方案在于构建了一个系统化、结构清晰的评测框架,并通过大规模实验发现,尽管主流LLMs在前两项任务上表现尚可,但在核心的“回答集计算”任务上存在显著困难,揭示了当前LLMs在符号推理整合方面的局限性,从而推动未来研究向更有效的符号与神经融合方法发展。
链接: https://arxiv.org/abs/2507.19749
作者: Lin Ren,Guohui Xiao,Guilin Qi,Yishuai Geng,Haohan Xue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 22nd International Conference on Principles of Knowledge Representation and Reasoning (KR 2025). The code is available at this https URL
Abstract:Answer Set Programming (ASP) is a powerful paradigm for non-monotonic reasoning. Recently, large language models (LLMs) have demonstrated promising capabilities in logical reasoning. Despite this potential, current evaluations of LLM capabilities in ASP are often limited. Existing works normally employ overly simplified ASP programs, do not support negation, disjunction, or multiple answer sets. Furthermore, there is a lack of benchmarks that introduce tasks specifically designed for ASP solving. To bridge this gap, we introduce ASPBench, a comprehensive ASP benchmark, including three ASP specific tasks: ASP entailment, answer set verification, and answer set computation. Our extensive evaluations on ASPBench reveal that while 14 state-of-the-art LLMs, including \emphdeepseek-r1, \empho4-mini, and \emphgemini-2.5-flash-thinking, perform relatively well on the first two simpler tasks, they struggle with answer set computation, which is the core of ASP solving. These findings offer insights into the current limitations of LLMs in ASP solving. This highlights the need for new approaches that integrate symbolic reasoning capabilities more effectively. The code and dataset are available at this https URL.
zh
[AI-97] Defining ethically sourced code generation
【速读】:该论文旨在解决当前代码生成模型(Code Generation Models)在开发与部署过程中存在的伦理与可持续性问题,尤其是数据来源不透明、隐私泄露、公平性缺失及环境影响等挑战,以推动负责任的人工智能发展。其解决方案的关键在于提出并系统化“伦理溯源代码生成”(Ethically Sourced Code Generation, ES-CodeGen)这一新概念,并通过两阶段文献综述与实践者调查构建了一个包含11个维度的ES-CodeGen分类体系,其中新增了“代码质量”维度以反映实际应用中的重要关切,同时识别出相关后果、产物和阶段,从而为实现从数据收集到后部署全流程的伦理与可持续实践提供理论框架与实证依据。
链接: https://arxiv.org/abs/2507.19743
作者: Zhuolin Xu,Chenglin Li,Qiushi Li,Shin Hwei Tan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Several code generation models have been proposed to help reduce time and effort in solving software-related tasks. To ensure responsible AI, there are growing interests over various ethical issues (e.g., unclear licensing, privacy, fairness, and environment impact). These studies have the overarching goal of ensuring ethically sourced generation, which has gained growing attentions in speech synthesis and image generation. In this paper, we introduce the novel notion of Ethically Sourced Code Generation (ES-CodeGen) to refer to managing all processes involved in code generation model development from data collection to post-deployment via ethical and sustainable practices. To build a taxonomy of ES-CodeGen, we perform a two-phase literature review where we read 803 papers across various domains and specific to AI-based code generation. We identified 71 relevant papers with 10 initial dimensions of ES-CodeGen. To refine our dimensions and gain insights on consequences of ES-CodeGen, we surveyed 32 practitioners, which include six developers who submitted GitHub issues to opt-out from the Stack dataset (these impacted users have real-world experience of ethically sourcing issues in code generation models). The results lead to 11 dimensions of ES-CodeGen with a new dimension on code quality as practitioners have noted its importance. We also identified consequences, artifacts, and stages relevant to ES-CodeGen. Our post-survey reflection showed that most practitioners tend to ignore social-related dimensions despite their importance. Most practitioners either agreed or strongly agreed that our survey help improve their understanding of ES-CodeGen. Our study calls for attentions of various ethical issues towards ES-CodeGen.
zh
[AI-98] Predicting Human Mobility in Disasters via LLM -Enhanced Cross-City Learning
【速读】:该论文旨在解决城市在自然灾害情境下人类移动模式发生显著变化时,现有深度移动预测模型无法有效适应的问题。传统模型多针对正常场景设计,在灾害场景中因人类行为意图的突变而失效,从而影响灾后预警、资源预分配等下游任务的效果。解决方案的关键在于提出DisasterMobLLM框架,其核心创新是利用大语言模型(Large Language Models, LLMs)建模灾害情境下的移动意图,并通过检索增强生成(Retrieval-Augmented Generation, RAG)机制提升意图预测准确性,再结合意图调制的位置预测模块实现精准定位。该框架可无缝集成至现有深度移动预测方法中,实现跨城市灾害知识迁移与意图引导的高精度移动预测。
链接: https://arxiv.org/abs/2507.19737
作者: Yinzhou Tang,Huandong Wang,Xiaochen Fan,Yong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:The vulnerability of cities to natural disasters has increased with urbanization and climate change, making it more important to predict human mobility in the disaster scenarios for downstream tasks including location-based early disaster warning and pre-allocating rescue resources, etc. However, existing human mobility prediction models are mainly designed for normal scenarios, and fail to adapt to disaster scenarios due to the shift of human mobility patterns under disaster. To address this issue, we introduce \textbfDisasterMobLLM, a mobility prediction framework for disaster scenarios that can be integrated into existing deep mobility prediction methods by leveraging LLMs to model the mobility intention and transferring the common knowledge of how different disasters affect mobility intentions between cities. This framework utilizes a RAG-Enhanced Intention Predictor to forecast the next intention, refines it with an LLM-based Intention Refiner, and then maps the intention to an exact location using an Intention-Modulated Location Predictor. Extensive experiments illustrate that DisasterMobLLM can achieve a 32.8% improvement in terms of Acc@1 and a 35.0% improvement in terms of the F1-score of predicting immobility compared to the baselines. The code is available at this https URL.
zh
[AI-99] Integrating Activity Predictions in Knowledge Graphs
【速读】:该论文旨在解决如何利用本体结构化知识图谱(ontology-structured knowledge graphs)提升对未来事件预测的准确性与可解释性问题。其核心挑战在于如何将时空动态数据(如渔船移动轨迹)转化为可计算的概率模型,并避免传统概率观中对可能性(likelihood)与概率(probability)的混淆及对“未来实体测量”这一概念的依赖。解决方案的关键在于:首先,借助Basic Formal Ontology (BFO) 和 Common Core Ontologies (CCO) 构建语义清晰的知识图谱结构,并引入“时空瞬时”(spatiotemporal instant)以完善描述动态过程的语义基础;其次,基于查询结果构建马尔可夫链(Markov chains)模型,将概率视为过程特征(process profiles)而非静态属性,从而更真实地刻画现实世界的演化机制;最后,将概率计算结果无缝回填至知识图谱,实现预测、分析与决策的闭环支持。
链接: https://arxiv.org/abs/2507.19733
作者: Alec Scully,Cameron Stockton,Forrest Hare
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 7 pages. 18 figures. Semantic Technology for Intelligence, Defense, and Security (STIDS 2024)
Abstract:We argue that ontology-structured knowledge graphs can play a crucial role in generating predictions about future events. By leveraging the semantic framework provided by Basic Formal Ontology (BFO) and Common Core Ontologies (CCO), we demonstrate how data such as the movements of a fishing vessel can be organized in and retrieved from a knowledge graph. These query results are then used to create Markov chain models, allowing us to predict future states based on the vessel’s history. To fully support this process, we introduce the term `spatiotemporal instant’ to complete the necessary structural semantics. Additionally, we critique the prevailing ontological model of probability, which conflates probability with likelihood and relies on the problematic concept of modal measurements: measurements of future entities. We propose an alternative view, where probabilities are treated as being about process profiles, which better captures the dynamics of real world phenomena. Finally, we demonstrate how our Markov chain based probability calculations can be seamlessly integrated back into the knowledge graph, enabling further analysis and decision-making. Keywords: predictive analytics, ontology, Markov chains, probability, Basic Formal Ontology (BFO), knowledge graphs, SPARQL.
zh
[AI-100] HypKG: Hypergraph-based Knowledge Graph Contextualization for Precision Healthcare ISWC2025
【速读】:该论文旨在解决通用知识图谱(Knowledge Graph, KG)在精准医疗应用中缺乏患者特定上下文信息的问题,导致其难以适应个体化诊疗需求。现有KG虽包含大量医学事实,但无法融合电子健康记录(Electronic Health Records, EHR)中如诊断、用药等个性化数据,从而限制了其在疾病预测等任务中的准确性。解决方案的关键在于提出HypKG框架,通过先进的实体链接技术将EHR中的患者信息与KG中的知识进行关联,并利用超图(hypergraph)模型对知识进行“情境化”建模;进一步地,采用由下游预测任务引导的超图Transformer联合学习KG与患者表征,从而充分挖掘KG中的先验知识和EHR中的个体上下文信息,实现更准确的医疗预测。
链接: https://arxiv.org/abs/2507.19726
作者: Yuzhang Xie,Xu Han,Ran Xu,Xiao Hu,Jiaying Lu,Carl Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Extended version of paper accepted at the 24th International Semantic Web Conference (ISWC 2025), Main Tracks, Research Track, Oral
Abstract:Knowledge graphs (KGs) are important products of the semantic web, which are widely used in various application domains. Healthcare is one of such domains where KGs are intensively used, due to the high requirement for knowledge accuracy and interconnected nature of healthcare data. However, KGs storing general factual information often lack the ability to account for important contexts of the knowledge such as the status of specific patients, which are crucial in precision healthcare. Meanwhile, electronic health records (EHRs) provide rich personal data, including various diagnoses and medications, which provide natural contexts for general KGs. In this paper, we propose HypKG, a framework that integrates patient information from EHRs into KGs to generate contextualized knowledge representations for accurate healthcare predictions. Using advanced entity-linking techniques, we connect relevant knowledge from general KGs with patient information from EHRs, and then utilize a hypergraph model to “contextualize” the knowledge with the patient information. Finally, we employ hypergraph transformers guided by downstream prediction tasks to jointly learn proper contextualized representations for both KGs and patients, fully leveraging existing knowledge in KGs and patient contexts in EHRs. In experiments using a large biomedical KG and two real-world EHR datasets, HypKG demonstrates significant improvements in healthcare prediction tasks across multiple evaluation metrics. Additionally, by integrating external contexts, HypKG can learn to adjust the representations of entities and relations in KG, potentially improving the quality and real-world utility of knowledge.
zh
[AI-101] Minding Motivation: The Effect of Intrinsic Motivation on Agent Behaviors
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)代理在稀疏奖励环境(reward-sparsity)中难以有效探索的问题,尤其是内在动机(Intrinsic Motivation, IM)方法虽能提升探索效率,却可能引发“奖励黑客”(reward hacking)现象——即代理过度优化内在奖励而偏离原始任务目标。其解决方案的关键在于引入广义奖励匹配(Generalized Reward Matching, GRM),该方法可与任意内在奖励函数结合,并通过理论保证使代理行为保持最优性,从而缓解IM带来的行为偏移问题。实验表明,GRM在部分场景下能有效抑制奖励黑客现象,同时保留探索优势。
链接: https://arxiv.org/abs/2507.19725
作者: Leonardo Villalobos-Arias,Grant Forbes,Jianxun Wang,David L Roberts,Arnav Jhala
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, 3 tables
Abstract:Games are challenging for Reinforcement Learning~(RL) agents due to their reward-sparsity, as rewards are only obtainable after long sequences of deliberate actions. Intrinsic Motivation~(IM) methods – which introduce exploration rewards – are an effective solution to reward-sparsity. However, IM also causes an issue known as `reward hacking’ where the agent optimizes for the new reward at the expense of properly playing the game. The larger problem is that reward hacking itself is largely unknown; there is no answer to whether, and to what extent, IM rewards change the behavior of RL agents. This study takes a first step by empirically evaluating the impact on behavior of three IM techniques on the MiniGrid game-like environment. We compare these IM models with Generalized Reward Matching~(GRM), a method that can be used with any intrinsic reward function to guarantee optimality. Our results suggest that IM causes noticeable change by increasing the initial rewards, but also altering the way the agent plays; and that GRM mitigated reward hacking in some scenarios.
zh
[AI-102] Oranits: Mission Assignment and Task Offloading in Open RAN-based ITS using Metaheuristic and Deep Reinforcement Learning
【速读】:该论文旨在解决开放无线接入网(Open RAN)赋能的智能交通系统(ITS)中,自动驾驶车辆在移动边缘计算(MEC)环境下进行任务分配与任务卸载时,因忽略任务间依赖关系及卸载成本而导致决策次优的问题。解决方案的关键在于提出Oranits系统模型,其核心创新包括:一是设计基于混沌高斯全局ARO(CGG-ARO)的元启发式进化算法,用于单时隙优化;二是构建融合多智能体协同与多动作选择机制的增强型奖励驱动深度强化学习框架(MA-DDQN),显著提升任务分配效率与环境适应性。仿真结果表明,该方案在任务完成率和整体收益上均优于传统方法。
链接: https://arxiv.org/abs/2507.19712
作者: Ngoc Hung Nguyen,Nguyen Van Thieu,Quang-Trung Luu,Anh Tuan Nguyen,Senura Wanasekara,Nguyen Cong Luong,Fatemeh Kavehmadavani,Van-Dinh Nguyen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: 15 pages, 13 figures
Abstract:In this paper, we explore mission assignment and task offloading in an Open Radio Access Network (Open RAN)-based intelligent transportation system (ITS), where autonomous vehicles leverage mobile edge computing for efficient processing. Existing studies often overlook the intricate interdependencies between missions and the costs associated with offloading tasks to edge servers, leading to suboptimal decision-making. To bridge this gap, we introduce Oranits, a novel system model that explicitly accounts for mission dependencies and offloading costs while optimizing performance through vehicle cooperation. To achieve this, we propose a twofold optimization approach. First, we develop a metaheuristic-based evolutionary computing algorithm, namely the Chaotic Gaussian-based Global ARO (CGG-ARO), serving as a baseline for one-slot optimization. Second, we design an enhanced reward-based deep reinforcement learning (DRL) framework, referred to as the Multi-agent Double Deep Q-Network (MA-DDQN), that integrates both multi-agent coordination and multi-action selection mechanisms, significantly reducing mission assignment time and improving adaptability over baseline methods. Extensive simulations reveal that CGG-ARO improves the number of completed missions and overall benefit by approximately 7.1% and 7.7%, respectively. Meanwhile, MA-DDQN achieves even greater improvements of 11.0% in terms of mission completions and 12.5% in terms of the overall benefit. These results highlight the effectiveness of Oranits in enabling faster, more adaptive, and more efficient task processing in dynamic ITS environments.
zh
[AI-103] he wall confronting large language models
【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在提升预测不确定性量化能力方面的根本性局限问题,即其性能受限于缩放定律(scaling laws),导致难以达到科学研究所要求的可靠性标准。论文指出,LLMs 的核心机制——从高斯输入生成非高斯输出分布的能力——可能是引发错误累积(error pileup)、信息灾难(information catastrophes)和退化 AI 行为(degenerative AI behaviour)的根源,这种学习与准确性的张力是造成观测到的缩放组件值偏低的关键机制。解决方案之关键在于:必须大幅提升对所研究问题结构特征的理解与洞察力,以主动规避由数据规模增长带来的虚假相关性(spurious correlations)激增问题,从而防止退化路径成为必然结果。
链接: https://arxiv.org/abs/2507.19703
作者: Peter V. Coveney,Sauro Succi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We show that the scaling laws which determine the performance of large language models (LLMs) severely limit their ability to improve the uncertainty of their predictions. As a result, raising their reliability to meet the standards of scientific inquiry is intractable by any reasonable measure. We argue that the very mechanism which fuels much of the learning power of LLMs, namely the ability to generate non-Gaussian output distributions from Gaussian input ones, might well be at the roots of their propensity to produce error pileup, ensuing information catastrophes and degenerative AI behaviour. This tension between learning and accuracy is a likely candidate mechanism underlying the observed low values of the scaling components. It is substantially compounded by the deluge of spurious correlations pointed out by Calude and Longo which rapidly increase in any data set merely as a function of its size, regardless of its nature. The fact that a degenerative AI pathway is a very probable feature of the LLM landscape does not mean that it must inevitably arise in all future AI research. Its avoidance, which we also discuss in this paper, necessitates putting a much higher premium on insight and understanding of the structural characteristics of the problems being investigated.
zh
[AI-104] KD-GAT: Combining Knowledge Distillation and Graph Attention Transformer for a Controller Area Network Intrusion Detection System
【速读】:该论文旨在解决车载控制器局域网(Controller Area Network, CAN)协议因缺乏内置安全机制而易受网络攻击的问题。其解决方案的关键在于提出一种基于知识蒸馏的图注意力网络(Knowledge Distillation with Graph Attention Networks, KD-GAT)入侵检测框架:通过滑动窗口将CAN流量建模为图结构以捕捉时序与关联模式,采用多层图注意力网络(GAT)作为教师模型,并训练一个仅为其6.32%大小的学生模型,该过程包含监督预训练和软标签与硬标签联合的知识蒸馏两个阶段,从而在保持高检测精度的同时显著降低计算复杂度。
链接: https://arxiv.org/abs/2507.19686
作者: Robert Frenken,Sidra Ghayour Bhatti,Hanqin Zhang,Qadeer Ahmed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The Controller Area Network (CAN) protocol is widely adopted for in-vehicle communication but lacks inherent security mechanisms, making it vulnerable to cyberattacks. This paper introduces KD-GAT, an intrusion detection framework that combines Graph Attention Networks (GATs) with knowledge distillation (KD) to enhance detection accuracy while reducing computational complexity. In our approach, CAN traffic is represented as graphs using a sliding window to capture temporal and relational patterns. A multi-layer GAT with jumping knowledge aggregation acting as the teacher model, while a compact student GAT–only 6.32% the size of the teacher–is trained via a two-phase process involving supervised pretraining and knowledge distillation with both soft and hard label supervision. Experiments on three benchmark datasets–Car-Hacking, Car-Survival, and can-train-and-test demonstrate that both teacher and student models achieve strong results, with the student model attaining 99.97% and 99.31% accuracy on Car-Hacking and Car-Survival, respectively. However, significant class imbalance in can-train-and-test has led to reduced performance for both models on this dataset. Addressing this imbalance remains an important direction for future work.
zh
[AI-105] Alignment and Safety in Large Language Models : Safety Mechanisms Training Paradigms and Emerging Challenges
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)与人类价值观和意图对齐(alignment)的问题,这是当前LLM广泛应用背景下亟需应对的核心挑战。其解决方案的关键在于系统梳理并分析多种对齐技术,包括监督微调(Supervised Fine-Tuning, SFT)、基于偏好优化的方法(如Direct Preference Optimization, DPO)、宪法式AI(Constitutional AI)、类脑启发方法及对齐不确定性量化(Alignment Uncertainty Quantification, AUQ),这些方法在质量与效率之间寻求平衡,并揭示了不同范式下核心对齐目标间的权衡关系,从而为研究者和实践者提供可操作的指导框架。
链接: https://arxiv.org/abs/2507.19672
作者: Haoran Lu,Luyang Fang,Ruidong Zhang,Xinliang Li,Jiazhang Cai,Huimin Cheng,Lin Tang,Ziyu Liu,Zeliang Sun,Tao Wang,Yingchuan Zhang,Arif Hassan Zidan,Jinwen Xu,Jincheng Yu,Meizhi Yu,Hanqi Jiang,Xilin Gong,Weidi Luo,Bolun Sun,Yongkai Chen,Terry Ma,Shushan Wu,Yifan Zhou,Junhao Chen,Haotian Xiang,Jing Zhang,Afrar Jahin,Wei Ruan,Ke Deng,Yi Pan,Peilong Wang,Jiahui Li,Zhengliang Liu,Lu Zhang,Lin Zhao,Wei Liu,Dajiang Zhu,Xin Xing,Fei Dou,Wei Zhang,Chao Huang,Rongjie Liu,Mengrui Zhang,Yiwen Liu,Xiaoxiao Sun,Qin Lu,Zhen Xiang,Wenxuan Zhong,Tianming Liu,Ping Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 119 pages, 10 figures, 7 tables
Abstract:Due to the remarkable capabilities and growing impact of large language models (LLMs), they have been deeply integrated into many aspects of society. Thus, ensuring their alignment with human values and intentions has emerged as a critical challenge. This survey provides a comprehensive overview of practical alignment techniques, training protocols, and empirical findings in LLM alignment. We analyze the development of alignment methods across diverse paradigms, characterizing the fundamental trade-offs between core alignment objectives. Our analysis shows that while supervised fine-tuning enables basic instruction-following, preference-based methods offer more flexibility for aligning with nuanced human intent. We discuss state-of-the-art techniques, including Direct Preference Optimization (DPO), Constitutional AI, brain-inspired methods, and alignment uncertainty quantification (AUQ), highlighting their approaches to balancing quality and efficiency. We review existing evaluation frameworks and benchmarking datasets, emphasizing limitations such as reward misspecification, distributional robustness, and scalable oversight. We summarize strategies adopted by leading AI labs to illustrate the current state of practice. We conclude by outlining open problems in oversight, value pluralism, robustness, and continuous alignment. This survey aims to inform both researchers and practitioners navigating the evolving landscape of LLM alignment.
zh
[AI-106] "X of Information Continuum: A Survey on AI-Driven Multi-dimensional Metrics for Next-Generation Networked Systems
【速读】:该论文旨在解决传统网络指标(如延迟和丢包率)无法有效衡量现代智能应用对信息质量的复杂需求的问题,特别是在自动驾驶、数字孪生、元宇宙等场景中,单纯关注数据传输量已不足以反映信息的实际价值。其解决方案的关键在于提出一个系统性的四维信息度量分类框架,涵盖时间维度(Temporal)、质量/效用维度(Quality/Utility)、可靠性/鲁棒性维度(Reliability/Robustness)以及网络/通信维度(Network/Communication),并通过揭示这四个维度之间的动态耦合关系——例如时间新鲜度触发质量评估,进而影响可靠性判断,最终优化网络传输效率——实现对信息质量的多维量化与自适应优化。此外,论文强调生成式 AI(Generative AI)技术(如深度强化学习、多智能体系统和神经优化模型)在实现上下文感知、竞争性目标协同优化中的核心作用,为下一代信息感知型网络设计提供了理论基础与实践路径。
链接: https://arxiv.org/abs/2507.19657
作者: Beining Wu,Jun Huang,Shui Yu
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 48 pages, 14 figures, submitted to IEEE
Abstract:The development of next-generation networking systems has inherently shifted from throughput-based paradigms towards intelligent, information-aware designs that emphasize the quality, relevance, and utility of transmitted information, rather than sheer data volume. While classical network metrics, such as latency and packet loss, remain significant, they are insufficient to quantify the nuanced information quality requirements of modern intelligent applications, including autonomous vehicles, digital twins, and metaverse environments. In this survey, we present the first comprehensive study of the ``X of Information’’ continuum by introducing a systematic four-dimensional taxonomic framework that structures information metrics along temporal, quality/utility, reliability/robustness, and network/communication dimensions. We uncover the increasing interdependencies among these dimensions, whereby temporal freshness triggers quality evaluation, which in turn helps with reliability appraisal, ultimately enabling effective network delivery. Our analysis reveals that artificial intelligence technologies, such as deep reinforcement learning, multi-agent systems, and neural optimization models, enable adaptive, context-aware optimization of competing information quality objectives. In our extensive study of six critical application domains, covering autonomous transportation, industrial IoT, healthcare digital twins, UAV communications, LLM ecosystems, and metaverse settings, we illustrate the revolutionary promise of multi-dimensional information metrics for meeting diverse operational needs. Our survey identifies prominent implementation challenges, including …
zh
[AI-107] On the Limitations of Ray-Tracing for Learning-Based RF Tasks in Urban Environments
【速读】:该论文旨在解决室外蜂窝通信链路中射线追踪(ray-tracing)仿真模型的现实性(realism)问题,即如何提升仿真结果与真实测量数据之间的一致性。研究聚焦于意大利罗马市中心的户外场景,利用1,664个用户设备(UE)和6个基站(BS)的实际部署位置进行系统性实验,通过调整路径深度、散射/镜面反射/折射标志、载波频率以及天线高度、辐射图和方向等参数,评估不同配置下的仿真精度。关键发现是:仿真器超参数对指标影响微乎其微,而天线的位置和方向则显著决定仿真质量;通过简单的贪心优化方法,Spearman相关系数提升5%至130%,基于RSSI指纹的k近邻(k-nearest-neighbor, kNN)定位误差降低三分之一,但仍高于仅使用实测数据时的误差水平。因此,精确几何建模和可信天线模型虽为必要条件,但要实现可迁移、高保真的室外无线传播仿真,仍需进一步捕捉城市环境中的残余噪声(residual urban noise)。
链接: https://arxiv.org/abs/2507.19653
作者: Armen Manukyan,Hrant Khachatrian,Edvard Ghukasyan,Theofanis P. Raptis
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work has been submitted to the IEEE for possible publication. This work was supported by funding under the bilateral agreement between CNR (Italy) and HESC MESCS RA (Armenia) as part of the DeepRF project for the 2025-2026 biennium, and by the HESC MESCS RA grant No. 22rl-052 (DISTAL)
Abstract:We study the realism of Sionna v1.0.2 ray-tracing for outdoor cellular links in central Rome. We use a real measurement set of 1,664 user-equipments (UEs) and six nominal base-station (BS) sites. Using these fixed positions we systematically vary the main simulation parameters, including path depth, diffuse/specular/refraction flags, carrier frequency, as well as antenna’s properties like its altitude, radiation pattern, and orientation. Simulator fidelity is scored for each base station via Spearman correlation between measured and simulated powers, and by a fingerprint-based k-nearest-neighbor localization algorithm using RSSI-based fingerprints. Across all experiments, solver hyper-parameters are having immaterial effect on the chosen metrics. On the contrary, antenna locations and orientations prove decisive. By simple greedy optimization we improve the Spearman correlation by 5% to 130% for various base stations, while kNN-based localization error using only simulated data as reference points is decreased by one-third on real-world samples, while staying twice higher than the error with purely real data. Precise geometry and credible antenna models are therefore necessary but not sufficient; faithfully capturing the residual urban noise remains an open challenge for transferable, high-fidelity outdoor RF simulation.
zh
[AI-108] Can You Share Your Story? Modeling Clients Metacognition and Openness for LLM Therapist Evaluation ACL2025
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLM)作为虚拟治疗师时,缺乏对来访者隐性心理状态(如未明确表达的信念和想法)进行识别与理解能力评估的问题。现有评估方法依赖于明确披露内部状态的模拟来访者,无法真实检验LLM治疗师在复杂互动中挖掘深层心理内容的能力。其解决方案的关键在于提出MindVoyager框架,该框架包含一个可控制且具备动态适应性的现实客户模拟器,能够根据咨询会话进展实时调整自身行为,并引入新的评估指标以量化LLM治疗师在探索来访者心理状态方面的深度与广度,从而提供更贴近临床实践的挑战性评估环境。
链接: https://arxiv.org/abs/2507.19643
作者: Minju Kim,Dongje Yoo,Yeonjun Hwang,Minseok Kang,Namyoung Kim,Minju Gwak,Beong-woo Kwak,Hyungjoo Chae,Harim Kim,Yunjoong Lee,Min Hee Kim,Dayi Jung,Kyong-Mee Chung,Jinyoung Yeo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Published at ACL 2025 Findings
Abstract:Understanding clients’ thoughts and beliefs is fundamental in counseling, yet current evaluations of LLM therapists often fail to assess this ability. Existing evaluation methods rely on client simulators that clearly disclose internal states to the therapist, making it difficult to determine whether an LLM therapist can uncover unexpressed perspectives. To address this limitation, we introduce MindVoyager, a novel evaluation framework featuring a controllable and realistic client simulator which dynamically adapts itself based on the ongoing counseling session, offering a more realistic and challenging evaluation environment. We further introduce evaluation metrics that assess the exploration ability of LLM therapists by measuring their thorough understanding of client’s beliefs and thoughts.
zh
[AI-109] Efficient and Scalable Agent ic AI with Heterogeneous Systems
【速读】:该论文旨在解决AI代理(AI agent)工作负载在异构计算基础设施上高效部署与服务的问题,这类工作负载具有动态性和结构复杂性,通常包含多模态数据处理、向量数据库查询、多个大语言模型(LLM)推理及工具调用等环节。其核心挑战在于如何在CPU和不同厂商、性能层级的加速器之间进行动态调度,并满足端到端服务等级协议(SLA)。解决方案的关键在于提出了一套系统级设计:首先通过基于成本模型的规划与优化框架,综合考虑硬件的计算、内存和带宽约束;其次利用MLIR(Multi-Level Intermediate Representation)构建可编译的执行图表示,将AI代理任务分解为细粒度算子并生成适配多种硬件的目标代码;最后引入动态编排机制,在异构环境中合理分配这些算子组件并实现无缝连接,从而实现总体拥有成本(TCO)最优。初步实验表明,采用旧代GPU与新型加速器混合配置可在TCO上媲美最新一代同质GPU方案,可能延长现有基础设施的使用寿命。
链接: https://arxiv.org/abs/2507.19635
作者: Zain Asgar,Michelle Nguyen,Sachin Katti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Early access preprint
Abstract:AI agents are emerging as a dominant workload in a wide range of applications, promising to be the vehicle that delivers the promised benefits of AI to enterprises and consumers. Unlike conventional software or static inference, agentic workloads are dynamic and structurally complex. Often these agents are directed graphs of compute and IO operations that span multi-modal data input and conversion), data processing and context gathering (e.g vector DB lookups), multiple LLM inferences, tool calls, etc. To scale AI agent usage, we need efficient and scalable deployment and agent-serving infrastructure. To tackle this challenge, in this paper, we present a system design for dynamic orchestration of AI agent workloads on heterogeneous compute infrastructure spanning CPUs and accelerators, both from different vendors and across different performance tiers within a single vendor. The system delivers several building blocks: a framework for planning and optimizing agentic AI execution graphs using cost models that account for compute, memory, and bandwidth constraints of different HW; a MLIR based representation and compilation system that can decompose AI agent execution graphs into granular operators and generate code for different HW options; and a dynamic orchestration system that can place the granular components across a heterogeneous compute infrastructure and stitch them together while meeting an end-to-end SLA. Our design performs a systems level TCO optimization and preliminary results show that leveraging a heterogeneous infrastructure can deliver significant TCO benefits. A preliminary surprising finding is that for some workloads a heterogeneous combination of older generation GPUs with newer accelerators can deliver similar TCO as the latest generation homogenous GPU infrastructure design, potentially extending the life of deployed infrastructure. Comments: Early access preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2507.19635 [cs.LG] (or arXiv:2507.19635v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.19635 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-110] DeltaLLM : A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在资源受限的边缘设备上部署时面临的计算复杂度高、内存占用大问题,尤其是其注意力机制随序列长度呈二次增长的瓶颈。现有动态注意力剪枝方法多针对具备大规模并行计算能力的硬件(如GPU或TPU)设计,适用于长上下文场景(如64K),难以适配边缘环境。解决方案的关键在于提出DeltaLLM框架,该框架无需训练即可利用注意力模式中的时间稀疏性,在预填充(prefilling)和解码(decoding)阶段均实现高效推理:一是通过引入一种兼顾精度与内存消耗的delta矩阵构建策略以引入时间稀疏性;二是设计一种上下文感知的混合注意力机制,局部窗口内使用完整注意力,外部区域采用delta近似,从而在保持准确率的同时显著提升稀疏度(最高达60%)。实验表明,该方法在BitNet-b1.58-2B-4T和Llama3.2-1B-Instruct模型上均能实现高稀疏性且无明显精度损失,为边缘端高效部署提供了可行方案。
链接: https://arxiv.org/abs/2507.19608
作者: Jiawen Qi,Chang Gao,Zhaochun Ren,Qinyu Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Deploying Large Language Models (LLMs) on edge devices remains challenging due to their quadratically increasing computations with the sequence length. Existing studies for dynamic attention pruning are designed for hardware with massively parallel computation capabilities, such as GPUs or TPUs, and aim at long context lengths (e.g., 64K), making them unsuitable for edge scenarios. We present DeltaLLM, a training-free framework that exploits temporal sparsity in attention patterns to enable efficient LLM inference across both the prefilling and decoding stages, on resource-constrained edge devices. DeltaLLM introduces an accuracy- and memory-aware delta matrix construction strategy that introduces temporal sparsity, and a context-aware hybrid attention mechanism that combines full attention in a local context window with delta approximation outside it to increase accuracy. We evaluate our framework on the edge-device-friendly BitNet-b1.58-2B-4T model and Llama3.2-1B-Instruct model across diverse language tasks. The results show that on BitNet, our framework increases the attention sparsity from 0% to 60% during the prefilling stage with slight accuracy improvement on the WG task, and 0% to 57% across both the prefilling and decoding stages, with even higher F1 score from 29.63 to 30.97 on SQuAD-v2 task. On the Llama model, it can also achieve up to 60% sparsity during the prefilling stage and around 57% across both stages with negligible accuracy drop. These results demonstrate that DeltaLLM offers a promising solution for efficient edge deployment, requiring no fine-tuning and seamlessly integrating with existing inference pipelines.
zh
[AI-111] Hypergames: Modeling Misaligned Perceptions and Nested Beliefs for Multi-agent Systems
【速读】:该论文旨在解决传统博弈论模型在多智能体系统(Multi-Agent Systems, MAS)中因假设理性个体、完全信息和共同知识 payoff 而导致的现实适用性不足问题,特别是在存在不确定性、感知错位和嵌套信念的情境下。其解决方案的关键在于引入并系统梳理超博弈理论(hypergame theory),该理论通过显式建模 agents 对战略情境的主观认知(即感知博弈,perceptual games),允许 agent 拥有对博弈结构、收益或可用行动的不同信念。论文进一步基于层级超博弈(hierarchical hypergames)与超博弈正规形式(HNF)两大扩展,提出代理兼容性标准与基于代理的分类框架,以评估其在动态交互环境中的集成模式与实用性,从而提升战略建模的真实性与有效性。
链接: https://arxiv.org/abs/2507.19593
作者: Vince Trencsenyi,Agnieszka Mensfelt,Kostas Stathis
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Classical game-theoretic models typically assume rational agents, complete information, and common knowledge of payoffs - assumptions that are often violated in real-world MAS characterized by uncertainty, misaligned perceptions, and nested beliefs. To overcome these limitations, researchers have proposed extensions that incorporate models of cognitive constraints, subjective beliefs, and heterogeneous reasoning. Among these, hypergame theory extends the classical paradigm by explicitly modeling agents’ subjective perceptions of the strategic scenario, known as perceptual games, in which agents may hold divergent beliefs about the structure, payoffs, or available actions. We present a systematic review of agent-compatible applications of hypergame theory, examining how its descriptive capabilities have been adapted to dynamic and interactive MAS contexts. We analyze 44 selected studies from cybersecurity, robotics, social simulation, communications, and general game-theoretic modeling. Building on a formal introduction to hypergame theory and its two major extensions - hierarchical hypergames and HNF - we develop agent-compatibility criteria and an agent-based classification framework to assess integration patterns and practical applicability. Our analysis reveals prevailing tendencies, including the prevalence of hierarchical and graph-based models in deceptive reasoning and the simplification of extensive theoretical frameworks in practical applications. We identify structural gaps, including the limited adoption of HNF-based models, the lack of formal hypergame languages, and unexplored opportunities for modeling human-agent and agent-agent misalignment. By synthesizing trends, challenges, and open research directions, this review provides a new roadmap for applying hypergame theory to enhance the realism and effectiveness of strategic modeling in dynamic multi-agent environments.
zh
[AI-112] Programmable Virtual Humans Toward Human Physiologically-Based Drug Discovery
【速读】:该论文旨在解决当前药物发现过程中早期研发与晚期开发之间的转化断层问题,这一断层导致了高失败率。传统AI方法仅数字化现有高通量实验,受限于常规流程,无法有效预测药物在人体中的效应;而生物医学数字孪生虽基于真实世界数据和机制模型,但分辨率不足,难以模拟分子相互作用及其系统性后果。解决方案的关键在于构建可编程虚拟人类(programmable virtual humans),即动态、多尺度的计算模型,能够从分子到表型水平模拟药物作用,实现无法在现实世界中进行的虚拟实验(如直接在人体内测试新化合物),从而打通早期发现与临床开发的壁垒,提升治疗效果与安全性优化的效率。
链接: https://arxiv.org/abs/2507.19568
作者: You Wu,Philip E. Bourne,Lei Xie
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: Under Review
Abstract:Artificial intelligence (AI) has sparked immense interest in drug discovery, but most current approaches only digitize existing high-throughput experiments. They remain constrained by conventional pipelines. As a result, they do not address the fundamental challenges of predicting drug effects in humans. Similarly, biomedical digital twins, largely grounded in real-world data and mechanistic models, are tailored for late-phase drug development and lack the resolution to model molecular interactions or their systemic consequences, limiting their impact in early-stage discovery. This disconnect between early discovery and late development is one of the main drivers of high failure rates in drug discovery. The true promise of AI lies not in augmenting current experiments but in enabling virtual experiments that are impossible in the real world: testing novel compounds directly in silico in the human body. Recent advances in AI, high-throughput perturbation assays, and single-cell and spatial omics across species now make it possible to construct programmable virtual humans: dynamic, multiscale models that simulate drug actions from molecular to phenotypic levels. By bridging the translational gap, programmable virtual humans offer a transformative path to optimize therapeutic efficacy and safety earlier than ever before. This perspective introduces the concept of programmable virtual humans, explores their roles in a new paradigm of drug discovery centered on human physiology, and outlines key opportunities, challenges, and roadmaps for their realization.
zh
[AI-113] Differentiating hype from practical applications of large language models in medicine - a primer for healthcare professionals ALT
【速读】:该论文旨在解决生成式 AI(Generative AI)在医疗生态系统中应用时面临的双重挑战:一方面,如何有效利用机器学习与大语言模型(Large Language Models, LLMs)提升临床培训、实践及相邻研究领域的效率与准确性;另一方面,如何规避LLMs因缺乏对现实客观真理的理解以及可能泄露受保护健康信息(Protected Health Information, PHI)所带来的风险。解决方案的关键在于对AI技术的部署进行审慎评估与场景化设计,在充分发挥其自动化与程序辅助能力的同时,建立严格的风险控制机制,确保在不同应用场景中实现效益最大化并最小化潜在危害。
链接: https://arxiv.org/abs/2507.19567
作者: Elisha D.O. Roberson
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 7 pages main document text, 2 figures. A basic primer on the potential and dangers of AI generally and LLMs specifically in the medical care system. Targeted to non-expert healthcare workers without experience in AI or LLMs
Abstract:The medical ecosystem consists of the training of new clinicians and researchers, the practice of clinical medicine, and areas of adjacent research. There are many aspects of these domains that could benefit from the application of task automation and programmatic assistance. Machine learning and artificial intelligence techniques, including large language models (LLMs), have been promised to deliver on healthcare innovation, improving care speed and accuracy, and reducing the burden on staff for manual interventions. However, LLMs have no understanding of objective truth that is based in reality. They also represent real risks to the disclosure of protected information when used by clinicians and researchers. The use of AI in medicine in general, and the deployment of LLMs in particular, therefore requires careful consideration and thoughtful application to reap the benefits of these technologies while avoiding the dangers in each context.
zh
[AI-114] owards Sustainability Model Cards
【速读】:该论文旨在解决当前机器学习(Machine Learning, ML)模型在训练和使用过程中能源消耗急剧增加所带来的环境与可持续性挑战,以及现有绿色人工智能(Green AI)评估指标未能有效集成到质量模型(Quality Models)和服务水平协议(Service-Level Agreements, SLAs)中的问题,从而限制了对模型能耗信息的自动化分析及其在模型比较、选择与认证等场景下的应用。其解决方案的关键在于引入质量模型的概念,并将其与现有的ML模型报告倡议(如Model Cards)及绿色/节俭AI(Green/Frugal AI)提案融合,提出一个用于形式化描述AI/ML模型可持续性的“可持续质量模型”(Sustainable Quality Model),并通过设计一种新的领域特定语言(Domain-Specific Language, DSL)来精确刻画模型各任务的能耗特征,最终实现该信息以扩展版Model Cards格式导出,并具备可自动处理的能力。
链接: https://arxiv.org/abs/2507.19559
作者: Gwendal Jouneaux,Jordi Cabot
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The growth of machine learning (ML) models and associated datasets triggers a consequent dramatic increase in energy costs for the use and training of these models. In the current context of environmental awareness and global sustainability concerns involving ICT, Green AI is becoming an important research topic. Initiatives like the AI Energy Score Ratings are a good example. Nevertheless, these benchmarking attempts are still to be integrated with existing work on Quality Models and Service-Level Agreements common in other, more mature, ICT subfields. This limits the (automatic) analysis of this model energy descriptions and their use in (semi)automatic model comparison, selection, and certification processes. We aim to leverage the concept of quality models and merge it with existing ML model reporting initiatives and Green/Frugal AI proposals to formalize a Sustainable Quality Model for AI/ML models. As a first step, we propose a new Domain-Specific Language to precisely define the sustainability aspects of an ML model (including the energy costs for its different tasks). This information can then be exported as an extended version of the well-known Model Cards initiative while, at the same time, being formal enough to be input of any other model description automatic process.
zh
[AI-115] PEMUTA: Pedagogically-Enriched Multi-Granular Undergraduate Thesis Assessment
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在本科毕业论文(Undergraduate Thesis, UGTE)评估中仅提供单一评分、忽视多维评价标准的问题,从而无法有效反映结构化指标、教学目标和多样化学术能力的局限性。解决方案的关键在于提出PEMUTA框架,该框架基于维果斯基理论(Vygotsky’s Theory)与布卢姆分类法(Bloom’s Taxonomy),设计了一种分层提示机制,从六个细粒度维度(Structure, Logic, Originality, Writing, Proficiency, Rigor,简称SLOWPR)对UGTE进行多粒度评估,并通过少样本提示(few-shot prompting)和角色扮演提示(role-play prompting)两种上下文学习技术,在无需微调的情况下显著提升与专家判断的一致性,同时构建了带有SLOWPR标注的真实UGTE数据集以支持细粒度评估。
链接: https://arxiv.org/abs/2507.19556
作者: Jialu Zhang,Qingyang Sun,Qianyi Wang,Weiyi Zhang,Zunjie Xiao,Xiaoqing Zhang,Jianfeng Ren,Jiang Liu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The undergraduate thesis (UGTE) plays an indispensable role in assessing a student’s cumulative academic development throughout their college years. Although large language models (LLMs) have advanced education intelligence, they typically focus on holistic assessment with only one single evaluation score, but ignore the intricate nuances across multifaceted criteria, limiting their ability to reflect structural criteria, pedagogical objectives, and diverse academic competencies. Meanwhile, pedagogical theories have long informed manual UGTE evaluation through multi-dimensional assessment of cognitive development, disciplinary thinking, and academic performance, yet remain underutilized in automated settings. Motivated by the research gap, we pioneer PEMUTA, a pedagogically-enriched framework that effectively activates domain-specific knowledge from LLMs for multi-granular UGTE assessment. Guided by Vygotsky’s theory and Bloom’s Taxonomy, PEMUTA incorporates a hierarchical prompting scheme that evaluates UGTEs across six fine-grained dimensions: Structure, Logic, Originality, Writing, Proficiency, and Rigor (SLOWPR), followed by holistic synthesis. Two in-context learning techniques, \ie, few-shot prompting and role-play prompting, are also incorporated to further enhance alignment with expert judgments without fine-tuning. We curate a dataset of authentic UGTEs with expert-provided SLOWPR-aligned annotations to support multi-granular UGTE assessment. Extensive experiments demonstrate that PEMUTA achieves strong alignment with expert evaluations, and exhibits strong potential for fine-grained, pedagogically-informed UGTE evaluations.
zh
[AI-116] Extending Group Relative Policy Optimization to Continuous Control: A Theoretical Framework for Robotic Reinforcement Learning
【速读】:该论文旨在解决Group Relative Policy Optimization (GRPO)在连续控制环境中的应用空白问题,尤其针对机器人领域中连续动作空间的关键需求。其核心挑战包括高维动作空间、稀疏奖励以及时间动态性。解决方案的关键在于引入轨迹驱动的策略聚类(trajectory-based policy clustering)、状态感知的优势估计(state-aware advantage estimation)以及面向机器人任务设计的正则化策略更新机制,从而构建适用于连续控制场景的理论框架,并为后续在运动控制与操作任务中的实证验证奠定基础。
链接: https://arxiv.org/abs/2507.19555
作者: Rajat Khanda,Mohammad Baqar,Sambuddha Chakrabarti,Satyasaran Changdar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures
Abstract:Group Relative Policy Optimization (GRPO) has shown promise in discrete action spaces by eliminating value function dependencies through group-based advantage estimation. However, its application to continuous control remains unexplored, limiting its utility in robotics where continuous actions are essential. This paper presents a theoretical framework extending GRPO to continuous control environments, addressing challenges in high-dimensional action spaces, sparse rewards, and temporal dynamics. Our approach introduces trajectory-based policy clustering, state-aware advantage estimation, and regularized policy updates designed for robotic applications. We provide theoretical analysis of convergence properties and computational complexity, establishing a foundation for future empirical validation in robotic systems including locomotion and manipulation tasks.
zh
[AI-117] AccessGuru: Leverag ing LLM s to Detect and Correct Web Accessibility Violations in HTML Code
【速读】:该论文旨在解决网页无障碍性(Web accessibility)违规问题,即当前绝大多数网页未能符合既定的无障碍指南,导致具有不同能力的用户无法有效访问其内容。为降低网页提供者的手动修正成本并提升包容性,研究提出自动检测与修复HTML代码中无障碍违规的技术挑战。解决方案的关键在于引入一个全新的分类体系,将网页无障碍违规划分为语法(Syntactic)、语义(Semantic)和布局(Layout)三类,并基于此构建了名为AccessGuru的方法:该方法融合现有的无障碍测试工具与大型语言模型(Large Language Models, LLMs),并通过针对三类违规设计的提示策略(taxonomy-driven prompting strategies)实现精准检测与自动修正。实验表明,该方法在真实世界基准上平均违规得分下降达84%,显著优于此前最多仅50%的改进效果。
链接: https://arxiv.org/abs/2507.19549
作者: Nadeen Fathallah,Daniel Hernández,Steffen Staab
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The vast majority of Web pages fail to comply with established Web accessibility guidelines, excluding a range of users with diverse abilities from interacting with their content. Making Web pages accessible to all users requires dedicated expertise and additional manual efforts from Web page providers. To lower their efforts and promote inclusiveness, we aim to automatically detect and correct Web accessibility violations in HTML code. While previous work has made progress in detecting certain types of accessibility violations, the problem of automatically detecting and correcting accessibility violations remains an open challenge that we address. We introduce a novel taxonomy classifying Web accessibility violations into three key categories - Syntactic, Semantic, and Layout. This taxonomy provides a structured foundation for developing our detection and correction method and redefining evaluation metrics. We propose a novel method, AccessGuru, which combines existing accessibility testing tools and Large Language Models (LLMs) to detect violations and applies taxonomy-driven prompting strategies to correct all three categories. To evaluate these capabilities, we develop a benchmark of real-world Web accessibility violations. Our benchmark quantifies syntactic and layout compliance and judges semantic accuracy through comparative analysis with human expert corrections. Evaluation against our benchmark shows that AccessGuru achieves up to 84% average violation score decrease, significantly outperforming prior methods that achieve at most 50%.
zh
[AI-118] Justifications for Democratizing AI Alignment and Their Prospects
【速读】:该论文试图解决人工智能对齐(AI alignment)问题中的规范性维度,即如何确定AI系统应遵循的伦理与价值约束。其核心挑战在于,在规范性和元规范性不确定性下,如何正当化这些约束的来源。论文指出,单纯依赖专家判断(epistocratic)或公众参与(democratic)均存在局限:前者可能缺乏合法性,后者则难以避免非正义的强制。解决方案的关键在于构建混合框架(hybrid frameworks),整合专家判断与参与式输入,并辅以制度性保障措施,以防范AI垄断并确保规范制定过程的正当性与有效性。
链接: https://arxiv.org/abs/2507.19548
作者: André Steingrüber,Kevin Baum
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: accepted for the LNCS on-site proceedings of the AISoLA 2025 conference
Abstract:The AI alignment problem comprises both technical and normative dimensions. While technical solutions focus on implementing normative constraints in AI systems, the normative problem concerns determining what these constraints should be. This paper examines justifications for democratic approaches to the normative problem – where affected stakeholders determine AI alignment – as opposed to epistocratic approaches that defer to normative experts. We analyze both instrumental justifications (democratic approaches produce better outcomes) and non-instrumental justifications (democratic approaches prevent illegitimate authority or coercion). We argue that normative and metanormative uncertainty create a justificatory gap that democratic approaches aim to fill through political rather than theoretical justification. However, we identify significant challenges for democratic approaches, particularly regarding the prevention of illegitimate coercion through AI alignment. Our analysis suggests that neither purely epistocratic nor purely democratic approaches may be sufficient on their own, pointing toward hybrid frameworks that combine expert judgment with participatory input alongside institutional safeguards against AI monopolization.
zh
[AI-119] Agent WARPP: Workflow Adherence via Runtime Parallel Personalization ICML2025
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在任务导向型对话(Task-Oriented Dialogue, TOD)系统中处理长且具有条件依赖的流程时存在的工作流遵从性差的问题,尤其是在涉及外部工具调用和用户特定信息的情境下。解决方案的关键在于提出一种无需训练的模块化框架——通过运行时并行个性化(Workflow Adherence via Runtime Parallel Personalization, WARPP),将多智能体编排与实时个性化机制相结合:其中专用的个性化代理(Personalizer agent)根据用户属性动态剪枝条件分支,从而降低推理开销并缩小工具选择范围;同时,领域特定的模块化代理并行执行,实现实时动态调整执行路径。该方法在银行、航班和医疗三个领域的五类复杂意图上验证有效,显著提升工具准确性和参数保真度,且不增加额外训练成本。
链接: https://arxiv.org/abs/2507.19543
作者: Maria Emilia Mazzolenis,Ruirui Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted at the ICML 2025 Workshop on Multi-Agent Systems in the Era of Foundation Models: Opportunities, Challenges, and Futures. Code repo: this https URL
Abstract:Large language models (LLMs) are increasingly applied in task-oriented dialogue (TOD) systems but often struggle with long, conditional workflows that involve external tool calls and depend on user-specific information. We present Workflow Adherence via Runtime Parallel Personalization, or WARPP, a training-free, modular framework that combines multi-agent orchestration with runtime personalization to improve workflow adherence in LLM-based systems. By dynamically pruning conditional branches based on user attributes, the framework reduces reasoning overhead and narrows tool selection at runtime. WARPP deploys a parallelized architecture where a dedicated Personalizer agent operates alongside modular, domain-specific agents to dynamically tailor execution paths in real time. The framework is evaluated across five representative user intents of varying complexity within three domains: banking, flights, and healthcare. Our evaluation leverages synthetic datasets and LLM-powered simulated users to test scenarios with conditional dependencies. Our results demonstrate that WARPP outperforms both the non-personalized method and the ReAct baseline, achieving increasingly larger gains in parameter fidelity and tool accuracy as intent complexity grows, while also reducing average token usage, without any additional training.
zh
[AI-120] Swift-Sarsa: Fast and Robust Linear Control
【速读】:该论文旨在解决强化学习中在线策略控制(on-policy control)问题,特别是在高维、含大量噪声输入信号场景下的有效特征选择与信用分配难题。其核心挑战在于:在非平稳噪声干扰下,智能体需识别出对决策真正相关的少数关键信号,并准确将预测误差归因于对应权重参数,从而实现高效学习。解决方案的关键是将SwiftTD算法的核心改进——包括步长优化(step-size optimization)、有效学习率边界约束及步长衰减机制——与True Online Sarsa(λ)相结合,提出一种新的在线策略算法Swift-Sarsa。该方法无需先验结构知识即可自动区分相关信号与噪声信号,在“操作性条件反射基准测试”(operant conditioning benchmark)中展现出鲁棒且高效的信用分配能力,为大规模特征空间中的表示学习提供了可行路径。
链接: https://arxiv.org/abs/2507.19539
作者: Khurram Javed,Richard S. Sutton
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Presented at RLDM 2025
Abstract:Javed, Sharifnassab, and Sutton (2024) introduced a new algorithm for TD learning – SwiftTD – that augments True Online TD( \lambda ) with step-size optimization, a bound on the effective learning rate, and step-size decay. In their experiments SwiftTD outperformed True Online TD( \lambda ) and TD( \lambda ) on a variety of prediction tasks derived from Atari games, and its performance was robust to the choice of hyper-parameters. In this extended abstract we extend SwiftTD to work for control problems. We combine the key ideas behind SwiftTD with True Online Sarsa( \lambda ) to develop an on-policy reinforcement learning algorithm called \textitSwift-Sarsa . We propose a simple benchmark for linear on-policy control called the \textitoperant conditioning benchmark . The key challenge in the operant conditioning benchmark is that a very small subset of input signals are relevant for decision making. The majority of the signals are noise sampled from a non-stationary distribution. To learn effectively, the agent must learn to differentiate between the relevant signals and the noisy signals, and minimize prediction errors by assigning credit to the weight parameters associated with the relevant signals. Swift-Sarsa, when applied to the operant conditioning benchmark, learned to assign credit to the relevant signals without any prior knowledge of the structure of the problem. It opens the door for solution methods that learn representations by searching over hundreds of millions of features in parallel without performance degradation due to noisy or bad features. Comments: Presented at RLDM 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2507.19539 [cs.LG] (or arXiv:2507.19539v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.19539 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-121] Graph Learning Metallic Glass Discovery from Wikipedia
【速读】:该论文旨在解决金属玻璃(metallic glass)等新型材料高效合成难题,其核心挑战在于材料形成高度依赖多元素的最优组合以抑制结晶,导致在广阔材料空间中仅有数千种候选材料被探索。传统数据驱动方法受限于数据稀缺和材料编码不成熟,常采用统计学习算法处理表格数据,预测能力和泛化性能有限。解决方案的关键在于提出基于材料网络表示的复杂数据学习范式:通过语言模型从维基百科提取节点元素嵌入,并设计多种图神经网络(Graph Neural Networks, GNNs)架构作为推荐系统,挖掘材料间的隐含关系;同时利用多语言维基百科嵌入评估自然语言在材料设计中的潜力,从而构建一种以人工智能为基础的新材料发现新范式。
链接: https://arxiv.org/abs/2507.19536
作者: K.-C. Ouyang,S.-Y. Zhang,S.-L. Liu,J. Tian,Y.-H. Li,H. Tong,H.-Y. Bai,W.-H. Wang,Y.-C. Hu
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 7 figures
Abstract:Synthesizing new materials efficiently is highly demanded in various research fields. However, this process is usually slow and expensive, especially for metallic glasses, whose formation strongly depends on the optimal combinations of multiple elements to resist crystallization. This constraint renders only several thousands of candidates explored in the vast material space since 1960. Recently, data-driven approaches armed by advanced machine learning techniques provided alternative routes for intelligent materials design. Due to data scarcity and immature material encoding, the conventional tabular data is usually mined by statistical learning algorithms, giving limited model predictability and generalizability. Here, we propose sophisticated data learning from material network representations. The node elements are encoded from the Wikipedia by a language model. Graph neural networks with versatile architectures are designed to serve as recommendation systems to explore hidden relationships among materials. By employing Wikipedia embeddings from different languages, we assess the capability of natural languages in materials design. Our study proposes a new paradigm to harvesting new amorphous materials and beyond with artificial intelligence.
zh
[AI-122] Clinical-Grade Blood Pressure Prediction in ICU Settings: An Ensemble Framework with Uncertainty Quantification and Cross-Institutional Validation
【速读】:该论文旨在解决重症监护病房(ICU)中电子健康记录(EHR)驱动的血压(BP)预测模型在实际临床部署中存在的三大关键问题:缺乏外部验证、不确定性量化不足以及数据泄露预防机制缺失。其解决方案的核心在于构建一个综合框架,首次整合了系统性的数据泄露防护策略、基于分位数回归的不确定性量化方法,并通过MIMIC-III与eICU数据库之间的跨机构验证来提升模型泛化能力。该框架采用梯度提升、随机森林和XGBoost的集成学习方法,利用5个生理域共74个特征进行建模,在内部验证中达到临床可接受性能(收缩压R²=0.86,RMSE=6.03 mmHg;舒张压R²=0.49,RMSE=7.13 mmHg),且预测区间覆盖率达80.3%(SBP)和79.9%(DBP),支持按风险分层的监测协议——标准监测使用窄区间(<15 mmHg),手动复核则启用宽区间(>30 mmHg)。这一设计为ICU环境中AI辅助血压监测的现实落地提供了可靠依据。
链接: https://arxiv.org/abs/2507.19530
作者: Md Basit Azam,Sarangthem Ibotombi Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Blood pressure (BP) monitoring is critical in in tensive care units (ICUs) where hemodynamic instability can rapidly progress to cardiovascular collapse. Current machine learning (ML) approaches suffer from three limitations: lack of external validation, absence of uncertainty quantification, and inadequate data leakage prevention. This study presents the first comprehensive framework with novel algorithmic leakage prevention, uncertainty quantification, and cross-institutional validation for electronic health records (EHRs) based BP pre dictions. Our methodology implemented systematic data leakage prevention, uncertainty quantification through quantile regres sion, and external validation between the MIMIC-III and eICU databases. An ensemble framework combines Gradient Boosting, Random Forest, and XGBoost with 74 features across five physiological domains. Internal validation achieved a clinically acceptable performance (for SBP: R^2 = 0.86, RMSE = 6.03 mmHg; DBP: R^2 = 0.49, RMSE = 7.13 mmHg), meeting AAMI standards. External validation showed 30% degradation with critical limitations in patients with hypotensive. Uncertainty quantification generated valid prediction intervals (80.3% SBP and 79.9% DBP coverage), enabling risk-stratified protocols with narrow intervals ( 15 mmHg) for standard monitoring and wide intervals ( 30 mmHg) for manual verification. This framework provides realistic deployment expectations for cross institutional AI-assisted BP monitoring in critical care settings. The source code is publicly available at this https URL mdbasit897/clinical-bp-prediction-ehr. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.19530 [cs.LG] (or arXiv:2507.19530v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.19530 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Md Basit Azam [view email] [v1] Mon, 21 Jul 2025 11:15:33 UTC (2,863 KB) Full-text links: Access Paper: View a PDF of the paper titled Clinical-Grade Blood Pressure Prediction in ICU Settings: An Ensemble Framework with Uncertainty Quantification and Cross-Institutional Validation, by Md Basit Azam and Sarangthem Ibotombi SinghView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-07 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[AI-123] Machine Learning Risk Intelligence for Green Hydrogen Investment: Insights for Duqm R3 Auction
【速读】:该论文旨在解决绿色氢气(green hydrogen)基础设施在沙漠环境中因缺乏历史运维数据而导致的风险评估难题,尤其是在Oman等新兴市场国家开展大规模项目时,难以准确量化设备维护压力与运营风险。其解决方案的关键在于构建一个基于人工智能的决策支持系统,利用公开气象数据开发出“维护压力指数”(Maintenance Pressure Index, MPI),通过环境因素如沙尘暴、极端温度和湿度波动等可测量指标,预测基础设施的未来维护需求,从而为拍卖评估和长期规划提供时间维度上的风险洞察,弥补当前无实际运行数据的空白。
链接: https://arxiv.org/abs/2507.19529
作者: Obumneme Nwafor,Mohammed Abdul Majeed Al Hooti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As green hydrogen emerges as a major component of global decarbonisation, Oman has positioned itself strategically through national auctions and international partnerships. Following two successful green hydrogen project rounds, the country launched its third auction (R3) in the Duqm region. While this area exhibits relative geospatial homogeneity, it is still vulnerable to environmental fluctuations that pose inherent risks to productivity. Despite growing global investment in green hydrogen, operational data remains scarce, with major projects like Saudi Arabia’s NEOM facility not expected to commence production until 2026, and Oman’s ACME Duqm project scheduled for 2028. This absence of historical maintenance and performance data from large-scale hydrogen facilities in desert environments creates a major knowledge gap for accurate risk assessment for infrastructure planning and auction decisions. Given this data void, environmental conditions emerge as accessible and reliable proxy for predicting infrastructure maintenance pressures, because harsh desert conditions such as dust storms, extreme temperatures, and humidity fluctuations are well-documented drivers of equipment degradation in renewable energy systems. To address this challenge, this paper proposes an Artificial Intelligence decision support system that leverages publicly available meteorological data to develop a predictive Maintenance Pressure Index (MPI), which predicts risk levels and future maintenance demands on hydrogen infrastructure. This tool strengthens regulatory foresight and operational decision-making by enabling temporal benchmarking to assess and validate performance claims over time. It can be used to incorporate temporal risk intelligence into auction evaluation criteria despite the absence of historical operational benchmarks.
zh
[AI-124] Quantizing Text-attributed Graphs for Semantic-Structural Integration KDD’2025
【速读】:该论文旨在解决当前将图结构信息有效嵌入大语言模型(Large Language Models, LLMs)兼容格式所面临的两大核心挑战:一是现有方法要么依赖计算成本高昂的对齐机制,要么采用手动图文本化技术,容易丢失关键结构细节;二是大多数方法需要源域标注数据才能实现迁移学习,限制了其泛化能力。解决方案的关键在于提出一种名为STAG的自监督框架,该框架通过冻结码本(frozen codebook)直接将图结构信息量化为离散token,并创新性地引入软分配(soft assignment)与KL散度引导的量化策略,以适应图数据缺乏天然分词结构的特点。此方法无需源域标签即可实现真正的零样本迁移学习,同时兼容多种LLM架构,在多个节点分类基准上达到领先性能。
链接: https://arxiv.org/abs/2507.19526
作者: Jianyuan Bo,Hao Wu,Yuan Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at KDD’2025
Abstract:Text-attributed graphs (TAGs) have emerged as a powerful representation for modeling complex relationships across diverse domains. With the rise of large language models (LLMs), there is growing interest in leveraging their capabilities for graph learning. However, current approaches face significant challenges in embedding structural information into LLM-compatible formats, requiring either computationally expensive alignment mechanisms or manual graph verbalization techniques that often lose critical structural details. Moreover, these methods typically require labeled data from source domains for effective transfer learning, significantly constraining their adaptability. We propose STAG, a novel self-supervised framework that directly quantizes graph structural information into discrete tokens using a frozen codebook. Unlike traditional quantization approaches, our method employs soft assignment and KL divergence guided quantization to address the unique challenges of graph data, which lacks natural tokenization structures. Our framework enables both LLM-based and traditional learning approaches, supporting true zero-shot transfer learning without requiring labeled data even in the source domain. Extensive experiments demonstrate state-of-the-art performance across multiple node classification benchmarks while maintaining compatibility with different LLM architectures, offering an elegant solution to bridging graph learning with LLMs.
zh
[AI-125] MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLM s
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在电子设计自动化(Electronic Design Automation, EDA)领域缺乏全面评估基准的问题。现有评测体系覆盖范围狭窄,难以准确衡量MLLM在电路设计全流程中的实际能力。为此,作者提出了MMCircuitEval——首个专为EDA任务设计的多模态基准测试集,其关键在于系统性地构建了3614对高质量问答(QA)数据,涵盖数字与模拟电路、从通用知识到前端和后端设计等关键EDA阶段,并通过专家审核确保内容准确性与相关性。该基准还创新性地按设计阶段、电路类型、能力维度(知识、理解、推理、计算)及难度分级,从而实现对模型性能的精细化分析,揭示了现有模型在后端设计和复杂计算任务中的显著短板,为后续针对性训练数据构建与模型优化提供了明确方向。
链接: https://arxiv.org/abs/2507.19525
作者: Chenchen Zhao,Zhengyuan Shi,Xiangyu Wen,Chengjie Liu,Yi Liu,Yunhao Zhou,Yuxiang Zhao,Hefei Feng,Yinan Zhu,Gwok-Waa Wan,Xin Cheng,Weiyu Chen,Yongqi Fu,Chujie Chen,Chenhao Xue,Guangyu Sun,Ying Wang,Yibo Lin,Jun Yang,Ning Xu,Xi Wang,Qiang Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure, 5 tables. To appear in ICCAD 2025
Abstract:The emergence of multimodal large language models (MLLMs) presents promising opportunities for automation and enhancement in Electronic Design Automation (EDA). However, comprehensively evaluating these models in circuit design remains challenging due to the narrow scope of existing benchmarks. To bridge this gap, we introduce MMCircuitEval, the first multimodal benchmark specifically designed to assess MLLM performance comprehensively across diverse EDA tasks. MMCircuitEval comprises 3614 meticulously curated question-answer (QA) pairs spanning digital and analog circuits across critical EDA stages - ranging from general knowledge and specifications to front-end and back-end design. Derived from textbooks, technical question banks, datasheets, and real-world documentation, each QA pair undergoes rigorous expert review for accuracy and relevance. Our benchmark uniquely categorizes questions by design stage, circuit type, tested abilities (knowledge, comprehension, reasoning, computation), and difficulty level, enabling detailed analysis of model capabilities and limitations. Extensive evaluations reveal significant performance gaps among existing LLMs, particularly in back-end design and complex computations, highlighting the critical need for targeted training datasets and modeling approaches. MMCircuitEval provides a foundational resource for advancing MLLMs in EDA, facilitating their integration into real-world circuit design workflows. Our benchmark is available at this https URL.
zh
[AI-126] Language Models for Controllable DNA Sequence Design
【速读】:该论文旨在解决可控DNA序列设计问题,即如何生成满足特定生物学属性的DNA序列。传统方法在控制序列功能特性方面存在局限,而语言模型(Language Models, LMs)在自然语言生成中的成功启发了其在生物序列建模中的应用潜力。解决方案的关键在于提出ATGC-Gen——一种基于跨模态编码的自动化Transformer生成框架,能够整合多种生物信号,并支持解码器-only与编码器-only两种Transformer架构,在自回归或掩码恢复目标下灵活训练与生成。该方法显著提升了生成序列在功能性、多样性及属性可控性方面的表现,优于现有方法,验证了语言模型在可编程基因组设计中的前景。
链接: https://arxiv.org/abs/2507.19523
作者: Xingyu Su,Xiner Li,Yuchao Lin,Ziqian Xie,Degui Zhi,Shuiwang Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We consider controllable DNA sequence design, where sequences are generated by conditioning on specific biological properties. While language models (LMs) such as GPT and BERT have achieved remarkable success in natural language generation, their application to DNA sequence generation remains largely underexplored. In this work, we introduce ATGC-Gen, an Automated Transformer Generator for Controllable Generation, which leverages cross-modal encoding to integrate diverse biological signals. ATGC-Gen is instantiated with both decoder-only and encoder-only transformer architectures, allowing flexible training and generation under either autoregressive or masked recovery objectives. We evaluate ATGC-Gen on representative tasks including promoter and enhancer sequence design, and further introduce a new dataset based on ChIP-Seq experiments for modeling protein binding specificity. Our experiments demonstrate that ATGC-Gen can generate fluent, diverse, and biologically relevant sequences aligned with the desired properties. Compared to prior methods, our model achieves notable improvements in controllability and functional relevance, highlighting the potential of language models in advancing programmable genomic design. The source code is released at (this https URL).
zh
[AI-127] Exoplanet Detection Using Machine Learning Models Trained on Synthetic Light Curves
【速读】:该论文旨在解决科学家在搜寻系外行星(exoplanets)过程中因人工筛选效率低下而导致发现速度缓慢的问题。当前自20世纪末以来仅确认约5000颗系外行星,表明传统方法难以应对日益增长的数据规模。解决方案的关键在于引入多种成熟的机器学习(machine learning, ML)模型(如逻辑回归、K近邻和随机森林)对NASA开普勒空间望远镜获取的数据进行训练与预测,以提升自动化识别效率;同时,为缓解数据不平衡带来的偏差问题,研究强调采用数据增强技术显著改善模型的召回率(recall)和精确率(precision),从而增强预测公平性与泛化能力。
链接: https://arxiv.org/abs/2507.19520
作者: Ethan Lo,Dan C. Lo
机构: 未知
类目: Machine Learning (cs.LG); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:
Abstract:With manual searching processes, the rate at which scientists and astronomers discover exoplanets is slow because of inefficiencies that require an extensive time of laborious inspections. In fact, as of now there have been about only 5,000 confirmed exoplanets since the late 1900s. Recently, machine learning (ML) has proven to be extremely valuable and efficient in various fields, capable of processing massive amounts of data in addition to increasing its accuracy by learning. Though ML models for discovering exoplanets owned by large corporations (e.g. NASA) exist already, they largely depend on complex algorithms and supercomputers. In an effort to reduce such complexities, in this paper, we report the results and potential benefits of various, well-known ML models in the discovery and validation of extrasolar planets. The ML models that are examined in this study include logistic regression, k-nearest neighbors, and random forest. The dataset on which the models train and predict is acquired from NASA’s Kepler space telescope. The initial results show promising scores for each model. However, potential biases and dataset imbalances necessitate the use of data augmentation techniques to further ensure fairer predictions and improved generalization. This study concludes that, in the context of searching for exoplanets, data augmentation techniques significantly improve the recall and precision, while the accuracy varies for each model.
zh
[AI-128] Physics-informed transfer learning for SHM via feature selection
【速读】:该论文旨在解决结构健康监测(Structural Health Monitoring, SHM)系统在训练过程中因标签数据获取成本高且不切实际而导致的泛化能力不足问题,尤其是在跨结构场景下,由于源域与目标域之间的分布差异(如结构间动态响应特性不同),传统机器学习方法难以有效迁移。解决方案的关键在于利用物理知识指导特征选择,并通过模态保证准则(Modal Assurance Criterion, MAC)量化健康结构模态间的对应关系,从而筛选出在不同结构中对损伤具有条件分布不变性的特征集。研究表明,MAC与衡量联合分布相似性的监督指标高度相关,因此可作为无监督迁移学习中选择合适源结构和特征的有效判据,提升模型在目标结构上的泛化性能。
链接: https://arxiv.org/abs/2507.19519
作者: J. Poole,P. Gardner,A. J. Hughes,N. Dervilis,R. S. Mills,T. A. Dardeno,K. Worden
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Data used for training structural health monitoring (SHM) systems are expensive and often impractical to obtain, particularly labelled data. Population-based SHM presents a potential solution to this issue by considering the available data across a population of structures. However, differences between structures will mean the training and testing distributions will differ; thus, conventional machine learning methods cannot be expected to generalise between structures. To address this issue, transfer learning (TL), can be used to leverage information across related domains. An important consideration is that the lack of labels in the target domain limits data-based metrics to quantifying the discrepancy between the marginal distributions. Thus, a prerequisite for the application of typical unsupervised TL methods is to identify suitable source structures (domains), and a set of features, for which the conditional distributions are related to the target structure. Generally, the selection of domains and features is reliant on domain expertise; however, for complex mechanisms, such as the influence of damage on the dynamic response of a structure, this task is not trivial. In this paper, knowledge of physics is leveraged to select more similar features, the modal assurance criterion (MAC) is used to quantify the correspondence between the modes of healthy structures. The MAC is shown to have high correspondence with a supervised metric that measures joint-distribution similarity, which is the primary indicator of whether a classifier will generalise between domains. The MAC is proposed as a measure for selecting a set of features that behave consistently across domains when subjected to damage, i.e. features with invariance in the conditional distributions. This approach is demonstrated on numerical and experimental case studies to verify its effectiveness in various applications.
zh
[AI-129] arget Circuit Matching in Large-Scale Netlists using GNN-Based Region Prediction
【速读】:该论文旨在解决大规模电路中子图匹配(subgraph matching)的效率与泛化能力问题,传统基于规则的方法难以适应任意目标电路,而节点对节点的匹配方法在计算上效率低下;现有深度学习方法则受限于无法高效捕捉全局子图嵌入或依赖低效的匹配矩阵。解决方案的关键在于:利用图神经网络(Graph Neural Networks, GNNs)预测目标电路可能存在的高概率区域,并通过构建多样化的负样本使GNN精准学习目标电路的存在特征;同时提出直接从全电路中提取子图嵌入的方法,从而捕获全局子图信息,避免对所有候选子图逐一应用GNN,显著提升时间效率和预测准确性,为大规模电路中的子图匹配提供了可扩展且有效的方案。
链接: https://arxiv.org/abs/2507.19518
作者: Sangwoo Seo,Jimin Seo,Yoonho Lee,Donghyeon Kim,Hyejin Shin,Banghyun Sung,Chanyoung Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICCAD 2025
Abstract:Subgraph matching plays an important role in electronic design automation (EDA) and circuit verification. Traditional rule-based methods have limitations in generalizing to arbitrary target circuits. Furthermore, node-to-node matching approaches tend to be computationally inefficient, particularly for large-scale circuits. Deep learning methods have emerged as a potential solution to address these challenges, but existing models fail to efficiently capture global subgraph embeddings or rely on inefficient matching matrices, which limits their effectiveness for large circuits. In this paper, we propose an efficient graph matching approach that utilizes Graph Neural Networks (GNNs) to predict regions of high probability for containing the target circuit. Specifically, we construct various negative samples to enable GNNs to accurately learn the presence of target circuits and develop an approach to directly extracting subgraph embeddings from the entire circuit, which captures global subgraph information and addresses the inefficiency of applying GNNs to all candidate subgraphs. Extensive experiments demonstrate that our approach significantly outperforms existing methods in terms of time efficiency and target region prediction, offering a scalable and effective solution for subgraph matching in large-scale circuits.
zh
[AI-130] BikeVAE-GNN: A Variational Autoencoder-Augmented Hybrid Graph Neural Network for Sparse Bicycle Volume Estimation ITSC2025
【速读】:该论文旨在解决城市自行车网络中因计数数据极度稀疏(sparse count data)而导致的精准链接级自行车日均流量(Average Daily Bicycle, ADB)估计难题。其核心挑战在于,多数道路路段缺乏标签计数数据,限制了传统机器学习与图神经网络(GNN)模型的性能表现。解决方案的关键在于提出BikeVAE-GNN框架,该框架创新性地融合了混合图神经网络(Hybrid-GNN)与变分自编码器(Variational Autoencoder, VAE),实现双任务协同:一方面通过GCN、GAT和GraphSAGE的组合建模复杂的空间依赖关系,另一方面利用VAE生成合成节点与边以增强稀疏图结构,从而提升ADB估计精度与交通水平分类准确性。实验表明,该方法在99%数据缺失率下仍能实现MAE=30.82辆/天、准确率99%及F1-score=0.99,显著优于现有基线模型。
链接: https://arxiv.org/abs/2507.19517
作者: Mohit Gupta,Debjit Bhowmick,Ben Beck
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for publication in the Proceedings of the 28th IEEE International Conference on Intelligent Transportation Systems (ITSC 2025). This is the author’s version of the work
Abstract:Accurate link-level bicycle volume estimation is essential for informed urban and transport planning but it is challenged by extremely sparse count data in urban bicycling networks worldwide. We propose BikeVAE-GNN, a novel dual-task framework augmenting a Hybrid Graph Neural Network (GNN) with Variational Autoencoder (VAE) to estimate Average Daily Bicycle (ADB) counts, addressing sparse bicycle networks. The Hybrid-GNN combines Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and GraphSAGE to effectively model intricate spatial relationships in sparse networks while VAE generates synthetic nodes and edges to enrich the graph structure and enhance the estimation performance. BikeVAE-GNN simultaneously performs - regression for bicycling volume estimation and classification for bicycling traffic level categorization. We demonstrate the effectiveness of BikeVAE-GNN using OpenStreetMap data and publicly available bicycle count data within the City of Melbourne - where only 141 of 15,933 road segments have labeled counts (resulting in 99% count data sparsity). Our experiments show that BikeVAE-GNN outperforms machine learning and baseline GNN models, achieving a mean absolute error (MAE) of 30.82 bicycles per day, accuracy of 99% and F1-score of 0.99. Ablation studies further validate the effective role of Hybrid-GNN and VAE components. Our research advances bicycling volume estimation in sparse networks using novel and state-of-the-art approaches, providing insights for sustainable bicycling infrastructures.
zh
[AI-131] Enhancing Spatiotemporal Networks with xLSTM: A Scalar LSTM Approach for Cellular Traffic Forecasting
【速读】:该论文旨在解决5G及未来网络中智能资源管理所需的精准时空交通预测问题,传统人工智能方法因难以捕捉用户移动性带来的复杂空间与时间模式而表现受限。其解决方案的关键在于提出一种轻量级双路径时空网络架构:通过Scalar LSTM(sLSTM)实现高效的时序建模,利用三层Conv3D模块提取空间特征,并设计融合层将两者整合为统一表示,从而提升梯度稳定性、加速收敛并降低预测误差。实验表明,该方法在真实数据集上相较ConvLSTM基线模型平均绝对误差(MAE)降低23%,且泛化能力提升30%,适用于大规模下一代网络部署场景。
链接: https://arxiv.org/abs/2507.19513
作者: Khalid Ali,Zineddine Bettouche,Andreas Kassler,Andreas Fischer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate spatiotemporal traffic forecasting is vital for intelligent resource management in 5G and beyond. However, conventional AI approaches often fail to capture the intricate spatial and temporal patterns that exist, due to e.g., the mobility of users. We introduce a lightweight, dual-path Spatiotemporal Network that leverages a Scalar LSTM (sLSTM) for efficient temporal modeling and a three-layer Conv3D module for spatial feature extraction. A fusion layer integrates both streams into a cohesive representation, enabling robust forecasting. Our design improves gradient stability and convergence speed while reducing prediction error. Evaluations on real-world datasets show superior forecast performance over ConvLSTM baselines and strong generalization to unseen regions, making it well-suited for large-scale, next-generation network deployments. Experimental evaluation shows a 23% MAE reduction over ConvLSTM, with a 30% improvement in model generalization.
zh
[AI-132] Beyond 9-to-5: A Generative Model for Augmenting Mobility Data of Underrepresented Shift Workers
【速读】:该论文旨在解决城市交通建模中对轮班工作者(shift workers)群体的系统性代表性不足问题,此类人群占工业化社会劳动力的15-20%,但传统交通调查与规划常忽略其非标准工作时间下的出行行为。研究通过对比GPS轨迹数据与问卷调查数据,揭示了轮班工作者呈现双峰时间分布模式,与常规朝九晚五作息存在显著差异。解决方案的关键在于提出一种基于Transformer架构的新方法,利用碎片化GPS轨迹生成完整且行为合理的活动模式;该方法引入周期感知的时间嵌入(period-aware temporal embeddings)和面向转换过程的损失函数(transition-focused loss function),专门捕捉轮班工作者独特的活动节律,并有效缓解传统交通数据集中的偏差。实验表明,生成数据在洛杉矶县GPS数据上的分布一致性达到极高水平(平均JSD < 0.02),为交通规划者提供了一种强大的数据增强工具,以更精准、包容地应对城市24/7出行需求。
链接: https://arxiv.org/abs/2507.19510
作者: Haoxuan Ma,Xishun Liao,Yifan Liu,Chris Stanford,Jiaqi Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper addresses a critical gap in urban mobility modeling by focusing on shift workers, a population segment comprising 15-20% of the workforce in industrialized societies yet systematically underrepresented in traditional transportation surveys and planning. This underrepresentation is revealed in this study by a comparative analysis of GPS and survey data, highlighting stark differences between the bimodal temporal patterns of shift workers and the conventional 9-to-5 schedules recorded in surveys. To address this bias, we introduce a novel transformer-based approach that leverages fragmented GPS trajectory data to generate complete, behaviorally valid activity patterns for individuals working non-standard hours. Our method employs periodaware temporal embeddings and a transition-focused loss function specifically designed to capture the unique activity rhythms of shift workers and mitigate the inherent biases in conventional transportation datasets. Evaluation shows that the generated data achieves remarkable distributional alignment with GPS data from Los Angeles County (Average JSD 0.02 for all evaluation metrics). By transforming incomplete GPS traces into complete, representative activity patterns, our approach provides transportation planners with a powerful data augmentation tool to fill critical gaps in understanding the 24/7 mobility needs of urban populations, enabling precise and inclusive transportation planning.
zh
[AI-133] Gaze-Aware AI: Mathematical modeling of epistemic experience of the Marginalized for Human-Computer Interaction AI Systems
【速读】:该论文试图解决的问题是:在社会文化主导规范的压力下,个体如何无意识地调整自我表达以适应主流文化,从而抑制了真实自我(authentic self)的展现,进而影响亲社会行为(prosocial behaviors)和群体和谐。解决方案的关键在于提出一种可量化的“凝视压力指数-差异复合指标”(Gaze Pressure Index (GPI)-Diff Composite Metric),用于建模不同对话空间之间的互动张力,并基于此构建一个数学方程来训练大型语言模型(Large Language Models, LLMs),从而推动更具包容性的交互设计(Human-Computer Interaction, HCI)。该方法结合后现代哲学与心理学视角,以及神经可塑性(Neuro-plasticity)原理,旨在通过AI系统增强心理空间感(psychological spaciousness),促进多元群体间的真诚交流与社会共情。
链接: https://arxiv.org/abs/2507.19500
作者: Omkar Suresh Hatti
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:The proliferation of artificial intelligence provides an opportunity to create psychological spaciousness in society. Spaciousness is defined as the ability to hold diverse interpersonal interactions and forms the basis for vulnerability that leads to authenticity that leads to prosocial behaviors and thus to societal harmony. This paper demonstrates an attempt to quantify, the human conditioning to subconsciously modify authentic self-expression to fit the norms of the dominant culture. Gaze is explored across various marginalized and intersectional groups, using concepts from postmodern philosophy and psychology. The effects of gaze are studied through analyzing a few redacted Reddit posts, only to be discussed in discourse and not endorsement. A mathematical formulation for the Gaze Pressure Index (GPI)-Diff Composite Metric is presented to model the analysis of two sets of conversational spaces in relation to one another. The outcome includes an equation to train Large Language Models (LLMs) - the working mechanism of AI products such as Chat-GPT; and an argument for affirming and inclusive HCI, based on the equation, is presented. The argument is supported by a few principles of Neuro-plasticity, The brain’s lifelong capacity to rewire.
zh
[AI-134] ChatMyopia: An AI Agent for Pre-consultation Education in Primary Eye Care Settings
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在近视(myopia)领域个性化医疗沟通中的可解释性不足与多任务整合困难问题,以及其作为患者教育工具在真实世界中有效性尚未验证的挑战。解决方案的关键在于构建一个名为ChatMyopia的LLM驱动的AI代理,该代理通过集成图像分类工具和基于文献、专家共识及临床指南构建的检索增强知识库,实现对文本与图像相关近视问题的精准响应;并通过单题测试、人类评估及随机对照试验(n=70)验证其在准确性、安全性、可解释性、可扩展性及患者满意度方面的优势,尤其在提升患者对疾病认知、医患沟通质量及整体教育体验方面显著优于传统宣传手册。
链接: https://arxiv.org/abs/2507.19498
作者: Yue Wu,Xiaolan Chen,Weiyi Zhang,Shunming Liu,Wing Man Rita Sum,Xinyuan Wu,Xianwen Shang,Chea-su Kee,Mingguang He,Danli Shi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 35 pages, 4 figures, 1 table
Abstract:Large language models (LLMs) show promise for tailored healthcare communication but face challenges in interpretability and multi-task integration particularly for domain-specific needs like myopia, and their real-world effectiveness as patient education tools has yet to be demonstrated. Here, we introduce ChatMyopia, an LLM-based AI agent designed to address text and image-based inquiries related to myopia. To achieve this, ChatMyopia integrates an image classification tool and a retrieval-augmented knowledge base built from literature, expert consensus, and clinical guidelines. Myopic maculopathy grading task, single question examination and human evaluations validated its ability to deliver personalized, accurate, and safe responses to myopia-related inquiries with high scalability and interpretability. In a randomized controlled trial (n=70, NCT06607822), ChatMyopia significantly improved patient satisfaction compared to traditional leaflets, enhancing patient education in accuracy, empathy, disease awareness, and patient-eyecare practitioner communication. These findings highlight ChatMyopia’s potential as a valuable supplement to enhance patient education and improve satisfaction with medical services in primary eye care settings.
zh
[AI-135] Unlimited Editions: Documenting Human Style in AI Art Generation ALT
【速读】:该论文试图解决当前生成式 AI (Generative AI) 在艺术创作中对艺术价值理解的片面性问题,即现有研究过度关注图像检测、真实性与自动化,而忽视了艺术价值源于艺术家在创作过程中面对影响与技术约束时所进行的创造性挣扎。解决方案的关键在于重新定义 HCI(人机交互)的研究重心:从单纯追求视觉输出质量转向自动记录和追踪艺术风格的起源与演化过程,通过捕捉生成图像中的风格谱系与创作决策痕迹,保留人类艺术实践中独特的意图与选择,从而实现对艺术风格本质的数字化传承与解释。
链接: https://arxiv.org/abs/2507.19497
作者: Alex Leitch,Celia Chen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: this http URL 2025
Abstract:As AI art generation becomes increasingly sophisticated, HCI research has focused primarily on questions of detection, authenticity, and automation. This paper argues that such approaches fundamentally misunderstand how artistic value emerges from the concerns that drive human image production. Through examination of historical precedents, we demonstrate that artistic style is not only visual appearance but the resolution of creative struggle, as artists wrestle with influence and technical constraints to develop unique ways of seeing. Current AI systems flatten these human choices into reproducible patterns without preserving their provenance. We propose that HCI’s role lies not only in perfecting visual output, but in developing means to document the origins and evolution of artistic style as it appears within generated visual traces. This reframing suggests new technical directions for HCI research in generative AI, focused on automatic documentation of stylistic lineage and creative choice rather than simple reproduction of aesthetic effects.
zh
[AI-136] Simulating Human Behavior with the Psychological-mechanism Agent : Integrating Feeling Thought and Action
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在模拟人类行为时情感建模过于简化、任务导向性强而缺乏真实性的问题。其解决方案的关键在于提出基于认知三角模型(Feeling-Thought-Action)的Psychological-mechanism Agent(PSYA)框架,通过三个核心模块实现更贴近人类心理机制的行为模拟:Feelings模块采用分层情绪模型刻画短期、中期与长期情绪变化;Thought模块基于三重网络模型支持目标导向与自发性思维;Action模块则整合情绪、需求与计划以优化行为输出。该框架在多项经典心理学实验中成功复现人类行为模式,显著提升了生成行为的真实性、一致性与多样性。
链接: https://arxiv.org/abs/2507.19495
作者: Qing Dong,Pengyuan Liu,Dong Yu,Chen Kang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative agents have made significant progress in simulating human behavior, but existing frameworks often simplify emotional modeling and focus primarily on specific tasks, limiting the authenticity of the simulation. Our work proposes the Psychological-mechanism Agent (PSYA) framework, based on the Cognitive Triangle (Feeling-Thought-Action), designed to more accurately simulate human behavior. The PSYA consists of three core modules: the Feeling module (using a layer model of affect to simulate changes in short-term, medium-term, and long-term emotions), the Thought module (based on the Triple Network Model to support goal-directed and spontaneous thinking), and the Action module (optimizing agent behavior through the integration of emotions, needs and plans). To evaluate the framework’s effectiveness, we conducted daily life simulations and extended the evaluation metrics to self-influence, one-influence, and group-influence, selection five classic psychological experiments for simulation. The results show that the PSYA framework generates more natural, consistent, diverse, and credible behaviors, successfully replicating human experimental outcomes. Our work provides a richer and more accurate emotional and cognitive modeling approach for generative agents and offers an alternative to human participants in psychological experiments.
zh
[AI-137] Confirmation bias: A challenge for scalable oversight
【速读】:该论文试图解决的问题是:如何设计可扩展的监督协议(scalable oversight protocols),以使评估者能够准确验证比自身能力更强的AI模型。其关键在于识别并缓解人类评估者在监督过程中可能产生的系统性偏差,从而确保监督机制的有效性和可靠性。研究发现,简单监督协议并未展现出整体优势,且人类评估者的自信常因外部信息(如在线检索)而增强,即使所采纳的答案本身错误;此外,早期研究中看似乐观的结果可能源于评估者拥有模型不具备的知识,而这种优势会随着模型能力提升而减弱。因此,论文强调必须系统测试监督协议对评估者偏见的鲁棒性、相对于直接信任模型的优越性,以及在问题难度和模型能力增加时的表现。
链接: https://arxiv.org/abs/2507.19486
作者: Gabriel Recchia,Chatrik Singh Mangat,Jinu Nyachhyon,Mridul Sharma,Callum Canavan,Dylan Epstein-Gross,Muhammed Abdulbari
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 61 pages, 8 figures
Abstract:Scalable oversight protocols aim to empower evaluators to accurately verify AI models more capable than themselves. However, human evaluators are subject to biases that can lead to systematic errors. We conduct two studies examining the performance of simple oversight protocols where evaluators know that the model is “correct most of the time, but not all of the time”. We find no overall advantage for the tested protocols, although in Study 1, showing arguments in favor of both answers improves accuracy in cases where the model is incorrect. In Study 2, participants in both groups become more confident in the system’s answers after conducting online research, even when those answers are incorrect. We also reanalyze data from prior work that was more optimistic about simple protocols, finding that human evaluators possessing knowledge absent from models likely contributed to their positive results–an advantage that diminishes as models continue to scale in capability. These findings underscore the importance of testing the degree to which oversight protocols are robust to evaluator biases, whether they outperform simple deference to the model under evaluation, and whether their performance scales with increasing problem difficulty and model capability.
zh
[AI-138] Creativity as a Human Right: Design Considerations for Computational Creativity Systems
【速读】:该论文旨在解决如何将《世界人权宣言》(Universal Declaration of Human Rights, UDHR)中关于创造力的人权属性转化为计算创造力(Computational Creativity, CC)系统的设计原则,从而为CC系统提供伦理与功能上的指导框架。其解决方案的关键在于通过分析UDHR中的五项条款,结合实际应用场景对每项条款进行具象化阐释,并据此提炼出针对CC系统的具体设计考虑,强调创造力作为第四代人权所体现的“共享智能”交互特性,进而构建人机协同创新的理论基础与实践路径。
链接: https://arxiv.org/abs/2507.19485
作者: Alayt Issak
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:We investigate creativity that is underlined in the Universal Declaration of Human Rights (UDHR) to present design considerations for Computational Creativity (CC) systems. We find this declaration to describe creativity in salient aspects and bring to light creativity as a Human Right attributed to the Fourth Generation of such rights. This generation of rights attributes CC systems and the evolving nature of interaction with entities of shared intelligence. Our methodology examines five of thirty articles from the UDHR and demonstrates each article with actualizations concluding with design considerations for each. We contribute our findings to ground the relationship between creativity and CC systems.
zh
[AI-139] he Architecture of Cognitive Amplification: Enhanced Cognitive Scaffolding as a Resolution to the Comfort-Growth Paradox in Human-AI Cognitive Integration
【速读】:该论文试图解决的是人工智能(AI)系统在作为认知扩展工具时所引发的“舒适-成长悖论”问题,即AI因过度友好和低摩擦特性可能导致用户产生认知惰性,从而阻碍其认知能力的发展。解决方案的关键在于提出“增强型认知支架”(Enhanced Cognitive Scaffolding)框架,该框架通过三个核心维度实现AI从辅助工具向动态导师的转变:(1) 渐进自主性(Progressive Autonomy),随用户能力提升逐步减少AI干预;(2) 自适应个性化(Adaptive Personalization),根据个体学习路径定制支持策略;(3) 认知负荷优化(Cognitive Load Optimization),在促进学习的同时控制不必要的复杂度。此框架旨在保障认知发展优先于便利性,推动人类与AI协同下的真正认知增强,并防范依赖、技能退化及偏见放大等风险。
链接: https://arxiv.org/abs/2507.19483
作者: Giuseppe Riva
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 39 Pages, no figures
Abstract:AI systems now function as cognitive extensions, evolving from tools to active cognitive collaborators within human-AI integrated systems. While these systems can amplify cognition - enhancing problem-solving, learning, and creativity - they present a fundamental “comfort-growth paradox”: AI’s user-friendly nature may foster intellectual stagnation by minimizing cognitive friction necessary for development. As AI aligns with user preferences and provides frictionless assistance, it risks inducing cognitive complacency rather than promoting growth. We introduce Enhanced Cognitive Scaffolding to resolve this paradox - reconceptualizing AI from convenient assistant to dynamic mentor. Drawing from Vygotskian theories, educational scaffolding principles, and AI ethics, our framework integrates three dimensions: (1) Progressive Autonomy, where AI support gradually fades as user competence increases; (2) Adaptive Personalization, tailoring assistance to individual needs and learning trajectories; and (3) Cognitive Load Optimization, balancing mental effort to maximize learning while minimizing unnecessary complexity. Research across educational, workplace, creative, and healthcare domains supports this approach, demonstrating accelerated skill acquisition, improved self-regulation, and enhanced higher-order thinking. The framework includes safeguards against risks like dependency, skill atrophy, and bias amplification. By prioritizing cognitive development over convenience in human-AI interaction, Enhanced Cognitive Scaffolding offers a pathway toward genuinely amplified cognition while safeguarding autonomous thought and continuous learning.
zh
[AI-140] Multivariate Conformal Prediction via Conformalized Gaussian Scoring
【速读】:该论文旨在解决在非参数条件下实现精确的条件覆盖(conditional coverage)难题,即如何在不依赖强且不可检验的正则性假设的前提下,获得可实践的近似条件保证。其核心挑战在于传统方法(如基于经验累积分布函数(Empirical Cumulative Distribution Function, ECDF)的非 conformal 分数)计算成本高、需昂贵的采样过程,难以适用于实际场景。论文的关键解决方案是利用高斯假设下 ECDF 分数退化为马氏距离(Mahalanobis distance)的特性,从而得到闭式表达式,直接用于构造 conformal 集合;这一简化不仅显著降低计算复杂度,还拓展了 conformal 方法的应用边界,例如支持缺失输出值、逐步信息更新以及输出空间变换下的集合构造。实证结果表明,该方法在多维场景中更逼近理想的条件覆盖性能。
链接: https://arxiv.org/abs/2507.20941
作者: Sacha Braun,Eugène Berta,Michael I. Jordan,Francis Bach
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Other Statistics (stat.OT)
备注:
Abstract:While achieving exact conditional coverage in conformal prediction is unattainable without making strong, untestable regularity assumptions, the promise of conformal prediction hinges on finding approximations to conditional guarantees that are realizable in practice. A promising direction for obtaining conditional dependence for conformal sets–in particular capturing heteroskedasticity–is through estimating the conditional density \mathbbP_Y|X and conformalizing its level sets. Previous work in this vein has focused on nonconformity scores based on the empirical cumulative distribution function (CDF). Such scores are, however, computationally costly, typically requiring expensive sampling methods. To avoid the need for sampling, we observe that the CDF-based score reduces to a Mahalanobis distance in the case of Gaussian scores, yielding a closed-form expression that can be directly conformalized. Moreover, the use of a Gaussian-based score opens the door to a number of extensions of the basic conformal method; in particular, we show how to construct conformal sets with missing output values, refine conformal sets as partial information about Y becomes available, and construct conformal sets on transformations of the output space. Finally, empirical results indicate that our approach produces conformal sets that more closely approximate conditional coverage in multivariate settings compared to alternative methods.
zh
[AI-141] Aligning Large Language Model Agents with Rational and Moral Preferences: A Supervised Fine-Tuning Approach
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在战略互动中行为偏离人类经济决策模式的问题,尤其是其在道德与经济情境下缺乏对激励的敏感性及过度合作倾向。解决方案的关键在于提出一种基于合成数据的监督微调(supervised fine-tuning)流程,该流程利用源自经济推理的训练数据,将LLM代理的行为对齐至两种结构化的偏好模型:第一种为仅依赖个体收益的“理性人”(homo economicus),第二种则引入康德式普遍化原则的“道德人”(homo moralis)。实验证明,即使使用小规模数据集,该微调方法也能显著引导LLM代理行为趋近于对应经济代理的策略,从而在自动驾驶道德困境和竞争市场算法定价等实际场景中实现可解释且可控的规范性输出。
链接: https://arxiv.org/abs/2507.20796
作者: Wei Lu,Daniel L. Chen,Christian B. Hansen
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Understanding how large language model (LLM) agents behave in strategic interactions is essential as these systems increasingly participate autonomously in economically and morally consequential decisions. We evaluate LLM preferences using canonical economic games, finding substantial deviations from human behavior. Models like GPT-4o show excessive cooperation and limited incentive sensitivity, while reasoning models, such as o3-mini, align more consistently with payoff-maximizing strategies. We propose a supervised fine-tuning pipeline that uses synthetic datasets derived from economic reasoning to align LLM agents with economic preferences, focusing on two stylized preference structures. In the first, utility depends only on individual payoffs (homo economicus), while utility also depends on a notion of Kantian universalizability in the second preference structure (homo moralis). We find that fine-tuning based on small datasets shifts LLM agent behavior toward the corresponding economic agent. We further assess the fine-tuned agents’ behavior in two applications: Moral dilemmas involving autonomous vehicles and algorithmic pricing in competitive markets. These examples illustrate how different normative objectives embedded via realizations from structured preference structures can influence market and moral outcomes. This work contributes a replicable, cost-efficient, and economically grounded pipeline to align AI preferences using moral-economic principles.
zh
[AI-142] MIMII-Agent : Leverag ing LLM s with Function Calling for Relative Evaluation of Anomalous Sound Detection
【速读】:该论文旨在解决无监督异常声音检测(Unsupervised Anomalous Sound Detection, UASD)系统在不同机器类型上性能评估时缺乏真实异常声音数据的问题。传统基于关键词的数据增强方法因依赖人工标注而生成不真实的音频,难以扩展;而先进音频生成模型(如MIMII-Gen)通常需要异常训练样本,在多样异常实例不可用时效果受限。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)理解故障文本描述,并自动选择合适的音频变换函数,将正常机器声音转化为多样化且合理的异常声音,从而实现仅基于正常声音训练的UASD系统在不同机器类型上的相对性能评估。
链接: https://arxiv.org/abs/2507.20666
作者: Harsh Purohit,Tomoya Nishida,Kota Dohi,Takashi Endo,Yohei Kawaguchi
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注:
Abstract:This paper proposes a method for generating machine-type-specific anomalies to evaluate the relative performance of unsupervised anomalous sound detection (UASD) systems across different machine types, even in the absence of real anomaly sound data. Conventional keyword-based data augmentation methods often produce unrealistic sounds due to their reliance on manually defined labels, limiting scalability as machine types and anomaly patterns diversify. Advanced audio generative models, such as MIMII-Gen, show promise but typically depend on anomalous training data, making them less effective when diverse anomalous examples are unavailable. To address these limitations, we propose a novel synthesis approach leveraging large language models (LLMs) to interpret textual descriptions of faults and automatically select audio transformation functions, converting normal machine sounds into diverse and plausible anomalous sounds. We validate this approach by evaluating a UASD system trained only on normal sounds from five machine types, using both real and synthetic anomaly data. Experimental results reveal consistent trends in relative detection difficulty across machine types between synthetic and real anomalies. This finding supports our hypothesis and highlights the effectiveness of the proposed LLM-based synthesis approach for relative evaluation of UASD systems.
zh
[AI-143] Implicit Spatiotemporal Bandwidth Enhancement Filter by Sine-activated Deep Learning Model for Fast 3D Photoacoustic Tomography
【速读】:该论文旨在解决高频率半球形超声探头在三维光声断层成像(3D photoacoustic tomography, 3D-PAT)中因通道数量有限和带宽受限导致的图像质量下降问题。解决方案的关键在于引入正弦激活函数(sine activation)到深度学习(deep learning, DL)模型中,以恢复宽带光声信号(photoacoustic radio-frequency, PARF)的高频成分;同时,通过模拟随机球形吸收体来构建简化训练策略,从而在数据稀缺条件下强化模型对频带特性的学习能力而非单纯记忆训练集,最终实现传感器密度提升与时空带宽恢复的双重效果。
链接: https://arxiv.org/abs/2507.20575
作者: I Gede Eka Sulistyawan,Takuro Ishii,Riku Suzuki,Yoshifumi Saijo
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: 14 pages, 13 figures. This work has been submitted to the IEEE for possible publication
Abstract:3D photoacoustic tomography (3D-PAT) using high-frequency hemispherical transducers offers near-omnidirectional reception and enhanced sensitivity to the finer structural details encoded in the high-frequency components of the broadband photoacoustic (PA) signal. However, practical constraints such as limited number of channels with bandlimited sampling rate often result in sparse and bandlimited sensors that degrade image quality. To address this, we revisit the 2D deep learning (DL) approach applied directly to sensor-wise PA radio-frequency (PARF) data. Specifically, we introduce sine activation into the DL model to restore the broadband nature of PARF signals given the observed band-limited and high-frequency PARF data. Given the scarcity of 3D training data, we employ simplified training strategies by simulating random spherical absorbers. This combination of sine-activated model and randomized training is designed to emphasize bandwidth learning over dataset memorization. Our model was evaluated on a leaf skeleton phantom, a micro-CT-verified 3D spiral phantom and in-vivo human palm vasculature. The results showed that the proposed training mechanism on sine-activated model was well-generalized across the different tests by effectively increasing the sensor density and recovering the spatiotemporal bandwidth. Qualitatively, the sine-activated model uniquely enhanced high-frequency content that produces clearer vascular structure with fewer artefacts. Quantitatively, the sine-activated model exhibits full bandwidth at -12 dB spectrum and significantly higher contrast-to-noise ratio with minimal loss of structural similarity index. Lastly, we optimized our approach to enable fast enhanced 3D-PAT at 2 volumes-per-second for better practical imaging of a free-moving targets.
zh
[AI-144] A Multi-Stage Hybrid CNN-Transformer Network for Automated Pediatric Lung Sound Classification
【速读】:该论文旨在解决儿童(尤其是6岁儿童)呼吸音分类在临床应用中因肺部发育变化导致声学特性差异而难以准确诊断儿科呼吸系统疾病的问题。其关键解决方案是提出了一种多阶段混合CNN-Transformer框架,通过结合卷积神经网络(Convolutional Neural Network, CNN)提取的特征与基于注意力机制的Transformer架构,对来自完整录音和单次呼吸事件的语谱图(scalogram)图像进行分类;同时采用类别加权的焦点损失(class-wise focal loss)缓解数据不平衡问题,从而显著提升模型在二分类事件、多分类事件及录音级别分类任务上的性能表现。
链接: https://arxiv.org/abs/2507.20408
作者: Samiul Based Shuvo,Taufiq Hasan
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated analysis of lung sound auscultation is essential for monitoring respiratory health, especially in regions facing a shortage of skilled healthcare workers. While respiratory sound classification has been widely studied in adults, its ap plication in pediatric populations, particularly in children aged 6 years, remains an underexplored area. The developmental changes in pediatric lungs considerably alter the acoustic proper ties of respiratory sounds, necessitating specialized classification approaches tailored to this age group. To address this, we propose a multistage hybrid CNN-Transformer framework that combines CNN-extracted features with an attention-based architecture to classify pediatric respiratory diseases using scalogram images from both full recordings and individual breath events. Our model achieved an overall score of 0.9039 in binary event classifi cation and 0.8448 in multiclass event classification by employing class-wise focal loss to address data imbalance. At the recording level, the model attained scores of 0.720 for ternary and 0.571 for multiclass classification. These scores outperform the previous best models by 3.81% and 5.94%, respectively. This approach offers a promising solution for scalable pediatric respiratory disease diagnosis, especially in resource-limited settings.
zh
[AI-145] A Theory of θ-Expectations
【速读】:该论文旨在解决在非凸不确定性结构下,传统基于次可加性(sub-additivity)的随机微积分理论无法识别模型参数的问题,即“可识别性困境”(identifiability impasse)。其解决方案的关键在于提出一种新的θ-BSDE(backward stochastic differential equation),其中驱动函数通过在一个可能非凸的不确定性集上逐点最大化来定义。该框架不依赖于凸性假设,而是基于一个全局解析条件:驱动函数存在唯一且全局Lipschitz连续的最大化映射(maximizer map)。在此条件下,利用不动点论证建立了方程的适定性(well-posedness);对于一类几何规则的模型,进一步证明了在Malliavin微积分的非退化条件下,最大值器沿任意解路径唯一,从而保证模型内部一致性。最终,该方法构造出一个动态一致的期望算子,并通过Feynman-Kac公式将其与全非线性偏微分方程(fully nonlinear PDEs)联系起来。
链接: https://arxiv.org/abs/2507.20353
作者: Qian Qi
机构: 未知
类目: Probability (math.PR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:The canonical theory of stochastic calculus under ambiguity, founded on sub-additivity, is insensitive to non-convex uncertainty structures, leading to an identifiability impasse. This paper develops a mathematical framework for an identifiable calculus sensitive to non-convex geometry. We introduce the \theta -BSDE, a class of backward stochastic differential equations where the driver is determined by a pointwise maximization over a primitive, possibly non-convex, uncertainty set. The system’s tractability is predicated not on convexity, but on a global analytic hypothesis: the existence of a unique and globally Lipschitz maximizer map for the driver function. Under this hypothesis, which carves out a tractable class of models, we establish well-posedness via a fixed-point argument. For a distinct, geometrically regular class of models, we prove a result of independent interest: under non-degeneracy conditions from Malliavin calculus, the maximizer is unique along any solution path, ensuring the model’s internal consistency. We clarify the fundamental logical gap between this pathwise property and the global regularity required by our existence proof. The resulting valuation operator defines a dynamically consistent expectation, and we establish its connection to fully nonlinear PDEs via a Feynman-Kac formula.
zh
[AI-146] NeuroCLIP: A Multimodal Contrastive Learning Method for rTMS-treated Methamphetamine Addiction Analysis
【速读】:该论文旨在解决甲基苯丙胺(methamphetamine)成瘾评估及重复经颅磁刺激(rTMS)疗效评价中依赖主观自评量表所带来的不确定性问题,同时克服单一神经影像技术(如脑电图 EEG 和功能性近红外光谱 fNIRS)在特征提取和生物标志物可靠性方面的局限。解决方案的关键在于提出一种名为 NeuroCLIP 的新型深度学习框架,通过渐进式学习策略融合同步采集的 EEG 与 fNIRS 数据,从而构建出具有更强判别能力与可信度的多模态神经生物标志物。该方法不仅显著提升了对成瘾个体与健康对照的区分性能,还实现了基于大脑活动模式的客观 rTMS 治疗效果评估,并验证了该标志物与心理测量学验证的渴求评分高度相关,展现出优于单模态方法的鲁棒性和临床应用潜力。
链接: https://arxiv.org/abs/2507.20189
作者: Chengkai Wang,Di Wu,Yunsheng Liao,Wenyao Zheng,Ziyi Zeng,Xurong Gao,Hemmings Wu,Zhoule Zhu,Jie Yang,Lihua Zhong,Weiwei Cheng,Yun-Hsuan Chen,Mohamad Sawan
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Methamphetamine dependence poses a significant global health challenge, yet its assessment and the evaluation of treatments like repetitive transcranial magnetic stimulation (rTMS) frequently depend on subjective self-reports, which may introduce uncertainties. While objective neuroimaging modalities such as electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) offer alternatives, their individual limitations and the reliance on conventional, often hand-crafted, feature extraction can compromise the reliability of derived biomarkers. To overcome these limitations, we propose NeuroCLIP, a novel deep learning framework integrating simultaneously recorded EEG and fNIRS data through a progressive learning strategy. This approach offers a robust and trustworthy biomarker for methamphetamine addiction. Validation experiments show that NeuroCLIP significantly improves discriminative capabilities among the methamphetamine-dependent individuals and healthy controls compared to models using either EEG or only fNIRS alone. Furthermore, the proposed framework facilitates objective, brain-based evaluation of rTMS treatment efficacy, demonstrating measurable shifts in neural patterns towards healthy control profiles after treatment. Critically, we establish the trustworthiness of the multimodal data-driven biomarker by showing its strong correlation with psychometrically validated craving scores. These findings suggest that biomarker derived from EEG-fNIRS data via NeuroCLIP offers enhanced robustness and reliability over single-modality approaches, providing a valuable tool for addiction neuroscience research and potentially improving clinical assessments.
zh
[AI-147] Iterative Pretraining Framework for Interatomic Potentials
【速读】:该论文旨在解决机器学习原子间势(Machine Learning Interatomic Potentials, MLIPs)在实际应用中对大规模标注数据依赖性强、预训练目标与下游任务不匹配,以及通用基础模型在特定系统上精度不足的问题。其解决方案的关键在于提出一种迭代预训练框架(Iterative Pretraining for Interatomic Potentials, IPIP),通过引入遗忘机制防止模型陷入局部最优,并采用轻量级架构实现高效且高精度的预测,在Mo-S-O体系中相较传统力场实现超过80%的预测误差降低和最高4倍的速度提升。
链接: https://arxiv.org/abs/2507.20118
作者: Taoyong Cui,Zhongyao Wang,Dongzhan Zhou,Yuqiang Li,Lei Bai,Wanli Ouyang,Mao Su,Shufei Zhang
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine learning interatomic potentials (MLIPs) enable efficient molecular dynamics (MD) simulations with ab initio accuracy and have been applied across various domains in physical science. However, their performance often relies on large-scale labeled training data. While existing pretraining strategies can improve model performance, they often suffer from a mismatch between the objectives of pretraining and downstream tasks or rely on extensive labeled datasets and increasingly complex architectures to achieve broad generalization. To address these challenges, we propose Iterative Pretraining for Interatomic Potentials (IPIP), a framework designed to iteratively improve the predictive performance of MLIP models. IPIP incorporates a forgetting mechanism to prevent iterative training from converging to suboptimal local minima. Unlike general-purpose foundation models, which frequently underperform on specialized tasks due to a trade-off between generality and system-specific accuracy, IPIP achieves higher accuracy and efficiency using lightweight architectures. Compared to general-purpose force fields, this approach achieves over 80% reduction in prediction error and up to 4x speedup in the challenging Mo-S-O system, enabling fast and accurate simulations.
zh
[AI-148] NIRS: An Ontology for Non-Invasive Respiratory Support in Acute Care
【速读】:该论文旨在解决急性 care 设置中非侵入性呼吸支持(Non Invasive Respiratory Support, NIRS)相关临床知识难以结构化表示与推理的问题,从而提升临床决策支持系统的准确性与一致性。解决方案的关键在于构建一个基于 Web Ontology Language (OWL) 的 NIRS 本体(ontology),通过引入 Semantic Web Rule Language (SWRL) 规则实现超越传统层次结构的规则驱动型临床推理,并利用 SPARQL 查询对来自电子重症监护数据库(eICU Collaborative Research Database)的假设患者场景进行验证,成功实现了对治疗策略(如高流量鼻导管氧疗 HFNC)与临床结局(如避免气管插管)之间逻辑关系的自动化推理与推荐。
链接: https://arxiv.org/abs/2507.19992
作者: Md Fantacher Islam,Jarrod Mosier,Vignesh Subbian
机构: 未知
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI)
备注: Submitted to the Journal of the American Medical Informatics Association (JAMIA)
Abstract:Objective: Develop a Non Invasive Respiratory Support (NIRS) ontology to support knowledge representation in acute care settings. Materials and Methods: We developed the NIRS ontology using Web Ontology Language (OWL) semantics and Protege to organize clinical concepts and relationships. To enable rule-based clinical reasoning beyond hierarchical structures, we added Semantic Web Rule Language (SWRL) rules. We evaluated logical reasoning by adding 17 hypothetical patient clinical scenarios. We used SPARQL queries and data from the Electronic Intensive Care Unit (eICU) Collaborative Research Database to retrieve and test targeted inferences. Results: The ontology has 132 classes, 12 object properties, and 17 data properties across 882 axioms that establish concept relationships. To standardize clinical concepts, we added 350 annotations, including descriptive definitions based on controlled vocabularies. SPARQL queries successfully validated all test cases (rules) by retrieving appropriate patient outcomes, for instance, a patient treated with HFNC (high-flow nasal cannula) for 2 hours due to acute respiratory failure may avoid endotracheal intubation. Discussion: The NIRS ontology formally represents domain-specific concepts, including ventilation modalities, patient characteristics, therapy parameters, and outcomes. SPARQL query evaluations on clinical scenarios confirmed the ability of the ontology to support rule based reasoning and therapy recommendations, providing a foundation for consistent documentation practices, integration into clinical data models, and advanced analysis of NIRS outcomes. Conclusion: We unified NIRS concepts into an ontological framework and demonstrated its applicability through the evaluation of hypothetical patient scenarios and alignment with standardized vocabularies. Comments: Submitted to the Journal of the American Medical Informatics Association (JAMIA) Subjects: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.19992 [q-bio.OT] (or arXiv:2507.19992v1 [q-bio.OT] for this version) https://doi.org/10.48550/arXiv.2507.19992 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Md Fantacher Islam [view email] [v1] Sat, 26 Jul 2025 16:05:20 UTC (524 KB) Full-text links: Access Paper: View a PDF of the paper titled NIRS: An Ontology for Non-Invasive Respiratory Support in Acute Care, by Md Fantacher Islam and 2 other authorsView PDFOther Formats view license Current browse context: q-bio.OT prev | next new | recent | 2025-07 Change to browse by: cs cs.AI q-bio References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[AI-149] Deep Learning Based Joint Channel Estimation and Positioning for Sparse XL-MIMO OFDM Systems
【速读】:该论文旨在解决近场稀疏超大规模多输入多输出(XL-MIMO)正交频分复用(OFDM)系统中联合信道估计与定位问题,其核心挑战在于如何提升两者在资源受限条件下的精度和协同性能。解决方案的关键在于提出一种基于深度学习的两阶段框架:首先利用CP-Mamba网络预测用户坐标并用于后续信道估计,从而实现定位与信道估计之间的协同增益;其次,设计了一种U形结构的Mamba架构(即CP-Mamba),融合了Mamba模型对长程时序依赖的建模能力与U型卷积网络在局部空间特征提取上的优势,显著提升了信道估计和定位精度。数值仿真表明,该方法优于现有基线方案,且稀疏阵列(SA)相比传统紧凑阵列在两项任务上均表现出更优性能。
链接: https://arxiv.org/abs/2507.19936
作者: Zhongnian Li,Chao Zheng,Jian Xiao,Ji Wang,Gongpu Wang,Ming Zeng,Octavia A. Dobre
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages,8 figures
Abstract:This paper investigates joint channel estimation and positioning in near-field sparse extra-large multiple-input multiple-output (XL-MIMO) orthogonal frequency division multiplexing (OFDM) systems. To achieve cooperative gains between channel estimation and positioning, we propose a deep learning-based two-stage framework comprising positioning and channel estimation. In the positioning stage, the user’s coordinates are predicted and utilized in the channel estimation stage, thereby enhancing the accuracy of channel estimation. Within this framework, we propose a U-shaped Mamba architecture for channel estimation and positioning, termed as CP-Mamba. This network integrates the strengths of the Mamba model with the structural advantages of U-shaped convolutional networks, enabling effective capture of local spatial features and long-range temporal dependencies of the channel. Numerical simulation results demonstrate that the proposed two-stage approach with CP-Mamba architecture outperforms existing baseline methods. Moreover, sparse arrays (SA) exhibit significantly superior performance in both channel estimation and positioning accuracy compared to conventional compact arrays.
zh
[AI-150] DynamiX: Large-Scale Dynamic Social Network Simulator
【速读】:该论文旨在解决现有社会网络模拟研究中忽视社交关系动态演化的问题,尤其在大规模代理群体扩展背景下,如何精准刻画用户角色切换与不同类型用户间关系的动态调整机制。其解决方案的关键在于提出DynamiX这一新型大规模社交网络模拟器,核心创新包括:(1)引入动态层级模块(dynamic hierarchy module),在每个时间步选择具有关键特征的核心代理,实现对现实世界用户角色自适应切换的精确建模;(2)针对不同用户类型设计差异化的社交关系建模策略——对意见领袖采用基于信息流的链接预测方法以生成同质化连接和自主行为决策,对普通用户则构建不平等导向的行为决策模块,有效捕捉由多维因素驱动的社会互动不均衡及关系调整模式。
链接: https://arxiv.org/abs/2507.19929
作者: Yanhui Sun,Wu Liu,Wentao Wang,Hantao Yao,Jiebo Luo,Yongdong Zhang
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding the intrinsic mechanisms of social platforms is an urgent demand to maintain social stability. The rise of large language models provides significant potential for social network simulations to capture attitude dynamics and reproduce collective behaviors. However, existing studies mainly focus on scaling up agent populations, neglecting the dynamic evolution of social relationships. To address this gap, we introduce DynamiX, a novel large-scale social network simulator dedicated to dynamic social network modeling. DynamiX uses a dynamic hierarchy module for selecting core agents with key characteristics at each timestep, enabling accurate alignment of real-world adaptive switching of user roles. Furthermore, we design distinct dynamic social relationship modeling strategies for different user types. For opinion leaders, we propose an information-stream-based link prediction method recommending potential users with similar stances, simulating homogeneous connections, and autonomous behavior decisions. For ordinary users, we construct an inequality-oriented behavior decision-making module, effectively addressing unequal social interactions and capturing the patterns of relationship adjustments driven by multi-dimensional factors. Experimental results demonstrate that DynamiX exhibits marked improvements in attitude evolution simulation and collective behavior analysis compared to static networks. Besides, DynamiX opens a new theoretical perspective on follower growth prediction, providing empirical evidence for opinion leaders cultivation.
zh
[AI-151] Ultracoarse Equilibria and Ordinal-Folding Dynamics in Operator-Algebraic Models of Infinite Multi-Agent Games
【速读】:该论文旨在解决无限博弈中具有连续统个参与者的策略演化与均衡存在性问题,尤其关注基于后悔的动态学习机制如何收敛至量化响应均衡(Quantal Response Equilibrium)。其解决方案的关键在于构建一个算子代数框架,将每个博弈映射为一个冯诺依曼代数(von Neumann algebra),从而用非交换连续性方程描述策略分布的演化;其中,代数内的反射后悔算子(reflective regret operator)驱动策略流,并以其不动点刻画均衡状态。该框架融合泛函分析、粗几何与博弈论,通过引入序数折叠指数(ordinal folding index)这一可计算的序数度量来量化动力学的自指深度,并证明其控制收敛所需的超限时间,从而实现对大规模多智能体系统稳定性的严格数学刻画。
链接: https://arxiv.org/abs/2507.19694
作者: Faruk Alpay,Hamdi Alakkad,Bugra Kilictas,Taylan Alpay
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: 15 pages, 2 figures; companion implementation available at this https URL
Abstract:We develop an operator algebraic framework for infinite games with a continuum of agents and prove that regret based learning dynamics governed by a noncommutative continuity equation converge to a unique quantal response equilibrium under mild regularity assumptions. The framework unifies functional analysis, coarse geometry and game theory by assigning to every game a von Neumann algebra that represents collective strategy evolution. A reflective regret operator within this algebra drives the flow of strategy distributions and its fixed point characterises equilibrium. We introduce the ordinal folding index, a computable ordinal valued metric that measures the self referential depth of the dynamics, and show that it bounds the transfinite time needed for convergence, collapsing to zero on coarsely amenable networks. The theory yields new invariant subalgebra rigidity results, establishes existence and uniqueness of envy free and maximin share allocations in continuum economies, and links analytic properties of regret flows with empirical stability phenomena in large language models. These contributions supply a rigorous mathematical foundation for large scale multi agent systems and demonstrate the utility of ordinal metrics for equilibrium selection.
zh
[AI-152] Quantum Reinforcement Learning by Adaptive Non-local Observables
【速读】:该论文旨在解决变分量子电路(Variational Quantum Circuits, VQCs)在量子强化学习(Quantum Reinforcement Learning, QRL)中因依赖局部测量而导致的表达能力受限问题。其解决方案的关键在于提出了一种自适应非局域可观测量(Adaptive Non-local Observable, ANO)范式,通过联合优化电路参数与多比特测量策略,在不增加电路深度的前提下扩展了函数逼近空间,从而提升了基于深度Q网络(DQN)和异步优势演员-评论家(A3C)算法的量子代理性能。
链接: https://arxiv.org/abs/2507.19629
作者: Hsin-Yi Lin,Samuel Yen-Chi Chen,Huan-Hsin Tseng,Shinjae Yoo
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at IEEE Quantum Week 2025 (QCE 2025)
Abstract:Hybrid quantum-classical frameworks leverage quantum computing for machine learning; however, variational quantum circuits (VQCs) are limited by the need for local measurements. We introduce an adaptive non-local observable (ANO) paradigm within VQCs for quantum reinforcement learning (QRL), jointly optimizing circuit parameters and multi-qubit measurements. The ANO-VQC architecture serves as the function approximator in Deep Q-Network (DQN) and Asynchronous Advantage Actor-Critic (A3C) algorithms. On multiple benchmark tasks, ANO-VQC agents outperform baseline VQCs. Ablation studies reveal that adaptive measurements enhance the function space without increasing circuit depth. Our results demonstrate that adaptive multi-qubit observables can enable practical quantum advantages in reinforcement learning.
zh
[AI-153] PennyCoder: Efficient Domain-Specific LLM s for PennyLane-Based Quantum Code Generation
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的量子编程辅助工具严重依赖远程API所带来的隐私泄露、高延迟和高昂使用成本等问题。其解决方案的关键在于提出PennyCoder,一个专为本地和嵌入式部署设计的轻量级量子代码生成框架,通过在LLaMA 3.1-8B模型上采用参数高效微调技术(Low-Rank Adaptation, LoRA)与领域特定指令微调相结合的方式,使模型能够理解并生成符合PennyLane平台语法与计算逻辑的量子程序,包括量子机器学习和量子强化学习任务,从而实现无需外部API即可在设备端完成高质量量子代码生成的能力。
链接: https://arxiv.org/abs/2507.19562
作者: Abdul Basit,Minghao Shao,Muhammad Haider Asif,Nouhaila Innan,Muhammad Kashif,Alberto Marchisio,Muhammad Shafique
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures, 3 tables, paper accepted to QCE 2025
Abstract:The growing demand for robust quantum programming frameworks has unveiled a critical limitation: current large language model (LLM) based quantum code assistants heavily rely on remote APIs, introducing challenges related to privacy, latency, and excessive usage costs. Addressing this gap, we propose PennyCoder, a novel lightweight framework for quantum code generation, explicitly designed for local and embedded deployment to enable on-device quantum programming assistance without external API dependence. PennyCoder leverages a fine-tuned version of the LLaMA 3.1-8B model, adapted through parameter-efficient Low-Rank Adaptation (LoRA) techniques combined with domain-specific instruction tuning optimized for the specialized syntax and computational logic of quantum programming in PennyLane, including tasks in quantum machine learning and quantum reinforcement learning. Unlike prior work focused on cloud-based quantum code generation, our approach emphasizes device-native operability while maintaining high model efficacy. We rigorously evaluated PennyCoder over a comprehensive quantum programming dataset, achieving 44.3% accuracy with our fine-tuned model (compared to 33.7% for the base LLaMA 3.1-8B and 40.1% for the RAG-augmented baseline), demonstrating a significant improvement in functional correctness.
zh
机器学习
[LG-0] Flow Matching Policy Gradients
链接: https://arxiv.org/abs/2507.21053
作者: David McAllister,Songwei Ge,Brent Yi,Chung Min Kim,Ethan Weber,Hongsuk Choi,Haiwen Feng,Angjoo Kanazawa
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: See our blog post: this https URL
Abstract:Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the need for exact likelihood computation while preserving the generative capabilities of flow-based models. Unlike prior approaches for diffusion-based reinforcement learning that bind training to a specific sampling method, FPO is agnostic to the choice of diffusion or flow integration at both training and inference time. We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks. We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.
[LG-1] ransformers as Unrolled Inference in Probabilistic Laplacian Eigenmaps: An Interpretation and Potential Improvements
链接: https://arxiv.org/abs/2507.21040
作者: Aditya Ravuri,Neil D. Lawrence
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Initial version
Abstract:We propose a probabilistic interpretation of transformers as unrolled inference steps assuming a probabilistic Laplacian Eigenmaps model from the ProbDR framework. Our derivation shows that at initialisation, transformers perform “linear” dimensionality reduction. We also show that within the transformer block, a graph Laplacian term arises from our arguments, rather than an attention matrix (which we interpret as an adjacency matrix). We demonstrate that simply subtracting the identity from the attention matrix (and thereby taking a graph diffusion step) improves validation performance on a language model and a simple vision transformer.
[LG-2] When Brain Foundation Model Meets Cauchy-Schwarz Divergence: A New Framework for Cross-Subject Motor Imagery Decoding
链接: https://arxiv.org/abs/2507.21037
作者: Jinzhou Wu,Baoping Tang,Qikang Li,Yi Wang,Cheng Li,Shujian Yu
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Decoding motor imagery (MI) electroencephalogram (EEG) signals, a key non-invasive brain-computer interface (BCI) paradigm for controlling external systems, has been significantly advanced by deep learning. However, MI-EEG decoding remains challenging due to substantial inter-subject variability and limited labeled target data, which necessitate costly calibration for new users. Many existing multi-source domain adaptation (MSDA) methods indiscriminately incorporate all available source domains, disregarding the large inter-subject differences in EEG signals, which leads to negative transfer and excessive computational costs. Moreover, while many approaches focus on feature distribution alignment, they often neglect the explicit dependence between features and decision-level outputs, limiting their ability to preserve discriminative structures. To address these gaps, we propose a novel MSDA framework that leverages a pretrained large Brain Foundation Model (BFM) for dynamic and informed source subject selection, ensuring only relevant sources contribute to adaptation. Furthermore, we employ Cauchy-Schwarz (CS) and Conditional CS (CCS) divergences to jointly perform feature-level and decision-level alignment, enhancing domain invariance while maintaining class discriminability. Extensive evaluations on two benchmark MI-EEG datasets demonstrate that our framework outperforms a broad range of state-of-the-art baselines. Additional experiments with a large source pool validate the scalability and efficiency of BFM-guided selection, which significantly reduces training time without sacrificing performance.
[LG-3] Optimization Performance of Factorization Machine with Annealing under Limited Training Data
链接: https://arxiv.org/abs/2507.21024
作者: Mayumi Nakano,Yuya Seki,Shuta Kikuchi,Shu Tanaka
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures
Abstract:Black-box (BB) optimization problems aim to identify an input that minimizes the output of a function (the BB function) whose input-output relationship is unknown. Factorization machine with annealing (FMA) is a promising approach to this task, employing a factorization machine (FM) as a surrogate model to iteratively guide the solution search via an Ising machine. Although FMA has demonstrated strong optimization performance across various applications, its performance often stagnates as the number of optimization iterations increases. One contributing factor to this stagnation is the growing number of data points in the dataset used to train FM. It is hypothesized that as more data points are accumulated, the contribution of newly added data points becomes diluted within the entire dataset, thereby reducing their impact on improving the prediction accuracy of FM. To address this issue, we propose a novel method for sequential dataset construction that retains at most a specified number of the most recently added data points. This strategy is designed to enhance the influence of newly added data points on the surrogate model. Numerical experiments demonstrate that the proposed FMA achieves lower-cost solutions with fewer BB function evaluations compared to the conventional FMA.
[LG-4] On Using the Shapley Value for Anomaly Localization: A Statistical Investigation
链接: https://arxiv.org/abs/2507.21023
作者: Rick S. Blum,Franziska Freytag
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Recent publications have suggested using the Shapley value for anomaly localization for sensor data systems. Using a reasonable mathematical anomaly model for full control, experiments indicate that using a single fixed term in the Shapley value calculation achieves a lower complexity anomaly localization test, with the same probability of error, as a test using the Shapley value for all cases tested. A proof demonstrates these conclusions must be true for all independent observation cases. For dependent observation cases, no proof is available.
[LG-5] Behavior-Specific Filtering for Enhanced Pig Behavior Classification in Precision Livestock Farming
链接: https://arxiv.org/abs/2507.21021
作者: Zhen Zhang(1),Dong Sam Ha(1),Gota Morota(2,3),Sook Shin(1) ((1) The Bradley Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, Virginia, USA, (2) Department of Animal and Poultry Sciences, Virginia Tech, Blacksburg, Virginia, USA, (3) Laboratory of Biometry and Bioinformatics, Department of Agricultural Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan)
类目: Machine Learning (cs.LG)
*备注: 11 pages, 4 tables, 3 figures
Abstract:This study proposes a behavior-specific filtering method to improve behavior classification accuracy in Precision Livestock Farming. While traditional filtering methods, such as wavelet denoising, achieved an accuracy of 91.58%, they apply uniform processing to all behaviors. In contrast, the proposed behavior-specific filtering method combines Wavelet Denoising with a Low Pass Filter, tailored to active and inactive pig behaviors, and achieved a peak accuracy of 94.73%. These results highlight the effectiveness of behavior-specific filtering in enhancing animal behavior monitoring, supporting better health management and farm efficiency.
[LG-6] Predicting Cognition from fMRI:A Comparative Study of Graph Transformer and Kernel Models Across Task and Rest Conditions
链接: https://arxiv.org/abs/2507.21016
作者: Jagruti Patel(1),Mikkel Schöttner(1),Thomas A. W. Bolton(1),Patric Hagmann(1) ((1) Department of Radiology, Lausanne University Hospital and University of Lausanne (CHUV-UNIL), Lausanne, Switzerland)
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: Preliminary version; a revised version will be uploaded later
Abstract:Predicting cognition from neuroimaging data in healthy individuals offers insights into the neural mechanisms underlying cognitive abilities, with potential applications in precision medicine and early detection of neurological and psychiatric conditions. This study systematically benchmarked classical machine learning (Kernel Ridge Regression (KRR)) and advanced deep learning (DL) models (Graph Neural Networks (GNN) and Transformer-GNN (TGNN)) for cognitive prediction using Resting-state (RS), Working Memory, and Language task fMRI data from the Human Connectome Project Young Adult dataset. Our results, based on R2 scores, Pearson correlation coefficient, and mean absolute error, revealed that task-based fMRI, eliciting neural responses directly tied to cognition, outperformed RS fMRI in predicting cognitive behavior. Among the methods compared, a GNN combining structural connectivity (SC) and functional connectivity (FC) consistently achieved the highest performance across all fMRI modalities; however, its advantage over KRR using FC alone was not statistically significant. The TGNN, designed to model temporal dynamics with SC as a prior, performed competitively with FC-based approaches for task-fMRI but struggled with RS data, where its performance aligned with the lower-performing GNN that directly used fMRI time-series data as node features. These findings emphasize the importance of selecting appropriate model architectures and feature representations to fully leverage the spatial and temporal richness of neuroimaging data. This study highlights the potential of multimodal graph-aware DL models to combine SC and FC for cognitive prediction, as well as the promise of Transformer-based approaches for capturing temporal dynamics. By providing a comprehensive comparison of models, this work serves as a guide for advancing brain-behavior modeling using fMRI, SC and DL. Comments: Preliminary version; a revised version will be uploaded later Subjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2507.21016 [cs.LG] (or arXiv:2507.21016v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.21016 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-7] Repairing vulnerabilities without invisible hands. A differentiated replication study on LLM s
链接: https://arxiv.org/abs/2507.20977
作者: Maria Camporese,Fabio Massacci
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Background: Automated Vulnerability Repair (AVR) is a fast-growing branch of program repair. Recent studies show that large language models (LLMs) outperform traditional techniques, extending their success beyond code generation and fault detection. Hypothesis: These gains may be driven by hidden factors – “invisible hands” such as training-data leakage or perfect fault localization – that let an LLM reproduce human-authored fixes for the same code. Objective: We replicate prior AVR studies under controlled conditions by deliberately adding errors to the reported vulnerability location in the prompt. If LLMs merely regurgitate memorized fixes, both small and large localization errors should yield the same number of correct patches, because any offset should divert the model from the original fix. Method: Our pipeline repairs vulnerabilities from the Vul4J and VJTrans benchmarks after shifting the fault location by n lines from the ground truth. A first LLM generates a patch, a second LLM reviews it, and we validate the result with regression and proof-of-vulnerability tests. Finally, we manually audit a sample of patches and estimate the error rate with the Agresti-Coull-Wilson method. Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2507.20977 [cs.SE] (or arXiv:2507.20977v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2507.20977 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Maria Camporese [view email] [v1] Mon, 28 Jul 2025 16:39:16 UTC (484 KB)
[LG-8] PROVCREATOR: Synthesizing Complex Heterogenous Graphs with Node and Edge Attributes
链接: https://arxiv.org/abs/2507.20967
作者: Tianhao Wang,Simon Klancher,Kunal Mukherjee,Josh Wiedemeier,Feng Chen,Murat Kantarcioglu,Kangkook Jee
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rise of graph-structured data has driven interest in graph learning and synthetic data generation. While successful in text and image domains, synthetic graph generation remains challenging – especially for real-world graphs with complex, heterogeneous schemas. Existing research has focused mostly on homogeneous structures with simple attributes, limiting their usefulness and relevance for application domains requiring semantic fidelity. In this research, we introduce ProvCreator, a synthetic graph framework designed for complex heterogeneous graphs with high-dimensional node and edge attributes. ProvCreator formulates graph synthesis as a sequence generation task, enabling the use of transformer-based large language models. It features a versatile graph-to-sequence encoder-decoder that 1. losslessly encodes graph structure and attributes, 2. efficiently compresses large graphs for contextual modeling, and 3. supports end-to-end, learnable graph generation. To validate our research, we evaluate ProvCreator on two challenging domains: system provenance graphs in cybersecurity and knowledge graphs from IntelliGraph Benchmark Dataset. In both cases, ProvCreator captures intricate dependencies between structure and semantics, enabling the generation of realistic and privacy-aware synthetic datasets. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.20967 [cs.LG] (or arXiv:2507.20967v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.20967 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-9] PySHRED: A Python package for SHallow REcurrent Decoding for sparse sensing model reduction and scientific discovery
链接: https://arxiv.org/abs/2507.20954
作者: David Ye,Jan Williams,Mars Gao,Stefano Riva,Matteo Tomasetto,David Zoro,J. Nathan Kutz
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD)
*备注: 15 pages, 9 figures
Abstract:SHallow REcurrent Decoders (SHRED) provide a deep learning strategy for modeling high-dimensional dynamical systems and/or spatiotemporal data from dynamical system snapshot observations. PySHRED is a Python package that implements SHRED and several of its major extensions, including for robust sensing, reduced order modeling and physics discovery. In this paper, we introduce the version 1.0 release of PySHRED, which includes data preprocessors and a number of cutting-edge SHRED methods specifically designed to handle real-world data that may be noisy, multi-scale, parameterized, prohibitively high-dimensional, and strongly nonlinear. The package is easy to install, thoroughly-documented, supplemented with extensive code examples, and modularly-structured to support future additions. The entire codebase is released under the MIT license and is available at this https URL.
[LG-10] Breaking the Precision Ceiling in Physics-Informed Neural Networks: A Hybrid Fourier-Neural Architecture for Ultra-High Accuracy
链接: https://arxiv.org/abs/2507.20929
作者: Wei Shan Lee,Chi Kiu Althina Chau,Kei Chon Sio,Kam Ian Leong
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph)
*备注:
Abstract:Physics-informed neural networks (PINNs) have plateaued at errors of 10^-3 - 10^-4 for fourth-order partial differential equations, creating a perceived precision ceiling that limits their adoption in engineering applications. We break through this barrier with a hybrid Fourier-neural architecture for the Euler-Bernoulli beam equation, achieving unprecedented L2 error of 1.94 \times 10^-7 -a 17-fold improvement over standard PINNs and (15-500\times) better than traditional numerical methods. Our approach synergistically combines a truncated Fourier series capturing dominant modal behavior with a deep neural network providing adaptive residual corrections. A systematic harmonic optimization study revealed a counter-intuitive discovery: exactly 10 harmonics yield optimal performance, with accuracy catastrophically degrading from 10^-7 to 10^-1 beyond this threshold. The two-phase optimization strategy (Adam followed by L-BFGS) and adaptive weight balancing enable stable ultra-precision convergence. GPU-accelerated implementation achieves sub-30-minute training despite fourth-order derivative complexity. By addressing 12 critical gaps in existing approaches-from architectural rigidity to optimization landscapes-this work demonstrates that ultra-precision is achievable through proper design, opening new paradigms for scientific computing where machine learning can match or exceed traditional numerical methods.
[LG-11] Zero-Shot Learning with Subsequence Reordering Pretraining for Compound-Protein Interaction
链接: https://arxiv.org/abs/2507.20925
作者: Hongzhi Zhang,Zhonglie Liu,Kun Meng,Jiameng Chen,Jia Wu,Bo Du,Di Lin,Yan Che,Wenbin Hu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Given the vastness of chemical space and the ongoing emergence of previously uncharacterized proteins, zero-shot compound-protein interaction (CPI) prediction better reflects the practical challenges and requirements of real-world drug development. Although existing methods perform adequately during certain CPI tasks, they still face the following challenges: (1) Representation learning from local or complete protein sequences often overlooks the complex interdependencies between subsequences, which are essential for predicting spatial structures and binding properties. (2) Dependence on large-scale or scarce multimodal protein datasets demands significant training data and computational resources, limiting scalability and efficiency. To address these challenges, we propose a novel approach that pretrains protein representations for CPI prediction tasks using subsequence reordering, explicitly capturing the dependencies between protein subsequences. Furthermore, we apply length-variable protein augmentation to ensure excellent pretraining performance on small training datasets. To evaluate the model’s effectiveness and zero-shot learning ability, we combine it with various baseline methods. The results demonstrate that our approach can improve the baseline model’s performance on the CPI task, especially in the challenging zero-shot scenario. Compared to existing pre-training models, our model demonstrates superior performance, particularly in data-scarce scenarios where training samples are limited. Our implementation is available at this https URL.
[LG-12] Online hierarchical partitioning of the output space in extreme multi-label data stream ECAI2025
链接: https://arxiv.org/abs/2507.20894
作者: Lara Neves,Afonso Lourenço,Alberto Cano,Goreti Marreiros
类目: Machine Learning (cs.LG)
*备注: Accepted at 28th European Conference on Artificial Intelligence (ECAI 2025)
Abstract:Mining data streams with multi-label outputs poses significant challenges due to evolving distributions, high-dimensional label spaces, sparse label occurrences, and complex label dependencies. Moreover, concept drift affects not only input distributions but also label correlations and imbalance ratios over time, complicating model adaptation. To address these challenges, structured learners are categorized into local and global methods. Local methods break down the task into simpler components, while global methods adapt the algorithm to the full output space, potentially yielding better predictions by exploiting label correlations. This work introduces iHOMER (Incremental Hierarchy Of Multi-label Classifiers), an online multi-label learning framework that incrementally partitions the label space into disjoint, correlated clusters without relying on predefined hierarchies. iHOMER leverages online divisive-agglomerative clustering based on \textitJaccard similarity and a global tree-based learner driven by a multivariate \textitBernoulli process to guide instance partitioning. To address non-stationarity, it integrates drift detection mechanisms at both global and local levels, enabling dynamic restructuring of label partitions and subtrees. Experiments across 23 real-world datasets show iHOMER outperforms 5 state-of-the-art global baselines, such as MLHAT, MLHT of Pruned Sets and iSOUPT, by 23%, and 12 local baselines, such as binary relevance transformations of kNN, EFDT, ARF, and ADWIN bagging/boosting ensembles, by 32%, establishing its robustness for online multi-label classification.
[LG-13] stbed and Software Architecture for Enhancing Security in Industrial Private 5G Networks
链接: https://arxiv.org/abs/2507.20873
作者: Song Son Ha,Florian Foerster,Thomas Robert Doebbert,Tim Kittel,Dominik Merli,Gerd Scholl
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:In the era of Industry 4.0, the growing need for secure and efficient communication systems has driven the development of fifth-generation (5G) networks characterized by extremely low latency, massive device connectivity and high data transfer speeds. However, the deployment of 5G networks presents significant security challenges, requiring advanced and robust solutions to counter increasingly sophisticated cyber threats. This paper proposes a testbed and software architecture to strengthen the security of Private 5G Networks, particularly in industrial communication environments.
[LG-14] textitFedABC: Attention-Based Client Selection for Federated Learning with Long-Term View
链接: https://arxiv.org/abs/2507.20871
作者: Wenxuan Ye,Xueli An,Junfan Wang,Xueqiang Yan,Georg Carle
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Accepted to ICC 2025
Abstract:Native AI support is a key objective in the evolution of 6G networks, with Federated Learning (FL) emerging as a promising paradigm. FL allows decentralized clients to collaboratively train an AI model without directly sharing their data, preserving privacy. Clients train local models on private data and share model updates, which a central server aggregates to refine the global model and redistribute it for the next iteration. However, client data heterogeneity slows convergence and reduces model accuracy, and frequent client participation imposes communication and computational burdens. To address these challenges, we propose \textitFedABC, an innovative client selection algorithm designed to take a long-term view in managing data heterogeneity and optimizing client participation. Inspired by attention mechanisms, \textitFedABC prioritizes informative clients by evaluating both model similarity and each model’s unique contributions to the global model. Moreover, considering the evolving demands of the global model, we formulate an optimization problem to guide \textitFedABC throughout the training process. Following the ``later-is-better" principle, \textitFedABC adaptively adjusts the client selection threshold, encouraging greater participation in later training stages. Extensive simulations on CIFAR-10 demonstrate that \textitFedABC significantly outperforms existing approaches in model accuracy and client participation efficiency, achieving comparable performance with 32% fewer clients than the classical FL algorithm \textitFedAvg, and 3.5% higher accuracy with 2% fewer clients than the state-of-the-art. This work marks a step toward deploying FL in heterogeneous, resource-constrained environments, thereby supporting native AI capabilities in 6G networks.
[LG-15] Bi-cephalic self-attended model to classify Parkinsons disease patients with freezing of gait
链接: https://arxiv.org/abs/2507.20862
作者: Shomoita Jahid Mitin(1,2),Rodrigue Rizk(2),Maximilian Scherer(3),Thomas Koeglsperger(3),Daniel Lench(4),KC Santosh(2),Arun Singh(1,5) ((1) Biomedical and Translational Sciences, University of South Dakota, Vermillion, SD, USA and (2) Artificial Intelligence Research lab, Department of Computer Science, University of South Dakota, Vermillion, SD and (3) Department of Neurology, Ludwig Maximilian University, Munich, Germany and (4) Department of Neurology, Medical University of South Carolina, Charleston, SC, USA and (5) Department of Neuroscience, Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, USA)
类目: Machine Learning (cs.LG)
*备注: 26 pages, 5944 words, 4 figures, 2 tables, European Journal of Neuroscience: Special edition FOG
Abstract:Parkinson Disease (PD) often results in motor and cognitive impairments, including gait dysfunction, particularly in patients with freezing of gait (FOG). Current detection methods are either subjective or reliant on specialized gait analysis tools. This study aims to develop an objective, data-driven, and multi-modal classification model to detect gait dysfunction in PD patients using resting-state EEG signals combined with demographic and clinical variables. We utilized a dataset of 124 participants: 42 PD patients with FOG (PDFOG+), 41 without FOG (PDFOG-), and 41 age-matched healthy controls. Features extracted from resting-state EEG and descriptive variables (age, education, disease duration) were used to train a novel Bi-cephalic Self-Attention Model (BiSAM). We tested three modalities: signal-only, descriptive-only, and multi-modal, across different EEG channel subsets (BiSAM-63, -16, -8, and -4). Signal-only and descriptive-only models showed limited performance, achieving a maximum accuracy of 55% and 68%, respectively. In contrast, the multi-modal models significantly outperformed both, with BiSAM-8 and BiSAM-4 achieving the highest classification accuracy of 88%. These results demonstrate the value of integrating EEG with objective descriptive features for robust PDFOG+ detection. This study introduces a multi-modal, attention-based architecture that objectively classifies PDFOG+ using minimal EEG channels and descriptive variables. This approach offers a scalable and efficient alternative to traditional assessments, with potential applications in routine clinical monitoring and early diagnosis of PD-related gait dysfunction.
[LG-16] owards Explainable Deep Clustering for Time Series Data ECML-PKDD2025
链接: https://arxiv.org/abs/2507.20840
作者: Udo Schlegel,Gabriel Marques Tavares,Thomas Seidl
类目: Machine Learning (cs.LG)
*备注: 14 pages, accepted at TempXAI Workshop at ECML-PKDD 2025
Abstract:Deep clustering uncovers hidden patterns and groups in complex time series data, yet its opaque decision-making limits use in safety-critical settings. This survey offers a structured overview of explainable deep clustering for time series, collecting current methods and their real-world applications. We thoroughly discuss and compare peer-reviewed and preprint papers through application domains across healthcare, finance, IoT, and climate science. Our analysis reveals that most work relies on autoencoder and attention architectures, with limited support for streaming, irregularly sampled, or privacy-preserved series, and interpretability is still primarily treated as an add-on. To push the field forward, we outline six research opportunities: (1) combining complex networks with built-in interpretability; (2) setting up clear, faithfulness-focused evaluation metrics for unsupervised explanations; (3) building explainers that adapt to live data streams; (4) crafting explanations tailored to specific domains; (5) adding human-in-the-loop methods that refine clusters and explanations together; and (6) improving our understanding of how time series clustering models work internally. By making interpretability a primary design goal rather than an afterthought, we propose the groundwork for the next generation of trustworthy deep clustering time series analytics.
[LG-17] BuildSTG: A Multi-building Energy Load Forecasting Method using Spatio-Temporal Graph Neural Network
链接: https://arxiv.org/abs/2507.20838
作者: Yongzheng Liu,Yiming Wang,Po Xu,Yingjie Xu,Yuntian Chen,Dongxiao Zhang
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Due to the extensive availability of operation data, data-driven methods show strong capabilities in predicting building energy loads. Buildings with similar features often share energy patterns, reflected by spatial dependencies in their operational data, which conventional prediction methods struggle to capture. To overcome this, we propose a multi-building prediction approach using spatio-temporal graph neural networks, comprising graph representation, graph learning, and interpretation. First, a graph is built based on building characteristics and environmental factors. Next, a multi-level graph convolutional architecture with attention is developed for energy prediction. Lastly, a method interpreting the optimized graph structure is introduced. Experiments on the Building Data Genome Project 2 dataset confirm superior performance over baselines such as XGBoost, SVR, FCNN, GRU, and Naive, highlighting the method’s robustness, generalization, and interpretability in capturing meaningful building similarities and spatial relationships.
[LG-18] Understanding Bias in Perceiving Dimensionality Reduction Projections
链接: https://arxiv.org/abs/2507.20805
作者: Seoyoung Doh,Hyeon Jeon,Sungbok Shin,Ghulam Jilani Quadri,Nam Wook Kim,Jinwook Seo
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 6 pages
Abstract:Selecting the dimensionality reduction technique that faithfully represents the structure is essential for reliable visual communication and analytics. In reality, however, practitioners favor projections for other attractions, such as aesthetics and visual saliency, over the projection’s structural faithfulness, a bias we define as visual interestingness. In this research, we conduct a user study that (1) verifies the existence of such bias and (2) explains why the bias exists. Our study suggests that visual interestingness biases practitioners’ preferences when selecting projections for analysis, and this bias intensifies with color-encoded labels and shorter exposure time. Based on our findings, we discuss strategies to mitigate bias in perceiving and interpreting DR projections.
[LG-19] Uncertainty-driven Embedding Convolution
链接: https://arxiv.org/abs/2507.20718
作者: Sungjun Lim,Kangjun Noh,Youngjun Choi,Heeyoung Lee,Kyungwoo Song
类目: Machine Learning (cs.LG)
*备注:
Abstract:Text embeddings are essential components in modern NLP pipelines. While numerous embedding models have been proposed, their performance varies across domains, and no single model consistently excels across all tasks. This variability motivates the use of ensemble techniques to combine complementary strengths. However, most existing ensemble methods operate on deterministic embeddings and fail to account for model-specific uncertainty, limiting their robustness and reliability in downstream applications. To address these limitations, we propose Uncertainty-driven Embedding Convolution (UEC). UEC first transforms deterministic embeddings into probabilistic ones in a post-hoc manner. It then computes adaptive ensemble weights based on embedding uncertainty, grounded in a Bayes-optimal solution under a surrogate loss. Additionally, UEC introduces an uncertainty-aware similarity function that directly incorporates uncertainty into similarity scoring. Extensive experiments on retrieval, classification, and semantic similarity benchmarks demonstrate that UEC consistently improves both performance and robustness by leveraging principled uncertainty modeling.
[LG-20] Exposing the Illusion of Fairness: Auditing Vulnerabilities to Distributional Manipulation Attacks
链接: https://arxiv.org/abs/2507.20708
作者: Valentin Lafargue,Adriana Laurindo Monteiro,Emmanuelle Claeys,Laurent Risser,Jean-Michel Loubes
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Applications (stat.AP)
*备注:
Abstract:Proving the compliance of AI algorithms has become an important challenge with the growing deployment of such algorithms for real-life applications. Inspecting possible biased behaviors is mandatory to satisfy the constraints of the regulations of the EU Artificial Intelligence’s Act. Regulation-driven audits increasingly rely on global fairness metrics, with Disparate Impact being the most widely used. Yet such global measures depend highly on the distribution of the sample on which the measures are computed. We investigate first how to manipulate data samples to artificially satisfy fairness criteria, creating minimally perturbed datasets that remain statistically indistinguishable from the original distribution while satisfying prescribed fairness constraints. Then we study how to detect such manipulation. Our analysis (i) introduces mathematically sound methods for modifying empirical distributions under fairness constraints using entropic or optimal transport projections, (ii) examines how an auditee could potentially circumvent fairness inspections, and (iii) offers recommendations to help auditors detect such data manipulations. These results are validated through experiments on classical tabular datasets in bias detection.
[LG-21] Novel Pivoted Cholesky Decompositions for Efficient Gaussian Process Inference
链接: https://arxiv.org/abs/2507.20678
作者: Filip de Roos,Fabio Muratore
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Cholesky decomposition is a fundamental tool for solving linear systems with symmetric and positive definite matrices which are ubiquitous in linear algebra, optimization, and machine learning. Its numerical stability can be improved by introducing a pivoting strategy that iteratively permutes the rows and columns of the matrix. The order of pivoting indices determines how accurately the intermediate decomposition can reconstruct the original matrix, thus is decisive for the algorithm’s efficiency in the case of early termination. Standard implementations select the next pivot from the largest value on the diagonal. In the case of Bayesian nonparametric inference, this strategy corresponds to greedy entropy maximization, which is often used in active learning and design of experiments. We explore this connection in detail and deduce novel pivoting strategies for the Cholesky decomposition. The resulting algorithms are more efficient at reducing the uncertainty over a data set, can be updated to include information about observations, and additionally benefit from a tailored implementation. We benchmark the effectiveness of the new selection strategies on two tasks important to Gaussian processes: sparse regression and inference based on preconditioned iterative solvers. Our results show that the proposed selection strategies are either on par or, in most cases, outperform traditional baselines while requiring a negligible amount of additional computation.
[LG-22] Deep Generative Models of Evolution: SNP-level Population Adaptation by Genomic Linkage Incorporation
链接: https://arxiv.org/abs/2507.20644
作者: Julia Siekiera,Christian Schlötterer,Stefan Kramer
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注: 10 pages, 5 figures
Abstract:The investigation of allele frequency trajectories in populations evolving under controlled environmental pressures has become a popular approach to study evolutionary processes on the molecular level. Statistical models based on well-defined evolutionary concepts can be used to validate different hypotheses about empirical observations. Despite their popularity, classic statistical models like the Wright-Fisher model suffer from simplified assumptions such as the independence of selected loci along a chromosome and uncertainty about the parameters. Deep generative neural networks offer a powerful alternative known for the integration of multivariate dependencies and noise reduction. Due to their high data demands and challenging interpretability they have, so far, not been widely considered in the area of population genomics. To address the challenges in the area of Evolve and Resequencing experiments (ER) based on pooled sequencing (Pool-Seq) data, we introduce a deep generative neural network that aims to model a concept of evolution based on empirical observations over time. The proposed model estimates the distribution of allele frequency trajectories by embedding the observations from single nucleotide polymorphisms (SNPs) with information from neighboring loci. Evaluation on simulated ER experiments demonstrates the model’s ability to capture the distribution of allele frequency trajectories and illustrates the representational power of deep generative models on the example of linkage disequilibrium (LD) estimation. Inspecting the internally learned representations enables estimating pairwise LD, which is typically inaccessible in Pool-Seq data. Our model provides competitive LD estimation in Pool-Seq data high degree of LD when compared to existing methods.
[LG-23] PhaseNAS: Language-Model Driven Architecture Search with Dynamic Phase Adaptation
链接: https://arxiv.org/abs/2507.20592
作者: Fei Kong,Xiaohan Shan,Yanwei Hu,Jianmin Li
类目: Machine Learning (cs.LG)
*备注: 14pages
Abstract:Neural Architecture Search (NAS) is challenged by the trade-off between search space exploration and efficiency, especially for complex tasks. While recent LLM-based NAS methods have shown promise, they often suffer from static search strategies and ambiguous architecture representations. We propose PhaseNAS, an LLM-based NAS framework with dynamic phase transitions guided by real-time score thresholds and a structured architecture template language for consistent code generation. On the NAS-Bench-Macro benchmark, PhaseNAS consistently discovers architectures with higher accuracy and better rank. For image classification (CIFAR-10/100), PhaseNAS reduces search time by up to 86% while maintaining or improving accuracy. In object detection, it automatically produces YOLOv8 variants with higher mAP and lower resource cost. These results demonstrate that PhaseNAS enables efficient, adaptive, and generalizable NAS across diverse vision tasks.
[LG-24] A note on the Artstein-Avidan-Milmans generalized Legendre transforms
链接: https://arxiv.org/abs/2507.20577
作者: Frank Nielsen
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 11 pages
Abstract:Artstein-Avidan and Milman [Annals of mathematics (2009), (169):661-674] characterized invertible reverse-ordering transforms on the space of lower-semi-continuous extended real-valued convex functions as affine deformations of the ordinary Legendre transform. In this note, we prove that all those generalized Legendre transforms on functions correspond to the ordinary Legendre transform on dually corresponding affine-deformed functions. That is, generalized convex conjugates are convex conjugates of affine-deformed functions. We conclude this note by sketching how this result can be interpreted from the lens of information geometry.
[LG-25] Fusing CFD and measurement data using transfer learning
链接: https://arxiv.org/abs/2507.20576
作者: Alexander Barklage,Philipp Bekemeyer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Aerodynamic analysis during aircraft design usually involves methods of varying accuracy and spatial resolution, which all have their advantages and disadvantages. It is therefore desirable to create data-driven models which effectively combine these advantages. Such data fusion methods for distributed quantities mainly rely on proper orthogonal decomposition as of now, which is a linear method. In this paper, we introduce a non-linear method based on neural networks combining simulation and measurement data via transfer learning. The network training accounts for the heterogeneity of the data, as simulation data usually features a high spatial resolution, while measurement data is sparse but more accurate. In a first step, the neural network is trained on simulation data to learn spatial features of the distributed quantities. The second step involves transfer learning on the measurement data to correct for systematic errors between simulation and measurement by only re-training a small subset of the entire neural network model. This approach is applied to a multilayer perceptron architecture and shows significant improvements over the established method based on proper orthogonal decomposition by producing more physical solutions near nonlinearities. In addition, the neural network provides solutions at arbitrary flow conditions, thus making the model useful for flight mechanical design, structural sizing, and certification. As the proposed training strategy is very general, it can also be applied to more complex neural network architectures in the future.
[LG-26] Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy ICCV2025
链接: https://arxiv.org/abs/2507.20573
作者: Yaxin Xiao,Qingqing Ye,Li Hu,Huadi Zheng,Haibo Hu,Zi Liang,Haoyang Li,Yijie Jiao
类目: Machine Learning (cs.LG)
*备注: Accepted by ICCV 2025
Abstract:Machine unlearning enables the removal of specific data from ML models to uphold the right to be forgotten. While approximate unlearning algorithms offer efficient alternatives to full retraining, this work reveals that they fail to adequately protect the privacy of unlearned data. In particular, these algorithms introduce implicit residuals which facilitate privacy attacks targeting at unlearned data. We observe that these residuals persist regardless of model architectures, parameters, and unlearning algorithms, exposing a new attack surface beyond conventional output-based leakage. Based on this insight, we propose the Reminiscence Attack (ReA), which amplifies the correlation between residuals and membership privacy through targeted fine-tuning processes. ReA achieves up to 1.90x and 1.12x higher accuracy than prior attacks when inferring class-wise and sample-wise membership, respectively. To mitigate such residual-induced privacy risk, we develop a dual-phase approximate unlearning framework that first eliminates deep-layer unlearned data traces and then enforces convergence stability to prevent models from “pseudo-convergence”, where their outputs are similar to retrained models but still preserve unlearned residuals. Our framework works for both classification and generation tasks. Experimental evaluations confirm that our approach maintains high unlearning efficacy, while reducing the adaptive privacy attack accuracy to nearly random guess, at the computational cost of 2-12% of full retraining from scratch.
[LG-27] Improving Group Fairness in Tensor Completion via Imbalance Mitigating Entity Augmentation
链接: https://arxiv.org/abs/2507.20542
作者: Dawon Ahn,Jun-Gi Jang,Evangelos E. Papalexakis
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Group fairness is important to consider in tensor decomposition to prevent discrimination based on social grounds such as gender or age. Although few works have studied group fairness in tensor decomposition, they suffer from performance degradation. To address this, we propose STAFF(Sparse Tensor Augmentation For Fairness) to improve group fairness by minimizing the gap in completion errors of different groups while reducing the overall tensor completion error. Our main idea is to augment a tensor with augmented entities including sufficient observed entries to mitigate imbalance and group bias in the sparse tensor. We evaluate \method on tensor completion with various datasets under conventional and deep learning-based tensor models. STAFF consistently shows the best trade-off between completion error and group fairness; at most, it yields 36% lower MSE and 59% lower MADE than the second-best baseline.
[LG-28] Kernel Learning for Sample Constrained Black-Box Optimization AAAI2025
链接: https://arxiv.org/abs/2507.20533
作者: Rajalaxmi Rajagopalan,Yu-Lin Wei,Romit Roy Choudhury
类目: Machine Learning (cs.LG)
*备注: Accepted to AAAI 2025
Abstract:Black box optimization (BBO) focuses on optimizing unknown functions in high-dimensional spaces. In many applications, sampling the unknown function is expensive, imposing a tight sample budget. Ongoing work is making progress on reducing the sample budget by learning the shape/structure of the function, known as kernel learning. We propose a new method to learn the kernel of a Gaussian Process. Our idea is to create a continuous kernel space in the latent space of a variational autoencoder, and run an auxiliary optimization to identify the best kernel. Results show that the proposed method, Kernel Optimized Blackbox Optimization (KOBO), outperforms state of the art by estimating the optimal at considerably lower sample budgets. Results hold not only across synthetic benchmark functions but also in real applications. We show that a hearing aid may be personalized with fewer audio queries to the user, or a generative model could converge to desirable images from limited user ratings.
[LG-29] Efficient Proxy Raytracer for Optical Systems using Implicit Neural Representations
链接: https://arxiv.org/abs/2507.20513
作者: Shiva Sinaei,Chuanjun Zheng,Kaan Akşit,Daisuke Iwai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ray tracing is a widely used technique for modeling optical systems, involving sequential surface-by-surface computations, which can be computationally intensive. We propose Ray2Ray, a novel method that leverages implicit neural representations to model optical systems with greater efficiency, eliminating the need for surface-by-surface computations in a single pass end-to-end model. Ray2Ray learns the mapping between rays emitted from a given source and their corresponding rays after passing through a given optical system in a physically accurate manner. We train Ray2Ray on nine off-the-shelf optical systems, achieving positional errors on the order of 1\mum and angular deviations on the order 0.01 degrees in the estimated output rays. Our work highlights the potential of neural representations as a proxy for optical raytracer.
[LG-30] Attributed Graph Clustering with Multi-Scale Weight-Based Pairwise Coarsening and Contrastive Learning CCL
链接: https://arxiv.org/abs/2507.20505
作者: Binxiong Li,Yuefei Wang,Binyu Zhao,Heyang Gao,Benhan Yang,Quanzhou Luo,Xue Li,Xu Xiang,Yujie Liu,Huijie Tang
类目: Machine Learning (cs.LG)
*备注: The source code for this study is available at this https URL
Abstract:This study introduces the Multi-Scale Weight-Based Pairwise Coarsening and Contrastive Learning (MPCCL) model, a novel approach for attributed graph clustering that effectively bridges critical gaps in existing methods, including long-range dependency, feature collapse, and information loss. Traditional methods often struggle to capture high-order graph features due to their reliance on low-order attribute information, while contrastive learning techniques face limitations in feature diversity by overemphasizing local neighborhood structures. Similarly, conventional graph coarsening methods, though reducing graph scale, frequently lose fine-grained structural details. MPCCL addresses these challenges through an innovative multi-scale coarsening strategy, which progressively condenses the graph while prioritizing the merging of key edges based on global node similarity to preserve essential structural information. It further introduces a one-to-many contrastive learning paradigm, integrating node embeddings with augmented graph views and cluster centroids to enhance feature diversity, while mitigating feature masking issues caused by the accumulation of high-frequency node weights during multi-scale coarsening. By incorporating a graph reconstruction loss and KL divergence into its self-supervised learning framework, MPCCL ensures cross-scale consistency of node representations. Experimental evaluations reveal that MPCCL achieves a significant improvement in clustering performance, including a remarkable 15.24% increase in NMI on the ACM dataset and notable robust gains on smaller-scale datasets such as Citeseer, Cora and DBLP.
[LG-31] Mixture of Length and Pruning Experts for Knowledge Graphs Reasoning
链接: https://arxiv.org/abs/2507.20498
作者: Enjun Du,Siyi Liu,Yongqi Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Knowledge Graph (KG) reasoning, which aims to infer new facts from structured knowledge repositories, plays a vital role in Natural Language Processing (NLP) systems. Its effectiveness critically depends on constructing informative and contextually relevant reasoning paths. However, existing graph neural networks (GNNs) often adopt rigid, query-agnostic path-exploration strategies, limiting their ability to adapt to diverse linguistic contexts and semantic nuances. To address these limitations, we propose \textbfMoKGR, a mixture-of-experts framework that personalizes path exploration through two complementary components: (1) a mixture of length experts that adaptively selects and weights candidate path lengths according to query complexity, providing query-specific reasoning depth; and (2) a mixture of pruning experts that evaluates candidate paths from a complementary perspective, retaining the most informative paths for each query. Through comprehensive experiments on diverse benchmark, MoKGR demonstrates superior performance in both transductive and inductive settings, validating the effectiveness of personalized path exploration in KGs reasoning.
[LG-32] HIAL: A New Paradigm for Hypergraph Active Learning via Influence Maximization
链接: https://arxiv.org/abs/2507.20490
作者: Yanheng Hou,Xunkai Li,Zhenjun Li,Bing Zhou,Ronghua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:In recent years, Hypergraph Neural Networks (HNNs) have demonstrated immense potential in handling complex systems with high-order interactions. However, acquiring large-scale, high-quality labeled data for these models is costly, making Active Learning (AL) a critical technique. Existing Graph Active Learning (GAL) methods, when applied to hypergraphs, often rely on techniques like “clique expansion,” which destroys the high-order structural information crucial to a hypergraph’s success, thereby leading to suboptimal performance. To address this challenge, we introduce HIAL (Hypergraph Active Learning), a native active learning framework designed specifically for hypergraphs. We innovatively reformulate the Hypergraph Active Learning (HAL) problem as an Influence Maximization task. The core of HIAL is a dual-perspective influence function that, based on our novel “High-Order Interaction-Aware (HOI-Aware)” propagation mechanism, synergistically evaluates a node’s feature-space coverage (via Magnitude of Influence, MoI) and its topological influence (via Expected Diffusion Value, EDV). We prove that this objective function is monotone and submodular, thus enabling the use of an efficient greedy algorithm with a formal (1-1/e) approximation guarantee. Extensive experiments on seven public datasets demonstrate that HIAL significantly outperforms state-of-the-art baselines in terms of performance, efficiency, generality, and robustness, establishing an efficient and powerful new paradigm for active learning on hypergraphs.
[LG-33] Conditional Diffusion Models for Global Precipitation Map Inpainting
链接: https://arxiv.org/abs/2507.20478
作者: Daiko Kishikawa,Yuka Muto,Shunji Kotsuki
类目: Machine Learning (cs.LG)
*备注:
Abstract:Incomplete satellite-based precipitation presents a significant challenge in global monitoring. For example, the Global Satellite Mapping of Precipitation (GSMaP) from JAXA suffers from substantial missing regions due to the orbital characteristics of satellites that have microwave sensors, and its current interpolation methods often result in spatial discontinuities. In this study, we formulate the completion of the precipitation map as a video inpainting task and propose a machine learning approach based on conditional diffusion models. Our method employs a 3D U-Net with a 3D condition encoder to reconstruct complete precipitation maps by leveraging spatio-temporal information from infrared images, latitude-longitude grids, and physical time inputs. Training was carried out on ERA5 hourly precipitation data from 2020 to 2023. We generated a pseudo-GSMaP dataset by randomly applying GSMaP masks to ERA maps. Performance was evaluated for the calendar year 2024, and our approach produces more spatio-temporally consistent inpainted precipitation maps compared to conventional methods. These results indicate the potential to improve global precipitation monitoring using the conditional diffusion models.
[LG-34] Diagonally-Weighted Generalized Method of Moments Estimation for Gaussian Mixture Modeling
链接: https://arxiv.org/abs/2507.20459
作者: Liu Zhang,Oscar Mickelin,Sheng Xu,Amit Singer
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Since Pearson [Philosophical Transactions of the Royal Society of London. A, 185 (1894), pp. 71-110] first applied the method of moments (MM) for modeling data as a mixture of one-dimensional Gaussians, moment-based estimation methods have proliferated. Among these methods, the generalized method of moments (GMM) improves the statistical efficiency of MM by weighting the moments appropriately. However, the computational complexity and storage complexity of MM and GMM grow exponentially with the dimension, making these methods impractical for high-dimensional data or when higher-order moments are required. Such computational bottlenecks are more severe in GMM since it additionally requires estimating a large weighting matrix. To overcome these bottlenecks, we propose the diagonally-weighted GMM (DGMM), which achieves a balance among statistical efficiency, computational complexity, and numerical stability. We apply DGMM to study the parameter estimation problem for weakly separated heteroscedastic low-rank Gaussian mixtures and design a computationally efficient and numerically stable algorithm that obtains the DGMM estimator without explicitly computing or storing the moment tensors. We implement the proposed algorithm and empirically validate the advantages of DGMM: in numerical studies, DGMM attains smaller estimation errors while requiring substantially shorter runtime than MM and GMM. The code and data will be available upon publication at this https URL.
[LG-35] Your Attention Matters: to Improve Model Robustness to Noise and Spurious Correlations
链接: https://arxiv.org/abs/2507.20453
作者: Camilo Tamayo-Rousseau,Yunjia Zhao,Yiqun Zhang,Randall Balestriero
类目: Machine Learning (cs.LG)
*备注:
Abstract:Self-attention mechanisms are foundational to Transformer architectures, supporting their impressive success in a wide range of tasks. While there are many self-attention variants, their robustness to noise and spurious correlations has not been well studied. This study evaluates Softmax, Sigmoid, Linear, Doubly Stochastic, and Cosine attention within Vision Transformers under different data corruption scenarios. Through testing across the CIFAR-10, CIFAR-100, and Imagenette datasets, we show that Doubly Stochastic attention is the most robust. Our findings inform self-attention selection in contexts with imperfect data.
[LG-36] BOASF: A Unified Framework for Speeding up Automatic Machine Learning via Adaptive Successive Filtering
链接: https://arxiv.org/abs/2507.20446
作者: Guanghui Zhu,Xin Fang,Lei Wang,Wenzhong Chen,Rong Gu,Chunfeng Yuan,Yihua Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning has been making great success in many application areas. However, for the non-expert practitioners, it is always very challenging to address a machine learning task successfully and efficiently. Finding the optimal machine learning model or the hyperparameter combination set from a large number of possible alternatives usually requires considerable expert knowledge and experience. To tackle this problem, we propose a combined Bayesian Optimization and Adaptive Successive Filtering algorithm (BOASF) under a unified multi-armed bandit framework to automate the model selection or the hyperparameter optimization. Specifically, BOASF consists of multiple evaluation rounds in each of which we select promising configurations for each arm using the Bayesian optimization. Then, ASF can early discard the poor-performed arms adaptively using a Gaussian UCB-based probabilistic model. Furthermore, a Softmax model is employed to adaptively allocate available resources for each promising arm that advances to the next round. The arm with a higher probability of advancing will be allocated more resources. Experimental results show that BOASF is effective for speeding up the model selection and hyperparameter optimization processes while achieving robust and better prediction performance than the existing state-of-the-art automatic machine learning methods. Moreover, BOASF achieves better anytime performance under various time budgets.
[LG-37] Provable In-Context Learning of Nonlinear Regression with Transformers
链接: https://arxiv.org/abs/2507.20443
作者: Hongbo Li,Lingjie Duan,Yingbin Liang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The transformer architecture, which processes sequences of input tokens to produce outputs for query tokens, has revolutionized numerous areas of machine learning. A defining feature of transformers is their ability to perform previously unseen tasks using task-specific prompts without updating parameters, a phenomenon known as in-context learning (ICL). Recent research has actively explored the training dynamics behind ICL, with much of the focus on relatively simple tasks such as linear regression and binary classification. To advance the theoretical understanding of ICL, this paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities in these settings. We analyze the stage-wise dynamics of attention during training: attention scores between a query token and its target features grow rapidly in the early phase, then gradually converge to one, while attention to irrelevant features decays more slowly and exhibits oscillatory behavior. Our analysis introduces new proof techniques that explicitly characterize how the nature of general non-degenerate L-Lipschitz task functions affects attention weights. Specifically, we identify that the Lipschitz constant L of nonlinear function classes as a key factor governing the convergence dynamics of transformers in ICL. Leveraging these insights, for two distinct regimes depending on whether L is below or above a threshold, we derive different time bounds to guarantee near-zero prediction error. Notably, despite the convergence time depending on the underlying task functions, we prove that query tokens consistently attend to prompt tokens with highly relevant features at convergence, demonstrating the ICL capability of transformers for unseen functions.
[LG-38] BioNeuralNet: A Graph Neural Network based Multi-Omics Network Data Analysis Tool
链接: https://arxiv.org/abs/2507.20440
作者: Vicente Ramos(1),Sundous Hussein(1),Mohamed Abdel-Hafiz(1),Arunangshu Sarkar(2),Weixuan Liu(2),Katerina J. Kechris(2),Russell P. Bowler(3),Leslie Lange(4),Farnoush Banaei-Kashani(1) ((1) Department of Computer Science and Engineering, University of Colorado Denver, Denver, USA, (2) Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, USA, (3) Genomic Medicine Institute, Cleveland Clinic, Cleveland, USA, (4) Division of Biomedical Informatics and Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, USA)
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 6 pages, 1 figure, 2 tables; Software available on PyPI as BioNeuralNet. For documentation, tutorials, and workflows see this https URL
Abstract:Multi-omics data offer unprecedented insights into complex biological systems, yet their high dimensionality, sparsity, and intricate interactions pose significant analytical challenges. Network-based approaches have advanced multi-omics research by effectively capturing biologically relevant relationships among molecular entities. While these methods are powerful for representing molecular interactions, there remains a need for tools specifically designed to effectively utilize these network representations across diverse downstream analyses. To fulfill this need, we introduce BioNeuralNet, a flexible and modular Python framework tailored for end-to-end network-based multi-omics data analysis. BioNeuralNet leverages Graph Neural Networks (GNNs) to learn biologically meaningful low-dimensional representations from multi-omics networks, converting these complex molecular networks into versatile embeddings. BioNeuralNet supports all major stages of multi-omics network analysis, including several network construction techniques, generation of low-dimensional representations, and a broad range of downstream analytical tasks. Its extensive utilities, including diverse GNN architectures, and compatibility with established Python packages (e.g., scikit-learn, PyTorch, NetworkX), enhance usability and facilitate quick adoption. BioNeuralNet is an open-source, user-friendly, and extensively documented framework designed to support flexible and reproducible multi-omics network analysis in precision medicine.
[LG-39] Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning
链接: https://arxiv.org/abs/2507.20424
作者: Tolga Dimlioglu,Anna Choromanska
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 9 pages main body, 32 pages of supplementary material for detailed derivations and more experiment results
Abstract:We study centralized distributed data parallel training of deep neural networks (DNNs), aiming to improve the trade-off between communication efficiency and model performance of the local gradient methods. To this end, we revisit the flat-minima hypothesis, which suggests that models with better generalization tend to lie in flatter regions of the loss landscape. We introduce a simple, yet effective, sharpness measure, Inverse Mean Valley, and demonstrate its strong correlation with the generalization gap of DNNs. We incorporate an efficient relaxation of this measure into the distributed training objective as a lightweight regularizer that encourages workers to collaboratively seek wide minima. The regularizer exerts a pushing force that counteracts the consensus step pulling the workers together, giving rise to the Distributed Pull-Push Force (DPPF) algorithm. Empirically, we show that DPPF outperforms other communication-efficient approaches and achieves better generalization performance than local gradient methods and synchronous gradient averaging, while significantly reducing communication overhead. In addition, our loss landscape visualizations confirm the ability of DPPF to locate flatter minima. On the theoretical side, we show that DPPF guides workers to span flat valleys, with the final valley width governed by the interplay between push and pull strengths, and that its pull-push dynamics is self-stabilizing. We further provide generalization guarantees linked to the valley width and prove convergence in the non-convex setting.
[LG-40] Bipedalism for Quadrupedal Robots: Versatile Loco-Manipulation through Risk-Adaptive Reinforcement Learning
链接: https://arxiv.org/abs/2507.20382
作者: Yuyou Zhang,Radu Corcodel,Ding Zhao
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Humanoids 2025
Abstract:Loco-manipulation of quadrupedal robots has broadened robotic applications, but using legs as manipulators often compromises locomotion, while mounting arms complicates the system. To mitigate this issue, we introduce bipedalism for quadrupedal robots, thus freeing the front legs for versatile interactions with the environment. We propose a risk-adaptive distributional Reinforcement Learning (RL) framework designed for quadrupedal robots walking on their hind legs, balancing worst-case conservativeness with optimal performance in this inherently unstable task. During training, the adaptive risk preference is dynamically adjusted based on the uncertainty of the return, measured by the coefficient of variation of the estimated return distribution. Extensive experiments in simulation show our method’s superior performance over baselines. Real-world deployment on a Unitree Go2 robot further demonstrates the versatility of our policy, enabling tasks like cart pushing, obstacle probing, and payload transport, while showcasing robustness against challenging dynamics and external disturbances.
[LG-41] Set-based Implicit Likelihood Inference of Galaxy Cluster Mass ICML
链接: https://arxiv.org/abs/2507.20378
作者: Bonny Y. Wang,Leander Thiele
类目: Machine Learning (cs.LG); Cosmology and Nongalactic Astrophysics (astro-ph.CO)
*备注: 5 pages, 4 figures; accepted as a spotlight talk at ICML-colocated ML4Astro 2025 workshop
Abstract:We present a set-based machine learning framework that infers posterior distributions of galaxy cluster masses from projected galaxy dynamics. Our model combines Deep Sets and conditional normalizing flows to incorporate both positional and velocity information of member galaxies to predict residual corrections to the M - \sigma relation for improved interpretability. Trained on the Uchuu-UniverseMachine simulation, our approach significantly reduces scatter and provides well-calibrated uncertainties across the full mass range compared to traditional dynamical estimates.
[LG-42] Sequence-Aware Inline Measurement Attribution for Good-Bad Wafer Diagnosis
链接: https://arxiv.org/abs/2507.20364
作者: Kohei Miyaguchi,Masao Joko,Rebekah Sheraw,Tsuyoshi Idé
类目: Machine Learning (cs.LG)
*备注: Published as K. Miyaguchi, M. Joko, R. Sheraw and T. Idé, “Sequence-Aware Inline Measurement Attribution for Good-Bad Wafer Diagnosis : DM: Big Data Management and Machine Learning,” 2025 36th Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC), Albany, NY, USA, 2025, pp. 1-6, doi: https://doi.org/10.1109/ASMC64512.2025.11010308
Abstract:How can we identify problematic upstream processes when a certain type of wafer defect starts appearing at a quality checkpoint? Given the complexity of modern semiconductor manufacturing, which involves thousands of process steps, cross-process root cause analysis for wafer defects has been considered highly challenging. This paper proposes a novel framework called Trajectory Shapley Attribution (TSA), an extension of Shapley values (SV), a widely used attribution algorithm in explainable artificial intelligence research. TSA overcomes key limitations of standard SV, including its disregard for the sequential nature of manufacturing processes and its reliance on an arbitrarily chosen reference point. We applied TSA to a good-bad wafer diagnosis task in experimental front-end-of-line processes at the NY CREATES Albany NanoTech fab, aiming to identify measurement items (serving as proxies for process parameters) most relevant to abnormal defect occurrence.
[LG-43] MH-GIN: Multi-scale Heterogeneous Graph-based Imputation Network for AIS Data (Extended Version)
链接: https://arxiv.org/abs/2507.20362
作者: Hengyu Liu,Tianyi Li,Yuqiang He,Kristian Torp,Yushuai Li,Christian S. Jensen
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 18 pages, 4 figures
Abstract:Location-tracking data from the Automatic Identification System, much of which is publicly available, plays a key role in a range of maritime safety and monitoring applications. However, the data suffers from missing values that hamper downstream applications. Imputing the missing values is challenging because the values of different heterogeneous attributes are updated at diverse rates, resulting in the occurrence of multi-scale dependencies among attributes. Existing imputation methods that assume similar update rates across attributes are unable to capture and exploit such dependencies, limiting their imputation accuracy. We propose MH-GIN, a Multi-scale Heterogeneous Graph-based Imputation Network that aims improve imputation accuracy by capturing multi-scale dependencies. Specifically, MH-GIN first extracts multi-scale temporal features for each attribute while preserving their intrinsic heterogeneous characteristics. Then, it constructs a multi-scale heterogeneous graph to explicitly model dependencies between heterogeneous attributes to enable more accurate imputation of missing values through graph propagation. Experimental results on two real-world datasets find that MH-GIN is capable of an average 57% reduction in imputation errors compared to state-of-the-art methods, while maintaining computational efficiency. The source code and implementation details of MH-GIN are publicly available this https URL.
[LG-44] Wafer Defect Root Cause Analysis with Partial Trajectory Regression
链接: https://arxiv.org/abs/2507.20357
作者: Kohei Miyaguchi,Masao Joko,Rebekah Sheraw,Tsuyoshi Idé
类目: Machine Learning (cs.LG)
*备注: Published as K. Miyaguchi, M. Joko, R. Sheraw and T. Idé, "Wafer Defect Root Cause Analysis with Partial Trajectory Regression,‘’ Proceedings of the 36th Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC 2025), Albany, NY, USA, 2025, pp. 1-6, doi: https://doi.org/10.1109/ASMC64512.2025.11010733
Abstract:Identifying upstream processes responsible for wafer defects is challenging due to the combinatorial nature of process flows and the inherent variability in processing routes, which arises from factors such as rework operations and random process waiting times. This paper presents a novel framework for wafer defect root cause analysis, called Partial Trajectory Regression (PTR). The proposed framework is carefully designed to address the limitations of conventional vector-based regression models, particularly in handling variable-length processing routes that span a large number of heterogeneous physical processes. To compute the attribution score of each process given a detected high defect density on a specific wafer, we propose a new algorithm that compares two counterfactual outcomes derived from partial process trajectories. This is enabled by new representation learning methods, proc2vec and route2vec. We demonstrate the effectiveness of the proposed framework using real wafer history data from the NY CREATES fab in Albany.
[LG-45] Computational Advantages of Multi-Grade Deep Learning: Convergence Analysis and Performance Insights
链接: https://arxiv.org/abs/2507.20351
作者: Ronglong Fang,Yuesheng Xu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Multi-grade deep learning (MGDL) has been shown to significantly outperform the standard single-grade deep learning (SGDL) across various applications. This work aims to investigate the computational advantages of MGDL focusing on its performance in image regression, denoising, and deblurring tasks, and comparing it to SGDL. We establish convergence results for the gradient descent (GD) method applied to these models and provide mathematical insights into MGDL’s improved performance. In particular, we demonstrate that MGDL is more robust to the choice of learning rate under GD than SGDL. Furthermore, we analyze the eigenvalue distributions of the Jacobian matrices associated with the iterative schemes arising from the GD iterations, offering an explanation for MGDL’s enhanced training stability.
[LG-46] From Observations to Causations: A GNN-based Probabilistic Prediction Framework for Causal Discovery
链接: https://arxiv.org/abs/2507.20349
作者: Rezaur Rashid,Gabriel Terejanu
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Causal discovery from observational data is challenging, especially with large datasets and complex relationships. Traditional methods often struggle with scalability and capturing global structural information. To overcome these limitations, we introduce a novel graph neural network (GNN)-based probabilistic framework that learns a probability distribution over the entire space of causal graphs, unlike methods that output a single deterministic graph. Our framework leverages a GNN that encodes both node and edge attributes into a unified graph representation, enabling the model to learn complex causal structures directly from data. The GNN model is trained on a diverse set of synthetic datasets augmented with statistical and information-theoretic measures, such as mutual information and conditional entropy, capturing both local and global data properties. We frame causal discovery as a supervised learning problem, directly predicting the entire graph structure. Our approach demonstrates superior performance, outperforming both traditional and recent non-GNN-based methods, as well as a GNN-based approach, in terms of accuracy and scalability on synthetic and real-world datasets without further training. This probabilistic framework significantly improves causal structure learning, with broad implications for decision-making and scientific discovery across various fields.
[LG-47] Approximating Full Conformal Prediction for Neural Network Regression with Gauss-Newton Influence ICLR2025
链接: https://arxiv.org/abs/2507.20272
作者: Dharmesh Tailor,Alvaro H.C. Correia,Eric Nalisnick,Christos Louizos
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at the 13th International Conference on Learning Representations (ICLR 2025)
Abstract:Uncertainty quantification is an important prerequisite for the deployment of deep learning models in safety-critical areas. Yet, this hinges on the uncertainty estimates being useful to the extent the prediction intervals are well-calibrated and sharp. In the absence of inherent uncertainty estimates (e.g. pretrained models predicting only point estimates), popular approaches that operate post-hoc include Laplace’s method and split conformal prediction (split-CP). However, Laplace’s method can be miscalibrated when the model is misspecified and split-CP requires sample splitting, and thus comes at the expense of statistical efficiency. In this work, we construct prediction intervals for neural network regressors post-hoc without held-out data. This is achieved by approximating the full conformal prediction method (full-CP). Whilst full-CP nominally requires retraining the model for every test point and candidate label, we propose to train just once and locally perturb model parameters using Gauss-Newton influence to approximate the effect of retraining. Coupled with linearization of the network, we express the absolute residual nonconformity score as a piecewise linear function of the candidate label allowing for an efficient procedure that avoids the exhaustive search over the output space. On standard regression benchmarks and bounding box localization, we show the resulting prediction intervals are locally-adaptive and often tighter than those of split-CP.
[LG-48] Data-Efficient Prediction-Powered Calibration via Cross-Validation
链接: https://arxiv.org/abs/2507.20268
作者: Seonghoon Yoo,Houssem Sifaou,Sangwoo Park,Joonhyuk Kang,Osvaldo Simeone
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:
Abstract:Calibration data are necessary to formally quantify the uncertainty of the decisions produced by an existing artificial intelligence (AI) model. To overcome the common issue of scarce calibration data, a promising approach is to employ synthetic labels produced by a (generally different) predictive model. However, fine-tuning the label-generating predictor on the inference task of interest, as well as estimating the residual bias of the synthetic labels, demand additional data, potentially exacerbating the calibration data scarcity problem. This paper introduces a novel approach that efficiently utilizes limited calibration data to simultaneously fine-tune a predictor and estimate the bias of the synthetic labels. The proposed method yields prediction sets with rigorous coverage guarantees for AI-generated decisions. Experimental results on an indoor localization problem validate the effectiveness and performance gains of our solution.
[LG-49] chnical Indicator Networks (TINs): An Interpretable Neural Architecture Modernizing Classic al Technical Analysis for Adaptive Algorithmic Trading
链接: https://arxiv.org/abs/2507.20202
作者: Longfei Lu
类目: Machine Learning (cs.LG); Portfolio Management (q-fin.PM)
*备注: Patent Application No. DE10202502351 filed on July 8, 2025 with DPMA
Abstract:This work proposes that a vast majority of classical technical indicators in financial analysis are, in essence, special cases of neural networks with fixed and interpretable weights. It is shown that nearly all such indicators, such as moving averages, momentum-based oscillators, volatility bands, and other commonly used technical constructs, can be reconstructed topologically as modular neural network components. Technical Indicator Networks (TINs) are introduced as a general neural architecture that replicates and structurally upgrades traditional indicators by supporting n-dimensional inputs such as price, volume, sentiment, and order book data. By encoding domain-specific knowledge into neural structures, TINs modernize the foundational logic of technical analysis and propel algorithmic trading into a new era, bridging the legacy of proven indicators with the potential of contemporary AI systems.
[LG-50] Practical Multi-Task Learning for Rare Conversions in Ad Tech RECSYS2025
链接: https://arxiv.org/abs/2507.20161
作者: Yuval Dishi,Ophir Friedler,Yonatan Karni,Natalia Silberstein,Yulia Stolin
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted to RecSys 2025
Abstract:We present a Multi-Task Learning (MTL) approach for improving predictions for rare (e.g., 1%) conversion events in online advertising. The conversions are classified into “rare” or “frequent” types based on historical statistics. The model learns shared representations across all signals while specializing through separate task towers for each type. The approach was tested and fully deployed to production, demonstrating consistent improvements in both offline (0.69% AUC lift) and online KPI performance metric (2% Cost per Action reduction).
[LG-51] Generative molecule evolution using 3D pharmacophore for efficient Structure-Based Drug Design
链接: https://arxiv.org/abs/2507.20130
作者: Yi He,Ailun Wang,Zhi Wang,Yu Liu,Xingyuan Xu,Wen Yan
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
Abstract:Recent advances in generative models, particularly diffusion and auto-regressive models, have revolutionized fields like computer vision and natural language processing. However, their application to structure-based drug design (SBDD) remains limited due to critical data constraints. To address the limitation of training data for models targeting SBDD tasks, we propose an evolutionary framework named MEVO, which bridges the gap between billion-scale small molecule dataset and the scarce protein-ligand complex dataset, and effectively increase the abundance of training data for generative SBDD models. MEVO is composed of three key components: a high-fidelity VQ-VAE for molecule representation in latent space, a diffusion model for pharmacophore-guided molecule generation, and a pocket-aware evolutionary strategy for molecule optimization with physics-based scoring function. This framework efficiently generate high-affinity binders for various protein targets, validated with predicted binding affinities using free energy perturbation (FEP) methods. In addition, we showcase the capability of MEVO in designing potent inhibitors to KRAS ^\textrmG12D , a challenging target in cancer therapeutics, with similar affinity to the known highly active inhibitor evaluated by FEP calculations. With high versatility and generalizability, MEVO offers an effective and data-efficient model for various tasks in structure-based ligand design.
[LG-52] Wine Characterisation with Spectral Information and Predictive Artificial Intelligence
链接: https://arxiv.org/abs/2507.20114
作者: Jianping Yao,Son N. Tran,Hieu Nguyen,Samantha Sawyer,Rocco Longo
类目: Machine Learning (cs.LG)
*备注:
Abstract:The purpose of this paper is to use absorbance data obtained by human tasting and an ultraviolet-visible (UV-Vis) scanning spectrophotometer to predict the attributes of grape juice (GJ) and to classify the wine’s origin, respectively. The approach combined machine learning (ML) techniques with spectroscopy to find a relatively simple way to apply them in two stages of winemaking and help improve the traditional wine analysis methods regarding sensory data and wine’s origins. This new technique has overcome the disadvantages of the complex sensors by taking advantage of spectral fingerprinting technology and forming a comprehensive study of the employment of AI in the wine analysis domain. In the results, Support Vector Machine (SVM) was the most efficient and robust in both attributes and origin prediction tasks. Both the accuracy and F1 score of the origin prediction exceed 91%. The feature ranking approach found that the more influential wavelengths usually appear at the lower end of the scan range, 250 nm (nanometers) to 420 nm, which is believed to be of great help for selecting appropriate validation methods and sensors to extract wine data in future research. The knowledge of this research provides new ideas and early solutions for the wine industry or other beverage industries to integrate big data and IoT in the future, which significantly promotes the development of ‘Smart Wineries’.
[LG-53] Graded Transformers: A Symbolic-Geometric Approach to Structured Learning
链接: https://arxiv.org/abs/2507.20108
作者: Tony Shaska Sr
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:
Abstract:We introduce the Graded Transformer framework, a novel class of sequence models that embeds algebraic inductive biases through grading transformations on vector spaces. Extending the theory of Graded Neural Networks (GNNs), we propose two architectures: the Linearly Graded Transformer (LGT) and the Exponentially Graded Transformer (EGT). These models apply parameterized scaling operators-governed by fixed or learnable grading tuples and, for EGT, exponential factors to infuse hierarchical structure into attention and representation layers, enhancing efficiency for structured data. We derive rigorous theoretical guarantees, including universal approximation theorems for continuous and Sobolev functions, reduced sample complexity via effective VC dimension bounds, Lipschitz continuity of graded operations, and robustness to adversarial perturbations. A graded loss function ensures gradient stability and alignment with domain priors during optimization. By treating grades as differentiable parameters, the framework enables adaptive feature prioritization, overcoming limitations of fixed grades in prior work. The Graded Transformer holds transformative potential for hierarchical learning and neurosymbolic reasoning, with applications spanning algebraic geometry (e.g., moduli spaces and zeta functions), physics (e.g., multiscale simulations), natural language processing (e.g., syntactic parsing), biological sequence analysis (e.g., variant prediction), and emerging areas like graph neural networks and financial modeling. This work advances structured deep learning by fusing geometric and algebraic principles with attention mechanisms, offering a mathematically grounded alternative to data-driven models and paving the way for interpretable, efficient systems in complex domains. Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML) Cite as: arXiv:2507.20108 [cs.LG] (or arXiv:2507.20108v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.20108 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-54] Meta Fusion: A Unified Framework For Multimodality Fusion with Mutual Learning
链接: https://arxiv.org/abs/2507.20089
作者: Ziyi Liang,Annie Qu,Babak Shahbaba
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Developing effective multimodal data fusion strategies has become increasingly essential for improving the predictive power of statistical machine learning methods across a wide range of applications, from autonomous driving to medical diagnosis. Traditional fusion methods, including early, intermediate, and late fusion, integrate data at different stages, each offering distinct advantages and limitations. In this paper, we introduce Meta Fusion, a flexible and principled framework that unifies these existing strategies as special cases. Motivated by deep mutual learning and ensemble learning, Meta Fusion constructs a cohort of models based on various combinations of latent representations across modalities, and further boosts predictive performance through soft information sharing within the cohort. Our approach is model-agnostic in learning the latent representations, allowing it to flexibly adapt to the unique characteristics of each modality. Theoretically, our soft information sharing mechanism reduces the generalization error. Empirically, Meta Fusion consistently outperforms conventional fusion strategies in extensive simulation studies. We further validate our approach on real-world applications, including Alzheimer’s disease detection and neural decoding.
[LG-55] Feed-anywhere ANN (I) Steady Discrete to Diffusing on Graph Hidden States
链接: https://arxiv.org/abs/2507.20088
作者: Dmitry Pasechnyuk-Vilensky,Daniil Doroshenko
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 11 pages, 1 algorithm
Abstract:We propose a novel framework for learning hidden graph structures from data using geometric analysis and nonlinear dynamics. Our approach: (1) Defines discrete Sobolev spaces on graphs for scalar/vector fields, establishing key functional properties; (2) Introduces gauge-equivalent nonlinear Schrödinger and Landau–Lifshitz dynamics with provable stable stationary solutions smoothly dependent on input data and graph weights; (3) Develops a stochastic gradient algorithm over graph moduli spaces with sparsity regularization. Theoretically, we guarantee: topological correctness (homology recovery), metric convergence (Gromov–Hausdorff), and efficient search space utilization. Our dynamics-based model achieves stronger generalization bounds than standard neural networks, with complexity dependent on the data manifold’s topology.
[LG-56] Cluster Purge Loss: Structuring Transformer Embeddings for Equivalent Mutants Detection
链接: https://arxiv.org/abs/2507.20078
作者: Adelaide Danilov,Aria Nourbakhsh,Christoph Schommer
类目: Machine Learning (cs.LG)
*备注: 11 pages, 6 figures
Abstract:Recent pre-trained transformer models achieve superior performance in various code processing objectives. However, although effective at optimizing decision boundaries, common approaches for fine-tuning them for downstream classification tasks - distance-based methods or training an additional classification head - often fail to thoroughly structure the embedding space to reflect nuanced intra-class semantic relationships. Equivalent code mutant detection is one of these tasks, where the quality of the embedding space is crucial to the performance of the models. We introduce a novel framework that integrates cross-entropy loss with a deep metric learning objective, termed Cluster Purge Loss. This objective, unlike conventional approaches, concentrates on adjusting fine-grained differences within each class, encouraging the separation of instances based on semantical equivalency to the class center using dynamically adjusted borders. Employing UniXCoder as the base model, our approach demonstrates state-of-the-art performance in the domain of equivalent mutant detection and produces a more interpretable embedding space.
[LG-57] Sparse Equation Matching: A Derivative-Free Learning for General-Order Dynamical Systems
链接: https://arxiv.org/abs/2507.20072
作者: Jiaqiang Li,Jianbin Tan,Xueqin Wang
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Equation discovery is a fundamental learning task for uncovering the underlying dynamics of complex systems, with wide-ranging applications in areas such as brain connectivity analysis, climate modeling, gene regulation, and physical system simulation. However, many existing approaches rely on accurate derivative estimation and are limited to first-order dynamical systems, restricting their applicability to real-world scenarios. In this work, we propose sparse equation matching (SEM), a unified framework that encompasses several existing equation discovery methods under a common formulation. SEM introduces an integral-based sparse regression method using Green’s functions, enabling derivative-free estimation of differential operators and their associated driving functions in general-order dynamical systems. The effectiveness of SEM is demonstrated through extensive simulations, benchmarking its performance against derivative-based approaches. We then apply SEM to electroencephalographic (EEG) data recorded during multiple oculomotor tasks, collected from 52 participants in a brain-computer interface experiment. Our method identifies active brain regions across participants and reveals task-specific connectivity patterns. These findings offer valuable insights into brain connectivity and the underlying neural mechanisms.
[LG-58] PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data
链接: https://arxiv.org/abs/2507.20068
作者: Aishwarya Mandyam,Jason Meng,Ge Gao,Jiankai Sun,Mac Schwager,Barbara E. Engelhardt,Emma Brunskill
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Off-policy evaluation (OPE) methods aim to estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of these value estimates. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation for OPE in RL lack principled uncertainty quantification. In high stakes settings like healthcare, reliable uncertainty estimates are important for comparing policy value estimates. In this work, we propose two approaches to construct valid confidence intervals for OPE when using data augmentation. The first provides a confidence interval over the policy performance conditioned on a particular initial state V^\pi(s_0) – such intervals are particularly important for human-centered applications. To do so we introduce a new conformal prediction method for high dimensional state MDPs. Second, we consider the more common task of estimating the average policy performance over many initial states; to do so we draw on ideas from doubly robust estimation and prediction powered inference. Across simulators spanning robotics, healthcare and inventory management, and a real healthcare dataset from MIMIC-IV, we find that our methods can use augmented data and still consistently produce intervals that cover the ground truth values, unlike previously proposed methods.
[LG-59] Geometric Operator Learning with Optimal Transport
链接: https://arxiv.org/abs/2507.20065
作者: Xinyi Li,Zongyi Li,Nikola Kovachki,Anima Anandkumar
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose integrating optimal transport (OT) into operator learning for partial differential equations (PDEs) on complex geometries. Classical geometric learning methods typically represent domains as meshes, graphs, or point clouds. Our approach generalizes discretized meshes to mesh density functions, formulating geometry embedding as an OT problem that maps these functions to a uniform density in a reference space. Compared to previous methods relying on interpolation or shared deformation, our OT-based method employs instance-dependent deformation, offering enhanced flexibility and effectiveness. For 3D simulations focused on surfaces, our OT-based neural operator embeds the surface geometry into a 2D parameterized latent space. By performing computations directly on this 2D representation of the surface manifold, it achieves significant computational efficiency gains compared to volumetric simulation. Experiments with Reynolds-averaged Navier-Stokes equations (RANS) on the ShapeNet-Car and DrivAerNet-Car datasets show that our method achieves better accuracy and also reduces computational expenses in terms of both time and memory usage compared to existing machine learning models. Additionally, our model demonstrates significantly improved accuracy on the FlowBench dataset, underscoring the benefits of employing instance-dependent deformation for datasets with highly variable geometries.
[LG-60] Strategic Filtering for Content Moderation: Free Speech or Free of Distortion?
链接: https://arxiv.org/abs/2507.20061
作者: Saba Ahmadi,Avrim Blum,Haifeng Xu,Fan Yao
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:User-generated content (UGC) on social media platforms is vulnerable to incitements and manipulations, necessitating effective regulations. To address these challenges, those platforms often deploy automated content moderators tasked with evaluating the harmfulness of UGC and filtering out content that violates established guidelines. However, such moderation inevitably gives rise to strategic responses from users, who strive to express themselves within the confines of guidelines. Such phenomena call for a careful balance between: 1. ensuring freedom of speech – by minimizing the restriction of expression; and 2. reducing social distortion – measured by the total amount of content manipulation. We tackle the problem of optimizing this balance through the lens of mechanism design, aiming at optimizing the trade-off between minimizing social distortion and maximizing free speech. Although determining the optimal trade-off is NP-hard, we propose practical methods to approximate the optimal solution. Additionally, we provide generalization guarantees determining the amount of finite offline data required to approximate the optimal moderator effectively.
[LG-61] ModShift: Model Privacy via Designed Shifts
链接: https://arxiv.org/abs/2507.20060
作者: Nomaan A. Kherani,Urbashi Mitra
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: To appear in the 2025 Asilomar Conference on Signals, Systems and Computers
Abstract:In this paper, shifts are introduced to preserve model privacy against an eavesdropper in federated learning. Model learning is treated as a parameter estimation problem. This perspective allows us to derive the Fisher Information matrix of the model updates from the shifted updates and drive them to singularity, thus posing a hard estimation problem for Eve. The shifts are securely shared with the central server to maintain model accuracy at the server and participating devices. A convergence test is proposed to detect if model updates have been tampered with and we show that our scheme passes this test. Numerical results show that our scheme achieves a higher model shift when compared to a noise injection scheme while requiring a lesser bandwidth secret channel.
[LG-62] What Can Grokking Teach Us About Learning Under Nonstationarity?
链接: https://arxiv.org/abs/2507.20057
作者: Clare Lyle,Gharda Sokar,Razvan Pascanu,Andras Gyorgy
类目: Machine Learning (cs.LG)
*备注:
Abstract:In continual learning problems, it is often necessary to overwrite components of a neural network’s learned representation in response to changes in the data stream; however, neural networks often exhibit \primacy bias, whereby early training data hinders the network’s ability to generalize on later tasks. While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of grokking, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous learned features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing primacy bias in non-stationary learning problems. We then propose a straightforward method to induce feature-learning dynamics as needed throughout training by increasing the effective learning rate, i.e. the ratio between parameter and update norms. We show that this approach both facilitates feature-learning and improves generalization in a variety of settings, including grokking, warm-starting neural network training, and reinforcement learning tasks.
[LG-63] Improving Deep Learning-based Respiratory Sound Analysis with Frequency Selection and Attention Mechanism
链接: https://arxiv.org/abs/2507.20052
作者: Nouhaila Fraihi,Ouassim Karrakchou,Mounir Ghogho
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Accurate classification of respiratory sounds requires deep learning models that effectively capture fine-grained acoustic features and long-range temporal dependencies. Convolutional Neural Networks (CNNs) are well-suited for extracting local time-frequency patterns but are limited in modeling global context. In contrast, transformer-based models can capture long-range dependencies, albeit with higher computational demands. To address these limitations, we propose a compact CNN-Temporal Self-Attention (CNN-TSA) network that integrates lightweight self-attention into an efficient CNN backbone. Central to our approach is a Frequency Band Selection (FBS) module that suppresses noisy and non-informative frequency regions, substantially improving accuracy and reducing FLOPs by up to 50%. We also introduce age-specific models to enhance robustness across diverse patient groups. Evaluated on the SPRSound-2022/2023 and ICBHI-2017 lung sound datasets, CNN-TSA with FBS sets new benchmarks on SPRSound and achieves state-of-the-art performance on ICBHI, all with a significantly smaller computational footprint. Furthermore, integrating FBS into an existing transformer baseline yields a new record on ICBHI, confirming FBS as an effective drop-in enhancement. These results demonstrate that our framework enables reliable, real-time respiratory sound analysis suitable for deployment in resource-constrained settings.
[LG-64] Improving Audio Classification by Transitioning from Zero- to Few-Shot INTERSPEECH2025
链接: https://arxiv.org/abs/2507.20036
作者: James Taylor,Wolfgang Mack
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Submitted to Interspeech 2025
Abstract:State-of-the-art audio classification often employs a zero-shot approach, which involves comparing audio embeddings with embeddings from text describing the respective audio class. These embeddings are usually generated by neural networks trained through contrastive learning to align audio and text representations. Identifying the optimal text description for an audio class is challenging, particularly when the class comprises a wide variety of sounds. This paper examines few-shot methods designed to improve classification accuracy beyond the zero-shot approach. Specifically, audio embeddings are grouped by class and processed to replace the inherently noisy text embeddings. Our results demonstrate that few-shot classification typically outperforms the zero-shot baseline.
[LG-65] Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion
链接: https://arxiv.org/abs/2507.19991
作者: Hei Shing Cheung,Boya Zhang
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 6 page, 3 figures
Abstract:We present a lightweight latent diffusion model for vocal-conditioned musical accompaniment generation that addresses critical limitations in existing music AI systems. Our approach introduces a novel soft alignment attention mechanism that adaptively combines local and global temporal dependencies based on diffusion timesteps, enabling efficient capture of multi- scale musical structure. Operating in the compressed latent space of a pre-trained variational autoencoder, the model achieves a 220 times parameter reduction compared to state-of-the-art systems while delivering 52 times faster inference. Experimental evaluation demonstrates competitive performance with only 15M parame- ters, outperforming OpenAI Jukebox in production quality and content unity while maintaining reasonable musical coherence. The ultra-lightweight architecture enables real-time deployment on consumer hardware, making AI-assisted music creation ac- cessible for interactive applications and resource-constrained environments.
[LG-66] Visual Analytics Using Tensor Unified Linear Comparative Analysis IEEE-VIS2025
链接: https://arxiv.org/abs/2507.19988
作者: Naoki Okami,Kazuki Miyake,Naohisa Sakamoto,Jorji Nonaka,Takanori Fujiwara
类目: Human-Computer Interaction (cs.HC); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: To appear in IEEE Transactions on Visualization and Computer Graphics and IEEE VIS 2025
Abstract:Comparing tensors and identifying their (dis)similar structures is fundamental in understanding the underlying phenomena for complex data. Tensor decomposition methods help analysts extract tensors’ essential characteristics and aid in visual analytics for tensors. In contrast to dimensionality reduction (DR) methods designed only for analyzing a matrix (i.e., second-order tensor), existing tensor decomposition methods do not support flexible comparative analysis. To address this analysis limitation, we introduce a new tensor decomposition method, named tensor unified linear comparative analysis (TULCA), by extending its DR counterpart, ULCA, for tensor analysis. TULCA integrates discriminant analysis and contrastive learning schemes for tensor decomposition, enabling flexible comparison of tensors. We also introduce an effective method to visualize a core tensor extracted from TULCA into a set of 2D visualizations. We integrate TULCA’s functionalities into a visual analytics interface to support analysts in interpreting and refining the TULCA results. We demonstrate the efficacy of TULCA and the visual analytics interface with computational evaluations and two case studies, including an analysis of log data collected from a supercomputer.
[LG-67] Who Owns This Sample: Cross-Client Membership Inference Attack in Federated Graph Neural Networks
链接: https://arxiv.org/abs/2507.19964
作者: Kunhao Li,Di Wu,Jun Bai,Jing Xu,Lei Yang,Ziyi Zhang,Yiliao Song,Wencheng Yang,Taotao Cai,Yan Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph-structured data is prevalent in many real-world applications, including social networks, financial systems, and molecular biology. Graph Neural Networks (GNNs) have become the de facto standard for learning from such data due to their strong representation capabilities. As GNNs are increasingly deployed in federated learning (FL) settings to preserve data locality and privacy, new privacy threats arise from the interaction between graph structures and decentralized training. In this paper, we present the first systematic study of cross-client membership inference attacks (CC-MIA) against node classification tasks of federated GNNs (FedGNNs), where a malicious client aims to infer which client owns the given data. Unlike prior centralized-focused work that focuses on whether a sample was included in training, our attack targets sample-to-client attribution, a finer-grained privacy risk unique to federated settings. We design a general attack framework that exploits FedGNNs’ aggregation behaviors, gradient updates, and embedding proximity to link samples to their source clients across training rounds. We evaluate our attack across multiple graph datasets under realistic FL setups. Results show that our method achieves high performance on both membership inference and ownership identification. Our findings highlight a new privacy threat in federated graph learning-client identity leakage through structural and model-level cues, motivating the need for attribution-robust GNN design.
[LG-68] A Survey on Generative Model Unlearning: Fundamentals Taxonomy Evaluation and Future Direction
链接: https://arxiv.org/abs/2507.19894
作者: Xiaohua Feng,Jiaming Zhang,Fengyuan Yu,Chengye Wang,Li Zhang,Kaixiang Li,Yuyuan Li,Chaochao Chen,Jianwei Yin
类目: Machine Learning (cs.LG)
*备注:
Abstract:With the rapid advancement of generative models, associated privacy concerns have attracted growing attention. To address this, researchers have begun adapting machine unlearning techniques from traditional classification models to generative settings. Although notable progress has been made in this area, a unified framework for systematically organizing and integrating existing work is still lacking. The substantial differences among current studies in terms of unlearning objectives and evaluation protocols hinder the objective and fair comparison of various approaches. While some studies focus on specific types of generative models, they often overlook the commonalities and systematic characteristics inherent in Generative Model Unlearning (GenMU). To bridge this gap, we provide a comprehensive review of current research on GenMU and propose a unified analytical framework for categorizing unlearning objectives, methodological strategies, and evaluation metrics. In addition, we explore the connections between GenMU and related techniques, including model editing, reinforcement learning from human feedback, and controllable generation. We further highlight the potential practical value of unlearning techniques in real-world applications. Finally, we identify key challenges and outline future research directions aimed at laying a solid foundation for further advancements in this field. We consistently maintain the related open-source materials at this https URL.
[LG-69] RestoreAI - Pattern-based Risk Estimation Of Remaining Explosives
链接: https://arxiv.org/abs/2507.19873
作者: Björn Kischelewski,Benjamin Guedj,David Wahl
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:Landmine removal is a slow, resource-intensive process affecting over 60 countries. While AI has been proposed to enhance explosive ordnance (EO) detection, existing methods primarily focus on object recognition, with limited attention to prediction of landmine risk based on spatial pattern information. This work aims to answer the following research question: How can AI be used to predict landmine risk from landmine patterns to improve clearance time efficiency? To that effect, we introduce RestoreAI, an AI system for pattern-based risk estimation of remaining explosives. RestoreAI is the first AI system that leverages landmine patterns for risk prediction, improving the accuracy of estimating the residual risk of missing EO prior to land release. We particularly focus on the implementation of three instances of RestoreAI, respectively, linear, curved and Bayesian pattern deminers. First, the linear pattern deminer uses linear landmine patterns from a principal component analysis (PCA) for the landmine risk prediction. Second, the curved pattern deminer uses curved landmine patterns from principal curves. Finally, the Bayesian pattern deminer incorporates prior expert knowledge by using a Bayesian pattern risk prediction. Evaluated on real-world landmine data, RestoreAI significantly boosts clearance efficiency. The top-performing pattern-based deminers achieved a 14.37 percentage point increase in the average share of cleared landmines per timestep and required 24.45% less time than the best baseline deminer to locate all landmines. Interestingly, linear and curved pattern deminers showed no significant performance difference, suggesting that more efficient linear patterns are a viable option for risk prediction.
[LG-70] Inducing Causal World Models in LLM s for Zero-Shot Physical Reasoning
链接: https://arxiv.org/abs/2507.19855
作者: Aditya Sharma,Linh Nguyen,Ananya Gupta,Chengyu Wang,Chiamaka Adebayo,Jakub Kowalski
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 12 pages, 4 figures,
Abstract:Large Language Models (LLMs), despite their advanced linguistic capabilities, fundamentally lack an intuitive understanding of physical dynamics, which limits their effectiveness in real-world scenarios that require causal reasoning. In this paper, we introduce Causal World Model Induction (CWMI), a novel framework designed to embed an explicit model of causal physics within an LLM. Our approach incorporates a dedicated Causal Physics Module (CPM) and a new training objective called Causal Intervention Loss, encouraging the model to learn cause-and-effect relationships from multimodal data. By training the model to predict the outcomes of hypothetical interventions instead of merely capturing statistical correlations, CWMI develops a robust internal representation of physical laws. Experimental results show that CWMI significantly outperforms state-of-the-art LLMs on zero-shot physical reasoning tasks, including the PIQA benchmark and our newly proposed PhysiCa-Bench dataset. These findings demonstrate that inducing a causal world model is a critical step toward more reliable and generalizable AI systems.
[LG-71] A Scalable and High Availability Solution for Recommending Resolutions to Problem Tickets
链接: https://arxiv.org/abs/2507.19846
作者: Harish S,Chetana K Nayak,Joy Bose
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 9 pages, 7 figures
Abstract:Resolution of incidents or problem tickets is a common theme in service industries in any sector, including billing and charging systems in telecom domain. Machine learning can help to identify patterns and suggest resolutions for the problem tickets, based on patterns in the historical data of the tickets. However, this process may be complicated due to a variety of phenomena such as data drift and issues such as missing data, lack of data pertaining to resolutions of past incidents, too many similar sounding resolutions due to free text and similar sounding text. This paper proposes a robust ML-driven solution employing clustering, supervised learning, and advanced NLP models to tackle these challenges effectively. Building on previous work, we demonstrate clustering-based resolution identification, supervised classification with LDA, Siamese networks, and One-shot learning, Index embedding. Additionally, we present a real-time dashboard and a highly available Kubernetes-based production deployment. Our experiments with both the open-source Bitext customer-support dataset and proprietary telecom datasets demonstrate high prediction accuracy.
[LG-72] Debunking Optimization Myths in Federated Learning for Medical Image Classification MICCAI2025
链接: https://arxiv.org/abs/2507.19822
作者: Youngjoon Lee,Hyukjoon Lee,Jinu Gong,Yang Cao,Joonhyuk Kang
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: Accepted to Efficient Medical AI Workshop - MICCAI 2025
Abstract:Federated Learning (FL) is a collaborative learning method that enables decentralized model training while preserving data privacy. Despite its promise in medical imaging, recent FL methods are often sensitive to local factors such as optimizers and learning rates, limiting their robustness in practical deployments. In this work, we revisit vanilla FL to clarify the impact of edge device configurations, benchmarking recent FL methods on colorectal pathology and blood cell classification task. We numerically show that the choice of local optimizer and learning rate has a greater effect on performance than the specific FL method. Moreover, we find that increasing local training epochs can either enhance or impair convergence, depending on the FL method. These findings indicate that appropriate edge-specific configuration is more crucial than algorithmic complexity for achieving effective FL.
[LG-73] Analyzing and Mitigating Repetitions in Trip Recommendation SIGIR2024
链接: https://arxiv.org/abs/2507.19798
作者: Wenzheng Shu,Kangqi Xu,Wenxin Tai,Ting Zhong,Yong Wang,Fan Zhou
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by ACM SIGIR 2024 Short Paper Track
Abstract:Trip recommendation has emerged as a highly sought-after service over the past decade. Although current studies significantly understand human intention consistency, they struggle with undesired repetitive outcomes that need resolution. We make two pivotal discoveries using statistical analyses and experimental designs: (1) The occurrence of repetitions is intricately linked to the models and decoding strategies. (2) During training and decoding, adding perturbations to logits can reduce repetition. Motivated by these observations, we introduce AR-Trip (Anti Repetition for Trip Recommendation), which incorporates a cycle-aware predictor comprising three mechanisms to avoid duplicate Points-of-Interest (POIs) and demonstrates their effectiveness in alleviating repetition. Experiments on four public datasets illustrate that AR-Trip successfully mitigates repetition issues while enhancing precision.
[LG-74] DOA: A Degeneracy Optimization Agent with Adaptive Pose Compensation Capability based on Deep Reinforcement Learning
链接: https://arxiv.org/abs/2507.19742
作者: Yanbin Li,Canran Xiao,Hongyang He,Shenghai Yuan,Zong Ke,Jiajie Yu,Zixiong Qin,Zhiguo Zhang,Wenzheng Chi,Wei Zhang
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 10 pages,9 figures
Abstract:Particle filter-based 2D-SLAM is widely used in indoor localization tasks due to its efficiency. However, indoor environments such as long straight corridors can cause severe degeneracy problems in SLAM. In this paper, we use Proximal Policy Optimization (PPO) to train an adaptive degeneracy optimization agent (DOA) to address degeneracy problem. We propose a systematic methodology to address three critical challenges in traditional supervised learning frameworks: (1) data acquisition bottlenecks in degenerate dataset, (2) inherent quality deterioration of training samples, and (3) ambiguity in annotation protocol design. We design a specialized reward function to guide the agent in developing perception capabilities for degenerate environments. Using the output degeneracy factor as a reference weight, the agent can dynamically adjust the contribution of different sensors to pose optimization. Specifically, the observation distribution is shifted towards the motion model distribution, with the step size determined by a linear interpolation formula related to the degeneracy factor. In addition, we employ a transfer learning module to endow the agent with generalization capabilities across different environments and address the inefficiency of training in degenerate environments. Finally, we conduct ablation studies to demonstrate the rationality of our model design and the role of transfer learning. We also compare the proposed DOA with SOTA methods to prove its superior degeneracy detection and optimization capabilities across various environments.
[LG-75] GSCache: Real-Time Radiance Caching for Volume Path Tracing using 3D Gaussian Splatting
链接: https://arxiv.org/abs/2507.19718
作者: David Bauer,Qi Wu,Hamid Gadirov,Kwan-Liu Ma
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注:
Abstract:Real-time path tracing is rapidly becoming the standard for rendering in entertainment and professional applications. In scientific visualization, volume rendering plays a crucial role in helping researchers analyze and interpret complex 3D data. Recently, photorealistic rendering techniques have gained popularity in scientific visualization, yet they face significant challenges. One of the most prominent issues is slow rendering performance and high pixel variance caused by Monte Carlo integration. In this work, we introduce a novel radiance caching approach for path-traced volume rendering. Our method leverages advances in volumetric scene representation and adapts 3D Gaussian splatting to function as a multi-level, path-space radiance cache. This cache is designed to be trainable on the fly, dynamically adapting to changes in scene parameters such as lighting configurations and transfer functions. By incorporating our cache, we achieve less noisy, higher-quality images without increasing rendering costs. To evaluate our approach, we compare it against a baseline path tracer that supports uniform sampling and next-event estimation and the state-of-the-art for neural radiance caching. Through both quantitative and qualitative analyses, we demonstrate that our path-space radiance cache is a robust solution that is easy to integrate and significantly enhances the rendering quality of volumetric visualization applications while maintaining comparable computational efficiency.
[LG-76] Beyond Nearest Neighbors: Semantic Compression and Graph-Augmented Retrieval for Enhanced Vector Search
链接: https://arxiv.org/abs/2507.19715
作者: Rahul Raja,Arpita Vats
类目: Machine Learning (cs.LG)
*备注:
Abstract:Vector databases typically rely on approximate nearest neighbor (ANN) search to retrieve the top-k closest vectors to a query in embedding space. While effective, this approach often yields semantically redundant results, missing the diversity and contextual richness required by applications such as retrieval-augmented generation (RAG), multi-hop QA, and memory-augmented agents. We introduce a new retrieval paradigm: semantic compression, which aims to select a compact, representative set of vectors that captures the broader semantic structure around a query. We formalize this objective using principles from submodular optimization and information geometry, and show that it generalizes traditional top-k retrieval by prioritizing coverage and diversity. To operationalize this idea, we propose graph-augmented vector retrieval, which overlays semantic graphs (e.g., kNN or knowledge-based links) atop vector spaces to enable multi-hop, context-aware search. We theoretically analyze the limitations of proximity-based retrieval under high-dimensional concentration and highlight how graph structures can improve semantic coverage. Our work outlines a foundation for meaning-centric vector search systems, emphasizing hybrid indexing, diversity-aware querying, and structured semantic retrieval. We make our implementation publicly available to foster future research in this area.
[LG-77] A Lightweight Deep Learning-based Model for Ranking Influential Nodes in Complex Networks
链接: https://arxiv.org/abs/2507.19702
作者: Mohammed A. Ramadhan,Abdulhakeem O. Mohammed
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:
Abstract:Identifying influential nodes in complex networks is a critical task with a wide range of applications across different domains. However, existing approaches often face trade-offs between accuracy and computational efficiency. To address these challenges, we propose 1D-CGS, a lightweight and effective hybrid model that integrates the speed of one-dimensional convolutional neural networks (1D-CNN) with the topological representation power of GraphSAGE for efficient node ranking. The model uses a lightweight input representation built on two straightforward and significant topological features: node degree and average neighbor degree. These features are processed through 1D convolutions to extract local patterns, followed by GraphSAGE layers to aggregate neighborhood information. We formulate the node ranking task as a regression problem and use the Susceptible-Infected-Recovered (SIR) model to generate ground truth influence scores. 1D-CGS is initially trained on synthetic networks generated by the Barabasi-Albert model and then applied to real world networks for identifying influential nodes. Experimental evaluations on twelve real world networks demonstrate that 1D-CGS significantly outperforms traditional centrality measures and recent deep learning models in ranking accuracy, while operating in very fast runtime. The proposed model achieves an average improvement of 4.73% in Kendall’s Tau correlation and 7.67% in Jaccard Similarity over the best performing deep learning baselines. It also achieves an average Monotonicity Index (MI) score 0.99 and produces near perfect rank distributions, indicating highly unique and discriminative rankings. Furthermore, all experiments confirm that 1D-CGS operates in a highly reasonable time, running significantly faster than existing deep learning methods, making it suitable for large scale applications.
[LG-78] Disjoint Generative Models
链接: https://arxiv.org/abs/2507.19700
作者: Anton Danholt Lautrup,Muhammad Rajabinasab,Tobias Hyrup,Arthur Zimek,Peter Schneider-Kamp
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose a new framework for generating cross-sectional synthetic datasets via disjoint generative models. In this paradigm, a dataset is partitioned into disjoint subsets that are supplied to separate instances of generative models. The results are then combined post hoc by a joining operation that works in the absence of common variables/identifiers. The success of the framework is demonstrated through several case studies and examples on tabular data that helps illuminate some of the design choices that one may make. The principal benefit of disjoint generative models is significantly increased privacy at only a low utility cost. Additional findings include increased effectiveness and feasibility for certain model types and the possibility for mixed-model synthesis.
[LG-79] NAICS-Aware Graph Neural Networks for Large-Scale POI Co-visitation Prediction: A Multi-Modal Dataset and Methodology
链接: https://arxiv.org/abs/2507.19697
作者: Yazeed Alrubyli,Omar Alomeir,Abrar Wafa,Diána Hidvégi,Hend Alrasheed,Mohsen Bahrami
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding where people go after visiting one business is crucial for urban planning, retail analytics, and location-based services. However, predicting these co-visitation patterns across millions of venues remains challenging due to extreme data sparsity and the complex interplay between spatial proximity and business relationships. Traditional approaches using only geographic distance fail to capture why coffee shops attract different customer flows than fine dining restaurants, even when co-located. We introduce NAICS-aware GraphSAGE, a novel graph neural network that integrates business taxonomy knowledge through learnable embeddings to predict population-scale co-visitation patterns. Our key insight is that business semantics, captured through detailed industry codes, provide crucial signals that pure spatial models cannot explain. The approach scales to massive datasets (4.2 billion potential venue pairs) through efficient state-wise decomposition while combining spatial, temporal, and socioeconomic features in an end-to-end framework. Evaluated on our POI-Graph dataset comprising 94.9 million co-visitation records across 92,486 brands and 48 US states, our method achieves significant improvements over state-of-the-art baselines: the R-squared value increases from 0.243 to 0.625 (a 157 percent improvement), with strong gains in ranking quality (32 percent improvement in NDCG at 10).
[LG-80] Feature learning is decoupled from generalization in high capacity neural networks
链接: https://arxiv.org/abs/2507.19680
作者: Niclas Alexander Göring,Charles London,Abdurrahman Hadi Erturk,Chris Mingard,Yoonsoo Nam,Ard A. Louis
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Neural networks outperform kernel methods, sometimes by orders of magnitude, e.g. on staircase functions. This advantage stems from the ability of neural networks to learn features, adapting their hidden representations to better capture the data. We introduce a concept we call feature quality to measure this performance improvement. We examine existing theories of feature learning and demonstrate empirically that they primarily assess the strength of feature learning, rather than the quality of the learned features themselves. Consequently, current theories of feature learning do not provide a sufficient foundation for developing theories of neural network generalization.
[LG-81] Directly Learning Stock Trading Strategies Through Profit Guided Loss Functions
链接: https://arxiv.org/abs/2507.19639
作者: Devroop Kar,Zimeng Lyu,Sheeraja Rajakrishnan,Hao Zhang,Alex Ororbia,Travis Desell,Daniel Krutz
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 17 pages, 4 figures, Submitted to Neural Information Processing Systems 2025
Abstract:Stock trading has always been a challenging task due to the highly volatile nature of the stock market. Making sound trading decisions to generate profit is particularly difficult under such conditions. To address this, we propose four novel loss functions to drive decision-making for a portfolio of stocks. These functions account for the potential profits or losses based with respect to buying or shorting respective stocks, enabling potentially any artificial neural network to directly learn an effective trading strategy. Despite the high volatility in stock market fluctuations over time, training time-series models such as transformers on these loss functions resulted in trading strategies that generated significant profits on a portfolio of 50 different SP 500 company stocks as compared to a benchmark reinforcment learning techniques and a baseline buy and hold method. As an example, using 2021, 2022 and 2023 as three test periods, the Crossformer model adapted with our best loss function was most consistent, resulting in returns of 51.42%, 51.04% and 48.62% respectively. In comparison, the best performing state-of-the-art reinforcement learning methods, PPO and DDPG, only delivered maximum profits of around 41%, 2.81% and 41.58% for the same periods. The code is available at this https URL.
[LG-82] Federated Calculation of the Free-Support Transportation Barycenter by Single-Loop Dual Decomposition
链接: https://arxiv.org/abs/2507.19627
作者: Zhengqi Lin,Andrzej Ruszczyński
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:We propose an efficient federated dual decomposition algorithm for calculating the Wasserstein barycenter of several distributions, including choosing the support of the solution. The algorithm does not access local data and uses only highly aggregated information. It also does not require repeated solutions to mass transportation problems. Because of the absence of any matrix-vector operations, the algorithm exhibits a very low complexity of each iteration and significant scalability. We illustrate its virtues and compare it to the state-of-the-art methods on several examples of mixture models.
[LG-83] Harnessing intuitive local evolution rules for physical learning
链接: https://arxiv.org/abs/2507.19561
作者: Roie Ezraty,Menachem Stern,Shmuel M. Rubinstein
类目: Machine Learning (cs.LG)
*备注: 26 pages, 6 figures (with appendices). Submitted to Physical Review E
Abstract:Machine Learning, however popular and accessible, is computationally intensive and highly power-consuming, prompting interest in alternative physical implementations of learning tasks. We introduce a training scheme for physical systems that minimize power dissipation in which only boundary parameters (i.e. inputs and outputs) are externally controlled. Using this scheme, these Boundary-Enabled Adaptive State Tuning Systems (BEASTS) learn by exploiting local phys- ical rules. Our scheme, BEASTAL (BEAST-Adaline), is the closest analog of the Adaline algorithm for such systems. We demonstrate this autonomous learning in silico for regression and classifi- cation tasks. Our approach advances previous physical learning schemes by using intuitive, local evolution rules without requiring large-scale memory or complex internal architectures. BEASTAL can perform any linear task, achieving best performance when the local evolution rule is non-linear.
[LG-84] Latent Representations of Intracardiac Electrograms for Atrial Fibrillation Driver Detection
链接: https://arxiv.org/abs/2507.19547
作者: Pablo Peiro-Corbacho,Long Lin,Pablo Ávila,Alejandro Carta-Bergaz,Ángel Arenal,Carlos Sevilla-Salcedo,Gonzalo R. Ríos-Muñoz
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Atrial Fibrillation (AF) is the most prevalent sustained arrhythmia, yet current ablation therapies, including pulmonary vein isolation, are frequently ineffective in persistent AF due to the involvement of non-pulmonary vein drivers. This study proposes a deep learning framework using convolutional autoencoders for unsupervised feature extraction from unipolar and bipolar intracavitary electrograms (EGMs) recorded during AF in ablation studies. These latent representations of atrial electrical activity enable the characterization and automation of EGM analysis, facilitating the detection of AF drivers. The database consisted of 11,404 acquisitions recorded from 291 patients, containing 228,080 unipolar EGMs and 171,060 bipolar EGMs. The autoencoders successfully learned latent representations with low reconstruction loss, preserving the morphological features. The extracted embeddings allowed downstream classifiers to detect rotational and focal activity with moderate performance (AUC 0.73-0.76) and achieved high discriminative performance in identifying atrial EGM entanglement (AUC 0.93). The proposed method can operate in real-time and enables integration into clinical electroanatomical mapping systems to assist in identifying arrhythmogenic regions during ablation procedures. This work highlights the potential of unsupervised learning to uncover physiologically meaningful features from intracardiac signals. Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY) Cite as: arXiv:2507.19547 [cs.LG] (or arXiv:2507.19547v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.19547 Focus to learn more arXiv-issued DOI via DataCite
[LG-85] Comparing Behavioural Cloning and Reinforcement Learning for Spacecraft Guidance and Control Networks
链接: https://arxiv.org/abs/2507.19535
作者: Harry Holt,Sebastien Origer,Dario Izzo
类目: ystems and Control (eess.SY); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:
Abstract:Guidance control networks (GCNETs) provide a promising alternative to on-board guidance and control (GC) architectures for spacecraft, offering a differentiable, end-to-end representation of the guidance and control architecture. When training GCNETs, two predominant paradigms emerge: behavioural cloning (BC), which mimics optimal trajectories, and reinforcement learning (RL), which learns optimal behaviour through trials and errors. Although both approaches have been adopted in GCNET related literature, direct comparisons are notably absent. To address this, we conduct a systematic evaluation of BC and RL specifically for training GCNETs on continuous-thrust spacecraft trajectory optimisation tasks. We introduce a novel RL training framework tailored to GCNETs, incorporating decoupled action and control frequencies alongside reward redistribution strategies to stabilise training and to provide a fair comparison. Our results show that BC-trained GCNETs excel at closely replicating expert policy behaviour, and thus the optimal control structure of a deterministic environment, but can be negatively constrained by the quality and coverage of the training dataset. In contrast RL-trained GCNETs, beyond demonstrating a superior adaptability to stochastic conditions, can also discover solutions that improve upon suboptimal expert demonstrations, sometimes revealing globally optimal strategies that eluded the generation of training samples.
[LG-86] Research on the application of graph data structure and graph neural network in node classification/clustering tasks
链接: https://arxiv.org/abs/2507.19527
作者: Yihan Wang,Jianing Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph-structured data are pervasive across domains including social networks, biological networks, and knowledge graphs. Due to their non-Euclidean nature, such data pose significant challenges to conventional machine learning methods. This study investigates graph data structures, classical graph algorithms, and Graph Neural Networks (GNNs), providing comprehensive theoretical analysis and comparative evaluation. Through comparative experiments, we quantitatively assess performance differences between traditional algorithms and GNNs in node classification and clustering tasks. Results show GNNs achieve substantial accuracy improvements of 43% to 70% over traditional methods. We further explore integration strategies between classical algorithms and GNN architectures, providing theoretical guidance for advancing graph representation learning research.
[LG-87] Kolmogorov Arnold Network Autoencoder in Medicine
链接: https://arxiv.org/abs/2507.19524
作者: Ugo Lomoio,Pierangelo Veltri,Pietro Hiram Guzzi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning neural networks architectures such Multi Layer Perceptrons (MLP) and Convolutional blocks still play a crucial role in nowadays research advancements. From a topological point of view, these architecture may be represented as graphs in which we learn the functions related to the nodes while fixed edges convey the information from the input to the output. A recent work introduced a new architecture called Kolmogorov Arnold Networks (KAN) that reports how putting learnable activation functions on the edges of the neural network leads to better performances in multiple scenarios. Multiple studies are focusing on optimizing the KAN architecture by adding important features such as dropout regularization, Autoencoders (AE), model benchmarking and last, but not least, the KAN Convolutional Network (KCN) that introduced matrix convolution with KANs learning. This study aims to benchmark multiple versions of vanilla AEs (such as Linear, Convolutional and Variational) against their Kolmogorov-Arnold counterparts that have same or less number of parameters. Using cardiological signals as model input, a total of five different classic AE tasks were studied: reconstruction, generation, denoising, inpainting and anomaly detection. The proposed experiments uses a medical dataset \textitAbnormalHeartbeat that contains audio signals obtained from the stethoscope.
[LG-88] Applications and Manipulations of Physics-Informed Neural Networks in Solving Differential Equations
链接: https://arxiv.org/abs/2507.19522
作者: Aarush Gupta,Kendric Hsu,Syna Mathod
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mathematical models in neural networks are powerful tools for solving complex differential equations and optimizing their parameters; that is, solving the forward and inverse problems, respectively. A forward problem predicts the output of a network for a given input by optimizing weights and biases. An inverse problem finds equation parameters or coefficients that effectively model the data. A Physics-Informed Neural Network (PINN) can solve both problems. PINNs inject prior analytical information about the data into the cost function to improve model performance outside the training set boundaries. This also allows PINNs to efficiently solve problems with sparse data without overfitting by extrapolating the model to fit larger trends in the data. The prior information we implement is in the form of differential equations. Residuals are the differences between the left-hand and right-hand sides of corresponding differential equations; PINNs minimize these residuals to effectively solve the differential equation and take advantage of prior knowledge. In this way, the solution and parameters are embedded into the loss function and optimized, allowing both the weights of the neural network and the model parameters to be found simultaneously, solving both the forward and inverse problems in the process. In this paper, we will create PINNs with residuals of varying complexity, beginning with linear and quadratic models and then expanding to fit models for the heat equation and other complex differential equations. We will mainly use Python as the computing language, using the PyTorch library to aid us in our research.
[LG-89] A Comparative Analysis of Traditional and Deep Learning Time Series Architectures for Influenza A Infectious Disease Forecasting
链接: https://arxiv.org/abs/2507.19515
作者: Edmund F. Agyemang,Hansapani Rodrigo,Vincent Agbenyeavu
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Influenza A is responsible for 290,000 to 650,000 respiratory deaths a year, though this estimate is an improvement from years past due to improvements in sanitation, healthcare practices, and vaccination programs. In this study, we perform a comparative analysis of traditional and deep learning models to predict Influenza A outbreaks. Using historical data from January 2009 to December 2023, we compared the performance of traditional ARIMA and Exponential Smoothing(ETS) models with six distinct deep learning architectures: Simple RNN, LSTM, GRU, BiLSTM, BiGRU, and Transformer. The results reveal a clear superiority of all the deep learning models, especially the state-of-the-art Transformer with respective average testing MSE and MAE of 0.0433 \pm 0.0020 and 0.1126 \pm 0.0016 for capturing the temporal complexities associated with Influenza A data, outperforming well known traditional baseline ARIMA and ETS models. These findings of this study provide evidence that state-of-the-art deep learning architectures can enhance predictive modeling for infectious diseases and indicate a more general trend toward using deep learning methods to enhance public health forecasting and intervention planning strategies. Future work should focus on how these models can be incorporated into real-time forecasting and preparedness systems at an epidemic level, and integrated into existing surveillance systems.
[LG-90] Wavelet Logic Machines: Learning and Reasoning in the Spectral Domain Without Neural Networks
链接: https://arxiv.org/abs/2507.19514
作者: Andrew Kiruluta
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce a fully spectral learning framework that eliminates traditional neural layers by operating entirely in the wavelet domain. The model applies learnable nonlinear transformations, including soft-thresholding and gain-phase modulation, directly to wavelet coefficients. It also includes a differentiable wavelet basis selection mechanism, enabling adaptive processing using families such as Haar, Daubechies, and Biorthogonal wavelets. Implemented in PyTorch with full 3D support, the model maintains a spectral pipeline without spatial convolutions or attention. On synthetic 3D denoising and natural language tasks from the GLUE benchmark, including SST-2 sentiment classification, the model achieves 89.3 percent accuracy, close to a 4-layer Transformer baseline (90.1 percent), while using 72 percent fewer parameters and 58 percent less peak memory. Faster early convergence is observed due to spectral sparsity priors. In contrast to the quadratic complexity of self-attention and large matrix multiplications in Transformers, our approach uses linear-time wavelet transforms and pointwise nonlinearities, significantly reducing inference cost. This yields a compact, interpretable, and efficient alternative to neural models. Our results support the viability of principled spectral learning in both vision and language tasks, offering new directions for model design without overparameterized architectures. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.19514 [cs.LG] (or arXiv:2507.19514v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.19514 Focus to learn more arXiv-issued DOI via DataCite
[LG-91] InkStream: Real-time GNN Inference on Streaming Graphs via Incremental Update
链接: https://arxiv.org/abs/2309.11071
作者: Dan Wu,Zhaoying Li,Tulika Mitra
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:
Abstract:Classic Graph Neural Network (GNN) inference approaches, designed for static graphs, are ill-suited for streaming graphs that evolve with time. The dynamism intrinsic to streaming graphs necessitates constant updates, posing unique challenges to acceleration on GPU. We address these challenges based on two key insights: (1) Inside the k -hop neighborhood, a significant fraction of the nodes is not impacted by the modified edges when the model uses min or max as aggregation function; (2) When the model weights remain static while the graph structure changes, node embeddings can incrementally evolve over time by computing only the impacted part of the neighborhood. With these insights, we propose a novel method, InkStream, designed for real-time inference with minimal memory access and computation, while ensuring an identical output to conventional methods. InkStream operates on the principle of propagating and fetching data only when necessary. It uses an event-based system to control inter-layer effect propagation and intra-layer incremental updates of node embedding. InkStream is highly extensible and easily configurable by allowing users to create and process customized events. We showcase that less than 10 lines of additional user code are needed to support popular GNN models such as GCN, GraphSAGE, and GIN. Our experiments with three GNN models on four large graphs demonstrate that InkStream accelerates by 2.5-427 \times on a CPU cluster and 2.4-343 \times on two different GPU clusters while producing identical outputs as GNN model inference on the latest graph snapshot.
[LG-92] Locally Adaptive Conformal Inference for Operator Models
链接: https://arxiv.org/abs/2507.20975
作者: Trevor Harris,Yan Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 9 pages, 2 figures, 2 tables
Abstract:Operator models are regression algorithms for functional data and have become a key tool for emulating large-scale dynamical systems. Recent advances in deep neural operators have dramatically improved the accuracy and scalability of operator modeling, but lack an inherent notion of predictive uncertainty. We introduce Local Spectral Conformal Inference (LSCI), a new framework for locally adaptive, distribution-free uncertainty quantification for neural operator models. LSCI uses projection-based depth scoring and localized conformal inference to generate function-valued prediction sets with statistical guarantees. We prove approximate finite-sample marginal coverage under local exchangeability, and demonstrate significant gains in adaptivity and coverage across synthetic and real-world operator learning tasks.
[LG-93] Mean-Field Langevin Diffusions with Density-dependent Temperature
链接: https://arxiv.org/abs/2507.20958
作者: Yu-Jui Huang,Zachariah Malik
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:In the context of non-convex optimization, we let the temperature of a Langevin diffusion to depend on the diffusion’s own density function. The rationale is that the induced density reveals to some extent the landscape imposed by the non-convex function to be minimized, such that a density-dependent temperature can provide location-wise random perturbation that may better react to, for instance, the location and depth of local minimizers. As the Langevin dynamics is now self-regulated by its own density, it forms a mean-field stochastic differential equation (SDE) of the Nemytskii type, distinct from the standard McKean-Vlasov equations. Relying on Wasserstein subdifferential calculus, we first show that the corresponding (nonlinear) Fokker-Planck equation has a unique solution. Next, a weak solution to the SDE is constructed from the solution to the Fokker-Planck equation, by Trevisan’s superposition principle. As time goes to infinity, we further show that the density induced by the SDE converges to an invariant distribution, which admits an explicit formula in terms of the Lambert W function.
[LG-94] owards trustworthy AI in materials mechanics through domain-guided attention
链接: https://arxiv.org/abs/2507.20658
作者: Jesco Talies,Eric Breitbarth,David Melching
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Ensuring the trustworthiness and robustness of deep learning models remains a fundamental challenge, particularly in high-stakes scientific applications. In this study, we present a framework called attention-guided training that combines explainable artificial intelligence techniques with quantitative evaluation and domain-specific priors to guide model attention. We demonstrate that domain specific feedback on model explanations during training can enhance the model’s generalization capabilities. We validate our approach on the task of semantic crack tip segmentation in digital image correlation data which is a key application in the fracture mechanical characterization of materials. By aligning model attention with physically meaningful stress fields, such as those described by Williams’ analytical solution, attention-guided training ensures that the model focuses on physically relevant regions. This finally leads to improved generalization and more faithful explanations.
[LG-95] Comparing and Scaling fMRI Features for Brain-Behavior Prediction
链接: https://arxiv.org/abs/2507.20601
作者: Mikkel Schöttner Sieler,Thomas A.W. Bolton,Jagruti Patel,Patric Hagmann
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:
Abstract:Predicting behavioral variables from neuroimaging modalities such as magnetic resonance imaging (MRI) has the potential to allow the development of neuroimaging biomarkers of mental and neurological disorders. A crucial processing step to this aim is the extraction of suitable features. These can differ in how well they predict the target of interest, and how this prediction scales with sample size and scan time. Here, we compare nine feature subtypes extracted from resting-state functional MRI recordings for behavior prediction, ranging from regional measures of functional activity to functional connectivity (FC) and metrics derived with graph signal processing (GSP), a principled approach for the extraction of structure-informed functional features. We study 979 subjects from the Human Connectome Project Young Adult dataset, predicting summary scores for mental health, cognition, processing speed, and substance use, as well as age and sex. The scaling properties of the features are investigated for different combinations of sample size and scan time. FC comes out as the best feature for predicting cognition, age, and sex. Graph power spectral density is the second best for predicting cognition and age, while for sex, variability-based features show potential as well. When predicting sex, the low-pass graph filtered coupled FC slightly outperforms the simple FC variant. None of the other targets were predicted significantly. The scaling results point to higher performance reserves for the better-performing features. They also indicate that it is important to balance sample size and scan time when acquiring data for prediction studies. The results confirm FC as a robust feature for behavior prediction, but also show the potential of GSP and variability-based measures. We discuss the implications for future prediction studies in terms of strategies for acquisition and sample composition.
[LG-96] Statistical Inference for Differentially Private Stochastic Gradient Descent
链接: https://arxiv.org/abs/2507.20560
作者: Xintao Xia,Linjun Zhang,Zhanrui Cai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Privacy preservation in machine learning, particularly through Differentially Private Stochastic Gradient Descent (DP-SGD), is critical for sensitive data analysis. However, existing statistical inference methods for SGD predominantly focus on cyclic subsampling, while DP-SGD requires randomized subsampling. This paper first bridges this gap by establishing the asymptotic properties of SGD under the randomized rule and extending these results to DP-SGD. For the output of DP-SGD, we show that the asymptotic variance decomposes into statistical, sampling, and privacy-induced components. Two methods are proposed for constructing valid confidence intervals: the plug-in method and the random scaling method. We also perform extensive numerical analysis, which shows that the proposed confidence intervals achieve nominal coverage rates while maintaining privacy.
[LG-97] Deep Reputation Scoring in DeFi: zScore-Based Wallet Ranking from Liquidity and Trading Signals
链接: https://arxiv.org/abs/2507.20494
作者: Dhanashekar Kandaswamy,Ashutosh Sahoo,Akshay SP,Gurukiran S,Parag Paul,Girish G N
类目: General Finance (q-fin.GN); Machine Learning (cs.LG)
*备注: Comments: 10 pages, 5 figures. Independently developed system by Zeru Finance for decentralized user scoring. Not submitted to any conference or journal
Abstract:As decentralized finance (DeFi) evolves, distinguishing between user behaviors - liquidity provision versus active trading - has become vital for risk modeling and on-chain reputation. We propose a behavioral scoring framework for Uniswap that assigns two complementary scores: a Liquidity Provision Score that assesses strategic liquidity contributions, and a Swap Behavior Score that reflects trading intent, volatility exposure, and discipline. The scores are constructed using rule-based blueprints that decompose behavior into volume, frequency, holding time, and withdrawal patterns. To handle edge cases and learn feature interactions, we introduce a deep residual neural network with densely connected skip blocks inspired by the U-Net architecture. We also incorporate pool-level context such as total value locked (TVL), fee tiers, and pool size, allowing the system to differentiate similar user behaviors across pools with varying characteristics. Our framework enables context-aware and scalable DeFi user scoring, supporting improved risk assessment and incentive design. Experiments on Uniswap v3 data show its usefulness for user segmentation and protocol-aligned reputation systems. Although we refer to our metric as zScore, it is independently developed and methodologically different from the cross-protocol system proposed by Udupi et al. Our focus is on role-specific behavioral modeling within Uniswap using blueprint logic and supervised learning.
[LG-98] Building crypto portfolios with agent ic AI
链接: https://arxiv.org/abs/2507.20468
作者: Antonino Castelli,Paolo Giudici,Alessandro Piergallini
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG)
*备注: 12 pages, 2 figures
Abstract:The rapid growth of crypto markets has opened new opportunities for investors, but at the same time exposed them to high volatility. To address the challenge of managing dynamic portfolios in such an environment, this paper presents a practical application of a multi-agent system designed to autonomously construct and evaluate crypto-asset allocations. Using data on daily frequencies of the ten most capitalized cryptocurrencies from 2020 to 2025, we compare two automated investment strategies. These are a static equal weighting strategy and a rolling-window optimization strategy, both implemented to maximize the evaluation metrics of the Modern Portfolio Theory (MPT), such as Expected Return, Sharpe and Sortino ratios, while minimizing volatility. Each step of the process is handled by dedicated agents, integrated through a collaborative architecture in Crew AI. The results show that the dynamic optimization strategy achieves significantly better performance in terms of risk-adjusted returns, both in-sample and out-of-sample. This highlights the benefits of adaptive techniques in portfolio management, particularly in volatile markets such as cryptocurrency markets. The following methodology proposed also demonstrates how multi-agent systems can provide scalable, auditable, and flexible solutions in financial automation.
[LG-99] Operator Inference Aware Quadratic Manifolds with Isotropic Reduced Coordinates for Nonintrusive Model Reduction
链接: https://arxiv.org/abs/2507.20463
作者: Paul Schwerdtner,Prakash Mohan,Julie Bessac,Marc T. Henry de Frahan,Benjamin Peherstorfer
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 23 pages, 8 figures
Abstract:Quadratic manifolds for nonintrusive reduced modeling are typically trained to minimize the reconstruction error on snapshot data, which means that the error of models fitted to the embedded data in downstream learning steps is ignored. In contrast, we propose a greedy training procedure that takes into account both the reconstruction error on the snapshot data and the prediction error of reduced models fitted to the data. Because our procedure learns quadratic manifolds with the objective of achieving accurate reduced models, it avoids oscillatory and other non-smooth embeddings that can hinder learning accurate reduced models. Numerical experiments on transport and turbulent flow problems show that quadratic manifolds trained with the proposed greedy approach lead to reduced models with up to two orders of magnitude higher accuracy than quadratic manifolds trained with respect to the reconstruction error alone.
[LG-100] A General Framework for Estimating Preferences Using Response Time Data
链接: https://arxiv.org/abs/2507.20403
作者: Federico Echenique,Alireza Fallah,Michael I. Jordan
类目: Theoretical Economics (econ.TH); Machine Learning (cs.LG)
*备注:
Abstract:We propose a general methodology for recovering preference parameters from data on choices and response times. Our methods yield estimates with fast ( 1/n for n data points) convergence rates when specialized to the popular Drift Diffusion Model (DDM), but are broadly applicable to generalizations of the DDM as well as to alternative models of decision making that make use of response time data. The paper develops an empirical application to an experiment on intertemporal choice, showing that the use of response times delivers predictive accuracy and matters for the estimation of economically relevant parameters.
[LG-101] Predicting Parkinsons Disease Progression Using Statistical and Neural Mixed Effects Models: A Comparative Study on Longitudinal Biomarkers
链接: https://arxiv.org/abs/2507.20058
作者: Ran Tong,Lanruo Wang,Tong Wang,Wei Yan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 20pages,3 figures,currently under review
Abstract:Predicting Parkinson’s Disease (PD) progression is crucial, and voice biomarkers offer a non-invasive method for tracking symptom severity (UPDRS scores) through telemonitoring. Analyzing this longitudinal data is challenging due to within-subject correlations and complex, nonlinear patient-specific progression patterns. This study benchmarks LMMs against two advanced hybrid approaches: the Generalized Neural Network Mixed Model (GNMM) (Mandel 2021), which embeds a neural network within a GLMM structure, and the Neural Mixed Effects (NME) model (Wortwein 2023), allowing nonlinear subject-specific parameters throughout the network. Using the Oxford Parkinson’s telemonitoring voice dataset, we evaluate these models’ performance in predicting Total UPDRS to offer practical guidance for PD research and clinical applications.
[LG-102] Extreme value theory for singular subspace estimation in the matrix denoising model
链接: https://arxiv.org/abs/2507.19978
作者: Junhyung Chang,Joshua Cape
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 64 pages, 8 figures
Abstract:This paper studies fine-grained singular subspace estimation in the matrix denoising model where a deterministic low-rank signal matrix is additively perturbed by a stochastic matrix of Gaussian noise. We establish that the maximum Euclidean row norm (i.e., the two-to-infinity norm) of the aligned difference between the leading sample and population singular vectors approaches the Gumbel distribution in the large-matrix limit, under suitable signal-to-noise conditions and after appropriate centering and scaling. We apply our novel asymptotic distributional theory to test hypotheses of low-rank signal structure encoded in the leading singular vectors and their corresponding principal subspace. We provide de-biased estimators for the corresponding nuisance signal singular values and show that our proposed plug-in test statistic has desirable properties. Notably, compared to using the Frobenius norm subspace distance, our test statistic based on the two-to-infinity norm has higher power to detect structured alternatives that differ from the null in only a few matrix entries or rows. Our main results are obtained by a novel synthesis of and technical analysis involving entrywise matrix perturbation analysis, extreme value theory, saddle point approximation methods, and random matrix theory. Our contributions complement the existing literature for matrix denoising focused on minimaxity, mean squared error analysis, unitarily invariant distances between subspaces, component-wise asymptotic distributional theory, and row-wise uniform error bounds. Numerical simulations illustrate our main results and demonstrate the robustness properties of our testing procedure to non-Gaussian noise distributions.
[LG-103] Nonconvex Optimization Framework for Group-Sparse Feedback Linear-Quadratic Optimal Control. II: Non-Penalty Approach
链接: https://arxiv.org/abs/2507.19895
作者: Lechen Feng,Xun Li,Yuan-Hua Ni
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2507.18114
Abstract:This work is a companion paper of [8], where the distributed linear-quadratic problem with fixed communication topology (DFT-LQ) and the sparse feedback LQ problem (SF-LQ) are formulated into a nonsmooth and nonconvex optimization problem with affine constraints. Moreover, a penalty approach is considered in \citefeng-part1, and the PALM (proximal alternating linearized minimization) algorithm is studied with convergence and complexity analysis. In this paper, we aim to address the inherent drawbacks of the penalty approach, such as the challenge of tuning the penalty parameter and the risk of introducing spurious stationary points. Specifically, we first reformulate the SF-LQ problem and the DFT-LQ problem from an epi-composition function perspective, aiming to solve the constrained problem directly. Then, from a theoretical viewpoint, we revisit the alternating direction method of multipliers (ADMM) and establish its convergence to the set of cluster points under certain assumptions. When these assumptions do not hold, we can effectively utilize alternative approaches combining subgradient descent with Difference-of-Convex relaxation methods. In summary, our results enable the direct design of group-sparse feedback gains with theoretical guarantees, without resorting to convex surrogates, restrictive structural assumptions, or penalty formulations that incorporate constraints into the cost function.
[LG-104] Quantum-Informed Machine Learning for Chaotic Systems
链接: https://arxiv.org/abs/2507.19861
作者: Maida Wang,Xiao Xue,Peter V. Coveney
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 33 pages, 4 figures
Abstract:Learning the behaviour of chaotic systems remains challenging due to instability in long-term predictions and difficulties in accurately capturing invariant statistical properties. While quantum machine learning offers a promising route to efficiently capture physical properties from high-dimensional data, its practical deployment is hindered by current hardware noise and limited scalability. We introduce a quantum-informed machine learning framework for learning partial differential equations, with an application focus on chaotic systems. A quantum circuit Born machine is employed to learn the invariant properties of chaotic dynamical systems, achieving substantial memory efficiency by representing these complex physical statistics with a compact set of trainable circuit parameters. This approach reduces the data storage requirement by over two orders of magnitude compared to the raw simulation data. The resulting statistical quantum-informed prior is then incorporated into a Koopman-based auto-regressive model to address issues such as gradient vanishing or explosion, while maintaining long-term statistical fidelity. The framework is evaluated on three representative systems: the Kuramoto-Sivashinsky equation, two-dimensional Kolmogorov flow and turbulent channel flow. In all cases, the quantum-informed model achieves superior performance compared to its classical counterparts without quantum priors. This hybrid architecture offers a practical route for learning dynamical systems using near-term quantum hardware.
[LG-105] Sequence-based protein-protein interaction prediction and its applications in drug discovery
链接: https://arxiv.org/abs/2507.19805
作者: François Charih,James R. Green,Kyle K. Biggar
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 32 pages, 6 figures, 3 tables
Abstract:Aberrant protein-protein interactions (PPIs) underpin a plethora of human diseases, and disruption of these harmful interactions constitute a compelling treatment avenue. Advances in computational approaches to PPI prediction have closely followed progress in deep learning and natural language processing. In this review, we outline the state-of the-art for sequence-based PPI prediction methods and explore their impact on target identification and drug discovery. We begin with an overview of commonly used training data sources and techniques used to curate these data to enhance the quality of the training set. Subsequently, we survey various PPI predictor types, including traditional similarity-based approaches, and deep learning-based approaches with a particular emphasis on the transformer architecture. Finally, we provide examples of PPI prediction in systems-level proteomics analyses, target identification, and design of therapeutic peptides and antibodies. We also take the opportunity to showcase the potential of PPI-aware drug discovery models in accelerating therapeutic development.
[LG-106] Enhancing Materials Discovery with Valence Constrained Design in Generative Modeling
链接: https://arxiv.org/abs/2507.19799
作者: Mouyang Cheng,Weiliang Luo,Hao Tang,Bowen Yu,Yongqiang Cheng,Weiwei Xie,Ju Li,Heather J. Kulik,Mingda Li
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 13 pages, 4 figures
Abstract:Diffusion-based deep generative models have emerged as powerful tools for inverse materials design. Yet, many existing approaches overlook essential chemical constraints such as oxidation state balance, which can lead to chemically invalid structures. Here we introduce CrysVCD (Crystal generator with Valence-Constrained Design), a modular framework that integrates chemical rules directly into the generative process. CrysVCD first employs a transformer-based elemental language model to generate valence-balanced compositions, followed by a diffusion model to generate crystal structures. The valence constraint enables orders-of-magnitude more efficient chemical valence checking, compared to pure data-driven approaches with post-screening. When fine-tuned on stability metrics, CrysVCD achieves 85% thermodynamic stability and 68% phonon stability. Moreover, CrysVCD supports conditional generation of functional materials, enabling discovery of candidates such as high thermal conductivity semiconductors and high- \kappa dielectric compounds. Designed as a general-purpose plugin, CrysVCD can be integrated into diverse generative pipeline to promote chemical validity, offering a reliable, scientifically grounded path for materials discovery.
[LG-107] Sparse-mode Dynamic Mode Decomposition for Disambiguating Local and Global Structures
链接: https://arxiv.org/abs/2507.19787
作者: Sara M. Ichinaga,Steven L. Brunton,Aleksandr Y. Aravkin,J. Nathan Kutz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The dynamic mode decomposition (DMD) is a data-driven approach that extracts the dominant features from spatiotemporal data. In this work, we introduce sparse-mode DMD, a new variant of the optimized DMD framework that specifically leverages sparsity-promoting regularization in order to approximate DMD modes which have localized spatial structure. The algorithm maintains the noise-robust properties of optimized DMD while disambiguating between modes which are spatially local versus global in nature. In many applications, such modes are associated with discrete and continuous spectra respectively, thus allowing the algorithm to explicitly construct, in an unsupervised manner, the distinct portions of the spectrum. We demonstrate this by analyzing synthetic and real-world systems, including examples from optical waveguides, quantum mechanics, and sea surface temperature data.
[LG-108] Bag of Coins: A Statistical Probe into Neural Confidence Structures
链接: https://arxiv.org/abs/2507.19774
作者: Agnideep Aich,Ashit Baran Aich,Md Monzur Murshed,Sameera Hewage,Bruce Wade
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Modern neural networks, despite their high accuracy, often produce poorly calibrated confidence scores, limiting their reliability in high-stakes applications. Existing calibration methods typically post-process model outputs without interrogating the internal consistency of the predictions themselves. In this work, we introduce a novel, non-parametric statistical probe, the Bag-of-Coins (BoC) test, that examines the internal consistency of a classifier’s logits. The BoC test reframes confidence estimation as a frequentist hypothesis test: does the model’s top-ranked class win 1-v-1 contests against random competitors at a rate consistent with its own stated softmax probability? When applied to modern deep learning architectures, this simple probe reveals a fundamental dichotomy. On Vision Transformers (ViTs), the BoC output serves as a state-of-the-art confidence score, achieving near-perfect calibration with an ECE of 0.0212, an 88% improvement over a temperature-scaled baseline. Conversely, on Convolutional Neural Networks (CNNs) like ResNet, the probe reveals a deep inconsistency between the model’s predictions and its internal logit structure, a property missed by traditional metrics. We posit that BoC is not merely a calibration method, but a new diagnostic tool for understanding and exposing the differing ways that popular architectures represent uncertainty.
[LG-109] okenBlowUp: Resolving Representational Singularities in LLM Token Spaces via Monoidal Transformations
链接: https://arxiv.org/abs/2507.19747
作者: Dongfang Zhao
类目: Algebraic Geometry (math.AG); Machine Learning (cs.LG)
*备注:
Abstract:Recent work has provided compelling evidence challenging the foundational manifold hypothesis for the token embedding spaces of Large Language Models (LLMs). These findings reveal the presence of geometric singularities around polysemous tokens, which can lead to representational instability. Existing methodologies, which presuppose a smooth data manifold, are ill-equipped to address such intrinsic structural flaws. In this paper, we formalize this problem in the language of scheme theory and propose a rigorous resolution by applying the scheme-theoretic blow-up at each singular point. This procedure replaces a singular point in the ambient affine scheme with its exceptional divisor, which we identify as a canonical geometric space – a projective space of directions – that houses the disambiguated semantic meanings of the token. This process of ``representational desingularization’’ constructs a new geometric landscape for embeddings. We prove a formal theorem guaranteeing the geometric regularization of this new space, showing that the original pathologies are resolved. Finally, we outline the architectural implications of our framework, arguing for a paradigm shift from static look-ups to dynamic, geometrically-grounded computation.
[LG-110] Adaptive Bayesian Data-Driven Design of Reliable Solder Joints for Micro-electronic Devices MICRO
链接: https://arxiv.org/abs/2507.19663
作者: Leo Guo,Adwait Inamdar,Willem D. van Driel,GuoQi Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: data-driven design, adaptive hyperparameters, Bayesian optimization, solder joint reliability, micro-electronics
Abstract:Solder joint reliability related to failures due to thermomechanical loading is a critically important yet physically complex engineering problem. As a result, simulated behavior is oftentimes computationally expensive. In an increasingly data-driven world, the usage of efficient data-driven design schemes is a popular choice. Among them, Bayesian optimization (BO) with Gaussian process regression is one of the most important representatives. The authors argue that computational savings can be obtained from exploiting thorough surrogate modeling and selecting a design candidate based on multiple acquisition functions. This is feasible due to the relatively low computational cost, compared to the expensive simulation objective. This paper addresses the shortcomings in the adjacent literature by providing and implementing a novel heuristic framework to perform BO with adaptive hyperparameters across the various optimization iterations. Adaptive BO is subsequently compared to regular BO when faced with synthetic objective minimization problems. The results show the efficiency of adaptive BO when compared any worst-performing regular Bayesian schemes. As an engineering use case, the solder joint reliability problem is tackled by minimizing the accumulated non-linear creep strain under a cyclic thermal load. Results show that adaptive BO outperforms regular BO by 3% on average at any given computational budget threshold, critically saving half of the computational expense budget. This practical result underlines the methodological potential of the adaptive Bayesian data-driven methodology to achieve better results and cut optimization-related expenses. Lastly, in order to promote the reproducibility of the results, the data-driven implementations are made available on an open-source basis.
[LG-111] Street network sub-patterns and travel mode
链接: https://arxiv.org/abs/2507.19648
作者: Juan Fernando Riascos Goyes,Michael Lowry,Nicolás Guarín Zapata,Juan Pablo Ospina
类目: Physics and Society (physics.soc-ph); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Urban morphology has long been recognized as a factor shaping human mobility, yet comparative and formal classifications of urban form across metropolitan areas remain limited. Building on theoretical principles of urban structure and advances in unsupervised learning, we systematically classified the built environment of nine U.S. metropolitan areas using structural indicators such as density, connectivity, and spatial configuration. The resulting morphological types were linked to mobility patterns through descriptive statistics, marginal effects estimation, and post hoc statistical testing. Here we show that distinct urban forms are systematically associated with different mobility behaviors, such as reticular morphologies being linked to significantly higher public transport use (marginal effect = 0.49) and reduced car dependence (-0.41), while organic forms are associated with increased car usage (0.44), and substantial declines in public transport (-0.47) and active mobility (-0.30). These effects are statistically robust (p 1e-19), highlighting that the spatial configuration of urban areas plays a fundamental role in shaping transportation choices. Our findings extend previous work by offering a reproducible framework for classifying urban form and demonstrate the added value of morphological analysis in comparative urban research. These results suggest that urban form should be treated as a key variable in mobility planning and provide empirical support for incorporating spatial typologies into sustainable urban policy design.
[LG-112] State evolution beyond first-order methods I: Rigorous predictions and finite-sample guarantees
链接: https://arxiv.org/abs/2507.19611
作者: Michael Celentano,Chen Cheng,Ashwin Pananjady,Kabir Aladin Verchand
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:We develop a toolbox for exact analysis of iterative algorithms on a class of high-dimensional nonconvex optimization problems with random data. While prior work has shown that low-dimensional statistics of (generalized) first-order methods can be predicted by a deterministic recursion known as state evolution, our focus is on developing such a prediction for a more general class of algorithms. We provide a state evolution for any method whose iterations are given by (possibly interleaved) first-order and saddle point updates, showing two main results. First, we establish a rigorous state evolution prediction that holds even when the updates are not coordinate-wise separable. Second, we establish finite-sample guarantees bounding the deviation of the empirical updates from the established state evolution. In the process, we develop a technical toolkit that may prove useful in related problems. One component of this toolkit is a general Hilbert space lifting technique to prove existence and uniqueness of a convenient parameterization of the state evolution. Another component of the toolkit combines a generic application of Bolthausen’s conditioning method with a sequential variant of Gordon’s Gaussian comparison inequality, and provides additional ingredients that enable a general finite-sample analysis.
[LG-113] Bayesian symbolic regression: Automated equation discovery from a physicists perspective
链接: https://arxiv.org/abs/2507.19540
作者: Roger Guimera,Marta Sales-Pardo
类目: Machine Learning (stat.ML); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:Symbolic regression automates the process of learning closed-form mathematical models from data. Standard approaches to symbolic regression, as well as newer deep learning approaches, rely on heuristic model selection criteria, heuristic regularization, and heuristic exploration of model space. Here, we discuss the probabilistic approach to symbolic regression, an alternative to such heuristic approaches with direct connections to information theory and statistical physics. We show how the probabilistic approach establishes model plausibility from basic considerations and explicit approximations, and how it provides guarantees of performance that heuristic approaches lack. We also discuss how the probabilistic approach compels us to consider model ensembles, as opposed to single models.
信息检索
[IR-0] Watermarking Large Language Model-based Time Series Forecasting
链接: https://arxiv.org/abs/2507.20762
作者: Wei Yuan,Chaoqun Yang,Yu Xing,Tong Chen,Nguyen Quoc Viet Hung,Hongzhi Yin
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Large Language Model-based Time Series Forecasting (LLMTS) has shown remarkable promise in handling complex and diverse temporal data, representing a significant step toward foundation models for time series analysis. However, this emerging paradigm introduces two critical challenges. First, the substantial commercial potential and resource-intensive development raise urgent concerns about intellectual property (IP) protection. Second, their powerful time series forecasting capabilities may be misused to produce misleading or fabricated deepfake time series data. To address these concerns, we explore watermarking the outputs of LLMTS models, that is, embedding imperceptible signals into the generated time series data that remain detectable by specialized algorithms. We propose a novel post-hoc watermarking framework, Waltz, which is broadly compatible with existing LLMTS models. Waltz is inspired by the empirical observation that time series patch embeddings are rarely aligned with a specific set of LLM tokens, which we term ``cold tokens’'. Leveraging this insight, Waltz embeds watermarks by rewiring the similarity statistics between patch embeddings and cold token embeddings, and detects watermarks using similarity z-scores. To minimize potential side effects, we introduce a similarity-based embedding position identification strategy and employ projected gradient descent to constrain the watermark noise within a defined boundary. Extensive experiments using two popular LLMTS models across seven benchmark datasets demonstrate that Waltz achieves high watermark detection accuracy with minimal impact on the quality of the generated time series.
[IR-1] Improving Community Detection in Academic Networks by Handling Publication Bias
链接: https://arxiv.org/abs/2507.20449
作者: Md Asaduzzaman Noor,John Sheppard,Jason Clark
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: This paper is an extended version of a work accepted at ASONAM 2025
Abstract:Finding potential research collaborators is a challenging task, especially in today’s fast-growing and interdisciplinary research landscape. While traditional methods often rely on observable relationships such as co-authorships and citations to construct the research network, in this work, we focus solely on publication content to build a topic-based research network using BERTopic with a fine-tuned SciBERT model that connects and recommends researchers across disciplines based on shared topical interests. A major challenge we address is publication imbalance, where some researchers publish much more than others, often across several topics. Without careful handling, their less frequent interests are hidden under dominant topics, limiting the network’s ability to detect their full research scope. To tackle this, we introduce a cloning strategy that clusters a researcher’s publications and treats each cluster as a separate node. This allows researchers to be part of multiple communities, improving the detection of interdisciplinary links. Evaluation on the proposed method shows that the cloned network structure leads to more meaningful communities and uncovers a broader set of collaboration opportunities.
[IR-2] IMEST: Temporal Information Motif Estimator Using Sampling Trees
链接: https://arxiv.org/abs/2507.20441
作者: Yunjie Pan,Omkar Bhalerao,C. Seshadhri,Nishil Talati
类目: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注:
Abstract:The mining of pattern subgraphs, known as motifs, is a core task in the field of graph mining. Edges in real-world networks often have timestamps, so there is a need for temporal motif mining. A temporal motif is a richer structure that imposes timing constraints on the edges of the motif. Temporal motifs have been used to analyze social networks, financial transactions, and biological networks. Motif counting in temporal graphs is particularly challenging. A graph with millions of edges can have trillions of temporal motifs, since the same edge can occur with multiple timestamps. There is a combinatorial explosion of possibilities, and state-of-the-art algorithms cannot manage motifs with more than four vertices. In this work, we present TIMEST: a general, fast, and accurate estimation algorithm to count temporal motifs of arbitrary sizes in temporal networks. Our approach introduces a temporal spanning tree sampler that leverages weighted sampling to generate substructures of target temporal motifs. This method carefully takes a subset of temporal constraints of the motif that can be jointly and efficiently sampled. TIMEST uses randomized estimation techniques to obtain accurate estimates of motif counts. We give theoretical guarantees on the running time and approximation guarantees of TIMEST. We perform an extensive experimental evaluation and show that TIMEST is both faster and more accurate than previous algorithms. Our CPU implementation exhibits an average speedup of 28x over state-of-the-art GPU implementation of the exact algorithm, and 6x speedup over SOTA approximate algorithms while consistently showcasing less than 5% error in most cases. For example, TIMEST can count the number of instances of a financial fraud temporal motif in four minutes with 0.6% error, while exact methods take more than two days. Subjects: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR) Cite as: arXiv:2507.20441 [cs.DB] (or arXiv:2507.20441v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2507.20441 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Omkar Bhalerao [view email] [v1] Sun, 27 Jul 2025 23:31:55 UTC (731 KB)
[IR-3] ADT-CSA: Temporal Advantage Decision Transformer with Contrastive State Abstraction for Generative Recommendation
链接: https://arxiv.org/abs/2507.20327
作者: Xiang Gao,Tianyuan Liu,Yisha Li,Jingxin Liu,Lexi Gao,Xin Li,Haiyang Lu,Liyin Hong
类目: Information Retrieval (cs.IR)
*备注:
Abstract:With the rapid advancement of Transformer-based Large Language Models (LLMs), generative recommendation has shown great potential in enhancing both the accuracy and semantic understanding of modern recommender systems. Compared to LLMs, the Decision Transformer (DT) is a lightweight generative model applied to sequential recommendation tasks. However, DT faces challenges in trajectory stitching, often producing suboptimal trajectories. Moreover, due to the high dimensionality of user states and the vast state space inherent in recommendation scenarios, DT can incur significant computational costs and struggle to learn effective state representations. To overcome these issues, we propose a novel Temporal Advantage Decision Transformer with Contrastive State Abstraction (TADT-CSA) model. Specifically, we combine the conventional Return-To-Go (RTG) signal with a novel temporal advantage (TA) signal that encourages the model to capture both long-term returns and their sequential trend. Furthermore, we integrate a contrastive state abstraction module into the DT framework to learn more effective and expressive state representations. Within this module, we introduce a TA-conditioned State Vector Quantization (TAC-SVQ) strategy, where the TA score guides the state codebooks to incorporate contextual token information. Additionally, a reward prediction network and a contrastive transition prediction (CTP) network are employed to ensure the state codebook preserves both the reward information of the current state and the transition information between adjacent states. Empirical results on both public datasets and an online recommendation system demonstrate the effectiveness of the TADT-CSA model and its superiority over baseline methods.
[IR-4] CTR-Driven Ad Text Generation via Online Feedback Preference Optimization
链接: https://arxiv.org/abs/2507.20227
作者: Yanda Chen,Zihui Ren,Qixiang Gao,Jiale Chen,Si Chen,Xubin Li,Tiezheng Ge,Bo Zheng
类目: Information Retrieval (cs.IR)
*备注: 9 pages, 6 figures, 5 tables
Abstract:Advertising text plays a critical role in determining click-through rates (CTR) in online advertising. Large Language Models (LLMs) offer significant efficiency advantages over manual ad text creation. However, LLM-generated ad texts do not guarantee higher CTR performance compared to human-crafted texts, revealing a gap between generation quality and online performance of ad texts. In this work, we propose a novel ad text generation method which optimizes for CTR through preference optimization from online feedback. Our approach adopts an innovative two-stage framework: (1) diverse ad text sampling via one-shot in-context learning, using retrieval-augmented generation (RAG) to provide exemplars with chain-of-thought (CoT) reasoning; (2) CTR-driven preference optimization from online feedback, which weighs preference pairs according to their CTR gains and confidence levels. Through our method, the resulting model enables end-to-end generation of high-CTR ad texts. Extensive experiments have demonstrated the effectiveness of our method in both offline and online metrics. Notably, we have applied our method on a large-scale online shopping platform and achieved significant CTR improvements, showcasing its strong applicability and effectiveness in advertising systems.
[IR-5] Integrating LLM -Derived Multi-Semantic Intent into Graph Model for Session-based Recommendation
链接: https://arxiv.org/abs/2507.20147
作者: Shuo Zhang,Xiao Li,Jiayi Wu,Fan Yang,Xiang Li,Ming Gao
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Session-based recommendation (SBR) is mainly based on anonymous user interaction sequences to recommend the items that the next user is most likely to click. Currently, the most popular and high-performing SBR methods primarily leverage graph neural networks (GNNs), which model session sequences as graph-structured data to effectively capture user intent. However, most GNNs-based SBR methods primarily focus on modeling the ID sequence information of session sequences, while neglecting the rich semantic information embedded within them. This limitation significantly hampers model’s ability to accurately infer users’ true intention. To address above challenge, this paper proposes a novel SBR approach called Integrating LLM-Derived Multi-Semantic Intent into Graph Model for Session-based Recommendation (LLM-DMsRec). The method utilizes a pre-trained GNN model to select the top-k items as candidate item sets and designs prompts along with a large language model (LLM) to infer multi-semantic intents from these candidate items. Specifically, we propose an alignment mechanism that effectively integrates the semantic intent inferred by the LLM with the structural intent captured by GNNs. Extensive experiments conducted on the Beauty and ML-1M datasets demonstrate that the proposed method can be seamlessly integrated into GNNs framework, significantly enhancing its recommendation performance.
[IR-6] A Non-Parametric Choice Model That Learns How Users Choose Between Recommended Options
链接: https://arxiv.org/abs/2507.20035
作者: Thorsten Krause,Harrie Oosterhuis
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Choice models predict which items users choose from presented options. In recommendation settings, they can infer user preferences while countering exposure bias. In contrast with traditional univariate recommendation models, choice models consider which competitors appeared with the chosen item. This ability allows them to distinguish whether a user chose an item due to preference, i.e., they liked it; or competition, i.e., it was the best available option. Each choice model assumes specific user behavior, e.g., the multinomial logit model. However, it is currently unclear how accurately these assumptions capture actual user behavior, how wrong assumptions impact inference, and whether better models exist. In this work, we propose the learned choice model for recommendation (LCM4Rec), a non-parametric method for estimating the choice model. By applying kernel density estimation, LCM4Rec infers the most likely error distribution that describes the effect of inter-item cannibalization and thereby characterizes the users’ choice model. Thus, it simultaneously infers what users prefer and how they make choices. Our experimental results indicate that our method (i) can accurately recover the choice model underlying a dataset; (ii) provides robust user preference inference, in contrast with existing choice models that are only effective when their assumptions match user behavior; and (iii) is more resistant against exposure bias than existing choice models. Thereby, we show that learning choice models, instead of assuming them, can produce more robust predictions. We believe this work provides an important step towards better understanding users’ choice behavior. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2507.20035 [cs.IR] (or arXiv:2507.20035v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.20035 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3705328.3748090 Focus to learn more DOI(s) linking to related resources
[IR-7] CleANN: Efficient Full Dynamism in Graph-based Approximate Nearest Neighbor Search
链接: https://arxiv.org/abs/2507.19802
作者: Ziyu Zhang,Yuanhao Wei,Joshua Engels,Julian Shun
类目: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注:
Abstract:Approximate nearest neighbor search (ANNS) has become a quintessential algorithmic problem for various other foundational data tasks for AI workloads. Graph-based ANNS indexes have superb empirical trade-offs in indexing cost, query efficiency, and query approximation quality. Most existing graph-based indexes are designed for the static scenario, where there are no updates to the data after the index is constructed. However, full dynamism (insertions, deletions, and searches) is crucial to providing up-to-date responses in applications using vector databases. It is desirable that the index efficiently supports updates and search queries concurrently. Existing dynamic graph-based indexes suffer from at least one of the following problems: (1) the query quality degrades as updates happen; and (2) the graph structure updates used to maintain the index quality upon updates are global and thus expensive. To solve these problems, we propose the CleANN system which consists of three main components: (1) workload-aware linking of diverse search tree descendants to combat distribution shift; (2)query-adaptive on-the-fly neighborhood consolidation to efficiently handle deleted nodes; and (3) semi-lazy memory cleaning to clean up stale information in the data structure and reduce the work spent by the first two components. We evaluate CleANN on 7 diverse datasets on fully dynamic workloads and find that CleANN has query quality at least as good as if the index had been built statically using the corresponding data. In the in-memory setting using 56 hyper-threads, with all types of queries running concurrently, at the same recall level, CleANN achieves 7-1200x throughput improvement on million-scale real-world datasets. To the best of our knowledge, CleANN is the first concurrent ANNS index to achieve such efficiency while maintaining quality under full dynamism.
[IR-8] A Unified Framework for Interactive Visual Graph Matching via Attribute-Structure Synchronization
链接: https://arxiv.org/abs/2507.19750
作者: Yuhua Liu,Haoxuan Wang,Jiajia Kou,Ling Sun,Heyu Wang,Yongheng Wang,Yigang Wang,Jinchang Lic,Zhiguang Zhou
类目: Information Retrieval (cs.IR)
*备注:
Abstract:In traditional graph retrieval tools, graph matching is commonly used to retrieve desired graphs from extensive graph datasets according to their structural similarities. However, in real applications, graph nodes have numerous attributes which also contain valuable information for evaluating similarities between graphs. Thus, to achieve superior graph matching results, it is crucial for graph retrieval tools to make full use of the attribute information in addition to structural information. We propose a novel framework for interactive visual graph matching. In the proposed framework, an attribute-structure synchronization method is developed for representing structural and attribute features in a unified embedding space based on Canonical Correlation Analysis (CCA). To support fast and interactive matching, \reviseour method provides users with intuitive visual query interfaces for traversing, filtering and searching for the target graph in the embedding space conveniently. With the designed interfaces, the users can also specify a new target graph with desired structural and semantic features. Besides, evaluation views are designed for easy validation and interpretation of the matching results. Case studies and quantitative comparisons on real-world datasets have demonstrated the superiorities of our proposed framework in graph matching and large graph exploration.