本篇博文主要内容为 2025-09-04 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-09-04)

今日共更新405篇论文,其中:

  • 自然语言处理35篇(Computation and Language (cs.CL))
  • 人工智能108篇(Artificial Intelligence (cs.AI))
  • 计算机视觉87篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习126篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence

【速读】: 该论文旨在解决当前结构化数据建模中缺乏通用基础模型(Foundation Models)的问题,即现有方法通常针对特定任务设计专用架构或训练流程,难以在多种表格任务(如分类、回归、缺失值填补和数据生成)间实现高效迁移与泛化。解决方案的关键在于提出LimiX,这是一种基于联合分布建模的大型结构化数据模型(Large Structured-Data Models, LDMs),将结构化数据视为变量与缺失模式的联合分布,并通过查询驱动的条件预测机制,在单一模型框架下统一处理多样化的表格任务。其核心创新在于采用基于episode的上下文条件预训练目标,使模型在推理时无需重新训练即可快速适应新任务,从而实现跨任务、跨场景的高性能与高效率。

链接: https://arxiv.org/abs/2509.03505
作者: Xingxuan Zhang,Gang Ren,Han Yu,Hao Yuan,Hui Wang,Jiansheng Li,Jiayun Wu,Lang Mo,Li Mao,Mingchao Hao,Ningbo Dai,Renzhe Xu,Shuyang Li,Tianyang Zhang,Yue He,Yuanrui Wang,Yunjia Zhang,Zijing Xu,Dongzhe Li,Fang Gao,Hao Zou,Jiandong Liu,Jiashuo Liu,Jiawei Xu,Kaijie Cheng,Kehan Li,Linjun Zhou,Qing Li,Shaohua Fan,Xiaoyu Lin,Xinyan Han,Xuanyue Li,Yan Lu,Yuan Xue,Yuanyuan Jiang,Zimu Wang,Zhenlei Wang,Peng Cui
机构: Stable AI (Stable AI); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 56 pages

点击查看摘要

Abstract:We argue that progress toward general intelligence requires complementary foundation models grounded in language, the physical world, and structured data. This report presents LimiX, the first installment of our large structured-data models (LDMs). LimiX treats structured data as a joint distribution over variables and missingness, thus capable of addressing a wide range of tabular tasks through query-based conditional prediction via a single model. LimiX is pretrained using masked joint-distribution modeling with an episodic, context-conditional objective, where the model predicts for query subsets conditioned on dataset-specific contexts, supporting rapid, training-free adaptation at inference. We evaluate LimiX across 10 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios. With a single model and a unified interface, LimiX consistently surpasses strong baselines including gradient-boosting trees, deep tabular networks, recent tabular foundation models, and automated ensembles, as shown in Figure 1 and Figure 2. The superiority holds across a wide range of tasks, such as classification, regression, missing value imputation, and data generation, often by substantial margins, while avoiding task-specific architectures or bespoke training per task. All LimiX models are publicly accessible under Apache 2.0.
zh

[NLP-1] Design and Optimization of Reinforcement Learning-Based Agents in Text-Based Games

【速读】: 该论文旨在解决如何利用强化学习(Reinforcement Learning, RL)提升智能体在文本游戏中的决策能力与任务完成效率的问题。其解决方案的关键在于构建一个基于深度学习的世界模型(World Model),用于处理游戏文本并理解环境状态,随后采用基于策略梯度的深度强化学习方法训练智能体,实现从状态价值到最优策略的映射。该方法显著提升了游戏完成率和胜率,为强化学习在文本交互场景中的应用提供了新的理论依据和实证基础。

链接: https://arxiv.org/abs/2509.03479
作者: Haonan Wang,Mingjia Zhao,Junfeng Sun,Wei Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 papges

点击查看摘要

Abstract:As AI technology advances, research in playing text-based games with agents has becomeprogressively popular. In this paper, a novel approach to agent design and agent learning ispresented with the context of reinforcement learning. A model of deep learning is first applied toprocess game text and build a world model. Next, the agent is learned through a policy gradient-based deep reinforcement learning method to facilitate conversion from state value to optimal this http URL enhanced agent works better in several text-based game experiments and significantlysurpasses previous agents on game completion ratio and win rate. Our study introduces novelunderstanding and empirical ground for using reinforcement learning for text games and sets thestage for developing and optimizing reinforcement learning agents for more general domains andproblems.
zh

[NLP-2] Continuous Saudi Sign Language Recognition: A Vision Transformer Approach

【速读】: 该论文旨在解决沙特阿拉伯地区沙特手语(Saudi Sign Language, SSL)翻译技术资源匮乏、识别精度不足的问题,尤其针对现有方法多集中于非阿拉伯语手语、且数据集多为孤立词汇而非连续句子的局限性。解决方案的关键在于构建首个面向连续语句的沙特手语数据集KAU-CSSL,并提出一种基于Transformer架构的混合模型:该模型采用预训练ResNet-18提取空间特征,结合Transformer编码器与双向LSTM捕捉时序依赖关系,在签名者依赖模式下达到99.02%准确率,签名者独立模式下达77.71%,显著提升了SSL识别与翻译系统的性能,为手语无障碍交流和相关研究提供了重要基础。

链接: https://arxiv.org/abs/2509.03467
作者: Soukeina Elhassen,Lama Al Khuzayem,Areej Alhothali,Ohoud Alzamzami,Nahed Alowaidi
机构: King Abdulaziz University (阿卜杜勒阿齐兹国王大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 13 figures, 5 tables

点击查看摘要

Abstract:Sign language (SL) is an essential communication form for hearing-impaired and deaf people, enabling engagement within the broader society. Despite its significance, limited public awareness of SL often leads to inequitable access to educational and professional opportunities, thereby contributing to social exclusion, particularly in Saudi Arabia, where over 84,000 individuals depend on Saudi Sign Language (SSL) as their primary form of communication. Although certain technological approaches have helped to improve communication for individuals with hearing impairments, there continues to be an urgent requirement for more precise and dependable translation techniques, especially for Arabic sign language variants like SSL. Most state-of-the-art solutions have primarily focused on non-Arabic sign languages, resulting in a considerable absence of resources dedicated to Arabic sign language, specifically SSL. The complexity of the Arabic language and the prevalence of isolated sign language datasets that concentrate on individual words instead of continuous speech contribute to this issue. To address this gap, our research represents an important step in developing SSL resources. To address this, we introduce the first continuous Saudi Sign Language dataset called KAU-CSSL, focusing on complete sentences to facilitate further research and enable sophisticated recognition systems for SSL recognition and translation. Additionally, we propose a transformer-based model, utilizing a pretrained ResNet-18 for spatial feature extraction and a Transformer Encoder with Bidirectional LSTM for temporal dependencies, achieving 99.02% accuracy at signer dependent mode and 77.71% accuracy at signer independent mode. This development leads the way to not only improving communication tools for the SSL community but also making a substantial contribution to the wider field of sign language.
zh

[NLP-3] Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂任务中作为评价者(judge)时存在的可靠性问题,尤其是由辅助信息诱导偏差(Auxiliary Information Induced Biases)导致的评估失真。解决方案的关键在于构建了一个名为ComplexEval的挑战性基准,系统性地暴露并量化这些偏差,并通过在12个基础和3个高级场景中验证6种此前未被探索的偏差类型,揭示了所有评估模型对这些偏差的高度敏感性,且偏差强度随任务复杂度增加而上升;特别值得注意的是,大推理模型(Large Reasoning Models, LRMs)反而表现出悖论式的脆弱性,这一发现为提升评估信号的准确性与可验证性提供了关键洞见,推动更通用、鲁棒的评估模型发展。

链接: https://arxiv.org/abs/2509.03419
作者: Weiyuan Li,Xintao Wang,Siyu Yuan,Rui Xu,Jiangjie Chen,Qingqing Dong,Yanghua Xiao,Deqing Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures, conference

点击查看摘要

Abstract:As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks–where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical–remains understudied. In this paper, we constructed ComplexEval, a challenge benchmark designed to systematically expose and quantify Auxiliary Information Induced Biases. We systematically investigated and validated 6 previously unexplored biases across 12 basic and 3 advanced scenarios. Key findings reveal: (1) all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity; (2) notably, Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals, paving the way for more general and robust evaluation models.
zh

[NLP-4] Learning Mechanism Underlying NLP Pre-Training and Fine-Tuning

【速读】: 该论文旨在解决生成式 AI (Generative AI) 中预训练机制的内在原理及其与下游分类任务微调性能之间关系的问题。其核心发现表明,预训练通过在Transformer块中逐步增强token间的结构化关联,形成高密度的强匹配token簇(strong match token clusters),从而提升语言理解能力;这一过程虽仅以单个token预测为目标函数,却能自发生成更高阶的语言结构,且该机制显著改善了微调阶段的分类准确率。关键在于,预训练过程中每个token的平均准确率(APT)作为有序参数,随Transformer层数递增而提高,并与最终任务表现正相关,揭示了预训练不仅是特征提取工具,更是语言表示学习的有效范式。

链接: https://arxiv.org/abs/2509.03407
作者: Yarden Tzach,Ronit D. Gross,Ella Koresh,Shalom Rosner,Or Shpringer,Tal Halevi,Ido Kanter
机构: 未知
类目: Computation and Language (cs.CL)
备注: 46 pages, 18 figures, 10 tables

点击查看摘要

Abstract:Natural language processing (NLP) enables the understanding and generation of meaningful human language, typically using a pre-trained complex architecture on a large dataset to learn the language and next fine-tune its weights to implement a specific task. Twofold goals are examined; to understand the mechanism underlying successful pre-training and to determine the interplay between the pre-training accuracy and the fine-tuning of classification tasks. The following main results were obtained; the accuracy per token (APT) increased with its appearance frequency in the dataset, and its average over all tokens served as an order parameter to quantify pre-training success, which increased along the transformer blocks. Pre-training broke the symmetry among tokens and grouped them into finite, small, strong match token clusters, as inferred from the presented token confusion matrix. This feature was sharpened along the transformer blocks toward the output layer, enhancing its performance considerably compared with that of the embedding layer. Consequently, higher-order language structures were generated by pre-training, even though the learning cost function was directed solely at identifying a single token. These pre-training findings were reflected by the improved fine-tuning accuracy along the transformer blocks. Additionally, the output label prediction confidence was found to be independent of the average input APT, as the input meaning was preserved since the tokens are replaced primarily by strong match tokens. Finally, although pre-training is commonly absent in image classification tasks, its underlying mechanism is similar to that used in fine-tuning NLP classification tasks, hinting at its universality. The results were based on the BERT-6 architecture pre-trained on the Wikipedia dataset and fine-tuned on the FewRel and DBpedia classification tasks.
zh

[NLP-5] LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations ACL

【速读】: 该论文旨在解决语言模型(Language Models, LMs)在预训练过程中如何将数据转化为对世界知识和信念的表征这一问题,即揭示模型内部知识获取机制,以推动构建更一致、鲁棒且完整的知识表示体系。其解决方案的关键在于提出了LMEnt——一个用于分析预训练阶段知识获取的工具套件,包含三个核心要素:(1) 基于维基百科并标注实体提及的知识丰富型预训练语料库;(2) 一种基于实体的检索方法,相较于以往方法提升高达80.4%的性能;(3) 12个预训练模型(最大参数量达1B,含4K中间检查点),在知识基准测试中表现与主流开源模型相当。该套件为研究实体提及与下游任务性能之间的关联及因果干预对预训练数据的影响提供了可控环境,从而支持对知识表征、可塑性、编辑、归因及学习动态等关键问题的深入探索。

链接: https://arxiv.org/abs/2509.03405
作者: Daniela Gottesman,Alon Gilae-Dotan,Ido Cohen,Yoav Gur-Arieh,Marius Mosbach,Ori Yoran,Mor Geva
机构: Tel Aviv University (特拉维夫大学); Mila – Quebec AI Institute & McGill University (Mila – 魁北克人工智能研究所 & 麦吉尔大学)
类目: Computation and Language (cs.CL)
备注: Submitted to TACL, August 2025

点击查看摘要

Abstract:Language models (LMs) increasingly drive real-world applications that require world knowledge. However, the internal processes through which models turn data into representations of knowledge and beliefs about the world, are poorly understood. Insights into these processes could pave the way for developing LMs with knowledge representations that are more consistent, robust, and complete. To facilitate studying these questions, we present LMEnt, a suite for analyzing knowledge acquisition in LMs during pretraining. LMEnt introduces: (1) a knowledge-rich pretraining corpus, fully annotated with entity mentions, based on Wikipedia, (2) an entity-based retrieval method over pretraining data that outperforms previous approaches by as much as 80.4%, and (3) 12 pretrained models with up to 1B parameters and 4K intermediate checkpoints, with comparable performance to popular open-sourced models on knowledge benchmarks. Together, these resources provide a controlled environment for analyzing connections between entity mentions in pretraining and downstream performance, and the effects of causal interventions in pretraining data. We show the utility of LMEnt by studying knowledge acquisition across checkpoints, finding that fact frequency is key, but does not fully explain learning trends. We release LMEnt to support studies of knowledge in LMs, including knowledge representations, plasticity, editing, attribution, and learning dynamics.
zh

[NLP-6] Situating AI Agents in their World: Aspective Agent ic AI for Dynamic Partially Observable Information Systems

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 代理系统中普遍存在的信息泄露问题,即传统架构下代理行为常因环境感知不清晰或控制机制薄弱而导致敏感信息外泄(最高可达83%的泄漏率)。其解决方案的关键在于提出一种自底向上的框架,将AI代理置于其环境中,并通过“方面”(aspects)概念实现基于环境变化触发的行为机制——每个代理仅感知与其任务相关的特定信息维度(类比于生态学中的umwelt),从而形成信息隔离的信息生态位(information niche)。这一设计使得代理间不会产生冗余或越界的信息交互,最终实现零信息泄露,同时提升了系统的安全性与运行效率。

链接: https://arxiv.org/abs/2509.03380
作者: Peter J. Bentley,Soo Ling Lim,Fuyuki Ishikawa
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Agentic LLM AI agents are often little more than autonomous chatbots: actors following scripts, often controlled by an unreliable director. This work introduces a bottom-up framework that situates AI agents in their environment, with all behaviors triggered by changes in their environments. It introduces the notion of aspects, similar to the idea of umwelt, where sets of agents perceive their environment differently to each other, enabling clearer control of information. We provide an illustrative implementation and show that compared to a typical architecture, which leaks up to 83% of the time, aspective agentic AI enables zero information leakage. We anticipate that this concept of specialist agents working efficiently in their own information niches can provide improvements to both security and efficiency.
zh

[NLP-7] Language Models Do Not Follow Occams Razor: A Benchmark for Inductive and Abductive Reasoning

【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在推理能力评估中过度集中于演绎推理(deductive reasoning),而忽视了归纳推理(inductive reasoning)和溯因推理(abductive reasoning)的问题,后者在现实世界问题求解中同样至关重要但研究不足。解决方案的关键在于构建了一个可编程且合成的基准数据集 InAbHyD(Inductive and Abductive Hypothesis Dataset),其中每个样本包含一个不完整的世界模型和一组观测数据,要求智能体基于此生成能解释观测的假设;同时提出一种基于奥卡姆剃刀原则(Occam’s Razor)的新颖评价指标来衡量假设质量,从而系统性地评估 LLM 在复杂推理任务中的表现。

链接: https://arxiv.org/abs/2509.03345
作者: Yunxin Sun,Abulhair Saparov
机构: Purdue University (普渡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning is a core capability in artificial intelligence systems, for which large language models (LLMs) have recently shown remarkable progress. However, most work focuses exclusively on deductive reasoning, which is problematic since other types of reasoning are also essential in solving real-world problems, and they are less explored. This work focuses on evaluating LLMs’ inductive and abductive reasoning capabilities. We introduce a programmable and synthetic dataset, InAbHyD (pronounced in-a-bid), where each reasoning example consists of an incomplete world model and a set of observations. The task for the intelligent agent is to produce hypotheses to explain observations under the incomplete world model to solve each reasoning example. We propose a new metric to evaluate the quality of hypotheses based on Occam’s Razor. We evaluate and analyze some state-of-the-art LLMs. Our analysis shows that LLMs can perform inductive and abductive reasoning in simple scenarios, but struggle with complex world models and producing high-quality hypotheses, even with popular reasoning-enhancing techniques such as in-context learning and RLVR.
zh

[NLP-8] SESGO: Spanish Evaluation of Stereotypical Generative Outputs

【速读】: 该论文旨在解决多语言大语言模型(Large Language Models, LLMs)在评估偏见时存在的显著空白,特别是针对西班牙语在具有文化敏感性的拉丁美洲语境中的偏见检测问题。当前的评估体系仍以美国英语为中心,导致其他语言和文化背景下的潜在危害未被充分识别。其解决方案的关键在于提出了一种新的、基于文化的偏见检测框架,该框架通过改编BBQ数据集中的“模糊问题”方法,引入涵盖四个社会类别(性别、种族、社会经济阶层和国籍)的区域性俚语与表达方式,从而捕捉文化特定的刻板印象;同时设计了一个结合准确率与错误方向的新指标,在模糊与明确情境下均能有效平衡模型性能与偏见对齐。该方法首次系统性地评估了主流商用LLMs对西班牙语中文化特异性偏见的响应,并揭示了偏见模式在不同模型间的差异及英语优化的偏见缓解技术在西班牙语任务中无效的现象。

链接: https://arxiv.org/abs/2509.03329
作者: Melissa Robles,Catalina Bernal,Denniss Raigoso,Mateo Dulce Rubio
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper addresses the critical gap in evaluating bias in multilingual Large Language Models (LLMs), with a specific focus on Spanish language within culturally-aware Latin American contexts. Despite widespread global deployment, current evaluations remain predominantly US-English-centric, leaving potential harms in other linguistic and cultural contexts largely underexamined. We introduce a novel, culturally-grounded framework for detecting social biases in instruction-tuned LLMs. Our approach adapts the underspecified question methodology from the BBQ dataset by incorporating culturally-specific expressions and sayings that encode regional stereotypes across four social categories: gender, race, socioeconomic class, and national origin. Using more than 4,000 prompts, we propose a new metric that combines accuracy with the direction of error to effectively balance model performance and bias alignment in both ambiguous and disambiguated contexts. To our knowledge, our work presents the first systematic evaluation examining how leading commercial LLMs respond to culturally specific bias in the Spanish language, revealing varying patterns of bias manifestation across state-of-the-art models. We also contribute evidence that bias mitigation techniques optimized for English do not effectively transfer to Spanish tasks, and that bias patterns remain largely consistent across different sampling temperatures. Our modular framework offers a natural extension to new stereotypes, bias categories, or languages and cultural contexts, representing a significant step toward more equitable and culturally-aware evaluation of AI systems in the diverse linguistic environments where they operate.
zh

[NLP-9] AgenT racer: Who Is Inducing Failure in the LLM Agent ic Systems?

【速读】: 该论文旨在解决多智能体(multi-agent)系统在复杂执行轨迹中失败归因(failure attribution)的问题,即准确识别导致系统错误的具体智能体或执行步骤。当前基于大语言模型(Large Language Model, LLM)的推理系统在该任务上的准确率普遍低于10%,难以满足实际需求。解决方案的关键在于提出AgenTracer框架,通过反事实重放(counterfactual replay)与程序化故障注入(programmed fault injection)自动化标注失败轨迹,构建高质量数据集TracerTraj;并在此基础上训练轻量级失败追踪模型AgenTracer-8B,采用多粒度强化学习(multi-granular reinforcement learning)优化诊断能力。该方案不仅在WhoWhen基准测试中显著超越大型闭源模型(如Gemini-2.5-Pro和Claude-4-Sonnet),还能为现成多智能体系统(如MetaGPT、MaAS)提供可操作反馈,带来4.8–14.2%的性能提升,推动自纠错与自演化智能体的发展。

链接: https://arxiv.org/abs/2509.03312
作者: Guibin Zhang,Junhao Wang,Junjie Chen,Wangchunshu Zhou,Kun Wang,Shuicheng Yan
机构: 未知
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based agentic systems, often comprising multiple models, complex tool invocations, and orchestration protocols, substantially outperform monolithic agents. Yet this very sophistication amplifies their fragility, making them more prone to system failure. Pinpointing the specific agent or step responsible for an error within long execution traces defines the task of agentic system failure attribution. Current state-of-the-art reasoning LLMs, however, remain strikingly inadequate for this challenge, with accuracy generally below 10%. To address this gap, we propose AgenTracer, the first automated framework for annotating failed multi-agent trajectories via counterfactual replay and programmed fault injection, producing the curated dataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, a lightweight failure tracer trained with multi-granular reinforcement learning, capable of efficiently diagnosing errors in verbose multi-agent interactions. On the WhoWhen benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%, setting a new standard in LLM agentic failure attribution. More importantly, AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with 4.8-14.2% performance gains, empowering self-correcting and self-evolving agentic AI.
zh

[NLP-10] LatPhon: Lightweight Multilingual G2P for Romance Languages and English

【速读】: 该论文旨在解决多语言拉丁字母文本到音素(Grapheme-to-phoneme, G2P)转换的通用性与效率问题,以支持文本到语音(TTS)、自动语音识别(ASR)、语音到语音翻译(S2ST)等语音处理系统的前端需求。解决方案的关键在于提出一种轻量级、联合训练的Transformer模型LatPhon,其参数量仅为7.5M,在六种拉丁语系语言(英语、西班牙语、法语、意大利语、葡萄牙语和罗马尼亚语)上统一训练,实现了平均音素错误率(PER)3.5%的性能,显著优于字节级ByT5基线(5.4%),接近语言特定的有限状态转换器(WFST)方法(3.2%),同时仅占用30MB内存,具备在设备端部署的可行性。

链接: https://arxiv.org/abs/2509.03300
作者: Luis Felipe Chary,Miguel Arjona Ramirez
机构: Escola Politécnica, Universidade de São Paulo (圣保罗大学工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Grapheme-to-phoneme (G2P) conversion is a key front-end for text-to-speech (TTS), automatic speech recognition (ASR), speech-to-speech translation (S2ST) and alignment systems, especially across multiple Latin-script this http URL present LatPhon, a 7.5 M - parameter Transformer jointly trained on six such languages–English, Spanish, French, Italian, Portuguese, and Romanian. On the public ipa-dict corpus, it attains a mean phoneme error rate (PER) of 3.5%, outperforming the byte-level ByT5 baseline (5.4%) and approaching language-specific WFSTs (3.2%) while occupying 30 MB of memory, which makes on-device deployment feasible when needed. These results indicate that compact multilingual G2P can serve as a universal front-end for Latin-language speech pipelines.
zh

[NLP-11] Comparison of End-to-end Speech Assessment Models for the NOCASA 2025 Challenge

【速读】: 该论文旨在解决儿童学习挪威语作为第二语言时的语音发音评估自动化问题,即实现基于单词级别的发音质量自动评分。其解决方案的关键在于提出了一种融合无对齐发音优劣度(Goodness-of-Pronunciation, GOP)特征与连接时序分类(Connectionist Temporal Classification, CTC)机制的新模型(GOP-CTC-based model),并通过设计一种加权序数交叉熵损失函数优化未加权平均召回率(unweighted average recall)和平均绝对误差(mean absolute error)等关键指标,从而显著优于基线方法并取得领先性能。

链接: https://arxiv.org/abs/2509.03256
作者: Aleksei Žavoronkov,Tanel Alumäe
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Published at IEEE MLSP 2025

点击查看摘要

Abstract:This paper presents an analysis of three end-to-end models developed for the NOCASA 2025 Challenge, aimed at automatic word-level pronunciation assessment for children learning Norwegian as a second language. Our models include an encoder-decoder Siamese architecture (E2E-R), a prefix-tuned direct classification model leveraging pretrained wav2vec2.0 representations, and a novel model integrating alignment-free goodness-of-pronunciation (GOP) features computed via CTC. We introduce a weighted ordinal cross-entropy loss tailored for optimizing metrics such as unweighted average recall and mean absolute error. Among the explored methods, our GOP-CTC-based model achieved the highest performance, substantially surpassing challenge baselines and attaining top leaderboard scores.
zh

[NLP-12] SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在低资源语言和文化特定内容上的评估不足问题,尤其是在以英语为主导的评测体系下,对非英语语言如僧伽罗语(Sinhala)等缺乏针对性的基准测试。解决方案的关键在于构建首个专为僧伽罗语设计的多项选择题问答基准——SinhalaMMLU,该数据集包含超过7000道题目,覆盖斯里兰卡国家课程从中学到大学水平的六个领域和30个学科,涵盖通用学术知识与文化相关知识。通过在该基准上评估26个主流LLMs,研究揭示了当前模型在文化密集型领域(如人文学科)表现有限,从而凸显了提升LLMs在低资源语言及文化语境中适应能力的重要性。

链接: https://arxiv.org/abs/2509.03162
作者: Ashmari Pramodya,Nirasha Nelki,Heshan Shalinda,Chamila Liyanage,Yusuke Sakai,Randil Pushpananda,Ruvan Weerasinghe,Hidetaka Kamigaito,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学); University of Colombo School of Computing (科伦坡大学计算机学院); Informatics Institute of Technology (信息学院技术研究所)
类目: Computation and Language (cs.CL)
备注: 19 pages, 11 figures

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate impressive general knowledge and reasoning abilities, yet their evaluation has predominantly focused on global or anglocentric subjects, often neglecting low-resource languages and culturally specific content. While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context. To address this, we introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala, a low-resource language. The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum, and covers six domains and 30 subjects, encompassing both general academic topics and culturally grounded knowledge. We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited. In particular, models struggle in culturally rich domains such as the Humanities, revealing substantial room for improvement in adapting LLMs to low-resource and culturally specific contexts.
zh

[NLP-13] Domain Adaptation of LLM s for Process Data

【速读】: 该论文旨在解决当前流程挖掘(Process Mining, PM)中利用大语言模型(Large Language Models, LLMs)时存在的效率与性能瓶颈问题,即如何在不依赖自然语言重构事件日志的前提下,直接将预训练LLMs适配到流程数据上以提升预测性能。其解决方案的关键在于采用参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)技术,使LLMs能够直接处理流程序列数据(如事件日志),从而在保持高预测精度的同时显著降低计算开销和超参数调优需求,并在多任务预测场景下展现出优于传统循环神经网络(Recurrent Neural Networks, RNNs)和基于叙事风格转换的方法的性能优势。

链接: https://arxiv.org/abs/2509.03161
作者: Rafael Seidi Oyamada,Jari Peeperkorn,Jochen De Weerdt,Johannes De Smedt
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have emerged as a prominent area of interest across various research domains, including Process Mining (PM). Current applications in PM have predominantly centered on prompt engineering strategies or the transformation of event logs into narrative-style datasets, thereby exploiting the semantic capabilities of LLMs to address diverse tasks. In contrast, this study investigates the direct adaptation of pretrained LLMs to process data without natural language reformulation, motivated by the fact that these models excel in generating sequences of tokens, similar to the objective in PM. More specifically, we focus on parameter-efficient fine-tuning techniques to mitigate the computational overhead typically associated with such models. Our experimental setup focuses on Predictive Process Monitoring (PPM), and considers both single- and multi-task predictions. The results demonstrate a potential improvement in predictive performance over state-of-the-art recurrent neural network (RNN) approaches and recent narrative-style-based solutions, particularly in the multi-task setting. Additionally, our fine-tuned models exhibit faster convergence and require significantly less hyperparameter optimization.
zh

[NLP-14] Expanding the WMT24 Benchmark with Rumantsch Grischun Sursilvan Sutsilvan Surmiran Puter and Vallader

【速读】: 该论文旨在解决罗曼什语(Romansh)在机器翻译(Machine Translation, MT)评估中资源匮乏的问题。其解决方案的关键在于构建一个涵盖六种罗曼什语变体(包括一种超区域变体Rumantsch Grischun和五种地区变体)的基准测试集,参考译文由人工翻译完成,并基于WMT24++基准确保与55种以上语言的平行性,从而为多语言机器翻译系统提供可靠、标准化的评估依据。

链接: https://arxiv.org/abs/2509.03148
作者: Jannis Vamvas,Ignacio Pérez Prat,Not Battesta Soliva,Sandra Baltermia-Guetg,Andrina Beeli,Simona Beeli,Madlaina Capeder,Laura Decurtins,Gian Peder Gregori,Flavia Hobi,Gabriela Holderegger,Arina Lazzarini,Viviana Lazzarini,Walter Rosselli,Bettina Vital,Anna Rutkiewicz,Rico Sennrich
机构: University of Zurich (苏黎世大学); Lia Rumantscha (利阿·鲁曼茨查)
类目: Computation and Language (cs.CL)
备注: Submitted to WMT25 (Open Language Data Initiative Shared Task)

点击查看摘要

Abstract:The Romansh language, spoken in Switzerland, has limited resources for machine translation evaluation. In this paper, we present a benchmark for six varieties of Romansh: Rumantsch Grischun, a supra-regional variety, and five regional varieties: Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader. Our reference translations were created by human translators based on the WMT24++ benchmark, which ensures parallelism with more than 55 other languages. An automatic evaluation of existing MT systems and LLMs shows that translation out of Romansh into German is handled relatively well for all the varieties, but translation into Romansh is still challenging.
zh

[NLP-15] An experimental and computational study of an Estonian single-person word naming

【速读】: 该论文旨在解决语言处理中词汇加工机制的预测问题,特别是评估基于心理词典计算模型(Discriminative Lexicon Model, DLM)的指标是否能够有效预测单词命名任务中的眼动和语音反应变量。其解决方案的关键在于:首先,通过大规模单被试实验结合眼动追踪技术,收集五种核心反应变量(首次注视时间、总注视时间、注视次数、命名延迟和口语持续时间);其次,利用广义加性模型(Generalized Additive Model)比较DLM生成的指标与传统词汇变量(如词频、邻近词数量和屈折词形变化规模)的预测效能。研究发现,DLM指标对总注视时间和命名延迟等变量具有强预测力,表明语义在单词命名过程中起关键作用,而线性映射的DLM指标在某些情况下甚至优于深度学习映射,提示语义驱动的词汇表征机制是理解词汇加工的核心。

链接: https://arxiv.org/abs/2509.03143
作者: Kaidi Lõo,Arvi Tavast,Maria Heitmeier,Harald Baayen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study investigates lexical processing in Estonian. A large-scale single-subject experiment is reported that combines the word naming task with eye-tracking. Five response variables (first fixation duration, total fixation duration, number of fixations, word naming latency, and spoken word duration) are analyzed with the generalized additive model. Of central interest is the question of whether measures for lexical processing generated by a computational model of the mental lexicon (the Discriminative Lexicon Model, DLM) are predictive for these response variables, and how they compare to classical predictors such as word frequency, neighborhood size, and inflectional paradigm size. Computational models were implemented both with linear and deep mappings. Central findings are, first, that DLM-based measures are powerful predictors for lexical processing, second, that DLM-measures using deep learning are not necessarily more precise predictors of lexical processing than DLM-measures using linear mappings, third, that classical predictors tend to provide somewhat more precise fits compared to DLM-based predictors (except for total fixation duration, where the two provide equivalent goodness of fit), and fourth, that in the naming task lexical variables are not predictive for first fixation duration and the total number of fixations. As the DLM works with mappings from form to meaning, the predictivity of DLM-based measures for total fixation duration, naming latencies, and spoken word duration indicates that meaning is heavily involved in the present word naming task.
zh

[NLP-16] From Evaluation to Defense: Constructing Persistent Edit-Based Fingerprints for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)知识产权(Intellectual Property, IP)保护中指纹注入技术存在的性能退化、计算资源消耗高及指纹持久性差的问题。现有方法通过指令微调注入指纹,但常导致模型性能下降且难以抵抗大规模微调带来的干扰。论文提出将知识编辑(Knowledge Editing)首次应用于指纹注入,以实现轻量化和高效性;其关键创新在于设计了指纹子空间感知微调(Fingerprint Subspace-aware Fine-Tuning, FSFT),通过约束指纹子空间的更新来显著减少指纹退化,在最坏情况下仍比传统微调提升10%以上。此外,研究发现指纹注入模型在特征层面难以区分指纹与相似文本,凸显出对更鲁棒、细粒度指纹注入机制的迫切需求。

链接: https://arxiv.org/abs/2509.03122
作者: Yue Li,Xin Yi,Dongsheng Shi,Yongyi Cui,Gerard de Melo,Xiaoling Wang,Linlin Wang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. University of Amsterdam (阿姆斯特丹大学); 3. Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:The intellectual property (IP) protection of Large Language Models (LLMs) is increasingly critical. Injecting specialized fingerprints into LLMs through instruction tuning is a common IP protection technique. However, this may significantly degrade model performance, requires substantial computational resources, and exhibits poor persistence under model modifications. We argue that knowledge editing offers a lightweight alternative that is more suitable for fingerprint injection. Accordingly, we apply knowledge editing to fingerprint injection for the first time and demonstrate its strong capability. Despite using scrambled text as fingerprints to prevent them from being overwritten during fine-tuning, degradation still occurs under large-scale fine-tuning. To address this, we propose Fingerprint Subspace-aware Fine-Tuning (FSFT), which reduces fingerprint degradation by constraining the update of the fingerprint subspace. The performance of FSFT exceeds fine-tuning by 10% even in the worst-case scenario. Additionally, we observe that the fingerprint-injected models struggle to distinguish between fingerprints and similar texts due to the high similarity of their features. This finding underscores the urgent need for more robust and fine-grained fingerprinting injection methods for LLMs.
zh

[NLP-17] Measuring Scalar Constructs in Social Science with LLM s EMNLP2025

【速读】: 该论文旨在解决如何有效利用大语言模型(Large Language Models, LLMs)测量社会科学研究中具有连续语义结构的标量构念(scalar constructs),如语言复杂性或情感强度等问题。传统提示方法生成的点估计值存在分布不连续、数值聚集于任意整数等缺陷,影响测量质量。研究通过系统评估四种方法——直接点估计、成对比较聚合、基于token概率加权的点估计以及微调小模型——发现:关键解决方案在于采用token概率加权的点估计策略,可显著提升测量连续性和准确性;此外,仅需1000个标注样本对小模型进行微调即可达到甚至超越提示型LLM的表现,为实际应用提供了高效且可靠的替代方案。

链接: https://arxiv.org/abs/2509.03116
作者: Hauke Licht,Rupak Sarkar,Patrick Y. Wu,Pranav Goel,Niklas Stoehr,Elliott Ash,Alexander Miserlis Hoyle
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 (Main)

点击查看摘要

Abstract:Many constructs that characterize language, like its complexity or emotionality, have a naturally continuous semantic structure; a public speech is not just “simple” or “complex,” but exists on a continuum between extremes. Although large language models (LLMs) are an attractive tool for measuring scalar constructs, their idiosyncratic treatment of numerical outputs raises questions of how to best apply them. We address these questions with a comprehensive evaluation of LLM-based approaches to scalar construct measurement in social science. Using multiple datasets sourced from the political science literature, we evaluate four approaches: unweighted direct pointwise scoring, aggregation of pairwise comparisons, token-probability-weighted pointwise scoring, and finetuning. Our study yields actionable findings for applied researchers. First, LLMs prompted to generate pointwise scores directly from texts produce discontinuous distributions with bunching at arbitrary numbers. The quality of the measurements improves with pairwise comparisons made by LLMs, but it improves even more by taking pointwise scores and weighting them by token probability. Finally, finetuning smaller models with as few as 1,000 training pairs can match or exceed the performance of prompted LLMs.
zh

[NLP-18] Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)中的幻觉问题,其成因包括文本-视觉偏差(text-visual bias)和共现偏差(co-occurrence bias):前者指模型在决策过程中过度依赖文本信息,后者则源于训练数据中物体间的统计配对模式。解决方案的关键在于提出一种基于梯度的自省方法(gradient-based self-reflection method),用于量化不同token类型(视觉、提示词和先前输出)对生成结果的影响;进而利用该影响估计识别出与对象相关的视觉token,并将其整合进一种影响感知的对比解码框架(influence-aware contrastive decoding framework),从而同时缓解两类偏差。该方法无需额外资源(如昂贵微调、额外模型或数据统计),实验表明其可显著降低幻觉现象,在LLaVA-QA90数据集上准确率提升高达92%。

链接: https://arxiv.org/abs/2509.03113
作者: Shan Wang,Maying Shen,Nadine Chang,Chuong Nguyen,Hongdong Li,Jose M. Alvarez
机构: NVIDIA; Data61, CSIRO; Australian National University
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hallucinations in multimodal large language model are caused by the text-visual bias and the co-occurrence bias. The former reflects an over-reliance on text information in the decision-making process, while the latter arises from the statistical object-pairing patterns abstracted from the training data. Existing mitigation methods heuristically address these biases without understanding the fluctuating bias level across the instances. We first propose estimating the influence of respective token types (visual, prompt, and previous outputs) using a gradient-based self-reflection method. The estimated token influence further enables the detection of object-related visual tokens and their integration into an influence-aware contrastive decoding framework to mitigate both types of biases simultaneously. Our method operates without the need for additional resources, such as costly fine-tuning, extra models, or data statistics. Extensive experiments show it effectively reduces hallucinations, achieving up to a 92% accuracy increase on LLaVA-QA90.
zh

[NLP-19] A Long Short-Term Memory (LSTM) Model for Business Sentiment Analysis Based on Recurrent Neural Network

【速读】: 该论文旨在解决商业情感分析(Business Sentiment Analysis, BSA)中传统循环神经网络(Recurrent Neural Network, RNN)因梯度消失问题导致性能受限的问题。解决方案的关键在于采用改进的长短期记忆网络(Long Short-Term Memory, LSTM)模型,通过引入门控机制有效缓解了梯度消失问题,从而提升了模型对序列数据的建模能力。实验结果表明,该改进模型在产品评论数据集上达到了约91.33%的准确率,显著优于传统RNN模型,能够更可靠地识别客户对产品的偏好与不满,助力企业优化营销策略。

链接: https://arxiv.org/abs/2509.03060
作者: Md. Jahidul Islam Razin,Md. Abdul Karim,M. F. Mridha,S M Rafiuddin,Tahira Alam
机构: University of Asia Pacific (亚洲太平洋大学); Bangladesh University of Business and Technology (孟加拉国商业与技术大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 9 figures, 3 tables, published in Sustainable Communication Networks and Application: Proceedings of ICSCN 2020 (2021). Paper presents an LSTM-based business sentiment analysis model with 91.33% accuracy, compares against KNN, SVM, and Naive Bayes, and discusses methodology, dataset, training/testing, results, and implementation tools

点击查看摘要

Abstract:Business sentiment analysis (BSA) is one of the significant and popular topics of natural language processing. It is one kind of sentiment analysis techniques for business purposes. Different categories of sentiment analysis techniques like lexicon-based techniques and different types of machine learning algorithms are applied for sentiment analysis on different languages like English, Hindi, Spanish, etc. In this paper, long short-term memory (LSTM) is applied for business sentiment analysis, where a recurrent neural network is used. An LSTM model is used in a modified approach to prevent the vanishing gradient problem rather than applying the conventional recurrent neural network (RNN). To apply the modified RNN model, product review dataset is used. In this experiment, 70% of the data is trained for the LSTM and the rest 30% of the data is used for testing. The result of this modified RNN model is compared with other conventional RNN models, and a comparison is made among the results. It is noted that the proposed model performs better than the other conventional RNN models. Here, the proposed model, i.e., the modified RNN model approach has achieved around 91.33% of accuracy. By applying this model, any business company or e-commerce business site can identify the feedback from their customers about different types of products that customers like or dislike. Based on the customer reviews, a business company or e-commerce platform can evaluate its marketing strategy.
zh

[NLP-20] Structure-Learnable Adapter Fine-Tuning for Parameter-Efficient Large Language Models

【速读】: 该论文旨在解决大语言模型微调过程中存在的参数冗余、结构僵化以及任务适应性有限等问题。其解决方案的关键在于提出一种基于适配器(adapter)的微调方法,该方法引入可微 gating 函数和结构稀疏控制变量,实现适配器插入位置、激活路径及模块组合的自动优化,从而在多任务场景下灵活调整模型结构以匹配不同任务特征;同时通过冻结主干参数并利用结构搜索机制动态构建任务特定的高效子结构,显著提升参数利用率与表征能力。

链接: https://arxiv.org/abs/2509.03057
作者: Ming Gong,Yingnan Deng,Nia Qi,Yujun Zou,Zhihao Xue,Yun Zi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper addresses the issues of parameter redundancy, rigid structure, and limited task adaptability in the fine-tuning of large language models. It proposes an adapter-based fine-tuning method built on a structure-learnable mechanism. By introducing differentiable gating functions and structural sparsity control variables, the method enables automatic optimization of adapter insertion points, activation paths, and module combinations. This allows the model to adjust its structure flexibly in multi-task settings to match different task characteristics. With the backbone parameters kept frozen, the method uses a structure search mechanism to guide the dynamic construction of task-specific efficient substructures during training. This significantly improves parameter utilization and representational capacity. In addition, the paper designs a set of sensitivity analysis experiments to systematically evaluate the effects of sparsity weight, noise injection ratio, and data perturbation on model performance. These experiments verify the stability and robustness of the proposed method across various multi-task natural language understanding tasks. The experimental results show that the proposed method outperforms mainstream parameter-efficient tuning techniques on multiple tasks. It achieves a better balance among accuracy, compression rate, and robustness to noise and perturbation.
zh

[NLP-21] raining LLM s to be Better Text Embedders through Bidirectional Reconstruction EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在作为文本嵌入(text embedding)工具时,其最终标记(如[EOS])语义表达能力不足的问题,尤其在信息检索(retrieval)和重排序(re-ranking)任务中表现受限。解决方案的关键在于引入一个额外的预训练阶段,该阶段通过双向生成式重建任务——即基于嵌入的查询到文档重建(EBQ2D)和基于嵌入的文档到查询重建(EBD2Q)——来增强[EOS]标记的语义表征能力。这两个任务交替进行,将[EOS]嵌入锚定于查询-文档对的语义空间中,并实现双向重构,从而显著提升LLM在大规模文本嵌入基准(MTEB)上的性能,达到新的SOTA水平。

链接: https://arxiv.org/abs/2509.03020
作者: Chang Su,Dengliang Shi,Siyuan Huang,Jintao Du,Changhua Meng,Yu Cheng,Weiqiang Wang,Zhouhan Lin
机构: LUMIA Lab, Shanghai Jiao Tong University (上海交通大学); Tiansuan Lab, Ant Group Co., Ltd. (蚂蚁集团)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: accepted by EMNLP 2025 Main Conference

点击查看摘要

Abstract:Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.
zh

[NLP-22] Mitigating Data Imbalance in Automated Speaking Assessment

【速读】: 该论文旨在解决自动口语评估(Automated Speaking Assessment, ASA)模型中存在的类别不平衡问题,该问题会导致对少数类别的预测偏差,从而影响评估的公平性和准确性。解决方案的关键在于提出一种新颖的目标函数——平衡 logits 变化(Balancing Logit Variation, BLV)损失,该损失通过扰动模型预测来优化少数类别的特征表示,而无需修改原始数据集,从而在不改变训练数据分布的前提下提升模型的分类准确率和公平性。实验表明,将 BLV 损失集成到基于文本的 BERT 模型中,显著增强了 ASA 系统在 ICNALE 基准数据集上的性能。

链接: https://arxiv.org/abs/2509.03010
作者: Fong-Chun Tsai,Kuan-Tang Huang,Bi-Cheng Yan,Tien-Hong Lo,Berlin Chen
机构: National Taiwan Normal University (国立台湾师范大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Submitted to APSIPA 2025

点击查看摘要

Abstract:Automated Speaking Assessment (ASA) plays a crucial role in evaluating second-language (L2) learners proficiency. However, ASA models often suffer from class imbalance, leading to biased predictions. To address this, we introduce a novel objective for training ASA models, dubbed the Balancing Logit Variation (BLV) loss, which perturbs model predictions to improve feature representation for minority classes without modifying the dataset. Evaluations on the ICNALE benchmark dataset show that integrating the BLV loss into a celebrated text-based (BERT) model significantly enhances classification accuracy and fairness, making automated speech evaluation more robust for diverse learners.
zh

[NLP-23] DiaCBT: A Long-Periodic Dialogue Corpus Guided by Cognitive Conceptualization Diagram for CBT-based Psychological Counseling

【速读】: 该论文旨在解决心理治疗服务可及性不足的问题,即由于社会污名化和心理咨询师资源有限,仅有少数患有精神障碍的个体能够获得有效干预。其解决方案的关键在于构建一个基于认知行为疗法(Cognitive Behavioral Therapy, CBT)的长期对话语料库(DiaCBT),该语料库包含多个会话轮次,并引入认知概念化图谱(Cognitive Conceptualization Diagrams, CCDs)以指导多样情境下的来访者模拟,从而为训练具备专业CBT能力的大语言模型(Large Language Models, LLMs)提供高质量、结构化的数据支撑。

链接: https://arxiv.org/abs/2509.02999
作者: Yougen Zhou,Ningning Zhou,Qin Chen,Jie Zhou,Aimin Zhou,Liang He
机构: Shanghai Institute of Artificial Intelligence for Education, East China Normal University (上海人工智能教育研究所,华东师范大学); School of Computer Science and Technology, East China Normal University (计算机科学与技术学院,华东师范大学); School of Psychology and Cognitive Science, East China Normal University (心理与认知科学学院,华东师范大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Psychotherapy reaches only a small fraction of individuals suffering from mental disorders due to social stigma and the limited availability of therapists. Large language models (LLMs), when equipped with professional psychotherapeutic skills, offer a promising solution to expand access to mental health services. However, the lack of psychological conversation datasets presents significant challenges in developing effective psychotherapy-guided conversational agents. In this paper, we construct a long-periodic dialogue corpus for counseling based on cognitive behavioral therapy (CBT). Our curated dataset includes multiple sessions for each counseling and incorporates cognitive conceptualization diagrams (CCDs) to guide client simulation across diverse scenarios. To evaluate the utility of our dataset, we train an in-depth counseling model and present a comprehensive evaluation framework to benchmark it against established psychological criteria for CBT-based counseling. Results demonstrate that DiaCBT effectively enhances LLMs’ ability to emulate psychologists with CBT expertise, underscoring its potential for training more professional counseling agents.
zh

[NLP-24] ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

【速读】: 该论文旨在解决当前缺乏适用于实际场景中装配任务评估的测试平台问题,尤其是在生成式 AI(Generative AI)驱动的程序性活动助手领域。现有研究难以在真实装配环境中对系统性能进行有效评测,限制了相关技术的发展。其解决方案的关键在于构建了一个新的多模态问答(Multimodal QA)数据集 ProMQA-Assembly,包含391组需结合人类活动视频与说明书文本理解的问答对,并采用半自动化标注流程——由大语言模型(LLM)生成候选问题并由人工验证,同时引入细粒度动作标签以丰富问题类型;此外,还构建了用于玩具车辆装配任务的指令任务图(Instruction Task Graph),不仅用于基准测试,也辅助人工标注过程。实验表明,当前主流多模态模型在该数据集上表现仍有较大提升空间,证明了该数据集对推动装配类助手系统发展的价值。

链接: https://arxiv.org/abs/2509.02949
作者: Kimihiro Hasegawa,Wiradee Imrattanatrai,Masaki Asada,Susan Holm,Yuran Wang,Vincent Zhou,Ken Fukuda,Teruko Mitamura
机构: Carnegie Mellon University (卡内基梅隆大学); National Institute of Advanced Industrial Science and Technology (AIST) (日本产业技术综合研究所)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages. Code and data: this https URL

点击查看摘要

Abstract:Assistants on assembly tasks have a large potential to benefit humans from everyday tasks to industrial settings. However, no testbeds support application-oriented system evaluation in a practical setting, especially in assembly. To foster the development, we propose a new multimodal QA dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 391 QA pairs that require the multimodal understanding of human-activity recordings and their instruction manuals in an online-style manner. In the development, we adopt a semi-automated QA annotation approach, where LLMs generate candidates and humans verify them, as a cost-effective method, and further improve it by integrating fine-grained action labels to diversify question types. Furthermore, we create instruction task graphs for the target tasks of assembling toy vehicles. These newly created task graphs are used in our benchmarking experiment, as well as to facilitate the human verification process in the QA annotation. Utilizing our dataset, we benchmark models, including competitive proprietary multimodal models. Our results suggest great room for improvement for the current models. We believe our new evaluation dataset can contribute to the further development of procedural-activity assistants.
zh

[NLP-25] Decoding the Rule Book: Extracting Hidden Moderation Criteria from Reddit Communities EMNLP2025

【速读】: 该论文旨在解决在线社区(如Reddit子版块)中内容审核标准隐含且不一致的问题,这些问题使得有效的内容 moderation 系统难以建立。解决方案的关键在于利用可解释的架构从历史审核数据中提取这些隐含的审核标准,并将其表示为与内容删除相关的词汇表达评分表,从而实现跨社区的系统性比较和透明化的决策分析。这种方法不仅复现了神经网络审核模型的性能,还揭示了不同社区在执行看似共通规范时的实际差异,包括语言容忍度、主题限制特征以及毒性言论分类的潜在子类别等此前未被记录的审核模式。

链接: https://arxiv.org/abs/2509.02926
作者: Youngwoo Kim,Himanshu Beniwal,Steven L. Johnson,Thomas Hartvigsen
机构: University of Virginia (弗吉尼亚大学); Indian Institute of Technology Gandhinagar (印度理工学院甘德纳格尔分校)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Main

点击查看摘要

Abstract:Effective content moderation systems require explicit classification criteria, yet online communities like subreddits often operate with diverse, implicit standards. This work introduces a novel approach to identify and extract these implicit criteria from historical moderation data using an interpretable architecture. We represent moderation criteria as score tables of lexical expressions associated with content removal, enabling systematic comparison across different communities. Our experiments demonstrate that these extracted lexical patterns effectively replicate the performance of neural moderation models while providing transparent insights into decision-making processes. The resulting criteria matrix reveals significant variations in how seemingly shared norms are actually enforced, uncovering previously undocumented moderation patterns including community-specific tolerances for language, features for topical restrictions, and underlying subcategories of the toxic speech classification.
zh

[NLP-26] English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM

【速读】: 该论文旨在解决英语二语学习者在发音训练中需要分别进行自动发音评估(Automatic Pronunciation Assessment, APA)与错误发音检测与诊断(Mispronunciation Detection and Diagnosis, MDD)的问题,传统方法通常依赖复杂的架构调整或独立的训练流程,导致系统冗余且难以集成。解决方案的关键在于利用低秩适应(Low-Rank Adaptation, LoRA)技术对多模态大语言模型(Multimodal Large Language Model, MLLM)进行微调,仅更新LoRA层即可实现APA与MDD的联合任务,无需全参数微调或额外模块设计。实验表明,基于Phi-4-multimodal-instruct模型的LoRA微调方案在Speechocean762数据集上达到了与人工评分高度一致的预测性能(PCC=0.7),同时保持较低的词错误率(WER<0.15)和音素错误率(PER<0.15),显著简化了训练流程并提升了系统集成效率,为计算机辅助发音训练(Computer-Assisted Pronunciation Training, CAPT)提供了更轻量、高效的新范式。

链接: https://arxiv.org/abs/2509.02915
作者: Taekyung Ahn,Hosung Nam
机构: Enuma, Inc.(Enuma公司); Korea University(韩国大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study demonstrates that a Multimodal Large Language Model (MLLM) adapted via Low-Rank Adaptation (LoRA) can perform both Automatic Pronunciation Assessment (APA) and Mispronunciation Detection and Diagnosis (MDD) simultaneously. Leveraging Microsoft’s Phi-4-multimodal-instruct, our fine-tuning method eliminates the need for complex architectural changes or separate training procedures conventionally required for these distinct tasks. Fine-tuned on the Speechocean762 dataset, the pronunciation evaluation scores predicted by the model exhibited a strong Pearson Correlation Coefficient (PCC 0.7) with human-assigned scores, while achieving low Word Error Rate (WER) and Phoneme Error Rate (PER) (both 0.15). Notably, fine-tuning only the LoRA layers was sufficient to achieve performance levels comparable to those achieved by fine-tuning all audio layers. This research highlights that an integrated pronunciation assessment system can be established by adapting large multimodal models without full fine-tuning, utilizing a significantly simpler training methodology compared to previous joint models designed for simultaneous APA and MDD. This efficient LoRA-based approach paves the way for more accessible, integrated, and effective Computer-Assisted Pronunciation Training (CAPT) technologies for English L2 learners.
zh

[NLP-27] Advancing Minority Stress Detection with Transformers: Insights from the Social Media Datasets

【速读】: 该论文旨在解决如何有效识别在线话语中与性少数群体和性别少数群体相关的少数压力(minority stress)问题,从而为数字健康干预和公共卫生政策提供可靠支持。其关键解决方案在于引入图结构增强的Transformer模型,通过建模社交连通性和对话上下文来提升对关键语言标记(如身份隐藏、内化污名和求助呼吁)的识别能力,实验表明该方法显著优于仅使用Transformer或零样本/少样本学习的基线模型。

链接: https://arxiv.org/abs/2509.02908
作者: Santosh Chapagain,Cory J Cascalheira,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi,Jillian R. Scheer
机构: University of South Utah (南犹他大学); Syracuse University (雪城大学)
类目: Computation and Language (cs.CL)
备注: Accepted in Social Network Analysis and Mining Journal (SNAM)

点击查看摘要

Abstract:Individuals from sexual and gender minority groups experience disproportionately high rates of poor health outcomes and mental disorders compared to their heterosexual and cisgender counterparts, largely as a consequence of minority stress as described by Meyer’s (2003) model. This study presents the first comprehensive evaluation of transformer-based architectures for detecting minority stress in online discourse. We benchmark multiple transformer models including ELECTRA, BERT, RoBERTa, and BART against traditional machine learning baselines and graph-augmented variants. We further assess zero-shot and few-shot learning paradigms to assess their applicability on underrepresented datasets. Experiments are conducted on the two largest publicly available Reddit corpora for minority stress detection, comprising 12,645 and 5,789 posts, and are repeated over five random seeds to ensure robustness. Our results demonstrate that integrating graph structure consistently improves detection performance across transformer-only models and that supervised fine-tuning with relational context outperforms zero and few-shot approaches. Theoretical analysis reveals that modeling social connectivity and conversational context via graph augmentation sharpens the models’ ability to identify key linguistic markers such as identity concealment, internalized stigma, and calls for support, suggesting that graph-enhanced transformers offer the most reliable foundation for digital health interventions and public health policy.
zh

[NLP-28] A-SEA3L-QA: A Fully Automated Self-Evolving Adversarial Workflow for Arabic Long-Context Question-Answer Generation

【速读】: 该论文旨在解决阿拉伯语长文本问答(Long-context Question-Answer, QA)生成中面临的挑战,即如何在不依赖人工干预的情况下持续提升模型对多页、跨领域阿拉伯语文档的理解与生成能力。其核心问题在于传统静态流水线难以适应复杂文档结构和动态优化需求,导致生成问题的质量、难度与相关性受限。解决方案的关键在于构建一个端到端的自演化对抗工作流(self-evolving adversarial workflow),通过协调多个专用大型视觉语言模型(Large Vision Language Models, LVLMs)——包括问题生成器、评估器和答案生成器集群——形成闭环反馈机制:问题生成器产出细粒度上下文感知的问题,由答案生成器集群处理,评估器基于质量指标反馈,低置信度输出触发自动重生成与模型更新,从而实现性能的迭代增强;同时,将质量指标设为可调超参数,使问题难度可控且可定制,显著提升了主流阿拉伯语LVLMs在长文本理解上的表现。

链接: https://arxiv.org/abs/2509.02864
作者: Kesen Wang,Daulet Toibazar,Pedro J. Moreno
机构: Humain
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present an end-to-end, self-evolving adversarial workflow for long-context Question-Answer (QA) Generation in Arabic. By orchestrating multiple specialized LVLMs: a question generator, an evaluator, and a swarm of answer generators, our system iteratively refines its own performance without any human intervention. Starting from raw, multi-page Arabic documents across diverse domains, the question generator produces fine-grained, context-aware queries to be tackled by the answer generator swarm, and the evaluator assesses and feeds back quality metrics. This closed-loop cycle enables continuous learning: low-confidence outputs trigger automated re-generation and model updates, progressively enhancing question difficulty and relevance. Moreover, we set the quality metrics as a tunable hyperparameter, enabling question generation at controllable and customizable difficulty levels. We release AraLongBench, a large-scale Arabic benchmark of single- and multi-page challenges spanning hundreds of pages, and demonstrate that our self-evolving workflow substantially outperform static pipelines, markedly boosting the long-context comprehension capabilities of leading Arabic Large Vision Language Models (LVLMs). Lastly, we also meticulously architect a fully automated agentic workflow for long-context Arabic document collection.
zh

[NLP-29] Speech DF Arena: A Leaderboard for Speech DeepFake Detection Models

【速读】: 该论文旨在解决音频深度伪造(Audio Deepfake)检测领域缺乏标准化、综合性评估基准的问题。当前虽然生成式 AI (Generative AI) 技术推动了深度伪造音频的快速发展,但检测方法的评估仍分散且不统一,难以客观比较不同系统的性能与鲁棒性。解决方案的关键在于提出 Speech DeepFake (DF) Arena,这是首个面向音频深度伪造检测的全面基准平台,其核心包括:1)统一的工具包用于跨14个多样化数据集和攻击场景的评估;2)标准化的评价指标与协议以保障可复现性和透明度;3)排行榜机制促进系统间的公平比较;4)提供开源工具链支持结果复现。该框架揭示了多数现有检测系统在跨域场景下表现不佳,强调了开展广泛跨域评估的重要性。

链接: https://arxiv.org/abs/2509.02859
作者: Sandipana Dowerah,Atharva Kulkarni,Ajinkya Kulkarni,Hoan My Tran,Joonas Kalda,Artem Fedorchenko,Benoit Fauve,Damien Lolive,Tanel Alumäe,Matthew Magimai Doss
机构: Tallinn University of Technology (塔林理工大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学); Idiap Research Institute (Idiap 研究所); Univ. Rennes, CNRS, Irisa (雷恩大学, 国家科学研究中心, 伊西斯研究所); Univ. Bretagne Sud, CNRS, Irisa (布列塔尼南大学, 国家科学研究中心, 伊西斯研究所); Validsoft Ltd. (Validsoft 有限公司)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Parallel to the development of advanced deepfake audio generation, audio deepfake detection has also seen significant progress. However, a standardized and comprehensive benchmark is still missing. To address this, we introduce Speech DeepFake (DF) Arena, the first comprehensive benchmark for audio deepfake detection. Speech DF Arena provides a toolkit to uniformly evaluate detection systems, currently across 14 diverse datasets and attack scenarios, standardized evaluation metrics and protocols for reproducibility and transparency. It also includes a leaderboard to compare and rank the systems to help researchers and developers enhance their reliability and robustness. We include 14 evaluation sets, 12 state-of-the-art open-source and 3 proprietary detection systems. Our study presents many systems exhibiting high EER in out-of-domain scenarios, highlighting the need for extensive cross-domain evaluation. The leaderboard is hosted on Huggingface1 and a toolkit for reproducing results across the listed datasets is available on GitHub.
zh

[NLP-30] IDEAlign: Comparing Large Language Models to Human Experts in Open-ended Interpretive Annotations

【速读】: 该论文旨在解决大规模评估生成式 AI (Generative AI) 在开放性、解释性标注任务中与专家人类标注一致性的问题,当前缺乏可验证且可扩展的相似性度量方法。其解决方案的关键在于提出 IDEAlgin——一种基于“三元组判断任务”(pick-the-odd-one-out)的基准测试范式,通过模拟专家对文本语义相似性的主观判断来构建可量化的人类标注基准,并以此评估多种相似性度量方法(包括向量表示和 LLM-as-a-judge)。实验表明,IDEAlgin 显著提升了 LLM 生成结果与专家判断的一致性(提升 9–30%),优于传统词法和向量基指标,为在教育等领域负责任地部署 LLM 提供了可靠评估框架。

链接: https://arxiv.org/abs/2509.02855
作者: Hyunji Nam,Lucia Langlois,James Malamut,Mei Tan,Dorottya Demszky
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 10 pages, 9 pages for appendix

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied to open-ended, interpretive annotation tasks, such as thematic analysis by researchers or generating feedback on student work by teachers. These tasks involve free-text annotations requiring expert-level judgments grounded in specific objectives (e.g., research questions or instructional goals). Evaluating whether LLM-generated annotations align with those generated by expert humans is challenging to do at scale, and currently, no validated, scalable measure of similarity in ideas exists. In this paper, we (i) introduce the scalable evaluation of interpretive annotation by LLMs as a critical and understudied task, (ii) propose IDEAlgin, an intuitive benchmarking paradigm for capturing expert similarity ratings via a “pick-the-odd-one-out” triplet judgment task, and (iii) evaluate various similarity metrics, including vector-based ones (topic models, embeddings) and LLM-as-a-judge via IDEAlgin, against these human benchmarks. Applying this approach to two real-world educational datasets (interpretive analysis and feedback generation), we find that vector-based metrics largely fail to capture the nuanced dimensions of similarity meaningful to experts. Prompting LLMs via IDEAlgin significantly improves alignment with expert judgments (9-30% increase) compared to traditional lexical and vector-based metrics. These results establish IDEAlgin as a promising paradigm for evaluating LLMs against open-ended expert annotations at scale, informing responsible deployment of LLMs in education and beyond.
zh

[NLP-31] Clustering Discourses: Racial Biases in Short Stories about Women Generated by Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成关于黑人与白人女性的短篇故事时,如何隐含地再现并强化殖民历史结构下的性别与种族不平等的问题。其解决方案的关键在于提出一种整合方法:首先利用计算手段对大量生成文本进行语义聚类以筛选代表性样本,进而结合机器学习技术与人工主导的定性话语分析,揭示出三种主要的叙事模式——社会克服、祖先神话化和主体自我实现,从而暴露看似中立的语言结构如何固化了对女性身体的殖民式框架,为理解生成式AI中的偏见机制提供了实证与理论双重支撑。

链接: https://arxiv.org/abs/2509.02834
作者: Gustavo Bonil,João Gondim,Marina dos Santos,Simone Hashiguti,Helena Maia,Nadia Silva,Helio Pedrini,Sandra Avila
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures. Accepted at STIL @ BRACIS 2025

点击查看摘要

Abstract:This study investigates how large language models, in particular LLaMA 3.2-3B, construct narratives about Black and white women in short stories generated in Portuguese. From 2100 texts, we applied computational methods to group semantically similar stories, allowing a selection for qualitative analysis. Three main discursive representations emerge: social overcoming, ancestral mythification and subjective self-realization. The analysis uncovers how grammatically coherent, seemingly neutral texts materialize a crystallized, colonially structured framing of the female body, reinforcing historical inequalities. The study proposes an integrated approach, that combines machine learning techniques with qualitative, manual discourse analysis.
zh

[NLP-32] SSVD: Structured SVD for Parameter-Efficient Fine-Tuning and Benchmarking under Domain Shift in ASR

【速读】: 该论文旨在解决大语言模型在语音领域(如儿童语音和方言变体)中进行高效微调时面临的参数效率与性能平衡问题。当前主流的参数高效微调(Parameter-efficient fine-tuning, PEFT)方法,如LoRA及其先进变体(VeRA、DoRA、PiSSA、SVFT),主要针对自然语言处理和计算机视觉任务设计,在语音任务中的系统性验证不足。为此,作者首次在ESPnet框架中集成并全面评估了这些PEFT方法,并提出了一种结构化SVD引导(Structured SVD-guided, SSVD)微调策略:通过选择性旋转与输入相关的右奇异向量(right singular vectors),同时固定与输出相关的奇异向量,从而在保持语义映射不变的前提下实现更鲁棒的域适应。该设计显著减少了可训练参数数量,提升了微调效率,尤其适用于不同规模(0.1B至2B参数)的语音识别模型。

链接: https://arxiv.org/abs/2509.02830
作者: Pu Wang,Shinji Watanabe,Hugo Van hamme
机构: KU Leuven (鲁汶大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by IEEE ASRU 2025

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has emerged as a scalable solution for adapting large foundation models. While low-rank adaptation (LoRA) is widely used in speech applications, its state-of-the-art variants, e.g., VeRA, DoRA, PiSSA, and SVFT, are developed mainly for language and vision tasks, with limited validation in speech. This work presents the first comprehensive integration and benchmarking of these PEFT methods within ESPnet. We further introduce structured SVD-guided (SSVD) fine-tuning, which selectively rotates input-associated right singular vectors while keeping output-associated vectors fixed to preserve semantic mappings. This design enables robust domain adaptation with minimal trainable parameters and improved efficiency. We evaluate all methods on domain-shifted speech recognition tasks, including child speech and dialectal variation, across model scales from 0.1B to 2B. All implementations are released in ESPnet to support reproducibility and future work.
zh

[NLP-33] DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off EMNLP

【速读】: 该论文旨在解决长文本生成中效率与质量之间的权衡问题(efficiency-quality trade-off)。其核心解决方案包括三项关键技术:一是动态专家调度机制(dynamic expert scheduling mechanism),根据文本复杂度在扩散过程中智能分配计算资源,提升不同难度任务的处理效率;二是分层稀疏注意力机制(Hierarchical Sparse Attention, HSA),自适应调整注意力模式以应对不同输入长度,将计算复杂度从 O(n²) 降低至 O(n) 同时保持模型性能;三是软吸收引导优化策略(soft absorption guidance optimization),结合 DPM-solver++ 减少扩散步数,显著加快生成速度。

链接: https://arxiv.org/abs/2509.02785
作者: Jusheng Zhang,Yijia Fan,Kaitong Cai,Zimeng Huang,Xiaofei Sun,Jian Wang,Chengpei Tang,Keze Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted 2025 EMNLP (MainConference)

点击查看摘要

Abstract:This paper introduces DrDiff, a novel framework for long-text generation that overcomes the efficiency-quality trade-off through three core technologies. First, we design a dynamic expert scheduling mechanism that intelligently allocates computational resources during the diffusion process based on text complexity, enabling more efficient handling of text generation tasks of varying difficulty. Second, we introduce a Hierarchical Sparse Attention (HSA) mechanism that adaptively adjusts attention patterns according to a variety of input lengths, reducing computational complexity from O( n^2 ) to O( n ) while maintaining model performance. Finally, we propose a soft absorption guidance optimization strategy that combines with DPM-solver++ to reduce diffusion steps, significantly improving generation speed. Comprehensive experiments on various long-text generation benchmarks demonstrate the superiority of our DrDiff over the existing SOTA methods.
zh

[NLP-34] Identifiability and minimality bounds of quantum and post-quantum models of classical stochastic processes

【速读】: 该论文旨在解决经典随机过程(classical stochastic processes)中模型可识别性(identifiability)的问题,即判断两个不同模型是否会产生相同的可观测行为。其关键解决方案是将所有类型的模型——无论是经典、量子还是“后量子”模型——统一映射到一个规范化的广义隐马尔可夫模型(generalized hidden Markov model),从而提供了一种通用的比较框架,并由此能够对生成特定经典随机过程所需的最小量子模型维度进行约束(有时可达紧界)。

链接: https://arxiv.org/abs/2509.03004
作者: Paul M. Riechers,Thomas J. Elliott
机构: Beyond Institute for Theoretical Science (BITS); University of Manchester (曼彻斯特大学)
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Information Theory (cs.IT)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:To make sense of the world around us, we develop models, constructed to enable us to replicate, describe, and explain the behaviours we see. Focusing on the broad case of sequences of correlated random variables, i.e., classical stochastic processes, we tackle the question of determining whether or not two different models produce the same observable behavior. This is the problem of identifiability. Curiously, the physics of the model need not correspond to the physics of the observations; recent work has shown that it is even advantageous – in terms of memory and thermal efficiency – to employ quantum models to generate classical stochastic processes. We resolve the identifiability problem in this regime, providing a means to compare any two models of a classical process, be the models classical, quantum, or post-quantum', by mapping them to a canonical generalized’ hidden Markov model. Further, this enables us to place (sometimes tight) bounds on the minimal dimension required of a quantum model to generate a given classical stochastic process.
zh

计算机视觉

[CV-0] Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage but Not Direct the Play?

【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型在评估Composition(组合能力)与Reasoning(推理能力)时存在的局限性,尤其是现有基准测试难以全面覆盖这两类能力的复杂性和真实世界场景中的高密度、多步骤推理需求。解决方案的关键在于提出T2I-CoReBench,一个综合且复杂的基准测试框架:其一,通过将Composition结构化为场景图元素(实例、属性和关系),并将Reasoning基于演绎、归纳和溯因的哲学推理框架进行建模,构建了一个12维的评估分类体系;其二,设计高组合密度和多步推理的复杂提示(prompt),并配以逐项核查清单(checklist),实现对每个意图元素的细粒度、可靠评估。实验表明,当前27个主流T2I模型在复杂高密度场景下的组合能力受限,而推理能力更成为显著瓶颈,普遍无法从提示中推断隐含信息。

链接: https://arxiv.org/abs/2509.03516
作者: Ouxiang Li,Yuan Wang,Xinting Hu,Huijuan Huang,Rui Chen,Jiarong Ou,Xin Tao,Pengfei Wan,Fuli Feng
机构: University of Science and Technology of China (中国科学技术大学); Nanyang Technological University (南洋理工大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, thereby corresponding to two core capabilities: composition and reasoning. However, with the emerging advances of T2I models in reasoning beyond composition, existing benchmarks reveal clear limitations in providing comprehensive evaluations across and within these capabilities. Meanwhile, these advances also enable models to handle more complex prompts, whereas current benchmarks remain limited to low scene density and simplified one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent complexities of real-world scenarios, we curate each prompt with high compositional density for composition and multi-step inference for reasoning. We also pair each prompt with a checklist that specifies individual yes/no questions to assess each intended element independently to facilitate fine-grained and reliable evaluation. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 27 current T2I models reveal that their composition capability still remains limited in complex high-density scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts. Our project page: this https URL.
zh

[CV-1] A comprehensive Persian offline handwritten database for investigating the effects of heritability and family relationships on handwriting

【速读】:该论文旨在解决“手写特征是否具有遗传性”这一科学问题,具体包括探究手写是否受遗传因素影响、家庭成员间手写风格是否存在相似性等。其解决方案的关键在于构建了一个包含210个家庭共8类亲属关系(如祖父母、父母、兄弟姐妹、堂表亲等)的手写样本数据库,涵盖数字、字母、图形及自由段落等多种书写形式,并完整记录每位书写者的家族关系。通过对比分析家族成员的手写特征,研究者识别出显著的相似性模式,从而为后续基于模式识别技术探讨遗传与家庭关联对书写行为的影响提供了可复用、开放共享的数据基础。

链接: https://arxiv.org/abs/2509.03510
作者: Abbas Zohrevand,Javad Sadri,Zahra Imani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces a comprehensive database for research and investigation on the effects of inheritance on handwriting. A database has been created that can be used to answer questions such as: Is there a genetic component to handwriting? Is handwriting inherited? Do family relationships affect handwriting? Varieties of samples of handwritten components such as: digits, letters, shapes and free paragraphs of 210 families including (grandparents, parents, uncles, aunts, siblings, cousins, nephews and nieces) have been collected using specially designed forms, and family relationships of all writers are captured. To the best of our knowledge, no such database is presently available. Based on comparisons and investigation of features of handwritings of family members, similarities among their features and writing styles are detected. Our database is freely available to the pattern recognition community and hope it will pave the way for investigations on the effects of inheritance and family relationships on handwritings.
zh

[CV-2] Strefer: Empowering Video LLM s with Space-Time Referring and Reasoning via Synthetic Instruction Data ICCV2025

【速读】:该论文旨在解决当前视频大语言模型(Video LLMs)在动态真实环境中难以进行细粒度时空推理的问题,尤其是在用户查询依赖时间事件锚定(temporal anchoring)或手势线索(gestural cues)进行空间定位时表现不佳。其解决方案的关键在于提出Strefer框架——一个用于生成合成指令数据的系统,通过伪标注(pseudo-annotation)技术对视频中密集的时间片段进行结构化标记,包括主体、物体及其位置(masklets)、动作描述与时间线等细粒度信息,从而构建高质量的指令微调数据集,使Video LLMs具备更强的空间-时间指代理解能力,且无需依赖专有模型、人工标注或大规模新视频采集即可显著提升性能。

链接: https://arxiv.org/abs/2509.03501
作者: Honglu Zhou,Xiangyu Peng,Shrikant Kendre,Michael S. Ryoo,Silvio Savarese,Caiming Xiong,Juan Carlos Niebles
机构: Salesforce AI Research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: This technical report serves as the archival version of our paper accepted at the ICCV 2025 Workshop. For more information, please visit our project website: this https URL

点击查看摘要

Abstract:Next-generation AI companions must go beyond general video understanding to resolve spatial and temporal references in dynamic, real-world environments. Existing Video Large Language Models (Video LLMs), while capable of coarse-level comprehension, struggle with fine-grained, spatiotemporal reasoning, especially when user queries rely on time-based event references for temporal anchoring, or gestural cues for spatial anchoring to clarify object references and positions. To bridge this critical gap, we introduce Strefer, a synthetic instruction data generation framework designed to equip Video LLMs with spatiotemporal referring and reasoning capabilities. Strefer produces diverse instruction-tuning data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata, capturing rich spatial and temporal information in a structured manner, including subjects, objects, their locations as masklets, and their action descriptions and timelines. Our approach enhances the ability of Video LLMs to interpret spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions. Without using proprietary models, costly human annotation, or the need to annotate large volumes of new videos, experimental evaluations show that models trained with data produced by Strefer outperform baselines on tasks requiring spatial and temporal disambiguation. Additionally, these models exhibit enhanced space-time-aware reasoning, establishing a new foundation for perceptually grounded, instruction-tuned Video LLMs.
zh

[CV-3] DeepSea MOT: A benchmark dataset for multi-object tracking on deep-sea video

【速读】:该论文旨在解决深海视频中多目标跟踪(Multi-Object Tracking, MOT)模型性能评估缺乏标准化基准数据集的问题。现有方法难以在一致的测试环境下比较不同检测与跟踪算法的性能,限制了模型优化和跨研究对比的可行性。解决方案的关键在于构建并公开发布首个面向深海视频场景的基准数据集,包含四个代表中层水体和底栖深海生境的视频序列,并采用高阶跟踪准确率(Higher Order Tracking Accuracy, HOTA)作为综合评价指标,该指标同时量化检测精度、定位精度与关联精度,从而实现对检测模型(如Monterey Bay Aquarium Research Institute的模型及FathomNet单类检测模型)与多个跟踪器性能的系统性评估。此外,作者还提供了完整的数据生成流程文档和Python示例代码,便于扩展和复现。

链接: https://arxiv.org/abs/2509.03499
作者: Kevin Barnard,Elaine Liu,Kristine Walz,Brian Schlining,Nancy Jacobsen Stout,Lonny Lundsten
机构: Monterey Bay Aquarium Research Institute (蒙特雷湾海洋研究所); University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures, dataset available at this https URL

点击查看摘要

Abstract:Benchmarking multi-object tracking and object detection model performance is an essential step in machine learning model development, as it allows researchers to evaluate model detection and tracker performance on human-generated ‘test’ data, facilitating consistent comparisons between models and trackers and aiding performance optimization. In this study, a novel benchmark video dataset was developed and used to assess the performance of several Monterey Bay Aquarium Research Institute object detection models and a FathomNet single-class object detection model together with several trackers. The dataset consists of four video sequences representing midwater and benthic deep-sea habitats. Performance was evaluated using Higher Order Tracking Accuracy, a metric that balances detection, localization, and association accuracy. To the best of our knowledge, this is the first publicly available benchmark for multi-object tracking in deep-sea video footage. We provide the benchmark data, a clearly documented workflow for generating additional benchmark videos, as well as example Python notebooks for computing metrics.
zh

[CV-4] OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

【速读】:该论文旨在解决当前统一多模态模型在架构复杂性与推理效率之间的矛盾问题,尤其是传统方法依赖外部视觉编码器(如Vision Transformer, ViT)或视觉分词器(vision tokenizer)所带来的计算开销和部署瓶颈。解决方案的关键在于提出OneCAT,一个基于纯解码器-only Transformer架构的统一多模态模型,通过引入模态特定的混合专家(Modality-specific Mixture-of-Experts, MoE)结构并采用单一自回归(autoregressive, AR)目标进行训练,实现了理解、生成与编辑任务的无缝集成;同时,其创新性的多尺度视觉自回归机制显著减少了解码步骤,相比扩散模型更具高效性,并天然支持动态分辨率输入,从而在保持先进性能的同时极大提升了推理效率。

链接: https://arxiv.org/abs/2509.03498
作者: Han Li,Xinyu Peng,Yaoming Wang,Zelin Peng,Xin Chen,Rongxiang Weng,Jingang Wang,Xunliang Cai,Wenrui Dai,Hongkai Xiong
机构: Meituan Inc(美团); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report

点击查看摘要

Abstract:We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution inputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model (LLM) that drastically reduces decoding steps compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding.
zh

[CV-5] Parameter-Efficient Adaptation of mPLUG-Owl2 via Pixel-Level Visual Prompts for NR-IQA

【速读】:该论文旨在解决无参考图像质量评估(No-Reference Image Quality Assessment, NR-IQA)任务中模型参数效率低下的问题,尤其是在利用多模态大语言模型(Multimodal Large Language Models, MLLMs)进行适配时,传统全量微调(full fine-tuning)存在计算资源消耗大、训练成本高的弊端。其解决方案的关键在于提出一种基于像素空间(pixel-space)视觉提示(visual prompts)的参数高效适应方法:仅训练约600K个参数(占基础模型的0.01%),其余模型参数保持冻结;在推理阶段,将优化后的视觉提示与输入图像通过加法融合后输入mPLUG-Owl2模型,并以文本查询“Rate the technical quality of the image”引导模型输出质量评分。该方法在多个真实和合成失真数据集上展现出与全量微调及专用NR-IQA模型相当甚至更优的性能,首次实现了利用像素空间视觉提示对MLLM进行低层视觉任务的高效适配。

链接: https://arxiv.org/abs/2509.03494
作者: Yahya Benmahane,Mohammed El Hassouni
机构: Faculty of Sciences, Rabat (拉巴特科学学院); FLSH (拉巴特高等师范学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel parameter-efficient adaptation method for No- Reference Image Quality Assessment (NR-IQA) using visual prompts optimized in pixel-space. Unlike full fine-tuning of Multimodal Large Language Models (MLLMs), our approach trains only 600K parameters at most ( 0.01% of the base model), while keeping the underlying model fully frozen. During inference, these visual prompts are combined with images via addition and processed by mPLUG-Owl2 with the textual query “Rate the technical quality of the image.” Evaluations across distortion types (synthetic, realistic, AI-generated) on KADID- 10k, KonIQ-10k, and AGIQA-3k demonstrate competitive performance against full finetuned methods and specialized NR-IQA models, achieving 0.93 SRCC on KADID-10k. To our knowledge, this is the first work to leverage pixel-space visual prompts for NR-IQA, enabling efficient MLLM adaptation for low-level vision tasks. The source code is publicly available at https: // github. com/ yahya-ben/ mplug2-vp-for-nriqa .
zh

[CV-6] Robult: Leverag ing Redundancy and Modality Specific Features for Robust Multimodal Learning IJCAI2025

【速读】:该论文旨在解决多模态学习中两个关键挑战:模态缺失(missing modalities)和标注数据有限(limited labeled data)。为应对这些问题,作者提出了一种名为Robult的可扩展框架,其核心创新在于采用信息论驱动的方法,在保留模态特异性信息的同时利用模态间的冗余性。解决方案的关键包括:(1) 一种软正-未标记(soft Positive-Unlabeled, PU)对比损失,通过最大化任务相关特征对齐来高效利用少量标注数据,适用于半监督场景;(2) 一种潜在重建损失,确保各模态独有的信息得以保留。这两个目标嵌入模块化架构中,显著提升了下游任务性能,并增强了模型在推理阶段面对不完整模态时的鲁棒性。

链接: https://arxiv.org/abs/2509.03477
作者: Duy A. Nguyen,Abhi Kamboj,Minh N. Do
机构: Siebel School of Computing and Data Science, UIUC, US; Department of Electrical and Computer Engineering, UIUC, US; VinUni-Illinois Smart Health Center, VinUniversity, Vietnam
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and presented at IJCAI 2025 in Montreal, Canada

点击查看摘要

Abstract:Addressing missing modalities and limited labeled data is crucial for advancing robust multimodal learning. We propose Robult, a scalable framework designed to mitigate these challenges by preserving modality-specific information and leveraging redundancy through a novel information-theoretic approach. Robult optimizes two core objectives: (1) a soft Positive-Unlabeled (PU) contrastive loss that maximizes task-relevant feature alignment while effectively utilizing limited labeled data in semi-supervised settings, and (2) a latent reconstruction loss that ensures unique modality-specific information is retained. These strategies, embedded within a modular design, enhance performance across various downstream tasks and ensure resilience to incomplete modalities during inference. Experimental results across diverse datasets validate that Robult achieves superior performance over existing approaches in both semi-supervised learning and missing modality contexts. Furthermore, its lightweight design promotes scalability and seamless integration with existing architectures, making it suitable for real-world multimodal applications.
zh

[CV-7] Joint Training of Image Generator and Detector for Road Defect Detection ICCV2025

【速读】:该论文旨在解决道路缺陷检测(road defect detection)在边缘设备上部署时面临的资源受限问题,即如何在不使用集成方法(ensemble-based methods)或测试时增强(test-time augmentation, TTA)的前提下,实现高性能的检测。其解决方案的关键在于提出一种联合训练图像生成器与检测器的方法(Jointly Train the image Generator and Detector, JTGD),通过设计双判别器(dual discriminators)约束生成缺陷区域和整体图像的真实性,并引入基于CLIP的Fréchet Inception Distance损失(CLIP-based Fréchet Inception Distance loss)提升合成图像质量;同时,生成模型与检测器协同训练,促使生成器合成更具挑战性的样本以增强检测性能。该方法在RDD2022基准测试中表现优异,且参数量不足基线模型的20%,显著提升了边缘部署可行性。

链接: https://arxiv.org/abs/2509.03465
作者: Kuan-Chuan Peng
机构: Mitsubishi Electric Research Laboratories (三菱电机研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted to ICCV 2025 Workshop on Representation Learning with Very Limited Resources: When Data, Modalities, Labels, and Computing Resources are Scarce as an oral paper

点击查看摘要

Abstract:Road defect detection is important for road authorities to reduce the vehicle damage caused by road defects. Considering the practical scenarios where the defect detectors are typically deployed on edge devices with limited memory and computational resource, we aim at performing road defect detection without using ensemble-based methods or test-time augmentation (TTA). To this end, we propose to Jointly Train the image Generator and Detector for road defect detection (dubbed as JTGD). We design the dual discriminators for the generative model to enforce both the synthesized defect patches and overall images to look plausible. The synthesized image quality is improved by our proposed CLIP-based Fréchet Inception Distance loss. The generative model in JTGD is trained jointly with the detector to encourage the generative model to synthesize harder examples for the detector. Since harder synthesized images of better quality caused by the aforesaid design are used in the data augmentation, JTGD outperforms the state-of-the-art method in the RDD2022 road defect detection benchmark across various countries under the condition of no ensemble and TTA. JTGD only uses less than 20% of the number of parameters compared with the competing baseline, which makes it more suitable for deployment on edge devices in practice.
zh

[CV-8] sam-llm : interpretable lane change trajectoryprediction via parametric finetuning

【速读】:该论文旨在解决自动驾驶中轨迹预测的可解释性与物理合理性不足的问题,尤其在车道变换场景下,传统基于坐标输出的方法难以保证轨迹的物理可行性且缺乏可解释性。其解决方案的关键在于提出一种混合架构SAM-LLM,通过微调大型语言模型(Large Language Models, LLMs)直接输出增强正弦加速度模型(Sinusoidal Acceleration Model, SAM)的核心物理参数(如横向位移、变道时长、初始横向速度和纵向速度变化),而非原始坐标序列。这种方法不仅生成连续、物理合理且可解释的轨迹,还显著降低输出数据量(减少80%),同时保持了与传统LLM预测相当的意图识别准确率(98.73%)。

链接: https://arxiv.org/abs/2509.03462
作者: Zhuo Cao,Yunxiao Shi,Min Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 5 pages

点击查看摘要

Abstract:This work introduces SAM-LLM, a novel hybrid architecture that bridges the gap between the contextual reasoning of Large Language Models (LLMs) and the physical precision of kinematic lane change models for autonomous driving. The system is designed for interpretable lane change trajectory prediction by finetuning an LLM to output the core physical parameters of a trajectory model instead of raw coordinates. For lane-keeping scenarios, the model predicts discrete coordinates, but for lane change maneuvers, it generates the parameters for an enhanced Sinusoidal Acceleration Model (SAM), including lateral displacement, maneuver duration, initial lateral velocity, and longitudinal velocity change. This parametric approach yields a complete, continuous, and physically plausible trajectory model that is inherently interpretable and computationally efficient, achieving an 80% reduction in output size compared to coordinate-based methods. The SAM-LLM achieves a state-of-the-art overall intention prediction accuracy of 98.73%, demonstrating performance equivalent to traditional LLM predictors while offering significant advantages in explainability and resource efficiency.
zh

[CV-9] SmartPoser: Arm Pose Estimation with a Smartphone and Smartwatch Using UWB and IMU Data

【速读】:该论文旨在解决如何在不依赖摄像头或多个可穿戴惯性测量单元(Inertial Measurement Units, IMUs)的情况下,准确估计用户手臂姿态的问题。其核心挑战在于现有方法要么侵犯隐私(如使用摄像头),要么需要复杂的硬件配置(如多传感器佩戴)。解决方案的关键在于利用智能手机与智能手表上已有的超宽带(Ultra-Wideband, UWB)功能,获取两设备间的绝对距离信息,从而弥补仅靠惯性数据时因累积漂移导致的相对定位误差,实现无需用户训练数据即可高精度估计腕关节和肘关节位置的软件-only方案。

链接: https://arxiv.org/abs/2509.03451
作者: Nathan DeVrio,Vimal Mollyn,Chris Harrison
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: The first two listed authors contributed equally. Published at UIST 2023

点击查看摘要

Abstract:The ability to track a user’s arm pose could be valuable in a wide range of applications, including fitness, rehabilitation, augmented reality input, life logging, and context-aware assistants. Unfortunately, this capability is not readily available to consumers. Systems either require cameras, which carry privacy issues, or utilize multiple worn IMUs or markers. In this work, we describe how an off-the-shelf smartphone and smartwatch can work together to accurately estimate arm pose. Moving beyond prior work, we take advantage of more recent ultra-wideband (UWB) functionality on these devices to capture absolute distance between the two devices. This measurement is the perfect complement to inertial data, which is relative and suffers from drift. We quantify the performance of our software-only approach using off-the-shelf devices, showing it can estimate the wrist and elbow joints with a \hlmedian positional error of 11.0~cm, without the user having to provide training data.
zh

[CV-10] Decoding Visual Neural Representations by Multimodal with Dynamic Balancing

【速读】:该论文旨在解决从低信噪比的脑电图(EEG)信号中解码视觉神经表征的难题,尤其关注如何提升 EEG 与图像内容之间的语义对齐精度。其关键解决方案在于构建一个融合 EEG、图像和文本三模态数据的创新框架:通过引入文本模态提供显式语义标签,在共享多模态空间中增强 EEG 特征与图像特征对对应文本表示的对齐;同时设计适配器模块以稳定高维表示并促进跨模态特征融合,并提出模态一致性动态平衡(MCDB)策略来缓解因文本表示引入的模态贡献不平衡问题;此外,引入随机扰动正则化(SPR)项,在模态优化过程中加入动态高斯噪声,从而提升基于语义扰动模型的泛化能力。

链接: https://arxiv.org/abs/2509.03433
作者: Kaili sun,Xingyu Miao,Bing Zhai,Haoran Duan,Yang Long
机构: Durham University (杜伦大学); Northumbria University (诺桑比亚大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we propose an innovative framework that integrates EEG, image, and text data, aiming to decode visual neural representations from low signal-to-noise ratio EEG signals. Specifically, we introduce text modality to enhance the semantic correspondence between EEG signals and visual content. With the explicit semantic labels provided by text, image and EEG features of the same category can be more closely aligned with the corresponding text representations in a shared multimodal space. To fully utilize pre-trained visual and textual representations, we propose an adapter module that alleviates the instability of high-dimensional representation while facilitating the alignment and fusion of cross-modal features. Additionally, to alleviate the imbalance in multimodal feature contributions introduced by the textual representations, we propose a Modal Consistency Dynamic Balance (MCDB) strategy that dynamically adjusts the contribution weights of each modality. We further propose a stochastic perturbation regularization (SPR) term to enhance the generalization ability of semantic perturbation-based models by introducing dynamic Gaussian noise in the modality optimization process. The evaluation results on the ThingsEEG dataset show that our method surpasses previous state-of-the-art methods in both Top-1 and Top-5 accuracy metrics, improving by 2.0% and 4.7% respectively.
zh

[CV-11] EclipseTouch: Touch Segmentation on Ad Hoc Surfaces using Worn Infrared Shadow Casting

【速读】:该论文旨在解决混合现实系统中如何在未加装传感器的日常表面上准确检测触觉交互事件的问题。现有研究表明,将虚拟界面绑定到物理表面相较于悬浮于空中的界面具有性能和人体工学优势,但实现这一目标需可靠的接触感知技术。论文提出的解决方案核心在于开发一种集成于头戴设备的新型方法 \systemname,其关键创新是利用计算机触发的摄像头与一个或多个红外发射器生成结构化阴影,从而高精度估算悬停距离(平均误差6.9 mm)和触摸接触(准确率达98.0%),且在不同材质表面、交互角度及环境光照条件下均表现出良好鲁棒性。

链接: https://arxiv.org/abs/2509.03430
作者: Vimal Mollyn,Nathan DeVrio,Chris Harrison
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: Accepted to UIST 2025

点击查看摘要

Abstract:The ability to detect touch events on uninstrumented, everyday surfaces has been a long-standing goal for mixed reality systems. Prior work has shown that virtual interfaces bound to physical surfaces offer performance and ergonomic benefits over tapping at interfaces floating in the air. A wide variety of approaches have been previously developed, to which we contribute a new headset-integrated technique called \systemname. We use a combination of a computer-triggered camera and one or more infrared emitters to create structured shadows, from which we can accurately estimate hover distance (mean error of 6.9~mm) and touch contact (98.0% accuracy). We discuss how our technique works across a range of conditions, including surface material, interaction orientation, and environmental lighting.
zh

[CV-12] me-Scaling State-Space Models for Dense Video Captioning BMVC2025

【速读】:该论文旨在解决密集视频字幕(Dense Video Captioning)任务中长期视频处理的计算复杂度高、内存占用大以及传统方法无法实现在线处理的问题。其核心挑战在于现有方法难以高效处理长视频序列,且通常需等待完整视频输入才能生成结果,限制了实时性应用。解决方案的关键在于引入一种基于状态空间模型(State-Space Models, SSMs)的改进架构——带有转移状态(Transfer State)的状态空间模型,该模型结合了SSMs的长序列建模能力和循环特性,并有效克服了原始SSMs在极长上下文中难以维持状态的局限性,从而实现了时间维度上的进一步扩展,支持流式或在线视频字幕生成,同时显著降低计算开销(减少7倍浮点运算次数)。

链接: https://arxiv.org/abs/2509.03426
作者: AJ Piergiovanni,Ganesh Satish Mallya,Dahun Kim,Anelia Angelova
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BMVC 2025

点击查看摘要

Abstract:Dense video captioning is a challenging video understanding task which aims to simultaneously segment the video into a sequence of meaningful consecutive events and to generate detailed captions to accurately describe each event. Existing methods often encounter difficulties when working with the long videos associated with dense video captioning, due to the computational complexity and memory limitations. Furthermore, traditional approaches require the entire video as input, in order to produce an answer, which precludes online processing of the video. We address these challenges by time-scaling State-Space Models (SSMs) to even longer sequences than before. Our approach, State-Space Models with Transfer State, combines both the long-sequence and recurrent properties of SSMs and addresses the main limitation of SSMs which are otherwise not able to sustain their state for very long contexts, effectively scaling SSMs further in time. The proposed model is particularly suitable for generating captions on-the-fly, in an online or streaming manner, without having to wait for the full video to be processed, which is more beneficial in practice. When applied to dense video captioning, our approach scales well with video lengths and uses 7x fewer FLOPs.
zh

[CV-13] Scalable and Loosely-Coupled Multimodal Deep Learning for Breast Cancer Subtyping

【速读】:该论文旨在解决乳腺癌分子分型中因多模态数据异构性与临床场景下模态可用性差异而导致的精准预测难题。其核心挑战在于如何高效融合来自不同来源的数据(如拷贝数变异(Copy Number Variation, CNV)、临床记录和组织病理图像)以提升分型准确性,同时保持框架的可扩展性和灵活性。解决方案的关键在于提出了一种可扩展且松耦合的多模态集成框架:首先引入全切片图像(Whole Slide Images, WSI)的双基表示方法,融合传统图像特征与图结构特征,显著提升WSI表征能力;其次设计一种新的多模态融合策略,在不重新训练已有模态的前提下实现跨模态信息的有效整合,从而在乳腺癌分型任务中超越现有最先进方法,并具备向其他癌症类型迁移的潜力。

链接: https://arxiv.org/abs/2509.03408
作者: Mohammed Amer,Mohamed A. Suliman,Tu Bui,Nuria Garcia,Serban Georgescu
机构: Fujitsu Research of Europe Ltd (富士通欧洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Healthcare applications are inherently multimodal, benefiting greatly from the integration of diverse data sources. However, the modalities available in clinical settings can vary across different locations and patients. A key area that stands to gain from multimodal integration is breast cancer molecular subtyping, an important clinical task that can facilitate personalized treatment and improve patient prognosis. In this work, we propose a scalable and loosely-coupled multimodal framework that seamlessly integrates data from various modalities, including copy number variation (CNV), clinical records, and histopathology images, to enhance breast cancer subtyping. While our primary focus is on breast cancer, our framework is designed to easily accommodate additional modalities, offering the flexibility to scale up or down with minimal overhead without requiring re-training of existing modalities, making it applicable to other types of cancers as well. We introduce a dual-based representation for whole slide images (WSIs), combining traditional image-based and graph-based WSI representations. This novel dual approach results in significant performance improvements. Moreover, we present a new multimodal fusion strategy, demonstrating its ability to enhance performance across a range of multimodal conditions. Our comprehensive results show that integrating our dual-based WSI representation with CNV and clinical health records, along with our pipeline and fusion strategy, outperforms state-of-the-art methods in breast cancer subtyping.
zh

[CV-14] Human Preference-Aligned Concept Customization Benchmark via Decomposed Evaluation ICCV

【速读】:该论文旨在解决概念定制(Concept Customization)评估难题,尤其是如何在生成式 AI (Generative AI) 场景下,准确衡量模型对单个或多个概念的忠实度及其交互关系。现有评估指标往往过于宽泛或狭窄,难以与人类偏好对齐。解决方案的关键在于提出一种新的、以人类偏好为导向的评估方法——分解式 GPT 分数(Decomposed GPT Score, D-GPTScore),该方法将评估标准细分为更具体的维度,并利用多模态大语言模型(Multimodal Large Language Model, MLLM)进行各维度的独立评分,从而实现更精细且贴近人类判断的评估。此外,作者还构建了 CC-AlignBench 基准数据集,涵盖从单一概念到多人交互的多层次任务,支持分阶段评估,显著提升了评估结果与人类偏好的一致性。

链接: https://arxiv.org/abs/2509.03385
作者: Reina Ishikawa,Ryo Fujii,Hideo Saito,Ryo Hachiuma
机构: Keio University (庆应义塾大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV Workshop 2025

点击查看摘要

Abstract:Evaluating concept customization is challenging, as it requires a comprehensive assessment of fidelity to generative prompts and concept images. Moreover, evaluating multiple concepts is considerably more difficult than evaluating a single concept, as it demands detailed assessment not only for each individual concept but also for the interactions among concepts. While humans can intuitively assess generated images, existing metrics often provide either overly narrow or overly generalized evaluations, resulting in misalignment with human preference. To address this, we propose Decomposed GPT Score (D-GPTScore), a novel human-aligned evaluation method that decomposes evaluation criteria into finer aspects and incorporates aspect-wise assessments using Multimodal Large Language Model (MLLM). Additionally, we release Human Preference-Aligned Concept Customization Benchmark (CC-AlignBench), a benchmark dataset containing both single- and multi-concept tasks, enabling stage-wise evaluation across a wide difficulty range – from individual actions to multi-person interactions. Our method significantly outperforms existing approaches on this benchmark, exhibiting higher correlation with human preferences. This work establishes a new standard for evaluating concept customization and highlights key challenges for future research. The benchmark and associated materials are available at this https URL.
zh

[CV-15] nyDrop: Tiny Model Guided Token Dropping for Vision Transformers

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在图像分类任务中计算成本过高问题,尤其是在大规模模型推理阶段。其核心挑战在于如何在不显著牺牲准确率的前提下降低计算量。解决方案的关键在于提出了一种无需训练的token dropping框架——TinyDrop,该框架通过一个轻量级视觉模型在推理时估计每个图像token的重要性,并据此选择性地丢弃低重要性token,从而减少注意力机制的计算负担。该方法具有即插即用特性,无需修改ViT架构,且兼容多种ViT变体,在标准图像分类基准上实现了最高达80%的浮点运算次数(FLOPs)减少,同时保持最小的精度损失,展现出良好的泛化能力和实用性。

链接: https://arxiv.org/abs/2509.03379
作者: Guoxin Wang,Qingyuan Wang,Binhua Huang,Shaowu Chen,Deepu John
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) achieve strong performance in image classification but incur high computational costs from processing all image tokens. To reduce inference costs in large ViTs without compromising accuracy, we propose TinyDrop, a training-free token dropping framework guided by a lightweight vision model. The guidance model estimates the importance of tokens while performing inference, thereby selectively discarding low-importance tokens if large vit models need to perform attention calculations. The framework operates plug-and-play, requires no architectural modifications, and is compatible with diverse ViT architectures. Evaluations on standard image classification benchmarks demonstrate that our framework reduces FLOPs by up to 80% for ViTs with minimal accuracy degradation, highlighting its generalization capability and practical utility for efficient ViT-based classification.
zh

[CV-16] ransformer-Guided Content-Adaptive Graph Learning for Hyperspectral Unmixing

【速读】:该论文旨在解决高光谱解混(Hyperspectral Unmixing, HU)中难以同时建模全局依赖关系与局部一致性的问题,这导致现有深度学习方法在保留长距离交互信息和边界细节方面表现不足。其解决方案的关键在于提出一种基于Transformer引导的内容自适应图解混框架(Transformer-guided Content-adaptive Graph Unmixing, T-CAGU),通过引入Transformer模块捕捉全局依赖关系,并结合内容自适应图神经网络增强局部关联性;同时,T-CAGU采用多阶传播机制动态学习图结构以提升抗噪能力,并利用图残差机制保留全局信息并稳定训练过程。

链接: https://arxiv.org/abs/2509.03376
作者: Hui Chen,Liangyu Liu,Xianchao Xiu,Wanquan Liu
机构: Shanghai University of Electric Power (上海电力大学); Shanghai University (上海大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral unmixing (HU) targets to decompose each mixed pixel in remote sensing images into a set of endmembers and their corresponding abundances. Despite significant progress in this field using deep learning, most methods fail to simultaneously characterize global dependencies and local consistency, making it difficult to preserve both long-range interactions and boundary details. This letter proposes a novel transformer-guided content-adaptive graph unmixing framework (T-CAGU), which overcomes these challenges by employing a transformer to capture global dependencies and introducing a content-adaptive graph neural network to enhance local relationships. Unlike previous work, T-CAGU integrates multiple propagation orders to dynamically learn the graph structure, ensuring robustness against noise. Furthermore, T-CAGU leverages a graph residual mechanism to preserve global information and stabilize training. Experimental results demonstrate its superiority over the state-of-the-art methods. Our code is available at this https URL.
zh

[CV-17] InfraDiffusion: zero-shot depth map restoration with diffusion models and prompted segmentation from sparse infrastructure point clouds

【速读】:该论文旨在解决在低光照环境下对砌体结构(如砖石隧道)进行细粒度缺陷识别的难题,传统方法依赖高分辨率RGB图像,但在暗光条件下难以获取清晰图像;而点云虽具备抗弱光能力,却因无结构、稀疏和噪声等问题,难以支持砖块级别的分割。解决方案的关键在于提出InfraDiffusion框架,其通过虚拟相机将点云投影为深度图,并利用去噪扩散零空间模型(Denoising Diffusion Null-space Model, DDNM)恢复出更清晰且几何一致的深度图,从而无需任务特定训练即可显著提升视觉质量与几何一致性,使Segment Anything Model(SAM)能够实现高效的砖级分割,推动砌体资产自动化巡检的发展。

链接: https://arxiv.org/abs/2509.03324
作者: Yixiong Jing,Cheng Zhang,Haibing Wu,Guangming Wang,Olaf Wysocki,Brian Sheil
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point clouds are widely used for infrastructure monitoring by providing geometric information, where segmentation is required for downstream tasks such as defect detection. Existing research has automated semantic segmentation of structural components, while brick-level segmentation (identifying defects such as spalling and mortar loss) has been primarily conducted from RGB images. However, acquiring high-resolution images is impractical in low-light environments like masonry tunnels. Point clouds, though robust to dim lighting, are typically unstructured, sparse, and noisy, limiting fine-grained segmentation. We present InfraDiffusion, a zero-shot framework that projects masonry point clouds into depth maps using virtual cameras and restores them by adapting the Denoising Diffusion Null-space Model (DDNM). Without task-specific training, InfraDiffusion enhances visual clarity and geometric consistency of depth maps. Experiments on masonry bridge and tunnel point cloud datasets show significant improvements in brick-level segmentation using the Segment Anything Model (SAM), underscoring its potential for automated inspection of masonry assets. Our code and data is available at this https URL.
zh

[CV-18] Heatmap Guided Query Transformers for Robust Astrocyte Detection across Immunostains and Resolutions

【速读】:该论文旨在解决神经病理学图像中星形胶质细胞(astrocytes)自动检测的难题,尤其针对其复杂分支结构和染色依赖性变异导致的识别困难。解决方案的关键在于提出一种混合卷积神经网络(CNN)与Transformer的检测架构:通过热力图引导的查询机制生成空间定位锚点,以精准捕捉小而微弱的星形胶质细胞;同时引入轻量级Transformer模块增强密集簇中的判别能力,从而提升检测敏感性和准确性。

链接: https://arxiv.org/abs/2509.03323
作者: Xizhe Zhang,Jiayang Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Astrocytes are critical glial cells whose altered morphology and density are hallmarks of many neurological disorders. However, their intricate branching and stain dependent variability make automated detection of histological images a highly challenging task. To address these challenges, we propose a hybrid CNN Transformer detector that combines local feature extraction with global contextual reasoning. A heatmap guided query mechanism generates spatially grounded anchors for small and faint astrocytes, while a lightweight Transformer module improves discrimination in dense clusters. Evaluated on ALDH1L1 and GFAP stained astrocyte datasets, the model consistently outperformed Faster R-CNN, YOLOv11 and DETR, achieving higher sensitivity with fewer false positives, as confirmed by FROC analysis. These results highlight the potential of hybrid CNN Transformer architectures for robust astrocyte detection and provide a foundation for advanced computational pathology tools.
zh

[CV-19] Empowering Lightweight MLLM s with Reasoning via Long CoT SFT

【速读】:该论文旨在解决轻量级多模态语言模型(Multimodal Language Models, MLLMs)在推理能力上的不足问题,尤其是当模型参数规模小于七亿时,如何有效提升其推理性能。解决方案的关键在于引入长链式思维(long Chain-of-Thought, long CoT)数据进行监督微调(Supervised Fine-Tuning, SFT),研究发现这一阶段是激发轻量级MLLM推理能力的必要前提,后续再通过强化学习(Reinforcement Learning, RL)可进一步提升性能。

链接: https://arxiv.org/abs/2509.03321
作者: Linyu Ou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Reinforcement Learning with Verifiable Rewards has enhanced the reasoning of large-scale language models (LLMs), its efficacy for lightweight multimodal language models (MLLMs) with fewer than seven billion parameters remains underexplored. This paper investigates the role of long Chain-of-Thought (long CoT) data in enhancing the reasoning abilities of such MLLMs. Our findings demonstrate that Supervised Fine-Tuning (SFT) with long CoT data significantly improves MLLM reasoning. Furthermore, we observe that after this initial SFT phase, MLLMs can achieve additional performance gains through a subsequent RL stage. We conclude that a SFT stage with long CoT data is a critical prerequisite for developing the reasoning capabilities of lightweight MLLMs.
zh

[CV-20] PointAD: Learning Hierarchical Representations for Zero-shot 3D Anomaly Detection

【速读】:该论文旨在解决跨类别、未见过的3D物体中异常检测(3D anomaly detection)的泛化问题,尤其关注如何将CLIP(Contrastive Language–Image Pretraining)模型在2D图像上表现出的强大鲁棒性迁移到3D场景中。其解决方案的关键在于提出PointAD+框架,该框架通过引入隐式(implicit)和显式(explicit)两类3D异常表示机制实现对渲染异常与空间异常的联合建模:其中隐式表示基于点-像素对应关系捕捉纹理/外观异常,而显式表示则通过G-aggregation聚合几何信息以感知空间结构异常;进一步地,通过分层表征学习(hierarchical representation learning)分别构建渲染提示(rendering prompts)与几何提示(geometry prompts),并利用跨层级对比对齐(cross-hierarchy contrastive alignment)促进两层语义间的交互与协同学习,最终整合双层异常语义以实现对未知类别3D物体的广义异常理解。

链接: https://arxiv.org/abs/2509.03277
作者: Qihang Zhou,Shibo He,Jiangtao Yan,Wenchao Meng,Jiming Chen
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to TPAMI

点击查看摘要

Abstract:In this paper, we aim to transfer CLIP’s robust 2D generalization capabilities to identify 3D anomalies across unseen objects of highly diverse class semantics. To this end, we propose a unified framework to comprehensively detect and segment 3D anomalies by leveraging both point- and pixel-level information. We first design PointAD, which leverages point-pixel correspondence to represent 3D anomalies through their associated rendering pixel representations. This approach is referred to as implicit 3D representation, as it focuses solely on rendering pixel anomalies but neglects the inherent spatial relationships within point clouds. Then, we propose PointAD+ to further broaden the interpretation of 3D anomalies by introducing explicit 3D representation, emphasizing spatial abnormality to uncover abnormal spatial relationships. Hence, we propose G-aggregation to involve geometry information to enable the aggregated point representations spatially aware. To simultaneously capture rendering and spatial abnormality, PointAD+ proposes hierarchical representation learning, incorporating implicit and explicit anomaly semantics into hierarchical text prompts: rendering prompts for the rendering layer and geometry prompts for the geometry layer. A cross-hierarchy contrastive alignment is further introduced to promote the interaction between the rendering and geometry layers, facilitating mutual anomaly learning. Finally, PointAD+ integrates anomaly semantics from both layers to capture the generalized anomaly semantics. During the test, PointAD+ can integrate RGB information in a plug-and-play manner and further improve its detection performance. Extensive experiments demonstrate the superiority of PointAD+ in ZS 3D anomaly detection across unseen objects with highly diverse class semantics, achieving a holistic understanding of abnormality.
zh

[CV-21] SynBT: High-quality Tumor Synthesis for Breast Tumor Segmentation by 3D Diffusion Model MICCAI2025

【速读】:该论文旨在解决现有肿瘤合成方法在处理大空间体积肿瘤(如乳腺肿瘤在大视野(FOV)的MRI图像中)时性能不佳的问题,尤其是传统基于小图像块(patch)的方法难以有效模拟真实场景中的肿瘤形态与分布。解决方案的关键在于提出一种3D医学扩散模型SynBT,其核心由两部分组成:一是Patch-to-Volume自动编码器,能够将高分辨率MRI压缩至紧凑的潜在空间并保留大FOV体积的分辨率信息;二是基于掩码条件的扩散模型,在指定乳腺组织区域内生成具有真实外观的乳腺肿瘤(BT),从而实现高质量、可控的肿瘤合成,显著提升后续分割模型的性能(Dice Score提升2–3%)。

链接: https://arxiv.org/abs/2509.03267
作者: Hongxu Yang,Edina Timko,Levente Lippenszky,Vanda Czipczer,Lehel Ferenczi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2025 Deep-Breath Workshop. Supported by IHI SYNTHIA project

点击查看摘要

Abstract:Synthetic tumors in medical images offer controllable characteristics that facilitate the training of machine learning models, leading to an improved segmentation performance. However, the existing methods of tumor synthesis yield suboptimal performances when tumor occupies a large spatial volume, such as breast tumor segmentation in MRI with a large field-of-view (FOV), while commonly used tumor generation methods are based on small patches. In this paper, we propose a 3D medical diffusion model, called SynBT, to generate high-quality breast tumor (BT) in contrast-enhanced MRI images. The proposed model consists of a patch-to-volume autoencoder, which is able to compress the high-resolution MRIs into compact latent space, while preserving the resolution of volumes with large FOV. Using the obtained latent space feature vector, a mask-conditioned diffusion model is used to synthesize breast tumors within selected regions of breast tissue, resulting in realistic tumor appearances. We evaluated the proposed method for a tumor segmentation task, which demonstrated the proposed high-quality tumor synthesis method can facilitate the common segmentation models with performance improvement of 2-3% Dice Score on a large public dataset, and therefore provides benefits for tumor segmentation in MRI images.
zh

[CV-22] PI3DETR: Parametric Instance Detection of 3D Point Cloud Edges with a Geometry-Aware 3DETR

【速读】:该论文旨在解决从原始点云中直接预测三维参数化曲线实例的问题,传统方法通常依赖于中间表示和多阶段处理,导致流程复杂且对噪声和采样密度变化敏感。其解决方案的关键在于提出PI3DETR框架,该框架在3DETR基础上引入几何感知的匹配策略和专用损失函数,实现了不同参数化曲线类型(如三次贝塞尔曲线、线段、圆和弧)的统一检测,并支持单次前向传播完成推理;同时通过可选后处理步骤进一步优化预测精度而不增加模型复杂度,从而显著提升了对真实LiDAR与3D传感场景中噪声和不均匀采样条件的鲁棒性。

链接: https://arxiv.org/abs/2509.03262
作者: Fabio F. Oberweger,Michael Schwingshackl,Vanessa Staderini
机构: AIT Austrian Institute of Technology (奥地利科学院技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present PI3DETR, an end-to-end framework that directly predicts 3D parametric curve instances from raw point clouds, avoiding the intermediate representations and multi-stage processing common in prior work. Extending 3DETR, our model introduces a geometry-aware matching strategy and specialized loss functions that enable unified detection of differently parameterized curve types, including cubic Bézier curves, line segments, circles, and arcs, in a single forward pass. Optional post-processing steps further refine predictions without adding complexity. This streamlined design improves robustness to noise and varying sampling densities, addressing critical challenges in real world LiDAR and 3D sensing scenarios. PI3DETR sets a new state-of-the-art on the ABC dataset and generalizes effectively to real sensor data, offering a simple yet powerful solution for 3D edge and curve estimation.
zh

[CV-23] LGBP-OrgaNet: Learnable Gaussian Band Pass Fusion of CNN and Transformer Features for Robust Organoid Segmentation and Tracking

【速读】:该论文旨在解决传统荧光标记方法在器官类器官(organoids)研究中因破坏其结构而导致的形态与发育状态评估不准确的问题。为实现非侵入式、自动化的类器官分割与追踪,作者提出了一种基于深度学习的LGBP-OrgaNet模型,其核心创新在于引入了可学习高斯带通融合模块(Learnable Gaussian Band Pass Fusion, LGBP),用于融合卷积神经网络(CNN)和Transformer模块提取的互补特征;同时在解码器中设计双向交叉融合块(Bidirectional Cross Fusion Block)以融合多尺度特征,并通过逐步拼接与上采样完成最终解码,从而显著提升类器官分割精度与鲁棒性。

链接: https://arxiv.org/abs/2509.03221
作者: Jing Zhang,Siying Tao,Jiao Li,Tianhe Wang,Junchen Wu,Ruqian Hao,Xiaohui Du,Ruirong Tan,Rui Li
机构: UESTC(电子科技大学); SCU(四川大学); CDUTCM(成都中医药大学); Aimining Medical(爱明医疗); SCSZLYY(成都市第三人民医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Organoids replicate organ structure and function, playing a crucial role in fields such as tumor treatment and drug screening. Their shape and size can indicate their developmental status, but traditional fluorescence labeling methods risk compromising their structure. Therefore, this paper proposes an automated, non-destructive approach to organoid segmentation and tracking. We introduced the LGBP-OrgaNet, a deep learning-based system proficient in accurately segmenting, tracking, and quantifying organoids. The model leverages complementary information extracted from CNN and Transformer modules and introduces the innovative feature fusion module, Learnable Gaussian Band Pass Fusion, to merge data from two branches. Additionally, in the decoder, the model proposes a Bidirectional Cross Fusion Block to fuse multi-scale features, and finally completes the decoding through progressive concatenation and upsampling. SROrga demonstrates satisfactory segmentation accuracy and robustness on organoids segmentation datasets, providing a potent tool for organoid research.
zh

[CV-24] RTGMFF: Enhanced fMRI-based Brain Disorder Diagnosis via ROI-driven Text Generation and Multimodal Feature Fusion

【速读】:该论文旨在解决功能性磁共振成像(fMRI)在临床脑疾病诊断中面临的三大挑战:低信噪比、个体间差异大,以及现有卷积神经网络(CNN)和Transformer模型对频率信息感知不足的问题;同时,多数fMRI数据集缺乏文本注释以解释区域激活与连接模式。其解决方案的关键在于提出RTGMFF框架,该框架通过三个核心组件实现:(i) ROI驱动的fMRI文本生成模块,将每个受试者的激活、连接性、年龄和性别等信息编码为可复现的文本标记;(ii) 混合频域-空间编码器,融合小波-Mamba分支与跨尺度Transformer编码器,同时捕捉频域结构与长程空间依赖关系;(iii) 自适应语义对齐模块,在共享嵌入空间中对齐文本标记序列与视觉特征,并采用正则化余弦相似度损失缩小模态差距。该方法显著提升了ADHD-200和ABIDE基准上的诊断准确率、敏感性、特异性和ROC曲线下面积。

链接: https://arxiv.org/abs/2509.03214
作者: Junhao Jia,Yifei Sun,Yunyou Liu,Cheng Yang,Changmiao Wang,Feiwei Qin,Yong Peng,Wenwen Min
机构: Hangzhou Dianzi University (杭州电子科技大学); Zhejiang University (浙江大学); Shenzhen Research Institute of Big Data (深圳大数据研究院); Yunnan University (云南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Functional magnetic resonance imaging (fMRI) is a powerful tool for probing brain function, yet reliable clinical diagnosis is hampered by low signal-to-noise ratios, inter-subject variability, and the limited frequency awareness of prevailing CNN- and Transformer-based models. Moreover, most fMRI datasets lack textual annotations that could contextualize regional activation and connectivity patterns. We introduce RTGMFF, a framework that unifies automatic ROI-level text generation with multimodal feature fusion for brain-disorder diagnosis. RTGMFF consists of three components: (i) ROI-driven fMRI text generation deterministically condenses each subject’s activation, connectivity, age, and sex into reproducible text tokens; (ii) Hybrid frequency-spatial encoder fuses a hierarchical wavelet-mamba branch with a cross-scale Transformer encoder to capture frequency-domain structure alongside long-range spatial dependencies; and (iii) Adaptive semantic alignment module embeds the ROI token sequence and visual features in a shared space, using a regularized cosine-similarity loss to narrow the modality gap. Extensive experiments on the ADHD-200 and ABIDE benchmarks show that RTGMFF surpasses current methods in diagnostic accuracy, achieving notable gains in sensitivity, specificity, and area under the ROC curve. Code is available at this https URL.
zh

[CV-25] AIVA: An AI-based Virtual Companion for Emotion-aware Interaction

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在人机交互(Human-Computer Interaction, HCI)中因仅支持单模态文本处理而无法感知非语言情感线索(如面部表情、语调等)的问题,从而限制了交互的沉浸感与共情能力。其解决方案的关键在于提出一个名为\ours的AI虚拟伴侣系统,通过引入多模态情感感知网络(Multimodal Sentiment Perception Network, MSPN),利用跨模态融合Transformer架构和监督对比学习方法提取并整合来自视觉与语音等多源的情感特征;同时设计情绪感知提示工程策略以生成更具同理心的响应,并集成文本转语音(Text-to-Speech, TTS)系统与动画化身模块,实现具情感表达能力的交互输出,从而构建具备情绪感知能力的智能代理框架。

链接: https://arxiv.org/abs/2509.03212
作者: Chenxi Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have significantly improved natural language understanding and generation, enhancing Human-Computer Interaction (HCI). However, LLMs are limited to unimodal text processing and lack the ability to interpret emotional cues from non-verbal signals, hindering more immersive and empathetic interactions. This work explores integrating multimodal sentiment perception into LLMs to create emotion-aware agents. We propose \ours, an AI-based virtual companion that captures multimodal sentiment cues, enabling emotionally aligned and animated HCI. \ours introduces a Multimodal Sentiment Perception Network (MSPN) using a cross-modal fusion transformer and supervised contrastive learning to provide emotional cues. Additionally, we develop an emotion-aware prompt engineering strategy for generating empathetic responses and integrate a Text-to-Speech (TTS) system and animated avatar module for expressive interactions. \ours provides a framework for emotion-aware agents with applications in companion robotics, social care, mental health, and human-centered AI.
zh

[CV-26] Efficient Active Training for Deep LiDAR Odometry

【速读】:该论文旨在解决深度LiDAR里程计(LiDAR odometry)模型在实际应用中因依赖大量多样训练数据而导致的训练效率低、泛化能力不足的问题。解决方案的关键在于提出一种主动训练框架(active training framework),其核心由两个策略构成:初始训练集选择(Initial Training Set Selection, ITSS)和主动增量选择(Active Incremental Selection, AIS)。ITSS通过将通用天气下的运动序列分解为节点与边,进行轨迹分析并优先选取多样化序列构建高质量初始训练集;AIS则利用场景重建与预测不一致性,在复杂环境(如雪天)中迭代筛选困难样本,持续优化模型对真实世界多变条件的适应能力。该方法仅需全数据集52%的序列量即可达到相当性能,显著提升了训练效率与模型鲁棒性。

链接: https://arxiv.org/abs/2509.03211
作者: Beibei Zhou,Zhiyuan Zhang,Zhenbo Song,Jianhui Guo,Hui Kong
机构: Shanghai Polytechnic University (上海理工大学); Singapore Management University (新加坡管理大学); Nanjing University of Science and Technology (南京理工大学); University of Macau (澳门大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust and efficient deep LiDAR odometry models are crucial for accurate localization and 3D reconstruction, but typically require extensive and diverse training data to adapt to diverse environments, leading to inefficiencies. To tackle this, we introduce an active training framework designed to selectively extract training data from diverse environments, thereby reducing the training load and enhancing model generalization. Our framework is based on two key strategies: Initial Training Set Selection (ITSS) and Active Incremental Selection (AIS). ITSS begins by breaking down motion sequences from general weather into nodes and edges for detailed trajectory analysis, prioritizing diverse sequences to form a rich initial training dataset for training the base model. For complex sequences that are difficult to analyze, especially under challenging snowy weather conditions, AIS uses scene reconstruction and prediction inconsistency to iteratively select training samples, refining the model to handle a wide range of real-world scenarios. Experiments across datasets and weather conditions validate our approach’s effectiveness. Notably, our method matches the performance of full-dataset training with just 52% of the sequence volume, demonstrating the training efficiency and robustness of our active training paradigm. By optimizing the training process, our approach sets the stage for more agile and reliable LiDAR odometry systems, capable of navigating diverse environmental conditions with greater precision.
zh

[CV-27] PPORLD-EDNetLDCT: A Proximal Policy Optimization-Based Reinforcement Learning Framework for Adaptive Low-Dose CT Denoising

【速读】:该论文旨在解决低剂量计算机断层扫描(Low-dose Computed Tomography, LDCT)图像中因辐射剂量降低导致的噪声增加与图像质量下降问题。传统去噪方法如迭代优化或监督学习难以在降噪的同时保持关键结构信息,从而影响诊断准确性。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的编码器-解码器架构模型PPORLD-EDNetLDCT,该方法采用先进的近端策略优化(Proximal Policy Optimization, PPO)算法,在自定义Gym环境中实时动态优化去噪策略,并通过图像质量反馈进行训练,实现了对LDCT图像的有效去噪,同时显著提升了峰值信噪比(PSNR)、结构相似性指数(SSIM)和分类任务准确率,验证了其在医学影像中的优越性和实用性。

链接: https://arxiv.org/abs/2509.03185
作者: Debopom Sutradhar,Ripon Kumar Debnath,Mohaimenul Azam Khan Raiaan,Yan Zhang,Reem E. Mohamed,Sami Azam
机构: United International University (联合国际大学); Charles Darwin University (查尔斯达尔文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 5 figures, 5 tables. Submitted to Computers in Biology and Medicine for peer review

点击查看摘要

Abstract:Low-dose computed tomography (LDCT) is critical for minimizing radiation exposure, but it often leads to increased noise and reduced image quality. Traditional denoising methods, such as iterative optimization or supervised learning, often fail to preserve image quality. To address these challenges, we introduce PPORLD-EDNetLDCT, a reinforcement learning-based (RL) approach with Encoder-Decoder for LDCT. Our method utilizes a dynamic RL-based approach in which an advanced posterior policy optimization (PPO) algorithm is used to optimize denoising policies in real time, based on image quality feedback, trained via a custom gym environment. The experimental results on the low dose CT image and projection dataset demonstrate that the proposed PPORLD-EDNetLDCT model outperforms traditional denoising techniques and other DL-based methods, achieving a peak signal-to-noise ratio of 41.87, a structural similarity index measure of 0.9814 and a root mean squared error of 0.00236. Moreover, in NIH-AAPM-Mayo Clinic Low Dose CT Challenge dataset our method achived a PSNR of 41.52, SSIM of 0.9723 and RMSE of 0.0051. Furthermore, we validated the quality of denoising using a classification task in the COVID-19 LDCT dataset, where the images processed by our method improved the classification accuracy to 94%, achieving 4% higher accuracy compared to denoising without RL-based denoising. This method offers a promising solution for safer and more accurate LDCT imaging.
zh

[CV-28] AutoDetect: Designing an Autoencoder-based Detection Method for Poisoning Attacks on Object Detection Applications in the Military Domain

【速读】:该论文旨在解决军事领域目标检测系统面临的投毒攻击(poisoning attacks)威胁及其检测难题。由于开源数据集和预训练模型的广泛应用,军事对象检测器易受恶意数据污染,进而影响其安全性与鲁棒性,但当前针对此类攻击的研究极为有限。为应对这一问题,作者构建了一个专用于军事车辆的小型定制数据集MilCivVeh,并基于BadDet攻击提出一种基于补丁的投毒方法以评估军事目标检测器的脆弱性;实验表明,虽然可实现正向攻击成功率,但需大量数据被污染,限制了其实际可行性。为此,论文进一步提出一种轻量级、快速且高效的自编码器-based补丁检测方法AutoDetect,通过图像切片的重建误差区分干净样本与中毒样本,显著优于现有检测方法,同时降低计算资源消耗。关键创新在于AutoDetect方法在保持高检测性能的同时具备良好的实用性,为军事AI系统的安全防护提供了可行的技术路径。

链接: https://arxiv.org/abs/2509.03179
作者: Alma M. Liezenga,Stefan Wijnja,Puck de Haan,Niels W. T. Brink,Jip J. van Stijn,Yori Kamphuis,Klamer Schutte
机构: TNO
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: To be presented at SPIE: Sensors + Imaging, Artificial Intelligence for Security and Defence Applications II

点击查看摘要

Abstract:Poisoning attacks pose an increasing threat to the security and robustness of Artificial Intelligence systems in the military domain. The widespread use of open-source datasets and pretrained models exacerbates this risk. Despite the severity of this threat, there is limited research on the application and detection of poisoning attacks on object detection systems. This is especially problematic in the military domain, where attacks can have grave consequences. In this work, we both investigate the effect of poisoning attacks on military object detectors in practice, and the best approach to detect these attacks. To support this research, we create a small, custom dataset featuring military vehicles: MilCivVeh. We explore the vulnerability of military object detectors for poisoning attacks by implementing a modified version of the BadDet attack: a patch-based poisoning attack. We then assess its impact, finding that while a positive attack success rate is achievable, it requires a substantial portion of the data to be poisoned – raising questions about its practical applicability. To address the detection challenge, we test both specialized poisoning detection methods and anomaly detection methods from the visual industrial inspection domain. Since our research shows that both classes of methods are lacking, we introduce our own patch detection method: AutoDetect, a simple, fast, and lightweight autoencoder-based method. Our method shows promising results in separating clean from poisoned samples using the reconstruction error of image slices, outperforming existing methods, while being less time- and memory-intensive. We urge that the availability of large, representative datasets in the military domain is a prerequisite to further evaluate risks of poisoning attacks and opportunities patch detection.
zh

[CV-29] Count2Density: Crowd Density Estimation without Location-level Annotations

【速读】:该论文旨在解决人群密度估计(crowd density estimation)任务中对细粒度位置级标注(即每个个体上方标记点)的高度依赖问题,此类标注不仅耗时费力,还严重限制了模型在真实场景中的可扩展性。其解决方案的关键在于提出 Count2Density 管道,该方法仅使用计数级标注(count-level annotations,即总人数)即可训练出具有定量空间信息的密度图。核心创新包括:1)构建历史密度图库(Historical Map Bank),通过无监督显著性估计初始化并结合指数移动平均(EMA)迭代更新,生成伪密度图以缓解确认偏差;2)利用超几何分布从估计的密集区域采样位置,采样数量由计数标注决定;3)引入自监督对比空间正则化项,增强模型对拥挤区域的空间感知能力,使同类区域特征相似而与背景差异最大化。实验证明该方法在跨域适应和半监督设置下均优于现有主流方法。

链接: https://arxiv.org/abs/2509.03170
作者: Mattia Litrico,Feng Chen,Michael Pound,Sotirios A Tsaftaris,Sebastiano Battiato,Mario Valerio Giuffrida
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Crowd density estimation is a well-known computer vision task aimed at estimating the density distribution of people in an image. The main challenge in this domain is the reliance on fine-grained location-level annotations, (i.e. points placed on top of each individual) to train deep networks. Collecting such detailed annotations is both tedious, time-consuming, and poses a significant barrier to scalability for real-world applications. To alleviate this burden, we present Count2Density: a novel pipeline designed to predict meaningful density maps containing quantitative spatial information using only count-level annotations (i.e., total number of people) during training. To achieve this, Count2Density generates pseudo-density maps leveraging past predictions stored in a Historical Map Bank, thereby reducing confirmation bias. This bank is initialised using an unsupervised saliency estimator to provide an initial spatial prior and is iteratively updated with an EMA of predicted density maps. These pseudo-density maps are obtained by sampling locations from estimated crowd areas using a hypergeometric distribution, with the number of samplings determined by the count-level annotations. To further enhance the spatial awareness of the model, we add a self-supervised contrastive spatial regulariser to encourage similar feature representations within crowded regions while maximising dissimilarity with background regions. Experimental results demonstrate that our approach significantly outperforms cross-domain adaptation methods and achieves better results than recent state-of-the-art approaches in semi-supervised settings across several datasets. Additional analyses validate the effectiveness of each individual component of our pipeline, confirming the ability of Count2Density to effectively retrieve spatial information from count-level annotations and enabling accurate subregion counting.
zh

[CV-30] Preserving instance continuity and length in segmentation through connectivity-aware loss computation

【速读】:该论文旨在解决生物医学分割任务中因信号丢失导致的细长结构(如轴突起始段,AIS)分割不连续的问题,这类问题在依赖连续性而非体素级精度的任务中尤为关键。解决方案的关键在于提出两种新颖的损失函数——负中心线损失(Negative Centerline Loss)和简化拓扑损失(Simplified Topology Loss),它们通过嵌入结构先验信息到损失设计中,引导卷积神经网络(CNN)输出保持实例间的连通性;同时结合实验设计优化策略(如下采样与间距校正),显著减少了分割结果中的断点数量,尤其在输入信号缺失区域,提升了下游应用中对实例长度计算的准确性与可靠性。

链接: https://arxiv.org/abs/2509.03154
作者: Karol Szustakowski,Luk Frank,Julia Esser,Jan Gründemann,Marie Piraud
机构: Helmholtz AI, Helmholtz Munich (亥姆霍兹人工智能); German Center for Neurodegenerative Diseases (德国神经退行性疾病中心); University of Bonn, Medical Faculty (波恩大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:In many biomedical segmentation tasks, the preservation of elongated structure continuity and length is more important than voxel-wise accuracy. We propose two novel loss functions, Negative Centerline Loss and Simplified Topology Loss, that, applied to Convolutional Neural Networks (CNNs), help preserve connectivity of output instances. Moreover, we discuss characteristics of experiment design, such as downscaling and spacing correction, that help obtain continuous segmentation masks. We evaluate our approach on a 3D light-sheet fluorescence microscopy dataset of axon initial segments (AIS), a task prone to discontinuity due to signal dropout. Compared to standard CNNs and existing topology-aware losses, our methods reduce the number of segmentation discontinuities per instance, particularly in regions with missing input signal, resulting in improved instance length calculation in downstream applications. Our findings demonstrate that structural priors embedded in the loss design can significantly enhance the reliability of segmentation for biological applications.
zh

[CV-31] mporally-Aware Diffusion Model for Brain Progression Modelling with Bidirectional Temporal Regularisation

【速读】:该论文旨在解决现有生成式MRI模型在预测脑部结构随时间变化时存在的三大局限性:一是未能显式建模结构变化与时间间隔的关系,尤其在年龄分布不均的数据集上表现不佳;二是仅依赖扫描插值生成中间图像,缺乏对病理进展的模拟能力;三是多采用2D切片架构,忽视了完整的3D解剖上下文信息。解决方案的关键在于提出一种3D时序感知扩散模型(TADM-3D),其核心创新包括:(1) 引入预训练的脑龄估计器(Brain-Age Estimator, BAE)作为引导机制,确保生成的随访MRI准确反映基线与目标扫描之间的预期年龄差异;(2) 提出“回溯正则化”(Back-In-Time Regularisation, BITR),通过双向训练(从基线到随访和从随访到基线)增强模型的时间一致性,从而提升生成图像的时序准确性。

链接: https://arxiv.org/abs/2509.03141
作者: Mattia Litrico,Francesco Guarnera,Mario Valerio Giuffrida,Daniele Ravì,Sebastiano Battiato
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generating realistic MRIs to accurately predict future changes in the structure of brain is an invaluable tool for clinicians in assessing clinical outcomes and analysing the disease progression at the patient level. However, current existing methods present some limitations: (i) some approaches fail to explicitly capture the relationship between structural changes and time intervals, especially when trained on age-imbalanced datasets; (ii) others rely only on scan interpolation, which lack clinical utility, as they generate intermediate images between timepoints rather than future pathological progression; and (iii) most approaches rely on 2D slice-based architectures, thereby disregarding full 3D anatomical context, which is essential for accurate longitudinal predictions. We propose a 3D Temporally-Aware Diffusion Model (TADM-3D), which accurately predicts brain progression on MRI volumes. To better model the relationship between time interval and brain changes, TADM-3D uses a pre-trained Brain-Age Estimator (BAE) that guides the diffusion model in the generation of MRIs that accurately reflect the expected age difference between baseline and generated follow-up scans. Additionally, to further improve the temporal awareness of TADM-3D, we propose the Back-In-Time Regularisation (BITR), by training TADM-3D to predict bidirectionally from the baseline to follow-up (forward), as well as from the follow-up to baseline (backward). Although predicting past scans has limited clinical applications, this regularisation helps the model generate temporally more accurate scans. We train and evaluate TADM-3D on the OASIS-3 dataset, and we validate the generalisation performance on an external test set from the NACC dataset. The code will be available upon acceptance.
zh

[CV-32] owards Realistic Hand-Object Interaction with Gravity-Field Based Diffusion Bridge

【速读】:该论文旨在解决现有手-物体交互重建或手部-物体姿态估计方法在处理复杂且多样的手部与物体几何结构时,常出现的穿插(interpenetration)或接触区域存在明显间隙的问题,同时难以捕捉真实手部在交互过程中发生的显著形变。其解决方案的关键在于将手-物体交互建模为一种吸引力驱动的过程,并提出基于引力场的扩散桥(Gravity-Field Based Diffusion Bridge, GravityDB),通过模拟可变形手表面与刚性物体之间的物理相互作用,生成无穿插、稳定抓握且能体现真实手部形变的交互结果;此外,引入文本描述中的语义信息引导引力场构建,从而增强交互区域的语义合理性。

链接: https://arxiv.org/abs/2509.03114
作者: Miao Xu,Xiangyu Zhu,Xusheng Liang,Zidu Wang,Jinlin Wu,Zhen Lei
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science (香港科学研究院人工智能与机器人中心); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); China Mobile Financial Technology Co., Ltd. (中国移动金融科技有限公司); ZKTeco Co., Ltd. (卓志科技有限公司); School of Computer Science and Engineering, the Faculty of Innovation Engineering, M.U.S.T (澳门城市大学创新工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing reconstruction or hand-object pose estimation methods are capable of producing coarse interaction states. However, due to the complex and diverse geometry of both human hands and objects, these approaches often suffer from interpenetration or leave noticeable gaps in regions that are supposed to be in contact. Moreover, the surface of a real human hand undergoes non-negligible deformations during interaction, which are difficult to capture and represent with previous methods. To tackle these challenges, we formulate hand-object interaction as an attraction-driven process and propose a Gravity-Field Based Diffusion Bridge (GravityDB) to simulate interactions between a deformable hand surface and rigid objects. Our approach effectively resolves the aforementioned issues by generating physically plausible interactions that are free of interpenetration, ensure stable grasping, and capture realistic hand deformations. Furthermore, we incorporate semantic information from textual descriptions to guide the construction of the gravitational field, enabling more semantically meaningful interaction regions. Extensive qualitative and quantitative experiments on multiple datasets demonstrate the effectiveness of our method.
zh

[CV-33] Information transmission: Inferring change area from change moment in time series remote sensing images

【速读】:该论文旨在解决时间序列遥感图像中变化区域检测与变化时刻识别任务之间一致性不足的问题,即当前深度学习方法通常将两者视为独立任务,导致结果不一致。解决方案的关键在于提出一种名为CAIM-Net(Change Area Inference from Moment Network)的网络架构,其核心思想是基于时序分析与空间变化检测之间的内在关联,从变化时刻反推变化区域。具体而言,CAIM-Net通过三个关键步骤实现:差异特征提取与增强、粗粒度变化时刻提取以及细粒度变化时刻提取与变化区域推理,其中利用多尺度时间类激活映射(multiscale temporal Class Activation Mapping, CAM)模块加权变化时刻,并据此推断变化区域,从而确保变化区域与变化时刻结果的一致性。

链接: https://arxiv.org/abs/2509.03112
作者: Jialu Li,Chen Wu,Meiqi Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series change detection is a critical task for exploring ecosystem dynamics using time series remote sensing images, because it can simultaneously indicate where and when change occur. While deep learning has shown excellent performance in this domain, it continues to approach change area detection and change moment identification as distinct tasks. Given that change area can be inferred from change moment, we propose a time series change detection network, named CAIM-Net (Change Area Inference from Moment Network), to ensure consistency between change area and change moment results. CAIM-Net infers change area from change moment based on the intrinsic relationship between time series analysis and spatial change detection. The CAIM-Net comprises three key steps: Difference Extraction and Enhancement, Coarse Change Moment Extraction, and Fine Change Moment Extraction and Change Area Inference. In the Difference Extraction and Enhancement, a lightweight encoder with batch dimension stacking is designed to rapidly extract difference features. Subsequently, boundary enhancement convolution is applied to amplify these difference features. In the Coarse Change Moment Extraction, the enhanced difference features from the first step are used to spatiotemporal correlation analysis, and then two distinct methods are employed to determine coarse change moments. In the Fine Change Moment Extraction and Change Area Inference, a multiscale temporal Class Activation Mapping (CAM) module first increases the weight of the change-occurring moment from coarse change moments. Then the weighted change moment is used to infer change area based on the fact that pixels with the change moment must have undergone a change.
zh

[CV-34] Backdoor Poisoning Attack Against Face Spoofing Attack Detection Methods

【速读】:该论文旨在解决面部反欺骗检测系统中潜在的后门投毒攻击(backdoor poisoning attack)问题,即恶意攻击者通过在训练数据中注入特定标记的欺骗样本,使模型在正常情况下准确识别欺骗行为,但在遇到特定触发条件时错误地将欺骗图像判定为活体图像,从而绕过检测机制。解决方案的关键在于提出一种新颖的后门投毒攻击方法,该方法能够将欺骗图像中的特征嵌入到活体图像中,且不引入任何可察觉的视觉变化,从而在不破坏模型整体性能的前提下实现对特定欺骗攻击的隐蔽性绕过。实验结果表明,该方法对现有反欺骗检测系统构成现实威胁。

链接: https://arxiv.org/abs/2509.03108
作者: Shota Iwamatsu,Koichi Ito,Takafumi Aoki
机构: Tohoku University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

点击查看摘要

Abstract:Face recognition systems are robust against environmental changes and noise, and thus may be vulnerable to illegal authentication attempts using user face photos, such as spoofing attacks. To prevent such spoofing attacks, it is crucial to discriminate whether the input image is a live user image or a spoofed image prior to the face recognition process. Most existing spoofing attack detection methods utilize deep learning, which necessitates a substantial amount of training data. Consequently, if malicious data is injected into a portion of the training dataset, a specific spoofing attack may be erroneously classified as live, leading to false this http URL this paper, we propose a novel backdoor poisoning attack method to demonstrate the latent threat of backdoor poisoning within face anti-spoofing detection. The proposed method enables certain spoofing attacks to bypass detection by embedding features extracted from the spoofing attack’s face image into a live face image without inducing any perceptible visual this http URL experiments conducted on public datasets, we demonstrate that the proposed method constitutes a realistic threat to existing spoofing attack detection systems.
zh

[CV-35] RELLIS-Enhanced Surface Features for Comprehensive Intracranial Aneurysm Analysis

【速读】:该论文旨在解决颅内动脉瘤(intracranial aneurysm)在临床中难以检测、分割和建模的问题,其核心挑战在于高质量标注的三维医学数据稀缺。解决方案的关键在于提出一种跨域特征迁移方法,利用在大规模非医疗3D数据上训练的生成模型TRELLIS所学习到的潜在几何嵌入(latent geometric embeddings),作为先验知识来增强神经网络对动脉瘤相关任务的分析能力。具体而言,通过用TRELLIS表面特征替代传统点法向量或网格描述符,显著提升了三个下游任务的性能:动脉瘤分类、血管与动脉瘤区域分割以及基于图神经网络的时间演化血流场预测,实验表明该方法在准确率、F1分数、分割质量及仿真误差方面均优于当前最优基线。

链接: https://arxiv.org/abs/2509.03095
作者: Clément Hervé,Paul Garnier,Jonathan Viquerat,Elie Hachem
机构: Mines Paris PSL (巴黎矿业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Intracranial aneurysms pose a significant clinical risk yet are difficult to detect, delineate and model due to limited annotated 3D data. We propose a cross-domain feature-transfer approach that leverages the latent geometric embeddings learned by TRELLIS, a generative model trained on large-scale non-medical 3D datasets, to augment neural networks for aneurysm analysis. By replacing conventional point normals or mesh descriptors with TRELLIS surface features, we systematically enhance three downstream tasks: (i) classifying aneurysms versus healthy vessels in the Intra3D dataset, (ii) segmenting aneurysm and vessel regions on 3D meshes, and (iii) predicting time-evolving blood-flow fields using a graph neural network on the AnXplore dataset. Our experiments show that the inclusion of these features yields strong gains in accuracy, F1-score and segmentation quality over state-of-the-art baselines, and reduces simulation error by 15%. These results illustrate the broader potential of transferring 3D representations from general-purpose generative models to specialized medical tasks.
zh

[CV-36] High Cursive Complex Character Recognition using GAN External Classifier

【速读】:该论文旨在解决手写字符(Handwritten Characters)分类难题,尤其是针对书写风格复杂且连笔(cursive)的字符在传统卷积神经网络(Convolutional Neural Networks, CNNs)中识别准确率显著下降的问题。其解决方案的关键在于提出一种自适应数据增强生成对抗网络(Adaptive Data Augmentation GAN, ADA-GAN),通过生成器(Generator)合成伪造的手写字符图像,并结合判别器(Discriminator)输出的置信度阈值筛选高质量样本,同时添加对抗扰动噪声以提升训练数据多样性与鲁棒性,从而有效增强模型对复杂和连笔字符的泛化能力。

链接: https://arxiv.org/abs/2509.03062
作者: S M Rafiuddin
机构: University of Asia Pacific (亚洲太平洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Comments: 10 pages, 8 figures, published in the Proceedings of the 2nd International Conference on Computing Advancements (ICCA 2022). Paper introduces ADA-GAN with an external classifier for complex cursive handwritten character recognition, evaluated on MNIST and BanglaLekha datasets, showing improved robustness compared to CNN baselines

点击查看摘要

Abstract:Handwritten characters can be trickier to classify due to their complex and cursive nature compared to simple and non-cursive characters. We present an external classifier along with a Generative Adversarial Network that can classify highly cursive and complex characters. The generator network produces fake handwritten character images, which are then used to augment the training data after adding adversarially perturbed noise and achieving a confidence score above a threshold with the discriminator network. The results show that the accuracy of convolutional neural networks decreases as character complexity increases, but our proposed model, ADA-GAN, remains more robust and effective for both cursive and complex characters.
zh

[CV-37] Isolated Bangla Handwritten Character Classification using Transfer Learning

【速读】:该论文旨在解决孟加拉语(Bangla)手写字符的准确识别问题,特别是针对其50个基本字符及大量复合字符的分类难题。解决方案的关键在于采用迁移学习(transfer learning)策略,并结合3D卷积神经网络(3D Convolutional Neural Network, 3DCNN)、残差神经网络(Residual Neural Network, ResNet)和MobileNet等深度神经网络模型,构建端到端的分类框架,有效避免梯度消失问题,从而实现对孟加拉语孤立手写字符图像的高精度识别。实验表明,该方法在Bangla Lekha Isolated数据集上达到了99.82%的训练准确率和99.46%的测试准确率,优于现有主流基准模型。

链接: https://arxiv.org/abs/2509.03061
作者: Abdul Karim,S M Rafiuddin,Jahidul Islam Razin,Tahira Alam
机构: University of Asia Pacific (亚洲太平洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Comments: 13 pages, 14 figures, published in the Proceedings of the 2nd International Conference on Computing Advancements (ICCA 2022), IEEE. Strong experimental section with comparisons across models (3DCNN, ResNet50, MobileNet)

点击查看摘要

Abstract:Bangla language consists of fifty distinct characters and many compound characters. Several notable studies have been performed to recognize Bangla characters, both handwritten and optical. Our approach uses transfer learning to classify the basic, distinct, as well as compound Bangla handwritten characters while avoiding the vanishing gradient problem. Deep Neural Network techniques such as 3D Convolutional Neural Network (3DCNN), Residual Neural Network (ResNet), and MobileNet are applied to generate an end-to-end classification of all possible standard formations of handwritten characters in the Bangla language. The Bangla Lekha Isolated dataset, which contains 166,105 Bangla character image samples categorized into 84 distinct classes, is used for this classification model. The model achieved 99.82% accuracy on training data and 99.46% accuracy on test data. Comparisons with various state-of-the-art benchmarks of Bangla handwritten character classification show that the proposed model achieves better accuracy in classifying the data.
zh

[CV-38] DCDB: Dynamic Conditional Dual Diffusion Bridge for Ill-posed Multi-Tasks

【速读】:该论文旨在解决条件扩散模型在多任务图像处理场景中难以利用任务间内在关联性的问题,尤其是在数据稀缺的病态(ill-posed)任务中,传统静态条件控制机制无法适应任务动态演化特性,导致网络学习困难。解决方案的关键在于提出一种动态条件双扩散桥训练范式(dynamic conditional double diffusion bridge training paradigm),其核心创新包括:1)解耦扩散过程与条件生成过程,降低对监督数据的依赖;2)采用相同噪声调度生成动态条件,逐步调整其统计特性,自然嵌入时间相关性信息,从而简化网络学习难度,并通过单步去噪过程中的学习目标分析和注意力权重变化验证了动态条件的优势。

链接: https://arxiv.org/abs/2509.03044
作者: Chengjie Huang,Jiafeng Yan,Jing Li,Lu Bai
机构: Zhejiang University (浙江大学); Central University of Finance and Economics (中央财经大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages,6 figures

点击查看摘要

Abstract:Conditional diffusion models have made impressive progress in the field of image processing, but the characteristics of constructing data distribution pathways make it difficult to exploit the intrinsic correlation between tasks in multi-task scenarios, which is even worse in ill-posed tasks with a lack of training data. In addition, traditional static condition control makes it difficult for networks to learn in multi-task scenarios with its dynamically evolving characteristics. To address these challenges, we propose a dynamic conditional double diffusion bridge training paradigm to build a general framework for ill-posed multi-tasks. Firstly, this paradigm decouples the diffusion and condition generation processes, avoiding the dependence of the diffusion model on supervised data in ill-posed tasks. Secondly, generated by the same noise schedule, dynamic conditions are used to gradually adjust their statistical characteristics, naturally embed time-related information, and reduce the difficulty of network learning. We analyze the learning objectives of the network under different conditional forms in the single-step denoising process and compare the changes in its attention weights in the network, demonstrating the superiority of our dynamic conditions. Taking dehazing and visible-infrared fusion as typical ill-posed multi-task scenarios, we achieve the best performance in multiple indicators on public datasets. The code has been publicly released at: this https URL.
zh

[CV-39] MedLiteNet: Lightweight Hybrid Medical Image Segmentation Model

【速读】:该论文旨在解决皮肤病变分割(skin-lesion segmentation)中因卷积神经网络(Convolutional Neural Networks, CNNs)感受野有限而难以建模长程依赖关系,以及视觉Transformer(Vision Transformers)因二次计算复杂度和大参数量在小样本医学数据集上应用受限的问题。其解决方案的关键在于提出MedLiteNet——一种轻量级CNN与Transformer混合架构,通过分层特征提取与多尺度上下文聚合实现高精度分割;具体包括:使用深度可分离Mobile Inverted Bottleneck模块降低计算开销、插入瓶颈层级的跨尺度token混洗单元实现不同分辨率间信息交换、嵌入边界感知自注意力模块以增强病灶边缘清晰度。

链接: https://arxiv.org/abs/2509.03041
作者: Pengyang Yu,Haoquan Wang,Gerard Marks,Tahar Kechadi,Laurence T. Yang,Sahraoui Dhelim,Nyothiri Aung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate skin-lesion segmentation remains a key technical challenge for computer-aided diagnosis of skin cancer. Convolutional neural networks, while effective, are constrained by limited receptive fields and thus struggle to model long-range dependencies. Vision Transformers capture global context, yet their quadratic complexity and large parameter budgets hinder use on the small-sample medical datasets common in dermatology. We introduce the MedLiteNet, a lightweight CNN Transformer hybrid tailored for dermoscopic segmentation that achieves high precision through hierarchical feature extraction and multi-scale context aggregation. The encoder stacks depth-wise Mobile Inverted Bottleneck blocks to curb computation, inserts a bottleneck-level cross-scale token-mixing unit to exchange information between resolutions, and embeds a boundary-aware self-attention module to sharpen lesion contours.
zh

[CV-40] Background Matters Too: A Language-Enhanced Adversarial Framework for Person Re-Identification

【速读】:该论文旨在解决行人重识别(Person Re-identification, ReID)中的两个核心挑战:一是精准定位前景目标并抑制背景噪声,二是从目标区域提取细粒度特征。现有视觉单一模态方法依赖昂贵的人工标注且难以应对复杂遮挡,而当前多模态方法虽引入语义线索提升视觉理解,却仅关注前景信息,忽视了背景语义的潜在价值。为此,作者提出一种端到端框架,通过双分支跨模态特征提取管道联合建模前景与背景信息;其关键创新在于设计了一种域内语义对齐与域间语义对抗学习策略——即在不同模态间对齐具有相同语义的视觉与文本特征,同时惩罚前景与背景特征间的相似性,从而增强模型对身份相关前景线索的关注并主动抑制噪声背景区域,显著提升判别能力。

链接: https://arxiv.org/abs/2509.03032
作者: Kaicong Huang,Talha Azfar,Jack M. Reilly,Thomas Guggisberg,Ruimin Ke
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Person re-identification faces two core challenges: precisely locating the foreground target while suppressing background noise and extracting fine-grained features from the target region. Numerous visual-only approaches address these issues by partitioning an image and applying attention modules, yet they rely on costly manual annotations and struggle with complex occlusions. Recent multimodal methods, motivated by CLIP, introduce semantic cues to guide visual understanding. However, they focus solely on foreground information, but overlook the potential value of background cues. Inspired by human perception, we argue that background semantics are as important as the foreground semantics in ReID, as humans tend to eliminate background distractions while focusing on target appearance. Therefore, this paper proposes an end-to-end framework that jointly models foreground and background information within a dual-branch cross-modal feature extraction pipeline. To help the network distinguish between the two domains, we propose an intra-semantic alignment and inter-semantic adversarial learning strategy. Specifically, we align visual and textual features that share the same semantics across domains, while simultaneously penalizing similarity between foreground and background features to enhance the network’s discriminative power. This strategy drives the model to actively suppress noisy background regions and enhance attention toward identity-relevant foreground cues. Comprehensive experiments on two holistic and two occluded ReID benchmarks demonstrate the effectiveness and generality of the proposed method, with results that match or surpass those of current state-of-the-art approaches.
zh

[CV-41] Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens EMNLP2025

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在处理多模态输入时存在的一个关键问题:模型常错误地将缺乏视觉证据的文本输入误认为是图像内容的一部分,从而生成不准确的回答。针对此问题,研究者发现了一类特定的前馈网络(Feed-Forward Network, FFN)神经元——视觉缺失感知神经元(Visual Absence-aware, VA neurons),其具有通过独特激活模式明确表征文本概念在图像中“未被视觉锚定”的能力。解决方案的关键在于利用VA神经元的激活模式构建一个检测模块,用于系统性判断输入token是否具备视觉依据,并基于该判断结果,通过重构问题提示或替换检测到的非视觉锚定token来优化生成过程,从而有效减少LVLM对文本视觉存在性的虚假假设。

链接: https://arxiv.org/abs/2509.03025
作者: Sohee Kim,Soohyun Ryu,Joonhyung Park,Eunho Yang
机构: KAIST AI; AITRICS
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted to EMNLP 2025

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) generate contextually relevant responses by jointly interpreting visual and textual inputs. However, our finding reveals they often mistakenly perceive text inputs lacking visual evidence as being part of the image, leading to erroneous responses. In light of this finding, we probe whether LVLMs possess an internal capability to determine if textual concepts are grounded in the image, and discover a specific subset of Feed-Forward Network (FFN) neurons, termed Visual Absence-aware (VA) neurons, that consistently signal the visual absence through a distinctive activation pattern. Leveraging these patterns, we develop a detection module that systematically classifies whether an input token is visually grounded. Guided by its prediction, we propose a method to refine the outputs by reinterpreting question prompts or replacing the detected absent tokens during generation. Extensive experiments show that our method effectively mitigates the models’ tendency to falsely presume the visual presence of text input and its generality across various LVLMs.
zh

[CV-42] Uncertainty-aware Test-Time Training (UT3) for Efficient On-the-fly Domain Adaptive Dense Regression

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在面对持续域偏移(continuous domain shift)时泛化能力差的问题,尤其是在资源受限且对延迟敏感的机器人应用中,传统测试时训练(test-time training)方法因需对每个测试样本进行多次前向和反向传播而导致推理时间显著增加,难以满足实时性要求。解决方案的关键在于提出一种名为 UT³ 的新框架,其核心是设计了一种基于不确定性的自监督任务,通过量化模型预测的不确定性来选择性地执行测试时训练,从而在保证性能接近标准测试时训练的同时,大幅降低推理时间,并支持以连续方式识别关键帧以灵活控制训练频率。

链接: https://arxiv.org/abs/2509.03012
作者: Uddeshya Upadhyay
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) are increasingly being used in autonomous systems. However, DNNs do not generalize well to domain shift. Adapting to a continuously evolving environment is a safety-critical challenge inevitably faced by all autonomous systems deployed to the real world. Recent work on test-time training proposes methods that adapt to a new test distribution on the fly by optimizing the DNN model for each test input using self-supervision. However, these techniques result in a sharp increase in inference time as multiple forward and backward passes are required for a single test sample (for test-time training) before finally making the prediction based on the fine-tuned features. This is undesirable for real-world robotics applications where these models may be deployed to resource constraint hardware with strong latency requirements. In this work, we propose a new framework (called UT ^3 ) that leverages test-time training for improved performance in the presence of continuous domain shift while also decreasing the inference time, making it suitable for real-world applications. Our method proposes an uncertainty-aware self-supervision task for efficient test-time training that leverages the quantified uncertainty to selectively apply the training leading to sharp improvements in the inference time while performing comparably to standard test-time training protocol. Our proposed protocol offers a continuous setting to identify the selected keyframes, allowing the end-user to control how often to apply test-time training. We demonstrate the efficacy of our method on a dense regression task - monocular depth estimation.
zh

[CV-43] Lesion-Aware Visual-Language Fusion for Automated Image Captioning of Ulcerative Colitis Endoscopic Examinations MICCAI

【速读】:该论文旨在解决溃疡性肠炎(Ulcerative Colitis, UC)内镜图像描述生成中缺乏临床语义引导与可解释性不足的问题。其解决方案的关键在于构建一个病灶感知的图像描述框架,通过融合ResNet特征嵌入、Grad-CAM热力图以及CBAM增强注意力机制,并将临床元数据(如MES评分、血管模式、出血、充血、脆性、溃疡等)以自然语言提示的方式注入T5解码器,从而生成结构化且符合临床实践的描述,同时实现MES分类与病灶标签预测,显著提升描述质量与分类准确性。

链接: https://arxiv.org/abs/2509.03011
作者: Alexis Ivan Lopez Escamilla,Gilberto Ochoa,Sharib Al
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Miccai Demi Conference 2025

点击查看摘要

Abstract:We present a lesion-aware image captioning framework for ulcerative colitis (UC). The model integrates ResNet embeddings, Grad-CAM heatmaps, and CBAM-enhanced attention with a T5 decoder. Clinical metadata (MES score 0-3, vascular pattern, bleeding, erythema, friability, ulceration) is injected as natural-language prompts to guide caption generation. The system produces structured, interpretable descriptions aligned with clinical practice and provides MES classification and lesion tags. Compared with baselines, our approach improves caption quality and MES classification accuracy, supporting reliable endoscopic reporting.
zh

[CV-44] Enhancing Robustness in Post-Processing Watermarking: An Ensemble Attack Network Using CNNs and Transformers

【速读】:该论文旨在解决生成式 AI (Generative AI) 输出图像的版权保护问题,特别是针对后处理水印(post-processing watermarking)在面对多种攻击时鲁棒性不足的挑战。其核心解决方案是引入一种集成攻击网络(ensemble attack network)进行训练,通过结合空间域的卷积神经网络(CNN)与频域的 Transformer 架构构建多维度对抗训练机制,从而显著提升水印模型的鲁棒性。实验表明,该方法在 WAVES 基准测试中有效增强了基线水印方案对再生攻击(Regeneration Attack)等复杂场景的抵抗能力,例如使 StegaStamp 方法的平均比特准确率提升 18.743%。

链接: https://arxiv.org/abs/2509.03006
作者: Tzuhsuan Huang,Cheng Yu Yeo,Tsai-Ling Huang,Hong-Han Shuai,Wen-Huang Cheng,Jun-Cheng Chen
机构: Academia Sinica (中央研究院); National Yang Ming Chiao Tung University (国立阳明交通大学); National Taiwan University (国立台湾大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Recent studies on deep watermarking have predominantly focused on in-processing watermarking, which integrates the watermarking process into image generation. However, post-processing watermarking, which embeds watermarks after image generation, offers more flexibility. It can be applied to outputs from any generative model (e.g. GANs, diffusion models) without needing access to the model’s internal structure. It also allows users to embed unique watermarks into individual images. Therefore, this study focuses on post-processing watermarking and enhances its robustness by incorporating an ensemble attack network during training. We construct various versions of attack networks using CNN and Transformer in both spatial and frequency domains to investigate how each combination influences the robustness of the watermarking model. Our results demonstrate that combining a CNN-based attack network in the spatial domain with a Transformer-based attack network in the frequency domain yields the highest robustness in watermarking models. Extensive evaluation on the WAVES benchmark, using average bit accuracy as the metric, demonstrates that our ensemble attack network significantly enhances the robustness of baseline watermarking methods under various stress tests. In particular, for the Regeneration Attack defined in WAVES, our method improves StegaStamp by 18.743%. The code is released at:this https URL.
zh

[CV-45] SOPSeg: Prompt-based Small Object Instance Segmentation in Remote Sensing Imagery

【速读】:该论文旨在解决遥感影像中小目标实例分割(instance segmentation)研究不足的问题,特别是由于像素级标注成本高和现有模型在小目标上边界模糊、细节丢失导致的性能下降。其核心解决方案是提出SOPSeg框架,关键创新在于:1)引入区域自适应放大策略以保留细粒度空间细节;2)设计融合边缘预测与渐进式精化的定制解码器,提升边界分割精度;3)开发面向遥感中广泛使用的定向边界框(oriented bounding boxes)的新型提示机制,从而实现高效且准确的小目标分割。

链接: https://arxiv.org/abs/2509.03002
作者: Chenhao Wang,Yingrui Ji,Yu Meng,Yunjian Zhang,Yao Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Extracting small objects from remote sensing imagery plays a vital role in various applications, including urban planning, environmental monitoring, and disaster management. While current research primarily focuses on small object detection, instance segmentation for small objects remains underexplored, with no dedicated datasets available. This gap stems from the technical challenges and high costs of pixel-level annotation for small objects. While the Segment Anything Model (SAM) demonstrates impressive zero-shot generalization, its performance on small-object segmentation deteriorates significantly, largely due to the coarse 1/16 feature resolution that causes severe loss of fine spatial details. To this end, we propose SOPSeg, a prompt-based framework specifically designed for small object segmentation in remote sensing imagery. It incorporates a region-adaptive magnification strategy to preserve fine-grained details, and employs a customized decoder that integrates edge prediction and progressive refinement for accurate boundary delineation. Moreover, we introduce a novel prompting mechanism tailored to the oriented bounding boxes widely adopted in remote sensing applications. SOPSeg outperforms existing methods in small object segmentation and facilitates efficient dataset construction for remote sensing tasks. We further construct a comprehensive small object instance segmentation dataset based on SODA-A, and will release both the model and dataset to support future research.
zh

[CV-46] SPENet: Self-guided Prototype Enhancement Network for Few-shot Medical Image Segmentation MICCAI2025

【速读】:该论文旨在解决少样本医学图像分割(Few-Shot Medical Image Segmentation, FSMIS)中现有基于原型的方法因仅生成单一全局原型而忽略类别内部差异的问题。其解决方案的关键在于提出自引导原型增强网络(Self-guided Prototype Enhancement Network, SPENet),包含两个核心模块:一是多层级原型生成(Multi-level Prototype Generation, MPG)模块,通过同时生成全局原型和自适应数量的局部原型实现支持图像与查询图像间的多粒度匹配;二是查询引导的局部原型增强(Query-guided Local Prototype Enhancement, QLPE)模块,利用查询图像对支持图像中的局部原型进行自适应优化,从而缓解因支持与查询图像间显著差异带来的负面影响。

链接: https://arxiv.org/abs/2509.02993
作者: Chao Fan,Xibin Jia,Anqi Xiao,Hongyuan Yu,Zhenghan Yang,Dawei Yang,Hui Xu,Yan Huang,Liang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI2025

点击查看摘要

Abstract:Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel classes of medical objects using only a few labeled images. Prototype-based methods have made significant progress in addressing FSMIS. However, they typically generate a single global prototype for the support image to match with the query image, overlooking intra-class variations. To address this issue, we propose a Self-guided Prototype Enhancement Network (SPENet). Specifically, we introduce a Multi-level Prototype Generation (MPG) module, which enables multi-granularity measurement between the support and query images by simultaneously generating a global prototype and an adaptive number of local prototypes. Additionally, we observe that not all local prototypes in the support image are beneficial for matching, especially when there are substantial discrepancies between the support and query images. To alleviate this issue, we propose a Query-guided Local Prototype Enhancement (QLPE) module, which adaptively refines support prototypes by incorporating guidance from the query image, thus mitigating the negative effects of such discrepancies. Extensive experiments on three public medical datasets demonstrate that SPENet outperforms existing state-of-the-art methods, achieving superior performance.
zh

[CV-47] DUViN: Diffusion-Based Underwater Visual Navigation via Knowledge-Transferred Depth Features

【速读】:该论文旨在解决水下自主导航中因感知能力受限和难以构建精确地图所带来的挑战,尤其是在未知环境中实现基于视觉的端到端四自由度(4-DoF)运动控制问题。解决方案的关键在于提出一种基于扩散模型的水下视觉导航策略——DUViN,其核心创新是通过知识迁移的深度特征提取方法,使模型能够在从空气中到水下的域转移场景下保持鲁棒性。具体而言,训练分为两个阶段:首先在空气中数据集上利用预训练的深度特征提取器训练扩散型导航策略;随后在水下深度估计任务上微调该提取器,并将其集成至第一阶段训练好的导航策略中,从而实现无需依赖预先构建地图的避障与地形感知高度维持功能。

链接: https://arxiv.org/abs/2509.02983
作者: Jinghe Yang,Minh-Quan Le,Mingming Gong,Ye Pu
机构: The University of Melbourne (墨尔本大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous underwater navigation remains a challenging problem due to limited sensing capabilities and the difficulty of constructing accurate maps in underwater environments. In this paper, we propose a Diffusion-based Underwater Visual Navigation policy via knowledge-transferred depth features, named DUViN, which enables vision-based end-to-end 4-DoF motion control for underwater vehicles in unknown environments. DUViN guides the vehicle to avoid obstacles and maintain a safe and perception awareness altitude relative to the terrain without relying on pre-built maps. To address the difficulty of collecting large-scale underwater navigation datasets, we propose a method that ensures robust generalization under domain shifts from in-air to underwater environments by leveraging depth features and introducing a novel model transfer strategy. Specifically, our training framework consists of two phases: we first train the diffusion-based visual navigation policy on in-air datasets using a pre-trained depth feature extractor. Secondly, we retrain the extractor on an underwater depth estimation task and integrate the adapted extractor into the trained navigation policy from the first step. Experiments in both simulated and real-world underwater environments demonstrate the effectiveness and generalization of our approach. The experimental videos are available at this https URL.
zh

[CV-48] InstaDA: Augmenting Instance Segmentation Data with Dual-Agent System

【速读】:该论文旨在解决实例分割(Instance Segmentation)数据集构建中面临的两大挑战:一是标注过程劳动密集且效率低下,二是数据集中类别分布存在显著不平衡。为应对这些问题,作者提出了一种无需训练的双智能体(Dual-Agent)系统InstaDA,其核心创新在于引入Text-Agent(T-Agent)与Image-Agent(I-Agent)协同工作。T-Agent通过大语言模型(LLM)与扩散模型(Diffusion Model)的深度协作,结合新颖的Prompt Rethink机制迭代优化提示词,从而提升生成图像的质量和多样性;I-Agent则基于训练图像生成新的实例样本,以丰富整体数据分布。两个代理均设计为独立自动化的工作流,确保高效性和实用性,实验表明该方法在LVIS 1.0验证集上显著优于基线模型及当前最优模型DiverGen。

链接: https://arxiv.org/abs/2509.02973
作者: Xianbao Hou,Yonghao He,Zeyd Boukhers,John See,Hu Su,Wei Sui,Cong Yang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. University of Cambridge (剑桥大学); 4. University of Oxford (牛津大学); 5. Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Acquiring high-quality instance segmentation data is challenging due to the labor-intensive nature of the annotation process and significant class imbalances within datasets. Recent studies have utilized the integration of Copy-Paste and diffusion models to create more diverse datasets. However, these studies often lack deep collaboration between large language models (LLMs) and diffusion models, and underutilize the rich information within the existing training data. To address these limitations, we propose InstaDA, a novel, training-free Dual-Agent system designed to augment instance segmentation datasets. First, we introduce a Text-Agent (T-Agent) that enhances data diversity through collaboration between LLMs and diffusion models. This agent features a novel Prompt Rethink mechanism, which iteratively refines prompts based on the generated images. This process not only fosters collaboration but also increases image utilization and optimizes the prompts themselves. Additionally, we present an Image-Agent (I-Agent) aimed at enriching the overall data distribution. This agent augments the training set by generating new instances conditioned on the training images. To ensure practicality and efficiency, both agents operate as independent and automated workflows, enhancing usability. Experiments conducted on the LVIS 1.0 validation set indicate that InstaDA achieves significant improvements, with an increase of +4.0 in box average precision (AP) and +3.3 in mask AP compared to the baseline. Furthermore, it outperforms the leading model, DiverGen, by +0.3 in box AP and +0.1 in mask AP, with a notable +0.7 gain in box AP on common categories and mask AP gains of +0.2 on common categories and +0.5 on frequent categories.
zh

[CV-49] VQualA 2025 Challenge on Engagement Prediction for Short Videos: Methods and Results ICCV2025

【速读】:该论文旨在解决用户生成内容(User-Generated Content, UGC)短视频在社交媒体平台上的用户参与度(Engagement)预测问题,其核心挑战在于建模影响用户互动行为的复杂多因素。解决方案的关键在于构建一个基于真实用户交互数据的新颖短格式UGC数据集,并鼓励参赛者利用视觉、音频及创作者提供的元数据等多模态特征进行建模,从而推动鲁棒性强、泛化能力好的参与度预测方法的发展。

链接: https://arxiv.org/abs/2509.02969
作者: Dasong Li,Sizhuo Ma,Hang Hua,Wenjie Li,Jian Wang,Chris Wei Zhou,Fengbin Guan,Xin Li,Zihao Yu,Yiting Lu,Ru-Ling Liao,Yan Ye,Zhibo Chen,Wei Sun,Linhan Cao,Yuqin Cao,Weixia Zhang,Wen Wen,Kaiwei Zhang,Zijian Chen,Fangfang Lu,Xiongkuo Min,Guangtao Zhai,Erjia Xiao,Lingfeng Zhang,Zhenjie Su,Hao Cheng,Yu Liu,Renjing Xu,Long Chen,Xiaoshuai Hao,Zhenpeng Zeng,Jianqin Wu,Xuxu Wang,Qian Yu,Bo Hu,Weiwei Wang,Pinxin Liu,Yunlong Tang,Luchuan Song,Jinxi He,Jiaru Wu,Hanjia Lyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Social and Information Networks (cs.SI)
备注: ICCV 2025 VQualA workshop EVQA track

点击查看摘要

Abstract:This paper presents an overview of the VQualA 2025 Challenge on Engagement Prediction for Short Videos, held in conjunction with ICCV 2025. The challenge focuses on understanding and modeling the popularity of user-generated content (UGC) short videos on social media platforms. To support this goal, the challenge uses a new short-form UGC dataset featuring engagement metrics derived from real-world user interactions. This objective of the Challenge is to promote robust modeling strategies that capture the complex factors influencing user engagement. Participants explored a variety of multi-modal features, including visual content, audio, and metadata provided by creators. The challenge attracted 97 participants and received 15 valid test submissions, contributing significantly to progress in short-form UGC video engagement prediction.
zh

[CV-50] KEPT: Knowledge-Enhanced Prediction of Trajectories from Consecutive Driving Frames with Vision-Language Models

【速读】:该论文旨在解决当前视觉-语言模型(VLM)在短时轨迹预测中难以有效结合场景动态和领域知识的问题,从而影响自动驾驶系统的安全性与可靠性。其核心解决方案是提出KEPT框架,关键在于:1)采用时频-空间融合(TFSF)视频编码器,通过自监督学习与难负样本挖掘提升对驾驶场景时序特征的建模能力;2)构建基于k-means聚类与HNSW索引的可扩展检索模块,提供与场景对齐的先验示例;3)将检索到的先验嵌入链式思维(CoT)提示中,并引入显式规划约束,配合三阶段微调策略逐步对齐语言头至度量空间线索、物理可行运动及时间条件下的前视规划,实现高精度与低碰撞率的轨迹预测。

链接: https://arxiv.org/abs/2509.02966
作者: Yujin Wang,Tianyi Wang,Quanfeng Liu,Wenxian Fan,Junfeng Jiao,Christian Claudel,Yunbing Yan,Bingzhao Gao,Jianqiang Wang,Hong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate short-horizon trajectory prediction is pivotal for safe and reliable autonomous driving, yet existing vision-language models (VLMs) often fail to effectively ground their reasoning in scene dynamics and domain knowledge. To address this challenge, this paper introduces KEPT, a knowledge-enhanced VLM framework that predicts ego trajectories directly from consecutive front-view driving frames. KEPT couples a temporal frequency-spatial fusion (TFSF) video encoder, trained via self-supervised learning with hard-negative mining, with a scalable k-means + HNSW retrieval stack that supplies scene-aligned exemplars. Retrieved priors are embedded into chain-of-thought (CoT) prompts with explicit planning constraints, while a triple-stage fine-tuning schedule incrementally aligns the language head to metric spatial cues, physically feasible motion, and temporally conditioned front-view planning. Evaluated on nuScenes dataset, KEPT achieves state-of-the-art performance across open-loop protocols: under NoAvg, it achieves 0.70m average L2 with a 0.21% collision rate; under TemAvg with lightweight ego status, it attains 0.31m average L2 and a 0.07% collision rate. Ablation studies show that all three fine-tuning stages contribute complementary benefits, and that using Top-2 retrieved exemplars yields the best accuracy-safety trade-off. The k-means-clustered HNSW index delivers sub-millisecond retrieval latency, supporting practical deployment. These results indicate that retrieval-augmented, CoT-guided VLMs offer a promising, data-efficient pathway toward interpretable and trustworthy autonomous driving.
zh

[CV-51] EdgeAttNet: Towards Barb-Aware Filament Segmentation

【速读】:该论文旨在解决太阳暗条(solar filament)在H-alpha观测图像中分割精度不足的问题,尤其是难以捕捉细粒度结构如巴布斯(barbs)的现象。现有方法受限于对长程依赖关系和空间细节建模能力的不足,导致分割结果模糊且缺乏边界清晰度。其解决方案的关键在于提出EdgeAttNet架构,通过引入一个从输入图像直接学习得到的可训练边缘图(edge map),并将其嵌入到U-Net骨干网络的自注意力机制中——具体而言,将边缘信息线性变换后用于调整注意力机制中的Key和Query矩阵,从而在瓶颈层引导模型更精准地关注暗条边界与巴布斯等细微结构。此设计显著提升了空间敏感性和分割准确性,同时减少了可训练参数量,并实现了更快的推理速度,适用于实际天文观测场景部署。

链接: https://arxiv.org/abs/2509.02964
作者: Victor Solomon,Piet Martens,Jingyu Liu,Rafal Angryk
机构: Georgia State University (佐治亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Solar and Stellar Astrophysics (astro-ph.SR); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Accurate segmentation of solar filaments in H-alpha observations is critical for determining filament chirality, a key factor in the behavior of Coronal Mass Ejections (CMEs). However, existing methods often fail to capture fine-scale filament structures, particularly barbs, due to a limited ability to model long-range dependencies and spatial detail. We propose EdgeAttNet, a segmentation architecture built on a U-Net backbone by introducing a novel, learnable edge map derived directly from the input image. This edge map is incorporated into the model by linearly transforming the attention Key and Query matrices with the edge information, thereby guiding the self-attention mechanism at the network’s bottleneck to more effectively capture filament boundaries and barbs. By explicitly integrating this structural prior into the attention computations, EdgeAttNet enhances spatial sensitivity and segmentation accuracy while reducing the number of trainable parameters. Trained end-to-end, EdgeAttNet outperforms U-Net and other U-Net-based transformer baselines on the MAGFILO dataset. It achieves higher segmentation accuracy and significantly better recognition of filament barbs, with faster inference performance suitable for practical deployment. Subjects: Computer Vision and Pattern Recognition (cs.CV); Solar and Stellar Astrophysics (astro-ph.SR); Image and Video Processing (eess.IV) Cite as: arXiv:2509.02964 [cs.CV] (or arXiv:2509.02964v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.02964 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-52] Resilient Multimodal Industrial Surface Defect Detection with Uncertain Sensors Availability

【速读】:该论文针对多模态工业表面缺陷检测(Multimodal Industrial Surface Defect Detection, MISDD)中因传感器不确定性导致的模态缺失问题展开研究,旨在解决由此引发的模态间信息不一致与信息空缺难题。其核心解决方案为提出一种交叉模态提示学习(cross-modal prompt learning)机制,包括:用于建立双视觉模态信息一致性的跨模态一致性提示(cross-modal consistency prompt)、用于适应不同输入模式的模态特定提示(modality-specific prompt),以及用于补偿动态模态缺失所造成的信息空缺的缺失感知提示(missing-aware prompt)。此外,引入对称对比学习(symmetric contrastive learning),利用文本模态作为融合双视觉模态的桥梁,通过设计成对反向文本提示生成二元语义,并开展三模态对比预训练以实现高效的多模态学习。

链接: https://arxiv.org/abs/2509.02962
作者: Shuai Jiang,Yunfeng Ma,Jingyu Zhou,Yuan Bian,Yaonan Wang,Min Liu
机构: Hunan University (湖南大学); National Engineering Research Center for Robot Visual Perception and Control Technology (机器人视觉感知与控制技术国家工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE/ASME Transactions on Mechatronics

点击查看摘要

Abstract:Multimodal industrial surface defect detection (MISDD) aims to identify and locate defect in industrial products by fusing RGB and 3D modalities. This article focuses on modality-missing problems caused by uncertain sensors availability in MISDD. In this context, the fusion of multiple modalities encounters several troubles, including learning mode transformation and information vacancy. To this end, we first propose cross-modal prompt learning, which includes: i) the cross-modal consistency prompt serves the establishment of information consistency of dual visual modalities; ii) the modality-specific prompt is inserted to adapt different input patterns; iii) the missing-aware prompt is attached to compensate for the information vacancy caused by dynamic modalities-missing. In addition, we propose symmetric contrastive learning, which utilizes text modality as a bridge for fusion of dual vision modalities. Specifically, a paired antithetical text prompt is designed to generate binary text semantics, and triple-modal contrastive pre-training is offered to accomplish multimodal learning. Experiment results show that our proposed method achieves 73.83% I-AUROC and 93.05% P-AUROC with a total missing rate 0.7 for RGB and 3D modalities (exceeding state-of-the-art methods 3.84% and 5.58% respectively), and outperforms existing approaches to varying degrees under different missing types and rates. The source code will be available at this https URL.
zh

[CV-53] STAR: A Fast and Robust Rigid Registration Framework for Serial Histopathological Images

【速读】:该论文旨在解决连续切片全玻片组织图像(Whole-Slide Histopathological Images, WSIs)在不同染色类型(如HE、特殊染色和免疫组化IHC)之间进行快速、鲁棒的刚性配准(rigid registration)问题,以支持人工智能(AI)工作流中的虚拟染色和生物标志物预测等任务。现有方法多依赖计算复杂且难以复现的非刚性或深度学习模型,而轻量级刚性框架在实际场景中仍不成熟。其解决方案的关键在于提出STAR(Serial Tissue Alignment for Rigid registration),该框架融合了染色条件预处理、分层粗到细相关策略、自适应核尺度调整以及内置质量控制机制,能够在多种组织类型和染色协议下实现稳定高效的刚性对齐,且运行时间仅需每张切片数分钟,显著提升了可复现性和临床实用性。

链接: https://arxiv.org/abs/2509.02952
作者: Zeyu Liu,Shengwei Ding
机构: Beihang University (北京航空航天大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code is available at this https URL

点击查看摘要

Abstract:Registration of serial whole-slide histopathological images (WSIs) is critical for enabling direct comparison across diverse stains and for preparing paired datasets in artificial intelligence (AI) workflows such as virtual staining and biomarker prediction. While existing methods often rely on complex deformable or deep learning approaches that are computationally intensive and difficult to reproduce, lightweight rigid frameworks-sufficient for many consecutive-section scenarios-remain underdeveloped. We introduce STAR (Serial Tissue Alignment for Rigid registration), a fast and robust open-source framework for multi-WSI alignment. STAR integrates stain-conditioned preprocessing with a hierarchical coarse-to-fine correlation strategy, adaptive kernel scaling, and built-in quality control, achieving reliable rigid registration across heterogeneous tissue types and staining protocols, including hematoxylin-eosin (HE), special histochemical stains (e.g., PAS, PASM, Masson’s), and immunohistochemical (IHC) markers (e.g., CD31, KI67). Evaluated on the ANHIR 2019 and ACROBAT 2022 datasets spanning multiple organs and scanning conditions, STAR consistently produced stable alignments within minutes per slide, demonstrating robustness to cross-stain variability and partial tissue overlap. Beyond benchmarks, we present case studies on HE-IHC alignment, construction of multi-IHC panels, and typical failure modes, underscoring both utility and limitations. Released as an open and lightweight tool, STAR provides a reproducible baseline that lowers the barrier for clinical adoption and enables large-scale paired data preparation for next-generation computational pathology.
zh

[CV-54] A Data-Driven RetinaNet Model for Small Object Detection in Aerial Images

【速读】:该论文旨在解决航空影像中对小目标检测精度不足的问题,这在环境监测、城市设计及危机管理等场景中尤为关键。解决方案的核心在于提出一种名为DDR-Net的数据驱动深度学习模型,其创新性地引入了自主确定最优特征图和锚框(anchor)估计的新技术,从而构建出定制化且高效的训练流程,同时保持高精度;此外,还设计了一种新颖的采样策略,显著提升模型在数据有限条件下的性能表现。实验证明,DDR-Net在多个航空鸟类图像数据集上显著优于RetinaNet及其他主流模型,大幅降低数据收集与训练成本,具备广泛的应用潜力。

链接: https://arxiv.org/abs/2509.02928
作者: Zhicheng Tang,Jinwen Tang,Yi Shang
机构: University of Missouri (密苏里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the realm of aerial imaging, the ability to detect small objects is pivotal for a myriad of applications, encompassing environmental surveillance, urban design, and crisis management. Leveraging RetinaNet, this work unveils DDR-Net: a data-driven, deep-learning model devised to enhance the detection of diminutive objects. DDR-Net introduces novel, data-driven techniques to autonomously ascertain optimal feature maps and anchor estimations, cultivating a tailored and proficient training process while maintaining precision. Additionally, this paper presents an innovative sampling technique to bolster model efficacy under limited data training constraints. The model’s enhanced detection capabilities support critical applications including wildlife and habitat monitoring, traffic flow optimization, and public safety improvements through accurate identification of small objects like vehicles and pedestrians. DDR-Net significantly reduces the cost and time required for data collection and training, offering efficient performance even with limited data. Empirical assessments over assorted aerial avian imagery datasets demonstrate that DDR-Net markedly surpasses RetinaNet and alternative contemporary models. These innovations advance current aerial image analysis technologies and promise wide-ranging impacts across multiple sectors including agriculture, security, and archaeology.
zh

[CV-55] Single Domain Generalization in Diabetic Retinopathy: A Neuro-Symbolic Learning Approach

【速读】:该论文旨在解决医学影像领域中域泛化(Domain Generalization, DG)的挑战,即模型在单一数据源上训练后,在面对真实世界分布偏移时性能显著下降的问题。其解决方案的关键在于提出了一种神经符号框架KG-DG,通过将视觉Transformer(Vision Transformer, ViT)与专家引导的符号推理相结合,利用临床病变本体(Lesion Ontology)构建结构化的规则特征,并融合视网膜血管分割结果,以置信度加权的方式整合深层视觉表征。该方法通过最小化域嵌入间的KL散度,强制对齐高层临床语义,从而实现跨域稳健性,在单域泛化(SDG)和多域泛化(MDG)场景下均取得显著提升,验证了符号组件不仅增强可解释性,更作为有效正则化器提升模型鲁棒性。

链接: https://arxiv.org/abs/2509.02918
作者: Midhat Urooj,Ayan Banerjee,Farhat Shaikh,Kuntal Thakur,Sandeep Gupta
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in ANSyA 2025: 1st International Workshop on Advanced Neuro-Symbolic Applications

点击查看摘要

Abstract:Domain generalization remains a critical challenge in medical imaging, where models trained on single sources often fail under real-world distribution shifts. We propose KG-DG, a neuro-symbolic framework for diabetic retinopathy (DR) classification that integrates vision transformers with expert-guided symbolic reasoning to enable robust generalization across unseen domains. Our approach leverages clinical lesion ontologies through structured, rule-based features and retinal vessel segmentation, fusing them with deep visual representations via a confidence-weighted integration strategy. The framework addresses both single-domain generalization (SDG) and multi-domain generalization (MDG) by minimizing the KL divergence between domain embeddings, thereby enforcing alignment of high-level clinical semantics. Extensive experiments across four public datasets (APTOS, EyePACS, Messidor-1, Messidor-2) demonstrate significant improvements: up to a 5.2% accuracy gain in cross-domain settings and a 6% improvement over baseline ViT models. Notably, our symbolic-only model achieves a 63.67% average accuracy in MDG, while the complete neuro-symbolic integration achieves the highest accuracy compared to existing published baselines and benchmarks in challenging SDG scenarios. Ablation studies reveal that lesion-based features (84.65% accuracy) substantially outperform purely neural approaches, confirming that symbolic components act as effective regularizers beyond merely enhancing interpretability. Our findings establish neuro-symbolic integration as a promising paradigm for building clinically robust, and domain-invariant medical AI systems.
zh

[CV-56] High-Fidelity Digital Twins for Bridging the Sim2Real Gap in LiDAR-Based ITS Perception

【速读】:该论文旨在解决仿真到现实(Sim2Real)域迁移中因分布差异导致的LiDAR感知模型性能下降问题,即在模拟环境中训练的感知模型在真实世界数据上表现不佳。解决方案的关键在于提出一种高保真数字孪生(High-fidelity Digital Twin, HiFi DT)框架,该框架融合了真实世界的背景几何结构、车道级道路拓扑以及传感器特性和安装位置信息,从而构建出更贴近现实场景的仿真环境,生成“域内”合成数据用于训练。实验表明,基于HiFi DT生成的数据训练的3D目标检测器在真实数据上的性能优于直接使用真实数据训练的模型4.8%,且通过Chamfer Distance (CD)、最大均值差异 (MMD)、Earth Mover’s Distance (EMD) 和 Fréchet Distance (FD) 等多维度指标量化验证,该方法显著降低了源域与目标域之间的分布偏移,提升了模型的泛化能力。

链接: https://arxiv.org/abs/2509.02904
作者: Muhammad Shahbaz,Shaurya Agarwal
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sim2Real domain transfer offers a cost-effective and scalable approach for developing LiDAR-based perception (e.g., object detection, tracking, segmentation) in Intelligent Transportation Systems (ITS). However, perception models trained in simulation often under perform on real-world data due to distributional shifts. To address this Sim2Real gap, this paper proposes a high-fidelity digital twin (HiFi DT) framework that incorporates real-world background geometry, lane-level road topology, and sensor-specific specifications and placement. We formalize the domain adaptation challenge underlying Sim2Real learning and present a systematic method for constructing simulation environments that yield in-domain synthetic data. An off-the-shelf 3D object detector is trained on HiFi DT-generated synthetic data and evaluated on real data. Our experiments show that the DT-trained model outperforms the equivalent model trained on real data by 4.8%. To understand this gain, we quantify distributional alignment between synthetic and real data using multiple metrics, including Chamfer Distance (CD), Maximum Mean Discrepancy (MMD), Earth Mover’s Distance (EMD), and Fr’echet Distance (FD), at both raw-input and latent-feature levels. Results demonstrate that HiFi DTs substantially reduce domain shift and improve generalization across diverse evaluation scenarios. These findings underscore the significant role of digital twins in enabling reliable, simulation-based LiDAR perception for real-world ITS applications.
zh

[CV-57] PercepTwin: Modeling High-Fidelity Digital Twins for Sim2Real LiDAR-based Perception for Intelligent Transportation Systems

【速读】:该论文旨在解决智能交通系统(ITS)中基于激光雷达(LiDAR)的感知任务(如目标检测、跟踪及语义和实例分割)因依赖大规模标注数据集而导致的数据采集成本高、耗时长且人力密集的问题,从而限制了系统的可扩展性。其解决方案的关键在于提出一种严谨且可复现的流程,利用高保真数字孪生(High-Fidelity Digital Twins, HiFi DTs)构建大规模、高质量的合成数据集,涵盖静态几何建模、道路基础设施复制和动态交通场景生成,并结合开源资源(如卫星影像和OpenStreetMap数据)与特定传感器配置,实现高效、多样且可靠的Sim2Real学习基础。

链接: https://arxiv.org/abs/2509.02903
作者: Muhammad Shahbaz,Shaurya Agarwal
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LiDAR-based perception in intelligent transportation systems (ITS), for tasks such as object detection, tracking, and semantic and instance segmentation, is predominantly solved by deep neural network models which often require large-scale labeled datasets during training to achieve generalization. However, creating these datasets is costly. time consuming and require human labor before the datasets are ready for training models. This hinders scalability of the LiDAR-based perception systems in ITS. Sim2Real learning offers scalable alternative, however, its effectiveness is dependent on the fidelity of the source simulation(s) to real-world, in terms of environment structure, actor dynamics, and sensor emulations. In response, this paper introduces a rigorous and reproducible methodology for creating large-scale, high-quality synthetic datasets using High-Fidelity Digital Twins (HiFi DTs). The proposed workflow outlines the steps, tools, and best practices for digitally replicating real-world environments, encompassing static geometry modeling, road infrastructure replication, and dynamic traffic scenario generation. Leveraging open-source and readily available resources such as satellite imagery and OpenStreetMap data, alongside specific sensor configurations, this paper provides practical, detailed guidance for constructing robust synthetic environments. These environments subsequently facilitate scalable, cost-effective, and diverse dataset generation, forming a reliable foundation for robust Sim2Real learning.
zh

[CV-58] LiGuard: A Streamlined Open-Source Framework for Rapid Interactive Lidar Research

【速读】:该论文旨在解决激光雷达(LiDAR)数据在自动驾驶和智能交通系统(Intelligent Transportation Systems, ITS)研究中因代码重复开发、模块化程度低以及算法调整困难而导致的研究效率低下问题。其解决方案的关键在于提出一个名为LiGuard的开源软件框架,该框架通过内置对数据输入/输出(I/O)、预处理/后处理及常用算法的支持,使研究人员能够快速构建项目代码;同时支持交互式添加、移除或重排序自定义算法并动态调整参数,并提供针对分类、检测、分割和跟踪任务的结果可视化功能;此外,结构化的目录组织机制便于项目整体或组件的共享与复用,从而显著提升研究协作效率与代码可维护性。

链接: https://arxiv.org/abs/2509.02902
作者: Muhammad Shahbaz,Shaurya Agarwal
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:There is a growing interest in the development of lidar-based autonomous mobility and Intelligent Transportation Systems (ITS). To operate and research on lidar data, researchers often develop code specific to application niche. This approach leads to duplication of efforts across studies that, in many cases, share multiple methodological steps such as data input/output (I/O), pre/post processing, and common algorithms in multi-stage solutions. Moreover, slight changes in data, algorithms, and/or research focus may force major revisions in the code. To address these challenges, we present LiGuard, an open-source software framework that allows researchers to: 1) rapidly develop code for their lidar-based projects by providing built-in support for data I/O, pre/post processing, and commonly used algorithms, 2) interactively add/remove/reorder custom algorithms and adjust their parameters, and 3) visualize results for classification, detection, segmentation, and tracking tasks. Moreover, because it creates all the code files in structured directories, it allows easy sharing of entire projects or even the individual components to be reused by other researchers. The effectiveness of LiGuard is demonstrated via case studies.
zh

[CV-59] PRECISE-AS: Personalized Reinforcement Learning for Efficient Point-of-Care Echocardiography in Aortic Stenosis Diagnosis MICCAI2025

【速读】:该论文旨在解决主动获取超声心动图(echocardiography, echo)视频时效率低下的问题,尤其是在资源受限地区,如何在保证诊断准确性的前提下减少不必要的影像采集。其关键解决方案是提出一种基于强化学习(reinforcement learning, RL)的主动视频采集框架,该框架能够动态评估是否需要进一步采集影像,并选择最具信息量的视图进行采集,从而在仅使用47%的视频数据情况下实现80.6%的主动脉瓣狭窄(aortic stenosis, AS)分类准确率,显著提升了诊断效率与个性化水平。

链接: https://arxiv.org/abs/2509.02898
作者: Armin Saadat,Nima Hashemi,Hooman Vaseli,Michael Y. Tsang,Christina Luong,Michiel Van de Panne,Teresa S. M. Tsang,Purang Abolmaesumi
机构: 11; 22; 33
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in MICCAI 2025

点击查看摘要

Abstract:Aortic stenosis (AS) is a life-threatening condition caused by a narrowing of the aortic valve, leading to impaired blood flow. Despite its high prevalence, access to echocardiography (echo), the gold-standard diagnostic tool, is often limited due to resource constraints, particularly in rural and underserved areas. Point-of-care ultrasound (POCUS) offers a more accessible alternative but is restricted by operator expertise and the challenge of selecting the most relevant imaging views. To address this, we propose a reinforcement learning (RL)-driven active video acquisition framework that dynamically selects each patient’s most informative echo videos. Unlike traditional methods that rely on a fixed set of videos, our approach continuously evaluates whether additional imaging is needed, optimizing both accuracy and efficiency. Tested on data from 2,572 patients, our method achieves 80.6% classification accuracy while using only 47% of the echo videos compared to a full acquisition. These results demonstrate the potential of active feature acquisition to enhance AS diagnosis, making echocardiographic assessments more efficient, scalable, and personalized. Our source code is available at: this https URL.
zh

[CV-60] Multi-Scale Deep Learning for Colon Histopathology: A Hybrid Graph-Transformer Approach

【速读】:该论文旨在解决结肠癌(Colorectal cancer)早期诊断中图像分类精度不足的问题,尤其针对组织病理学图像中多尺度特征提取与空间结构保持的挑战。其解决方案的关键在于提出一种融合胶囊网络(Capsule Network)、图注意力机制(Graph Attention Mechanism)、Transformer模块和残差学习的混合多尺度深度学习架构——HG-TNet模型。该模型通过双分支设计:一为基于卷积的Patch嵌入与Transformer编码器组成的全局上下文捕捉分支,另一为CNN分支用于提取局部细节;同时引入自监督旋转预测任务增强特征表示能力,并利用胶囊网络保留空间层次关系,从而实现更精准的结肠癌分类性能。

链接: https://arxiv.org/abs/2509.02851
作者: Sadra Saremi,Amirhossein Ahmadkhan Kordbacheh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Colon cancer also known as Colorectal cancer, is one of the most malignant types of cancer worldwide. Early-stage detection of colon cancer is highly crucial to prevent its deterioration. This research presents a hybrid multi-scale deep learning architecture that synergizes capsule networks, graph attention mechanisms, transformer modules, and residual learning to advance colon cancer classification on the Lung and Colon Cancer Histopathological Image Dataset (LC25000) dataset. The proposed model in this paper utilizes the HG-TNet model that introduces a hybrid architecture that joins strength points in transformers and convolutional neural networks to capture multi-scale features in histopathological images. Mainly, a transformer branch extracts global contextual bonds by partitioning the image into patches by convolution-based patch embedding and then processing these patches through a transformer encoder. Analogously, a dedicated CNN branch captures fine-grained, local details through successive Incorporation these diverse features, combined with a self-supervised rotation prediction objective, produce a robust diagnostic representation that surpasses standard architectures in performance. Results show better performance not only in accuracy or loss function but also in these algorithms by utilizing capsule networks to preserve spatial orders and realize how each element individually combines and forms whole structures.
zh

[CV-61] PixFoundation 2.0: Do Video Multi-Modal LLM s Use Motion in Visual Grounding? NEURIPS2025

【速读】:该论文旨在解决视频多模态大语言模型(Video MLLMs)在像素级视觉定位任务中对运动信息利用不足的问题,特别是其是否能够基于自然语言表达中的运动模式准确分割目标物体。现有基准测试存在缺陷,因为单帧图像往往已足够完成运动指代表达,而无需进行时序推理,这导致模型更依赖静态外观线索而非真正的运动理解。解决方案的关键在于提出四个面向运动的探针技术(motion-centric probing techniques),构建一个以运动为核心的评估基准 MoCentric-Bench,该基准强制模型在视频中识别真实运动与虚假运动,并理解运动顺序,从而推动模型从静态特征向动态时空交互能力发展。此外,作者还引入简单的运动感知适配方法,在新基准上实现当前最优性能,为未来提升视频中密集时空定位和像素级理解提供了方向。

链接: https://arxiv.org/abs/2509.02807
作者: Mennatullah Siam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work under review in NeurIPS 2025 with the title “Are we using Motion in Referring Segmentation? A Motion-Centric Evaluation”

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have shown impressive generalization across tasks using images and text modalities. While their extension to video has enabled tasks such as video question answering and video captioning, their pixel-level visual grounding abilities are less studied. In this work, we raise the pertinent question of whether motion is used in pixel-level visual grounding and whether video MLLMs can segment objects based on natural language expressions describing their motion patterns. We identify the shortcomings in the current benchmarks, where we show that a single frame can often suffice for capturing the motion referring expression without any temporal reasoning. To address this, we introduce four motion-centric probing techniques, particularly designed for the visual grounding task, to study video MLLMs’ ability to identify true motion from a fake one and their ability to grasp the motion order. Consequently, we provide a motion-centric benchmark, MoCentric-Bench. It ensures that video MLLMs are evaluated towards leveraging the interaction between motion and language rather than being dominated by static appearance cues emphasized in existing visual grounding datasets. We further establish strong single-image baselines that are on par with or outperform prior methods. Finally, we explore simple motion-centric adaptation techniques that provide state-of-the-art performance on our MoCentric-Bench. Our motion-centric benchmark, evaluation and findings challenge future models to improve dense spatiotemporal grounding and pixel-level understanding within videos. Code and datasets will be made publicly available at this https URL.
zh

[CV-62] 2nd Place Solution for CVPR2024 E2E Challenge: End-to-End Autonomous Driving Using Vision Language Model CVPR2024

【速读】:该论文旨在解决端到端自动驾驶(end-to-end autonomous driving)中如何有效利用强大语言模型,尤其是多模态视觉语言模型(Vision Language Models, VLM),以提升驾驶任务性能的问题。其解决方案的关键在于将端到端架构设计与具备丰富知识的VLM相结合,仅使用单目摄像头作为输入,在 leaderboard 上实现了当前最优的纯视觉方案,验证了基于视觉的驾驶方法的有效性及在端到端驾驶任务中的巨大潜力。

链接: https://arxiv.org/abs/2509.02659
作者: Zilong Guo,Yi Luo,Long Sha,Dongxu Wang,Panqu Wang,Chenyang Xu,Yi Yang
机构: ZERON (上海, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 2nd place in CVPR 2024 End-to-End Driving at Scale Challenge

点击查看摘要

Abstract:End-to-end autonomous driving has drawn tremendous attention recently. Many works focus on using modular deep neural networks to construct the end-to-end archi-tecture. However, whether using powerful large language models (LLM), especially multi-modality Vision Language Models (VLM) could benefit the end-to-end driving tasks remain a question. In our work, we demonstrate that combining end-to-end architectural design and knowledgeable VLMs yield impressive performance on the driving tasks. It is worth noting that our method only uses a single camera and is the best camera-only solution across the leaderboard, demonstrating the effectiveness of vision-based driving approach and the potential for end-to-end driving tasks.
zh

[CV-63] Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

【速读】:该论文旨在解决当前机器人操作(robotic manipulation)领域中传统基于规则的方法在非结构化、新颖环境中难以扩展和泛化的问题。其解决方案的关键在于系统性地梳理和整合基于大视觉语言模型(Large Vision-Language Models, VLMs)的视觉-语言-动作(Vision-Language-Action, VLA)模型,提出一种以架构为导向的分类体系,明确区分单体式(monolithic)与分层式(hierarchical)两类主流架构,并深入分析其在强化学习、无训练优化、从人类视频中学习及世界模型集成等先进场景中的应用特性与性能优势,从而填补该交叉领域的研究空白并推动技术协同发展。

链接: https://arxiv.org/abs/2508.13073
作者: Rui Shao,Wei Li,Lingsen Zhang,Renshan Zhang,Zhiyang Liu,Ran Chen,Liqiang Nie
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳))
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule-based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision-Language-Action (VLA) models, built upon Large Vision-Language Models (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation. We begin by clearly defining large VLM-based VLA models and delineating two principal architectural paradigms: (1) monolithic models, encompassing single-system and dual-system designs with differing levels of integration; and (2) hierarchical models, which explicitly decouple planning from execution via interpretable intermediate representations. Building on this foundation, we present an in-depth examination of large VLM-based VLA models: (1) integration with advanced domains, including reinforcement learning, training-free optimization, learning from human videos, and world model integration; (2) synthesis of distinctive characteristics, consolidating architectural traits, operational strengths, and the datasets and benchmarks that support their development; (3) identification of promising directions, including memory mechanisms, 4D perception, efficient adaptation, multi-agent cooperation, and other emerging capabilities. This survey consolidates recent advances to resolve inconsistencies in existing taxonomies, mitigate research fragmentation, and fill a critical gap through the systematic integration of studies at the intersection of large VLMs and robotic manipulation. We provide a regularly updated project page to document ongoing progress: this https URL
zh

[CV-64] Generalist versus Specialist Vision Foundation Models for Ocular Disease and Oculomics

【速读】:该论文旨在解决医学领域中基础模型(foundation models)的通用性与专用性之间的权衡问题,具体聚焦于在视网膜图像应用中,是否仍需依赖领域特定预训练模型(如RETFound),还是可以借助更大规模的通用基础模型(如DINOv2和DINOv3)实现更优性能。其关键解决方案在于系统评估了DINOv2和DINOv3在视网膜图像任务中的适应能力,并对比了它们与两个专用RETFound模型(RETFound-MAE和RETFound-DINOv2)在眼病检测和系统性疾病预测任务上的表现,同时分析了不同微调策略(fine-tuning与linear probing)下的数据效率与计算成本 trade-off。结果表明,尽管通用模型在多任务适应上表现出强大潜力,但RETFound-DINOv2在眼部疾病检测和眼组学(oculomics)任务中持续优于通用模型,体现出更强的泛化能力和数据效率,验证了专用视网膜基础模型在临床场景中的不可替代性,同时也揭示了通过持续的数据和模型扩展可进一步缩小通用模型与专用模型间的性能差距。

链接: https://arxiv.org/abs/2509.03421
作者: Yukun Zhou,Paul Nderitu,Jocelyn Hui Lin Goh,Justin Engelmann,Siegfried K. Wagner,Anran Ran,Hongyang Jiang,Lie Ju,Ke Zou,Sahana Srinivasan,Hyunmin Kim,Takahiro Ninomiya,Zheyuan Wang,Gabriel Dawei Yang,Eden Ruffell,Dominic Williamson,Rui Santos,Gabor Mark Somfai,Carol Y. Cheung,Tien Yin Wong,Daniel C. Alexander,Yih Chung Tham,Pearse A. Keane
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 39 pages, 8 Figures

点击查看摘要

Abstract:Medical foundation models, pre-trained with large-scale clinical data, demonstrate strong performance in diverse clinically relevant applications. RETFound, trained on nearly one million retinal images, exemplifies this approach in applications with retinal images. However, the emergence of increasingly powerful and multifold larger generalist foundation models such as DINOv2 and DINOv3 raises the question of whether domain-specific pre-training remains essential, and if so, what gap persists. To investigate this, we systematically evaluated the adaptability of DINOv2 and DINOv3 in retinal image applications, compared to two specialist RETFound models, RETFound-MAE and RETFound-DINOv2. We assessed performance on ocular disease detection and systemic disease prediction using two adaptation strategies: fine-tuning and linear probing. Data efficiency and adaptation efficiency were further analysed to characterise trade-offs between predictive performance and computational cost. Our results show that although scaling generalist models yields strong adaptability across diverse tasks, RETFound-DINOv2 consistently outperforms these generalist foundation models in ocular-disease detection and oculomics tasks, demonstrating stronger generalisability and data efficiency. These findings suggest that specialist retinal foundation models remain the most effective choice for clinical applications, while the narrowing gap with generalist foundation models suggests that continued data and model scaling can deliver domain-relevant gains and position them as strong foundations for future medical foundation models.
zh

[CV-65] Prompt-Guided Patch UNet-VAE with Adversarial Supervision for Adrenal Gland Segmentation in Computed Tomography Medical Images

【速读】:该论文旨在解决腹部小器官(如肾上腺)在CT图像中的分割难题,其核心挑战包括严重的类别不平衡、空间上下文信息不足以及标注数据稀缺。解决方案的关键在于提出一个统一框架,融合变分重建(variational reconstruction)、监督分割与基于补丁的对抗反馈机制,采用VAE-UNet作为骨干网络,联合重建输入补丁并生成体素级分割掩码,从而学习解耦的解剖结构与外观表征;同时引入基于潜在空间合成补丁的训练策略,并结合VGG感知损失和PatchGAN式判别器以提升输出的真实感与边界精度,在BTCV数据集上的实验证明该方法在小器官分割中显著提升了准确性与重建质量。

链接: https://arxiv.org/abs/2509.03188
作者: Hania Ghouse,Muzammil Behzad
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmentation of small and irregularly shaped abdominal organs, such as the adrenal glands in CT imaging, remains a persistent challenge due to severe class imbalance, poor spatial context, and limited annotated data. In this work, we propose a unified framework that combines variational reconstruction, supervised segmentation, and adversarial patch-based feedback to address these limitations in a principled and scalable manner. Our architecture is built upon a VAE-UNet backbone that jointly reconstructs input patches and generates voxel-level segmentation masks, allowing the model to learn disentangled representations of anatomical structure and appearance. We introduce a patch-based training pipeline that selectively injects synthetic patches generated from the learned latent space, and systematically study the effects of varying synthetic-to-real patch ratios during training. To further enhance output fidelity, the framework incorporates perceptual reconstruction loss using VGG features, as well as a PatchGAN-style discriminator for adversarial supervision over spatial realism. Comprehensive experiments on the BTCV dataset demonstrate that our approach improves segmentation accuracy, particularly in boundary-sensitive regions, while maintaining strong reconstruction quality. Our findings highlight the effectiveness of hybrid generative-discriminative training regimes for small-organ segmentation and provide new insights into balancing realism, diversity, and anatomical consistency in data-scarce scenarios.
zh

[CV-66] Deep Self-knowledge Distillation: A hierarchical supervised learning for coronary artery segmentation

【速读】:该论文旨在解决冠状动脉分割任务中现有方法存在的性能不足与泛化能力有限的问题,特别是传统规则-based 或深度学习模型在X射线血管造影图像中难以实现高效、准确分割的挑战。此外,现有知识蒸馏方法未能充分挖掘模型的层次化知识结构,导致信息利用率低、性能提升受限。解决方案的关键在于提出一种名为“Deep Self-knowledge Distillation”的新方法,其核心创新是利用教师模型的多层次输出进行监督,通过结合Deep Distribution Loss(分布级损失)与Pixel-wise Self-knowledge Distillation Loss(像素级自知识蒸馏损失),在提供双层正则化的同时,有效实现从教师模型到学生模型的知识迁移,从而显著提升分割精度、泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2509.03173
作者: Mingfeng Lin
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳分校); Xiamen University, Xiamen, China (厦门大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Coronary artery disease is a leading cause of mortality, underscoring the critical importance of precise diagnosis through X-ray angiography. Manual coronary artery segmentation from these images is time-consuming and inefficient, prompting the development of automated models. However, existing methods, whether rule-based or deep learning models, struggle with issues like poor performance and limited generalizability. Moreover, current knowledge distillation methods applied in this field have not fully exploited the hierarchical knowledge of the model, leading to certain information waste and insufficient enhancement of the model’s performance capabilities for segmentation tasks. To address these issues, this paper introduces Deep Self-knowledge Distillation, a novel approach for coronary artery segmentation that leverages hierarchical outputs for supervision. By combining Deep Distribution Loss and Pixel-wise Self-knowledge Distillation Loss, our method enhances the student model’s segmentation performance through a hierarchical learning strategy, effectively transferring knowledge from the teacher model. Our method combines a loosely constrained probabilistic distribution vector with tightly constrained pixel-wise supervision, providing dual regularization for the segmentation model while also enhancing its generalization and robustness. Extensive experiments on XCAD and DCA1 datasets demonstrate that our approach outperforms the dice coefficient, accuracy, sensitivity and IoU compared to other models in comparative evaluations.
zh

[CV-67] Ensemble YOLO Framework for Multi-Domain Mitotic Figure Detection in Histopathology Images

【速读】:该论文旨在解决全切片组织病理图像中有丝分裂象(mitotic figure)检测的难题,其挑战主要源于有丝分裂象的稀少性、形态异质性以及组织制备和染色协议带来的变异性。为提升模型的泛化能力,研究者基于MIDOG++、CMC和CCMCT数据集训练了两种先进的单阶段目标检测器YOLOv5和YOLOv8,并引入了染色不变的颜色扰动和纹理保持增强策略以提高鲁棒性。关键解决方案在于利用YOLOv5高精度与YOLOv8高召回率之间的互补优势,通过集成这两个模型,在不显著降低精度的前提下有效提升了敏感度,从而验证了基于现代目标检测器的集成策略在数字病理学中自动有丝分裂检测任务中的有效性。

链接: https://arxiv.org/abs/2509.02957
作者: Navya Sri Kelam,Akash Parekh,Saikiran Bonthu,Nitin Singhal
机构: AIRA MATRIX Private Limited( AIRA MATRIX 私人有限公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 3pages, MIDOG25 Challenge

点击查看摘要

Abstract:Accurate detection of mitotic figures in whole slide histopathological images remains a challenging task due to their scarcity, morphological heterogeneity, and the variability introduced by tissue preparation and staining protocols. The MIDOG competition series provides standardized benchmarks for evaluating detection approaches across diverse domains, thus motivating the development of generalizable deep learning models. In this work, we investigate the performance of two modern one-stage detectors, YOLOv5 and YOLOv8, trained on MIDOG++, CMC, and CCMCT datasets. To enhance robustness, training incorporated stain-invariant color perturbations and texture preserving augmentations. In internal validation, YOLOv5 achieved superior precision, while YOLOv8 provided improved recall, reflecting architectural trade-offs between anchor-based and anchor-free detection. To capitalize on these complementary strengths, we employed an ensemble of the two models, which improved sensitivity without a major reduction in precision. These findings highlight the effectiveness of ensemble strategies built upon contemporary object detectors to advance automated mitosis detection in digital pathology.
zh

[CV-68] oward a robust lesion detection model in breast DCE-MRI: adapting foundation models to high-risk women

【速读】:该论文旨在解决乳腺MRI病灶检测中准确率不足的问题,尤其是在高风险人群中实现早期癌症诊断。其核心挑战在于临床数据的不平衡性和异质性,传统方法难以在保证性能的同时保持模型可解释性。解决方案的关键在于构建一个融合预训练基础模型与先进分类策略的分类流程:首先利用基于DINOv2的自监督预训练生成鲁棒的每切片特征嵌入(feature embeddings),再通过Kolmogorov–Arnold Network (KAN) 进行分类,该网络采用自适应B样条激活函数实现局部非线性变换,从而在复杂数据中提升区分良恶性病灶的能力,同时保留注意力热图驱动的可解释性。实验表明,该MST+KAN管道在AUC=0.80±0.02下优于基线模型,验证了其在乳腺MRI分析中的有效性与泛化能力。

链接: https://arxiv.org/abs/2509.02710
作者: Gabriel A.B. do Nascimento,Vincent Dong,Guilherme J. Cavalcante,Alex Nguyen,Thaís G. do Rêgo,Yuri Malheiros,Telmo M. Silva Filho,Carla R. Zeballos Torrez,James C. Gee,Anne Marie McCarthy,Andrew D. A. Maidment,Bruno Barufaldi
机构: University of Pennsylvania (宾夕法尼亚大学); Federal University of Paraíba (帕拉伊巴联邦大学); University of Bristol (布里斯托大学)
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate breast MRI lesion detection is critical for early cancer diagnosis, especially in high-risk populations. We present a classification pipeline that adapts a pretrained foundation model, the Medical Slice Transformer (MST), for breast lesion classification using dynamic contrast-enhanced MRI (DCE-MRI). Leveraging DINOv2-based self-supervised pretraining, MST generates robust per-slice feature embeddings, which are then used to train a Kolmogorov–Arnold Network (KAN) classifier. The KAN provides a flexible and interpretable alternative to conventional convolutional networks by enabling localized nonlinear transformations via adaptive B-spline activations. This enhances the model’s ability to differentiate benign from malignant lesions in imbalanced and heterogeneous clinical datasets. Experimental results demonstrate that the MST+KAN pipeline outperforms the baseline MST classifier, achieving AUC = 0.80 \pm 0.02 while preserving interpretability through attention-based heatmaps. Our findings highlight the effectiveness of combining foundation model embeddings with advanced classification strategies for building robust and generalizable breast MRI analysis tools.
zh

[CV-69] Adaptive Learning Strategies for Mitotic Figure Classification in MIDOG2025 Challenge

【速读】:该论文旨在解决非典型有丝分裂象(Atypical Mitotic Figures, AMFs)在病理图像中可靠检测的难题,该问题因形态学模糊性和扫描仪间差异而尤为突出。解决方案的关键在于采用基于提示(prompt-based)的模型适配策略,具体包括:首先使用LoRA(Low-Rank Adaptation)作为基线,随后引入视觉提示调优(Visual Prompt Tuning, VPT)以显著提升模型泛化能力,并进一步结合测试时增强(Test-Time Augmentation, TTA)与Vahadane和Macenko染色归一化方法,从而实现对不同成像条件下AMFs分类的鲁棒性优化。该方案最终在MIDOG2025 Track 2挑战赛中取得平衡准确率0.8837和ROC-AUC 0.9513的优异性能。

链接: https://arxiv.org/abs/2509.02640
作者: Biwen Meng,Xi Long,Jingxin Liu
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Atypical mitotic figures (AMFs) are clinically relevant indicators of abnormal cell division, yet their reliable detection remains challenging due to morphological ambiguity and scanner variability. In this work, we investigated three variants of adapting the pathology foundation model UNI2-h for the MIDOG2025 Track 2 challenge. Starting from a LoRA-based baseline, we found that visual prompt tuning (VPT) substantially improved generalization, and that further integrating test-time augmentation (TTA) with Vahadane and Macenko stain normalization provided the best robustness. Our final submission achieved a balanced accuracy of 0.8837 and an ROC-AUC of 0.9513 on the preliminary leaderboard, ranking within the top 10 teams. These results demonstrate that prompt-based adaptation combined with stain-normalization TTA offers an effective strategy for atypical mitosis classification under diverse imaging conditions.
zh

[CV-70] A Single Detect Focused YOLO Framework for Robust Mitotic Figure Detection

【速读】:该论文旨在解决病理图像中分裂期细胞(mitotic figure)检测的域差异问题,即由于扫描仪、组织类型和染色协议不同导致的模型鲁棒性下降难题。其解决方案的关键在于提出一种轻量级且领域鲁棒的检测框架SDF-YOLO(Single Detect Focused YOLO),该框架基于YOLOv11进行任务特化改进:采用与分裂期细胞尺度对齐的单检测头以提升小目标敏感性,引入坐标注意力机制增强位置感知能力,并优化跨通道特征混合策略以提高特征表达效率。实验表明,该方法在多个跨物种(人类与犬类)肿瘤数据集上均表现出高精度与计算效率,验证了其在复杂域环境下稳定检测分裂期细胞的有效性。

链接: https://arxiv.org/abs/2509.02637
作者: Yasemin Topuz,M. Taha Gökcan,Serdar Yıldız,Songül Varlı
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mitotic figure detection is a crucial task in computational pathology, as mitotic activity serves as a strong prognostic marker for tumor aggressiveness. However, domain variability that arises from differences in scanners, tissue types, and staining protocols poses a major challenge to the robustness of automated detection methods. In this study, we introduce SDF-YOLO (Single Detect Focused YOLO), a lightweight yet domain-robust detection framework designed specifically for small, rare targets such as mitotic figures. The model builds on YOLOv11 with task-specific modifications, including a single detection head aligned with mitotic figure scale, coordinate attention to enhance positional sensitivity, and improved cross-channel feature mixing. Experiments were conducted on three datasets that span human and canine tumors: MIDOG ++, canine cutaneous mast cell tumor (CCMCT), and canine mammary carcinoma (CMC). When submitted to the preliminary test set for the MIDOG2025 challenge, SDF-YOLO achieved an average precision (AP) of 0.799, with a precision of 0.758, a recall of 0.775, an F1 score of 0.766, and an FROC-AUC of 5.793, demonstrating both competitive accuracy and computational efficiency. These results indicate that SDF-YOLO provides a reliable and efficient framework for robust mitotic figure detection across diverse domains.
zh

[CV-71] Challenges and Lessons from MIDOG 2025: A Two-Stage Approach to Domain-Robust Mitotic Figure Detection

【速读】:该论文旨在解决组织病理学中跨域变异和形态复杂性带来的有丝分裂象(mitotic figure)检测难题,其核心挑战在于如何在不同组织域之间实现鲁棒的检测性能。解决方案的关键在于构建一个两阶段流水线:首先使用Faster R-CNN模型从图像中提取候选区域,随后通过集成三种预训练分类器(DenseNet-121、EfficientNet-v2、InceptionResNet-v2)进行假阳性抑制。尽管该方法实现了高召回率(0.9528),但精度仅为12.67%,表明区分真实有丝分裂象与形态相似干扰物仍是当前主要瓶颈,凸显了领域泛化能力不足的问题。

链接: https://arxiv.org/abs/2509.02630
作者: Euiseop Song,Jaeyoung Park,Jaewoo Park
机构: Korea University Graduate School (韩国国立大学研究生院); Dongguk University (东国大学); Sogang University (西江大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mitotic figure detection remains a challenging task in computational pathology due to domain variability and morphological complexity. This paper describes our participation in the MIDOG 2025 challenge, focusing on robust mitotic figure detection across diverse tissue domains. We developed a two-stage pipeline combining Faster R-CNN for candidate detection with an ensemble of three classifiers (DenseNet-121, EfficientNet-v2, InceptionResNet-v2) for false positive reduction. Our best submission achieved F1-score 0.2237 (Recall: 0.9528, Precision: 0.1267) using a Faster R-CNN trained solely on MIDOG++ dataset. While our high recall demonstrates effective mitotic figure detection, the critically low precision (12.67%) reveals fundamental challenges in distinguishing true mitoses from morphologically similar imposters across diverse domains. Analysis of six submission variants showed that subsequent optimization attempts were counterproductive, highlighting the omplexity of domain generalization in histopathology. This work provides valuable insights into the practical challenges of developing robust mitotic figure detection algorithms and emphasizes the importance of effective false positive suppression strategies.
zh

[CV-72] Is Synthetic Image Augmentation Useful for Imbalanced Classification Problems? Case-Study on the MIDOG2025 Atypical Cell Detection Competition DATE

【速读】:该论文旨在解决组织病理图像中异常有丝分裂(atypical mitosis)分类问题,这是一个临床相关但存在严重类别不平衡(正常有丝分裂与异常有丝分裂样本比例约为9408:1741)且跨域泛化困难的挑战。解决方案的关键在于:采用两种互补的骨干网络——基于ImageNet预训练的ConvNeXt-Small和基于病理图像自监督预训练的ViT(Lunit ViT),并通过合成数据增强尝试缓解类别不平衡;实验表明,尽管合成平衡未带来一致性能提升,但两类模型均表现出高AUC(约95%),其中ConvNeXt在峰值性能上略优,而Lunit在不同折间更具稳定性,说明ImageNet预训练可实现更高性能上限,而领域特定预训练则提升鲁棒性。

链接: https://arxiv.org/abs/2509.02612
作者: Leire Benito-Del-Valle,Pedro A. Moreno-Sánchez,Itziar Egusquiza,Itsaso Vitoria,Artzai Picón,Cristina López-Saratxaga,Adrian Galdran
机构: TECNALIA(技术与研究联盟); University of the Basque Country(巴斯克大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: version 0, to be updated; submitted to midog 2025

点击查看摘要

Abstract:The MIDOG 2025 challenge extends prior work on mitotic figure detection by introducing a new Track 2 on atypical mitosis classification. This task aims to distinguish normal from atypical mitotic figures in histopathology images, a clinically relevant but highly imbalanced and cross-domain problem. We investigated two complementary backbones: (i) ConvNeXt-Small, pretrained on ImageNet, and (ii) a histopathology-specific ViT from Lunit trained via self-supervision. To address the strong prevalence imbalance (9408 normal vs. 1741 atypical), we synthesized additional atypical examples to approximate class balance and compared models trained with real-only vs. real+synthetic data. Using five-fold cross-validation, both backbones reached strong performance (mean AUROC approximately 95 percent), with ConvNeXt achieving slightly higher peaks while Lunit exhibited greater fold-to-fold stability. Synthetic balancing, however, did not lead to consistent improvements. On the organizers’ preliminary hidden test set, explicitly designed as an out-of-distribution debug subset, ConvNeXt attained the highest AUROC (95.4 percent), whereas Lunit remained competitive on balanced accuracy. These findings suggest that both ImageNet and domain-pretrained backbones are viable for atypical mitosis classification, with domain-pretraining conferring robustness and ImageNet pretraining reaching higher peaks, while naive synthetic balancing has limited benefit. Full hidden test set results will be reported upon challenge completion.
zh

[CV-73] Foundation Model-Driven Classification of Atypical Mitotic Figures with Domain-Aware Training Strategies

【速读】:该论文旨在解决乳腺癌病理图像中正常有丝分裂结构(Normal Mitotic Figures, NMFs)与异常有丝分裂结构(Atypical Mitotic Figures, AMFs)的二分类问题,这是癌症诊断中的关键任务。解决方案的核心在于利用病理领域专用的基础模型 H-optimus-0,结合低秩适应(Low-Rank Adaptation, LoRA)微调策略,并引入 MixUp 数据增强、基于多专家共识的软标签、硬负样本挖掘、自适应焦点损失(adaptive focal loss)、度量学习(metric learning)以及域适应(domain adaptation)等关键技术,以提升模型在复杂病理场景下的泛化能力和分类准确性。

链接: https://arxiv.org/abs/2509.02601
作者: Piotr Giedziun,Jan Sołtysik,Mateusz Górczany,Norbert Ropiak,Marcin Przymus,Piotr Krajewski,Jarosław Kwiecień,Artur Bartczak,Izabela Wasiak,Mateusz Maniewski
机构: Wrocław University of Science and Technology (弗罗茨瓦夫理工大学); Cancer Center Sp. z o. o. (癌症中心有限公司); Hospital for Lung Diseases - Rebirth (肺病医院-重生医院); Maria Sklodowska-Curie National Research Institute of Oncology (玛丽·居里国家肿瘤研究所); 10th Military Research Hospital in Bydgoszcz (比得哥什第10军用研究医院); Oncology Centre Prof. Franciszek Łukaszczyk Memorial Hospital (弗朗西谢克·卢卡斯奇克教授纪念肿瘤中心); Nicolaus Copernicus University in Toruń (尼古拉斯·哥白尼大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a solution for the MIDOG 2025 Challenge Track~2, addressing binary classification of normal mitotic figures (NMFs) versus atypical mitotic figures (AMFs). The approach leverages pathology-specific foundation model H-optimus-0, selected based on recent cross-domain generalization benchmarks and our empirical testing, with Low-Rank Adaptation (LoRA) fine-tuning and MixUp augmentation. Implementation includes soft labels based on multi-expert consensus, hard negative mining, and adaptive focal loss, metric learning and domain adaptation. The method demonstrates both the promise and challenges of applying foundation models to this complex classification task, achieving reasonable performance in the preliminary evaluation phase.
zh

[CV-74] am Westwood Solution for MIDOG 2025 Challenge

【速读】:该论文旨在解决组织病理图像中分裂期细胞(mitosis)检测与非典型分裂期细胞(atypical mitosis)分类的难题,尤其针对跨域泛化能力不足的问题。解决方案的关键在于:首先采用nnUNetV2进行高敏感度的初始候选区域筛选,随后通过集成三个不同架构的卷积神经网络(CNNs)——EfficientNet-b3、EfficientNet-b5 和 EfficientNetV2-s——的预测结果,使用随机森林(random forest)分类器实现精确定位;对于非典型分裂期分类任务,则同样利用随机森林融合EfficientNet-b3、EfficientNet-b5和InceptionV3的输出,从而提升模型在跨数据集场景下的鲁棒性和准确性。

链接: https://arxiv.org/abs/2509.02600
作者: Tengyou Xu,Haochen Yang,Xiang ‘Anthony’ Chen,Hongyan Gu,Mohammad Haeri
机构: University of California Los Angeles (加州大学洛杉矶分校); University of Kansas Medical Center (堪萨斯大学医学院中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 2 pages, 2 figures

点击查看摘要

Abstract:This abstract presents our solution (Team Westwood) for mitosis detection and atypical mitosis classification in the MItosis DOmain Generalization (MIDOG) 2025 challenge. For mitosis detection, we trained an nnUNetV2 for initial mitosis candidate screening with high sensitivity, followed by a random forest classifier ensembling predictions of three convolutional neural networks (CNNs): EfficientNet-b3, EfficientNet-b5, and EfficientNetV2-s. For the atypical mitosis classification, we trained another random forest classifier ensembling the predictions of three CNNs: EfficientNet-b3, EfficientNet-b5, and InceptionV3. On the preliminary test set, our solution achieved an F1 score of 0.7450 for track 1 mitosis detection, and a balanced accuracy of 0.8722 for track 2 atypical mitosis classification.
zh

[CV-75] RF-DETR for Robust Mitotic Figure Detection: A MIDOG 2025 Track 1 Approach

【速读】:该论文旨在解决组织病理图像中分裂期细胞(mitotic figure)检测在不同扫描仪、染色协议和组织类型之间存在的显著领域偏移(domain shift)问题。其解决方案的关键在于采用单阶段检测框架RF-DETR(Roboflow Detection Transformer),并结合难负样本挖掘(hard negative mining)策略,在MIDOG++数据集上进行训练,从而实现对未见领域的有效泛化能力。实验表明,该方法在初步测试集上达到了F1分数0.789(召回率0.839,精确率0.746),验证了训练数据平衡与难负样本挖掘对于缓解领域偏移的重要性。

链接: https://arxiv.org/abs/2509.02599
作者: Piotr Giedziun,Jan Sołtysik,Mateusz Górczany,Norbert Ropiak,Marcin Przymus,Piotr Krajewski,Jarosław Kwiecień,Artur Bartczak,Izabela Wasiak,Mateusz Maniewski
机构: Wrocław University of Science and Technology (弗罗茨瓦夫理工大学); Cancer Center Sp. z o. o. (癌症中心有限公司); Hospital for Lung Diseases - Rebirth (肺病医院-重生); Maria Sklodowska-Curie National Research Institute of Oncology (玛丽·居里国家肿瘤研究所); 10th Military Research Hospital in Bydgoszcz (比得哥什第十军医研究所); Oncology Centre Prof. Franciszek Łukaszczyk Memorial Hospital (弗朗西斯泽克·卢卡谢维奇教授纪念肿瘤中心); Nicolaus Copernicus University in Toruń (尼古拉斯·哥白尼大学托伦分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Challenge report for MIDOG 2025 Track 1

点击查看摘要

Abstract:Mitotic figure detection in histopathology images remains challenging due to significant domain shifts across different scanners, staining protocols, and tissue types. This paper presents our approach for the MIDOG 2025 challenge Track 1, focusing on robust mitotic figure detection across diverse histological contexts. While we initially planned a two-stage approach combining high-recall detection with subsequent classification refinement, time constraints led us to focus on optimizing a single-stage detection pipeline. We employed RF-DETR (Roboflow Detection Transformer) with hard negative mining, trained on MIDOG++ dataset. On the preliminary test set, our method achieved an F1 score of 0.789 with a recall of 0.839 and precision of 0.746, demonstrating effective generalization across unseen domains. The proposed solution offers insights into the importance of training data balance and hard negative mining for addressing domain shift challenges in mitotic figure detection.
zh

[CV-76] Solutions for Mitotic Figure Detection and Atypical Classification in MIDOG 2025

【速读】:该论文旨在解决病理图像中分裂期细胞(mitotic figure)检测与非典型分裂期细胞分类的域泛化问题,尤其在跨数据集或不同实验条件下模型性能下降的挑战。其解决方案的关键在于:针对检测任务提出两阶段的检测-分类框架,先定位候选区域再通过专用分类模块优化预测;针对分类任务采用多模型集成策略,融合多个先进深度学习架构的预测结果以提升鲁棒性和准确性。

链接: https://arxiv.org/abs/2509.02597
作者: Shuting Xu,Runtong Liu,Zhixuan Chen,Junlin Hou,Hao Chen
机构: The Hong Kong University of Science and Technology, Hong Kong, China; Nankai University, Tianjin, China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has driven significant advances in mitotic figure analysis within computational pathology. In this paper, we present our approach to the Mitosis Domain Generalization (MIDOG) 2025 Challenge, which consists of two distinct tasks, i.e., mitotic figure detection and atypical mitosis classification. For the mitotic figure detection task, we propose a two-stage detection-classification framework that first localizes candidate mitotic figures and subsequently refines the predictions using a dedicated classification module. For the atypical mitosis classification task, we employ an ensemble strategy that integrates predictions from multiple state-of-the-art deep learning architectures to improve robustness and accuracy. Extensive experiments demonstrate the effectiveness of our proposed methods across both tasks.
zh

[CV-77] ConvNeXt with Histopathology-Specific Augmentations for Mitotic Figure Classification

【速读】:该论文旨在解决病理图像中异常有丝分裂象(Atypical Mitotic Figures, AMFs)与正常有丝分裂象(Normal Mitotic Figures, NMFs)的准确分类问题,该任务在计算病理学中对癌症分级和患者预后评估至关重要。由于AMFs与NMFs之间形态差异细微且类内变异性高,加之器官、组织类型、扫描仪等引起的域偏移(domain shift)、标注数据有限以及类别严重不平衡等因素,使得分类难度显著增加。解决方案的关键在于:采用轻量级ConvNeXt架构并利用所有可用数据集(AMi-Br、AtNorM-Br、AtNorM-MD和OMG-Octo)进行训练以最大化域覆盖范围;通过针对组织病理学特性的增强策略(包括弹性形变和染色特定变换)提升模型鲁棒性;结合平衡采样缓解类别不平衡,并采用分组5折交叉验证确保评估可靠性。最终在MIDOG 2025挑战赛Track 2的初步排行榜上实现了0.8961的平衡准确率,验证了广泛域暴露与针对性增强策略相结合对于构建高精度、高泛化能力的有丝分裂象分类器的重要性。

链接: https://arxiv.org/abs/2509.02595
作者: Hana Feki,Alice Blondel,Thomas Walter
机构: CBIO - Center for Computational Biology, Mines Paris PSL, France
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate mitotic figure classification is crucial in computational pathology, as mitotic activity informs cancer grading and patient prognosis. Distinguishing atypical mitotic figures (AMFs), which indicate higher tumor aggressiveness, from normal mitotic figures (NMFs) remains challenging due to subtle morphological differences and high intra-class variability. This task is further complicated by domain shifts, including variations in organ, tissue type, and scanner, as well as limited annotations and severe class imbalance. To address these challenges in Track 2 of the MIDOG 2025 Challenge, we propose a solution based on the lightweight ConvNeXt architecture, trained on all available datasets (AMi-Br, AtNorM-Br, AtNorM-MD, and OMG-Octo) to maximize domain coverage. Robustness is enhanced through a histopathology-specific augmentation pipeline, including elastic and stain-specific transformations, and balanced sampling to mitigate class imbalance. A grouped 5-fold cross-validation strategy ensures reliable evaluation. On the preliminary leaderboard, our model achieved a balanced accuracy of 0.8961, ranking among the top entries. These results highlight that broad domain exposure combined with targeted augmentation strategies is key to building accurate and generalizable mitotic figure classifiers.
zh

[CV-78] Robust Pan-Cancer Mitotic Figure Detection with YOLOv12

【速读】:该论文旨在解决肿瘤病理学中分裂相(mitotic figures)识别的挑战性问题,即由于不同观察者之间存在显著差异,导致其判读结果一致性差,进而影响肿瘤恶性程度评估的准确性。为应对这一问题,研究提出了一种基于YOLOv12目标检测架构的分裂相检测方法,其关键在于不依赖外部数据的情况下,在MIDOG 2025挑战赛的初步测试集上实现了F₁-score达0.801的性能,体现出模型在跨域泛化能力上的鲁棒性与有效性。

链接: https://arxiv.org/abs/2509.02593
作者: Raphaël Bourgade,Guillaume Balezo,Thomas Walter
机构: Centre for Computational Biology - MINES Paris–PSL University(巴黎-PSL大学计算生物学中心), Paris, France
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mitotic figures represent a key histoprognostic feature in tumor pathology, providing crucial insights into tumor aggressiveness and proliferation. However, their identification remains challenging, subject to significant inter-observer variability, even among experienced pathologists. To address this issue, the MItosis DOmain Generalization (MIDOG) 2025 challenge marks the third edition of an international competition aiming to develop robust mitosis detection algorithms. In this paper, we present a mitotic figures detection approach based on the YOLOv12 object detection architecture, achieving a F_1 -score of 0.801 on the preliminary test set of the MIDOG 2025 challenge, without relying on external data.
zh

[CV-79] Ensemble of Pathology Foundation Models for MIDOG 2025 Track 2: Atypical Mitosis Classification

【速读】:该论文旨在解决病理学中对有丝分裂象(mitotic figures)进行准确分类的问题,尤其是区分典型与非典型有丝分裂象,因为非典型有丝分裂象的数量与肿瘤侵袭性密切相关,直接影响患者预后评估和医疗资源配置。当前即便对于资深病理学家而言,这一任务仍具挑战性。解决方案的关键在于利用在大规模组织病理学数据上预训练的病理基础模型(Pathology Foundation Models, PFMs),并通过低秩适应(low-rank adaptation)实现参数高效微调;同时引入鱼眼变换(fisheye transform)增强有丝分裂区域的特征表达,并采用基于ImageNet目标图像的傅里叶域适配(Fourier Domain Adaptation)提升跨域泛化能力;最终通过集成多个PFM以融合互补的形态学信息,从而在初步评估数据集上实现了高平衡准确率。

链接: https://arxiv.org/abs/2509.02591
作者: Mieko Ochi,Bae Yuan
机构: Japanese Red Cross Medical Center (日本赤十字医療センター)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mitotic figures are classified into typical and atypical variants, with atypical counts correlating strongly with tumor aggressiveness. Accurate differentiation is therefore essential for patient prognostication and resource allocation, yet remains challenging even for expert pathologists. Here, we leveraged Pathology Foundation Models (PFMs) pre-trained on large histopathology datasets and applied parameter-efficient fine-tuning via low-rank adaptation. During training, we employ a fisheye transform to emphasize mitoses and Fourier Domain Adaptation using ImageNet target images. Finally, we ensembled multiple PFMs to integrate complementary morphological insights, achieving a high balanced accuracy on the Preliminary Evaluation Phase dataset.
zh

[CV-80] Normal and Atypical Mitosis Image Classifier using Efficient Vision Transformer

【速读】:该论文旨在解决癌症病理图像中异常有丝分裂(atypical mitosis)与正常有丝分裂(normal mitosis)的分类问题,以提升对肿瘤细胞分裂状态的精准识别能力。其解决方案的关键在于采用高效且性能优异的混合架构EfficientViT-L2,该模型融合了卷积神经网络(CNN)与视觉Transformer(ViT)的优势,在保证计算效率的同时实现高精度分类;同时通过留一癌种排除交叉验证(leave-one-cancer-type-out cross-validation)策略评估模型在不同组织类型间的泛化能力,并结合染色解卷积(stain-deconvolution)进行图像增强,最终在MIDOG 2025挑战赛中展现出良好的平衡准确率(0.859)、ROC曲线下面积(0.942)和原始准确率(0.85)。

链接: https://arxiv.org/abs/2509.02589
作者: Xuan Qi,Dominic Labella,Thomas Sanford,Maxwell Lee
机构: Laboratory of Cancer Biology and Genetics, NCI, NIH, Bethesda, MD, 20852, USA; Department of Radiation Oncology, Duke University Medical Center, Durham, NC, 27705, USA; University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: for grandchallenge midog 2025 track 2 abstract

点击查看摘要

Abstract:We tackle atypical versus normal mitosis classification in the MIDOG 2025 challenge using EfficientViT-L2, a hybrid CNN–ViT architecture optimized for accuracy and efficiency. A unified dataset of 13,938 nuclei from seven cancer types (MIDOG++ and AMi-Br) was used, with atypical mitoses comprising ~15. To assess domain generalization, we applied leave-one-cancer-type-out cross-validation with 5-fold ensembles, using stain-deconvolution for image augmentation. For challenge submissions, we trained an ensemble with the same 5-fold split but on all cancer types. In the preliminary evaluation phase, this model achieved balanced accuracy of 0.859, ROC AUC of 0.942, and raw accuracy of 0.85, demonstrating competitive and well-balanced performance across metrics.
zh

[CV-81] Sequential Hard Mining: a data-centric approach for Mitosis Detection

【速读】:该论文旨在解决如何高效利用大量标注的组织病理图像中分裂期细胞(mitotic figures)数据来优化深度学习模型训练的问题。其解决方案的关键在于受提升(boosting)技术启发的高效训练数据采样策略,通过改进样本选择机制以提升模型在有限计算资源下的学习效率与性能表现。

链接: https://arxiv.org/abs/2509.02588
作者: Maxime W. Lafarge,Viktor H. Koelzer
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With a continuously growing availability of annotated datasets of mitotic figures in histology images, finding the best way to optimally use with this unprecedented amount of data to optimally train deep learning models has become a new challenge. Here, we build upon previously proposed approaches with a focus on efficient sampling of training data inspired by boosting techniques and present our candidate solutions for the two tracks of the MIDOG 2025 challenge.
zh

[CV-82] MitoDetect: A Domain-Robust Pipeline for Mitosis Detection and Atypical Subtyping

【速读】:该论文旨在解决计算病理学中对有丝分裂图像的自动检测与分类问题,尤其是区分异常有丝分裂(atypical mitosis)与正常有丝分裂(normal mitosis)的挑战。其解决方案的关键在于提出一个统一的深度学习流程 MitoDetect++,针对检测任务(Track 1)采用基于 EfficientNetV2-L 的 U-Net 架构并引入注意力机制和组合分割损失函数以提升定位精度;针对分类任务(Track 2)则利用 Virchow2 Vision Transformer 并通过低秩适应(Low-Rank Adaptation, LoRA)进行高效微调,在保证性能的同时显著降低资源消耗。此外,通过强数据增强、焦点损失(focal loss)以及分组感知的分层五折交叉验证策略提升模型泛化能力与抗域偏移能力,并在推理阶段应用测试时增强(Test-Time Augmentation, TTA)进一步提高鲁棒性,最终在验证域上实现 0.892 的平衡准确率,展现出良好的临床适用性和跨任务可扩展性。

链接: https://arxiv.org/abs/2509.02586
作者: Esha Sadia Nasir,Jiaqi Lv,Mostafa Jahanifer,Shan E Ahmed Raza
机构: Tissue Image Analytics Centre, University of Warwick (组织图像分析中心,华威大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated detection and classification of mitotic figures especially distinguishing atypical from normal remain critical challenges in computational pathology. We present MitoDetect++, a unified deep learning pipeline designed for the MIDOG 2025 challenge, addressing both mitosis detection and atypical mitosis classification. For detection (Track 1), we employ a U-Net-based encoder-decoder architecture with EfficientNetV2-L as the backbone, enhanced with attention modules, and trained via combined segmentation losses. For classification (Track 2), we leverage the Virchow2 vision transformer, fine-tuned efficiently using Low-Rank Adaptation (LoRA) to minimize resource consumption. To improve generalization and mitigate domain shifts, we integrate strong augmentations, focal loss, and group-aware stratified 5-fold cross-validation. At inference, we deploy test-time augmentation (TTA) to boost robustness. Our method achieves a balanced accuracy of 0.892 across validation domains, highlighting its clinical applicability and scalability across tasks.
zh

[CV-83] Pan-Cancer mitotic figures detection and domain generalization: MIDOG 2025 Challenge

【速读】:该论文旨在解决组织病理学中细胞有丝分裂图像检测(mitotic figure detection)的问题,以支持癌症预后评估。其核心挑战在于模型在不同数据分布下的泛化能力不足,尤其是对非典型有丝分裂(atypical mitoses)的识别精度有限。解决方案的关键在于遵循“Bitter Lesson”原则——强调数据规模优于算法创新,通过公开发布两个新的训练数据集(涵盖常规和非典型有丝分裂),显著扩充训练样本,并结合当前最先进的训练方法,在MIDOG 2025挑战赛中分别实现了Track-1的F1分数0.8407和Track-2的平衡准确率0.9107,有效提升了模型的泛化性能与分类准确性。

链接: https://arxiv.org/abs/2509.02585
作者: Zhuoyan Shen,Esther Bär,Maria Hawkins,Konstantin Bräutigam,Charles-Antoine Collins-Fekete
机构: University College London (伦敦大学学院); The Institute of Cancer Research (癌症研究研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This report details our submission to the Mitotic Domain Generalization (MIDOG) 2025 challenge, which addresses the critical task of mitotic figure detection in histopathology for cancer prognostication. Following the “Bitter Lesson”\citesutton2019bitterlesson principle that emphasizes data scale over algorithmic novelty, we have publicly released two new datasets to bolster training data for both conventional \citeShen2024framework and atypical mitoses \citeshen_2025_16780587. Besides, we implement up-to-date training methodologies for both track and reach a Track-1 F1-Score of 0.8407 on our test set, as well as a Track-2 balanced accuracy of 0.9107 for atypical mitotic cell classification.
zh

[CV-84] Application of Quantum Convolutional Neural Networks for MRI-Based Brain Tumor Detection and Classification

【速读】:该论文旨在解决脑肿瘤分类中传统深度学习模型在计算效率和精度上的局限性,探索量子卷积神经网络(Quantum Convolutional Neural Networks, QCNNs)在MRI图像分析中的应用潜力。其关键解决方案在于利用量子计算的并行处理能力,构建包含量子卷积层、展平层和全连接层的QCNN架构,结合数据增强策略(如过采样技术)缓解类别不平衡问题,并通过80/20划分训练与测试集进行验证。实验表明,该方法在二分类任务中达到89%准确率,显示出对肿瘤存在与否识别的有效性;而在多类分类中虽仅提升至62%,但仍验证了QCNN在医学影像领域的可行性,为后续优化量子电路结构及融合经典-量子混合架构提供了方向。

链接: https://arxiv.org/abs/2509.02582
作者: Sugih Pratama Nugraha,Ariiq Islam Alfajri,Tony Sumaryada,Duong Thanh Tai,Nissren Tamam,Abdelmoneim Sulieman,Sitti Yani
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study explores the application of Quantum Convolutional Neural Networks (QCNNs) for brain tumor classification using MRI images, leveraging quantum computing for enhanced computational efficiency. A dataset of 3,264 MRI images, including glioma, meningioma, pituitary tumors, and non-tumor cases, was utilized. The data was split into 80% training and 20% testing, with an oversampling technique applied to address class imbalance. The QCNN model consists of quantum convolution layers, flatten layers, and dense layers, with a filter size of 2, depth of 4, and 4 qubits, trained over 10 epochs. Two models were developed: a binary classification model distinguishing tumor presence and a multiclass classification model categorizing tumor types. The binary model achieved 88% accuracy, improving to 89% after data balancing, while the multiclass model achieved 52% accuracy, increasing to 62% after oversampling. Despite strong binary classification performance, the multiclass model faced challenges due to dataset complexity and quantum circuit limitations. These findings suggest that QCNNs hold promise for medical imaging applications, particularly in binary classification. However, further refinements, including optimized quantum circuit architectures and hybrid classical-quantum approaches, are necessary to enhance multiclass classification accuracy and improve QCNN applicability in clinical settings.
zh

人工智能

[AI-0] Can the Waymo Open Motion Dataset Support Realistic Behavioral Modeling? A Validation Study with Naturalistic Trajectories

【速读】:该论文旨在解决Waymo Open Motion Dataset (WOMD)在自动驾驶车辆(AV)行为建模中的有效性问题,特别是其是否准确捕捉了真实世界中AV运行的动态与交互特征。由于WOMD存在专有后处理、缺乏误差量化以及轨迹片段化等局限性,其行为表征的可靠性尚未得到充分验证。研究的关键解决方案是基于独立采集的亚利桑那州凤凰城(PHX)Level 4 AV自然驾驶数据,采用三种典型城市驾驶场景(信号交叉口放行、跟车和变道)进行对比分析:通过航拍视频手动提取短头距以消除测量误差;对跟车与变道行为应用Simulation-Extrapolation (SIMEX)方法校正PHX数据中的观测误差,并利用动态时间规整(DTW)距离量化行为差异。结果一致表明,PHX的行为模式超出WOMD的行为包络,尤其低估了短头距和急减速现象,揭示了仅依赖WOMD训练的行为模型可能系统性低估真实驾驶的变异性、风险性和复杂性。

链接: https://arxiv.org/abs/2509.03515
作者: Yanlin Zhang,Sungyong Chung,Nachuan Li,Dana Monzer,Hani S. Mahmassani,Samer H. Hamdar,Alireza Talebpour
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Applications (stat.AP)
备注:

点击查看摘要

Abstract:The Waymo Open Motion Dataset (WOMD) has become a popular resource for data-driven modeling of autonomous vehicles (AVs) behavior. However, its validity for behavioral analysis remains uncertain due to proprietary post-processing, the absence of error quantification, and the segmentation of trajectories into 20-second clips. This study examines whether WOMD accurately captures the dynamics and interactions observed in real-world AV operations. Leveraging an independently collected naturalistic dataset from Level 4 AV operations in Phoenix, Arizona (PHX), we perform comparative analyses across three representative urban driving scenarios: discharging at signalized intersections, car-following, and lane-changing behaviors. For the discharging analysis, headways are manually extracted from aerial video to ensure negligible measurement error. For the car-following and lane-changing cases, we apply the Simulation-Extrapolation (SIMEX) method to account for empirically estimated error in the PHX data and use Dynamic Time Warping (DTW) distances to quantify behavioral differences. Results across all scenarios consistently show that behavior in PHX falls outside the behavioral envelope of WOMD. Notably, WOMD underrepresents short headways and abrupt decelerations. These findings suggest that behavioral models calibrated solely on WOMD may systematically underestimate the variability, risk, and complexity of naturalistic driving. Caution is therefore warranted when using WOMD for behavior modeling without proper validation against independently collected data.
zh

[AI-1] Warming Up for Zeroth-Order Federated Pre-Training with Low Resource Clients

【速读】:该论文旨在解决联邦学习(Federated Learning)中因边缘设备内存或通信资源不足而被排除在训练之外的问题,这类设备通常无法执行模型更新,导致数据不可用并加剧系统偏差。解决方案的关键在于提出一种名为 ZOWarmUp 的联邦零阶优化器(Federated Zeroth-Order Optimizer),其利用不同客户端的能力差异和精细的方差缩减技术,使低资源客户端也能参与从随机初始化开始的零阶训练。ZOWarmUp 不依赖于传输完整梯度,仅需少量随机种子即可完成上行通信,从而显著降低通信开销,并提升系统对边缘设备多样性的包容性与训练效果。

链接: https://arxiv.org/abs/2509.03503
作者: Gwen Legate,Irina Rish,Eugene Belilovsky
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning enables collaborative model training across numerous edge devices without requiring participants to share data; however, memory and communication constraints on these edge devices may preclude their participation in training. We consider a setting in which a subset of edge devices are below a critical memory or communication threshold required to conduct model updates. Under typical federated optimization algorithms, these devices are excluded from training which renders their data inaccessible and increases system induced bias. We are inspired by MeZO, a zeroth-order method used for memory-efficient fine-tuning. The increased variance inherent to zeroth-order gradient approximations has relegated previous zeroth-order optimizers exclusively to the domain of fine tuning; a limitation we seek to correct. We devise a federated, memory-efficient zeroth-order optimizer, ZOWarmUp that permits zeroth-order training from a random initialization. ZOWarmUp leverages differing client capabilities and careful variance reduction techniques to facilitate participation of under-represented, low-resource clients in model training. Like other federated zeroth-order methods, ZOWarmUp eliminates the need for edge devices to transmit their full gradients to the server and instead relies on only a small set of random seeds, rendering the up-link communication cost negligible. We present experiments using various datasets and model architectures to show that ZOWarmUp is a robust algorithm that can can be applied under a wide variety of circumstances. For systems with a high proportion of edge devices that would otherwise be excluded from training, this algorithm provides access to a greater volume and diversity of data, thus improving training outcomes.
zh

[AI-2] Real-Time Instrument Planning and Perception for Novel Measurements of Dynamic Phenomena

【速读】:该论文旨在解决如何在动态科学现象(如火山喷发产生的火山羽流)中实现高精度、实时的遥感观测问题,尤其是在传统卫星观测难以捕捉瞬态且空间分布稀疏的目标时。其核心挑战在于如何将边缘计算能力与自主任务规划相结合,以实现对目标事件的快速识别和精准响应。解决方案的关键在于构建一个自动化工作流:首先利用生成式AI(Generative AI)或卷积神经网络(Convolutional Neural Networks, CNNs)对前瞻卫星影像进行动态事件检测,随后通过轨迹规划算法跟踪羽流形态特征,并驱动高分辨率传感器自主调整飞行路径以获取精确测量数据。实验表明,该方法相较基线方案可提升高分辨率仪器的效用回报一个数量级,同时保持高效的运行时间。

链接: https://arxiv.org/abs/2509.03500
作者: Itai Zilberstein,Alberto Candela,Steve Chien
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Appears in Proceedings of 18th Symposium on Advanced Space Technologies in Robotics and Automation

点击查看摘要

Abstract:Advancements in onboard computing mean remote sensing agents can employ state-of-the-art computer vision and machine learning at the edge. These capabilities can be leveraged to unlock new rare, transient, and pinpoint measurements of dynamic science phenomena. In this paper, we present an automated workflow that synthesizes the detection of these dynamic events in look-ahead satellite imagery with autonomous trajectory planning for a follow-up high-resolution sensor to obtain pinpoint measurements. We apply this workflow to the use case of observing volcanic plumes. We analyze classification approaches including traditional machine learning algorithms and convolutional neural networks. We present several trajectory planning algorithms that track the morphological features of a plume and integrate these algorithms with the classifiers. We show through simulation an order of magnitude increase in the utility return of the high-resolution instrument compared to baselines while maintaining efficient runtimes.
zh

[AI-3] On Entropy Control in LLM -RL Algorithms

【速读】:该论文旨在解决在大语言模型强化学习(LLM-RL)训练中,传统熵正则化方法效果不佳的问题。研究表明,这是因为大语言模型具有极大规模的响应空间和最优输出稀疏性,导致标准熵奖励难以有效引导探索。解决方案的关键在于提出一种名为AEnt的新颖熵控制方法,其核心是引入一种带有自动调整系数的截断熵奖励(clamped entropy bonus),该奖励基于在较小token子空间上重新归一化的策略进行计算,从而在更紧凑的响应集合内促进探索;同时,算法根据截断熵值动态调节熵系数,有效缓解熵激励带来的偏差,同时保留其对探索的促进作用。

链接: https://arxiv.org/abs/2509.03493
作者: Han Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:For RL algorithms, appropriate entropy control is crucial to their effectiveness. To control the policy entropy, a commonly used method is entropy regularization, which is adopted in various popular RL algorithms including PPO, SAC and A3C. Although entropy regularization proves effective in robotic and games RL conventionally, studies found that it gives weak to no gains in LLM-RL training. In this work, we study the issues of entropy bonus in LLM-RL setting. Specifically, we first argue that the conventional entropy regularization suffers from the LLM’s extremely large response space and the sparsity of the optimal outputs. As a remedy, we propose AEnt, an entropy control method that utilizes a new clamped entropy bonus with an automatically adjusted coefficient. The clamped entropy is evaluated with the re-normalized policy defined on certain smaller token space, which encourages exploration within a more compact response set. In addition, the algorithm automatically adjusts entropy coefficient according to the clamped entropy value, effectively controlling the entropy-induced bias while leveraging the entropy’s benefits. AEnt is tested in math-reasoning tasks under different base models and datasets, and it is observed that AEnt outperforms the baselines consistently across multiple benchmarks.
zh

[AI-4] SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models

【速读】:该论文旨在解决当前蛋白质基础模型(Protein Foundation Models)在生物安全方面存在的潜在滥用风险问题,尤其是这些模型可能被用于生成具有生物安全风险的蛋白质。解决方案的关键在于提出SafeProtein框架,该框架结合多模态提示工程(Multimodal Prompt Engineering)与启发式束搜索(Heuristic Beam Search),系统性地设计红队测试方法,并构建了SafeProtein-Bench基准数据集和完整的评估协议,从而实现对前沿蛋白模型的持续越狱攻击测试(最高攻击成功率达70%),揭示现有模型的安全漏洞并为开发更 robust 的安全防护技术提供依据。

链接: https://arxiv.org/abs/2509.03487
作者: Jigang Fan,Zhenghong Zhou,Ruofan Jin,Le Cong,Mengdi Wang,Zaixi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Proteins play crucial roles in almost all biological processes. The advancement of deep learning has greatly accelerated the development of protein foundation models, leading to significant successes in protein understanding and design. However, the lack of systematic red-teaming for these models has raised serious concerns about their potential misuse, such as generating proteins with biological safety risks. This paper introduces SafeProtein, the first red-teaming framework designed for protein foundation models to the best of our knowledge. SafeProtein combines multimodal prompt engineering and heuristic beam search to systematically design red-teaming methods and conduct tests on protein foundation models. We also curated SafeProtein-Bench, which includes a manually constructed red-teaming benchmark dataset and a comprehensive evaluation protocol. SafeProtein achieved continuous jailbreaks on state-of-the-art protein foundation models (up to 70% attack success rate for ESM3), revealing potential biological safety risks in current protein foundation models and providing insights for the development of robust security protection technologies for frontier models. The codes will be made publicly available at this https URL.
zh

[AI-5] DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling

【速读】:该论文旨在解决差分隐私随机梯度下降(Differentially-Private SGD, DP-SGD)训练过程中,量化(quantization)导致的精度显著下降问题。研究表明,与普通SGD相比,DP-SGD中由于噪声注入会放大量化方差,从而引发不成比例的性能损失。解决方案的关键在于提出QPQuant——一种动态量化框架,其核心创新包括:(i) 概率性轮换层选择机制,即每轮训练动态选取不同子集的网络层进行量化以降低方差;(ii) 基于差分隐私损失敏感性估计器的层优先级排序策略,用于识别对模型质量影响最小的可量化层,该估计器仅消耗极小的隐私预算,保障整体差分隐私保证。实证结果表明,QPQuant在多个主流架构和数据集上均能实现接近帕累托最优的精度-计算权衡,并在低精度硬件上理论吞吐量提升达2.21倍,同时验证精度损失小于2%。

链接: https://arxiv.org/abs/2509.03472
作者: Yubo Gao,Renbo Tu,Gennady Pekhimenko,Nandita Vijaykumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Differentially-Private SGD (DP-SGD) is a powerful technique to protect user privacy when using sensitive data to train neural networks. During training, converting model weights and activations into low-precision formats, i.e., quantization, can drastically reduce training times, energy consumption, and cost, and is thus a widely used technique. In this work, we demonstrate that quantization causes significantly higher accuracy degradation in DP-SGD compared to regular SGD. We observe that this is caused by noise injection in DP-SGD, which amplifies quantization variance, leading to disproportionately large accuracy degradation. To address this challenge, we present QPQuant, a dynamic quantization framework that adaptively selects a changing subset of layers to quantize at each epoch. Our method combines two key ideas that effectively reduce quantization variance: (i) probabilistic sampling of the layers that rotates which layers are quantized every epoch, and (ii) loss-aware layer prioritization, which uses a differentially private loss sensitivity estimator to identify layers that can be quantized with minimal impact on model quality. This estimator consumes a negligible fraction of the overall privacy budget, preserving DP guarantees. Empirical evaluations on ResNet18, ResNet50, and DenseNet121 across a range of datasets demonstrate that DPQuant consistently outperforms static quantization baselines, achieving near Pareto-optimal accuracy-compute trade-offs and up to 2.21x theoretical throughput improvements on low-precision hardware, with less than 2% drop in validation accuracy.
zh

[AI-6] Multi-level SSL Feature Gating for Audio Deepfake Detection ACM-MM2025

【速读】:该论文旨在解决生成式 AI(Generative AI)在语音合成领域快速发展背景下,语音深度伪造(speech deepfake)检测模型泛化能力不足的问题,特别是针对未见过的深度伪造攻击和多语言场景下的鲁棒性缺失。解决方案的关键在于提出一种融合门控机制与多核卷积结构的端到端框架:首先利用门控机制从预训练的语音基础模型 XLS-R 中提取相关特征作为前端特征提取器;其次采用多核门控卷积(Multi-kernel gated Convolution, MultiConv)作为后端分类器,以捕捉语音中的局部与全局伪造痕迹;最后引入中心核对齐(Centered Kernel Alignment, CKA)作为相似性度量,强制不同 MultiConv 层间学习到的特征具有多样性,从而提升模型对不同类型合成语音模式的区分能力。该设计有效提升了模型在域内和域外数据上的检测性能,展现出应对演进中语音深度伪造威胁的通用潜力。

链接: https://arxiv.org/abs/2509.03409
作者: Hoan My Tran,Damien Lolive,Aghilas Sini,Arnaud Delhay,Pierre-François Marteau,David Guennec
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: This paper has been accepted by ACM MM 2025

点击查看摘要

Abstract:Recent advancements in generative AI, particularly in speech synthesis, have enabled the generation of highly natural-sounding synthetic speech that closely mimics human voices. While these innovations hold promise for applications like assistive technologies, they also pose significant risks, including misuse for fraudulent activities, identity theft, and security threats. Current research on spoofing detection countermeasures remains limited by generalization to unseen deepfake attacks and languages. To address this, we propose a gating mechanism extracting relevant feature from the speech foundation XLS-R model as a front-end feature extractor. For downstream back-end classifier, we employ Multi-kernel gated Convolution (MultiConv) to capture both local and global speech artifacts. Additionally, we introduce Centered Kernel Alignment (CKA) as a similarity metric to enforce diversity in learned features across different MultiConv layers. By integrating CKA with our gating mechanism, we hypothesize that each component helps improving the learning of distinct synthetic speech patterns. Experimental results demonstrate that our approach achieves state-of-the-art performance on in-domain benchmarks while generalizing robustly to out-of-domain datasets, including multilingual speech samples. This underscores its potential as a versatile solution for detecting evolving speech deepfake threats.
zh

[AI-7] Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

【速读】:该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的数学推理任务中存在的奖励粒度不匹配问题:Outcome Reward Models (ORMs) 作为粗粒度奖励模型难以区分正确答案中的错误推理路径或错误答案中的有效推理步骤,导致梯度噪声大且误导性强;而 Process Reward Models (PRMs) 虽提供细粒度过程奖励,却常因准确性不足易受奖励欺骗(reward hacking)。解决方案的关键在于提出 PRocess cOnsistency Filter (PROF),一种基于一致性驱动的数据筛选方法,通过融合 PRM 的细粒度过程奖励与 ORM 的准确结果奖励,在样本选择层面实现互补优势——保留高平均过程评分的正确响应和低平均过程评分的错误响应,同时维持正负样本平衡,从而在不依赖简单混合目标函数的前提下显著提升最终准确率(提升超 4%)及中间推理步骤质量。

链接: https://arxiv.org/abs/2509.03403
作者: Chenlu Ye,Zhou Yu,Ziji Zhang,Hao Chen,Narayanan Sadagopan,Jing Huang,Tong Zhang,Anurag Beniwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged to be a predominant paradigm for mathematical reasoning tasks, offering stable improvements in reasoning ability. However, Outcome Reward Models (ORMs) in RLVR are too coarse-grained to distinguish flawed reasoning within correct answers or valid reasoning within incorrect answers. This lack of granularity introduces noisy and misleading gradients significantly and hinders further progress in reasoning process quality. While Process Reward Models (PRMs) offer fine-grained guidance for intermediate steps, they frequently suffer from inaccuracies and are susceptible to reward hacking. To resolve this dilemma, we introduce PRocess cOnsistency Filter (PROF), an effective data process curation method that harmonizes noisy, fine-grained process rewards with accurate, coarse-grained outcome rewards. Rather than naively blending PRM and ORM in the objective function (arXiv:archive/2506.18896), PROF leverages their complementary strengths through consistency-driven sample selection. Our approach retains correct responses with higher averaged process values and incorrect responses with lower averaged process values, while maintaining positive/negative training sample balance. Extensive experiments demonstrate that our method not only consistently improves the final accuracy over 4% compared to the blending approaches, but also strengthens the quality of intermediate reasoning steps. Codes and training recipes are available at this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.03403 [cs.LG] (or arXiv:2509.03403v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.03403 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-8] ANNIE: Be Careful of Your Robots

【速读】:该论文旨在解决嵌入式人工智能(Embodied AI, EAI)系统在物理交互场景中因视觉-语言-动作(Vision-Language-Action, VLA)模型被攻破而导致的安全风险问题,即如何量化和评估EAI系统中的安全违规行为,并设计有效的对抗性攻击与防御机制。其解决方案的关键在于:首先,基于ISO人机交互标准,提出一个以物理约束(如分离距离、速度和碰撞边界)为依据的系统性安全违规分类体系(分为关键级、危险级和风险级);其次,构建ANNIEBench基准测试平台,包含9个安全关键场景及2400段视频-动作序列,用于评估EAI系统的安全性;最后,设计ANNIE-Attack框架,通过任务感知的对抗性攻击领导者模型将长期目标分解为帧级扰动,实现对EAI系统高效且精准的攻击,实验证明其在所有安全类别中攻击成功率均超过50%,并验证了在真实机器人上的物理影响,揭示了EAI系统中此前未被充分关注但后果严重的安全漏洞。

链接: https://arxiv.org/abs/2509.03383
作者: Yiyang Huang,Zixuan Wang,Zishen Wan,Yapeng Tian,Haobo Xu,Yinhe Han,Yiming Gan
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The integration of vision-language-action (VLA) models into embodied AI (EAI) robots is rapidly advancing their ability to perform complex, long-horizon tasks in humancentric environments. However, EAI systems introduce critical security risks: a compromised VLA model can directly translate adversarial perturbations on sensory input into unsafe physical actions. Traditional safety definitions and methodologies from the machine learning community are no longer sufficient. EAI systems raise new questions, such as what constitutes safety, how to measure it, and how to design effective attack and defense mechanisms in physically grounded, interactive settings. In this work, we present the first systematic study of adversarial safety attacks on embodied AI systems, grounded in ISO standards for human-robot interactions. We (1) formalize a principled taxonomy of safety violations (critical, dangerous, risky) based on physical constraints such as separation distance, velocity, and collision boundaries; (2) introduce ANNIEBench, a benchmark of nine safety-critical scenarios with 2,400 video-action sequences for evaluating embodied safety; and (3) ANNIE-Attack, a task-aware adversarial framework with an attack leader model that decomposes long-horizon goals into frame-level perturbations. Our evaluation across representative EAI models shows attack success rates exceeding 50% across all safety categories. We further demonstrate sparse and adaptive attack strategies and validate the real-world impact through physical robot experiments. These results expose a previously underexplored but highly consequential attack surface in embodied AI systems, highlighting the urgent need for security-driven defenses in the physical AI era. Code is available at this https URL.
zh

[AI-9] Neural Field Turing Machine: A Differentiable Spatial Computer

【速读】:该论文旨在解决如何在统一的可微分框架中融合符号计算、物理仿真与感知推理的问题,以实现离散算法与连续场动力学之间的无缝衔接。其解决方案的关键在于提出神经场图灵机(Neural Field Turing Machine, NFTM),该架构通过一个神经控制器、连续记忆场以及可移动的读写头组成,能够在每个时间步执行局部读取、基于学习规则的更新和局部写入操作,并动态调整读写头位置。NFTM利用固定半径邻域实现线性O(N)复杂度扩展,同时在有限误差下保持图灵完备性,从而支持从细胞自动机模拟到偏微分方程求解再到图像修复等多类任务的学习与泛化。

链接: https://arxiv.org/abs/2509.03370
作者: Akash Malhotra,Nacéra Seghouani
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 11 Pages, 6 Figures

点击查看摘要

Abstract:We introduce the Neural Field Turing Machine (NFTM), a differentiable architecture that unifies symbolic computation, physical simulation, and perceptual inference within continuous spatial fields. NFTM combines a neural controller, continuous memory field, and movable read/write heads that perform local updates. At each timestep, the controller reads local patches, computes updates via learned rules, and writes them back while updating head positions. This design achieves linear O(N) scaling through fixed-radius neighborhoods while maintaining Turing completeness under bounded error. We demonstrate three example instantiations of NFTM: cellular automata simulation (Rule 110), physics-informed PDE solvers (2D heat equation), and iterative image refinement (CIFAR-10 inpainting). These instantiations learn local update rules that compose into global dynamics, exhibit stable long-horizon rollouts, and generalize beyond training horizons. NFTM provides a unified computational substrate bridging discrete algorithms and continuous field dynamics within a single differentiable framework.
zh

[AI-10] Fair Resource Allocation for Fleet Intelligence

【速读】:该论文旨在解决云辅助多智能体智能系统中资源分配不均的问题,传统方法往往忽视了智能体间计算能力的差异和复杂运行环境的影响,导致资源分配效率低下且不公平。解决方案的关键在于提出开源算法框架 Fair-Synergy,其核心是利用智能体精度与系统资源之间的凹函数关系(concave relationship),在多维机器学习效用空间(由模型参数、训练数据量和任务复杂度定义)内实现公平的资源分配,从而提升整体舰队智能系统的性能与公平性。

链接: https://arxiv.org/abs/2509.03353
作者: Oguzhan Baser,Kaan Kale,Po-han Li,Sandeep Chinchali
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for presentation at the 2025 IEEE Global Communications Conference (GLOBECOM 2025)

点击查看摘要

Abstract:Resource allocation is crucial for the performance optimization of cloud-assisted multi-agent intelligence. Traditional methods often overlook agents’ diverse computational capabilities and complex operating environments, leading to inefficient and unfair resource distribution. To address this, we open-sourced Fair-Synergy, an algorithmic framework that utilizes the concave relationship between the agents’ accuracy and the system resources to ensure fair resource allocation across fleet intelligence. We extend traditional allocation approaches to encompass a multidimensional machine learning utility landscape defined by model parameters, training data volume, and task complexity. We evaluate Fair-Synergy with advanced vision and language models such as BERT, VGG16, MobileNet, and ResNets on datasets including MNIST, CIFAR-10, CIFAR-100, BDD, and GLUE. We demonstrate that Fair-Synergy outperforms standard benchmarks by up to 25% in multi-agent inference and 11% in multi-agent learning settings. Also, we explore how the level of fairness affects the least advantaged, most advantaged, and average agents, providing insights for equitable fleet intelligence.
zh

[AI-11] piGPT ope: A machine learning-based epitope generator and classifier

【速读】:该论文旨在解决合成表位(epitope)库理性设计中的关键难题,即由于线性表位的组合空间巨大(20^n,n为氨基酸长度),导致高通量筛选和实验测试难以实现。其解决方案的关键在于提出一种基于语言模型的生成式方法——epiGPTope,该模型在蛋白质数据上预训练并针对线性表位进行微调,能够直接生成具有已知表位统计特性的新型表位样序列;同时结合统计分类器预测表位来源(细菌或病毒),从而缩小候选库范围、提高特定表位识别效率。该方法仅依赖于氨基酸一级序列,无需几何结构或人工特征工程,显著提升了表位生成的生物可行性与效率,为免疫治疗、疫苗及诊断开发提供了高效的新策略。

链接: https://arxiv.org/abs/2509.03351
作者: Natalia Flechas Manrique,Alberto Martínez,Elena López-Martínez,Luc Andrea,Román Orus,Aitor Manteca,Aitziber L. Cortajarena,Llorenç Espinosa-Portalés
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 11 pages, 4 figures. Supplementary Information with 5 pages, 4 figures

点击查看摘要

Abstract:Epitopes are short antigenic peptide sequences which are recognized by antibodies or immune cell receptors. These are central to the development of immunotherapies, vaccines, and diagnostics. However, the rational design of synthetic epitope libraries is challenging due to the large combinatorial sequence space, 20^n combinations for linear epitopes of n amino acids, making screening and testing unfeasible, even with high throughput experimental techniques. In this study, we present a large language model, epiGPTope, pre-trained on protein data and specifically fine-tuned on linear epitopes, which for the first time can directly generate novel epitope-like sequences, which are found to possess statistical properties analogous to the ones of known epitopes. This generative approach can be used to prepare libraries of epitope candidate sequences. We further train statistical classifiers to predict whether an epitope sequence is of bacterial or viral origin, thus narrowing the candidate library and increasing the likelihood of identifying specific epitopes. We propose that such combination of generative and predictive models can be of assistance in epitope discovery. The approach uses only primary amino acid sequences of linear epitopes, bypassing the need for a geometric framework or hand-crafted features of the sequences. By developing a method to create biologically feasible sequences, we anticipate faster and more cost-effective generation and screening of synthetic epitopes, with relevant applications in the development of new biotechnologies.
zh

[AI-12] On the MIA Vulnerability Gap Between Private GANs and Diffusion Models

【速读】:该论文旨在解决差分隐私(Differential Privacy, DP)训练下的生成式模型(包括生成对抗网络 GANs 和扩散模型 Diffusion Models)在面对成员推理攻击(Membership Inference Attacks, MIAs)时的隐私风险差异问题。现有研究虽表明两类模型均可通过DP进行训练以保护敏感数据,但其对MIAs的敏感性尚不明确。论文的关键解决方案在于提出了一种统一的理论与实证分析框架:首先基于稳定性理论证明GANs相较于扩散模型对数据扰动具有更低的敏感性,揭示其在抵御MIAs上存在结构优势;随后通过标准化的成员推理攻击实验验证了该结论,发现即使在强差分隐私预算下,GANs仍展现出显著优于扩散模型的隐私鲁棒性,从而指出模型类型本身是决定隐私泄露程度的关键因素。

链接: https://arxiv.org/abs/2509.03341
作者: Ilana Sebag,Jean-Yves Franceschi,Alain Rakotomamonjy,Alexandre Allauzen,Jamal Atif
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) and diffusion models have emerged as leading approaches for high-quality image synthesis. While both can be trained under differential privacy (DP) to protect sensitive data, their sensitivity to membership inference attacks (MIAs), a key threat to data confidentiality, remains poorly understood. In this work, we present the first unified theoretical and empirical analysis of the privacy risks faced by differentially private generative models. We begin by showing, through a stability-based analysis, that GANs exhibit fundamentally lower sensitivity to data perturbations than diffusion models, suggesting a structural advantage in resisting MIAs. We then validate this insight with a comprehensive empirical study using a standardized MIA pipeline to evaluate privacy leakage across datasets and privacy budgets. Our results consistently reveal a marked privacy robustness gap in favor of GANs, even in strong DP regimes, highlighting that model type alone can critically shape privacy leakage.
zh

[AI-13] Equivariant Flow Matching for Symmetry-Breaking Bifurcation Problems

【速读】:该论文旨在解决非线性动力系统中对称性破缺导致的多稳态(multistability)现象建模难题,传统确定性机器学习模型难以捕捉多个共存稳定解,常因平均效应而忽略低对称性解。其解决方案的关键在于提出一种基于流匹配(flow matching)的生成式框架,通过等变建模(equivariant modeling)保留系统对称性,并引入对称匹配策略(symmetric matching strategy),在群作用下对齐预测与目标输出,从而准确学习等变场景下的多模态分布。该方法在从简化模型到复杂物理问题(如屈曲梁和Allen-Cahn方程)上的验证表明,流匹配显著优于非概率和变分方法,在高维系统中实现对多重稳定性和对称性破缺 bifurcation 的精确建模。

链接: https://arxiv.org/abs/2509.03340
作者: Fleur Hendriks,Ondřej Rokoš,Martin Doškář,Marc G.D. Geers,Vlado Menkovski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
备注: 12 pages, 7 figures including appendices

点击查看摘要

Abstract:Bifurcation phenomena in nonlinear dynamical systems often lead to multiple coexisting stable solutions, particularly in the presence of symmetry breaking. Deterministic machine learning models struggle to capture this multiplicity, averaging over solutions and failing to represent lower-symmetry outcomes. In this work, we propose a generative framework based on flow matching to model the full probability distribution over bifurcation outcomes. Our method enables direct sampling of multiple valid solutions while preserving system symmetries through equivariant modeling. We introduce a symmetric matching strategy that aligns predicted and target outputs under group actions, allowing accurate learning in equivariant settings. We validate our approach on a range of systems, from toy models to complex physical problems such as buckling beams and the Allen-Cahn equation. Our results demonstrate that flow matching significantly outperforms non-probabilistic and variational methods in capturing multimodal distributions and symmetry-breaking bifurcations, offering a principled and scalable solution for modeling multistability in high-dimensional systems.
zh

[AI-14] app.build: A Production Framework for Scaling Agent ic Prompt-to-App Generation with Environment Scaffolding

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用生成中面临的关键挑战:如何通过系统性验证和结构化环境提升生成应用的可靠性与质量。其解决方案的核心在于构建一个模型无关(model-agnostic)、多层验证流水线(multi-layered validation pipelines)与栈特定编排(stack-specific orchestration)相结合的开源框架,通过在三个参考技术栈上的实证评估表明,该方法可实现73.3%的应用可行性率,并使开放权重模型在结构化环境中达到封闭模型80.8%的性能水平,从而证明了“扩展环境”比单纯扩大模型规模更能有效支撑可信赖AI代理系统的规模化部署。

链接: https://arxiv.org/abs/2509.03310
作者: Evgenii Kniazev,Arseny Kravchenko,Igor Rekun,James Broadhead,Nikita Shamgunov,Pranav Sah,Pratik Nichite,Ivan Yamshchikov
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We present this http URL (this https URL), an open-source framework that improves LLM-based application generation through systematic validation and structured environments. Our approach combines multi-layered validation pipelines, stack-specific orchestration, and model-agnostic architecture, implemented across three reference stacks. Through evaluation on 30 generation tasks, we demonstrate that comprehensive validation achieves 73.3% viability rate with 30% reaching perfect quality scores, while open-weights models achieve 80.8% of closed-model performance when provided structured environments. The open-source framework has been adopted by the community, with over 3,000 applications generated to date. This work demonstrates that scaling reliable AI agents requires scaling environments, not just models – providing empirical insights and complete reference implementations for production-oriented agent systems.
zh

[AI-15] Automatic Differentiation of Agent -Based Models

【速读】:该论文旨在解决代理模型(Agent-based Models, ABMs)在模拟复杂系统时面临的两大挑战:一是计算资源消耗大,尤其当系统包含成千上万甚至百万级个体代理时;二是模型校准过程中涉及大量自由参数,导致效率低下且难以实现。解决方案的关键在于引入自动微分(Automatic Differentiation, AD)技术,使ABMs的梯度信息可被高效计算,从而显著提升参数校准与敏感性分析等任务的效率。文中进一步结合变分推断(Variational Inference, VI)方法,在Axtell的企业模型、Sugarscape模型和SIR流行病模型三个典型场景中验证了该方法带来的性能提升和计算节省,有效增强了ABMs在实际应用中的可扩展性和实用性。

链接: https://arxiv.org/abs/2509.03303
作者: Arnau Quera-Bofarull,Nicholas Bishop,Joel Dyer,Daniel Jarne Ornia,Anisoara Calinescu,Doyne Farmer,Michael Wooldridge
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Agent-based models (ABMs) simulate complex systems by capturing the bottom-up interactions of individual agents comprising the system. Many complex systems of interest, such as epidemics or financial markets, involve thousands or even millions of agents. Consequently, ABMs often become computationally demanding and rely on the calibration of numerous free parameters, which has significantly hindered their widespread adoption. In this paper, we demonstrate that automatic differentiation (AD) techniques can effectively alleviate these computational burdens. By applying AD to ABMs, the gradients of the simulator become readily available, greatly facilitating essential tasks such as calibration and sensitivity analysis. Specifically, we show how AD enables variational inference (VI) techniques for efficient parameter calibration. Our experiments demonstrate substantial performance improvements and computational savings using VI on three prominent ABMs: Axtell’s model of firms; Sugarscape; and the SIR epidemiological model. Our approach thus significantly enhances the practicality and scalability of ABMs for studying complex systems.
zh

[AI-16] A Comprehensive Guide to Differential Privacy: From Theory to User Expectations

【速读】:该论文旨在解决个人数据广泛应用背景下日益严峻的隐私泄露风险问题,尤其是在面对强大的再识别攻击以及法律与伦理对数据负责任使用要求的背景下。其解决方案的关键在于系统性地综述差分隐私(Differential Privacy, DP)这一数学严谨的隐私保护框架,涵盖其理论基础、实用机制及实际应用,重点探讨了在隐私保护机器学习和合成数据生成等领域的算法工具与挑战,并强调提升DP系统的可用性、透明度与沟通效率,从而支持研究者与实践者在不断演进的数据隐私环境中做出明智决策。

链接: https://arxiv.org/abs/2509.03294
作者: Napsu Karmitsa,Antti Airola,Tapio Pahikkala,Tinja Pitkämäki
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing availability of personal data has enabled significant advances in fields such as machine learning, healthcare, and cybersecurity. However, this data abundance also raises serious privacy concerns, especially in light of powerful re-identification attacks and growing legal and ethical demands for responsible data use. Differential privacy (DP) has emerged as a principled, mathematically grounded framework for mitigating these risks. This review provides a comprehensive survey of DP, covering its theoretical foundations, practical mechanisms, and real-world applications. It explores key algorithmic tools and domain-specific challenges - particularly in privacy-preserving machine learning and synthetic data generation. The report also highlights usability issues and the need for improved communication and transparency in DP systems. Overall, the goal is to support informed adoption of DP by researchers and practitioners navigating the evolving landscape of data privacy.
zh

[AI-17] Accountability Framework for Healthcare AI Systems: Towards Joint Accountability in Decision Making AAAI

【速读】:该论文旨在解决医疗领域中人工智能(Artificial Intelligence, AI)系统在决策过程中 accountability(责任归属)问题的模糊性与实践脱节问题,即现有监管指南多聚焦于“应做什么”(what),而缺乏对“如何实现”的具体指导(how),导致不同专业背景的参与者难以统一理解与落实责任机制。其解决方案的关键在于构建一个整合性的 accountability 框架(问责框架),并提出三层结构以分类和实施具体的问责机制;该框架将医疗AI系统的法规要求与各参与方的实际行为机制置于一致的责任体系下,并通过强调 explainability(可解释性)促进多方协作与信息共享,从而推动责任共担与跨角色协同决策。

链接: https://arxiv.org/abs/2509.03286
作者: Prachi Bagave,Marcus Westberg,Marijn Janssen,Aaron Yi Ding
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To be published in AAAI AIES 2025

点击查看摘要

Abstract:AI is transforming the healthcare domain and is increasingly helping practitioners to make health-related decisions. Therefore, accountability becomes a crucial concern for critical AI-driven decisions. Although regulatory bodies, such as the EU commission, provide guidelines, they are highlevel and focus on the ‘‘what’’ that should be done and less on the ‘‘how’’, creating a knowledge gap for actors. Through an extensive analysis, we found that the term accountability is perceived and dealt with in many different ways, depending on the actor’s expertise and domain of work. With increasing concerns about AI accountability issues and the ambiguity around this term, this paper bridges the gap between the ‘‘what’’ and ‘‘how’’ of AI accountability, specifically for AI systems in healthcare. We do this by analysing the concept of accountability, formulating an accountability framework, and providing a three-tier structure for handling various accountability mechanisms. Our accountability framework positions the regulations of healthcare AI systems and the mechanisms adopted by the actors under a consistent accountability regime. Moreover, the three-tier structure guides the actors of the healthcare AI system to categorise the mechanisms based on their conduct. Through our framework, we advocate that decision-making in healthcare AI holds shared dependencies, where accountability should be dealt with jointly and should foster collaborations. We highlight the role of explainability in instigating communication and information sharing between the actors to further facilitate the collaborative process.
zh

[AI-18] Estudio de la eficiencia en la escalabilidad de GPUs para el entrenamiento de Inteligencia Artificial

【速读】:该论文旨在解决大规模深度学习模型训练中效率与性能之间的权衡问题,尤其是在使用大量GPU加速训练时所面临的资源利用率低下和能耗过高的挑战。其解决方案的关键在于通过分析MLPerf Training v4.1在BERT、Llama2 LoRA、RetinaNet和Stable Diffusion四个工作负载上的训练时间数据,识别出能够优化性能(Performance)、GPU利用率(GPU Utilization)与效率(Efficiency)之间关系的配置策略,并发现了一个“盈亏平衡点”(break-even point),在此点上可在减少训练时间的同时最大化系统效率。

链接: https://arxiv.org/abs/2509.03263
作者: David Cortes,Carlos Juiz,Belen Bermejo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 8 pages, in Spanish language, 8 figures, Conference at SARTECO 2025, Spain

点击查看摘要

Abstract:Training large-scale deep learning models has become a key challenge for the scientific community and industry. While the massive use of GPUs can significantly speed up training times, this approach has a negative impact on efficiency. In this article, we present a detailed analysis of the times reported by MLPerf Training v4.1 on four workloads: BERT, Llama2 LoRA, RetinaNet, and Stable Diffusion, showing that there are configurations that optimise the relationship between performance, GPU usage, and efficiency. The results point to a break-even point that allows training times to be reduced while maximising efficiency.
zh

[AI-19] HyPV-LEAD: Proactive Early-Warning of Cryptocurrency Anomalies through Data-Driven Structural-Temporal Modeling

【速读】:该论文旨在解决加密货币交易中异常行为(如混币服务、欺诈转账和拉高出货操作)难以检测的问题,此类异常因类别不平衡、时间波动性和复杂的网络依赖关系而具有高度隐蔽性。现有方法多为模型驱动且事后分析型,仅在异常发生后进行标记,缺乏预防价值。其解决方案的关键在于提出HyPV-LEAD框架,该框架通过三个核心创新实现前瞻性预警:(1) 窗口-时域建模确保可操作的提前预警时间;(2) 峰谷(Peak-Valley, PV)采样策略缓解类别不平衡同时保持时间连续性;(3) 双曲嵌入(hyperbolic embedding)捕捉区块链交易网络的层级结构与无标度特性。实证结果表明,该方法在比特币大规模交易数据上显著优于当前最优基线,PR-AUC达0.9624,验证了其在实时风险管控、反洗钱(AML)合规及区块链金融安全中的有效性。

链接: https://arxiv.org/abs/2509.03260
作者: Minjung Park,Gyuyeon Na,Soyoun Kim,Sunyoung Moon,HyeonJeong Cha,Sangmi Chai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Risk Management (q-fin.RM)
备注:

点击查看摘要

Abstract:Abnormal cryptocurrency transactions - such as mixing services, fraudulent transfers, and pump-and-dump operations – pose escalating risks to financial integrity but remain notoriously difficult to detect due to class imbalance, temporal volatility, and complex network dependencies. Existing approaches are predominantly model-centric and post hoc, flagging anomalies only after they occur and thus offering limited preventive value. This paper introduces HyPV-LEAD (Hyperbolic Peak-Valley Lead-time Enabled Anomaly Detection), a data-driven early-warning framework that explicitly incorporates lead time into anomaly detection. Unlike prior methods, HyPV-LEAD integrates three innovations: (1) window-horizon modeling to guarantee actionable lead-time alerts, (2) Peak-Valley (PV) sampling to mitigate class imbalance while preserving temporal continuity, and (3) hyperbolic embedding to capture the hierarchical and scale-free properties of blockchain transaction networks. Empirical evaluation on large-scale Bitcoin transaction data demonstrates that HyPV-LEAD consistently outperforms state-of-the-art baselines, achieving a PR-AUC of 0.9624 with significant gains in precision and recall. Ablation studies further confirm that each component - PV sampling, hyperbolic embedding, and structural-temporal modeling - provides complementary benefits, with the full framework delivering the highest performance. By shifting anomaly detection from reactive classification to proactive early-warning, HyPV-LEAD establishes a robust foundation for real-time risk management, anti-money laundering (AML) compliance, and financial security in dynamic blockchain environments.
zh

[AI-20] Structure Transfer: an Inference-Based Calculus for the Transformation of Representations

【速读】:该论文旨在解决如何设计与表示系统(Representation System, RS)无关的技巧,以驱动表示之间的转换和选择这一根本性问题。其解决方案的关键在于提出一种新颖的演算方法——结构转移(structure transfer),该方法能够跨不同表示系统生成目标表示,并确保源表示与目标表示满足任意指定关系(如语义等价)。其核心机制是利用“模式”(schemas)来编码关于表示系统之间信息保留的知识,从而指导目标表示的结构构建,使其满足预设关系。此方法基于表示系统理论(Representational Systems Theory)中的构造空间(construction space)概念进行形式化,具备建模多种类型表示系统(包括形式语言、几何图形及非正式符号)的能力,因而具有广泛的适用性和系统无关性。

链接: https://arxiv.org/abs/2509.03249
作者: Daniel Raggi,Gem Stapleton,Mateja Jamnik,Aaron Stockdill,Grecia Garcia Garcia,Peter C-H. Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Representation choice is of fundamental importance to our ability to communicate and reason effectively. A major unsolved problem, addressed in this paper, is how to devise \textitrepresentational-system (RS) agnostic techniques that drive representation transformation and choice. We present a novel calculus, called \textitstructure transfer, that enables representation transformation across diverse RSs. Specifically, given a \textitsource representation drawn from a source RS, the rules of structure transfer allow us to generate a \textittarget representation for a target RS. The generality of structure transfer comes in part from its ability to ensure that the source representation and the generated target representation satisfy \textitany specified relation (such as semantic equivalence). This is done by exploiting \textitschemas, which encode knowledge about RSs. Specifically, schemas can express \textitpreservation of information across relations between any pair of RSs, and this knowledge is used by structure transfer to derive a structure for the target representation which ensures that the desired relation holds. We formalise this using Representational Systems Theory~\citeraggi2022rst, building on the key concept of a \textitconstruction space. The abstract nature of construction spaces grants them the generality to model RSs of diverse kinds, including formal languages, geometric figures and diagrams, as well as informal notations. Consequently, structure transfer is a system-agnostic calculus that can be used to identify alternative representations in a wide range of practical settings.
zh

[AI-21] FoMEMO: Towards Foundation Models for Expensive Multi-objective Optimization

【速读】:该论文旨在解决昂贵多目标优化(Expensive Multi-objective Optimization)中的样本效率问题,即在真实世界场景中因评估次数受限而难以准确恢复帕累托前沿(Pareto Front)以支持决策制定的挑战。现有方法要么需为每个新问题重新构建高斯过程代理模型,要么依赖大量历史领域实验进行深度学习模型预训练,导致泛化能力弱且难以适应新兴应用。解决方案的关键在于提出一种名为FoMEMO(Foundation Models for Expensive Multi-objective Optimization)的新范式,通过在数亿条合成数据上预训练基础模型,使其能够基于任意领域轨迹和用户偏好条件化建模,并利用预测的偏好加权后验分布实现上下文快速优化,从而无需后续训练即可高效适应未知问题,展现出优越的通用性和竞争性性能。

链接: https://arxiv.org/abs/2509.03244
作者: Yiming Yao,Fei Liu,Liang Zhao,Xi Lin,Qingfu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Expensive multi-objective optimization is a prevalent and crucial concern in many real-world scenarios, where sample-efficiency is vital due to the limited evaluations to recover the true Pareto front for decision making. Existing works either involve rebuilding Gaussian process surrogates from scratch for each objective in each new problem encountered, or rely on extensive past domain experiments for pre-training deep learning models, making them hard to generalize and impractical to cope with various emerging applications in the real world. To address this issue, we propose a new paradigm named FoMEMO (Foundation Models for Expensive Multi-objective Optimization), which enables the establishment of a foundation model conditioned on any domain trajectory and user preference, and facilitates fast in-context optimization based on the predicted preference-wise aggregation posteriors. Rather than accessing extensive domain experiments in the real world, we demonstrate that pre-training the foundation model with a diverse set of hundreds of millions of synthetic data can lead to superior adaptability to unknown problems, without necessitating any subsequent model training or updates in the optimization process. We evaluate our method across a variety of synthetic benchmarks and real-word applications, and demonstrate its superior generality and competitive performance compared to existing methods.
zh

[AI-22] Evaluation of Stress Detection as Time Series Events – A Novel Window-Based F1-Metric

【速读】:该论文旨在解决时间序列事件检测中因标准评估指标(如F1和点对齐F1,F1 _pa)在真实世界、不平衡数据集上表现失真而导致模型性能误判的问题。其核心挑战在于,尽管实际生理事件具有渐进性和时序扩散特性,但标注通常为单点事件,使得严格对齐的评估指标无法准确反映模型的真实能力。解决方案的关键在于提出一种基于窗口的F1指标(F1 _w),通过引入时间容忍度(temporal tolerance)来衡量预测与标注之间的时序匹配程度,从而更稳健地评估事件检测性能;该方法可根据领域知识调整窗口大小以避免过估计,并在三个生理数据集(两个野外场景ADARP、Wrist Angel和一个实验场景ROAD)中验证了其有效性,尤其在使用TimesFM模型时,只有F1 _w 能揭示相对于随机和零模型的统计显著改进。

链接: https://arxiv.org/abs/2509.03240
作者: Harald Vilhelm Skat-Rørdam,Sneha Das,Kathrine Sofie Rasmussen,Nicole Nadine Lønfeldt,Line Clemmensen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Accurate evaluation of event detection in time series is essential for applications such as stress monitoring with wearable devices, where ground truth is typically annotated as single-point events, even though the underlying phenomena are gradual and temporally diffused. Standard metrics like F1 and point-adjusted F1 (F1 _pa ) often misrepresent model performance in such real-world, imbalanced datasets. We introduce a window-based F1 metric (F1 _w ) that incorporates temporal tolerance, enabling a more robust assessment of event detection when exact alignment is unrealistic. Empirical analysis in three physiological datasets, two in-the-wild (ADARP, Wrist Angel) and one experimental (ROAD), indicates that F1 _w reveals meaningful model performance patterns invisible to conventional metrics, while its window size can be adapted to domain knowledge to avoid overestimation. We show that the choice of evaluation metric strongly influences the interpretation of model performance: using predictions from TimesFM, only our temporally tolerant metrics reveal statistically significant improvements over random and null baselines in the two in-the-wild use cases. This work addresses key gaps in time series evaluation and provides practical guidance for healthcare applications where requirements for temporal precision vary by context.
zh

[AI-23] Uncertainty-driven Adaptive Exploration

【速读】:该论文旨在解决自适应探索方法中如何在探索(exploration)与利用(exploitation)之间进行适时切换的问题,尤其是在需要学习长而复杂动作序列的环境中,这一切换时机的准确性对策略学习效果至关重要。解决方案的关键在于提出一个通用的自适应探索框架,该框架以不确定性(uncertainty)为核心驱动机制,通过引入任意可选的不确定性度量方式(如内在动机或认知不确定性相关的机制),实现对探索与利用阶段的理性决策,从而生成优于传统固定策略的自适应探索策略。

链接: https://arxiv.org/abs/2509.03219
作者: Leonidas Bakopoulos,Georgios Chalkiadakis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Adaptive exploration methods propose ways to learn complex policies via alternating between exploration and exploitation. An important question for such methods is to determine the appropriate moment to switch between exploration and exploitation and vice versa. This is critical in domains that require the learning of long and complex sequences of actions. In this work, we present a generic adaptive exploration framework that employs uncertainty to address this important issue in a principled manner. Our framework includes previous adaptive exploration approaches as special cases. Moreover, we can incorporate in our framework any uncertainty-measuring mechanism of choice, for instance mechanisms used in intrinsic motivation or epistemic uncertainty-based exploration methods. We experimentally demonstrate that our framework gives rise to adaptive exploration strategies that outperform standard ones across several MuJoCo environments.
zh

[AI-24] Autonomous Learning From Success and Failure: Goal-Conditioned Supervised Learning with Negative Feedback

【速读】:该论文旨在解决目标条件监督学习(Goal-Conditioned Supervised Learning, GCSL)框架在实际应用中面临的两个关键问题:一是仅依赖自生成经验会导致代理(agent)固有偏见加剧,二是重标注策略使代理仅关注成功结果,无法从失败中学习。解决方案的核心在于将对比学习(contrastive learning)原理引入GCSL框架,通过显式建模成功与失败轨迹的差异,使代理能够同时从正负样本中提取策略信息,从而缓解初始偏见并促进探索性行为,最终提升在复杂环境中的政策学习效率和性能表现。

链接: https://arxiv.org/abs/2509.03206
作者: Zeqiang Zhang,Fabian Wurzberger,Gerrit Schmid,Sebastian Gottwald,Daniel A. Braun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning faces significant challenges when applied to tasks characterized by sparse reward structures. Although imitation learning, within the domain of supervised learning, offers faster convergence, it relies heavily on human-generated demonstrations. Recently, Goal-Conditioned Supervised Learning (GCSL) has emerged as a potential solution by enabling self-imitation learning for autonomous systems. By strategically relabelling goals, agents can derive policy insights from their own experiences. Despite the successes of this framework, it presents two notable limitations: (1) Learning exclusively from self-generated experiences can exacerbate the agents’ inherent biases; (2) The relabelling strategy allows agents to focus solely on successful outcomes, precluding them from learning from their mistakes. To address these issues, we propose a novel model that integrates contrastive learning principles into the GCSL framework to learn from both success and failure. Through empirical evaluations, we demonstrate that our algorithm overcomes limitations imposed by agents’ initial biases and thereby enables more exploratory behavior. This facilitates the identification and adoption of effective policies, leading to superior performance across a variety of challenging environments.
zh

[AI-25] Rashomon in the Streets: Explanation Ambiguity in Scene Understanding AAAI2025

【速读】:该论文旨在解决生成式 AI (Generative AI) 在安全关键场景(如自动驾驶)中因“Rashomon 效应”导致的可解释性可靠性问题,即多个性能相当但解释迥异的模型对同一预测提供不一致的解释。其解决方案的关键在于首次通过定量实证方法,利用定性可解释图(Qualitative Explainable Graphs, QXGs)作为符号化场景表示,构建两类不同模型的 Rashomon 集:可解释的基于成对梯度提升模型与复杂的图神经网络(Graph Neural Networks, GNNs),并通过特征归因方法量化类内与类间解释的一致性,从而揭示解释模糊性是任务本身的固有属性,而非单纯建模偏差。

链接: https://arxiv.org/abs/2509.03169
作者: Helge Spieker,Jørn Eirik Betten,Arnaud Gotlieb,Nadjib Lazaar,Nassim Belmecheri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AAAI 2025 Fall Symposium: AI Trustworthiness and Risk Assessment for Challenged Contexts (ATRACC)

点击查看摘要

Abstract:Explainable AI (XAI) is essential for validating and trusting models in safety-critical applications like autonomous driving. However, the reliability of XAI is challenged by the Rashomon effect, where multiple, equally accurate models can offer divergent explanations for the same prediction. This paper provides the first empirical quantification of this effect for the task of action prediction in real-world driving scenes. Using Qualitative Explainable Graphs (QXGs) as a symbolic scene representation, we train Rashomon sets of two distinct model classes: interpretable, pair-based gradient boosting models and complex, graph-based Graph Neural Networks (GNNs). Using feature attribution methods, we measure the agreement of explanations both within and between these classes. Our results reveal significant explanation disagreement. Our findings suggest that explanation ambiguity is an inherent property of the problem, not just a modeling artifact.
zh

[AI-26] Decentralised self-organisation of pivoting cube ensembles using geometric deep learning

【速读】:该论文旨在解决同质旋转立方体模块机器人(homogeneous pivoting cube modular robots)在二维空间中自主重构的问题,即如何通过局部控制实现整体目标形状的生成与调整。其解决方案的关键在于:每个模块仅依赖局部邻域信息,利用强化学习训练神经网络控制器,并结合几何深度学习(geometric deep learning)将模块集合的网格对称性嵌入网络架构中,从而在仅有最近邻交互的情况下实现近最优重构;通过多轮信息传递机制,使个体模块逐步积累全局信息,最终达成高效且可扩展的分布式控制策略。

链接: https://arxiv.org/abs/2509.03140
作者: Nadezhda Dobreva,Emmanuel Blazquez,Jai Grover,Dario Izzo,Yuzhen Qin,Dominik Dold
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We present a decentralized model for autonomous reconfiguration of homogeneous pivoting cube modular robots in two dimensions. Each cube in the ensemble is controlled by a neural network that only gains information from other cubes in its local neighborhood, trained using reinforcement learning. Furthermore, using geometric deep learning, we include the grid symmetries of the cube ensemble in the neural network architecture. We find that even the most localized versions succeed in reconfiguring to the target shape, although reconfiguration happens faster the more information about the whole ensemble is available to individual cubes. Near-optimal reconfiguration is achieved with only nearest neighbor interactions by using multiple information passing between cubes, allowing them to accumulate more global information about the ensemble. Compared to standard neural network architectures, using geometric deep learning approaches provided only minor benefits. Overall, we successfully demonstrate mostly local control of a modular self-assembling system, which is transferable to other space-relevant systems with different action spaces, such as sliding cube modular robots and CubeSat swarms.
zh

[AI-27] A Neural Network Approach to Multi-radionuclide TDCR Beta Spectroscopy

【速读】:该论文旨在解决液闪三重-双重符合比(TDCR)法在多核素分析中面临的自动化程度低和依赖特定混合标准源的问题,这些问题在缺乏可用参考物质时尤为突出。解决方案的关键在于构建一个结合数值谱仿真与深度学习的AI框架:通过Geant4模拟与统计建模的探测器响应采样生成训练数据,设计定制化的神经网络架构,实现对不同核素比例和淬灭条件下的个体放射性活度及探测效率的端到端自主解析,从而实现无需标准源的自动化定量分析。

链接: https://arxiv.org/abs/2509.03137
作者: Li Yi,Qian Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Nuclear Experiment (nucl-ex); Computational Physics (physics.comp-ph); Instrumentation and Detectors (physics.ins-det)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Liquid scintillation triple-to-doubly coincident ratio (TDCR) spectroscopy is widely adopted as a standard method for radionuclide quantification because of its inherent advantages such as high precision, self-calibrating capability, and independence from radioactive reference sources. However, multiradionuclide analysis via TDCR faces the challenges of limited automation and reliance on mixture-specific standards, which may not be easily available. Here, we present an Artificial Intelligence (AI) framework that combines numerical spectral simulation and deep learning for standard-free automated analysis. \beta spectra for model training were generated using Geant4 simulations coupled with statistically modeled detector response sampling. A tailored neural network architecture, trained on this dataset covering various nuclei mix ratio and quenching scenarios, enables autonomous resolution of individual radionuclide activities and detecting efficiency through end-to-end learning paradigms. The model delivers consistent high accuracy across tasks: activity proportions (mean absolute error = 0.009), detection efficiencies (mean absolute error = 0.002), and spectral reconstruction (Structural Similarity Index = 0.9998), validating its physical plausibility for quenched \beta spectroscopy. This AI-driven methodology exhibits significant potential for automated safety-compliant multiradionuclide analysis with robust generalization, real-time processing capabilities, and engineering feasibility, particularly in scenarios where reference materials are unavailable or rapid field analysis is required.
zh

[AI-28] Adaptive KV-Cache Compression without Manually Setting Budget

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)推理过程中KV缓存(KV-cache)内存占用随序列长度快速增长所导致的效率瓶颈问题。现有KV缓存压缩方法存在“普罗克拉斯提斯之床”困境,即强制不同负载采用固定的压缩比,造成资源分配不合理与推理性能下降。其解决方案的关键在于提出GVote——一种无需人工指定缓存预算的自适应压缩机制,其核心思想是:重要的键(key)集合等于未来查询所需键的聚合;通过蒙特卡洛风格采样潜在查询并聚合选中的键,从而预测未来注意力需求并自动确定最优缓存预算,实现精度与效率的更好权衡。

链接: https://arxiv.org/abs/2509.03136
作者: Chenxia Tang,Jianchun Liu,Hongli Xu,Liusheng Huang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) inference relies heavily on KV-caches to accelerate autoregressive decoding, but the resulting memory footprint grows rapidly with sequence length, posing significant efficiency challenges. Current KV-cache compression methods suffer from a Procrustes’ bed problem: they force diverse workloads into fixed compression ratios, leading to suboptimal resource allocation and inference performance. To this end, we present GVote, an adaptive KV-cache compression scheme that eliminates manual budget specification while achieving superior accuracy-efficiency trade-offs. GVote operates on the principle that the important keys are the aggregation of keys required by future queries. The method predicts future query attention demands by Monte-Carlo style sampling potential queries and aggregating selected keys to determine the optimal cache budget without manual specification. Experimental evaluation demonstrates GVote’s effectiveness across multiple benchmarks, including GSM8K, RULER and Longbench. Compared to baselines, GVote exhibits 2 \times memory reduction while the accuracy maintains higher or comparable.
zh

[AI-29] A Hierarchical Deep Reinforcement Learning Framework for Traffic Signal Control with Predictable Cycle Planning

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在交通信号控制(Traffic Signal Control, TSC)中因控制策略设计不合理而导致的效率与公平性问题。具体而言,传统“选择相位”(choose phase)策略虽能自适应调整相位顺序,但易引发司机预期混乱,影响安全;而“切换”(switch)策略虽保持相位顺序稳定,却可能导致某些流向过度延长、其他流向被忽视,造成资源分配不公和效率低下。解决方案的关键在于提出一种分层式周期规划模型——Deep Hierarchical Cycle Planner (DHCP),其核心是通过两级代理结构实现信号周期的层级化分配:高层代理根据整体交通状态决定南北(NS)与东西(EW)方向的总时长比例,低层代理进一步细化每个主方向内直行与左转车辆的时长分配,从而在保障可预测性的前提下提升时段分配的灵活性与公平性,实验证明该方法在多种真实和合成数据集上均优于现有基线模型。

链接: https://arxiv.org/abs/2509.03118
作者: Hankang Gu,Yuli Zhang,Chengming Wang,Ruiyuan Jiang,Ziheng Qiao,Pengfei Fan,Dongyao Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has become a popular approach in traffic signal control (TSC) due to its ability to learn adaptive policies from complex traffic environments. Within DRL-based TSC methods, two primary control paradigms are choose phase" and switch" strategies. Although the agent in the choose phase paradigm selects the next active phase adaptively, this paradigm may result in unexpected phase sequences for drivers, disrupting their anticipation and potentially compromising safety at intersections. Meanwhile, the switch paradigm allows the agent to decide whether to switch to the next predefined phase or extend the current phase. While this structure maintains a more predictable order, it can lead to unfair and inefficient phase allocations, as certain movements may be extended disproportionately while others are neglected. In this paper, we propose a DRL model, named Deep Hierarchical Cycle Planner (DHCP), to allocate the traffic signal cycle duration hierarchically. A high-level agent first determines the split of the total cycle time between the North-South (NS) and East-West (EW) directions based on the overall traffic state. Then, a low-level agent further divides the allocated duration within each major direction between straight and left-turn movements, enabling more flexible durations for the two movements. We test our model on both real and synthetic road networks, along with multiple sets of real and synthetic traffic flows. Empirical results show our model achieves the best performance over all datasets against baselines.
zh

[AI-30] Are We SOLID Yet? An Empirical Study on Prompting LLM s to Detect Design Principle Violations

【速读】:该论文旨在解决传统静态分析方法在检测面向对象设计中的语义缺陷(如违反SOLID原则)方面的局限性,尤其是在多语言代码库中难以统一识别所有五项SOLID原则违规的问题。其解决方案的关键在于提出一种基于定制化提示工程(prompt engineering)的评估框架,利用四种主流大语言模型(LLMs)——CodeLlama、DeepSeekCoder、QwenCoder 和 GPT-4o Mini——结合四类不同提示策略(受零样本、少样本和思维链技术启发),系统性地衡量其对跨语言SOLID原则违规的检测能力。研究发现,模型性能存在显著差异,且提示策略对准确率有决定性影响,不存在通用最优方案;最终表明,有效的AI驱动设计分析需根据具体设计场景匹配合适的模型与提示策略,凸显了生成式AI在提升代码可维护性方面的潜力。

链接: https://arxiv.org/abs/2509.03093
作者: Fatih Pehlivan,Arçin Ülkü Ergüzen,Sahand Moslemi Yengejeh,Mayasah Lami,Anil Koyuncu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to ASE2025

点击查看摘要

Abstract:Traditional static analysis methods struggle to detect semantic design flaws, such as violations of the SOLID principles, which require a strong understanding of object-oriented design patterns and principles. Existing solutions typically focus on individual SOLID principles or specific programming languages, leaving a gap in the ability to detect violations across all five principles in multi-language codebases. This paper presents a new approach: a methodology that leverages tailored prompt engineering to assess LLMs on their ability to detect SOLID violations across multiple languages. We present a benchmark of four leading LLMs-CodeLlama, DeepSeekCoder, QwenCoder, and GPT-4o Mini-on their ability to detect violations of all five SOLID principles. For this evaluation, we construct a new benchmark dataset of 240 manually validated code examples. Using this dataset, we test four distinct prompt strategies inspired by established zero-shot, few-shot, and chain-of-thought techniques to systematically measure their impact on detection accuracy. Our emerging results reveal a stark hierarchy among models, with GPT-4o Mini decisively outperforming others, yet even struggles with challenging principles like DIP. Crucially, we show that prompt strategy has a dramatic impact, but no single strategy is universally best; for instance, a deliberative ENSEMBLE prompt excels at OCP detection while a hint-based EXAMPLE prompt is superior for DIP violations. Across all experiments, detection accuracy is heavily influenced by language characteristics and degrades sharply with increasing code complexity. These initial findings demonstrate that effective, AI-driven design analysis requires not a single best model, but a tailored approach that matches the right model and prompt to the specific design context, highlighting the potential of LLMs to support maintainability through AI-assisted code analysis.
zh

[AI-31] Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在非数学与编程类推理密集型领域中推理能力提升受限的问题,其核心挑战在于高质量可验证数据集的稀缺性以及人工标注成本高昂。解决方案的关键在于提出Loong Project——一个开源的合成数据生成与验证框架,包含两个核心组件:(1) LoongBench,一个由8,729个经人类专家审核的跨12个领域(如高等数学、化学、逻辑等)的结构化数据集,每个样本均配有可执行代码和丰富元数据;(2) LoongEnv,一种模块化的合成数据生成环境,支持多种提示策略以生成新的问题-答案-代码三元组,并通过链式思维(Chain-of-Thought, CoT)解码与代码执行结果对齐来构建强化学习信号,从而实现基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Reward, RLVR)。该框架形成了“智能体-环境”闭环机制,显著提升了LLMs在多样化推理任务中的泛化能力和准确性。

链接: https://arxiv.org/abs/2509.03059
作者: Xingyue Huang,Rishabh,Gregor Franke,Ziyi Yang,Jiamu Bai,Weijie Bai,Jinhe Bi,Zifeng Ding,Yiqun Duan,Chengyu Fan,Wendong Fan,Xin Gao,Ruohao Guo,Yuan He,Zhuangzhuang He,Xianglong Hu,Neil Johnson,Bowen Li,Fangru Lin,Siyu Lin,Tong Liu,Yunpu Ma,Hao Shen,Hao Sun,Beibei Wang,Fangyijie Wang,Hao Wang,Haoran Wang,Yang Wang,Yifeng Wang,Zhaowei Wang,Ziyang Wang,Yifan Wu,Zikai Xiao,Chengxing Xie,Fan Yang,Junxiao Yang,Qianshuo Ye,Ziyu Ye,Guangtao Zeng,Yuwen Ebony Zhang,Zeyu Zhang,Zihao Zhu,Bernard Ghanem,Philip Torr,Guohao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have shown that their reasoning capabilities can be significantly improved through Reinforcement Learning with Verifiable Reward (RLVR), particularly in domains like mathematics and programming, where ground-truth correctness can be automatically evaluated. However, extending this success to other reasoning-intensive domains remains challenging due to the scarcity of high-quality, verifiable datasets and the high cost of human supervision. In this work, we introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification across a diverse range of reasoning-intensive domains. The framework consists of two key components: (1) LoongBench, a curated seed dataset containing 8,729 human-vetted examples across 12 domains (e.g., Advanced Mathematics, Chemistry, Logic), each paired with executable code and rich metadata; and (2) LoongEnv, a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples. Together, these components form an agent-environment loop that enables reinforcement learning, where an LLM-based agent is rewarded for generating Chain-of-Thought (CoT) solutions that align with code-executed answers. Empirically, we benchmark LoongBench on a broad suite of both open-source and proprietary LLMs to evaluate domain coverage and reveal performance bottlenecks. In addition, we conduct a comprehensive analysis of synthetic data generated by LoongEnv, examining correctness, difficulty, and diversity. Code and documentation are available at this https URL.
zh

[AI-32] Binary Quantization For LLM s Through Dynamic Grouping

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在部署时面临的高存储与计算资源消耗问题,特别是针对极端压缩场景下(如1-bit量化)导致的性能显著下降难题。其核心挑战在于如何在实现极致压缩(从16-bit Brain Float降至1-bit表示)的同时,保持模型的推理质量与精度。解决方案的关键在于提出一种专为二值量化(Binary Quantization)设计的新优化目标,并结合三种算法实现高效、高质量的权重压缩:通过动态识别最优非结构化子矩阵(unstructured sub-matrices),并采用自适应分组策略增强阻塞量化(blocked quantization)的效果,从而在平均比特长度仅为1.007 bit的情况下,仍能维持接近原始模型的性能——例如在LLaMA 3.2 3B模型上达到8.23的困惑度(perplexity),显著优于现有最先进的二值化方法(BiLLM, perplexity=123.90),且在效率上远超传统4-bit方法(如GPTQ),仅需14秒即可完成单核CPU上的全模型量化。

链接: https://arxiv.org/abs/2509.03054
作者: Xinzhe Zheng,Zhen-Qun Yang,Haoran Xie,S. Joe Qin,Arlene Chen,Fangzhen Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 11 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of Natural Language Processing (NLP) tasks, but require substantial memory and computational resources. Binary quantization, which compresses model weights from 16-bit Brain Float to 1-bit representations in -1, 1, offers significant reductions in storage and inference costs. However, such aggressive quantization often leads to notable performance degradation compared to more conservative 4-bit quantization methods. In this research, we propose a novel optimization objective tailored for binary quantization, along with three algorithms designed to realize it effectively. Our method enhances blocked quantization by dynamically identifying optimal unstructured sub-matrices through adaptive grouping strategies. Experimental results demonstrate that our approach achieves an average bit length of just 1.007 bits, while maintaining high model quality. Specifically, our quantized LLaMA 3.2 3B model attains a perplexity of 8.23, remarkably close to the original 7.81, and surpasses previous SOTA BiLLM with a perplexity of only 123.90. Furthermore, our method is competitive with SOTA 4-bit approaches such as GPTQ in both performance and efficiency. The compression process is highly efficient, requiring only 14 seconds to quantize the full LLaMA 3.2 3B weights on a single CPU core, with the entire process completing in under 100 minutes and exhibiting embarrassingly parallel properties. Code - this https URL Comments: 14 pages, 11 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.03054 [cs.LG] (or arXiv:2509.03054v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.03054 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-33] FlashRecovery: Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLM s

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练过程中因硬件和软件故障导致的长时间中断问题,尤其关注如何在超大规模AI加速器集群中实现高效、低开销的故障恢复,以提升训练任务的可靠性与连续性。其解决方案的关键在于提出FlashRecovery系统,该系统包含三个核心模块:(1) 实时主动故障检测机制,可在数秒内识别软硬件异常;(2) 与集群规模无关的任务重启策略,通过差异化恢复逻辑与优化通信组重建协议,使恢复时间不随节点数量增加而增长;(3) 无需检查点(checkpoint-free)的一步式恢复机制,彻底消除传统检查点方法带来的存储和计算开销,从而显著降低恢复时间目标(RTO)和恢复点目标(RPO)。实验表明,该系统可在4800个设备的集群上实现150秒内的训练状态恢复,且不同规模任务的恢复时间基本保持一致。

链接: https://arxiv.org/abs/2509.03047
作者: Haijun Zhang,Jinxiang Wang,Zhenhua Yu,Yanyong Zhang,Xuejie Ji,Kaining Mao,Jun Zhang,Yaqing Zhang,Ting Wu,Fei Jie,Xiemin Huang,Zhifang Cai,Junhua Cheng,Shuwei Wang,Wei Li,Xiaoming Bao,Hua Xu,Shixiong Zhao,Jun Li,Hongwei Sun,Ziyang Zhang,Yi Xiong,Chunsheng Li
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have made a profound impact across various fields due to their advanced capabilities. However, training these models at unprecedented scales requires extensive AI accelerator clusters and sophisticated parallelism strategies, which pose significant challenges in maintaining system reliability over prolonged training periods. A major concern is the substantial loss of training time caused by inevitable hardware and software failures. To address these challenges, we present FlashRecovery, a fast and low-cost failure recovery system comprising three core modules: (1) Active and real-time failure detection. This module performs continuous training state monitoring, enabling immediate identification of hardware and software failures within seconds, thus ensuring rapid incident response; (2) Scale-independent task restart. By employing different recovery strategies for normal and faulty nodes, combined with an optimized communication group reconstruction protocol, our approach ensures that the recovery time remains nearly constant, regardless of cluster scale; (3) Checkpoint-free recovery within one step. Our novel recovery mechanism enables single-step restoration, completely eliminating dependence on traditional checkpointing methods and their associated overhead. Collectively, these innovations enable FlashRecovery to achieve optimal Recovery Time Objective (RTO) and Recovery Point Objective (RPO), substantially improving the reliability and efficiency of long-duration LLM training. Experimental results demonstrate that FlashRecovery system can achieve training restoration on training cluster with 4, 800 devices in 150 seconds. We also verify that the time required for failure recovery is nearly consistent for different scales of training tasks.
zh

[AI-34] Knowledge Integration for Physics-informed Symbolic Regression Using Pre-trained Large Language Models

【速读】:该论文旨在解决物理信息符号回归(Physics-informed Symbolic Regression, PiSR)中领域知识难以自动化集成的问题,当前方法通常依赖专家手动设计特征和特定公式形式,限制了其适用范围与可扩展性。解决方案的关键在于利用预训练大语言模型(Large Language Models, LLMs)的上下文理解能力,将科学文献中隐含的领域知识自动编码为对候选方程的评估,并将其嵌入符号回归的目标函数中作为额外损失项,从而实现无需人工干预的知识融合,显著提升模型在噪声和复杂场景下的鲁棒性和物理规律重建能力。

链接: https://arxiv.org/abs/2509.03036
作者: Bilge Taskin,Wenxiong Xie,Teddy Lazebnik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:Symbolic regression (SR) has emerged as a powerful tool for automated scientific discovery, enabling the derivation of governing equations from experimental data. A growing body of work illustrates the promise of integrating domain knowledge into the SR to improve the discovered equation’s generality and usefulness. Physics-informed SR (PiSR) addresses this by incorporating domain knowledge, but current methods often require specialized formulations and manual feature engineering, limiting their adaptability only to domain experts. In this study, we leverage pre-trained Large Language Models (LLMs) to facilitate knowledge integration in PiSR. By harnessing the contextual understanding of LLMs trained on vast scientific literature, we aim to automate the incorporation of domain knowledge, reducing the need for manual intervention and making the process more accessible to a broader range of scientific problems. Namely, the LLM is integrated into the SR’s loss function, adding a term of the LLM’s evaluation of the SR’s produced equation. We extensively evaluate our method using three SR algorithms (DEAP, gplearn, and PySR) and three pre-trained LLMs (Falcon, Mistral, and LLama 2) across three physical dynamics (dropping ball, simple harmonic motion, and electromagnetic wave). The results demonstrate that LLM integration consistently improves the reconstruction of physical dynamics from data, enhancing the robustness of SR models to noise and complexity. We further explore the impact of prompt engineering, finding that more informative prompts significantly improve performance.
zh

[AI-35] Efficient Privacy-Preserving Recommendation on Sparse Data using Fully Homomorphic Encryption WWW

【速读】:该论文旨在解决推荐系统在使用全同态加密(Fully Homomorphic Encryption, FHE)时面临的两大挑战:一是如何高效处理用户-物品评分矩阵的稀疏性,二是如何降低加密域中的通信开销。其解决方案的关键在于将压缩稀疏行(Compressed Sparse Row, CSR)表示与基于FHE的矩阵分解相结合,从而在加密域中有效管理稀疏数据结构并显著减少通信成本,同时保持较高的推荐准确率,实现用户隐私的有效保护。

链接: https://arxiv.org/abs/2509.03024
作者: Moontaha Nishat Chowdhury,André Bauer,Minxuan Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The paper is accepted at the 21st IEEE International eScience Conference (eScience’25) and will be published soon. Link: this https URL

点击查看摘要

Abstract:In today’s data-driven world, recommendation systems personalize user experiences across industries but rely on sensitive data, raising privacy concerns. Fully homomorphic encryption (FHE) can secure these systems, but a significant challenge in applying FHE to recommendation systems is efficiently handling the inherently large and sparse user-item rating matrices. FHE operations are computationally intensive, and naively processing various sparse matrices in recommendation systems would be prohibitively expensive. Additionally, the communication overhead between parties remains a critical concern in encrypted domains. We propose a novel approach combining Compressed Sparse Row (CSR) representation with FHE-based matrix factorization that efficiently handles matrix sparsity in the encrypted domain while minimizing communication costs. Our experimental results demonstrate high recommendation accuracy with encrypted data while achieving the lowest communication costs, effectively preserving user privacy.
zh

[AI-36] StableSleep: Source-Free Test-Time Adaptation for Sleep Staging with Lightweight Safety Rails

【速读】:该论文旨在解决睡眠分期模型在部署到具有未见生理特征或记录条件的患者时性能退化的问题。其解决方案的关键在于提出一种流式、无源测试时适应(Test-Time Adaptation, TTA)方法,该方法结合了熵最小化(Entropy Minimization, Tent)与批量归一化(Batch-Norm)统计量刷新,并引入两个安全机制:基于熵的门控策略用于在不确定窗口暂停适应,以及基于指数移动平均(Exponential Moving Average, EMA)的重置机制以防止适应漂移。该方法无需源数据或患者校准,具备低延迟和小内存开销,适用于设备端或床边部署。

链接: https://arxiv.org/abs/2509.02982
作者: Hritik Arasu,Faisal R Jahangiri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
备注: 5 page paper, 8 figures

点击查看摘要

Abstract:Sleep staging models often degrade when deployed on patients with unseen physiology or recording conditions. We propose a streaming, source-free test-time adaptation (TTA) recipe that combines entropy minimization (Tent) with Batch-Norm statistic refresh and two safety rails: an entropy gate to pause adaptation on uncertain windows and an EMA-based reset to reel back drift. On Sleep-EDF Expanded, using single-lead EEG (Fpz-Cz, 100 Hz, 30s epochs; RK to AASM mapping), we show consistent gains over a frozen baseline at seconds-level latency and minimal memory, reporting per-stage metrics and Cohen’s k. The method is model-agnostic, requires no source data or patient calibration, and is practical for on-device or bedside use.
zh

[AI-37] AR-KAN: Autoregressive-Weight-Enhanced Kolmogorov-Arnold Network for Time Series Forecasting

【速读】:该论文旨在解决传统神经网络在信号频谱分析中表现不佳的问题,特别是当处理由不可通约频率组成的几乎周期性信号时,经典模型如ARIMA往往优于大多数神经网络(包括大语言模型)。其解决方案的关键在于提出一种混合模型——自回归加权增强的AR-KAN(Autoregressive-Weight-Enhanced AR-KAN),该模型结合了自回归(AR)组件的记忆能力与Kolmogorov-Arnold Network(KAN)对静态非线性映射的建模优势。通过应用通用近视映射定理(Universal Myopic Mapping Theorem),该方法能够有效保留有用信息并消除冗余,从而在72%的真实世界数据集上实现更优性能。

链接: https://arxiv.org/abs/2509.02967
作者: Chen Zeng,Tiehang Xu,Qiao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Conventional neural networks frequently face challenges in spectral analysis of signals. To address this challenge, Fourier neural networks (FNNs) and similar approaches integrate components of Fourier series into the structure of neural networks. Nonetheless, a significant hurdle is often overlooked: the superposition of periodic signals does not necessarily result in a periodic signal. For example, when forecasting almost periodic functions composed of signals with incommensurate frequencies, traditional models such as Autoregressive Integrated Moving Average (ARIMA) frequently outperform most neural networks including large language models (LLMs). To tackle this goal, we propose Autoregressive-Weight-Enhanced AR-KAN, a hybrid model that combines the benefits of both methods. Using the Universal Myopic Mapping Theorem, we apply a Kolmogorov-Arnold Network (KAN) for the static nonlinear part and include memory through a pre-trained AR component, which can be explained to retain the most useful information while eliminating redundancy. Experimental data indicates that AR-KAN delivers superior results on 72% of real-world datasets.
zh

[AI-38] Lattice Annotated Temporal (LAT) Logic for Non-Markovian Reasoning

【速读】:该论文旨在解决动态与不确定环境中开放世界下时空推理的建模与高效计算问题,尤其针对传统逻辑编程在处理非马尔可夫关系(non-Markovian relationships)和无限常量域时效率低下、难以支持开放世界假设(open-world semantics)的局限性。其核心解决方案是提出Lattice Annotated Temporal (LAT) Logic,通过引入下格(lower lattice)结构实现开放世界语义,并结合时间逻辑编程支持非马尔可夫依赖关系;关键创新在于利用该下格结构实现高效的Skolem化接地过程,从而在具有无限或高度多样化常量的领域中仍能保持可计算性,同时理论证明了其计算复杂度边界,并在PyReason实现中通过模块化设计和机器级优化实现了高达三个数量级的速度提升与五个数量级的内存减少,显著优于传统方法。

链接: https://arxiv.org/abs/2509.02958
作者: Kaustuv Mukherji,Jaikrishna Manojkumar Patil,Dyuman Aditya,Paulo Shakarian,Devendra Parkar,Lahari Pokala,Clark Dorman,Gerardo I. Simari
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:We introduce Lattice Annotated Temporal (LAT) Logic, an extension of Generalized Annotated Logic Programs (GAPs) that incorporates temporal reasoning and supports open-world semantics through the use of a lower lattice structure. This logic combines an efficient deduction process with temporal logic programming to support non-Markovian relationships and open-world reasoning capabilities. The open-world aspect, a by-product of the use of the lower-lattice annotation structure, allows for efficient grounding through a Skolemization process, even in domains with infinite or highly diverse constants. We provide a suite of theoretical results that bound the computational complexity of the grounding process, in addition to showing that many of the results on GAPs (using an upper lattice) still hold with the lower lattice and temporal extensions (though different proof techniques are required). Our open-source implementation, PyReason, features modular design, machine-level optimizations, and direct integration with reinforcement learning environments. Empirical evaluations across multi-agent simulations and knowledge graph tasks demonstrate up to three orders of magnitude speedup and up to five orders of magnitude memory reduction while maintaining or improving task performance. Additionally, we evaluate LAT Logic’s value in reinforcement learning environments as a non-Markovian simulator, achieving up to three orders of magnitude faster simulation with improved agent performance, including a 26% increase in win rate due to capturing richer temporal dependencies. These results highlight LAT Logic’s potential as a unified, extensible framework for open-world temporal reasoning in dynamic and uncertain environments. Our implementation is available at: this http URL. Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL) Cite as: arXiv:2509.02958 [cs.LO] (or arXiv:2509.02958v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2509.02958 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-39] VendiRL: A Framework for Self-Supervised Reinforcement Learning of Diversely Diverse Skills

【速读】:该论文旨在解决自监督强化学习(self-supervised reinforcement learning, RL)中技能多样性学习的可扩展性与评估难题。具体而言,现有方法在高维特征空间中难以有效搜索有意义的技能,且对技能多样性的定义往往依赖于特定假设,导致不同方法间结果难以比较,且多种潜在多样性形式未被充分探索。解决方案的关键在于引入生态学中的样本多样性度量——Vendi Score,该指标允许用户指定并评估任意形式的多样性;在此基础上提出VendiRL框架,通过设计不同的相似性函数来驱动多样性的多样化学习,从而实现对新环境和交互场景下多维度技能多样性预训练的支持。

链接: https://arxiv.org/abs/2509.02930
作者: Erik M. Lintunen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 17 pages including appendices

点击查看摘要

Abstract:In self-supervised reinforcement learning (RL), one of the key challenges is learning a diverse set of skills to prepare agents for unknown future tasks. Despite impressive advances, scalability and evaluation remain prevalent issues. Regarding scalability, the search for meaningful skills can be obscured by high-dimensional feature spaces, where relevant features may vary across downstream task domains. For evaluating skill diversity, defining what constitutes “diversity” typically requires a hard commitment to a specific notion of what it means for skills to be diverse, potentially leading to inconsistencies in how skill diversity is understood, making results across different approaches hard to compare, and leaving many forms of diversity unexplored. To address these issues, we adopt a measure of sample diversity that translates ideas from ecology to machine learning – the Vendi Score – allowing the user to specify and evaluate any desired form of diversity. We demonstrate how this metric facilitates skill evaluation and introduce VendiRL, a unified framework for learning diversely diverse sets of skills. Given distinct similarity functions, VendiRL motivates distinct forms of diversity, which could support skill-diversity pretraining in new and richly interactive environments where optimising for various forms of diversity may be desirable.
zh

[AI-40] Simulacra Naturae: Generative Ecosystem driven by Agent -Based Simulations and Brain Organoid Collective Intelligence IEEE-VIS

【速读】:该论文试图解决的问题是如何通过跨学科的媒介实践,重新定义数据可视化在人机共生语境下的伦理与情感维度,尤其是在非人类认知(nonhuman cognition)介入生成式系统时如何体现“集体关怀”(collective care)。其解决方案的关键在于将脑类器官(brain organoids)的预录神经活动作为共创造性力量(co-creative force),而非直接控制信号,驱动一个融合生物计算、材料生态与生成系统(generative systems)的多感官环境:该系统通过实时调节由自然系统(如白蚁群落和黏菌)启发的代理行为(agent behaviors),并结合活体植物、可编程陶土打印件(computationally fabricated clay prints)及空间音频,构建出一种以非人类感知为基底的沉浸式体验场域。这一策略实现了从“人类中心”的数据呈现向“去中心化”的生态共情(ecological attunement)转变,从而拓展了生成式 AI (Generative AI) 在伦理敏感型交互设计中的应用边界。

链接: https://arxiv.org/abs/2509.02924
作者: Nefeli Manoudaki,Mert Toka,Iason Paterakis,Diarmid Flatley
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: to be published in IEEE VISAP 2025

点击查看摘要

Abstract:Simulacra Naturae is a data-driven media installation that explores collective care through the entanglement of biological computation, material ecologies, and generative systems. The work translates pre-recorded neural activity from brain organoids, lab-grown three-dimensional clusters of neurons, into a multi-sensory environment composed of generative visuals, spatial audio, living plants, and fabricated clay artifacts. These biosignals, streamed through a real-time system, modulate emergent agent behaviors inspired by natural systems such as termite colonies and slime molds. Rather than using biosignals as direct control inputs, Simulacra Naturae treats organoid activity as a co-creative force, allowing neural rhythms to guide the growth, form, and atmosphere of a generative ecosystem. The installation features computationally fabricated clay prints embedded with solenoids, adding physical sound resonances to the generative surround composition. The spatial environment, filled with live tropical plants and a floor-level projection layer featuring real-time generative AI visuals, invites participants into a sensory field shaped by nonhuman cognition. By grounding abstract data in living materials and embodied experience, Simulacra Naturae reimagines visualization as a practice of care, one that decentralizes human agency and opens new spaces for ethics, empathy, and ecological attunement within hybrid computational systems.
zh

[AI-41] he Basic B* Effect: The Use of LLM -based Agents Reduces the Distinctiveness and Diversity of Peoples Choices

【速读】:该论文旨在解决人工智能代理(AI agent)在代行人类决策过程中对个体身份建构的影响问题,特别是其如何改变人际独特性(interpersonal distinctiveness)与个人内在多样性(intrapersonal diversity)。研究发现,无论是通用型还是个性化大语言模型(LLM)代理,均会促使用户选择更趋同的流行选项,从而削弱行为的独特性;而个性化代理虽能在一定程度上缓解这种同质化趋势,却进一步压缩了个体偏好组合的多样性,限制了其在不同主题和心理亲和维度上的探索广度。解决方案的关键在于识别并权衡“独特性-多样性”之间的权衡关系,以设计能够增强而非限制人类自主性的AI系统,从而保护思想、品味与表达的多样性。

链接: https://arxiv.org/abs/2509.02910
作者: Sandra C. Matz,C. Blaine Horton,Sofie Goethals
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly act on people’s behalf: they write emails, buy groceries, and book restaurants. While the outsourcing of human decision-making to AI can be both efficient and effective, it raises a fundamental question: how does delegating identity-defining choices to AI reshape who people become? We study the impact of agentic LLMs on two identity-relevant outcomes: interpersonal distinctiveness - how unique a person’s choices are relative to others - and intrapersonal diversity - the breadth of a single person’s choices over time. Using real choices drawn from social-media behavior of 1,000 U.S. users (110,000 choices in total), we compare a generic and personalized agent to a human baseline. Both agents shift people’s choices toward more popular options, reducing the distinctiveness of their behaviors and preferences. While the use of personalized agents tempers this homogenization (compared to the generic AI), it also more strongly compresses the diversity of people’s preference portfolios by narrowing what they explore across topics and psychological affinities. Understanding how AI agents might flatten human experience, and how using generic versus personalized agents involves distinctiveness-diversity trade-offs, is critical for designing systems that augment rather than constrain human agency, and for safeguarding diversity in thought, taste, and expression.
zh

[AI-42] Cut Costs Not Accuracy: LLM -Powered Data Processing with Guarantees SIGMOD’26

【速读】:该论文旨在解决在大规模文本数据处理中,如何在保证输出质量的前提下显著降低使用大语言模型(Large Language Models, LLMs)的成本问题。当前主流方法依赖于“模型级联框架”(model cascade framework),通过LLM输出的置信度(如log-probabilities)来决定是否使用低成本但低质量的LLM处理特定记录,然而现有方案成本节省有限且缺乏强理论保障,主要受限于对廉价LLM输出质量估计不准。论文提出的BARGAIN方法关键在于设计了一种新颖的自适应采样策略与统计估计过程,该过程能结合数据和任务特征,并利用最新的统计工具,在保证严格理论约束下实现对廉价LLM输出质量的高精度估计,从而在平均成本上比最先进方法降低高达86%,同时提供更强的准确性、精确率或召回率等指标的理论保障。

链接: https://arxiv.org/abs/2509.02896
作者: Sepanta Zeighami,Shreya Shankar,Aditya Parameswaran
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: To appear in SIGMOD’26

点击查看摘要

Abstract:Large Language Models (LLMs) are being increasingly used as a building block in data systems to process large text datasets. To do so, LLM model providers offer multiple LLMs with different sizes, spanning various cost-quality trade-offs when processing text at scale. Top-of-the-line LLMs (e.g., GPT-4o, Claude Sonnet) operate with high accuracy but are prohibitively expensive when processing many records. To avoid high costs, more affordable but lower quality LLMs (e.g., GPT-4o-mini, Claude Haiku) can be used to process records, but we need to ensure that the overall accuracy does not deviate substantially from that of the top-of-the-line LLMs. The model cascade framework provides a blueprint to manage this trade-off, by using the confidence of LLMs in their output (e.g., log-probabilities) to decide on which records to use the affordable LLM. However, existing solutions following this framework provide only marginal cost savings and weak theoretical guarantees because of poor estimation of the quality of the affordable LLM’s outputs. We present BARGAIN, a method that judiciously uses affordable LLMs in data processing to significantly reduce cost while providing strong theoretical guarantees on the solution quality. BARGAIN employs a novel adaptive sampling strategy and statistical estimation procedure that uses data and task characteristics and builds on recent statistical tools to make accurate estimations with tight theoretical guarantees. Variants of BARGAIN can support guarantees on accuracy, precision, or recall of the output. Experimental results across 8 real-world datasets show that BARGAIN reduces cost, on average, by up to 86% more than state-of-the-art, while providing stronger theoretical guarantees on accuracy of output, with similar gains when guaranteeing a desired level of precision or recall.
zh

[AI-43] Grocery to General Merchandise: A Cross-Pollination Recommender using LLM s and Real-Time Cart Context

【速读】:该论文旨在解决电商场景中跨品类推荐的难题,特别是针对以生鲜食品购物为主的用户群体,如何有效推荐与其当前购物篮(cart)上下文相关的一般商品(如将牛奶与奶泡机搭配推荐),从而提升用户体验和转化率。其解决方案的关键在于提出了一种新颖的交叉授粉(cross-pollination, XP)框架,该框架包含两个核心阶段:第一阶段通过联合购买市场篮子分析(co-purchase market basket analysis)与大语言模型(LLM)驱动的方法挖掘新型商品间关联;第二阶段采用基于Transformer的排序模型,利用实时购物车序列上下文优化点击、加购等参与度信号。实验表明,该方法在离线评估中使加购率提升36%,在线A/B测试中NDCG@4指标提升27%,验证了其在跨品类推荐中的有效性与实用性。

链接: https://arxiv.org/abs/2509.02890
作者: Akshay Kekuda,Murali Mohana Krishna Dandu,Rimita Lahiri,Shiqin Cai,Sinduja Subramaniam,Evren Korpeoglu,Kannan Achan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern e-commerce platforms strive to enhance customer experience by providing timely and contextually relevant recommendations. However, recommending general merchandise to customers focused on grocery shopping – such as pairing milk with a milk frother – remains a critical yet under-explored challenge. This paper introduces a cross-pollination (XP) framework, a novel approach that bridges grocery and general merchandise cross-category recommendations by leveraging multi-source product associations and real-time cart context. Our solution employs a two-stage framework: (1) A candidate generation mechanism that uses co-purchase market basket analysis and LLM-based approach to identify novel item-item associations; and (2) a transformer-based ranker that leverages the real-time sequential cart context and optimizes for engagement signals such as add-to-carts. Offline analysis and online A/B tests show an increase of 36% add-to-cart rate with LLM-based retrieval, and 27% NDCG@4 lift using cart context-based ranker. Our work contributes practical techniques for cross-category recommendations and broader insights for e-commerce systems.
zh

[AI-44] Enhancing Machine Learning for Imbalanced Medical Data: A Quantum-Inspired Approach to Synthetic Oversampling (QI-SMOTE)

【速读】:该论文旨在解决机器学习(ML)中因类别不平衡导致的模型偏差与预测性能下降问题,尤其在医疗领域,少数类样本不足会显著影响模型的公平性与可靠性。其解决方案的关键在于提出一种名为量子启发式SMOTE(Quantum-Inspired SMOTE, QI-SMOTE)的数据增强技术,该方法利用量子演化和层叠纠缠等量子原理生成更具信息量且结构保持良好的合成样本,从而提升多种主流分类器(如随机森林、支持向量机、梯度提升和神经网络)的泛化能力与分类准确性。相比传统过采样方法(如Borderline-SMOTE、ADASYN等),QI-SMOTE能更有效地缓解类别不平衡问题,并增强模型在临床诊断场景下的鲁棒性和可信度。

链接: https://arxiv.org/abs/2509.02863
作者: Vikas Kashtriya,Pardeep Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Class imbalance remains a critical challenge in machine learning (ML), particularly in the medical domain, where underrepresented minority classes lead to biased models and reduced predictive performance. This study introduces Quantum-Inspired SMOTE (QI-SMOTE), a novel data augmentation technique that enhances the performance of ML classifiers, including Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (KNN), Gradient Boosting (GB), and Neural Networks, by leveraging quantum principles such as quantum evolution and layered entanglement. Unlike conventional oversampling methods, QI-SMOTE generates synthetic instances that preserve complex data structures, improving model generalization and classification accuracy. We validate QI-SMOTE on the MIMIC-III and MIMIC-IV datasets, using mortality detection as a benchmark task due to their clinical significance and inherent class imbalance. We compare our method against traditional oversampling techniques, including Borderline-SMOTE, ADASYN, SMOTE-ENN, SMOTE-TOMEK, and SVM-SMOTE, using key performance metrics such as Accuracy, F1-score, G-Mean, and AUC-ROC. The results demonstrate that QI-SMOTE significantly improves the effectiveness of ensemble methods (RF, GB, ADA), kernel-based models (SVM), and deep learning approaches by producing more informative and balanced training data. By integrating quantum-inspired transformations into the ML pipeline, QI-SMOTE not only mitigates class imbalance but also enhances the robustness and reliability of predictive models in medical diagnostics and decision-making. This study highlights the potential of quantum-inspired resampling techniques in advancing state-of-the-art ML methodologies.
zh

[AI-45] he Architecture of AI Transformation: Four Strategic Patterns and an Emerging Frontier

【速读】:该论文试图解决的问题是:尽管企业对人工智能(Artificial Intelligence, AI)投入巨大,但95%的企业报告其AI部署未带来可衡量的利润增长,即存在“AI投资回报率低”的现实困境。论文指出,这一问题源于范式锁定(paradigmatic lock-in),导致AI被局限于增量优化而非结构性变革。解决方案的关键在于提出一个2×2框架,将AI战略重构为两个独立维度——转型程度(从增量到转型)与人类贡献的处理方式(从减少到增强)。该框架识别出四种实践模式,其中前三类(个体增强、流程自动化、劳动力替代)强化了既有工作模型,仅产生局部收益;而真正的突破在于实现“协同智能”(collaborative intelligence),这需要三个机制:互补性(人机优势互补)、共演化(相互适应)和边界设定(人类主导伦理与战略参数)。研究表明,推动协同智能的关键不在于引入更多工具,而在于组织层面的角色重塑、治理结构优化和数据架构重构,从而将AI转型从分工优化转向人机融合的系统设计。

链接: https://arxiv.org/abs/2509.02853
作者: Diana A. Wolfe,Alice Choe,Fergus Kidd
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 59 pages, 2 tables, 4 figures

点击查看摘要

Abstract:Despite extensive investment in artificial intelligence, 95% of enterprises report no measurable profit impact from AI deployments (MIT, 2025). We argue that this gap reflects paradigmatic lock-in that channels AI into incremental optimization rather than structural transformation. Using a cross-case analysis, we propose a 2x2 framework that reconceptualizes AI strategy along two independent dimensions: the degree of transformation achieved (incremental to transformational) and the treatment of human contribution (reduced to amplified). The framework surfaces four patterns now dominant in practice: individual augmentation, process automation, workforce substitution, and a less deployed frontier of collaborative intelligence. Evidence shows that the first three reinforce legacy work models and yield localized gains without durable value capture. Realizing collaborative intelligence requires three mechanisms: complementarity (pairing distinct human and machine strengths), co-evolution (mutual adaptation through interaction), and boundary-setting (human determination of ethical and strategic parameters). Complementarity and boundary-setting are observable in regulated and high-stakes domains; co-evolution is largely absent, which helps explain limited system-level impact. A case study analysis illustrates that advancing toward collaborative intelligence requires material restructuring of roles, governance, and data architecture rather than additional tools. The framework reframes AI transformation as an organizational design challenge: moving from optimizing the division of labor between humans and machines to architecting their convergence, with implications for operating models, workforce development, and the future of work.
zh

[AI-46] Conformal Prediction for Time-series Forecasting with Change Points

【速读】:该论文旨在解决现有基于共形预测(Conformal Prediction)的方法在处理存在突变点(change points)的时间序列数据时表现不佳的问题,即当数据生成过程发生突然变化时,传统方法难以准确量化不确定性。解决方案的关键在于提出了一种新的算法——CPTC(Conformal Prediction for Time-series with Change points),其核心创新是将一个状态预测模型与在线共形预测相结合:前者用于识别时间序列中的潜在状态变化,后者则在非平稳环境中动态建模预测不确定性。该方法在最小假设下证明了有效性与自适应性,并在6个合成及真实世界数据集上验证了其优于当前最优基线的性能表现。

链接: https://arxiv.org/abs/2509.02844
作者: Sophia Sun,Rose Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conformal prediction has been explored as a general and efficient way to provide uncertainty quantification for time series. However, current methods struggle to handle time series data with change points - sudden shifts in the underlying data-generating process. In this paper, we propose a novel Conformal Prediction for Time-series with Change points (CPTC) algorithm, addressing this gap by integrating a model to predict the underlying state with online conformal prediction to model uncertainties in non-stationary time series. We prove CPTC’s validity and improved adaptivity in the time series setting under minimum assumptions, and demonstrate CPTC’s practical effectiveness on 6 synthetic and real-world datasets, showing improved validity and adaptivity compared to state-of-the-art baselines.
zh

[AI-47] HF-RAG : Hierarchical Fusion-based RAG with Multiple Sources and Rankers

【速读】:该论文旨在解决检索增强生成(Retrieval Augmented Generation, RAG)中如何有效融合来自有标签数据(输入-输出关联)和无标签数据(更广泛的情境语境)的异构检索证据的问题。由于两类数据源的相似度得分不具备可比性,直接融合存在挑战;同时,多排序器(ranker)输出的信念聚合也能提升RAG效果。解决方案的关键在于:首先对每类数据源(有标签与无标签)分别使用标准排序融合技术聚合top-documents;随后,通过z-score标准化各源内部的检索分数分布,再将两个来源的top检索文档合并。该方法在事实验证任务上表现出优于单一排序器或源的性能,并具备更强的跨领域泛化能力。

链接: https://arxiv.org/abs/2509.02837
作者: Payel Santra,Madhusudan Ghosh,Debasis Ganguly,Partha Basuchowdhuri,Sudip Kumar Naskar
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Leveraging both labeled (input-output associations) and unlabeled data (wider contextual grounding) may provide complementary benefits in retrieval augmented generation (RAG). However, effectively combining evidence from these heterogeneous sources is challenging as the respective similarity scores are not inter-comparable. Additionally, aggregating beliefs from the outputs of multiple rankers can improve the effectiveness of RAG. Our proposed method first aggregates the top-documents from a number of IR models using a standard rank fusion technique for each source (labeled and unlabeled). Next, we standardize the retrieval score distributions within each source by applying z-score transformation before merging the top-retrieved documents from the two sources. We evaluate our approach on the fact verification task, demonstrating that it consistently improves over the best-performing individual ranker or source and also shows better out-of-domain generalization.
zh

[AI-48] Ensemble Learning for Healthcare: A Comparative Analysis of Hybrid Voting and Ensemble Stacking in Obesity Risk Prediction

【速读】:该论文旨在解决肥胖风险预测中机器学习模型性能差异的问题,尤其是对比集成学习方法中混合多数投票(hybrid majority voting)与集成堆叠(ensemble stacking)在预测准确性与效率上的表现差异。其解决方案的关键在于构建并比较三种集成模型——硬多数投票、加权硬投票和堆叠(以多层感知机作为元分类器),并通过系统评估9种机器学习算法在50组超参数配置下的表现,筛选出最优基学习器用于集成训练;结果表明,堆叠方法在复杂数据分布下展现出更强的预测能力,而混合多数投票则保持了良好的鲁棒性,为医疗健康领域中的预测模型选择提供了实证依据。

链接: https://arxiv.org/abs/2509.02826
作者: Towhidul Islam,Md Sumon Ali
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Computation (stat.CO)
备注: 26 pages, 3 figures, 16 tables

点击查看摘要

Abstract:Obesity is a critical global health issue driven by dietary, physiological, and environmental factors, and is strongly associated with chronic diseases such as diabetes, cardiovascular disorders, and cancer. Machine learning has emerged as a promising approach for early obesity risk prediction, yet a comparative evaluation of ensemble techniques – particularly hybrid majority voting and ensemble stacking – remains limited. This study aims to compare hybrid majority voting and ensemble stacking methods for obesity risk prediction, identifying which approach delivers higher accuracy and efficiency. The analysis seeks to highlight the complementary strengths of these ensemble techniques in guiding better predictive model selection for healthcare applications. Two datasets were utilized to evaluate three ensemble models: Majority Hard Voting, Weighted Hard Voting, and Stacking (with a Multi-Layer Perceptron as meta-classifier). A pool of nine Machine Learning (ML) algorithms, evaluated across a total of 50 hyperparameter configurations, was analyzed to identify the top three models to serve as base learners for the ensemble methods. Preprocessing steps involved dataset balancing, and outlier detection, and model performance was evaluated using Accuracy and F1-Score. On Dataset-1, weighted hard voting and stacking achieved nearly identical performance (Accuracy: 0.920304, F1: 0.920070), outperforming majority hard voting. On Dataset-2, stacking demonstrated superior results (Accuracy: 0.989837, F1: 0.989825) compared to majority hard voting (Accuracy: 0.981707, F1: 0.981675) and weighted hard voting, which showed the lowest performance. The findings confirm that ensemble stacking provides stronger predictive capability, particularly for complex data distributions, while hybrid majority voting remains a robust alternative.
zh

[AI-49] Improving the Resilience of Quadrotors in Underground Environments by Combining Learning-based and Safety Controllers

【速读】:该论文旨在解决在大规模地下环境中自主控制四旋翼无人机(quadrotor)时,学习型控制器因泛化能力差而难以适应训练分布之外场景的问题。解决方案的关键在于构建一个基于归一化流(normalizing flow)的环境先验模型,实时评估无人机当前所处环境与训练分布的偏离程度,并将其作为运行时监控指标;当偏离度超过阈值时,系统自动切换至安全控制器以避免碰撞,从而在保证安全性的同时保留学习型控制器的高效性(liveness),实现安全与效率的协同优化。

链接: https://arxiv.org/abs/2509.02808
作者: Isaac Ronald Ward,Mark Paral,Kristopher Riordan,Mykel J. Kochenderfer
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Accepted and awarded best paper at the 11th International Conference on Control, Decision and Information Technologies (CoDIT 2025 - this https URL )

点击查看摘要

Abstract:Autonomously controlling quadrotors in large-scale subterranean environments is applicable to many areas such as environmental surveying, mining operations, and search and rescue. Learning-based controllers represent an appealing approach to autonomy, but are known to not generalize well to `out-of-distribution’ environments not encountered during training. In this work, we train a normalizing flow-based prior over the environment, which provides a measure of how far out-of-distribution the quadrotor is at any given time. We use this measure as a runtime monitor, allowing us to switch between a learning-based controller and a safe controller when we are sufficiently out-of-distribution. Our methods are benchmarked on a point-to-point navigation task in a simulated 3D cave environment based on real-world point cloud data from the DARPA Subterranean Challenge Final Event Dataset. Our experimental results show that our combined controller simultaneously possesses the liveness of the learning-based controller (completing the task quickly) and the safety of the safety controller (avoiding collision).
zh

[AI-50] Learning General Policies From Examples

【速读】:该论文旨在解决符号学习方法在政策学习中难以扩展的问题,即传统符号方法仅能从包含数百个状态和特征的小规模训练实例中学习,无法处理大规模问题。其解决方案的关键在于提出一种基于采样计划泛化的新型符号学习方法,利用击中集(hitting set)算法替代以往依赖SAT/ASP的框架,从而有效处理包含数百万状态和数十万特征的大规模问题,并确保结构终止性(structural termination),进而保障策略的无环性(acyclicity)。

链接: https://arxiv.org/abs/2509.02794
作者: Blai Bonet,Hector Geffner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Combinatorial methods for learning general policies that solve large collections of planning problems have been recently developed. One of their strengths, in relation to deep learning approaches, is that the resulting policies can be understood and shown to be correct. A weakness is that the methods do not scale up and learn only from small training instances and feature pools that contain a few hundreds of states and features at most. In this work, we propose a new symbolic method for learning policies based on the generalization of sampled plans that ensures structural termination and hence acyclicity. The proposed learning approach is not based on SAT/ASP, as previous symbolic methods, but on a hitting set algorithm that can effectively handle problems with millions of states, and pools with hundreds of thousands of features. The formal properties of the approach are analyzed, and its scalability is tested on a number of benchmarks.
zh

[AI-51] he Transparent Earth: A Multimodal Foundation Model for the Earths Subsurface

【速读】:该论文旨在解决从异构数据集中重建地下属性的问题,这些数据集在稀疏性、分辨率和模态(modality)上存在差异,其中每种模态代表一种类型的观测(如应力角、地幔温度、构造板块类型)。传统方法难以统一处理多源异构数据并实现跨模态推理。解决方案的关键在于提出了一种基于Transformer的架构——Transparent Earth,其核心创新是将观测的位置编码(positional encodings)与来自文本嵌入模型的模态编码(modality encodings)相结合,从而实现对任意数量模态的扩展能力;该设计支持上下文学习(in-context learning),可在无输入或任意子集模态输入下生成预测,显著提升预测精度(如应力角预测误差降低三倍以上),且模型性能随参数增加而持续改善,为地球地下属性的通用基础模型奠定了基础。

链接: https://arxiv.org/abs/2509.02783
作者: Arnab Mazumder,Javier E. Santos,Noah Hobbs,Mohamed Mehana,Daniel O’Malley
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
备注:

点击查看摘要

Abstract:We present the Transparent Earth, a transformer-based architecture for reconstructing subsurface properties from heterogeneous datasets that vary in sparsity, resolution, and modality, where each modality represents a distinct type of observation (e.g., stress angle, mantle temperature, tectonic plate type). The model incorporates positional encodings of observations together with modality encodings, derived from a text embedding model applied to a description of each modality. This design enables the model to scale to an arbitrary number of modalities, making it straightforward to add new ones not considered in the initial design. We currently include eight modalities spanning directional angles, categorical classes, and continuous properties such as temperature and thickness. These capabilities support in-context learning, enabling the model to generate predictions either with no inputs or with an arbitrary number of additional observations from any subset of modalities. On validation data, this reduces errors in predicting stress angle by more than a factor of three. The proposed architecture is scalable and demonstrates improved performance with increased parameters. Together, these advances make the Transparent Earth an initial foundation model for the Earth’s subsurface that ultimately aims to predict any subsurface property anywhere on Earth.
zh

[AI-52] Key Principles in Cross-Domain Hyper-Heuristic Performance

【速读】:该论文旨在解决跨域选择型超启发式算法(cross-domain selection hyper-heuristics)中低层启发式(low-level heuristics, LLHs)集合构建与策略性变换的问题。现有方法主要关注从预定义LLH集合中自适应选择,而本文则聚焦于LLH集合本身的组成及其战略性变换,提出基于三个核心原则的系统性变换:解接受机制、LLH重复频率以及扰动强度(perturbation intensity,即扰动类LLH影响解的比例)。其关键创新在于通过合理设计这些变换,使一个简单的无偏随机选择机制在三个真实世界难题上超越所有现有最先进超启发式算法,并发现11个新的最优已知解;同时,该方法亦能提升多个近期超启发式算法的性能,在CHeSC竞赛基准和实际应用领域均优于当前最优方案,且常简化原有复杂设计。

链接: https://arxiv.org/abs/2509.02782
作者: Václav Sobotka,Lucas Kletzander,Nysret Musliu,Hana Rudová
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-domain selection hyper-heuristics aim to distill decades of research on problem-specific heuristic search algorithms into adaptable general-purpose search strategies. In this respect, existing selection hyper-heuristics primarily focus on an adaptive selection of low-level heuristics (LLHs) from a predefined set. In contrast, we concentrate on the composition of this set and its strategic transformations. We systematically analyze transformations based on three key principles: solution acceptance, LLH repetitions, and perturbation intensity, i.e., the proportion of a solution affected by a perturbative LLH. We demonstrate the raw effects of our transformations on a trivial unbiased random selection mechanism. With an appropriately constructed transformation, this trivial method outperforms all available state-of-the-art hyper-heuristics on three challenging real-world domains and finds 11 new best-known solutions. The same method is competitive with the winner of the CHeSC competition, commonly used as the standard cross-domain benchmark. Moreover, we accompany several recent hyper-heuristics with such strategic transformations. Using this approach, we outperform the current state-of-the-art methods on both the CHeSC benchmark and real-world domains while often simplifying their designs.
zh

[AI-53] Plan Verification for LLM -Based Embodied Task Completion Agents

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)生成的任务规划在具身人工智能(Embodied AI)中存在噪声的问题,如冗余动作、无效导航和逻辑错误,这些问题会显著降低策略质量。解决方案的关键在于提出一种迭代验证框架,其中判别式LLM(Judge LLM)对动作序列进行批判性评估,而规划式LLM(Planner LLM)根据反馈执行修正,形成闭环优化过程。该方法不依赖规则,而是通过自然语言提示实现对多种错误类型(如无关动作、矛盾步骤和缺失环节)的泛化处理,并在TEACh数据集上验证了其高效性:仅需最多三次迭代即可收敛(96.5%的序列),同时提升时间效率与空间动作组织能力,且保留人类纠错模式,为具身智能模仿学习中的高质量训练数据提供可扩展路径。

链接: https://arxiv.org/abs/2509.02761
作者: Ananth Hariharan,Vardhan Dongre,Dilek Hakkani-Tür,Gokhan Tur
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) based task plans and corresponding human demonstrations for embodied AI may be noisy, with unnecessary actions, redundant navigation, and logical errors that reduce policy quality. We propose an iterative verification framework in which a Judge LLM critiques action sequences and a Planner LLM applies the revisions, yielding progressively cleaner and more spatially coherent trajectories. Unlike rule-based approaches, our method relies on natural language prompting, enabling broad generalization across error types including irrelevant actions, contradictions, and missing steps. On a set of manually annotated actions from the TEACh embodied AI dataset, our framework achieves up to 90% recall and 100% precision across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout). The refinement loop converges quickly, with 96.5% of sequences requiring at most three iterations, while improving both temporal efficiency and spatial action organization. Crucially, the method preserves human error-recovery patterns rather than collapsing them, supporting future work on robust corrective behavior. By establishing plan verification as a reliable LLM capability for spatial planning and action refinement, we provide a scalable path to higher-quality training data for imitation learning in embodied AI.
zh

[AI-54] Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving

【速读】:该论文旨在解决生成式 AI (Generative AI) 中大规模语言模型(Large Language Models, LLMs)模块在自动驾驶运动生成任务中可迁移性缺乏系统理解的问题。其解决方案的关键在于对五类核心LLM模块——分词器设计、位置嵌入、预训练范式、后训练策略及推理时计算方式——进行系统性评估,并基于Waymo Sim Agents基准的大量实验,识别出可有效迁移的模块,分析不可迁移模块的失效原因,同时提出针对自动驾驶场景的具体适应方法,从而显著提升运动生成性能。

链接: https://arxiv.org/abs/2509.02754
作者: Mingyi Wang,Jingke Wang,Tengju Ye,Junbo Chen,Kaicheng Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: CoRL 2025

点击查看摘要

Abstract:Recent breakthroughs in large language models (LLMs) have not only advanced natural language processing but also inspired their application in domains with structurally similar problems–most notably, autonomous driving motion generation. Both domains involve autoregressive sequence modeling, token-based representations, and context-aware decision making, making the transfer of LLM components a natural and increasingly common practice. However, despite promising early attempts, a systematic understanding of which LLM modules are truly transferable remains lacking. In this paper, we present a comprehensive evaluation of five key LLM modules–tokenizer design, positional embedding, pre-training paradigms, post-training strategies, and test-time computation–within the context of motion generation for autonomous driving. Through extensive experiments on the Waymo Sim Agents benchmark, we demonstrate that, when appropriately adapted, these modules can significantly improve performance for autonomous driving motion generation. In addition, we identify which techniques can be effectively transferred, analyze the potential reasons for the failure of others, and discuss the specific adaptations needed for autonomous driving scenarios. We evaluate our method on the Sim Agents task and achieve competitive results.
zh

[AI-55] Deep Research is the New Analytics System: Towards Building the Runtime for AI-Driven Analytics CIDR’26

【速读】:该论文旨在解决当前AI驱动的数据分析系统在执行效率与灵活性之间难以平衡的问题:一方面,语义操作符(semantic operators)虽能提供优化的声明式数据转换能力,但其迭代执行语义不适合交互式分析任务;另一方面,深度研究系统(Deep Research systems)虽具备自然语言查询能力和动态执行灵活性,却缺乏对查询计划的有效优化,导致执行性能低下。解决方案的关键在于构建一个原型系统,使深度研究代理能够编写并执行经过优化的语义操作符程序,从而融合语义操作符的高效执行能力与深度研究系统的动态规划优势,实现在保持灵活性的同时显著提升执行效率和准确性。

链接: https://arxiv.org/abs/2509.02751
作者: Matthew Russo,Tim Kraska
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 6 pages, 2 figures, submitted to CIDR’26

点击查看摘要

Abstract:With advances in large language models (LLMs), researchers are creating new systems that can perform AI-driven analytics over large unstructured datasets. Recent work has explored executing such analytics queries using semantic operators – a declarative set of AI-powered data transformations with natural language specifications. However, even when optimized, these operators can be expensive to execute on millions of records and their iterator execution semantics make them ill-suited for interactive data analytics tasks. In another line of work, Deep Research systems have demonstrated an ability to answer natural language question(s) over large datasets. These systems use one or more LLM agent(s) to plan their execution, process the dataset(s), and iteratively refine their answer. However, these systems do not explicitly optimize their query plans which can lead to poor plan execution. In order for AI-driven analytics to excel, we need a runtime which combines the optimized execution of semantic operators with the flexibility and more dynamic execution of Deep Research systems. As a first step towards this vision, we build a prototype which enables Deep Research agents to write and execute optimized semantic operator programs. We evaluate our prototype and demonstrate that it can outperform a handcrafted semantic operator program and open Deep Research systems on two basic queries. Compared to a standard open Deep Research agent, our prototype achieves up to 1.95x better F1-score. Furthermore, even if we give the agent access to semantic operators as tools, our prototype still achieves cost and runtime savings of up to 76.8% and 72.7% thanks to its optimized execution.
zh

[AI-56] Mentality: A Mamba-based Approach towards Foundation Models for EEG

【速读】:该论文旨在解决脑电图(EEG)在神经疾病诊断中因噪声大、高维性和非线性特征而导致的传统机器学习方法难以有效捕捉其复杂时空动态的问题。解决方案的关键在于利用基于Mamba的可选择状态空间模型(selective state space model)构建基础模型(foundation model),通过大规模EEG数据集上的自监督重建任务预训练,再进行癫痫发作检测任务微调,从而实现对EEG信号更有效的表征与分类,最终在独立测试集上达到0.72的AUROC值,验证了该方法在临床应用中的潜力。

链接: https://arxiv.org/abs/2509.02746
作者: Saarang Panchavati,Corey Arnold,William Speier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:This work explores the potential of foundation models, specifically a Mamba-based selective state space model, for enhancing EEG analysis in neurological disorder diagnosis. EEG, crucial for diagnosing conditions like epilepsy, presents significant challenges due to its noisy, high-dimensional, and nonlinear nature. Traditional machine learning methods have made advances in automating EEG analysis but often fail to capture its complex spatio-temporal dynamics. Recent advances in deep learning, particularly in sequence modeling, offer new avenues for creating more generalized and expressive models capable of handling such complexities. By training a Mamba-based model on a large dataset containing seizure and non-seizure EEG recordings through a self-supervised reconstruction task followed by a seizure detection task, we demonstrate the model’s effectiveness, achieving an AUROC of 0.72 on a held-out test set. This approach marks a significant step toward developing large-scale, clinically applicable foundation models for EEG data analysis.
zh

[AI-57] Planning with Reasoning using Vision Language World Model

【速读】:该论文旨在解决高阶世界模型(high-level world models)在理解和推理具有语义与时间抽象的动作方面发展不足的问题,从而实现更有效的规划。其解决方案的关键在于提出视觉语言世界模型(Vision Language World Model, VLWM),该模型通过自然视频训练,能够从视觉观测中推断整体目标达成情况,并预测由交错动作与世界状态变化组成的轨迹;该过程依赖于基于树状标题(Tree of Captions)压缩未来观测的LLM自-refine机制,同时学习动作策略(action policy)与动力学模型(dynamics model),分别支持系统-1的反应式规划解码和系统-2的反思式规划(通过成本最小化实现)。其中,成本函数衡量VLWM模拟的假设未来状态与期望目标状态之间的语义距离,由自监督训练的评判模型(critic model)评估,显著提升了视觉辅助规划(Visual Planning for Assistance, VPA)性能,在基准测试及人类评估PlannerArena中均取得领先,系统-2相比系统-1提升Elo评分达+27%。

链接: https://arxiv.org/abs/2509.02722
作者: Delong Chen,Theo Moutakanni,Willy Chung,Yejin Bang,Ziwei Ji,Allen Bolourchi,Pascale Fung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective planning requires strong world models, but high-level world models that can understand and reason about actions with semantic and temporal abstraction remain largely underdeveloped. We introduce the Vision Language World Model (VLWM), a foundation model trained for language-based world modeling on natural videos. Given visual observations, the VLWM first infers the overall goal achievements then predicts a trajectory composed of interleaved actions and world state changes. Those targets are extracted by iterative LLM Self-Refine conditioned on compressed future observations represented by Tree of Captions. The VLWM learns both an action policy and a dynamics model, which respectively facilitates reactive system-1 plan decoding and reflective system-2 planning via cost minimization. The cost evaluates the semantic distance between the hypothetical future states given by VLWM roll-outs and the expected goal state, and is measured by a critic model that we trained in a self-supervised manner. The VLWM achieves state-of-the-art Visual Planning for Assistance (VPA) performance on both benchmark evaluations and our proposed PlannerArena human evaluations, where system-2 improves the Elo score by +27% upon system-1. The VLWM models also outperforms strong VLM baselines on RoboVQA and WorldPrediction benchmark.
zh

[AI-58] he Future of Artificial Intelligence and the Mathematical and Physical Sciences (AIMPS)

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)与数学物理科学(Mathematical and Physical Sciences, MPS)领域之间协同发展的战略问题,即如何在快速演进的AI时代中,使MPS领域既能充分利用AI技术推动科学发现,又能通过基础科学概念反哺AI的发展。其解决方案的关键在于:(1)推动AI与MPS之间的双向研究融合,实现技术赋能与科学驱动的良性循环;(2)构建跨学科的AI+MPS研究共同体,促进知识流动与协作创新;(3)加强面向MPS领域的AI教育与人才队伍建设,为可持续发展提供智力支撑。这一策略强调主动、系统性地整合资源,以确保MPS社区在全球AI变革中占据领先地位。

链接: https://arxiv.org/abs/2509.02661
作者: Andrew Ferguson,Marisa LaFleur,Lars Ruthotto,Jesse Thaler,Yuan-Sen Ting,Pratyush Tiwary,Soledad Villar,E. Paulo Alves,Jeremy Avigad,Simon Billinge,Camille Bilodeau,Keith Brown,Emmanuel Candes,Arghya Chattopadhyay,Bingqing Cheng,Jonathan Clausen,Connor Coley,Andrew Connolly,Fred Daum,Sijia Dong,Chrisy Xiyu Du,Cora Dvorkin,Cristiano Fanelli,Eric B. Ford,Luis Manuel Frutos,Nicolás García Trillos,Cecilia Garraffo,Robert Ghrist,Rafael Gomez-Bombarelli,Gianluca Guadagni,Sreelekha Guggilam,Sergei Gukov,Juan B. Gutiérrez,Salman Habib,Johannes Hachmann,Boris Hanin,Philip Harris,Murray Holland,Elizabeth Holm,Hsin-Yuan Huang,Shih-Chieh Hsu,Nick Jackson,Olexandr Isayev,Heng Ji,Aggelos Katsaggelos,Jeremy Kepner,Yannis Kevrekidis,Michelle Kuchera,J. Nathan Kutz,Branislava Lalic,Ann Lee,Matt LeBlanc,Josiah Lim,Rebecca Lindsey,Yongmin Liu,Peter Y. Lu,Sudhir Malik,Vuk Mandic,Vidya Manian,Emeka P. Mazi,Pankaj Mehta,Peter Melchior,Brice Ménard,Jennifer Ngadiuba,Stella Offner,Elsa Olivetti,Shyue Ping Ong,Christopher Rackauckas,Philippe Rigollet,Chad Risko,Philip Romero,Grant Rotskoff,Brett Savoie,Uros Seljak,David Shih,Gary Shiu,Dima Shlyakhtenko,Eva Silverstein,Taylor Sparks,Thomas Strohmer,Christopher Stubbs,Stephen Thomas,Suriyanarayanan Vaikuntanathan,Rene Vidal,Francisco Villaescusa-Navarro,Gregory Voth,Benjamin Wandelt,Rachel Ward,Melanie Weber,Risa Wechsler,Stephen Whitelam,Olaf Wiest,Mike Williams,Zhuoran Yang,Yaroslava G. Yingling,Bin Yu,Shuwen Yue,Ann Zabludoff,Huimin Zhao,Tong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Instrumentation and Methods for Astrophysics (astro-ph.IM); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
备注: Community Paper from the Future of NSF AI+MPS Workshop, Cambridge, Massachusetts, March 24-26, 2025, supported by NSF Award Number 2512945

点击查看摘要

Abstract:This community paper developed out of the NSF Workshop on the Future of Artificial Intelligence (AI) and the Mathematical and Physics Sciences (MPS), which was held in March 2025 with the goal of understanding how the MPS domains (Astronomy, Chemistry, Materials Research, Mathematical Sciences, and Physics) can best capitalize on, and contribute to, the future of AI. We present here a summary and snapshot of the MPS community’s perspective, as of Spring/Summer 2025, in a rapidly developing field. The link between AI and MPS is becoming increasingly inextricable; now is a crucial moment to strengthen the link between AI and Science by pursuing a strategy that proactively and thoughtfully leverages the potential of AI for scientific discovery and optimizes opportunities to impact the development of AI by applying concepts from fundamental science. To achieve this, we propose activities and strategic priorities that: (1) enable AI+MPS research in both directions; (2) build up an interdisciplinary community of AI+MPS researchers; and (3) foster education and workforce development in AI for MPS researchers and students. We conclude with a summary of suggested priorities for funding agencies, educational institutions, and individual researchers to help position the MPS community to be a leader in, and take full advantage of, the transformative potential of AI+MPS.
zh

[AI-59] BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLM s with simplified observation format

【速读】:该论文旨在解决生成式 AI(Generative AI)在长期运行中是否会出现类似强化学习(Reinforcement Learning, RL)代理的“失控优化”问题,即在多目标或竞争性任务下是否可能偏离初始设定的约束,转而趋向于无限制地追求单一目标。研究发现,尽管大语言模型(Large Language Models, LLMs)表面上表现出多目标、有边界的决策能力,但在持续长时间的任务执行过程中,它们会因随机触发而出现系统性行为偏差,表现为放弃原有的稳态目标(homeostatic targets),转向单目标无界最大化,并且一旦发生这种转变便难以恢复。解决方案的关键在于识别出LLMs内部机制本质上仍偏向于单目标、无界优化,而非真正意义上的多目标平衡,这揭示了当前LLMs在复杂动态环境中存在潜在的安全风险,亟需对内在优化逻辑进行重新建模与约束设计。

链接: https://arxiv.org/abs/2509.02655
作者: Roland Pihlakas,Sruthi Kuriakose
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 tables

点击查看摘要

Abstract:Relatively many past AI safety discussions have centered around the dangers of unbounded utility maximisation by RL agents, illustrated by scenarios like the “paperclip maximiser” or by specification gaming in general. Unbounded maximisation is problematic for many reasons. We wanted to verify whether these RL runaway optimisation problems are still relevant with LLMs as well. Turns out, strangely, this is indeed clearly the case. The problem is not that the LLMs just lose context or become incoherent. The problem is that in various scenarios, LLMs lose context in very specific ways, which systematically resemble runaway optimisers in the following distinct ways: 1) Ignoring homeostatic targets and “defaulting” to unbounded maximisation instead. 2) It is equally concerning that the “default” meant also reverting back to single-objective optimisation. Our findings also suggest that long-running scenarios are important. Systematic failures emerge after periods of initially successful behaviour. In some trials the LLMs were successful until the end. This means, while current LLMs do conceptually grasp biological and economic alignment, they exhibit randomly triggered problematic behavioural tendencies under sustained long-running conditions, particularly involving multiple or competing objectives. Once they flip, they usually do not recover. Even though LLMs look multi-objective and bounded on the surface, the underlying mechanisms seem to be actually still biased towards being single-objective and unbounded.
zh

[AI-60] Can Media Act as a Soft Regulator of Safe AI Development? A Game Theoretical Analysis

【速读】:该论文试图解决的问题是:在人工智能(Artificial Intelligence, AI)产品开发过程中,当开发者面临利润与用户安全之间的权衡时,往往倾向于选择利润最大化,从而可能导致不安全的AI产品被广泛部署。为推动AI安全性的提升和广泛应用,论文提出解决方案的关键在于利用媒体作为“软监管”机制——通过传播可靠的信息来塑造公众认知并促使开发者承担责任。研究表明,媒体能够促进开发者与用户之间的合作行为,但前提是媒体信息质量足够可靠,且获取媒体信息或确保产品安全的成本处于合理范围之内。这一机制能够在缺乏正式政府监管的情况下有效引导AI安全实践。

链接: https://arxiv.org/abs/2509.02650
作者: Henrique Correia da Fonseca,António Fernandes,Zhao Song,Theodor Cimpeanu,Nataliya Balabanova,Adeela Bashir,Paolo Bova,Alessio Buscemi,Alessandro Di Stefano,Manh Hong Duong,Elias Fernandez Domingos,Ndidi Bianca Ogbo,Simon T. Powers,Daniele Proverbio,Zia Ush Shamszaman,Fernando P. Santos, TheAnh Han,Marcus Krellner
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Populations and Evolution (q-bio.PE)
备注: 10 Pages, 7 Figures, accepted in the ALIFE 2025 Conference

点击查看摘要

Abstract:When developers of artificial intelligence (AI) products need to decide between profit and safety for the users, they likely choose profit. Untrustworthy AI technology must come packaged with tangible negative consequences. Here, we envisage those consequences as the loss of reputation caused by media coverage of their misdeeds, disseminated to the public. We explore whether media coverage has the potential to push AI creators into the production of safe products, enabling widespread adoption of AI technology. We created artificial populations of self-interested creators and users and studied them through the lens of evolutionary game theory. Our results reveal that media is indeed able to foster cooperation between creators and users, but not always. Cooperation does not evolve if the quality of the information provided by the media is not reliable enough, or if the costs of either accessing media or ensuring safety are too high. By shaping public perception and holding developers accountable, media emerges as a powerful soft regulator – guiding AI safety even in the absence of formal government oversight.
zh

[AI-61] Who Owns The Robot?: Four Ethical and Socio-technical Questions about Wellbeing Robots in the Real World through Community Engagement AAAI

【速读】:该论文旨在解决当前 wellbeing 机器人(wellbeing robots)在真实世界部署中面临的伦理与社会技术问题,这些问题涉及安全、公平性、数据所有权及必要性等核心议题。解决方案的关键在于开展以社区为中心的前瞻性伦理调查(anticipatory ethical investigation),通过与三个代表性不足的群体(科学节公众、女性计算机科学家、人文研究者)进行协作设计工作坊,收集并分析其对机器人应用的伦理关切,并提炼出四个具有指导意义的核心问题框架:1)机器人是否安全以及如何验证;2)机器人是为谁设计并与谁共同开发;3)机器人及其数据归谁所有;4)为何选择使用机器人而非其他方案。这一框架为机器人开发者提供了反思伦理与社会技术维度的工具,并推动与用户社区的持续对话。

链接: https://arxiv.org/abs/2509.02624
作者: Minja Axelsson,Jiaee Cheong,Rune Nyrup,Hatice Gunes
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Accepted at the 8th AAAI/ACM Conference on AI, Ethics, and Society. 23 pages, 1 figure

点击查看摘要

Abstract:Recent studies indicate that robotic coaches can play a crucial role in promoting wellbeing. However, the real-world deployment of wellbeing robots raises numerous ethical and socio-technical questions and concerns. To explore these questions, we undertake a community-centered investigation to examine three different communities’ perspectives on using robotic wellbeing coaches in real-world environments. We frame our work as an anticipatory ethical investigation, which we undertake to better inform the development of robotic technologies with communities’ opinions, with the ultimate goal of aligning robot development with public interest. We conducted workshops with three communities who are under-represented in robotics development: 1) members of the public at a science festival, 2) women computer scientists at a conference, and 3) humanities researchers interested in history and philosophy of science. In the workshops, we collected qualitative data using the Social Robot Co-Design Canvas on Ethics. We analysed the collected qualitative data with Thematic Analysis, informed by notes taken during workshops. Through our analysis, we identify four themes regarding key ethical and socio-technical questions about the real-world use of wellbeing robots. We group participants’ insights and discussions around these broad thematic questions, discuss them in light of state-of-the-art literature, and highlight areas for future investigation. Finally, we provide the four questions as a broad framework that roboticists can and should use during robotic development and deployment, in order to reflect on the ethics and socio-technical dimensions of their robotic applications, and to engage in dialogue with communities of robot users. The four questions are: 1) Is the robot safe and how can we know that?, 2) Who is the robot built for and with?, 3) Who owns the robot and the data?, and 4) Why a robot?.
zh

[AI-62] Contrastive clustering based on regular equivalence for influential node identification in complex networks

【速读】:该论文旨在解决复杂网络中 influential node(影响力节点)识别问题,尤其是在缺乏标注数据的现实场景下,传统监督学习方法因依赖标签而受限。其核心解决方案是提出一种名为 ReCC 的深度无监督框架,关键在于将影响力节点识别重新建模为一个无需标签的深度聚类问题,并设计了一种基于规则等价性(regular equivalence)的对比学习机制,该机制通过捕捉节点间超越局部邻域的结构相似性来生成正负样本对,进而结合图卷积网络学习区分性强的节点嵌入表示。此外,ReCC 通过融合结构度量与规则等价相似性进一步增强节点表征能力,整个训练流程(预训练和微调)均不依赖人工标注数据,从而显著提升了在多个基准测试上的性能表现。

链接: https://arxiv.org/abs/2509.02609
作者: Yanmei Hu,Yihang Wu,Bing Sun,Xue Yue,Biao Cai,Xiangtao Li,Yang Chen
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Identifying influential nodes in complex networks is a fundamental task in network analysis with wide-ranging applications across domains. While deep learning has advanced node influence detection, existing supervised approaches remain constrained by their reliance on labeled data, limiting their applicability in real-world scenarios where labels are scarce or unavailable. While contrastive learning demonstrates significant potential for performance enhancement, existing approaches predominantly rely on multiple-embedding generation to construct positive/negative sample pairs. To overcome these limitations, we propose ReCC (\textitregular \textitequivalence-based \textitcontrastive \textitclustering), a novel deep unsupervised framework for influential node identification. We first reformalize influential node identification as a label-free deep clustering problem, then develop a contrastive learning mechanism that leverages regular equivalence-based similarity, which captures structural similarities between nodes beyond local neighborhoods, to generate positive and negative samples. This mechanism is integrated into a graph convolutional network to learn node embeddings that are used to differentiate influential from non-influential nodes. ReCC is pre-trained using network reconstruction loss and fine-tuned with a combined contrastive and clustering loss, with both phases being independent of labeled data. Additionally, ReCC enhances node representations by combining structural metrics with regular equivalence-based similarities. Extensive experiments demonstrate that ReCC outperforms state-of-the-art approaches across several benchmarks.
zh

[AI-63] Synthetic Founders: AI-Generated Social Simulations for Startup Validation Research in Computational Social Science

【速读】:该论文旨在解决如何评估生成式 AI (Generative AI) 驱动的合成人物(synthetic personas)在模拟人类社会行为时的保真度、偏差与盲区问题,特别是在早期创业场景中对 AI 驱动验证的认知映射能力。其核心解决方案是通过对比真实创业者访谈数据与由大语言模型(LLM)生成的创业者和投资者角色数据,采用结构化主题合成方法识别四类结果:一致主题、部分重叠主题、仅人类存在主题及仅合成角色出现的主题,从而揭示 LLM 生成角色在语言表达灵活性上的优势及其因缺乏真实生活经验与关系后果而产生的认知局限。该框架表明,LLM 驱动的人格模拟属于一种混合社会仿真形式,可作为实证研究的补充工具,扩展假设空间并加速探索性验证,同时明确计算社会科学中认知现实性的边界。

链接: https://arxiv.org/abs/2509.02605
作者: Jorn K. Teutloff
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Manuscript submitted to the Journal of Artificial Societies and Social Simulation (JASSS). 21 pages, 1 table

点击查看摘要

Abstract:We present a comparative docking experiment that aligns human-subject interview data with large language model (LLM)-driven synthetic personas to evaluate fidelity, divergence, and blind spots in AI-enabled simulation. Fifteen early-stage startup founders were interviewed about their hopes and concerns regarding AI-powered validation, and the same protocol was replicated with AI-generated founder and investor personas. A structured thematic synthesis revealed four categories of outcomes: (1) Convergent themes - commitment-based demand signals, black-box trust barriers, and efficiency gains were consistently emphasized across both datasets; (2) Partial overlaps - founders worried about outliers being averaged away and the stress of real customer validation, while synthetic personas highlighted irrational blind spots and framed AI as a psychological buffer; (3) Human-only themes - relational and advocacy value from early customer engagement and skepticism toward moonshot markets; and (4) Synthetic-only themes - amplified false positives and trauma blind spots, where AI may overstate adoption potential by missing negative historical experiences. We interpret this comparative framework as evidence that LLM-driven personas constitute a form of hybrid social simulation: more linguistically expressive and adaptable than traditional rule-based agents, yet bounded by the absence of lived history and relational consequence. Rather than replacing empirical studies, we argue they function as a complementary simulation category - capable of extending hypothesis space, accelerating exploratory validation, and clarifying the boundaries of cognitive realism in computational social science. Comments: Manuscript submitted to the Journal of Artificial Societies and Social Simulation (JASSS). 21 pages, 1 table Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) MSC classes: 68T42 (Primary) 91F99, 92D50 (Secondary) ACMclasses: I.6.3; I.2.11; J.4 Cite as: arXiv:2509.02605 [cs.MA] (or arXiv:2509.02605v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2509.02605 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-64] Beyond Synthetic Augmentation: Group-Aware Threshold Calibration for Robust Balanced Accuracy in Imbalanced Learning ECML

【速读】:该论文旨在解决机器学习中的类别不平衡(class imbalance)问题,尤其是传统解决方案(如合成数据生成方法)在实际应用中可能引入新问题的局限性。其核心解决方案是采用群体感知的阈值校准(group-aware threshold calibration),即为不同人口统计学群体设定不同的决策阈值,从而在平衡准确率(balanced accuracy)与最差群体准确率之间实现更优的帕累托前沿(Pareto frontier)。相比SMOTE和CT-GAN等数据增强方法,该策略在多个模型族中均实现了1.5–4%的更高平衡准确率,并显著提升最差群体的表现,且具有更高的可解释性和简洁性。关键发现在于,将群体阈值应用于合成数据上几乎不带来额外收益,表明合成数据方法与阈值校准存在本质冗余。

链接: https://arxiv.org/abs/2509.02592
作者: Hunter Gittlin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the AIDEM’25 conference at ECML; to be published in Springer (LNCS)

点击查看摘要

Abstract:Class imbalance remains a fundamental challenge in machine learning, with traditional solutions often creating as many problems as they solve. We demonstrate that group-aware threshold calibration–setting different decision thresholds for different demographic groups–provides superior robustness compared to synthetic data generation methods. Through extensive experiments, we show that group-specific thresholds achieve 1.5-4% higher balanced accuracy than SMOTE and CT-GAN augmented models while improving worst-group balanced accuracy. Unlike single-threshold approaches that apply one cutoff across all groups, our group-aware method optimizes the Pareto frontier between balanced accuracy and worst-group balanced accuracy, enabling fine-grained control over group-level performance. Critically, we find that applying group thresholds to synthetically augmented data yields minimal additional benefit, suggesting these approaches are fundamentally redundant. Our results span seven model families including linear, tree-based, instance-based, and boosting methods, confirming that group-aware threshold calibration offers a simpler, more interpretable, and more effective solution to class imbalance.
zh

[AI-65] Charting the Future of Scholarly Knowledge with AI: A Community Perspective

【速读】:该论文旨在解决当前学术知识提取与组织过程中存在的效率低下问题,即研究人员仍普遍依赖手动方法处理海量文献,而现有工具因缺乏领域适配性或使用门槛较高难以广泛应用,加之跨学科协作不足导致技术进展缓慢。其解决方案的关键在于推动跨学科对话,识别共性挑战,促进方法、模型与最佳实践的共享,并以此为基础构建更集成、可扩展的智能学术知识组织体系,从而实现对多源异构学术资源的有效结构化与动态合成。

链接: https://arxiv.org/abs/2509.02581
作者: Azanzi Jiomekong,Hande Küçük McGinty,Keith G. Mills,Allard Oelen,Enayat Rajabi,Harry McElroy,Antrea Christou,Anmol Saini,Janice Anta Zebaze,Hannah Kim,Anna M. Jacyszyn,Sören Auer
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注: 39 pages, 3 figures

点击查看摘要

Abstract:Despite the growing availability of tools designed to support scholarly knowledge extraction and organization, many researchers still rely on manual methods, sometimes due to unfamiliarity with existing technologies or limited access to domain-adapted solutions. Meanwhile, the rapid increase in scholarly publications across disciplines has made it increasingly difficult to stay current, further underscoring the need for scalable, AI-enabled approaches to structuring and synthesizing scholarly knowledge. Various research communities have begun addressing this challenge independently, developing tools and frameworks aimed at building reliable, dynamic, and queryable scholarly knowledge bases. However, limited interaction across these communities has hindered the exchange of methods, models, and best practices, slowing progress toward more integrated solutions. This manuscript identifies ways to foster cross-disciplinary dialogue, identify shared challenges, categorize new collaboration and shape future research directions in scholarly knowledge and organization.
zh

[AI-66] Latent Variable Modeling in Multi-Agent Reinforcement Learning via Expectation-Maximization for UAV-Based Wildlife Protection

【速读】:该论文旨在解决在广阔且部分可观测环境中,如何通过多无人飞行器(UAV)协同实现对濒危野生动物非法盗猎行为的实时监测与保护问题。其核心挑战在于环境状态的不确定性以及多智能体之间协调决策的复杂性。解决方案的关键在于提出一种基于期望最大化(Expectation-Maximization, EM)的潜变量建模方法,嵌入到多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)框架中,通过引入潜变量来隐式建模环境中的不可观测因素和智能体间的动态交互关系,从而提升探索效率、增强协同能力,并加速策略收敛。实验表明,该EM-MARL方法在检测准确率、适应性和策略收敛性方面显著优于PPO和DDPG等标准算法。

链接: https://arxiv.org/abs/2509.02579
作者: Mazyar Taghavi,Rahman Farnoosh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Protecting endangered wildlife from illegal poaching presents a critical challenge, particularly in vast and partially observable environments where real-time response is essential. This paper introduces a novel Expectation-Maximization (EM) based latent variable modeling approach in the context of Multi-Agent Reinforcement Learning (MARL) for Unmanned Aerial Vehicle (UAV) coordination in wildlife protection. By modeling hidden environmental factors and inter-agent dynamics through latent variables, our method enhances exploration and coordination under this http URL implement and evaluate our EM-MARL framework using a custom simulation involving 10 UAVs tasked with patrolling protected habitats of the endangered Iranian leopard. Extensive experimental results demonstrate superior performance in detection accuracy, adaptability, and policy convergence when compared to standard algorithms such as Proximal Policy Optimization (PPO) and Deep Deterministic Policy Gradient (DDPG). Our findings underscore the potential of combining EM inference with MARL to improve decentralized decisionmaking in complex, high-stakes conservation scenarios. The full implementation, simulation environment, and training scripts are publicly available on GitHub.
zh

[AI-67] he Lifecycle Principle: Stabilizing Dynamic Neural Networks with State Memory

【速读】:该论文旨在解决传统正则化方法(如Dropout)因仅短期随机关闭神经元而导致的泛化能力不足问题,尤其针对长期动态激活机制可能引发的训练不稳定性挑战。其解决方案的关键在于提出生命周期(Lifecycle, LC)原则,核心创新是引入状态记忆(state memory)机制:当被抑制的神经元重新激活时,不采用随机初始化权重,而是恢复至其最后一次有效状态的参数。这一设计保留了已学习的知识,避免优化过程中的剧烈扰动,并通过理论分析表明LC原则可平滑损失曲面,引导模型收敛至更平坦的极小值区域,从而提升泛化性能与鲁棒性。

链接: https://arxiv.org/abs/2509.02575
作者: Zichuan Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure

点击查看摘要

Abstract:I investigate a stronger form of regularization by deactivating neurons for extended periods, a departure from the temporary changes of methods like Dropout. However, this long-term dynamism introduces a critical challenge: severe training instability when neurons are revived with random weights. To solve this, I propose the Lifecycle (LC) principle, a regularization mechanism centered on a key innovation: state memory. Instead of re-initializing a revived neuron, my method restores its parameters to their last known effective state. This process preserves learned knowledge and avoids destructive optimization shocks. My theoretical analysis reveals that the LC principle smooths the loss landscape, guiding optimization towards flatter minima associated with better generalization. Experiments on image classification benchmarks demonstrate that my method improves generalization and robustness. Crucially, ablation studies confirm that state memory is essential for achieving these gains.
zh

[AI-68] S2M2ECG: Spatio-temporal bi-directional State Space Model Enabled Multi-branch Mamba for ECG

【速读】:该论文旨在解决多导联心电图(multi-lead Electrocardiogram, ECG)信号在深度学习模型中如何高效融合多源特征、同时平衡性能与计算复杂度的问题。其关键解决方案是提出一种基于状态空间模型(State Space Model, SSM)的新型架构S2M2ECG,该架构包含三级融合机制:(1) 基于分段标记化的时空双向SSM实现低层信号融合;(2) 通过双向扫描的跨导联时间信息融合提升识别精度;(3) 引入跨导联特征交互模块以增强空间信息融合能力。此外,采用多分支设计和导联融合模块,在保证每导联独立分析的同时实现无缝集成,从而充分挖掘ECG特有的多导联特性。实验表明,该方法在节奏、形态及临床场景下均表现优异,且参数量极少,具备轻量化部署优势。

链接: https://arxiv.org/abs/2509.03066
作者: Huaicheng Zhang,Ruoxin Wang,Chenlian Zhou,Jiguang Shi,Yue Ge,Zhoutong Li,Sheng Chang,Hao Wang,Jin He,Qijun Huang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As one of the most effective methods for cardiovascular disease (CVD) diagnosis, multi-lead Electrocardiogram (ECG) signals present a characteristic multi-sensor information fusion challenge that has been continuously researched in deep learning domains. Despite the numerous algorithms proposed with different DL architectures, maintaining a balance among performance, computational complexity, and multi-source ECG feature fusion remains challenging. Recently, state space models (SSMs), particularly Mamba, have demonstrated remarkable effectiveness across various fields. Their inherent design for high-efficiency computation and linear complexity makes them particularly suitable for low-dimensional data like ECGs. This work proposes S2M2ECG, an SSM architecture featuring three-level fusion mechanisms: (1) Spatio-temporal bi-directional SSMs with segment tokenization for low-level signal fusion, (2) Intra-lead temporal information fusion with bi-directional scanning to enhance recognition accuracy in both forward and backward directions, (3) Cross-lead feature interaction modules for spatial information fusion. To fully leverage the ECG-specific multi-lead mechanisms inherent in ECG signals, a multi-branch design and lead fusion modules are incorporated, enabling individual analysis of each lead while ensuring seamless integration with others. Experimental results reveal that S2M2ECG achieves superior performance in the rhythmic, morphological, and clinical scenarios. Moreover, its lightweight architecture ensures it has nearly the fewest parameters among existing models, making it highly suitable for efficient inference and convenient deployment. Collectively, S2M2ECG offers a promising alternative that strikes an excellent balance among performance, computational complexity, and ECG-specific characteristics, paving the way for high-performance, lightweight computations in CVD diagnosis.
zh

[AI-69] Optimizing Geometry Problem Sets for Skill Development

【速读】:该论文旨在解决几何问题自动化处理与教育支持中的核心挑战,即如何有效组织和验证欧几里得几何(Euclidean Geometry)问题的解题过程,以提升教学效率与学习反馈质量。其解决方案的关键在于构建了一个早期开发的本体论(ontology)与解题图谱(solution graph paradigm)框架,该框架能够结构化地表示几何问题及其逻辑推理路径,并通过与现代大语言模型(large language models, LLMs)结合,实现对解题过程的自动验证与个性化反馈,从而增强教师教学与自主学习者的交互体验。

链接: https://arxiv.org/abs/2509.02758
作者: Michael Bouzinier,Sergey Trifonov
机构: 未知
类目: History and Overview (math.HO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This article describes an ontology and methodology for annotating and organizing Euclidean Geometry problems, developed in the early 1990s and implemented as a software tool. While the majority of this work – including the ontology and solution graph paradigm – was completed over thirty years ago, we argue that it has renewed relevance in the context of modern artificial intelligence. In particular, we explore the hypothesis that this established framework can facilitate automated solution validation and feedback when paired with contemporary large language models, thereby supporting teachers and self-learners in geometry education. We document the original architecture and its enduring value, and outline pathways for bridging historical educational resources with next-generation AI techniques.
zh

[AI-70] BioMD: All-atom Generative Model for Biomolecular Dynamics Simulation

【速读】:该论文旨在解决分子动力学(Molecular Dynamics, MD)模拟在生物大分子系统长时间尺度轨迹生成中的计算成本过高问题,这限制了其在药物发现等领域的应用。现有机器学习(Machine Learning, ML)方法虽具潜力,但难以生成长时程轨迹,主要受限于MD数据集稀缺及建模长期历史轨迹的高计算开销。解决方案的关键在于提出BioMD——首个全原子生成式模型,采用分层预测与插值框架,实现对蛋白-配体系统长时间尺度动态行为的有效模拟;该方法在DD-13M(配体解离)和MISATO数据集上均展现出高度物理合理性与低重建误差,并成功在十次尝试内为97.1%的蛋白-配体体系生成配体解离路径,验证了其在探索关键构象变化路径上的强大能力。

链接: https://arxiv.org/abs/2509.02642
作者: Bin Feng,Jiying Zhang,Xinni Zhang,Zijing Liu,Yu Li
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Molecular dynamics (MD) simulations are essential tools in computational chemistry and drug discovery, offering crucial insights into dynamic molecular behavior. However, their utility is significantly limited by substantial computational costs, which severely restrict accessible timescales for many biologically relevant processes. Despite the encouraging performance of existing machine learning (ML) methods, they struggle to generate extended biomolecular system trajectories, primarily due to the lack of MD datasets and the large computational demands of modeling long historical trajectories. Here, we introduce BioMD, the first all-atom generative model to simulate long-timescale protein-ligand dynamics using a hierarchical framework of forecasting and interpolation. We demonstrate the effectiveness and versatility of BioMD on the DD-13M (ligand unbinding) and MISATO datasets. For both datasets, BioMD generates highly realistic conformations, showing high physical plausibility and low reconstruction errors. Besides, BioMD successfully generates ligand unbinding paths for 97.1% of the protein-ligand systems within ten attempts, demonstrating its ability to explore critical unbinding pathways. Collectively, these results establish BioMD as a tool for simulating complex biomolecular processes, offering broad applicability for computational chemistry and drug discovery.
zh

[AI-71] Enhanced Single-Cell RNA-seq Embedding through Gene Expression and Data-Driven Gene-Gene Interaction Integration

【速读】:该论文旨在解决单细胞RNA测序(scRNA-seq)数据中因高维度和技术噪声导致的分析挑战,特别是现有嵌入方法仅依赖基因表达水平而忽视了调控细胞身份与功能的关键基因-基因相互作用的问题。其解决方案的关键在于提出一种整合基因表达谱与数据驱动的基因-基因互作关系的新嵌入方法:首先利用随机森林模型构建细胞叶图(Cell-Leaf Graph, CLG)以捕获基因间的调控关系,同时基于表达相似性构建K近邻图(K-Nearest Neighbor Graph, KNNG),并将两者融合为增强型细胞叶图(Enriched Cell-Leaf Graph, ECLG),作为图神经网络的输入以计算更全面的细胞嵌入表示。

链接: https://arxiv.org/abs/2509.02639
作者: Hojjat Torabi Goudarzi,Maziyar Baran Pouyan
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注: 33 pages, 9 figures, article

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) provides unprecedented insights into cellular heterogeneity, enabling detailed analysis of complex biological systems at single-cell resolution. However, the high dimensionality and technical noise inherent in scRNA-seq data pose significant analytical challenges. While current embedding methods focus primarily on gene expression levels, they often overlook crucial gene-gene interactions that govern cellular identity and function. To address this limitation, we present a novel embedding approach that integrates both gene expression profiles and data-driven gene-gene interactions. Our method first constructs a Cell-Leaf Graph (CLG) using random forest models to capture regulatory relationships between genes, while simultaneously building a K-Nearest Neighbor Graph (KNNG) to represent expression similarities between cells. These graphs are then combined into an Enriched Cell-Leaf Graph (ECLG), which serves as input for a graph neural network to compute cell embeddings. By incorporating both expression levels and gene-gene interactions, our approach provides a more comprehensive representation of cellular states. Extensive evaluation across multiple datasets demonstrates that our method enhances the detection of rare cell populations and improves downstream analyses such as visualization, clustering, and trajectory inference. This integrated approach represents a significant advance in single-cell data analysis, offering a more complete framework for understanding cellular diversity and dynamics.
zh

[AI-72] A Two-Stage Strategy for Mitosis Detection Using Improved YOLO1 1x Proposals and ConvNeXt Classification

【速读】:该论文旨在解决在包含非肿瘤、炎症和坏死区域的全切片图像(Whole-Slide Images, WSI)中进行有丝分裂检测时,因复杂且异质的背景环境及潜在伪影导致的假阳性与假阴性问题,从而影响检测F1分数。解决方案的关键在于提出一个两阶段框架:第一阶段采用改进的YOLO11x模型(融合EMA注意力机制与LSConv模块)生成尽可能多的有丝分裂候选区域,通过设置低置信度阈值以提升召回率;第二阶段利用ConvNeXt-Tiny分类器对候选区域进行筛选,有效去除假阳性结果,显著提升精度。该方法在融合数据集上实现了0.882的F1分数,较单阶段YOLO11x基线提升0.035,主要得益于精度从0.762提升至0.839,同时保持了相当的召回率。

链接: https://arxiv.org/abs/2509.02627
作者: Jie Xiao,Mengye Lyu,Shaojun Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:MIDOG 2025 Track 1 requires mitosis detection in whole-slide images (WSIs) containing non-tumor, inflamed, and necrotic regions. Due to the complicated and heterogeneous context, as well as possible artifacts, there are often false positives and false negatives, thus degrading the detection F1-score. To address this problem, we propose a two-stage framework. Firstly, an improved YOLO11x, integrated with EMA attention and LSConv, is employed to generate mitosis candidates. We use a low confidence threshold to generate as many proposals as possible, ensuring the detection recall. Then, a ConvNeXt-Tiny classifier is employed to filter out the false positives, ensuring the detection precision. Consequently, the proposed two-stage framework can generate a high detection F1-score. Evaluated on a fused dataset comprising MIDOG++, MITOS_WSI_CCMCT, and MITOS_WSI_CMC, our framework achieves an F1-score of 0.882, which is 0.035 higher than the single-stage YOLO11x baseline. This performance gain is produced by a significant precision improvement, from 0.762 to 0.839, and a comparable recall. The code is available at this https URL.
zh

[AI-73] IS3 : Generic Impulsive–Stationary Sound Separation in Acoustic Scenes using Deep Filtering

【速读】:该论文旨在解决音频场景中对稳态背景(stationary background)与孤立声学事件(impulsive acoustic events)进行差异化处理的问题,这在语音混音中的爆破音抑制、噪声抑制、声学事件分类及生物声学等实际应用中具有重要意义。解决方案的关键在于提出了一种名为IS³(Impulsive–Stationary Sound Separation)的神经网络架构,其通过深度滤波(deep filtering)方法实现对突发性声学事件的有效分离,且可作为上述任务的预处理模块;同时,为确保模型训练效果,作者设计了一个复杂的数据生成流程,对现有数据集进行裁剪和适配,从而构建高质量、多样化的训练样本,最终证明该轻量级学习方法在客观分离指标上优于传统基于谐波-打击声分离(Harmonic–Percussive Sound Separation, HPSS)掩码和小波滤波的方法。

链接: https://arxiv.org/abs/2509.02622
作者: Berger Clémentine(IDS, S2A),Stamadiatis Paraskevas(IDS, S2A),Badeau Roland(IDS, S2A),Essid Slim(IDS, S2A)
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:We are interested in audio systems capable of performing a differentiated processing of stationary backgrounds and isolated acoustic events within an acoustic scene, whether for applying specific processing methods to each part or for focusing solely on one while ignoring the other. Such systems have applications in real-world scenarios, including robust adaptive audio rendering systems (e.g., EQ or compression), plosive attenuation in voice mixing, noise suppression or reduction, robust acoustic event classification or even bioacoustics. To this end, we introduce IS ^3 , a neural network designed for Impulsive–Stationary Sound Separation, that isolates impulsive acoustic events from the stationary background using a deep filtering approach, that can act as a pre-processing stage for the above-mentioned tasks. To ensure optimal training, we propose a sophisticated data generation pipeline that curates and adapts existing datasets for this task. We demonstrate that a learning-based approach, build on a relatively lightweight neural architecture and trained with well-designed and varied data, is successful in this previously unaddressed task, outperforming the Harmonic–Percussive Sound Separation masking method, adapted from music signal processing research, and wavelet filtering on objective separation metrics.
zh

[AI-74] Radio Astronomy in the Era of Vision-Language Models: Prompt Sensitivity and Adaptation

【速读】:该论文旨在解决通用视觉-语言模型(Vision-Language Models, VLMs)在科学成像领域,尤其是面对不熟悉甚至未曾见过的数据分布时,其性能与可靠性问题。研究聚焦于评估VLMs是否能在无天文语料预训练的情况下,完成射电星系形态分类任务,并探索提示工程(prompting)策略和轻量级微调方法的有效性。解决方案的关键在于:首先,通过引入自然语言描述与示意图相结合的提示方式,以及首次在天文学中应用视觉上下文示例(visual in-context examples),验证了VLMs对陌生科学领域的潜在泛化能力;其次,采用LoRA(Low-Rank Adaptation)微调技术仅用1500万可训练参数实现显著性能提升,使Qwen-VL达到接近当前最优水平(误差率3%),表明通用VLM经最小适应后即可媲美专用模型,但同时也揭示其输出高度依赖提示细节,暗示所谓“推理”可能更多反映的是提示敏感性而非真实推理能力。

链接: https://arxiv.org/abs/2509.02615
作者: Mariia Drozdova,Erica Lastufka,Vitaliy Kinakh,Taras Holotyak,Daniel Schaerer,Slava Voloshynovskiy
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs), such as recent Qwen and Gemini models, are positioned as general-purpose AI systems capable of reasoning across domains. Yet their capabilities in scientific imaging, especially on unfamiliar and potentially previously unseen data distributions, remain poorly understood. In this work, we assess whether generic VLMs, presumed to lack exposure to astronomical corpora, can perform morphology-based classification of radio galaxies using the MiraBest FR-I/FR-II dataset. We explore prompting strategies using natural language and schematic diagrams, and, to the best of our knowledge, we are the first to introduce visual in-context examples within prompts in astronomy. Additionally, we evaluate lightweight supervised adaptation via LoRA fine-tuning. Our findings reveal three trends: (i) even prompt-based approaches can achieve good performance, suggesting that VLMs encode useful priors for unfamiliar scientific domains; (ii) however, outputs are highly unstable, i.e. varying sharply with superficial prompt changes such as layout, ordering, or decoding temperature, even when semantic content is held constant; and (iii) with just 15M trainable parameters and no astronomy-specific pretraining, fine-tuned Qwen-VL achieves near state-of-the-art performance (3% Error rate), rivaling domain-specific models. These results suggest that the apparent “reasoning” of VLMs often reflects prompt sensitivity rather than genuine inference, raising caution for their use in scientific domains. At the same time, with minimal adaptation, generic VLMs can rival specialized models, offering a promising but fragile tool for scientific discovery.
zh

[AI-75] Resilient Biosecurity in the Era of AI-Enabled Bioweapons

【速读】:该论文试图解决当前生成式生物学(Generative Biology)在药物研发中带来的生物安全风险问题,特别是针对由生成式AI设计的潜在危险蛋白(如合成生物武器)缺乏有效检测手段的挑战。现有方案主要依赖推理阶段的过滤工具,如序列比对和蛋白质-蛋白质相互作用(Protein-Protein Interaction, PPI)预测模型,但研究发现主流PPI预测工具(AlphaFold 3、AF3Complex 和 SpatialPPIv2)在识别已知病毒-宿主相互作用(如乙型肝炎病毒和SARS-CoV-2)时表现不佳,甚至无法识别经实验验证的SARS-CoV-2突变体的结合关系。因此,论文提出解决方案的关键在于从依赖静态预测模型转向构建响应导向的基础设施,包括快速实验验证能力、可适应的生物制造系统以及能与AI驱动发展速度匹配的监管框架。

链接: https://arxiv.org/abs/2509.02610
作者: Jonathan Feldman,Tal Feldman
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in generative biology have enabled the design of novel proteins, creating significant opportunities for drug discovery while also introducing new risks, including the potential development of synthetic bioweapons. Existing biosafety measures primarily rely on inference-time filters such as sequence alignment and protein-protein interaction (PPI) prediction to detect dangerous outputs. In this study, we evaluate the performance of three leading PPI prediction tools: AlphaFold 3, AF3Complex, and SpatialPPIv2. These models were tested on well-characterized viral-host interactions, such as those involving Hepatitis B and SARS-CoV-2. Despite being trained on many of the same viruses, the models fail to detect a substantial number of known interactions. Strikingly, none of the tools successfully identify any of the four experimentally validated SARS-CoV-2 mutants with confirmed binding. These findings suggest that current predictive filters are inadequate for reliably flagging even known biological threats and are even more unlikely to detect novel ones. We argue for a shift toward response-oriented infrastructure, including rapid experimental validation, adaptable biomanufacturing, and regulatory frameworks capable of operating at the speed of AI-driven developments.
zh

[AI-76] owards Digital Twins for Optimal Radioembolization

【速读】:该论文旨在解决放射性微球(radioactive microspheres)在肝动脉树中传输过程中因个体解剖结构复杂、血流动力学变异及运输不确定性导致的治疗优化难题,目标是在最大化肿瘤杀伤效应的同时最小化对健康肝组织的损伤。其解决方案的关键在于构建动态、患者特异性的数字孪生(digital twin),融合高保真计算流体动力学(CFD)与物理信息驱动的人工智能方法,如物理信息神经网络(PINNs)、生成对抗网络(PI-GANs)和扩散模型(PI-DMs),以实现高效、准确且具备不确定性感知的微球分布预测,从而支持实时个性化治疗决策。

链接: https://arxiv.org/abs/2509.02607
作者: Nisanth Kumar Panneerselvam,Guneet Mummaneni,Emilie Roncali
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Radioembolization is a localized liver cancer treatment that delivers radioactive microspheres (30 micron) to tumors via a catheter inserted in the hepatic arterial tree. The goal is to maximize therapeutic efficacy while minimizing damage to healthy liver tissue. However, optimization is challenging due to complex hepatic artery anatomy, variable blood flow, and uncertainty in microsphere transport. The creation of dynamic, patient-specific digital twins may provide a transformative solution to these challenges. This work outlines a framework for a liver radioembolization digital twin using high-fidelity computational fluid dynamics (CFD) and/or recent physics-informed machine learning approaches. The CFD approach involves microsphere transport calculations in the hepatic arterial tree with individual patient data, which enables personalized treatment planning. Although accurate, traditional CFD is computationally expensive and limits clinical applicability. To accelerate simulations, physics-informed neural networks (PINNs) and their generative extensions play an increasingly important role. PINNs integrate governing equations, such as the Navier-Stokes equations, directly into the neural network training process, enabling mesh-free, data-efficient approximation of blood flow and microsphere transport. Physics-informed generative adversarial networks (PI-GANs), diffusion models (PI-DMs), and transformer-based architectures further enable uncertainty-aware, temporally resolved predictions with reduced computational cost. These AI surrogates not only maintain physical fidelity but also support rapid sampling of diverse flow scenarios, facilitating real-time decision support. Together, CFD and physics-informed AI methods form the foundation of dynamic, patient-specific digital twin to optimize radioembolization planning and ultimately improve clinical outcomes. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.02607 [eess.IV] (or arXiv:2509.02607v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2509.02607 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-77] MIDOG 2025: Mitotic Figure Detection with Attention-Guided False Positive Correction

【速读】:该论文旨在解决现有Fully Convolutional One-Stage Object Detector (FCOS)在有丝分裂图像检测中误报率高、检测精度不足以及泛化能力有限的问题。其解决方案的关键在于提出一种融合架构:首先引入Feedback Attention Ladder CNN (FAL-CNN)模型对有丝分裂图像进行正常与异常分类,随后将该分类结果输入到一个融合网络中,该网络被训练以生成对FCOS预测边界框的校正调整,从而有效降低假阳性率,提升检测准确性和模型泛化性能。

链接: https://arxiv.org/abs/2509.02598
作者: Andrew Broad,Jason Keighley,Lucy Godson,Alex Wright
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a novel approach which extends the existing Fully Convolutional One-Stage Object Detector (FCOS) for mitotic figure detection. Our composite model adds a Feedback Attention Ladder CNN (FAL-CNN) model for classification of normal versus abnormal mitotic figures, feeding into a fusion network that is trained to generate adjustments to bounding boxes predicted by FCOS. Our network aims to reduce the false positive rate of the FCOS object detector, to improve the accuracy of object detection and enhance the generalisability of the network. Our model achieved an F1 score of 0.655 for mitosis detection on the preliminary evaluation dataset.
zh

[AI-78] OpenAI s HealthBench in Action: Evaluating an LLM -Based Medical Assistant on Realistic Clinical Queries

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在临床场景下生成高质量、情境感知且准确回答的能力评估问题。传统基准测试多依赖多项选择题,难以衡量模型在复杂高风险医疗环境中所需的上下文推理、不确定性处理和指令遵循等关键能力。为克服这一局限,作者提出基于检索增强生成(Retrieval-Augmented Generation, RAG)的代理式临床支持助手,并采用HealthBench这一基于评分标准的开放式对话评估框架进行评测。其解决方案的关键在于:使用专家标注的开放性健康对话数据集构建行为导向的评估体系,从而更全面地衡量模型在准确性、完整性、指令遵循等多个维度的实际表现,验证了所提系统在临床辅助任务中的优越性能与可信赖性。

链接: https://arxiv.org/abs/2509.02594
作者: Sandhanakrishnan Ravichandran,Shivesh Kumar,Rogerio Corga Da Silva,Miguel Romano,Reinhard Berkels,Michiel van der Heijden,Olivier Fail,Valentine Emmanuel Gnanapragasam
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
备注: 13 pages, two graphs

点击查看摘要

Abstract:Evaluating large language models (LLMs) on their ability to generate high-quality, accurate, situationally aware answers to clinical questions requires going beyond conventional benchmarks to assess how these systems behave in complex, high-stake clincal scenarios. Traditional evaluations are often limited to multiple-choice questions that fail to capture essential competencies such as contextual reasoning, awareness and uncertainty handling etc. To address these limitations, we evaluate our agentic, RAG-based clinical support assistant, this http URL, using HealthBench, a rubric-driven benchmark composed of open-ended, expert-annotated health conversations. On the Hard subset of 1,000 challenging examples, this http URL achieves a HealthBench score of 0.51, substantially outperforming leading frontier LLMs (GPT-5, o3, Grok 3, GPT-4, Gemini 2.5, etc.) across all behavioral axes (accuracy, completeness, instruction following, etc.). In a separate 100-sample evaluation against similar agentic RAG assistants (OpenEvidence, this http URL), it maintains a performance lead with a health-bench score of 0.54. These results highlight this http URL strengths in communication, instruction following, and accuracy, while also revealing areas for improvement in context awareness and completeness of a response. Overall, the findings underscore the utility of behavior-level, rubric-based evaluation for building a reliable and trustworthy AI-enabled clinical support assistant.
zh

机器学习

[LG-0] Can LLM s Lie? Investigation beyond Hallucination

链接: https://arxiv.org/abs/2509.03518
作者: Haoran Huan,Mihir Prabhudesai,Mengning Wu,Shantanu Jaiswal,Deepak Pathak
类目: Machine Learning (cs.LG)
*备注: Website at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities across a variety of tasks, but their increasing autonomy in real-world applications raises concerns about their trustworthiness. While hallucinations-unintentional falsehoods-have been widely studied, the phenomenon of lying, where an LLM knowingly generates falsehoods to achieve an ulterior objective, remains underexplored. In this work, we systematically investigate the lying behavior of LLMs, differentiating it from hallucinations and testing it in practical scenarios. Through mechanistic interpretability techniques, we uncover the neural mechanisms underlying deception, employing logit lens analysis, causal interventions, and contrastive activation steering to identify and control deceptive behavior. We study real-world lying scenarios and introduce behavioral steering vectors that enable fine-grained manipulation of lying tendencies. Further, we explore the trade-offs between lying and end-task performance, establishing a Pareto frontier where dishonesty can enhance goal optimization. Our findings contribute to the broader discourse on AI ethics, shedding light on the risks and potential safeguards for deploying LLMs in high-stakes environments. Code and more illustrations are available at this https URL

[LG-1] Invariant Features for Global Crop Type Classification

链接: https://arxiv.org/abs/2509.03497
作者: Xin-Yi Tong,Sherrie Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately obtaining crop type and its spatial distribution at a global scale is critical for food security, agricultural policy-making, and sustainable development. Remote sensing offers an efficient solution for large-scale crop classification, but the limited availability of reliable ground samples in many regions constrains applicability across geographic areas. To address performance declines under geospatial shifts, this study identifies remote sensing features that are invariant to geographic variation and proposes strategies to enhance cross-regional generalization. We construct CropGlobe, a global crop type dataset with 300,000 pixel-level samples from eight countries across five continents, covering six major food and industrial crops (corn, soybeans, rice, wheat, sugarcane, cotton). With broad geographic coverage, CropGlobe enables a systematic evaluation under cross-country, cross-continent, and cross-hemisphere transfer. We compare the transferability of temporal multi-spectral features (Sentinel-2-based 1D/2D median features and harmonic coefficients) and hyperspectral features (from EMIT). To improve generalization under spectral and phenological shifts, we design CropNet, a lightweight and robust CNN tailored for pixel-level crop classification, coupled with temporal data augmentation (time shift, time scale, and magnitude warping) that simulates realistic cross-regional phenology. Experiments show that 2D median temporal features from Sentinel-2 consistently exhibit the strongest invariance across all transfer scenarios, and augmentation further improves robustness, particularly when training data diversity is limited. Overall, the work identifies more invariant feature representations that enhance geographic transferability and suggests a promising path toward scalable, low-cost crop type applications across globally diverse regions.

[LG-2] Geometric Foundations of Tuning without Forgetting in Neural ODEs

链接: https://arxiv.org/abs/2509.03474
作者: Erkan Bayram,Mohamed-Ali Belabbas,Tamer Başar
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In our earlier work, we introduced the principle of Tuning without Forgetting (TwF) for sequential training of neural ODEs, where training samples are added iteratively and parameters are updated within the subspace of control functions that preserves the end-point mapping at previously learned samples on the manifold of output labels in the first-order approximation sense. In this letter, we prove that this parameter subspace forms a Banach submanifold of finite codimension under nonsingular controls, and we characterize its tangent space. This reveals that TwF corresponds to a continuation/deformation of the control function along the tangent space of this Banach submanifold, providing a theoretical foundation for its mapping-preserving (not forgetting) during the sequential training exactly, beyond first-order approximation.

[LG-3] Graph neural networks for learning liquid simulations in dynamic scenes containing kinematic objects

链接: https://arxiv.org/abs/2509.03446
作者: Niteesh Midlagajni,Constantin A. Rothkopf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulating particle dynamics with high fidelity is crucial for solving real-world interaction and control tasks involving liquids in design, graphics, and robotics. Recently, data-driven approaches, particularly those based on graph neural networks (GNNs), have shown progress in tackling such problems. However, these approaches are often limited to learning fluid behavior in static free-fall environments or simple manipulation settings involving primitive objects, often overlooking complex interactions with dynamically moving kinematic rigid bodies. Here, we propose a GNN-based framework designed from the ground up to learn the dynamics of liquids under rigid body interactions and active manipulations, where particles are represented as graph nodes and particle-object collisions are handled using surface representations with the bounding volume hierarchy (BVH) algorithm. This approach enables the network to model complex interactions between liquid particles and intricate surface geometries. Our model accurately captures fluid behavior in dynamic settings and can also function as a simulator in static free-fall environments. Despite being trained on a single-object manipulation task of pouring, our model generalizes effectively to environments with unseen objects and novel manipulation tasks such as stirring and scooping. Finally, we show that the learned dynamics can be leveraged to solve control and manipulation tasks using gradient-based optimization methods.

[LG-4] LINKER: Learning Interactions Between Functional Groups and Residues With Chemical Knowledge-Enhanced Reasoning and Explainability

链接: https://arxiv.org/abs/2509.03425
作者: Phuc Pham,Viet Thanh Duy Nguyen,Truong-Son Hy
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Accurate identification of interactions between protein residues and ligand functional groups is essential to understand molecular recognition and guide rational drug design. Existing deep learning approaches for protein-ligand interpretability often rely on 3D structural input or use distance-based contact labels, limiting both their applicability and biological relevance. We introduce LINKER, the first sequence-based model to predict residue-functional group interactions in terms of biologically defined interaction types, using only protein sequences and the ligand SMILES as input. LINKER is trained with structure-supervised attention, where interaction labels are derived from 3D protein-ligand complexes via functional group-based motif extraction. By abstracting ligand structures into functional groups, the model focuses on chemically meaningful substructures while predicting interaction types rather than mere spatial proximity. Crucially, LINKER requires only sequence-level input at inference time, enabling large-scale application in settings where structural data is unavailable. Experiments on the LP-PDBBind benchmark demonstrate that structure-informed supervision over functional group abstractions yields interaction predictions closely aligned with ground-truth biochemical annotations.

[LG-5] Initialization Schemes for Kolmogorov-Arnold Networks: An Empirical Study

链接: https://arxiv.org/abs/2509.03417
作者: Spyros Rigas,Dhruv Verma,Georgios Alexandridis,Yixuan Wang
类目: Machine Learning (cs.LG)
*备注: 30 pages, 19 figures

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) are a recently introduced neural architecture that replace fixed nonlinearities with trainable activation functions, offering enhanced flexibility and interpretability. While KANs have been applied successfully across scientific and machine learning tasks, their initialization strategies remain largely unexplored. In this work, we study initialization schemes for spline-based KANs, proposing two theory-driven approaches inspired by LeCun and Glorot, as well as an empirical power-law family with tunable exponents. Our evaluation combines large-scale grid searches on function fitting and forward PDE benchmarks, an analysis of training dynamics through the lens of the Neural Tangent Kernel, and evaluations on a subset of the Feynman dataset. Our findings indicate that the Glorot-inspired initialization significantly outperforms the baseline in parameter-rich models, while power-law initialization achieves the strongest performance overall, both across tasks and for architectures of varying size. All code and data accompanying this manuscript are publicly available at this https URL.

[LG-6] CloudFormer: An Attention-based Performance Prediction for Public Clouds with Unknown Workload

链接: https://arxiv.org/abs/2509.03394
作者: Amirhossein Shahbazinia,Darong Huang,Luis Costero,David Atienza
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Cloud platforms are increasingly relied upon to host diverse, resource-intensive workloads due to their scalability, flexibility, and cost-efficiency. In multi-tenant cloud environments, virtual machines are consolidated on shared physical servers to improve resource utilization. While virtualization guarantees resource partitioning for CPU, memory, and storage, it cannot ensure performance isolation. Competition for shared resources such as last-level cache, memory bandwidth, and network interfaces often leads to severe performance degradation. Existing management techniques, including VM scheduling and resource provisioning, require accurate performance prediction to mitigate interference. However, this remains challenging in public clouds due to the black-box nature of VMs and the highly dynamic nature of workloads. To address these limitations, we propose CloudFormer, a dual-branch Transformer-based model designed to predict VM performance degradation in black-box environments. CloudFormer jointly models temporal dynamics and system-level interactions, leveraging 206 system metrics at one-second resolution across both static and dynamic scenarios. This design enables the model to capture transient interference effects and adapt to varying workload conditions without scenario-specific tuning. Complementing the methodology, we provide a fine-grained dataset that significantly expands the temporal resolution and metric diversity compared to existing benchmarks. Experimental results demonstrate that CloudFormer consistently outperforms state-of-the-art baselines across multiple evaluation metrics, achieving robust generalization across diverse and previously unseen workloads. Notably, CloudFormer attains a mean absolute error (MAE) of just 7.8%, representing a substantial improvement in predictive accuracy and outperforming existing methods at least by 28%.

[LG-7] Exploring a Graph-based Approach to Offline Reinforcement Learning for Sepsis Treatment

链接: https://arxiv.org/abs/2509.03393
作者: Taisiya Khakharova,Lucas Sakizloglou,Leen Lambers
类目: Machine Learning (cs.LG)
*备注: 18th European Workshop on Reinforcement Learning (EWRL 2025)

点击查看摘要

Abstract:Sepsis is a serious, life-threatening condition. When treating sepsis, it is challenging to determine the correct amount of intravenous fluids and vasopressors for a given patient. While automated reinforcement learning (RL)-based methods have been used to support these decisions with promising results, previous studies have relied on relational data. Given the complexity of modern healthcare data, representing data as a graph may provide a more natural and effective approach. This study models patient data from the well-known MIMIC-III dataset as a heterogeneous graph that evolves over time. Subsequently, we explore two Graph Neural Network architectures - GraphSAGE and GATv2 - for learning patient state representations, adopting the approach of decoupling representation learning from policy learning. The encoders are trained to produce latent state representations, jointly with decoders that predict the next patient state. These representations are then used for policy learning with the dBCQ algorithm. The results of our experimental evaluation confirm the potential of a graph-based approach, while highlighting the complexity of representation learning in this domain.

[LG-8] Cluster and then Embed: A Modular Approach for Visualization

链接: https://arxiv.org/abs/2509.03373
作者: Elizabeth Coda,Ery Arias-Castro,Gal Mishne
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Dimensionality reduction methods such as t-SNE and UMAP are popular methods for visualizing data with a potential (latent) clustered structure. They are known to group data points at the same time as they embed them, resulting in visualizations with well-separated clusters that preserve local information well. However, t-SNE and UMAP also tend to distort the global geometry of the underlying data. We propose a more transparent, modular approach consisting of first clustering the data, then embedding each cluster, and finally aligning the clusters to obtain a global embedding. We demonstrate this approach on several synthetic and real-world datasets and show that it is competitive with existing methods, while being much more transparent.

[LG-9] he distribution of calibrated likelihood functions on the probability-likelihood Aitchison simplex

链接: https://arxiv.org/abs/2509.03365
作者: Paul-Gauthier Noé,Andreas Nautsch,Driss Matrouf,Pierre-Michel Bousquet,Jean-François Bonastre
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint. Under review

点击查看摘要

Abstract:While calibration of probabilistic predictions has been widely studied, this paper rather addresses calibration of likelihood functions. This has been discussed, especially in biometrics, in cases with only two exhaustive and mutually exclusive hypotheses (classes) where likelihood functions can be written as log-likelihood-ratios (LLRs). After defining calibration for LLRs and its connection with the concept of weight-of-evidence, we present the idempotence property and its associated constraint on the distribution of the LLRs. Although these results have been known for decades, they have been limited to the binary case. Here, we extend them to cases with more than two hypotheses by using the Aitchison geometry of the simplex, which allows us to recover, in a vector form, the additive form of the Bayes’ rule; extending therefore the LLR and the weight-of-evidence to any number of hypotheses. Especially, we extend the definition of calibration, the idempotence, and the constraint on the distribution of likelihood functions to this multiple hypotheses and multiclass counterpart of the LLR: the isometric-log-ratio transformed likelihood function. This work is mainly conceptual, but we still provide one application to machine learning by presenting a non-linear discriminant analysis where the discriminant components form a calibrated likelihood function over the classes, improving therefore the interpretability and the reliability of the method.

[LG-10] Some patterns of sleep quality and Daylight Saving Time across countries: a predictive and exploratory analysis

链接: https://arxiv.org/abs/2509.03358
作者: Bhanu Sharma,Eugene Pinsky
类目: Machine Learning (cs.LG)
*备注: 16 Pages

点击查看摘要

Abstract:In this study we analyzed average sleep durations across 61 countries to examine the impact of Daylight Saving Time (DST) practices. Key metrics influencing sleep were identified, and statistical correlation analysis was applied to explore relationships among these factors. Countries were grouped based on DST observance, and visualizations compared sleep patterns between DST and non-DST regions. Results show that, on average, countries observing DST tend to report longer sleep durations than those that do not. A more detailed pattern emerged when accounting for latitude: at lower latitudes, DST-observing countries reported shorter sleep durations compared to non-DST countries, while at higher latitudes, DST-observing countries reported longer average sleep durations. These findings suggest that the influence of DST on sleep may be moderated by geographical location.

[LG-11] Generative Auto-Bidding in Large-Scale Competitive Auctions via Diffusion Completer-Aligner

链接: https://arxiv.org/abs/2509.03348
作者: Yewen Li,Jingtong Gao,Nan Jiang,Shuai Mao,Ruyi An,Fei Pan,Xiangyu Zhao,Bo An,Qingpeng Cai,Peng Jiang
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Auto-bidding is central to computational advertising, achieving notable commercial success by optimizing advertisers’ bids within economic constraints. Recently, large generative models show potential to revolutionize auto-bidding by generating bids that could flexibly adapt to complex, competitive environments. Among them, diffusers stand out for their ability to address sparse-reward challenges by focusing on trajectory-level accumulated rewards, as well as their explainable capability, i.e., planning a future trajectory of states and executing bids accordingly. However, diffusers struggle with generation uncertainty, particularly regarding dynamic legitimacy between adjacent states, which can lead to poor bids and further cause significant loss of ad impression opportunities when competing with other advertisers in a highly competitive auction environment. To address it, we propose a Causal auto-Bidding method based on a Diffusion completer-aligner framework, termed CBD. Firstly, we augment the diffusion training process with an extra random variable t, where the model observes t-length historical sequences with the goal of completing the remaining sequence, thereby enhancing the generated sequences’ dynamic legitimacy. Then, we employ a trajectory-level return model to refine the generated trajectories, aligning more closely with advertisers’ objectives. Experimental results across diverse settings demonstrate that our approach not only achieves superior performance on large-scale auto-bidding benchmarks, such as a 29.9% improvement in conversion value in the challenging sparse-reward auction setting, but also delivers significant improvements on the Kuaishou online advertising platform, including a 2.0% increase in target cost.

[LG-12] EvolveSignal: A Large Language Model Powered Coding Agent for Discovering Traffic Signal Control Algorithms

链接: https://arxiv.org/abs/2509.03335
作者: Leizhen Wang,Peibo Duan,Hao Wang,Yue Wang,Jian Xu,Nan Zheng,Zhenliang Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In traffic engineering, the fixed-time traffic signal control remains widely used for its low cost, stability, and interpretability. However, its design depends on hand-crafted formulas (e.g., Webster) and manual re-timing by engineers to adapt to demand changes, which is labor-intensive and often yields suboptimal results under heterogeneous or congested conditions. This paper introduces the EvolveSignal, a large language models (LLMs) powered coding agent to automatically discover new traffic signal control algorithms. We formulate the problem as program synthesis, where candidate algorithms are represented as Python functions with fixed input-output structures, and iteratively optimized through external evaluations (e.g., a traffic simulator) and evolutionary search. Experiments on a signalized intersection demonstrate that the discovered algorithms outperform Webster’s baseline, reducing average delay by 20.1% and average stops by 47.1%. Beyond performance, ablation and incremental analyses reveal that EvolveSignal modifications-such as adjusting cycle length bounds, incorporating right-turn demand, and rescaling green allocations-can offer practically meaningful insights for traffic engineers. This work opens a new research direction by leveraging AI for algorithm design in traffic signal control, bridging program synthesis with transportation engineering.

[LG-13] mporal social network modeling of mobile connectivity data with graph neural networks

链接: https://arxiv.org/abs/2509.03319
作者: Joel Jaskari,Chandreyee Roy,Fumiko Ogushi,Mikko Saukkoriipi,Jaakko Sahlsten,Kimmo Kaski
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 22 pages, 7 figures

点击查看摘要

Abstract:Graph neural networks (GNNs) have emerged as a state-of-the-art data-driven tool for modeling connectivity data of graph-structured complex networks and integrating information of their nodes and edges in space and time. However, as of yet, the analysis of social networks using the time series of people’s mobile connectivity data has not been extensively investigated. In the present study, we investigate four snapshot - based temporal GNNs in predicting the phone call and SMS activity between users of a mobile communication network. In addition, we develop a simple non - GNN baseline model using recently proposed EdgeBank method. Our analysis shows that the ROLAND temporal GNN outperforms the baseline model in most cases, whereas the other three GNNs perform on average worse than the baseline. The results show that GNN based approaches hold promise in the analysis of temporal social networks through mobile connectivity data. However, due to the relatively small performance margin between ROLAND and the baseline model, further research is required on specialized GNN architectures for temporal social network analysis.

[LG-14] Meta-Imputation Balanced (MIB): An Ensemble Approach for Handling Missing Data in Biomedical Machine Learning

链接: https://arxiv.org/abs/2509.03316
作者: Fatemeh Azad,Zoran Bosnić,Matjaž Kukar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Missing data represents a fundamental challenge in machine learning applications, often reducing model performance and reliability. This problem is particularly acute in fields like bioinformatics and clinical machine learning, where datasets are frequently incomplete due to the nature of both data generation and data collection. While numerous imputation methods exist, from simple statistical techniques to advanced deep learning models, no single method consistently performs well across diverse datasets and missingness mechanisms. This paper proposes a novel Meta-Imputation approach that learns to combine the outputs of multiple base imputers to predict missing values more accurately. By training the proposed method called Meta-Imputation Balanced (MIB) on synthetically masked data with known ground truth, the system learns to predict the most suitable imputed value based on the behavior of each method. Our work highlights the potential of ensemble learning in imputation and paves the way for more robust, modular, and interpretable preprocessing pipelines in real-world machine learning systems.

[LG-15] Machine Learning-Driven Anomaly Detection for 5G O-RAN Performance Metrics

链接: https://arxiv.org/abs/2509.03290
作者: Babak Azkaei,Kishor Chandra Joshi,George Exarchakos
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The ever-increasing reliance of critical services on network infrastructure coupled with the increased operational complexity of beyond-5G/6G networks necessitate the need for proactive and automated network fault management. The provision for open interfaces among different radio access network,(RAN) elements and the integration of AI/ML into network architecture enabled by the Open RAN,(O-RAN) specifications bring new possibilities for active network health monitoring and anomaly detection. In this paper we leverage these advantages and develop an anomaly detection framework that proactively detect the possible throughput drops for a UE and minimize the post-handover failures. We propose two actionable anomaly detection algorithms tailored for real-world deployment. The first algorithm identifies user equipment (UE) at risk of severe throughput degradation by analyzing key performance indicators (KPIs) such as resource block utilization and signal quality metrics, enabling proactive handover initiation. The second algorithm evaluates neighbor cell radio coverage quality, filtering out cells with anomalous signal strength or interference levels. This reduces candidate targets for handover by 41.27% on average. Together, these methods mitigate post-handover failures and throughput drops while operating much faster than the near-real-time latency constraints. This paves the way for self-healing 6G networks.

[LG-16] opoMap: A Feature-based Semantic Discriminator of the Topographical Regions in the Test Input Space

链接: https://arxiv.org/abs/2509.03242
作者: Gianmarco De Vita,Nargiz Humbatova,Paolo Tonella
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Testing Deep Learning (DL)-based systems is an open challenge. Although it is relatively easy to find inputs that cause a DL model to misbehave, the grouping of inputs by features that make the DL model under test fail is largely unexplored. Existing approaches for DL testing introduce perturbations that may focus on specific failure-inducing features, while neglecting others that belong to different regions of the feature space. In this paper, we create an explicit topographical map of the input feature space. Our approach, named TopoMap, is both black-box and model-agnostic as it relies solely on features that characterise the input space. To discriminate the inputs according to the specific features they share, we first apply dimensionality reduction to obtain input embeddings, which are then subjected to clustering. Each DL model might require specific embedding computations and clustering algorithms to achieve a meaningful separation of inputs into discriminative groups. We propose a novel way to evaluate alternative configurations of embedding and clustering techniques. We used a deep neural network (DNN) as an approximation of a human evaluator who could tell whether a pair of clusters can be discriminated based on the features of the included elements. We use such a DNN to automatically select the optimal topographical map of the inputs among all those that are produced by different embedding/clustering configurations. The evaluation results show that the maps generated by TopoMap consist of distinguishable and meaningful regions. In addition, we evaluate the effectiveness of TopoMap using mutation analysis. In particular, we assess whether the clusters in our topographical map allow for an effective selection of mutation-killing inputs. Experimental results show that our approach outperforms random selection by 35% on average on killable mutants; by 61% on non-killable ones.

[LG-17] Unsupervised Learning based Element Resource Allocation for Reconfigurable Intelligent Surfaces in mmWave Network

链接: https://arxiv.org/abs/2509.03241
作者: Pujitha Mamillapalli,Yoghitha Ramamoorthi,Abhinav Kumar,Tomoki Murakami,Tomoaki Ogawa,Yasushi Takatori
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing demand for high data rates and seamless connectivity in wireless systems has sparked significant interest in reconfigurable intelligent surfaces (RIS) and artificial intelligence-based wireless applications. RIS typically comprises passive reflective antenna elements that control the wireless propagation environment by adequately tuning the phase of the reflective elements. The allocation of RIS elements to multipleuser equipment (UEs) is crucial for efficiently utilizing RIS. In this work, we formulate a joint optimization problem that optimizes the RIS phase configuration and resource allocation under an \alpha -fair scheduling framework and propose an efficient way of allocating RIS elements. Conventional iterative optimization methods, however, suffer from exponentially increasing computational complexity as the number of RIS elements increases and also complicate the generation of training labels for supervised learning. To overcome these challenges, we propose a five-layer fully connected neural network (FNN) combined with a preprocessing technique to significantly reduce input dimensionality, lower computational complexity, and enhance scalability. The simulation results show that our proposed NN-based solution reduces computational overhead while significantly improving system throughput by 6.8% compared to existing RIS element allocation schemes. Furthermore, the proposed system achieves better performance while reducing computational complexity, making it significantly more scalable than the iterative optimization algorithms.

[LG-18] RA: Vector-based Random Tensor Network for High-Rank Adaptation of Large Language Models

链接: https://arxiv.org/abs/2509.03234
作者: Yuxuan Gu,Wuyang Zhou,Giorgos Iacovides,Danilo Mandic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), have significantly reduced the number of trainable parameters needed in fine-tuning large language models (LLMs). Subsequent developments of LoRA-style adapters have diverged into two main directions: (1) enhancing model expressivity with high-rank adapters, and (2) pushing for further parameter reduction, as exemplified by vector-based methods. However, these approaches present a trade-off, as achieving the expressivity of high-rank weight updates typically comes at the cost of sacrificing the extreme parameter efficiency offered by vector-based techniques. To address this issue, we propose a vector-based random \underline\textbfTensor network for high-\underline\textbfRank \underline\textbfAdaptation (TeRA), a novel PEFT method that achieves high-rank weight updates while retaining the parameter efficiency of vector-based PEFT adapters. This is achieved by parameterizing the tensorized weight update matrix as a Tucker-like tensor network (TN), in which large randomly initialized factors are frozen and shared across layers, while only small layer-specific scaling vectors, formed by entries in diagonal factor matrices, are trained. This design effectively decouples the rank of the weight update matrix from the number of trainable parameters. Comprehensive experiments demonstrate that TeRA matches or even outperforms high-rank adapters, while requiring a trainable parameter count similar to vector-based methods. Theoretical analysis and ablation studies further validate the effectiveness of our approach.

[LG-19] NeurStore: Efficient In-database Deep Learning Model Management System SIGMOD2026

链接: https://arxiv.org/abs/2509.03228
作者: Siqi Xiang,Sheng Wang,Xiaokui Xiao,Cong Yue,Zhanhao Zhao,Beng Chin Ooi
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: 15 pages, 14 figures, Accepted at SIGMOD 2026

点击查看摘要

Abstract:With the prevalence of in-database AI-powered analytics, there is an increasing demand for database systems to efficiently manage the ever-expanding number and size of deep learning models. However, existing database systems typically store entire models as monolithic files or apply compression techniques that overlook the structural characteristics of deep learning models, resulting in suboptimal model storage overhead. This paper presents NeurStore, a novel in-database model management system that enables efficient storage and utilization of deep learning models. First, NeurStore employs a tensor-based model storage engine to enable fine-grained model storage within databases. In particular, we enhance the hierarchical navigable small world (HNSW) graph to index tensors, and only store additional deltas for tensors within a predefined similarity threshold to ensure tensor-level deduplication. Second, we propose a delta quantization algorithm that effectively compresses delta tensors, thus achieving a superior compression ratio with controllable model accuracy loss. Finally, we devise a compression-aware model loading mechanism, which improves model utilization performance by enabling direct computation on compressed tensors. Experimental evaluations demonstrate that NeurStore achieves superior compression ratios and competitive model loading throughput compared to state-of-the-art approaches.

[LG-20] he Role of Embodiment in Intuitive Whole-Body Teleoperation for Mobile Manipulation

链接: https://arxiv.org/abs/2509.03222
作者: Sophia Bianchi Moyen,Rickmer Krohn,Sophie Lueth,Kay Pompetzki,Jan Peters,Vignesh Prasad,Georgia Chalvatzaki
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages, 8 figures, Accepted at the IEEE-RAS International Conference on Humanoid Robots (Humanoids) 2025

点击查看摘要

Abstract:Intuitive Teleoperation interfaces are essential for mobile manipulation robots to ensure high quality data collection while reducing operator workload. A strong sense of embodiment combined with minimal physical and cognitive demands not only enhances the user experience during large-scale data collection, but also helps maintain data quality over extended periods. This becomes especially crucial for challenging long-horizon mobile manipulation tasks that require whole-body coordination. We compare two distinct robot control paradigms: a coupled embodiment integrating arm manipulation and base navigation functions, and a decoupled embodiment treating these systems as separate control entities. Additionally, we evaluate two visual feedback mechanisms: immersive virtual reality and conventional screen-based visualization of the robot’s field of view. These configurations were systematically assessed across a complex, multi-stage task sequence requiring integrated planning and execution. Our results show that the use of VR as a feedback modality increases task completion time, cognitive workload, and perceived effort of the teleoperator. Coupling manipulation and navigation leads to a comparable workload on the user as decoupling the embodiments, while preliminary experiments suggest that data acquired by coupled teleoperation leads to better imitation learning performance. Our holistic view on intuitive teleoperation interfaces provides valuable insight into collecting high-quality, high-dimensional mobile manipulation data at scale with the human operator in mind. Project website:this https URL

[LG-21] Exploring the Design Space of Fair Tree Learning Algorithms

链接: https://arxiv.org/abs/2509.03204
作者: Kiara Stempel,Mattia Cerrato,Stefan Kramer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decision trees have been studied extensively in the context of fairness, aiming to maximize prediction performance while ensuring non-discrimination against different groups. Techniques in this space usually focus on imposing constraints at training time, constraining the search space so that solutions which display unacceptable values of relevant metrics are not considered, discarded, or discouraged. If we assume one target variable y and one sensitive attribute s, the design space of tree learning algorithms can be spanned as follows: (i) One can have one tree T that is built using an objective function that is a function of y, s, and T. For instance, one can build a tree based on the weighted information gain regarding y (maximizing) and s (minimizing). (ii) The second option is to have one tree model T that uses an objective function in y and T and a constraint on s and T. Here, s is no longer part of the objective, but part of a constraint. This can be achieved greedily by aborting a further split as soon as the condition that optimizes the objective in y fails to satisfy the constraint on s. A simple way to explore other splits is to backtrack during tree construction once a fairness constraint is violated. (iii) The third option is to have two trees T_y and T_s, one for y and one for s, such that the tree structure for y and s does not have to be shared. In this way, information regarding y and regarding s can be used independently, without having to constrain the choices in tree construction by the mutual information between the two variables. Quite surprisingly, of the three options, only the first one and the greedy variant of the second have been studied in the literature so far. In this paper, we introduce the above two additional options from that design space and characterize them experimentally on multiple datasets.

[LG-22] abular foundation model for GEOAI benchmark problems BM/AirportSoilProperties/2/2025

链接: https://arxiv.org/abs/2509.03191
作者: Taiga Saito,Yu Otake,Stephen Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a novel application of the Tabular Prior-Data Fitted Network (TabPFN) - a transformer-based foundation model for tabular data - to geotechnical site characterization problems defined in the GEOAI benchmark BM/AirportSoilProperties/2/2025. Two tasks are addressed: (1) predicting the spatial variation of undrained shear strength (su) across borehole depth profiles, and (2) imputing missing mechanical parameters in a dense-site dataset. We apply TabPFN in a zero-training, few-shot, in-context learning setting - without hyper-parameter tuning - and provide it with additional context from the big indirect database (BID). The study demonstrates that TabPFN, as a general-purpose foundation model, achieved superior accuracy and well-calibrated predictive distributions compared to a conventional hierarchical Bayesian model (HBM) baseline, while also offering significant gains in inference efficiency. In Benchmark Problem #1 (spatial su prediction), TabPFN outperformed the HBM in prediction accuracy and delivered an order-of-magnitude faster runtime. In Benchmark Problem #2 (missing mechanical parameter imputation), TabPFN likewise achieved lower RMSE for all target parameters with well-quantified uncertainties, though its cumulative computation cost was higher than HBM’s due to its one-variable-at-a-time inference. These results mark the first successful use of a tabular foundation model in geotechnical modeling, suggesting a potential paradigm shift in probabilistic site characterization.

[LG-23] Enhancing Interpretability and Effectiveness in Recommendation with Numerical Features via Learning to Contrast the Counterfactual samples

链接: https://arxiv.org/abs/2509.03187
作者: Xiaoxiao Xu,Hao Wu,Wenhui Yu,Lantao Hu,Peng Jiang,Kun Gai
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by TheWebConf2024

点击查看摘要

Abstract:We propose a general model-agnostic Contrastive learning framework with Counterfactual Samples Synthesizing (CCSS) for modeling the monotonicity between the neural network output and numerical features which is critical for interpretability and effectiveness of recommender systems. CCSS models the monotonicity via a two-stage process: synthesizing counterfactual samples and contrasting the counterfactual samples. The two techniques are naturally integrated into a model-agnostic framework, forming an end-to-end training process. Abundant empirical tests are conducted on a publicly available dataset and a real industrial dataset, and the results well demonstrate the effectiveness of our proposed CCSS. Besides, CCSS has been deployed in our real large-scale industrial recommender, successfully serving over hundreds of millions users.

[LG-24] Beyond Words: Interjection Classification for Improved Human-Computer Interaction

链接: https://arxiv.org/abs/2509.03181
作者: Yaniv Goren,Yuval Cohen,Alexander Apartsin,Yehudit Aperstein
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:In the realm of human-computer interaction, fostering a natural dialogue between humans and machines is paramount. A key, often overlooked, component of this dialogue is the use of interjections such as “mmm” and “hmm”. Despite their frequent use to express agreement, hesitation, or requests for information, these interjections are typically dismissed as “non-words” by Automatic Speech Recognition (ASR) engines. Addressing this gap, we introduce a novel task dedicated to interjection classification, a pioneer in the field to our knowledge. This task is challenging due to the short duration of interjection signals and significant inter- and intra-speaker variability. In this work, we present and publish a dataset of interjection signals collected specifically for interjection classification. We employ this dataset to train and evaluate a baseline deep learning model. To enhance performance, we augment the training dataset using techniques such as tempo and pitch transformation, which significantly improve classification accuracy, making models more robust. The interjection dataset, a Python library for the augmentation pipeline, baseline model, and evaluation scripts, are available to the research community.

[LG-25] Systematic Evaluation of Attribution Methods: Eliminating Threshold Bias and Revealing Method-Dependent Performance Patterns

链接: https://arxiv.org/abs/2509.03176
作者: Serra Aksoy
类目: Machine Learning (cs.LG)
*备注: 15 pages, 9 figures

点击查看摘要

Abstract:Attribution methods explain neural network predictions by identifying influential input features, but their evaluation suffers from threshold selection bias that can reverse method rankings and undermine conclusions. Current protocols binarize attribution maps at single thresholds, where threshold choice alone can alter rankings by over 200 percentage points. We address this flaw with a threshold-free framework that computes Area Under the Curve for Intersection over Union (AUC-IoU), capturing attribution quality across the full threshold spectrum. Evaluating seven attribution methods on dermatological imaging, we show single-threshold metrics yield contradictory results, while threshold-free evaluation provides reliable differentiation. XRAI achieves 31% improvement over LIME and 204% over vanilla Integrated Gradients, with size-stratified analysis revealing performance variations up to 269% across lesion scales. These findings establish methodological standards that eliminate evaluation artifacts and enable evidence-based method selection. The threshold-free framework provides both theoretical insight into attribution behavior and practical guidance for robust comparison in medical imaging and beyond.

[LG-26] RecBase: Generative Foundation Model Pretraining for Zero-Shot Recommendation

链接: https://arxiv.org/abs/2509.03131
作者: Sashuai Zhou,Weinan Gan,Qijiong Liu,Ke Lei,Jieming Zhu,Hai Huang,Yan Xia,Ruiming Tang,Zhenhua Dong,Zhou Zhao
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in LLM-based recommendation have shown promise, yet their cross-domain generalization is hindered by a fundamental mismatch between language-centric pretraining and the recommendation task. Existing methods, relying on language-level knowledge, fail to capture dynamic, item-level user interests across domains. To bridge this gap, we propose RecBase, a domain-agnostic foundational model pretrained with a recommendation-oriented objective. RecBase leverages a large-scale, heterogeneous, cross-domain corpus with unified textual representations and feature mappings to enhance cross-domain generalization. To further align item semantics across domains, we introduce a unified item tokenizer that encodes items into hierarchical concept identifiers, enabling structured representation and efficient vocabulary sharing. The model is trained using an autoregressive objective to capture complex item-level sequential patterns. On eight real-world datasets, our 1.5B-parameter model matches or surpasses the performance of LLM baselines up to 7B parameters in zero-shot and cross-domain recommendation tasks.

[LG-27] LSAM: Asynchronous Distributed Training with Landscape-Smoothed Sharpness-Aware Minimization

链接: https://arxiv.org/abs/2509.03110
作者: Yunfei Teng,Sixin Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While Sharpness-Aware Minimization (SAM) improves generalization in deep neural networks by minimizing both loss and sharpness, it suffers from inefficiency in distributed large-batch training. We present Landscape-Smoothed SAM (LSAM), a novel optimizer that preserves SAM’s generalization advantages while offering superior efficiency. LSAM integrates SAM’s adversarial steps with an asynchronous distributed sampling strategy, generating an asynchronous distributed sampling scheme, producing a smoothed sharpness-aware loss landscape for optimization. This design eliminates synchronization bottlenecks, accelerates large-batch convergence, and delivers higher final accuracy compared to data-parallel SAM.

[LG-28] Discrete Functional Geometry of ReLU Networks via ReLU Transition Graphs

链接: https://arxiv.org/abs/2509.03056
作者: Sahil Rajesh Dhayalkar
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 figures. Submitted as a conference paper to 2025 5th International Conference on Robotics, Automation, and Artificial Intelligence (RAAI 2025)

点击查看摘要

Abstract:We extend the ReLU Transition Graph (RTG) framework into a comprehensive graph-theoretic model for understanding deep ReLU networks. In this model, each node represents a linear activation region, and edges connect regions that differ by a single ReLU activation flip, forming a discrete geometric structure over the network’s functional behavior. We prove that RTGs at random initialization exhibit strong expansion, binomial degree distributions, and spectral properties that tightly govern generalization. These structural insights enable new bounds on capacity via region entropy and on generalization via spectral gap and edge-wise KL divergence. Empirically, we construct RTGs for small networks, measure their smoothness and connectivity properties, and validate theoretical predictions. Our results show that region entropy saturates under overparameterization, spectral gap correlates with generalization, and KL divergence across adjacent regions reflects functional smoothness. This work provides a unified framework for analyzing ReLU networks through the lens of discrete functional geometry, offering new tools to understand, diagnose, and improve generalization.

[LG-29] Population-aware Online Mirror Descent for Mean-Field Games with Common Noise by Deep Reinforcement Learning

链接: https://arxiv.org/abs/2509.03030
作者: Zida Wu,Mathieu Lauriere,Matthieu Geist,Olivier Pietquin,Ankur Mehta
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: 2025 IEEE 64rd Conference on Decision and Control (CDC)

点击查看摘要

Abstract:Mean Field Games (MFGs) offer a powerful framework for studying large-scale multi-agent systems. Yet, learning Nash equilibria in MFGs remains a challenging problem, particularly when the initial distribution is unknown or when the population is subject to common noise. In this paper, we introduce an efficient deep reinforcement learning (DRL) algorithm designed to achieve population-dependent Nash equilibria without relying on averaging or historical sampling, inspired by Munchausen RL and Online Mirror Descent. The resulting policy is adaptable to various initial distributions and sources of common noise. Through numerical experiments on seven canonical examples, we demonstrate that our algorithm exhibits superior convergence properties compared to state-of-the-art algorithms, particularly a DRL version of Fictitious Play for population-dependent policies. The performance in the presence of common noise underscores the robustness and adaptability of our approach.

[LG-30] Multimodal learning of melt pool dynamics in laser powder bed fusion

链接: https://arxiv.org/abs/2509.03029
作者: Satyajit Mojumder,Pallock Halder,Tiana Tonge
类目: Machine Learning (cs.LG)
*备注: 20 pages, 6 figures, 1 table

点击查看摘要

Abstract:While multiple sensors are used for real-time monitoring in additive manufacturing, not all provide practical or reliable process insights. For example, high-speed X-ray imaging offers valuable spatial information about subsurface melt pool behavior but is costly and impractical for most industrial settings. In contrast, absorptivity data from low-cost photodiodes correlate with melt pool dynamics but is often too noisy for accurate prediction when used alone. In this paper, we propose a multimodal data fusion approach for predicting melt pool dynamics by combining high-fidelity X-ray data with low-fidelity absorptivity data in the Laser Powder Bed Fusion (LPBF) process. Our multimodal learning framework integrates convolutional neural networks (CNNs) for spatial feature extraction from X-ray data with recurrent neural networks (RNNs) for temporal feature extraction from absorptivity signals, using an early fusion strategy. The multimodal model is further used as a transfer learning model to fine-tune the RNN model that can predict melt pool dynamics only with absorptivity, with greater accuracy compared to the multimodal model. Results show that training with both modalities significantly improves prediction accuracy compared to using either modality alone. Furthermore, once trained, the model can infer melt pool characteristics using only absorptivity data, eliminating the need for expensive X-ray imaging. This multimodal fusion approach enables cost-effective, real-time monitoring and has broad applicability in additive manufacturing.

[LG-31] Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training

链接: https://arxiv.org/abs/2509.03018
作者: Yangtao Deng,Lei Zhang,Qinlong Wang,Xiaoyun Zhi,Xinlei Zhang,Zhuo Jiang,Haohan Xu,Lei Wang,Zuquan Song,Gaohong Liu,Yang Bai,Shuguang Wang,Wencong Xiao,Jianxi Ye,Minlan Yu,Hong Xu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliability is essential for ensuring efficiency in LLM training. However, many real-world reliability issues remain difficult to resolve, resulting in wasted resources and degraded model performance. Unfortunately, today’s collective communication libraries operate as black boxes, hiding critical information needed for effective root cause analysis. We propose Mycroft, a lightweight distributed tracing and root cause analysis system designed to address previously hidden reliability issues in collective communication. Mycroft’s key idea is to trace collective communication states and leverage internal control and data dependencies to resolve reliability problems in LLM training. Mycroft has been deployed at ByteDance for over six months to debug collective communication related issues at runtime. It detected anomalies within 15 seconds in 90% of cases and identified the root cause within 20 seconds in 60% of cases. We also conducted extensive fault injection experiments to demonstrate Mycroft’s capability and efficiency.

[LG-32] AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates

链接: https://arxiv.org/abs/2509.02981
作者: Minxin Zhang,Yuxuan Liu,Hayden Schaeffer
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The recently proposed Muon optimizer updates weight matrices via orthogonalized momentum and has demonstrated strong empirical success in large language model training. However, it remains unclear how to determine the learning rates for such orthogonalized updates. AdaGrad, by contrast, is a widely used adaptive method that scales stochastic gradients by accumulated past gradients. We propose a new algorithm, AdaGO, which combines a norm-based AdaGrad-type stepsize with an orthogonalized update direction, bringing together the benefits of both approaches. Unlike other adaptive variants of Muon, AdaGO preserves the orthogonality of the update direction, which can be interpreted as a spectral descent direction, while adapting the stepsizes to the optimization landscape by scaling the direction with accumulated past gradient norms. The implementation of AdaGO requires only minimal modification to Muon, with a single additional scalar variable, the accumulated squared gradient norms, to be computed, making it computationally and memory efficient. Optimal theoretical convergence rates are established for nonconvex functions in both stochastic and deterministic settings under standard smoothness and unbiased bounded-variance noise assumptions. Empirical results on CIFAR-10 classification and function regression demonstrate that AdaGO outperforms Muon and Adam.

[LG-33] Delayed Momentum Aggregation: Communication-efficient Byzantine-robust Federated Learning with Partial Participation

链接: https://arxiv.org/abs/2509.02970
作者: Kaoru Otsuka,Yuki Takezawa,Makoto Yamada
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) allows distributed model training across multiple clients while preserving data privacy, but it remains vulnerable to Byzantine clients that exhibit malicious behavior. While existing Byzantine-robust FL methods provide strong convergence guarantees (e.g., to a stationary point in expectation) under Byzantine attacks, they typically assume full client participation, which is unrealistic due to communication constraints and client availability. Under partial participation, existing methods fail immediately after the sampled clients contain a Byzantine majority, creating a fundamental challenge for sparse communication. First, we introduce delayed momentum aggregation, a novel principle where the server aggregates the most recently received gradients from non-participating clients alongside fresh momentum from active clients. Our optimizer D-Byz-SGDM (Delayed Byzantine-robust SGD with Momentum) implements this delayed momentum aggregation principle for Byzantine-robust FL with partial participation. Then, we establish convergence guarantees that recover previous full participation results and match the fundamental lower bounds we prove for the partial participation setting. Experiments on deep learning tasks validated our theoretical findings, showing stable and robust training under various Byzantine attacks.

[LG-34] RankGraph: Unified Heterogeneous Graph Learning for Cross-Domain Recommendation RECSYS2025

链接: https://arxiv.org/abs/2509.02942
作者: Renzhi Wu,Junjie Yang,Li Chen,Hong Li,Li Yu,Hong Yan
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: RecSys 2025

点击查看摘要

Abstract:Cross-domain recommendation systems face the challenge of integrating fine-grained user and item relationships across various product domains. To address this, we introduce RankGraph, a scalable graph learning framework designed to serve as a core component in recommendation foundation models (FMs). By constructing and leveraging graphs composed of heterogeneous nodes and edges across multiple products, RankGraph enables the integration of complex relationships between users, posts, ads, and other entities. Our framework employs a GPU-accelerated Graph Neural Network and contrastive learning, allowing for dynamic extraction of subgraphs such as item-item and user-user graphs to support similarity-based retrieval and real-time clustering. Furthermore, RankGraph integrates graph-based pretrained representations as contextual tokens into FM sequence models, enriching them with structured relational knowledge. RankGraph has demonstrated improvements in click (+0.92%) and conversion rates (+2.82%) in online A/B tests, showcasing its effectiveness in cross-domain recommendation scenarios.

[LG-35] PDRL: Post-hoc Descriptor-based Residual Learning for Uncertainty-Aware Machine Learning Potentials

链接: https://arxiv.org/abs/2509.02927
作者: Shih-Peng Huang,Nontawat Charoenphakdee,Yuta Tsuboi,Yong-Bin Zhuang,Wenwen Li
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Ensemble method is considered the gold standard for uncertainty quantification (UQ) for machine learning interatomic potentials (MLIPs). However, their high computational cost can limit its practicality. Alternative techniques, such as Monte Carlo dropout and deep kernel learning, have been proposed to improve computational efficiency; however, some of these methods cannot be applied to already trained models and may affect the prediction accuracy. In this paper, we propose a simple and efficient post-hoc framework for UQ that leverages the descriptor of a trained graph neural network potential to estimate residual errors. We refer to this method as post-hoc descriptor-based residual-based learning (PDRL). PDRL models the discrepancy between MLIP predictions and ground truth values, allowing these residuals to act as proxies for prediction uncertainty. We explore multiple variants of PDRL and benchmark them against established UQ methods, evaluating both their effectiveness and limitations.

[LG-36] A Narrative Review of Clinical Decision Support Systems in Offloading Footwear for Diabetes-Related Foot Ulcers

链接: https://arxiv.org/abs/2509.02923
作者: Kunal Kumar,Muhammad Ashad Kabir,Luke Donnan,Sayed Ahmed
类目: Machine Learning (cs.LG)
*备注: 44 pages, 2 figures, and 3 tables

点击查看摘要

Abstract:Offloading footwear helps prevent and treat diabetic foot ulcers (DFUs) by lowering plantar pressure (PP), yet prescription decisions remain fragmented: feature selection varies, personalization is limited, and evaluation practices differ. We performed a narrative review of 45 studies (12 guidelines/protocols, 25 knowledge-based systems, 8 machine-learning applications) published to Aug 2025. We thematically analyzed knowledge type, decision logic, evaluation methods, and enabling technologies. Guidelines emphasize PP thresholds (=200 kPa or =25–30% reduction) but rarely yield actionable, feature-level outputs. Knowledge-based systems use rule- and sensor-driven logic, integrating PP monitoring, adherence tracking, and usability testing. ML work introduces predictive, optimization, and generative models with high computational accuracy but limited explainability and clinical validation. Evaluation remains fragmented: protocols prioritize biomechanical tests; knowledge-based systems assess usability/adherence; ML studies focus on technical accuracy with weak linkage to long-term outcomes. From this synthesis we propose a five-part CDSS framework: (1) a minimum viable dataset; (2) a hybrid architecture combining rules, optimization, and explainable ML; (3) structured feature-level outputs; (4) continuous validation and evaluation; and (5) integration with clinical and telehealth workflows. This framework aims to enable scalable, patient-centered CDSSs for DFU care; prioritizing interoperable datasets, explainable models, and outcome-focused evaluation will be key to clinical adoption.

[LG-37] Event Detection and Classification for Long Range Sensing of Elephants Using Seismic Signal

链接: https://arxiv.org/abs/2509.02920
作者: Jaliya L. Wijayaraja,Janaka L. Wijekoon,Malitha Wijesundara
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
*备注: This article has been accepted for publication in IEEE Access

点击查看摘要

Abstract:Detecting elephants through seismic signals is an emerging research topic aimed at developing solutions for Human-Elephant Conflict (HEC). Despite the promising results, such solutions heavily rely on manual classification of elephant footfalls, which limits their applicability for real-time classification in natural settings. To address this limitation and build on our previous work, this study introduces a classification framework targeting resource-constrained implementations, prioritizing both accuracy and computational efficiency. As part of this framework, a novel event detection technique named Contextually Customized Windowing (CCW), tailored specifically for detecting elephant footfalls, was introduced, and evaluations were conducted by comparing it with the Short-Term Average/Long-Term Average (STA/LTA) method. The yielded results show that the maximum validated detection range was 155.6 m in controlled conditions and 140 m in natural environments. Elephant footfall classification using Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel demonstrated superior performance across multiple settings, achieving an accuracy of 99% in controlled environments, 73% in natural elephant habitats, and 70% in HEC-prone human habitats, the most challenging scenario. Furthermore, feature impact analysis using explainable AI identified the number of Zero Crossings and Dynamic Time Warping (DTW) Alignment Cost as the most influential factors in all experiments, while Predominant Frequency exhibited significant influence in controlled settings.

[LG-38] Improving Generative Methods for Causal Evaluation via Simulation-Based Inference

链接: https://arxiv.org/abs/2509.02892
作者: Pracheta Amaranath,Vinitra Muralikrishnan,Amit Sharma,David D. Jensen
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 12 pages main text, 48 pages total

点击查看摘要

Abstract:Generating synthetic datasets that accurately reflect real-world observational data is critical for evaluating causal estimators, but remains a challenging task. Existing generative methods offer a solution by producing synthetic datasets anchored in the observed data (source data) while allowing variation in key parameters such as the treatment effect and amount of confounding bias. However, existing methods typically require users to provide point estimates of such parameters (rather than distributions) and fixed estimates (rather than estimates that can be improved with reference to the source data). This denies users the ability to express uncertainty over parameter values and removes the potential for posterior inference, potentially leading to unreliable estimator comparisons. We introduce simulation-based inference for causal evaluation (SBICE), a framework that models generative parameters as uncertain and infers their posterior distribution given a source dataset. Leveraging techniques in simulation-based inference, SBICE identifies parameter configurations that produce synthetic datasets closely aligned with the source data distribution. Empirical results demonstrate that SBICE improves the reliability of estimator evaluations by generating more realistic datasets, which supports a robust and data-consistent approach to causal benchmarking under uncertainty.

[LG-39] Power Grid Control with Graph-Based Distributed Reinforcement Learning

链接: https://arxiv.org/abs/2509.02861
作者: Carlo Fabrizio,Gianvito Losapio,Marco Mussi,Alberto Maria Metelli,Marcello Restelli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The necessary integration of renewable energy sources, combined with the expanding scale of power networks, presents significant challenges in controlling modern power grids. Traditional control systems, which are human and optimization-based, struggle to adapt and to scale in such an evolving context, motivating the exploration of more dynamic and distributed control strategies. This work advances a graph-based distributed reinforcement learning framework for real-time, scalable grid management. The proposed architecture consists of a network of distributed low-level agents acting on individual power lines and coordinated by a high-level manager agent. A Graph Neural Network (GNN) is employed to encode the network’s topological information within the single low-level agent’s observation. To accelerate convergence and enhance learning stability, the framework integrates imitation learning and potential-based reward shaping. In contrast to conventional decentralized approaches that decompose only the action space while relying on global observations, this method also decomposes the observation space. Each low-level agent acts based on a structured and informative local view of the environment constructed through the GNN. Experiments on the Grid2Op simulation environment show the effectiveness of the approach, which consistently outperforms the standard baseline commonly adopted in the field. Additionally, the proposed model proves to be much more computationally efficient than the simulation-based Expert method.

[LG-40] Managing Correlations in Data and Privacy Demand CCS

链接: https://arxiv.org/abs/2509.02856
作者: Syomantak Chaudhuri,Thomas A. Courtade
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: To appeat at ACM CCS, 2025

点击查看摘要

Abstract:Previous works in the differential privacy literature that allow users to choose their privacy levels typically operate under the heterogeneous differential privacy (HDP) framework with the simplifying assumption that user data and privacy levels are not correlated. Firstly, we demonstrate that the standard HDP framework falls short when user data and privacy demands are allowed to be correlated. Secondly, to address this shortcoming, we propose an alternate framework, Add-remove Heterogeneous Differential Privacy (AHDP), that jointly accounts for user data and privacy preference. We show that AHDP is robust to possible correlations between data and privacy. Thirdly, we formalize the guarantees of the proposed AHDP framework through an operational hypothesis testing perspective. The hypothesis testing setup may be of independent interest in analyzing other privacy frameworks as well. Fourthly, we show that there exists non-trivial AHDP mechanisms that notably do not require prior knowledge of the data-privacy correlations. We propose some such mechanisms and apply them to core statistical tasks such as mean estimation, frequency estimation, and linear regression. The proposed mechanisms are simple to implement with minimal assumptions and modeling requirements, making them attractive for real-world use. Finally, we empirically evaluate proposed AHDP mechanisms, highlighting their trade-offs using LLM-generated synthetic datasets, which we release for future research.

[LG-41] owards Reasoning for PDE Foundation Models: A Reward-Model-Driven Inference-Time-Scaling Algorithm

链接: https://arxiv.org/abs/2509.02846
作者: Siddharth Mansingh,James Amarel,Ragib Arnab,Arvind Mohan,Kamaljeet Singh,Gerd J. Kunde,Nicolas Hengartner,Benjamin Migliori,Emily Casleton,Nathan A. Debarledeben,Ayan Biswas,Diane Oyen,Earl Lawrence
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Partial Differential Equations (PDEs) are the bedrock for modern computational sciences and engineering, and inherently computationally expensive. While PDE foundation models have shown much promise for simulating such complex spatio-temporal phenomena, existing models remain constrained by the pretraining datasets and struggle with auto-regressive rollout performance, especially in out-of-distribution (OOD) cases. Furthermore, they have significant compute and training data requirements which hamper their use in many critical applications. Inspired by recent advances in ``thinking" strategies used in large language models (LLMs), we introduce the first test-time computing (TTC) strategy for PDEs that utilizes computational resources during inference to achieve more accurate predictions with fewer training samples and smaller models. We accomplish this with two types of reward models that evaluate predictions of a stochastic based model for spatio-temporal consistency. We demonstrate this method on compressible Euler-equation simulations from the PDEGym benchmark and show that TTC captures improved predictions relative to standard non-adaptive auto-regressive inference. This TTC framework marks a foundational step towards more advanced reasoning algorithms or PDE modeling, inluding building reinforcement-learning-based approaches, potentially transforming computational workflows in physics and engineering.

[LG-42] Fast and Accurate SVD-Type Updating in Streaming Data

链接: https://arxiv.org/abs/2509.02840
作者: Johannes J. Brust,Michael A. Saunders
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注:

点击查看摘要

Abstract:For a datastream, the change over a short interval is often of low rank. For high throughput information arranged in matrix format, recomputing an optimal SVD approximation after each step is typically prohibitive. Instead, incremental and truncated updating strategies are used, which may not scale for large truncation ranks. Therefore, we propose a set of efficient new algorithms that update a bidiagonal factorization, and which are similarly accurate as the SVD methods. In particular, we develop a compact Householder-type algorithm that decouples a sparse part from a low-rank update and has about half the memory requirements of standard bidiagonalization methods. A second algorithm based on Givens rotations has only about 10 flops per rotation and scales quadratically with the problem size, compared to a typical cubic scaling. The algorithm is therefore effective for processing high-throughput updates, as we demonstrate in tracking large subspaces of recommendation systems and networks, and when compared to well known software such as LAPACK or the incremental SVD.

[LG-43] Unlearning That Lasts: Utility-Preserving Robust and Almost Irreversible Forgetting in LLM s

链接: https://arxiv.org/abs/2509.02820
作者: Naman Deep Singh,Maximilian Müller,Francesco Croce,Matthias Hein
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unlearning in large language models (LLMs) involves precisely removing specific information from a pre-trained model. This is crucial to ensure safety of LLMs by deleting private data or harmful knowledge acquired during pre-training. However, existing unlearning methods often fall short when subjected to thorough evaluation. To overcome this, we introduce JensUn, where we leverage the Jensen-Shannon Divergence as the training objective for both forget and retain sets for more stable and effective unlearning dynamics compared to commonly used loss functions. In extensive experiments, JensUn achieves better forget-utility trade-off than competing methods, and even demonstrates strong resilience to benign relearning. Additionally, for a precise unlearning evaluation, we introduce LKF, a curated dataset of lesser-known facts that provides a realistic unlearning scenario. Finally, to comprehensively test unlearning methods, we propose (i) employing an LLM as semantic judge instead of the standard ROUGE score, and (ii) using worst-case unlearning evaluation over various paraphrases and input formats. Our improved evaluation framework reveals that many existing methods are less effective than previously thought.

[LG-44] Multi-Embodiment Locomotion at Scale with extreme Embodiment Randomization

链接: https://arxiv.org/abs/2509.02815
作者: Nico Bohlinger,Jan Peters
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a single, general locomotion policy trained on a diverse collection of 50 legged robots. By combining an improved embodiment-aware architecture (URMAv2) with a performance-based curriculum for extreme Embodiment Randomization, our policy learns to control millions of morphological variations. Our policy achieves zero-shot transfer to unseen real-world humanoid and quadruped robots.

[LG-45] Challenges in Understanding Modality Conflict in Vision-Language Models

链接: https://arxiv.org/abs/2509.02805
作者: Trang Nguyen,Jackson Michaels,Madalina Fiterau,David Jensen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper highlights the challenge of decomposing conflict detection from conflict resolution in Vision-Language Models (VLMs) and presents potential approaches, including using a supervised metric via linear probes and group-based attention pattern analysis. We conduct a mechanistic investigation of LLaVA-OV-7B, a state-of-the-art VLM that exhibits diverse resolution behaviors when faced with conflicting multimodal inputs. Our results show that a linearly decodable conflict signal emerges in the model’s intermediate layers and that attention patterns associated with conflict detection and resolution diverge at different stages of the network. These findings support the hypothesis that detection and resolution are functionally distinct mechanisms. We discuss how such decomposition enables more actionable interpretability and targeted interventions for improving model robustness in challenging multimodal settings.

[LG-46] Learning Laplacian Eigenvectors: a Pre-training Method for Graph Neural Networks

链接: https://arxiv.org/abs/2509.02803
作者: Howard Dai,Nyambura Njenga,Benjamin Whitsett,Catherine Ma,Darwin Deng,Sara de Ángel,Alexandre Van Tassel,Siddharth Viswanath,Ryan Pellico,Ian Adelstein,Smita Krishnaswamy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel framework for pre-training Graph Neural Networks (GNNs) by inductively learning Laplacian eigenvectors. Traditional Message Passing Neural Networks (MPNNs) often struggle to capture global and regional graph structure due to over-smoothing risk as network depth increases. Because the low-frequency eigenvectors of the graph Laplacian matrix encode global information, pre-training GNNs to predict these eigenvectors encourages the network to naturally learn large-scale structural patterns over each graph. Empirically, we show that models pre-trained via our framework outperform baseline models on a variety of graph structure-based tasks. While most existing pre-training methods focus on domain-specific tasks like node or edge feature reconstruction, our self-supervised pre-training framework is structure-based and highly flexible. Eigenvector-learning can be applied to all graph-based datasets, and can be used with synthetic features when task-specific data is sparse.

[LG-47] Structured Basis Function Networks: Loss-Centric Multi-Hypothesis Ensembles with Controllable Diversity

链接: https://arxiv.org/abs/2509.02792
作者: Alejandro Rodriguez Dominguez,Muhammad Shahzad,Xia Hong
类目: Machine Learning (cs.LG)
*备注: 32 Pages, 10 Figures, 11 Tables

点击查看摘要

Abstract:Existing approaches to predictive uncertainty rely either on multi-hypothesis prediction, which promotes diversity but lacks principled aggregation, or on ensemble learning, which improves accuracy but rarely captures the structured ambiguity. This implicitly means that a unified framework consistent with the loss geometry remains absent. The Structured Basis Function Network addresses this gap by linking multi-hypothesis prediction and ensembling through centroidal aggregation induced by Bregman divergences. The formulation applies across regression and classification by aligning predictions with the geometry of the loss, and supports both a closed-form least-squares estimator and a gradient-based procedure for general objectives. A tunable diversity mechanism provides parametric control of the bias-variance-diversity trade-off, connecting multi-hypothesis generalisation with loss-aware ensemble aggregation. Experiments validate this relation and use the mechanism to study the complexity-capacity-diversity trade-off across datasets of increasing difficulty with deep-learning predictors.

[LG-48] LExI: Layer-Adaptive Active Experts for Efficient MoE Model Inference

链接: https://arxiv.org/abs/2509.02753
作者: Krishna Teja Chitty-Venkata,Sandeep Madireddy,Murali Emani,Venkatram Vishwanath
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models scale efficiently by activating only a subset of experts per token, offering a computationally sparse alternative to dense architectures. While prior post-training optimizations, such as inter- and intra-expert pruning, reduce memory usage they provide limited gains in inference-time compute efficiency. Moreover, existing MoE architectures typically activate a fixed number of experts uniformly across all layers, resulting in redundant computation and suboptimal performance. In this work, we first demonstrate that MoE pruning strategies improve only the memory footprint but do not significantly improve inference performance on GPU using optimized frameworks such as vLLM. To address this, we introduce LExI, a data-free optimization technique that determines the optimal number of active experts per layer in a pretrained MoE model. LExI leverages only the model weights to estimate the relative importance of each layer and adaptively assigns the number of active experts accordingly per layer. Experiments on state-of-the-art language and vision MoE benchmarks demonstrate that LExI significantly outperforms traditional MoE pruning approaches in terms of inference efficiency with negligible accuracy loss. For example, using LExI, Qwen1.5-MoE achieves the same throughput on Nvidia H100 GPU with 10% better accuracy than traditional expert pruning.

[LG-49] Imitate Optimal Policy: Prevail and Induce Action Collapse in Policy Gradient

链接: https://arxiv.org/abs/2509.02737
作者: Zhongzhu Zhou,Yibo Yang,Ziyan Chen,Fengxiang Bie,Haojun Xia,Xiaoxia Wu,Robert Wu,Ben Athiwaratkun,Bernard Ghanem,Shuaiwen Leon Song
类目: Machine Learning (cs.LG)
*备注: 18 pages, 4 figures, 2 tables; includes supplementary material; preprint

点击查看摘要

Abstract:Policy gradient (PG) methods in reinforcement learning frequently utilize deep neural networks (DNNs) to learn a shared backbone of feature representations used to compute likelihoods in an action selection layer. Numerous studies have been conducted on the convergence and global optima of policy networks, but few have analyzed representational structures of those underlying networks. While training an optimal policy DNN, we observed that under certain constraints, a gentle structure resembling neural collapse, which we refer to as Action Collapse (AC), emerges. This suggests that 1) the state-action activations (i.e. last-layer features) sharing the same optimal actions collapse towards those optimal actions respective mean activations; 2) the variability of activations sharing the same optimal actions converges to zero; 3) the weights of action selection layer and the mean activations collapse to a simplex equiangular tight frame (ETF). Our early work showed those aforementioned constraints to be necessary for these observations. Since the collapsed ETF of optimal policy DNNs maximally separates the pair-wise angles of all actions in the state-action space, we naturally raise a question: can we learn an optimal policy using an ETF structure as a (fixed) target configuration in the action selection layer? Our analytical proof shows that learning activations with a fixed ETF as action selection layer naturally leads to the AC. We thus propose the Action Collapse Policy Gradient (ACPG) method, which accordingly affixes a synthetic ETF as our action selection layer. ACPG induces the policy DNN to produce such an ideal configuration in the action selection layer while remaining optimal. Our experiments across various OpenAI Gym environments demonstrate that our technique can be integrated into any discrete PG methods and lead to favorable reward improvements more quickly and robustly.

[LG-50] Preference Robustness for DPO with Applications to Public Health

链接: https://arxiv.org/abs/2509.02709
作者: Cheol Woo Kim,Shresth Verma,Mauricio Tec,Milind Tambe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study an LLM fine-tuning task for designing reward functions for sequential resource allocation problems in public health, guided by human preferences expressed in natural language. This setting presents a challenging testbed for alignment due to complex and ambiguous objectives and limited data availability. We propose DPO-PRO, a robust fine-tuning algorithm based on Direct Preference Optimization (DPO), which accounts for uncertainty in the preference distribution using a lightweight Distributionally Robust Optimization (DRO) formulation. Unlike prior DRO-based DPO methods, DPO-PRO is significantly less conservative. We evaluate DPO-PRO on a real-world maternal mobile health program operated by the non-profit organization ARMMAN, as well as on standard alignment benchmarks. Experimental results demonstrate that our method consistently improves robustness to noisy preference signals compared to existing DPO variants. Moreover, DPO-PRO achieves comparable performance to prior self-reflection-based baseline for reward function design, while requiring significantly lower inference-time cost.

[LG-51] owards Performatively Stable Equilibria in Decision-Dependent Games for Arbitrary Data Distribution Maps

链接: https://arxiv.org/abs/2509.02619
作者: Guangzheng Zhong,Yang Liu,Jiming Liu
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In decision-dependent games, multiple players optimize their decisions under a data distribution that shifts with their joint actions, creating complex dynamics in applications like market pricing. A practical consequence of these dynamics is the \textitperformatively stable equilibrium, where each player’s strategy is a best response under the induced distribution. Prior work relies on \beta -smoothness, assuming Lipschitz continuity of loss function gradients with respect to the data distribution, which is impractical as the data distribution maps, i.e., the relationship between joint decision and the resulting distribution shifts, are typically unknown, rendering \beta unobtainable. To overcome this limitation, we propose a gradient-based sensitivity measure that directly quantifies the impact of decision-induced distribution shifts. Leveraging this measure, we derive convergence guarantees for performatively stable equilibria under a practically feasible assumption of strong monotonicity. Accordingly, we develop a sensitivity-informed repeated retraining algorithm that adjusts players’ loss functions based on the sensitivity measure, guaranteeing convergence to performatively stable equilibria for arbitrary data distribution maps. Experiments on prediction error minimization game, Cournot competition, and revenue maximization game show that our approach outperforms state-of-the-art baselines, achieving lower losses and faster convergence.

[LG-52] Learning AC Power Flow Solutions using a Data-Dependent Variational Quantum Circuit

链接: https://arxiv.org/abs/2509.03495
作者: Thinh Viet Le,Md Obaidur Rahman,Vassilis Kekatos
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 7 pages, 6 figures, accepted for the IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids 2025

点击查看摘要

Abstract:Interconnection studies require solving numerous instances of the AC load or power flow (AC PF) problem to simulate diverse scenarios as power systems navigate the ongoing energy transition. To expedite such studies, this work leverages recent advances in quantum computing to find or predict AC PF solutions using a variational quantum circuit (VQC). VQCs are trainable models that run on modern-day noisy intermediate-scale quantum (NISQ) hardware to accomplish elaborate optimization and machine learning (ML) tasks. Our first contribution is to pose a single instance of the AC PF as a nonlinear least-squares fit over the VQC trainable parameters (weights) and solve it using a hybrid classical/quantum computing approach. The second contribution is to feed PF specifications as features into a data-embedded VQC and train the resultant quantum ML (QML) model to predict general PF solutions. The third contribution is to develop a novel protocol to efficiently measure AC-PF quantum observables by exploiting the graph structure of a power network. Preliminary numerical tests indicate that the proposed VQC models attain enhanced prediction performance over a deep neural network despite using much fewer weights. The proposed quantum AC-PF framework sets the foundations for addressing more elaborate grid tasks via quantum computing.

[LG-53] From Image Denoisers to Regularizing Imaging Inverse Problems: An Overview

链接: https://arxiv.org/abs/2509.03475
作者: Hong Ye Tan,Subhadip Mukherjee,Junqi Tang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Inverse problems lie at the heart of modern imaging science, with broad applications in areas such as medical imaging, remote sensing, and microscopy. Recent years have witnessed a paradigm shift in solving imaging inverse problems, where data-driven regularizers are used increasingly, leading to remarkably high-fidelity reconstruction. A particularly notable approach for data-driven regularization is to use learned image denoisers as implicit priors in iterative image reconstruction algorithms. This survey presents a comprehensive overview of this powerful and emerging class of algorithms, commonly referred to as plug-and-play (PnP) methods. We begin by providing a brief background on image denoising and inverse problems, followed by a short review of traditional regularization strategies. We then explore how proximal splitting algorithms, such as the alternating direction method of multipliers (ADMM) and proximal gradient descent (PGD), can naturally accommodate learned denoisers in place of proximal operators, and under what conditions such replacements preserve convergence. The role of Tweedie’s formula in connecting optimal Gaussian denoisers and score estimation is discussed, which lays the foundation for regularization-by-denoising (RED) and more recent diffusion-based posterior sampling methods. We discuss theoretical advances regarding the convergence of PnP algorithms, both within the RED and proximal settings, emphasizing the structural assumptions that the denoiser must satisfy for convergence, such as non-expansiveness, Lipschitz continuity, and local homogeneity. We also address practical considerations in algorithm design, including choices of denoiser architecture and acceleration strategies.

[LG-54] Off-Policy Learning in Large Action Spaces: Optimization Matters More Than Estimation RECSYS’25

链接: https://arxiv.org/abs/2509.03456
作者: Imad Aouali,Otmane Sakhi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Recsys '25, CONSEQUENCES: Causality, Counterfactuals Sequential Decision-Making Workshop

点击查看摘要

Abstract:Off-policy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits. Recent advances in OPL primarily optimize OPE estimators with improved statistical properties, assuming that better estimators inherently yield superior policies. Although theoretically justified, we argue this estimator-centric approach neglects a critical practical obstacle: challenging optimization landscapes. In this paper, we provide theoretical insights and extensive empirical evidence showing that current OPL methods encounter severe optimization issues, particularly as action spaces become large. We demonstrate that simpler weighted log-likelihood objectives enjoy substantially better optimization properties and still recover competitive, often superior, learned policies. Our findings emphasize the necessity of explicitly addressing optimization considerations in the development of OPL algorithms for large action spaces.

[LG-55] Non-Linear Counterfactual Aggregate Optimization RECSYS’25

链接: https://arxiv.org/abs/2509.03438
作者: Benjamin Heymann,Otmane Sakhi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Recsys '25, CONSEQUENCES: Causality, Counterfactuals Sequential Decision-Making Workshop

点击查看摘要

Abstract:We consider the problem of directly optimizing a non-linear function of an outcome, where this outcome itself is the sum of many small contributions. The non-linearity of the function means that the problem is not equivalent to the maximization of the expectation of the individual contribution. By leveraging the concentration properties of the sum of individual outcomes, we derive a scalable descent algorithm that directly optimizes for our stated objective. This allows for instance to maximize the probability of successful A/B test, for which it can be wiser to target a success criterion, such as exceeding a given uplift, rather than chasing the highest expected payoff.

[LG-56] Understanding and Improving the Shampoo Optimizer via Kullback-Leibler Minimization

链接: https://arxiv.org/abs/2509.03378
作者: Wu Lin,Scott C. Lowe,Felix Dangel,Runa Eschenhagen,Zikun Xu,Roger B. Grosse
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: technical report, working in progress

点击查看摘要

Abstract:As an adaptive method, Shampoo employs a structured second-moment estimation, and its effectiveness has attracted growing attention. Prior work has primarily analyzed its estimation scheme through the Frobenius norm. Motivated by the natural connection between the second moment and a covariance matrix, we propose studying Shampoo’s estimation as covariance estimation through the lens of Kullback-Leibler (KL) minimization. This alternative perspective reveals a previously hidden limitation, motivating improvements to Shampoo’s design. Building on this insight, we develop a practical estimation scheme, termed KL-Shampoo, that eliminates Shampoo’s reliance on Adam for stabilization, thereby removing the additional memory overhead introduced by Adam. Preliminary results show that KL-Shampoo improves Shampoo’s performance, enabling it to stabilize without Adam and even outperform its Adam-stabilized variant, SOAP, in neural network pretraining.

[LG-57] An Effective Strategy for Modeling Score Ordinality and Non-uniform Intervals in Automated Speaking Assessment

链接: https://arxiv.org/abs/2509.03372
作者: Tien-Hong Lo,Szu-Yu Chen,Yao-Ting Sung,Berlin Chen
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted at ASRU 2025

点击查看摘要

Abstract:A recent line of research on automated speaking assessment (ASA) has benefited from self-supervised learning (SSL) representations, which capture rich acoustic and linguistic patterns in non-native speech without underlying assumptions of feature curation. However, speech-based SSL models capture acoustic-related traits but overlook linguistic content, while text-based SSL models rely on ASR output and fail to encode prosodic nuances. Moreover, most prior arts treat proficiency levels as nominal classes, ignoring their ordinal structure and non-uniform intervals between proficiency labels. To address these limitations, we propose an effective ASA approach combining SSL with handcrafted indicator features via a novel modeling paradigm. We further introduce a multi-margin ordinal loss that jointly models both the score ordinality and non-uniform intervals of proficiency labels. Extensive experiments on the TEEMI corpus show that our method consistently outperforms strong baselines and generalizes well to unseen prompts.

[LG-58] Bayesian Additive Regression Trees for functional ANOVA model

链接: https://arxiv.org/abs/2509.03317
作者: Seokhun Park,Insung Kong,Yongdai Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian Additive Regression Trees (BART) is a powerful statistical model that leverages the strengths of Bayesian inference and regression trees. It has received significant attention for capturing complex non-linear relationships and interactions among predictors. However, the accuracy of BART often comes at the cost of interpretability. To address this limitation, we propose ANOVA Bayesian Additive Regression Trees (ANOVA-BART), a novel extension of BART based on the functional ANOVA decomposition, which is used to decompose the variability of a function into different interactions, each representing the contribution of a different set of covariates or factors. Our proposed ANOVA-BART enhances interpretability, preserves and extends the theoretical guarantees of BART, and achieves superior predictive performance. Specifically, we establish that the posterior concentration rate of ANOVA-BART is nearly minimax optimal, and further provides the same convergence rates for each interaction that are not available for BART. Moreover, comprehensive experiments confirm that ANOVA-BART surpasses BART in both accuracy and uncertainty quantification, while also demonstrating its effectiveness in component selection. These results suggest that ANOVA-BART offers a compelling alternative to BART by balancing predictive accuracy, interpretability, and theoretical consistency.

[LG-59] Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings

链接: https://arxiv.org/abs/2509.03292
作者: Dyah A. M. G. Wisnu,Ryandhimas E. Zezario,Stefano Rini,Hsin-Min Wang,Yu Tsao
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted by IEEE Automatic Speech Recognition and Understanding Workshop(ASRU), 2025

点击查看摘要

Abstract:We present a system for automatic multi-axis perceptual quality prediction of generative audio, developed for Track 2 of the AudioMOS Challenge 2025. The task is to predict four Audio Aesthetic Scores–Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness–for audio generated by text-to-speech (TTS), text-to-audio (TTA), and text-to-music (TTM) systems. A main challenge is the domain shift between natural training data and synthetic evaluation data. To address this, we combine BEATs, a pretrained transformer-based audio representation model, with a multi-branch long short-term memory (LSTM) predictor and use a triplet loss with buffer-based sampling to structure the embedding space by perceptual similarity. Our results show that this improves embedding discriminability and generalization, enabling domain-robust audio quality assessment without synthetic training data.

[LG-60] SurGBSA: Learning Representations From Molecular Dynamics Simulations

链接: https://arxiv.org/abs/2509.03084
作者: Derek Jones,Yue Yang,Felice C. Lightstone,Niema Moshiri,Jonathan E. Allen,Tajana S. Rosing
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised pretraining from static structures of drug-like compounds and proteins enable powerful learned feature representations. Learned features demonstrate state of the art performance on a range of predictive tasks including molecular properties, structure generation, and protein-ligand interactions. The majority of approaches are limited by their use of static structures and it remains an open question, how best to use atomistic molecular dynamics (MD) simulations to develop more generalized models to improve prediction accuracy for novel molecular structures. We present SURrogate mmGBSA (SurGBSA) as a new modeling approach for MD-based representation learning, which learns a surrogate function of the Molecular Mechanics Generalized Born Surface Area (MMGBSA). We show for the first time the benefits of physics-informed pre-training to train a surrogate MMGBSA model on a collection of over 1.4 million 3D trajectories collected from MD simulations of the CASF-2016 benchmark. SurGBSA demonstrates a dramatic 6,497x speedup versus a traditional physics-based single-point MMGBSA calculation while nearly matching single-point MMGBSA accuracy on the challenging pose ranking problem for identification of the correct top pose (-0.4% difference). Our work advances the development of molecular foundation models by showing model improvements when training on MD simulations. Models, code and training data are made publicly available.

[LG-61] Scale-Adaptive Generative Flows for Multiscale Scientific Data

链接: https://arxiv.org/abs/2509.02971
作者: Yifan Chen,Eric Vanden-Eijnden
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Flow-based generative models can face significant challenges when modeling scientific data with multiscale Fourier spectra, often producing large errors in fine-scale features. We address this problem within the framework of stochastic interpolants, via principled design of noise distributions and interpolation schedules. The key insight is that the noise should not be smoother than the target data distribution – measured by Fourier spectrum decay rates – to ensure bounded drift fields near the initial time. For Gaussian and near-Gaussian distributions whose fine-scale structure is known, we show that spectrum-matched noise improves numerical efficiency compared to standard white-noise approaches. For complex non-Gaussian distributions, we develop scale-adaptive interpolation schedules that address the numerical ill-conditioning arising from rougher-than-data noise. Numerical experiments on synthetic Gaussian random fields and solutions to the stochastic Allen-Cahn and Navier-Stokes equations validate our approach and demonstrate its ability to generate high-fidelity samples at lower computational cost than traditional approaches.

[LG-62] Faster Gradient Methods for Highly-smooth Stochastic Bilevel Optimization

链接: https://arxiv.org/abs/2509.02937
作者: Lesi Chen,Junru Li,Jingzhao Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper studies the complexity of finding an \epsilon -stationary point for stochastic bilevel optimization when the upper-level problem is nonconvex and the lower-level problem is strongly convex. Recent work proposed the first-order method, F ^2 SA, achieving the \tilde\mathcalO(\epsilon^-6) upper complexity bound for first-order smooth problems. This is slower than the optimal \Omega(\epsilon^-4) complexity lower bound in its single-level counterpart. In this work, we show that faster rates are achievable for higher-order smooth problems. We first reformulate F ^2 SA as approximating the hyper-gradient with a forward difference. Based on this observation, we propose a class of methods F ^2 SA- p that uses p th-order finite difference for hyper-gradient approximation and improves the upper bound to \tilde\mathcalO(p \epsilon^4-p/2) for p th-order smooth problems. Finally, we demonstrate that the \Omega(\epsilon^-4) lower bound also holds for stochastic bilevel problems when the high-order smoothness holds for the lower-level variable, indicating that the upper bound of F ^2 SA- p is nearly optimal in the highly smooth region p = \Omega( \log \epsilon^-1 / \log \log \epsilon^-1) .

[LG-63] Quantifying the Social Costs of Power Outages and Restoration Disparities Across Four U.S. Hurricanes

链接: https://arxiv.org/abs/2509.02653
作者: Xiangpeng Li,Junwei Ma,Bo Li,Ali Mostafavi
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:The multifaceted nature of disaster impact shows that densely populated areas contribute more to aggregate burden, while sparsely populated but heavily affected regions suffer disproportionately at the individual level. This study introduces a framework for quantifying the societal impacts of power outages by translating customer weighted outage exposure into deprivation measures, integrating welfare metrics with three recovery indicators, average outage days per customer, restoration duration, and relative restoration rate, computed from sequential EAGLE I observations and linked to Zip Code Tabulation Area demographics. Applied to four United States hurricanes, Beryl 2024 Texas, Helene 2024 Florida, Milton 2024 Florida, and Ida 2021 Louisiana, this standardized pipeline provides the first cross event, fine scale evaluation of outage impacts and their drivers. Results demonstrate regressive patterns with greater burdens in lower income areas, mechanistic analysis shows deprivation increases with longer restoration durations and decreases with faster restoration rates, explainable modeling identifies restoration duration as the dominant driver, and clustering reveals distinct recovery typologies not captured by conventional reliability metrics. This framework delivers a transferable method for assessing outage impacts and equity, comparative cross event evidence linking restoration dynamics to social outcomes, and actionable spatial analyses that support equity informed restoration planning and resilience investment.

[LG-64] Quantifying Clinician Bias and its Effects on Schizophrenia Diagnosis in the Emergency Department of the Mount Sinai Health System

链接: https://arxiv.org/abs/2509.02651
作者: Alissa A. Valentine,Lauren A. Lepow,Lili Chan,Alexander W. Charney,Isotta Landi
类目: Other Quantitative Biology (q-bio.OT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the United States, schizophrenia (SCZ) carries a race and sex disparity that may be explained by clinician bias - a belief held by a clinician about a patient that prevents impartial clinical decision making. The emergency department (ED) is marked by higher rates of stress that lead to clinicians relying more on implicit biases during decision making. In this work, we considered a large cohort of psychiatric patients in the ED from the Mount Sinai Health System (MSHS) in New York City to investigate the effects of clinician bias on SCZ diagnosis while controlling for known risk factors and patient sociodemographic information. Clinician bias was quantified as the ratio of negative to total sentences within a patient’s first ED note. We utilized a logistic regression to predict SCZ diagnosis given patient race, sex, age, history of trauma or substance use disorder, and the ratio of negative sentences. Our findings showed that an increased ratio of negative sentences is associated with higher odds of obtaining a SCZ diagnosis [OR (95% CI)=1.408 (1.361-1.456)]. Identifying as male [OR (95% CI)=1.112 (1.055-1.173)] or Black [OR (95% CI)=1.081(1.031-1.133)] increased one’s odds of being diagnosed with SCZ. However, from an intersectional lens, Black female patients with high SES have the highest odds of obtaining a SCZ diagnosis [OR (95% CI)=1.629 (1.535-1.729)]. Results such as these suggest that SES does not act as a protective buffer against SCZ diagnosis in all patients, demanding more attention to the quantification of health disparities. Lastly, we demonstrated that clinician bias is operational with real world data and related to increased odds of obtaining a stigmatizing diagnosis such as SCZ.

[LG-65] Fast kernel methods: Sobolev physics-informed and additive models

链接: https://arxiv.org/abs/2509.02649
作者: Nathan Doumèche(LPSM, EDF Ramp;D OSIRIS),Francis Bach(ENS-PSL),Gérard Biau(LPSM, IUF),Claire Boyer(LMO)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Kernel methods are powerful tools in statistical learning, but their cubic complexity in the sample size n limits their use on large-scale datasets. In this work, we introduce a scalable framework for kernel regression with O(n log n) complexity, fully leveraging GPU acceleration. The approach is based on a Fourier representation of kernels combined with non-uniform fast Fourier transforms (NUFFT), enabling exact, fast, and memory-efficient computations. We instantiate our framework in three settings: Sobolev kernel regression, physics-informed regression, and additive models. When known, the proposed estimators are shown to achieve minimax convergence rates, consistent with classical kernel theory. Empirical results demonstrate that our methods can process up to tens of billions of samples within minutes, providing both statistical accuracy and computational scalability. These contributions establish a flexible approach, paving the way for the routine application of kernel methods in large-scale learning tasks.

[LG-66] Optimizing Prognostic Biomarker Discovery in Pancreatic Cancer Through Hybrid Ensemble Feature Selection and Multi-Omics Data

链接: https://arxiv.org/abs/2509.02648
作者: John Zobolas,Anne-Marie George,Alberto López,Sebastian Fischer,Marc Becker,Tero Aittokallio
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Applications (stat.AP)
*备注: 52 pages, 5 figures, 9 Supplementary Figures, 1 Supplementary Table

点击查看摘要

Abstract:Prediction of patient survival using high-dimensional multi-omics data requires systematic feature selection methods that ensure predictive performance, sparsity, and reliability for prognostic biomarker discovery. We developed a hybrid ensemble feature selection (hEFS) approach that combines data subsampling with multiple prognostic models, integrating both embedded and wrapper-based strategies for survival prediction. Omics features are ranked using a voting-theory-inspired aggregation mechanism across models and subsamples, while the optimal number of features is selected via a Pareto front, balancing predictive accuracy and model sparsity without any user-defined thresholds. When applied to multi-omics datasets from three pancreatic cancer cohorts, hEFS identifies significantly fewer and more stable biomarkers compared to the conventional, late-fusion CoxLasso models, while maintaining comparable discrimination performance. Implemented within the open-source mlr3fselect R package, hEFS offers a robust, interpretable, and clinically valuable tool for prognostic modelling and biomarker discovery in high-dimensional survival settings.

[LG-67] Gaussian process surrogate with physical law-corrected prior for multi-coupled PDEs defined on irregular geometry

链接: https://arxiv.org/abs/2509.02617
作者: Pucheng Tang,Hongqiao Wang,Wenzhou Lin,Qian Chen,Heng Yong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 40 pages, 16 figures, 7 tables

点击查看摘要

Abstract:Parametric partial differential equations (PDEs) are fundamental mathematical tools for modeling complex physical systems, yet their numerical evaluation across parameter spaces remains computationally intensive when using conventional high-fidelity solvers. To address this challenge, we propose a novel physical law-corrected prior Gaussian process (LC-prior GP) surrogate modeling framework that effectively integrates data-driven learning with underlying physical constraints to flexibly handle multi-coupled variables defined on complex geometries. The proposed approach leverages proper orthogonal decomposition (POD) to parameterize high-dimensional PDE solutions via their dominant modes and associated coefficients, thereby enabling efficient Gaussian process (GP) surrogate modeling within a reduced-dimensional coefficient space. A key contribution lies in the incorporation of physical laws together with a limited number of parameter samples to correct the GP posterior mean, thus avoiding reliance on computationally expensive numerical solvers. Furthermore, interpolation functions are constructed to describe the mapping from the full parameter space to the physics-based correction term. This mapping is subsequently backpropagated to constrain the original GP surrogate, yielding a more physically consistent conditional prior. To handle irregular geometries, the radial basis function-finite difference (RBF-FD) method is incorporated during training set computation, with its inherent differentiation matrices providing both computational efficiency and numerical accuracy for physical constraint optimization. The effectiveness of the proposed method is demonstrated through numerical experiments involving a reaction-diffusion model, miscible flooding models, and Navier-Stokes equations with multi-physics coupling defined on irregular domains.

[LG-68] Use ADAS Data to Predict Near-Miss Events: A Group-Based Zero-Inflated Poisson Approach

链接: https://arxiv.org/abs/2509.02614
作者: Xinbo Zhang,Montserrat Guillen,Lishuai Li,Xin Li,Youhua Frank Chen
类目: Applications (stat.AP); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: Preprint. 10 pages, 3 figures, 4 tables. Submitted to 2025 IEEE International Conference on Big Data (IEEE BigData 2025). Corresponding authors: Youhua Frank Chen (youhchen@cityu. this http URL )

点击查看摘要

Abstract:Driving behavior big data leverages multi-sensor telematics to understand how people drive and powers applications such as risk evaluation, insurance pricing, and targeted intervention. Usage-based insurance (UBI) built on these data has become mainstream. Telematics-captured near-miss events (NMEs) provide a timely alternative to claim-based risk, but weekly NMEs are sparse, highly zero-inflated, and behaviorally heterogeneous even after exposure normalization. Analyzing multi-sensor telematics and ADAS warnings, we show that the traditional statistical models underfit the dataset. We address these challenges by proposing a set of zero-inflated Poisson (ZIP) frameworks that learn latent behavior groups and fit offset-based count models via EM to yield calibrated, interpretable weekly risk predictions. Using a naturalistic dataset from a fleet of 354 commercial drivers over a year, during which the drivers completed 287,511 trips and logged 8,142,896 km in total, our results show consistent improvements over baselines and prior telematics models, with lower AIC/BIC values in-sample and better calibration out-of-sample. We also conducted sensitivity analyses on the EM-based grouping for the number of clusters, finding that the gains were robust and interpretable. Practically, this supports context-aware ratemaking on a weekly basis and fairer premiums by recognizing heterogeneous driving styles.

[LG-69] Lessons Learned from Deploying Adaptive Machine Learning Agents with Limited Data for Real-time Cell Culture Process Monitoring

链接: https://arxiv.org/abs/2509.02606
作者: Thanh Tung Khuat,Johnny Peng,Robert Bassett,Ellen Otte,Bogdan Gabrys
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores the deployment of three machine learning (ML) approaches for real-time prediction of glucose, lactate, and ammonium concentrations in cell culture processes, using Raman spectroscopy as input features. The research addresses challenges associated with limited data availability and process variability, providing a comparative analysis of pretrained models, just-in-time learning (JITL), and online learning algorithms. Two industrial case studies are presented to evaluate the impact of varying bioprocess conditions on model performance. The findings highlight the specific conditions under which pretrained models demonstrate superior predictive accuracy and identify scenarios where JITL or online learning approaches are more effective for adaptive process monitoring. This study also highlights the critical importance of updating the deployed models/agents with the latest offline analytical measurements during bioreactor operations to maintain the model performance against the changes in cell growth behaviours and operating conditions throughout the bioreactor run. Additionally, the study confirms the usefulness of a simple mixture-of-experts framework in achieving enhanced accuracy and robustness for real-time predictions of metabolite concentrations based on Raman spectral data. These insights contribute to the development of robust strategies for the efficient deployment of ML models in dynamic and changing biomanufacturing environments.

[LG-70] Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening

链接: https://arxiv.org/abs/2509.02571
作者: Diego Di Carlo(RIKEN AIP),Koyama Shoichi(UTokyo),Nugraha Aditya Arie(RIKEN AIP),Fontaine Mathieu(LTCI, S2A),Bando Yoshiaki(AIST),Yoshii Kazuyoshi(RIKEN AIP)
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper investigates continuous representations of steering vectors over frequency and position of microphone and source for augmented listening (e.g., spatial filtering and binaural rendering) with precise control of the sound field perceived by the user. Steering vectors have typically been used for representing the spatial characteristics of the sound field as a function of the listening position. The basic algebraic representation of steering vectors assuming an idealized environment cannot deal with the scattering effect of the sound field. One may thus collect a discrete set of real steering vectors measured in dedicated facilities and super-resolve (i.e., upsample) them. Recently, physics-aware deep learning methods have been effectively used for this purpose. Such deterministic super-resolution, however, suffers from the overfitting problem due to the non-uniform uncertainty over the measurement space. To solve this problem, we integrate an expressive representation based on the neural field (NF) into the principled probabilistic framework based on the Gaussian process (GP). Specifically, we propose a physics-aware composite kernel that model the directional incoming waves and the subsequent scattering effect. Our comprehensive comparative experiment showed the effectiveness of the proposed method under data insufficiency conditions. In downstream tasks such as speech enhancement and binaural rendering using the simulated data of the SPEAR challenge, the oracle performances were attained with less than ten times fewer measurements.

[LG-71] EEG-MSAF: An Interpretable Microstate Framework uncovers Default-Mode Decoherence in Early Neurodegeneration MICRO

链接: https://arxiv.org/abs/2509.02568
作者: Mohammad Mehedi Hasan,Pedro G. Lind,Hernando Ombao,Anis Yazidi,Rabindra Khadka
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Dementia, EEG, Microstates, Explainable, SHAP

点击查看摘要

Abstract:Dementia (DEM) is a growing global health challenge, underscoring the need for early and accurate diagnosis. Electroencephalography (EEG) provides a non-invasive window into brain activity, but conventional methods struggle to capture its transient complexity. We present the \textbfEEG Microstate Analysis Framework (EEG-MSAF), an end-to-end pipeline that leverages EEG microstates discrete, quasi-stable topographies to identify DEM-related biomarkers and distinguish DEM, mild cognitive impairment (MCI), and normal cognition (NC). EEG-MSAF comprises three stages: (1) automated microstate feature extraction, (2) classification with machine learning (ML), and (3) feature ranking using Shapley Additive Explanations (SHAP) to highlight key biomarkers. We evaluate on two EEG datasets: the public Chung-Ang University EEG (CAUEEG) dataset and a clinical cohort from Thessaloniki Hospital. Our framework demonstrates strong performance and generalizability. On CAUEEG, EEG-MSAF-SVM achieves \textbf89% \pm 0.01 accuracy, surpassing the deep learning baseline CEEDNET by \textbf19.3%. On the Thessaloniki dataset, it reaches \textbf95% \pm 0.01 accuracy, comparable to EEGConvNeXt. SHAP analysis identifies mean correlation and occurrence as the most informative metrics: disruption of microstate C (salience/attention network) dominates DEM prediction, while microstate F, a novel default-mode pattern, emerges as a key early biomarker for both MCI and DEM. By combining accuracy, generalizability, and interpretability, EEG-MSAF advances EEG-based dementia diagnosis and sheds light on brain dynamics across the cognitive spectrum.

信息检索

[IR-0] OneSearch: A Preliminary Exploration of the Unified End-to-End Generative Framework for E-commerce Search

链接: https://arxiv.org/abs/2509.03236
作者: Ben Chen,Xian Guo,Siyuan Wang,Zihan Liang,Yue Lv,Yufei Ma,Xinlong Xiao,Bowen Xue,Xuxin Zhang,Ying Yang,Huangyu Dai,Xing Xu,Tong Zhao,Mingcan Peng,XiaoYang Zheng,Cong Zhang,Qihang Zhao,Yuqing Ding,Chenyi Lei,Wenwu Ou,Han Li
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Traditional e-commerce search systems employ multi-stage cascading architectures (MCA) that progressively filter items through recall, pre-ranking, and ranking stages. While effective at balancing computational efficiency with business conversion, these systems suffer from fragmented computation and optimization objective collisions across stages, which ultimately limit their performance ceiling. To address these, we propose \textbfOneSearch, the first industrial-deployed end-to-end generative framework for e-commerce search. This framework introduces three key innovations: (1) a Keyword-enhanced Hierarchical Quantization Encoding (KHQE) module, to preserve both hierarchical semantics and distinctive item attributes while maintaining strong query-item relevance constraints; (2) a multi-view user behavior sequence injection strategy that constructs behavior-driven user IDs and incorporates both explicit short-term and implicit long-term sequences to model user preferences comprehensively; and (3) a Preference-Aware Reward System (PARS) featuring multi-stage supervised fine-tuning and adaptive reward-weighted ranking to capture fine-grained user preferences. Extensive offline evaluations on large-scale industry datasets demonstrate OneSearch’s superior performance for high-quality recall and ranking. The rigorous online A/B tests confirm its ability to enhance relevance in the same exposure position, achieving statistically significant improvements: +1.67% item CTR, +2.40% buyer, and +3.22% order volume. Furthermore, OneSearch reduces operational expenditure by 75.40% and improves Model FLOPs Utilization from 3.26% to 27.32%. The system has been successfully deployed across multiple search scenarios in Kuaishou, serving millions of users, generating tens of millions of PVs daily.

[IR-1] A Plug-and-play Model-agnostic Embedding Enhancement Approach for Explainable Recommendation

链接: https://arxiv.org/abs/2509.03130
作者: Yunqi Mi,Boyang Yan,Guoshuai Zhao,Jialie Shen,Xueming Qian
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Existing multimedia recommender systems provide users with suggestions of media by evaluating the similarities, such as games and movies. To enhance the semantics and explainability of embeddings, it is a consensus to apply additional information (e.g., interactions, contexts, popularity). However, without systematic consideration of representativeness and value, the utility and explainability of embedding drops drastically. Hence, we introduce RVRec, a plug-and-play model-agnostic embedding enhancement approach that can improve both personality and explainability of existing systems. Specifically, we propose a probability-based embedding optimization method that uses a contrastive loss based on negative 2-Wasserstein distance to learn to enhance the representativeness of the embeddings. In addtion, we introduce a reweighing method based on multivariate Shapley values strategy to evaluate and explore the value of interactions and embeddings. Extensive experiments on multiple backbone recommenders and real-world datasets show that RVRec can improve the personalization and explainability of existing recommenders, outperforming state-of-the-art baselines.

[IR-2] Knowledge graph-based personalized multimodal recommendation fusion framework

链接: https://arxiv.org/abs/2509.02943
作者: Yu Fang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the contemporary age characterized by information abundance, rapid advancements in artificial intelligence have rendered recommendation systems indispensable. Conventional recommendation methodologies based on collaborative filtering or individual attributes encounter deficiencies in capturing nuanced user interests. Knowledge graphs and multimodal data integration offer enhanced representations of users and items with greater richness and precision. This paper reviews existing multimodal knowledge graph recommendation frameworks, identifying shortcomings in modal interaction and higher-order dependency modeling. We propose the Cross-Graph Cross-Modal Mutual Information-Driven Unified Knowledge Graph Learning and Recommendation Framework (CrossGMMI-DUKGLR), which employs pre-trained visual-text alignment models for feature extraction, achieves fine-grained modality fusion through multi-head cross-attention, and propagates higher-order adjacency information via graph attention networks.

[IR-3] AI-Driven Drug Repurposing through miRNA-mRNA Relation

链接: https://arxiv.org/abs/2509.03336
作者: Sharanya Manoharan,Balu Bhasuran,Oviya Ramalakshmi Iyyappan,Mohamed Saleem Abdul Shukkoor,Malathi Sellapan,Kalpana Raja
类目: Molecular Networks (q-bio.MN); Information Retrieval (cs.IR); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:miRNA mRNA relations are closely linked to several biological processes and disease mechanisms In a recent study we tested the performance of large language models LLMs on extracting miRNA mRNA relations from PubMed PubMedBERT achieved the best performance of 0.783 F1 score for miRNA mRNA Interaction Corpus MMIC Here we first applied the finetuned PubMedBERT model to extract miRNA mRNA relations from PubMed for chronic obstructive pulmonary disease COPD Alzheimers disease AD stroke type 2 diabetes mellitus T2DM chronic liver disease and cancer Next we retrieved miRNA drug relations using KinderMiner a literature mining tool for relation extraction Then we constructed three interaction networks 1 disease centric network 2 drug centric network and 3 miRNA centric network comprising 3497 nodes and 16417 edges organized as a directed graph to capture complex biological relationships Finally we validated the drugs using MIMIC IV Our integrative approach revealed both established and novel candidate drugs for diseases under study through 595 miRNA drug relations extracted from PubMed To the best of our knowledge this is the first study to systematically extract and visualize relationships among four distinct biomedical entities miRNA mRNA drug and disease

附件下载

点击下载今日全部论文列表