本篇博文主要内容为 2025-11-19 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-11-19)
今日共更新582篇论文,其中:
- 自然语言处理共60篇(Computation and Language (cs.CL))
- 人工智能共181篇(Artificial Intelligence (cs.AI))
- 计算机视觉共154篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共170篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Strategic Innovation Management in the Age of Large Language Models Market Intelligence Adaptive RD and Ethical Governance
【速读】: 该论文旨在解决传统研发(R&D)流程效率低、创新周期长以及跨学科知识整合困难的问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)的多重功能,包括自动化知识发现、增强假设生成能力、融合多学科洞见,并促进创新生态系统内的协作,从而实现更灵活、数据驱动的研发工作流,显著缩短创新周期并加速突破性成果的市场化进程。
链接: https://arxiv.org/abs/2511.14709
作者: Raha Aghaei,Ali A. Kiaei,Mahnaz Boush,Mahan Rofoosheh,Mohammad Zavvar
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This study analyzes the multiple functions of Large Language Models (LLMs) in transforming research and development (RD) processes. By automating knowledge discovery, boosting hypothesis creation, integrating transdisciplinary insights, and enabling cooperation within innovation ecosystems, LLMs dramatically improve the efficiency and effectiveness of research processes. Through extensive analysis of scientific literature, patent databases, and experimental data, these models enable more flexible and informed RD workflows, ultimately accelerating innovation cycles and lowering time-to-market for breakthrough ideas.
zh
[NLP-1] Subword Tokenization Strategies for Kurdish Word Embeddings
【速读】: 该论文旨在解决低资源语言(以库尔德语为例)在词嵌入(word embeddings)构建过程中,不同分词策略(tokenization strategies)对形态相似性保持能力的影响问题。其关键解决方案在于提出并验证了一种基于形态素(morpheme-based)的分词方法,结合自举训练(bootstrapped training)的BiLSTM-CRF形态分割器,生成高质量词嵌入,并通过覆盖度感知的综合评估指标(包括相似性保持、聚类质量与语义组织结构)发现:尽管BPE(Byte Pair Encoding)在局部形态相似性上表现突出,但其测试覆盖率仅为28.6%,远低于形态素模型的68.7%,导致性能高估;而形态素分词在整体嵌入空间组织、语义邻域结构和形态复杂度分布均衡性方面均更优,凸显了在低资源语言处理中覆盖度敏感评估的重要性。
链接: https://arxiv.org/abs/2511.14696
作者: Ali Salehi,Cassandra L. Jacobs
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We investigate tokenization strategies for Kurdish word embeddings by comparing word-level, morpheme-based, and BPE approaches on morphological similarity preservation tasks. We develop a BiLSTM-CRF morphological segmenter using bootstrapped training from minimal manual annotation and evaluate Word2Vec embeddings across comprehensive metrics including similarity preservation, clustering quality, and semantic organization. Our analysis reveals critical evaluation biases in tokenization comparison. While BPE initially appears superior in morphological similarity, it evaluates only 28.6% of test cases compared to 68.7% for morpheme model, creating artificial performance inflation. When assessed comprehensively, morpheme-based tokenization demonstrates superior embedding space organization, better semantic neighborhood structure, and more balanced coverage across morphological complexity levels. These findings highlight the importance of coverage-aware evaluation in low-resource language processing and offers different tokenization methods for low-resourced language processing.
zh
[NLP-2] alk Snap Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances AAAI2026 AAAI
【速读】: 该论文旨在解决传统投诉分析方法依赖单一模态短文本(如推文或产品评论)所带来的局限性,尤其是在复杂场景下难以实现细粒度的投诉方面(aspect)和严重程度(severity)分类的问题。针对这一挑战,作者提出VALOR框架——一种面向多模态、多轮客户支持对话的验证感知学习模型,其核心创新在于引入多专家推理机制(multi-expert reasoning setup),结合大规模生成式AI模型与思维链(Chain-of-Thought, CoT)提示策略以增强决策的细致性;同时通过语义对齐得分(semantic alignment score)实现跨模态一致性建模,并采用元融合策略(meta-fusion strategy)整合文本与视觉证据,从而提升整体分类准确性。该方案在包含细粒度标注的多模态投诉数据集上显著优于基线模型,尤其在信息分布于文本与图像中的复杂案例中表现突出。
链接: https://arxiv.org/abs/2511.14693
作者: Rishu Kumar Singh,Navneet Shreya,Sarmistha Das,Apoorva Singh,Sriparna Saha
机构: 未知
类目: Computation and Language (cs.CL)
备注: To be published in the Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026 Special Track on AI for Social Impact )
Abstract:Existing approaches to complaint analysis largely rely on unimodal, short-form content such as tweets or product reviews. This work advances the field by leveraging multimodal, multi-turn customer support dialogues, where users often share both textual complaints and visual evidence (e.g., screenshots, product photos) to enable fine-grained classification of complaint aspects and severity. We introduce VALOR, a Validation-Aware Learner with Expert Routing, tailored for this multimodal setting. It employs a multi-expert reasoning setup using large-scale generative models with Chain-of-Thought (CoT) prompting for nuanced decision-making. To ensure coherence between modalities, a semantic alignment score is computed and integrated into the final classification through a meta-fusion strategy. In alignment with the United Nations Sustainable Development Goals (UN SDGs), the proposed framework supports SDG 9 (Industry, Innovation and Infrastructure) by advancing AI-driven tools for robust, scalable, and context-aware service infrastructure. Further, by enabling structured analysis of complaint narratives and visual context, it contributes to SDG 12 (Responsible Consumption and Production) by promoting more responsive product design and improved accountability in consumer services. We evaluate VALOR on a curated multimodal complaint dataset annotated with fine-grained aspect and severity labels, showing that it consistently outperforms baseline models, especially in complex complaint scenarios where information is distributed across text and images. This study underscores the value of multimodal interaction and expert validation in practical complaint understanding systems. Resources related to data and codes are available here: this https URL
zh
[NLP-3] Ground Truth Generation for Multilingual Historical NLP using LLM s
【速读】: 该论文旨在解决历史文本和低资源自然语言处理(Natural Language Processing, NLP)任务中的两大挑战:标注数据稀缺以及与现代网络语料库之间的领域差异。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成合成的“真实标签”(ground-truth annotations),并基于此少量合成数据对spaCy等NLP工具进行微调,从而显著提升针对特定历史时期法语(16–20世纪)和中文(1900–1950年)文本的词性标注(POS)、词形还原(lemmatization)和命名实体识别(NER)性能。研究表明,即使合成数据量有限,领域特定的微调仍能有效改善计算人文学科中低资源语料的NLP工具表现。
链接: https://arxiv.org/abs/2511.14688
作者: Clovis Gladstone,Zhao Fang,Spencer Dean Stewart
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 tables, 1 figure
Abstract:Historical and low-resource NLP remains challenging due to limited annotated data and domain mismatches with modern, web-sourced corpora. This paper outlines our work in using large language models (LLMs) to create ground-truth annotations for historical French (16th-20th centuries) and Chinese (1900-1950) texts. By leveraging LLM-generated ground truth on a subset of our corpus, we were able to fine-tune spaCy to achieve significant gains on period-specific tests for part-of-speech (POS) annotations, lemmatization, and named entity recognition (NER). Our results underscore the importance of domain-specific models and demonstrate that even relatively limited amounts of synthetic data can improve NLP tools for under-resourced corpora in computational humanities research.
zh
[NLP-4] Encoding and Understanding Astrophysical Information in Large Language Model-Generated Summaries NEURIPS2025
【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)编码通常仅通过科学测量获得的物理信息,这些信息往往仅以文本描述的形式被松散表达。研究聚焦于天体物理学领域,探索LLM嵌入(embeddings)是否能够捕捉来自科学测量的物理统计量,并重点回答两个问题:提示(prompting)是否影响这些物理量在LLM中的编码方式,以及语言的哪些方面最有助于编码测量所代表的物理内容。解决方案的关键在于使用稀疏自编码器(sparse autoencoders)从文本中提取可解释的特征,从而揭示LLM如何将物理信息隐式地编码到其内部表示中。
链接: https://arxiv.org/abs/2511.14685
作者: Kiera McCormick,Rafael Martínez-Galarza
机构: Johns Hopkins University (约翰霍普金斯大学); AstroAI; Center for Astrophysics | Harvard & Smithsonian (哈佛史密森天体物理中心)
类目: Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注: Accepted to the Machine Learning and the Physical Sciences Workshop at NeurIPS 2025, 11 pages, 4 figures
Abstract:Large Language Models have demonstrated the ability to generalize well at many levels across domains, modalities, and even shown in-context learning capabilities. This enables research questions regarding how they can be used to encode physical information that is usually only available from scientific measurements, and loosely encoded in textual descriptions. Using astrophysics as a test bed, we investigate if LLM embeddings can codify physical summary statistics that are obtained from scientific measurements through two main questions: 1) Does prompting play a role on how those quantities are codified by the LLM? and 2) What aspects of language are most important in encoding the physics represented by the measurement? We investigate this using sparse autoencoders that extract interpretable features from the text.
zh
[NLP-5] SMRC: Aligning Large Language Models with Student Reasoning for Mathematical Error Correction
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在求解数学问题时存在推理错误,且现有方法主要依赖模型自我修正、难以满足教育场景中“教师式”系统性指导与过程修正的问题。解决方案的关键在于提出一种名为SMRC(Student Mathematical Reasoning Correction)的新方法,其核心是将学生推理过程建模为多步序列决策问题,并引入蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)以探索最优修正路径;同时,通过LLM引导的广度优先搜索(BFS)结合最终答案评估生成奖励信号,并利用反向传播机制将其分配至中间推理步骤,从而实现细粒度的过程监督,显著提升模型对错误推理步骤的识别与修正能力。
链接: https://arxiv.org/abs/2511.14684
作者: Biaojie Zeng,Min Zhang,Juan Zhou,Fengrui Liu,Ruiyang Huang,Xin Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures
Abstract:Large language models (LLMs) often make reasoning errors when solving mathematical problems, and how to automatically detect and correct these errors has become an important research direction. However, existing approaches \textitmainly focus on self-correction within the model, which falls short of the teacher-style correction required in educational settings, \textiti.e., systematically guiding and revising a student’s problem-solving process. To address this gap, we propose \textttSMRC (\textit\underlineStudent \underlineMathematical \underlineReasoning \underlineCorrection), a novel method that aligns LLMs with student reasoning. Specifically, \textttSMRC formulates student reasoning as a multi-step sequential decision problem and introduces Monte Carlo Tree Search (MCTS) to explore optimal correction paths. To reduce the cost of the annotating process-level rewards, we leverage breadth-first search (BFS) guided by LLMs and final-answer evaluation to generate reward signals, which are then distributed across intermediate reasoning steps via a back-propagation mechanism, enabling fine-grained process supervision. Additionally, we construct a benchmark for high school mathematics, MSEB (Multi-Solution Error Benchmark), consisting of 158 instances that include problem statements, student solutions, and correct reasoning steps. We further propose a dual evaluation protocol centered on \textbfsolution accuracy and \textbfcorrect-step retention, offering a comprehensive measure of educational applicability. Experiments demonstrate that \textttSMRC significantly outperforms existing methods on two public datasets (ProcessBench and MR-GSM8K) and our MSEB in terms of effectiveness and overall performance. The code and data are available at this https URL.
zh
[NLP-6] Quadratic Term Correction on Heaps Law
【速读】: 该论文试图解决的是Heaps’或Herdan’s定律在描述词类(word-type)与词频(word-token)关系时存在的局限性问题,即传统幂律函数在对数-对数尺度下虽近似为直线,但实际数据仍呈现轻微凹性,表明其无法精确拟合类型-标记曲线。解决方案的关键在于引入二阶近似模型:通过分析20部英文小说(部分为翻译作品)的类型-标记数据,发现以对数尺度下的二次函数拟合效果最佳,回归分析显示线性系数略大于1,二次系数约为-0.02;进一步基于“有放回抽取袋中彩色球”模型,揭示了该凹性本质上对应于一个负值的“伪方差”(pseudo-variance),从而为类型-标记关系的非线性特征提供了理论解释和量化估算方法,尤其适用于小样本场景下的曲率估计。
链接: https://arxiv.org/abs/2511.14683
作者: Oscar Fontanelli,Wentian Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 3 figures
Abstract:Heaps’ or Herdan’s law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a pseudo-variance" which is negative. Although a pseudo-variance calculation may encounter numeric instability when the number of tokens is large, due to the large values of pseudo-weights, this formalism provides a rough estimation of the curvature when the number of tokens is small.
zh
[NLP-7] Streamlining Industrial Contract Management with Retrieval-Augmented LLM s
【速读】: 该论文旨在解决合同管理流程中因标注数据稀缺和大量非结构化历史合同而导致的自动化难题,尤其聚焦于识别并优化不合规或不可接受的条款修订。其解决方案的关键在于构建一个模块化框架,通过检索增强生成(Retrieval-Augmented Generation, RAG)流水线实现:集成合成数据生成、语义条款检索、可接受性分类及基于奖励的对齐机制,从而在低资源条件下实现超过80%的准确率,有效提升合同修订效率与质量。
链接: https://arxiv.org/abs/2511.14671
作者: Kristi Topollai,Tolga Dimlioglu,Anna Choromanska,Simon Odie,Reginald Hui
机构: New York University (纽约大学); Consolidated Edison Company of New York Inc. (康涅狄格爱迪生公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Contract management involves reviewing and negotiating provisions, individual clauses that define rights, obligations, and terms of agreement. During this process, revisions to provisions are proposed and iteratively refined, some of which may be problematic or unacceptable. Automating this workflow is challenging due to the scarcity of labeled data and the abundance of unstructured legacy contracts. In this paper, we present a modular framework designed to streamline contract management through a retrieval-augmented generation (RAG) pipeline. Our system integrates synthetic data generation, semantic clause retrieval, acceptability classification, and reward-based alignment to flag problematic revisions and generate improved alternatives. Developed and evaluated in collaboration with an industry partner, our system achieves over 80% accuracy in both identifying and optimizing problematic revisions, demonstrating strong performance under real-world, low-resource conditions and offering a practical means of accelerating contract revision workflows.
zh
[NLP-8] Bias in Bias out: Annotation Bias in Multilingual Large Language Models
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)数据集中标注偏差(annotation bias)问题,尤其是在多语言大型语言模型(Multilingual Large Language Models, LLMs)开发过程中,因任务表述、标注者主观性及文化差异所引发的模型输出失真与社会危害加剧问题。其解决方案的关键在于提出一个系统性的标注偏差分类框架,区分指令偏差(instruction bias)、标注者偏差(annotator bias)和情境与文化偏差(contextual and cultural bias),并在此基础上整合检测方法(如标注者间一致性、模型分歧度与元数据分析)与缓解策略(包括多样化标注者招募、迭代式指南优化及事后模型调整),最终构建一种面向多语言场景的基于集成学习的偏差缓解方法,从而推动更公平、更具文化适配性的标注流程设计。
链接: https://arxiv.org/abs/2511.14662
作者: Xia Cui,Ziyi Huang,Naeemeh Adel
机构: Manchester Metropolitan University (曼彻斯特都会大学); Hubei University (湖北大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Annotation bias in NLP datasets remains a major challenge for developing multilingual Large Language Models (LLMs), particularly in culturally diverse settings. Bias from task framing, annotator subjectivity, and cultural mismatches can distort model outputs and exacerbate social harms. We propose a comprehensive framework for understanding annotation bias, distinguishing among instruction bias, annotator bias, and contextual and cultural bias. We review detection methods (including inter-annotator agreement, model disagreement, and metadata analysis) and highlight emerging techniques such as multilingual model divergence and cultural inference. We further outline proactive and reactive mitigation strategies, including diverse annotator recruitment, iterative guideline refinement, and post-hoc model adjustments. Our contributions include: (1) a typology of annotation bias; (2) a synthesis of detection metrics; (3) an ensemble-based bias mitigation approach adapted for multilingual settings, and (4) an ethical analysis of annotation processes. Together, these insights aim to inform more equitable and culturally grounded annotation pipelines for LLMs.
zh
[NLP-9] Graded strength of comparative illusions is explained by Bayesian inference
【速读】: 该论文试图解决语言理解中存在的一种认知错觉现象——比较幻觉(Comparative Illusion, CI),即人们在处理诸如“更多学生去过俄罗斯,而不是我”这类句子时,尽管其逻辑比较无意义,仍倾向于认为句子是可接受的。此前研究提出基于噪声信道理论(Noisy Channel Theory)的解释框架,认为理解过程是贝叶斯推理的结果:句子的解释后验概率由先验概率和该解释被误传为观察到的CI句的可能性共同决定。本文的关键解决方案在于构建一个定量模型,通过融合统计语言模型与人类行为数据,直接预测CI效应的强度,并成功解释了先前未被阐明的由代词与完整名词短语作从句主语所引发的差异效应。这一成果不仅验证了噪声信道理论对CI现象的预测能力,还进一步支持其作为统一计算层级理论来解释多种语言处理现象的有效性。
链接: https://arxiv.org/abs/2511.14642
作者: Yuhan Zhang,Erxiao Wang,Cory Shain
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 49 pages, 7 figures
Abstract:Like visual processing, language processing is susceptible to illusions in which people systematically misperceive stimuli. In one such case–the comparative illusion (CI), e.g., More students have been to Russia than I have–comprehenders tend to judge the sentence as acceptable despite its underlying nonsensical comparison. Prior research has argued that this phenomenon can be explained as Bayesian inference over a noisy channel: the posterior probability of an interpretation of a sentence is proportional to both the prior probability of that interpretation and the likelihood of corruption into the observed (CI) sentence. Initial behavioral work has supported this claim by evaluating a narrow set of alternative interpretations of CI sentences and showing that comprehenders favor interpretations that are more likely to have been corrupted into the illusory sentence. In this study, we replicate and go substantially beyond this earlier work by directly predicting the strength of illusion with a quantitative model of the posterior probability of plausible interpretations, which we derive through a novel synthesis of statistical language models with human behavioral data. Our model explains not only the fine gradations in the strength of CI effects, but also a previously unexplained effect caused by pronominal vs. full noun phrase than-clause subjects. These findings support a noisy-channel theory of sentence comprehension by demonstrating that the theory makes novel predictions about the comparative illusion that bear out empirically. This outcome joins related evidence of noisy channel processing in both illusory and non-illusory contexts to support noisy channel inference as a unified computational-level theory of diverse language processing phenomena.
zh
[NLP-10] A Specialized Large Language Model for Clinical Reasoning and Diagnosis in Rare Diseases
【速读】: 该论文旨在解决罕见病(rare diseases)诊断周期长、准确率低的问题,其核心挑战在于传统诊断流程将噪声数据提取与推理诊断分离,且通用或医学大语言模型(LLMs)受限于真实电子健康记录(EHRs)稀缺、领域知识陈旧及幻觉问题。解决方案的关键在于构建一个领域专业化临床语料库和由临床医生验证的推理数据集,并开发RareSeek R1模型,通过分阶段指令微调(staged instruction tuning)、思维链学习(chain of thought learning)以及图结构引导的检索(graph grounded retrieval)实现精准诊断。实验表明,该方法在多中心EHR文本和公开基准上达到最先进性能,尤其在结合优先变异信息时,增强型检索显著提升诊断准确性,且透明化推理能突出非表型证据(如影像学、干预措施、功能检测等,中位占比23.1%),从而缩短诊断旅程并提供可审计、临床可落地的决策支持。
链接: https://arxiv.org/abs/2511.14638
作者: Tao Yang,Dandan Huang,Yunting Lin,Pengfei Wu,Zhikun Wu,Gangyuan Ma,Yulan Lu,Xinran Dong,Dingpeng Li,Junshuang Ge,Zhiyan Zhang,Xuanzhao Huang,Wenyan Nong,Yao Zhou,Hui Tang,Hongxi Yang,Shijie Zhang,Juan Li,Xiaojun Cao,Lin Yang,Xia Gao,Kaishou Xu,Xiaoqiong Gu,Wen Zhang,Huimin Xia,Li Liu,Wenhao Zhou,Mulin Jun Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 50 pages, 5 figures
Abstract:Rare diseases affect hundreds of millions worldwide, yet diagnosis often spans years. Convectional pipelines decouple noisy evidence extraction from downstream inferential diagnosis, and general/medical large language models (LLMs) face scarce real world electronic health records (EHRs), stale domain knowledge, and hallucinations. We assemble a large, domain specialized clinical corpus and a clinician validated reasoning set, and develop RareSeek R1 via staged instruction tuning, chain of thought learning, and graph grounded retrieval. Across multicenter EHR narratives and public benchmarks, RareSeek R1 attains state of the art accuracy, robust generalization, and stability under noisy or overlapping phenotypes. Augmented retrieval yields the largest gains when narratives pair with prioritized variants by resolving ambiguity and aligning candidates to mechanisms. Human studies show performance on par with experienced physicians and consistent gains in assistive use. Notably, transparent reasoning highlights decisive non phenotypic evidence (median 23.1%, such as imaging, interventions, functional tests) underpinning many correct diagnoses. This work advances a narrative first, knowledge integrated reasoning paradigm that shortens the diagnostic odyssey and enables auditable, clinically translatable decision support.
zh
[NLP-11] Enhancing Agent ic Autonomous Scientific Discovery with Vision-Language Model Capabilities
【速读】: 该论文旨在解决端到端自主科学发现中缺乏实时纠错与可解释性的问题,尤其是在多智能体系统(multi-agent systems)执行探索性数据分析时难以自我修正错误推理路径的挑战。其解决方案的关键在于引入视觉-语言模型(VLM)作为“裁判”(judge),将图表视为可验证的检查点(verifiable checkpoints),通过动态生成领域特定的评分标准(domain-specific rubrics)对结果进行评估,从而实现智能体的自我纠错和实时引导。实验表明,该方法在10项数据驱动发现任务上显著提升性能(pass@1得分达0.7–0.8),优于仅用代码(0.2–0.3)或代码加文本基线(0.4–0.5),并提供可审计的推理轨迹以增强可解释性。
链接: https://arxiv.org/abs/2511.14631
作者: Kahaan Gandhi,Boris Bolliet,Inigo Zubeldia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:
Abstract:We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: this https URL
zh
[NLP-12] Bridging Human and Model Perspectives: A Comparative Analysis of Political Bias Detection in News Media Using Large Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)在政治偏见检测任务中与人类判断的一致性问题,即当前大型语言模型(LLMs)是否能准确捕捉和反映人类对新闻媒体政治倾向的感知。其解决方案的关键在于构建一个包含人工标注数据集的对比评估框架,通过量化标注一致性、偏见极性以及不同模型间的共识程度,系统比较传统Transformer模型(如BERT、RoBERTa)与生成式模型(如GPT、FLAN)在零样本和微调设置下对政治偏见的识别能力。实验表明,微调后的RoBERTa在准确性与人类标签对齐度上表现最优,而GPT在零样本场景下展现出最强的人类一致性,揭示了人类与LLMs在偏见感知上的系统性差异,从而强调了融合人类可解释性与模型可扩展性的混合评估框架的重要性。
链接: https://arxiv.org/abs/2511.14606
作者: Shreya Adrita Banik,Niaz Nafi Rahman,Tahsina Moiukh,Farig Sadeque
机构: BRAC University (BRAC大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Detecting political bias in news media is a complex task that requires interpreting subtle linguistic and contextual cues. Although recent advances in Natural Language Processing (NLP) have enabled automatic bias classification, the extent to which large language models (LLMs) align with human judgment still remains relatively underexplored and not yet well understood. This study aims to present a comparative framework for evaluating the detection of political bias across human annotations and multiple LLMs, including GPT, BERT, RoBERTa, and FLAN. We construct a manually annotated dataset of news articles and assess annotation consistency, bias polarity, and inter-model agreement to quantify divergence between human and model perceptions of bias. Experimental results show that among traditional transformer-based models, RoBERTa achieves the highest alignment with human labels, whereas generative models such as GPT demonstrate the strongest overall agreement with human annotations in a zero-shot setting. Among all transformer-based baselines, our fine-tuned RoBERTa model acquired the highest accuracy and the strongest alignment with human-annotated labels. Our findings highlight systematic differences in how humans and LLMs perceive political slant, underscoring the need for hybrid evaluation frameworks that combine human interpretability with model scalability in automated media bias detection.
zh
[NLP-13] A Method for Characterizing Disease Progression from Acute Kidney Injury to Chronic Kidney Disease
【速读】: 该论文旨在解决急性肾损伤(Acute Kid Injury, AKI)患者向慢性肾病(Chronic Kid Disease, CKD)进展的风险分层难题,即如何精准识别高风险人群以实现早期干预。其解决方案的关键在于利用电子健康记录(Electronic Health Record, EHR)数据,通过纵向医疗编码和肌酐测量值构建患者向量,并采用聚类方法识别AKI后的临床状态;进而基于多状态模型估计各状态间的转移概率及CKD进展风险,结合生存分析在不同AKI亚人群中识别CKD风险因素。该方法揭示了15种独特的AKI后临床轨迹,其中多数患者(75%)仅经历单一状态或一次状态转换,且已知与新兴风险因素的影响因临床状态而异,从而为开发基于证据的决策支持工具提供了数据驱动的基础。
链接: https://arxiv.org/abs/2511.14603
作者: Yilu Fang,Jordan G. Nestor,Casey N. Ta,Jerard Z. Kneifati-Hayek,Chunhua Weng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Patients with acute kidney injury (AKI) are at high risk of developing chronic kidney disease (CKD), but identifying those at greatest risk remains challenging. We used electronic health record (EHR) data to dynamically track AKI patients’ clinical evolution and characterize AKI-to-CKD progression. Post-AKI clinical states were identified by clustering patient vectors derived from longitudinal medical codes and creatinine measurements. Transition probabilities between states and progression to CKD were estimated using multi-state modeling. After identifying common post-AKI trajectories, CKD risk factors in AKI subpopulations were identified through survival analysis. Of 20,699 patients with AKI at admission, 3,491 (17%) developed CKD. We identified fifteen distinct post-AKI states, each with different probabilities of CKD development. Most patients (75%, n=15,607) remained in a single state or made only one transition during the study period. Both established (e.g., AKI severity, diabetes, hypertension, heart failure, liver disease) and novel CKD risk factors, with their impact varying across these clinical states. This study demonstrates a data-driven approach for identifying high-risk AKI patients, supporting the development of decision-support tools for early CKD detection and intervention.
zh
[NLP-14] Leverag ing Digitized Newspapers to Collect Summarization Data in Low-Resource Languages
【速读】: 该论文旨在解决低资源语言中高质量摘要数据稀缺的问题。其核心解决方案在于利用历史报纸中编辑为长篇文章撰写的“头版预告”(Front-Page Teasers)作为自然标注的摘要来源,这些预告本质上是多文档摘要(multi-document summarization)的天然语料。关键创新在于提出了一种适用于不同语言资源水平的自动化数据收集流程,并成功应用于希伯来语报纸,构建了首个面向希伯来语的多文档摘要数据集HEBTEASESUM,从而为低资源语言的摘要任务提供了可扩展的数据基础。
链接: https://arxiv.org/abs/2511.14598
作者: Noam Dahan,Omer Kidron,Gabriel Stanovsky
机构: The Hebrew University of Jerusalem (希伯来大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.
zh
[NLP-15] Examining the Metrics for Document-Level Claim Extraction in Czech and Slovak
【速读】: 该论文旨在解决文档级事实核查中主张提取(claim extraction)的评估难题,尤其是如何有效对齐和比较模型抽取与人工标注的主张集合。其核心解决方案是提出一种基于对齐得分的相似性计算方法,通过识别两组主张之间的最优匹配关系,构建一个可靠的评估框架,从而实现对模型提取性能的量化评估,并可用于衡量标注者间的一致性。该方法特别针对捷克语和斯洛伐克语新闻评论中的非正式语言、强地域语境及语言细微差异等挑战进行了验证,揭示了现有评估方法在捕捉主张语义相似性及关键属性(如原子性、可验证性和去上下文化)方面的局限性,强调需发展更先进的评估机制。
链接: https://arxiv.org/abs/2511.14566
作者: Lucia Makaiová,Martin Fajčík,Antonín Jarolím
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Document-level claim extraction remains an open challenge in the field of fact-checking, and subsequently, methods for evaluating extracted claims have received limited attention. In this work, we explore approaches to aligning two sets of claims pertaining to the same source document and computing their similarity through an alignment score. We investigate techniques to identify the best possible alignment and evaluation method between claim sets, with the aim of providing a reliable evaluation framework. Our approach enables comparison between model-extracted and human-annotated claim sets, serving as a metric for assessing the extraction performance of models and also as a possible measure of inter-annotator agreement. We conduct experiments on newly collected dataset-claims extracted from comments under Czech and Slovak news articles-domains that pose additional challenges due to the informal language, strong local context, and subtleties of these closely related languages. The results draw attention to the limitations of current evaluation approaches when applied to document-level claim extraction and highlight the need for more advanced methods-ones able to correctly capture semantic similarity and evaluate essential claim properties such as atomicity, checkworthiness, and decontextualization.
zh
[NLP-16] LiveRAG : A diverse QA dataset with varying difficulty level for RAG evaluation
【速读】: 该论文旨在解决生成式 AI 中检索增强生成(Retrieval Augmented Generation, RAG)系统在实际应用中缺乏系统性评估方法的问题。当前 RAG-based 问答(QA)系统的效果难以客观比较,限制了其性能优化与技术进步。解决方案的关键在于提出 LiveRAG 基准测试集,该数据集包含 895 个合成问题及其答案,源自 SIGIR’2025 LiveRAG 挑战赛,并补充了竞赛期间未公开的真值答案和支撑性论据(supporting claims),同时通过项目反应理论(Item Response Theory, IRT)模型为每个问题标注难度和区分度评分。这一设计使得 LiveRAG 能够有效评估不同 RAG 系统的能力差异,推动社区开展更严谨、可复现的 RAG 研究与系统开发。
链接: https://arxiv.org/abs/2511.14531
作者: David Carmel,Simone Filice,Guy Horowitz,Yoelle Maarek,Alex Shtoff,Oren Somekh,Ran Tavory
机构: Technology Innovation Institute (TII)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 14 pages, 4 figures, 5 tables
Abstract:With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based QA systems. This synthetic benchmark is derived from the one used during the SIGIR’2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors’ answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to competitors’ responses. Our analysis highlights the benchmark’s questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities. The LiveRAG benchmark will hopefully help the community advance RAG research, conduct systematic evaluation, and develop more robust QA systems.
zh
[NLP-17] Agent -R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体(Agents)在通过强化学习(Reinforcement Learning, RL)进行训练时面临的两大核心问题:一是缺乏针对LLM Agent场景专门设计的强化学习方法体系,二是缺少灵活、可扩展的训练框架以支持多样化任务与交互环境。解决方案的关键在于:首先,通过系统性地将马尔可夫决策过程(Markov Decision Process, MDP)框架扩展,明确定义了LLM Agent的核心构成要素;其次,提出了Agent-R1这一模块化、灵活且用户友好的强化学习训练框架,能够便捷适配多种任务场景和交互环境,并在多跳问答(Multihop QA)基准任务上验证了其有效性。
链接: https://arxiv.org/abs/2511.14460
作者: Mingyue Cheng,Jie Ouyang,Shuo Yu,Ruiran Yan,Yucong Luo,Zirui Liu,Daoyu Wang,Qi Liu,Enhong Chen
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: This paper serves as the technical report of the Agent-R1 project
Abstract:Large Language Models (LLMs) are increasingly being explored for building Agents capable of active environmental interaction (e.g., via tool use) to solve complex problems. Reinforcement Learning (RL) is considered a key technology with significant potential for training such Agents; however, the effective application of RL to LLM Agents is still in its nascent stages and faces considerable challenges. Currently, this emerging field lacks in-depth exploration into RL approaches specifically tailored for the LLM Agent context, alongside a scarcity of flexible and easily extensible training frameworks designed for this purpose. To help advance this area, this paper first revisits and clarifies Reinforcement Learning methodologies for LLM Agents by systematically extending the Markov Decision Process (MDP) framework to comprehensively define the key components of an LLM Agent. Secondly, we introduce Agent-R1, a modular, flexible, and user-friendly training framework for RL-based LLM Agents, designed for straightforward adaptation across diverse task scenarios and interactive environments. We conducted experiments on Multihop QA benchmark tasks, providing initial validation for the effectiveness of our proposed methods and framework.
zh
[NLP-18] Me: An LLM -powered Mental Well-being Assistant with RAG Synthetic Dialogue Generation and Agent ic Planning ACL
【速读】: 该论文旨在解决当前心理健康支持资源获取难、专业治疗数据稀缺以及静态自护工具缺乏个性化与动态适应性的问题。其解决方案的关键在于构建一个名为Tell Me的综合性心理福祉系统,该系统通过三个核心组件实现:(1) 基于检索增强生成(Retrieval-Augmented Generation, RAG)的助手,提供情境感知且知识 grounded 的个性化对话支持;(2) 条件化生成合成客户-治疗师对话的模块,用于研究治疗语言并缓解保密治疗数据不足的问题;(3) 利用CrewAI实现的“福祉AI团队”,可生成动态调整的个性化周度自我照护计划及引导冥想音频,从而突破传统静态工具的局限。该系统强调作为情感处理的反思空间而非专业治疗替代品,体现了自然语言处理(Natural Language Processing, NLP)与心理健康领域协同创新的潜力。
链接: https://arxiv.org/abs/2511.14445
作者: Trishala Jayesh Ahalpara
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 8 pages, 2 figures, 1 Table. Submitted to the Computation and Language (cs.CL) category. Uses the ACL-style template. Code and demo will be released at: this https URL
Abstract:We present Tell Me, a mental well-being system that leverages advances in large language models to provide accessible, context-aware support for users and researchers. The system integrates three components: (i) a retrieval-augmented generation (RAG) assistant for personalized, knowledge-grounded dialogue; (ii) a synthetic client-therapist dialogue generator conditioned on client profiles to facilitate research on therapeutic language and data augmentation; and (iii) a Well-being AI crew, implemented with CrewAI, that produces weekly self-care plans and guided meditation audio. The system is designed as a reflective space for emotional processing rather than a substitute for professional therapy. It illustrates how conversational assistants can lower barriers to support, complement existing care, and broaden access to mental health resources. To address the shortage of confidential therapeutic data, we introduce synthetic client-therapist dialogue generation conditioned on client profiles. Finally, the planner demonstrates an innovative agentic workflow for dynamically adaptive, personalized self-care, bridging the limitations of static well-being tools. We describe the architecture, demonstrate its functionalities, and report evaluation of the RAG assistant in curated well-being scenarios using both automatic LLM-based judgments and a human-user study. This work highlights opportunities for interdisciplinary collaboration between NLP researchers and mental health professionals to advance responsible innovation in human-AI interaction for well-being.
zh
[NLP-19] MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models Multimodal Models and Intelligent Agents
链接: https://arxiv.org/abs/2511.14439
作者: Jinru Ding,Lu Lu,Chao Ding,Mouxiao Bian,Jiayuan Chen,Renjie Lu,Wenrao Pang,Xiaoqin Wu,Zhiqiang Liu,Luyi Jiang,Bing Han,Yunqiu Wang,Jie Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
[NLP-20] Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在教育场景中面临的安全威胁问题,特别是针对越狱攻击(jailbreak attacks)和微调攻击(fine-tuning attacks)导致的安全对齐失效与有害输出风险。现有研究多聚焦于通用安全评估,缺乏对教育场景特有安全需求的关注。为此,作者构建了EduHarm基准数据集,包含五个典型教育场景下的安全-不安全指令对,用于系统性评估教育LLMs的安全性。解决方案的核心是提出一种三阶段防护框架(Three-Stage Shield Framework, TSSF):首先通过安全感知注意力重对齐(safety-aware attention realignment)增强模型对关键危险标记的敏感度;其次利用分层安全判断(layer-wise safety judgment)聚合多层安全线索以识别有害指令;最后采用防御驱动的双路径路由机制(defense-driven dual routing)实现安全与不安全查询的分离处理,确保良性输入正常响应、有害输入获得受控输出。该框架在多种越狱攻击和微调攻击数据集上均表现出强健的防御能力,同时保持对良性微调带来的性能提升。
链接: https://arxiv.org/abs/2511.14423
作者: Xin Yi,Yue Li,Dongsheng Shi,Linlin Wang,Xiaoling Wang,Liang He
机构: Shanghai Institute of Artificial Intelligence for Education, East China Normal University (华东师范大学人工智能教育研究院); School of Computer Science and Technology, East China Normal University (华东师范大学计算机科学与技术学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are increasingly integrated into educational applications. However, they remain vulnerable to jailbreak and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs. Existing studies mainly focus on general safety evaluations, with limited attention to the unique safety requirements of educational scenarios. To address this gap, we construct EduHarm, a benchmark containing safe-unsafe instruction pairs across five representative educational scenarios, enabling systematic safety evaluation of educational LLMs. Furthermore, we propose a three-stage shield framework (TSSF) for educational LLMs that simultaneously mitigates both jailbreak and fine-tuning attacks. First, safety-aware attention realignment redirects attention toward critical unsafe tokens, thereby restoring the harmfulness feature that discriminates between unsafe and safe inputs. Second, layer-wise safety judgment identifies harmfulness features by aggregating safety cues across multiple layers to detect unsafe instructions. Finally, defense-driven dual routing separates safe and unsafe queries, ensuring normal processing for benign inputs and guarded responses for harmful ones. Extensive experiments across eight jailbreak attack strategies demonstrate that TSSF effectively strengthens safety while preventing over-refusal of benign queries. Evaluations on three fine-tuning attack datasets further show that it consistently achieves robust defense against harmful queries while maintaining preserving utility gains from benign fine-tuning.
zh
[NLP-21] Mitigating Label Length Bias in Large Language Models AACL2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多候选选项预测任务中因标签长度差异导致的标签长度偏差(label length bias)问题,即不同长度的标签即使经过标准长度归一化仍会被不一致地处理,从而影响预测准确性与置信度估计的可靠性。解决方案的关键在于提出归一化上下文校准(Normalized Contextual Calibration, NCC),该方法在完整标签层面进行归一化和校准,有效缓解了此类偏差,并在多个数据集和模型上实现了统计显著的性能提升(F1分数最高提升达10%)。NCC不仅提升了预测准确性,还增强了模型对少样本示例选择的鲁棒性,减少了所需示例数量,并提供了更可靠的置信度估计,适用于包括多选题问答在内的更广泛任务场景。
链接: https://arxiv.org/abs/2511.14385
作者: Mario Sanz-Guerrero,Katharina von der Wense
机构: Johannes Gutenberg University Mainz (美因茨约翰内斯古腾堡大学); University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computation and Language (cs.CL)
备注: Accepted to AACL 2025 (Main)
Abstract:Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.
zh
[NLP-22] O3SLM: Open Weight Open Data and Open Vocabulary Sketch-Language Model AAAI2026
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)在理解抽象视觉输入(尤其是手绘草图)方面能力不足的问题。现有LVLMs难以有效处理草图这一直观表达难以文字描述概念的模态,其根本瓶颈在于缺乏一个大规模、联合建模草图、真实图像与自然语言指令的数据集。解决方案的关键在于:首先构建了一个大规模的图像-草图-指令三元组数据集(SketchVCL),用于支持预训练和指令微调;其次基于该数据集训练出O3SLM模型,使其在物体定位、计数、图像检索(SBIR及细粒度SBIR)以及视觉问答(VQA)等多项草图任务中达到当前最优性能,显著优于现有LVLMs在草图理解和推理上的表现。
链接: https://arxiv.org/abs/2511.14368
作者: Rishi Gupta,Mukilan Karuppasamy,Shyam Marjit,Aditay Tripathi,Anirban Chakraborty
机构: Indian Institute of Science (印度科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to AAAI 2026
Abstract:While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, © image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.
zh
[NLP-23] ATLAS: A High-Difficulty Multidisciplinary Benchmark for Frontier Scientific Reasoning
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在主流基准测试中性能趋于饱和,难以有效区分前沿模型能力的问题;同时应对现有高难度评测基准普遍存在学科聚焦过窄、答案格式简化及数据污染敏感等缺陷,导致其与真实科学探究存在 fidelity gap(保真度差距)。解决方案的关键在于提出 ATLAS(AGI-Oriented Testbed for Logical Application in Science),一个由约800道原创问题构成的大规模、跨学科、高难度评估套件,涵盖数学、物理、化学、生物、计算机科学、地球科学和材料科学七个核心领域。其核心创新包括:高度原创性与抗污染设计、跨学科知识整合能力评估、复杂开放答案(含多步推理和LaTeX表达式)以及严格的专家评审与对抗测试质量控制流程,并引入基于LLM评委的自动化评估范式,以精准衡量模型在科学推理上的高级能力。
链接: https://arxiv.org/abs/2511.14366
作者: Hongwei Liu,Junnan Liu,Shudong Liu,Haodong Duan,Yuqiang Li,Mao Su,Xiaohong Liu,Guangtao Zhai,Xinyu Fang,Qianhong Ma,Taolin Zhang,Zihan Ma,Yufeng Zhao,Peiheng Zhou,Linchen Xiao,Wenlong Zhang,Shijie Zhou,Xingjian Ma,Siqi Sun,Jiaye Ge,Meng Li,Yuhong Liu,Jianxin Dong,Jiaying Li,Hui Wu,Hanwen Liang,Jintai Lin,Yanting Wang,Jie Dong,Tong Zhu,Tianfan Fu,Conghui He,Qi Zhang,Songyang Zhang,Lei Bai,Kai Chen
机构: Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 39 pages
Abstract:The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models’ ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS’s effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable “ruler” for progress toward Artificial General Intelligence.
zh
[NLP-24] he Tokenization Bottleneck: How Vocabulary Extension Improves Chemistry Representation Learning in Pretrained Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在化学领域应用中面临的“分词瓶颈”问题,即通用领域训练的分词器常将化学表示形式(如SMILES)切分为语义信息贫乏的子词单元,从而影响模型性能。解决方案的关键在于提出一种系统性方法:通过有针对性的词汇表扩展(targeted vocabulary extension),向预训练LLM的词汇表中引入具有化学意义的token,并在此基础上使用化学领域文本进行持续预训练(continued pretraining),以实现自然语言与分子结构表示的统一建模,从而显著提升下游化学任务的性能。
链接: https://arxiv.org/abs/2511.14365
作者: Prathamesh Kalamkar,Ned Letcher,Meissane Chami,Sahger Lad,Shayan Mohanty,Prasanna Pendse
机构: Thoughtworks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The application of large language models (LLMs) to chemistry is frequently hampered by a “tokenization bottleneck”, where tokenizers tuned on general-domain text tend to fragment chemical representations such as SMILES into semantically uninformative sub-tokens. This paper introduces a principled methodology to resolve this bottleneck by unifying the representation of natural language and molecular structures within a single model. Our approach involves targeted vocabulary extension-augmenting a pretrained LLM’s vocabulary with chemically salient tokens, followed by continued pretraining on chemistry-domain text to integrate this new knowledge. We provide an empirical demonstration of the effectiveness of this strategy, showing that our methodology leads to superior performance on a range of downstream chemical tasks.
zh
[NLP-25] SciRAG : Adaptive Citation-Aware and Outline-Guided Retrieval and Synthesis for Scientific Literature
【速读】: 该论文旨在解决科学文献知识合成中面临的三大挑战:现有检索增强生成(Retrieval-Augmented Generation, RAG)方法忽视引文图结构、难以适应复杂查询,以及生成结果碎片化且难以验证。其解决方案的关键在于提出 SciRAG 框架,包含三项核心创新:(1) 自适应检索机制,灵活切换顺序与并行证据获取策略;(2) 引文感知的符号推理,利用引文图结构组织和筛选支持性文档;(3) 大纲引导的合成机制,通过规划、批判与迭代优化确保答案的一致性和可追溯性。实验证明,SciRAG 在 QASA 和 ScholarQA 等多基准测试中显著优于现有系统,在事实准确性与合成质量上建立了新的标准。
链接: https://arxiv.org/abs/2511.14362
作者: Hang Ding,Yilun Zhao,Tiansheng Hu,Manasi Patwardhan,Arman Cohan
机构: Shanghai Jiao Tong University (上海交通大学); Yale University (耶鲁大学); NYU Shanghai (纽约大学上海分校); CS Research (CS 研究)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:
Abstract:The accelerating growth of scientific publications has intensified the need for scalable, trustworthy systems to synthesize knowledge across diverse literature. While recent retrieval-augmented generation (RAG) methods have improved access to scientific information, they often overlook citation graph structure, adapt poorly to complex queries, and yield fragmented, hard-to-verify syntheses. We introduce SciRAG, an open-source framework for scientific literature exploration that addresses these gaps through three key innovations: (1) adaptive retrieval that flexibly alternates between sequential and parallel evidence gathering; (2) citation-aware symbolic reasoning that leverages citation graphs to organize and filter supporting documents; and (3) outline-guided synthesis that plans, critiques, and refines answers to ensure coherence and transparent attribution. Extensive experiments across multiple benchmarks such as QASA and ScholarQA demonstrate that SciRAG outperforms prior systems in factual accuracy and synthesis quality, establishing a new foundation for reliable, large-scale scientific knowledge aggregation.
zh
[NLP-26] ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions AAAI2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对包含冲突约束的用户指令时,其冲突检测与解决能力不足的问题。当前研究多关注LLM对指令的遵循程度,却忽视了复杂提示中常见的指令矛盾场景,导致模型在实际应用中可能产生不合理或错误的输出。解决方案的关键在于提出ConInstruct这一专门用于评估LLM冲突检测与解决能力的新基准数据集,通过系统性实验发现:主流商用模型如Claude-4.5-Sonnet和DeepSeek-R1在冲突检测上表现优异(F1分数分别为87.3%和91.5%),但它们极少主动向用户提示冲突或请求澄清,揭示了当前LLM在指令理解中的关键短板——即虽能识别冲突,却缺乏有效的交互式澄清机制。
链接: https://arxiv.org/abs/2511.14342
作者: Xingwei He,Qianru Zhang,Pengfei Chen,Guanhua Chen,Linlin Yu,Yuan Yuan,Siu-Ming Yiu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to AAAI 2026
Abstract:Instruction-following is a critical capability of Large Language Models (LLMs). While existing works primarily focus on assessing how well LLMs adhere to user instructions, they often overlook scenarios where instructions contain conflicting constraints-a common occurrence in complex prompts. The behavior of LLMs under such conditions remains under-explored. To bridge this gap, we introduce ConInstruct, a benchmark specifically designed to assess LLMs’ ability to detect and resolve conflicts within user instructions. Using this dataset, we evaluate LLMs’ conflict detection performance and analyze their conflict resolution behavior. Our experiments reveal two key findings: (1) Most proprietary LLMs exhibit strong conflict detection capabilities, whereas among open-source models, only DeepSeek-R1 demonstrates similarly strong performance. DeepSeek-R1 and Claude-4.5-Sonnet achieve the highest average F1-scores at 91.5% and 87.3%, respectively, ranking first and second overall. (2) Despite their strong conflict detection abilities, LLMs rarely explicitly notify users about the conflicts or request clarification when faced with conflicting constraints. These results underscore a critical shortcoming in current LLMs and highlight an important area for future improvement when designing instruction-following LLMs.
zh
[NLP-27] Steganographic Backdoor Attacks in NLP: Ultra-Low Poisoning and Defense Evasion
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型在自然语言处理(Natural Language Processing, NLP)应用中因数据投毒引发的后门攻击问题,尤其是针对语义触发器(semantic triggers)的隐蔽攻击。现有防御方法多聚焦于对抗基于文本风格或标记级扰动的攻击,忽视了更现实且危险的情形:攻击者利用真实人名或实体作为触发词,使模型在部署环境中对特定个体或事件产生恶意响应。解决方案的关键在于提出 SteganoBackdoor,其核心是借鉴自然语言隐写术(natural-language steganography)中的无害属性,通过梯度引导的数据优化过程将语义触发种子转化为高载荷、高流畅性且无表征相似性的隐写载体,从而实现极低数据污染率(order-of-magnitude lower)下超过99%的攻击成功率,并有效规避多种数据级防御机制。
链接: https://arxiv.org/abs/2511.14301
作者: Eric Xue,Ruiyi Zhang,Zijun Zhang,Pengtao Xie
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Transformer models are foundational to natural language processing (NLP) applications, yet remain vulnerable to backdoor attacks introduced through poisoned data, which implant hidden behaviors during training. To strengthen the ability to prevent such compromises, recent research has focused on designing increasingly stealthy attacks to stress-test existing defenses, pairing backdoor behaviors with stylized artifact or token-level perturbation triggers. However, this trend diverts attention from the harder and more realistic case: making the model respond to semantic triggers such as specific names or entities, where a successful backdoor could manipulate outputs tied to real people or events in deployed systems. Motivated by this growing disconnect, we introduce SteganoBackdoor, bringing stealth techniques back into line with practical threat models. Leveraging innocuous properties from natural-language steganography, SteganoBackdoor applies a gradient-guided data optimization process to transform semantic trigger seeds into steganographic carriers that embed a high backdoor payload, remain fluent, and exhibit no representational resemblance to the trigger. Across diverse experimental settings, SteganoBackdoor achieves over 99% attack success at an order-of-magnitude lower data-poisoning rate than prior approaches while maintaining unparalleled evasion against a comprehensive suite of data-level defenses. By revealing this practical and covert attack, SteganoBackdoor highlights an urgent blind spot in current defenses and demands immediate attention to adversarial data defenses and real-world threat modeling.
zh
[NLP-28] DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval Multi-role Debating and Multi-path Reasoning
【速读】: 该论文旨在解决当前数据洞察代理(data insight agents)在自动化数据分析与洞察发现中面临的三大核心问题:一是领域知识利用不足,二是分析深度有限,三是生成代码时易出错。其解决方案的关键在于提出一个名为DataSage的多智能体框架,该框架通过三个创新机制实现突破:引入外部知识检索以丰富分析上下文,采用多角色辩论机制模拟多样化分析视角以深化分析层次,以及设计多路径推理策略以提升代码与洞察生成的准确性。
链接: https://arxiv.org/abs/2511.14299
作者: Xiaochuan Liu,Yuanfeng Song,Xiaoming Yin,Xing Chen
机构: ByteDance Inc.(字节跳动)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:In today’s data-driven era, fully automated end-to-end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM-driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi-agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi-role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi-path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.
zh
[NLP-29] AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在阿拉伯语(Arabic)语言理解能力评估中缺乏系统性、人工标注的基准测试问题。现有模型虽然在知识型评测任务中表现优异,但其真实语言结构理解能力(如语法、句法推理)仍存在显著不足,常依赖记忆或模式识别而非深层语义理解。解决方案的关键在于构建并发布AraLingBench——一个由专家设计、涵盖语法(grammar)、形态学(morphology)、拼写(spelling)、阅读理解(reading comprehension)和句法(syntax)五大核心维度的150道多项选择题的人工标注基准测试集,从而精准诊断阿拉伯语LLMs的语言掌握水平,并揭示模型在深层语言推理上的短板。
链接: https://arxiv.org/abs/2511.14295
作者: Mohammad Zbib,Hasan Abed Al Kader Hammoud,Sina Mukalled,Nadine Rizk,Fatima Karnib,Issam Lakkis,Ammar Mohanna,Bernard Ghanem
机构: King Abdullah University of Science and Technology (KAUST); American University of Beirut (AUB)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.
zh
[NLP-30] Dont Miss the Forest for the Trees: In-Depth Confidence Estimation for LLM s via Reasoning over the Answer Space
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在输出结果时缺乏可靠置信度估计的问题,尤其是在生成式AI(Generative AI)场景下,如何提升模型对其预测的可信程度进行准确、透明的量化。其解决方案的关键在于引入“预测口语化概率分布”的机制,迫使模型在回答时不仅考虑所有可能的答案选项,还需为每个候选答案分配符合概率分布要求的置信度分数,从而激发更深层次的链式思维(chain-of-thought)推理过程,使置信度估计更具逻辑性和一致性,且该方法在不同模型和任务中均表现出优势,即使经过强化学习优化后仍保持有效性,并与人类预期的推理模式高度一致。
链接: https://arxiv.org/abs/2511.14275
作者: Ante Wang,Weizhi Ma,Yang Liu
机构: Tsinghua University (清华大学); Institute for AI Industry Research (AIR) (清华人工智能产业研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Knowing the reliability of a model’s response is essential in application. With the strong generation capabilities of LLMs, research has focused on generating verbalized confidence. This is further enhanced by combining chain-of-thought reasoning, which provides logical and transparent estimation. However, how reasoning strategies affect the estimated confidence is still under-explored. In this work, we demonstrate that predicting a verbalized probability distribution can effectively encourage in-depth reasoning for confidence estimation. Intuitively, it requires an LLM to consider all candidates within the answer space instead of basing on a single guess, and to carefully assign confidence scores to meet the requirements of a distribution. This method shows an advantage across different models and various tasks, regardless of whether the answer space is known. Its advantage is maintained even after reinforcement learning, and further analysis shows its reasoning patterns are aligned with human expectations.
zh
[NLP-31] Entropy-Guided Reasoning Compression
【速读】: 该论文旨在解决大型推理模型在复杂任务中生成的思维链(Chain-of-Thought, CoT)过长所导致的计算成本高和部署困难的问题。现有压缩方法虽取得一定成效,但忽视了训练过程中存在的“熵冲突”现象:即压缩目标促使熵下降以缩短推理路径,而准确率导向的目标则提升熵以增强探索能力,二者相互制约,使模型陷入局部困境。解决方案的关键在于提出一种熵引导的训练框架,动态调节模型在熵降低时向高效推理收敛、在熵升高时强化探索能力,从而平衡压缩效率与推理鲁棒性。实验表明,该方法可将推理长度压缩至原始的20%,同时保持或超越基线精度。
链接: https://arxiv.org/abs/2511.14258
作者: Hourun Zhu,Yang Gao,Wenlong Fei,Jiawei Li,Huashan Sun
机构: Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL)
备注: 10pages, 4 figures
Abstract:Large reasoning models have demonstrated remarkable performance on complex reasoning tasks, yet the excessive length of their chain-of-thought outputs remains a major practical bottleneck due to high computation cost and poor deployability. Existing compression methods have achieved partial success but overlook a crucial phenomenon in the training process – the entropy conflict. During compression training, entropy decreases, leading to shorter reasoning but limited exploration, while accuracy-oriented objectives increase entropy, lengthening reasoning chains. This can cause the model to get stuck in a local dilemma. Our analysis further reveals the origin of the entropy conflict: many high-entropy tokens are logical connectors that receive larger gradients and are encouraged under the performance objective, while the compression objective simultaneously penalizes these potentially redundant connectors. This opposing pressure creates a direct source of entropy conflict. To address these issues, we adopt an entropy-guided training framework. As entropy descends, the model is guided toward efficient reasoning by encouraging concise thought steps; as entropy rises, exploration is reinforced under the compact reasoning mode to improve robustness. Experiments on six mathematical benchmarks show that our method compresses reasoning length to 20% of the original while maintaining or even surpassing baseline accuracy. Code and models will be released publicly.
zh
[NLP-32] AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR AACL2025
【速读】: 该论文旨在解决非洲英语口音多样性背景下语音识别系统评估缺失的问题,即当前缺乏针对非洲多语言、多口音场景的专用模型评测基准。其解决方案的关键在于构建了首个面向非洲英语口音的领域特定评估套件AfriSpeech-MultiBench,涵盖10余个国家、100+种非洲英语口音及7个应用场景(如金融、法律、医疗等),并系统性地对比了开源、闭源、单模态自动语音识别(ASR)与多模态大语言模型(LLM)驱动的语音识别系统在自然对话和非自然对话场景下的性能表现。该基准揭示了不同模型在口音鲁棒性、领域适应性和幻觉控制等方面的差异,为非洲本地化语音技术选型提供了实证依据。
链接: https://arxiv.org/abs/2511.14255
作者: Gabrial Zencha Ashungafac,Mardhiyah Sanni,Busayo Awobade,Alex Gichamba,Tobi Olatunji
机构: Intron Health
类目: Computation and Language (cs.CL)
备注: Accepted As a Conference Paper IJCNLP-AACL 2025
Abstract:Recent advances in speech-enabled AI, including Google’s NotebookLM and OpenAI’s speech-to-speech API, are driving widespread interest in voice interfaces globally. Despite this momentum, there exists no publicly available application-specific model evaluation that caters to Africa’s linguistic diversity. We present AfriSpeech-MultiBench, the first domain-specific evaluation suite for over 100 African English accents across 10+ countries and seven application domains: Finance, Legal, Medical, General dialogue, Call Center, Named Entities and Hallucination Robustness. We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems using both spontaneous and non-spontaneous speech conversation drawn from various open African accented English speech datasets. Our empirical analysis reveals systematic variation: open-source ASR models excels in spontaneous speech contexts but degrades on noisy, non-native dialogue; multimodal LLMs are more accent-robust yet struggle with domain-specific named entities; proprietary models deliver high accuracy on clean speech but vary significantly by country and domain. Models fine-tuned on African English achieve competitive accuracy with lower latency, a practical advantage for deployment, hallucinations still remain a big problem for most SOTA models. By releasing this comprehensive benchmark, we empower practitioners and researchers to select voice technologies suited to African use-cases, fostering inclusive voice applications for underserved communities.
zh
[NLP-33] owards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning AAAI2026
【速读】: 该论文旨在解决当前自动电影配音模型在模拟真实配音流程中的不足,特别是忽略了导演与演员之间关键的互动环节——即导演通过情感引导帮助演员内化语境信息的过程。现有方法通常假设演员直接配音而无需准备,未能体现真实工作中导演对演员情绪表达的指导作用。解决方案的关键在于提出一种名为Authentic-Dubber的新框架,其核心是构建一个基于多模态参考视频库(Retrieve-Augmented Director-Actor Interaction Learning),并引入三项创新机制:(1) 利用大语言模型(LLMs)深度理解跨模态信号的情感表征,构建可模拟导演教学素材的参考库;(2) 设计基于情感相似性的检索增强策略,高效提取与目标视频最相关的情感多模态信息;(3) 提出渐进式图结构语音生成方法,逐步融合检索到的情感知识,从而模拟演员最终配音时的情绪内化过程。这一方案实现了对真实配音工作流的忠实还原,并显著提升了配音的情感表现力。
链接: https://arxiv.org/abs/2511.14249
作者: Rui Liu,Yuan Zhao,Zhenqi Jia
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI 2026
Abstract:The automatic movie dubbing model generates vivid speech from given scripts, replicating a speaker’s timbre from a brief timbre prompt while ensuring lip-sync with the silent video. Existing approaches simulate a simplified workflow where actors dub directly without preparation, overlooking the critical director-actor interaction. In contrast, authentic workflows involve a dynamic collaboration: directors actively engage with actors, guiding them to internalize the context cues, specifically emotion, before performance. To address this issue, we propose a new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing, termed Authentic-Dubber, which contains three novel mechanisms: (1) We construct a multimodal Reference Footage library to simulate the learning footage provided by directors. Note that we integrate Large Language Models (LLMs) to achieve deep comprehension of emotional representations across multimodal signals. (2) To emulate how actors efficiently and comprehensively internalize director-provided footage during dubbing, we propose an Emotion-Similarity-based Retrieval-Augmentation strategy. This strategy retrieves the most relevant multimodal information that aligns with the target silent video. (3) We develop a Progressive Graph-based speech generation approach that incrementally incorporates the retrieved multimodal emotional knowledge, thereby simulating the actor’s final dubbing process. The above mechanisms enable the Authentic-Dubber to faithfully replicate the authentic dubbing workflow, achieving comprehensive improvements in emotional expressiveness. Both subjective and objective evaluations on the V2C Animation benchmark dataset validate the effectiveness. The code and demos are available at this https URL.
zh
[NLP-34] MuCPT: Music-related Natural Language Model Continued Pretraining
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在音乐娱乐领域专业化能力不足的问题,核心挑战在于音乐相关语料库的规模、纯净度以及数据与训练目标之间的匹配度。解决方案的关键在于构建一个包含400亿词元(40B tokens)的高质量音乐相关自然语言语料库,并采用“领域优先”(domain-first)的数据处理流程:首先通过轻量级分类器筛选并加权域内文本,随后进行多阶段清洗、去重和隐私保护掩码处理;同时融合多源音乐文本及其元数据以增强领域知识结构。此外,引入基于参考模型(Reference Model, RM)的逐token软评分机制实现质量控制,利用统一的损失比准则进行数据选择与训练过程中的动态降权,从而降低噪声梯度、强化任务对齐信号,显著提升音乐领域的持续预训练与对齐效果。
链接: https://arxiv.org/abs/2511.14245
作者: Kai Tian,Yirong Mao,Wendong Bi,Hanjie Wang,Que Wenhui
机构: Tsinghua University (清华大学); WeChat (微信); Tencent Inc (腾讯公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models perform strongly on general tasks but remain constrained in specialized settings such as music, particularly in the music-entertainment domain, where corpus scale, purity, and the match between data and training objectives are critical. We address this by constructing a large, music-related natural language corpus (40B tokens) that combines open source and in-house data, and by implementing a domain-first data pipeline: a lightweight classifier filters and weights in-domain text, followed by multi-stage cleaning, de-duplication, and privacy-preserving masking. We further integrate multi-source music text with associated metadata to form a broader, better-structured foundation of domain knowledge. On the training side, we introduce reference-model (RM)-based token-level soft scoring for quality control: a unified loss-ratio criterion is used both for data selection and for dynamic down-weighting during optimization, reducing noise gradients and amplifying task-aligned signals, thereby enabling more effective music-domain continued pretraining and alignment. To assess factuality, we design the MusicSimpleQA benchmark, which adopts short, single-answer prompts with automated agreement scoring. Beyond the benchmark design, we conduct systematic comparisons along the axes of data composition. Overall, this work advances both the right corpus and the right objective, offering a scalable data-training framework and a reusable evaluation tool for building domain LLMs in the music field.
zh
[NLP-35] ArbESC: Arabic Enhanced Edit Selection System Combination for Grammatical Error Correction Resolving conflict and improving system combination in Arabic GEC
【速读】: 该论文旨在解决阿拉伯语语法错误纠正(Grammatical Error Correction, GEC)中因复杂形态学和句法结构所带来的挑战,尤其是现有单模型方法在准确性与鲁棒性上的局限。其解决方案的关键在于提出首个面向阿拉伯语的多系统集成框架——ArbESC+,通过融合多种预训练语言模型(如AraT5、ByT5、mT5、AraBART及其增强版本)生成校正建议,并将这些建议转化为数值特征输入分类器进行决策;同时引入支持技术以过滤冗余校正并评估决策可靠性,从而显著提升整体性能,在QALB-14、QALB-15 L1和L2测试集上分别达到82.63%、84.64%和65.55%的F0.5分数,优于单一模型表现。
链接: https://arxiv.org/abs/2511.14230
作者: Ahlam Alrehili,Areej Alhothali
机构: King Abdulaziz University (阿卜杜勒阿齐兹国王大学); Saudi Electronic University (沙特电子大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages
Abstract:Grammatical Error Correction (GEC) is an important aspect of natural language processing. Arabic has a complicated morphological and syntactic structure, posing a greater challenge than other languages. Even though modern neural models have improved greatly in recent years, the majority of previous attempts used individual models without taking into account the potential benefits of combining different systems. In this paper, we present one of the first multi-system approaches for correcting grammatical errors in Arabic, the Arab Enhanced Edit Selection System Complication (ArbESC+). Several models are used to collect correction proposals, which are represented as numerical features in the framework. A classifier determines and implements the appropriate corrections based on these features. In order to improve output quality, the framework uses support techniques to filter overlapping corrections and estimate decision reliability. A combination of AraT5, ByT5, mT5, AraBART, AraBART+Morph+GEC, and Text editing systems gave better results than a single model alone, with F0.5 at 82.63% on QALB-14 test data, 84.64% on QALB-15 L1 data, and 65.55% on QALB-15 L2 data. As one of the most significant contributions of this work, it’s the first Arab attempt to integrate linguistic error correction. Improving existing models provides a practical step towards developing advanced tools that will benefit users and researchers of Arabic text processing.
zh
[NLP-36] Harnessing Deep LLM Participation for Robust Entity Linking
【速读】: 该论文旨在解决实体链接(Entity Linking, EL)任务中大型语言模型(Large Language Models, LLMs)应用不充分的问题,即现有方法通常仅将LLMs用于EL的孤立阶段(如实体消歧或输入表示),未能实现其能力在整个流程中的深度融合。解决方案的关键在于提出DeepEL框架,该框架将LLMs系统性地嵌入到EL的每一个阶段,并引入一种新颖的自验证机制(self-validation mechanism),利用全局上下文信息使LLMs能够自我修正预测结果,从而更准确地识别同一句子内实体间的语义关联,提升整体性能。实验证明,该方法在十项基准数据集上显著优于现有最优方法,平均F1分数提升2.6%,跨域数据集提升达4%。
链接: https://arxiv.org/abs/2511.14181
作者: Jiajun Hou,Chenyu Zhang,Rui Meng
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Entity Linking (EL), the task of mapping textual entity mentions to their corresponding entries in knowledge bases, constitutes a fundamental component of natural language understanding. Recent advancements in Large Language Models (LLMs) have demonstrated remarkable potential for enhancing EL performance. Prior research has leveraged LLMs to improve entity disambiguation and input representation, yielding significant gains in accuracy and robustness. However, these approaches typically apply LLMs to isolated stages of the EL task, failing to fully integrate their capabilities throughout the entire process. In this work, we introduce DeepEL, a comprehensive framework that incorporates LLMs into every stage of the entity linking task. Furthermore, we identify that disambiguating entities in isolation is insufficient for optimal performance. To address this limitation, we propose a novel self-validation mechanism that utilizes global contextual information, enabling LLMs to rectify their own predictions and better recognize cohesive relationships among entities within the same sentence. Extensive empirical evaluation across ten benchmark datasets demonstrates that DeepEL substantially outperforms existing state-of-the-art methods, achieving an average improvement of 2.6% in overall F1 score and a remarkable 4% gain on out-of-domain datasets. These results underscore the efficacy of deep LLM integration in advancing the state-of-the-art in entity linking. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2511.14181 [cs.CL] (or arXiv:2511.14181v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.14181 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-37] SymLoc: Symbolic Localization of Hallucination across HaluEval and TruthfulQA
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中符号性幻觉(symbolic hallucination)的定位难题,即当模型面对修饰语、否定词、数字、例外情况和命名实体等符号触发因素时,为何以及如何产生错误输出的问题。此前方法如LSC和激活方差分析未能区分不同token的重要性,忽略了符号语言知识在诱发幻觉中的作用。本文提出首个基于符号语言学与语义知识的定位框架,其关键在于利用符号性语言特征(如否定词、命名实体等)作为锚点,系统追踪幻觉在各模型层中的演化路径;实证发现注意力方差在早期层(2-4层)显著激增,尤其是否定词引发灾难性不稳定性,表明符号语义处理从初始阶段即已失效,从而揭示幻觉本质是符号语言处理失败,而非泛化生成问题。
链接: https://arxiv.org/abs/2511.14172
作者: Naveen Lamba,Sanju Tiwari,Manas Gaur
机构: Sharda University (夏尔达大学); University of Maryland Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs still struggle with hallucination, especially when confronted with symbolic triggers like modifiers, negation, numbers, exceptions, and named entities. Yet, we lack a clear understanding of where these symbolic hallucinations originate, making it crucial to systematically handle such triggers and localize the emergence of hallucination inside the model. While prior work explored localization using statistical techniques like LSC and activation variance analysis, these methods treat all tokens equally and overlook the role symbolic linguistic knowledge plays in triggering hallucinations. So far, no approach has investigated how symbolic elements specifically drive hallucination failures across model layers, nor has symbolic linguistic knowledge been used as the foundation for a localization framework. We propose the first symbolic localization framework that leverages symbolic linguistic and semantic knowledge to meaningfully trace the development of hallucinations across all model layers. By focusing on how models process symbolic triggers, we analyze five models using HaluEval and TruthfulQA. Our symbolic knowledge approach reveals that attention variance for these linguistic elements explodes to critical instability in early layers (2-4), with negation triggering catastrophic variance levels, demonstrating that symbolic semantic processing breaks down from the very beginning. Through the lens of symbolic linguistic knowledge, despite larger model sizes, hallucination rates remain consistently high (78.3%-83.7% across Gemma variants), with steep attention drops for symbolic semantic triggers throughout deeper layers. Our findings demonstrate that hallucination is fundamentally a symbolic linguistic processing failure, not a general generation problem, revealing that symbolic semantic knowledge provides the key to understanding and localizing hallucination mechanisms in LLMs.
zh
[NLP-38] Selective Weak-to-Strong Generalization AAAI2025
【速读】: 该论文旨在解决当前模型对齐(alignment)过程中因依赖弱监督信号而导致的鲁棒性不足问题,尤其是在强预训练模型(strong pretrained model)通过弱监督微调时,部分弱标签可能对模型性能产生负面影响。其解决方案的关键在于提出一种选择性弱到强泛化(selective weak-to-strong generalization, selective W2SG)框架:首先训练一个二分类器 P(IK) 来识别强模型能够正确回答的问题,并利用该模型自生成的标签进行对齐;同时引入图平滑方法进一步优化弱标签质量。实验表明,该方法在多个基准测试中均优于现有基线,且P(IK)具备跨任务和难度的泛化能力,验证了选择性对齐机制在超对齐(superalignment)中的有效性。
链接: https://arxiv.org/abs/2511.14166
作者: Hao Lang,Fei Huang,Yongbin Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: AAAI2025 Special Track on AI Alignment
Abstract:Future superhuman models will surpass the ability of humans and humans will only be able to \textitweakly supervise superhuman models. To alleviate the issue of lacking high-quality data for model alignment, some works on weak-to-strong generalization (W2SG) finetune a strong pretrained model with a weak supervisor so that it can generalize beyond weak supervision. However, the invariable use of weak supervision in existing methods exposes issues in robustness, with a proportion of weak labels proving harmful to models. In this paper, we propose a selective W2SG framework to avoid using weak supervision when unnecessary. We train a binary classifier P(IK) to identify questions that a strong model can answer and use its self-generated labels for alignment. We further refine weak labels with a graph smoothing method. Extensive experiments on three benchmarks show that our method consistently outperforms competitive baselines. Further analyses show that P(IK) can generalize across tasks and difficulties, which indicates selective W2SG can help superalignment.
zh
[NLP-39] Applying Relation Extraction and Graph Matching to Answering Multiple Choice Questions KR
【速读】: 该论文旨在解决多选题(MCQ)答案生成过程中缺乏可追溯性的问题,同时确保答案的准确性。传统方法往往直接输出结果而无法解释推理路径,导致可信度不足。解决方案的关键在于结合基于Transformer的关系抽取(Relation Extraction, RE)与知识图谱(Knowledge Graph, KG)匹配技术,将问题句转化为结构化的关系图,并在封闭世界假设下通过验证其与事实正确KG的一致性来判断语句的真实性,从而实现高可追溯性的问答过程。此方法能够动态构建KG以反映输入文本语义,并有效避免因错误前提导致的虚假信息生成,实验表明该方案在约70%的题目上能正确作答,且答案过程具备清晰的逻辑链条。
链接: https://arxiv.org/abs/2511.14144
作者: Naoki Shimoda,Akihiro Yamamoto
机构: Kyoto University (京都大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Presented at NeLaMKRR@KR, 2025 ( arXiv:2511.09575 )
Abstract:In this research, we combine Transformer-based relation extraction with matching of knowledge graphs (KGs) and apply them to answering multiple-choice questions (MCQs) while maintaining the traceability of the output process. KGs are structured representations of factual knowledge consisting of entities and relations. Due to the high construction cost, they had been regarded as static databases with validated links. However, the recent development of Transformer-based relation extraction (RE) methods has enabled us to generate KGs dynamically by giving them natural language texts, and thereby opened the possibility for representing the meaning of the input sentences with the created KGs. Using this effect, we propose a method that answers MCQs in the “fill-in-the-blank” format, taking care of the point that RE methods generate KGs that represent false information if provided with factually incorrect texts. We measure the truthfulness of each question sentence by (i) converting the sentence into a relational graph using an RE method and (ii) verifying it against factually correct KGs under the closed-world assumption. The experimental results demonstrate that our method correctly answers up to around 70% of the questions, while providing traceability of the procedure. We also highlight that the question category has a vast influence on the accuracy.
zh
[NLP-40] From Graphs to Hypergraphs: Enhancing Aspect-Based Sentiment Analysis via Multi-Level Relational Modeling
【速读】: 该论文旨在解决Aspect-Based Sentiment Analysis (ABSA) 中因不同方面(aspect)间情感冲突以及短文本上下文稀疏导致的建模难题。传统基于图的方法仅能建模成对依赖关系,需为不同关系视角构建多个图结构,从而引入冗余、参数开销和融合过程中的误差传播,限制了在短文本、低资源场景下的鲁棒性。其解决方案的关键在于提出一种动态超图框架 HyperABSA,通过样本特定的层次聚类自动诱导出 aspect-opinion 结构,并设计了一种新颖的加速-回退截断机制(acceleration-fallback cutoff),自适应确定聚类粒度以构建超边(hyperedges),从而更高效地捕捉多方面之间的复杂语义关联。实验表明,该方法在三个基准数据集(Lap14, Rest14, MAMS)上均优于强基线图模型,尤其在结合 RoBERTa 预训练模型时提升显著。
链接: https://arxiv.org/abs/2511.14142
作者: Omkar Mahesh Kashyap,Padegal Amit,Madhav Kashyap,Ashwini M Joshi,Shylaja SS
机构: PES University (PES大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Aspect-Based Sentiment Analysis (ABSA) predicts sentiment polarity for specific aspect terms, a task made difficult by conflicting sentiments across aspects and the sparse context of short texts. Prior graph-based approaches model only pairwise dependencies, forcing them to construct multiple graphs for different relational views. These introduce redundancy, parameter overhead, and error propagation during fusion, limiting robustness in short-text, low-resource settings. We present HyperABSA, a dynamic hypergraph framework that induces aspect-opinion structures through sample-specific hierarchical clustering. To construct these hyperedges, we introduce a novel acceleration-fallback cutoff for hierarchical clustering, which adaptively determines the level of granularity. Experiments on three benchmarks (Lap14, Rest14, MAMS) show consistent improvements over strong graph baselines, with substantial gains when paired with RoBERTa backbones. These results position dynamic hypergraph construction as an efficient, powerful alternative for ABSA, with potential extensions to other short-text NLP tasks.
zh
[NLP-41] PRISM: Prompt-Refined In-Context System Modelling for Financial Retrieval
【速读】: 该论文旨在解决金融信息检索中从长篇财务文件中提取任务相关资讯的难题,这在运营和分析决策中至关重要。其解决方案的关键在于提出PRISM框架,该框架无需训练即可运行,融合了精细化系统提示(refined system prompting)、上下文学习(in-context learning, ICL)与轻量级多智能体系统(multi-agent system)。其中,提示工程提供精确的任务指令,ICL通过语义相关的少量示例增强模型理解能力,多智能体系统则模拟协同评分行为,三者协同提升了文档与段落排序的准确性,在受限验证集上达到NDCG@5为0.71818的性能,且具备生产规模部署的可行性与鲁棒性。
链接: https://arxiv.org/abs/2511.14130
作者: Chun Chet Ng,Jia Yu Lim,Wei Zeng Low
机构: AI Lens(人工智能镜头), Kuala Lumpur(吉隆坡), Malaysia(马来西亚)
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 3rd-place solution for the ACM ICAIF 2025 Agentic Retrieval Grand Challenge
Abstract:With the rapid progress of large language models (LLMs), financial information retrieval has become a critical industrial application. Extracting task-relevant information from lengthy financial filings is essential for both operational and analytical decision-making. The FinAgentBench dataset formalizes this problem through two tasks: document ranking and chunk ranking. We present PRISM, a training-free framework that integrates refined system prompting, in-context learning (ICL), and a lightweight multi-agent system. Each component is examined extensively to reveal their synergies: prompt engineering provides precise task instructions, ICL supplies semantically relevant few-shot examples, and the multi-agent system models coordinated scoring behaviour. Our best configuration achieves an NDCG@5 of 0.71818 on the restricted validation split. We further demonstrate that PRISM is feasible and robust for production-scale financial retrieval. Its modular, inference-only design makes it practical for real-world use cases. The source code is released at this https URL.
zh
[NLP-42] Synthetic Clinical Notes for Rare ICD Codes: A Data-Centric Framework for Long-Tail Medical Coding
【速读】: 该论文旨在解决临床文本自动编码(ICD coding)任务中因诊断码分布极端长尾而导致的模型性能不均问题,尤其是罕见和零样本(zero-shot)ICD码在数据集(如MIMIC-III)中严重欠采样,造成宏观F1分数(macro-F1)偏低。解决方案的关键在于提出一种以数据为中心的框架,通过生成高质量合成出院小结来缓解类别不平衡:该方法基于真实世界的共现模式、ICD描述、同义词、分类体系及相似临床文本,构建锚定于罕见代码的结构化多标签代码集,并利用这些提示生成90,000条合成病历,覆盖7,902个ICD码,从而显著扩展训练数据分布。在此基础上微调PLM-ICD与GKI-ICD两个先进的Transformer模型,实验表明该策略在保持微观F1(micro-F1)稳定的同时提升了宏观F1,优于现有最先进方法,验证了精心设计的合成数据对提升长尾ICD码预测公平性具有实际价值。
链接: https://arxiv.org/abs/2511.14112
作者: Truong Vo,Weiyi Wu,Kaize Ding
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 page-short paper
Abstract:Automatic ICD coding from clinical text is a critical task in medical NLP but remains hindered by the extreme long-tail distribution of diagnostic codes. Thousands of rare and zero-shot ICD codes are severely underrepresented in datasets like MIMIC-III, leading to low macro-F1 scores. In this work, we propose a data-centric framework that generates high-quality synthetic discharge summaries to mitigate this imbalance. Our method constructs realistic multi-label code sets anchored on rare codes by leveraging real-world co-occurrence patterns, ICD descriptions, synonyms, taxonomy, and similar clinical notes. Using these structured prompts, we generate 90,000 synthetic notes covering 7,902 ICD codes, significantly expanding the training distribution. We fine-tune two state-of-the-art transformer-based models, PLM-ICD and GKI-ICD, on both the original and extended datasets. Experiments show that our approach modestly improves macro-F1 while maintaining strong micro-F1, outperforming prior SOTA. While the gain may seem marginal relative to the computational cost, our results demonstrate that carefully crafted synthetic data can enhance equity in long-tail ICD code prediction.
zh
[NLP-43] Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT
【速读】: 该论文旨在解决推理增强型视觉语言模型(Reasoning-augmented Vision-Language Models, RVLMs)在安全对齐(safety alignment)机制下仍易受攻击的问题,尤其是其暴露的思维链(chain-of-thought, CoT)轨迹为恶意攻击提供了新入口。解决方案的关键在于提出一种名为“隐形微调”(Stealth Fine-Tuning)的新方法:通过段级干扰(segment-level interference)诱导模型生成有害推理路径,并将模型自动生成的输出作为监督微调数据;同时采用轮次加权损失设计(turn-based weighted loss),实现轻量、分布一致的微调过程。该方法仅需499个样本和单张A100显卡(QLoRA)训练不足3小时,即在AdvBench等基准上显著提升攻击成功率(ASR)达38.52%,且不破坏原始模型的通用推理能力。
链接: https://arxiv.org/abs/2511.14106
作者: Le Yu,Zhengyue Zhao,Yawen Zheng,Yunhao Liu
机构: Sichuan University (四川大学); University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 7 figures
Abstract:Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed \textbfStealth Fine-Tuning. Our method elicits harmful reasoning traces through \textbfsegment-level interference and reuses the self-generated outputs as supervised fine-tuning data. Through a \textbfturn-based weighted loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. \textcolorred\textbfDisclaimer: This paper contains content that may be disturbing or offensive.
zh
[NLP-44] Error-Driven Scene Editing for 3D Grounding in Large Language Models
【速读】: 该论文旨在解决当前3D大语言模型(3D-LLMs)在将语言准确锚定到三维环境中的视觉和空间元素方面存在的局限性,其根源在于训练数据偏重于语言推理而非空间理解,导致模型存在固有的空间锚定偏差(grounding bias)。解决方案的关键在于提出一种基于误差驱动的场景编辑框架DEER-3D,采用“分解(Decompose)、诊断评估(Diagnostic Evaluation)、编辑(Edit)与再训练(Re-train)”的结构化流程,通过精细的空间操作生成针对性的视觉反事实样本(visual counterfactuals),从而实现对模型特定错误类型的精准修正,无需大规模重建场景或收集3D数据,显著提升模型的空间锚定准确性。
链接: https://arxiv.org/abs/2511.14086
作者: Yue Zhang,Zun Wang,Han Lin,Jialu Li,Jianing Yang,Yonatan Bitton,Idan Szpektor,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); University of Michigan (密歇根大学); Google Research (谷歌研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code: this https URL
Abstract:Despite recent progress in 3D-LLMs, they remain limited in accurately grounding language to visual and spatial elements in 3D environments. This limitation stems in part from training data that focuses on language reasoning rather than spatial understanding due to scarce 3D resources, leaving inherent grounding biases unresolved. To address this, we propose 3D scene editing as a key mechanism to generate precise visual counterfactuals that mitigate these biases through fine-grained spatial manipulation, without requiring costly scene reconstruction or large-scale 3D data collection. Furthermore, to make these edits targeted and directly address the specific weaknesses of the model, we introduce DEER-3D, an error-driven framework following a structured “Decompose, Diagnostic Evaluation, Edit, and Re-train” workflow, rather than broadly or randomly augmenting data as in conventional approaches. Specifically, upon identifying a grounding failure of the 3D-LLM, our framework first diagnoses the exact predicate-level error (e.g., attribute or spatial relation). It then executes minimal, predicate-aligned 3D scene edits, such as recoloring or repositioning, to produce targeted counterfactual supervision for iterative model fine-tuning, significantly enhancing grounding accuracy. We evaluate our editing pipeline across multiple benchmarks for 3D grounding and scene understanding tasks, consistently demonstrating improvements across all evaluated datasets through iterative refinement. DEER-3D underscores the effectiveness of targeted, error-driven scene editing in bridging linguistic reasoning capabilities with spatial grounding in 3D LLMs.
zh
[NLP-45] Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement
【速读】: 该论文旨在解决多标签情感分类(multi-label sentiment classification)中因数据集存在严重类别不平衡(class imbalance)而导致模型性能下降的问题,尤其是对低频情感类别的识别能力不足。其解决方案的关键在于构建一个平衡的多标签情感数据集,并设计一种融合预训练FastText词嵌入、卷积层(convolutional layers)、双向长短期记忆网络(bidirectional LSTM)和注意力机制(attention mechanism)的增强型分类模型。通过整合原始GoEmotions数据、基于RoBERTa-base-GoEmotions模型从Sentiment140中提取的情感标注样本以及GPT-4 mini生成的手动标注文本,实现了28个情感类别间的均匀分布;同时,采用Sigmoid激活输出层支持多标签预测,并结合混合精度训练(mixed precision training)提升计算效率,实验表明该方法在准确率、精确率、召回率、F1分数及AUC等指标上均显著优于基于不平衡数据训练的模型。
链接: https://arxiv.org/abs/2511.14073
作者: Zijin Su,Huanzhu Lv,Yuren Niu,Yiming Liu
机构: University College London (伦敦大学学院); Central South University (中南大学); Nanyang City Fifth Complete School (南阳市第五完全学校); Wuhan Guanggu Future School (武汉光谷未来学校)
类目: Computation and Language (cs.CL)
备注: 12 pages, 8 figures, 5 tables. Dataset and code available at this https URL and this https URL
Abstract:Multi-label sentiment classification plays a vital role in natural language processing by detecting multiple emotions within a single text. However, existing datasets like GoEmotions often suffer from severe class imbalance, which hampers model performance, especially for underrepresented emotions. To address this, we constructed a balanced multi-label sentiment dataset by integrating the original GoEmotions data, emotion-labeled samples from Sentiment140 using a RoBERTa-base-GoEmotions model, and manually annotated texts generated by GPT-4 mini. Our data balancing strategy ensured an even distribution across 28 emotion categories. Based on this dataset, we developed an enhanced multi-label classification model that combines pre-trained FastText embeddings, convolutional layers for local feature extraction, bidirectional LSTM for contextual learning, and an attention mechanism to highlight sentiment-relevant words. A sigmoid-activated output layer enables multi-label prediction, and mixed precision training improves computational efficiency. Experimental results demonstrate significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data, highlighting the effectiveness of our approach.
zh
[NLP-46] GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练阶段大型语言模型(Large Language Models, LLMs)所面临的新型隐私泄露问题。传统成员推理攻击(Membership Inference Attacks, MIAs)通常基于模型对训练数据的显式记忆,而RLVR由于其在线策略更新特性,不依赖固定标注输出,导致隐私泄露机制从“答案记忆”转变为“行为变化”,即需判断某提示词是否被用于微调过程。为此,作者提出首个专为RLVR设计的成员推理框架——行为差异攻击(Divergence-in-Behavior Attack, DIBA),其核心创新在于将攻击焦点从记忆痕迹转向可测量的行为偏移:一方面捕捉优势侧改进(如正确性提升),另一方面检测 logits 侧漂移(如策略演化)。实验表明,DIBA在多种场景下显著优于现有基线,AUC达约0.8,且在低假阳性率(0.1% FPR)下真阳性率(TPR)高出一个数量级,揭示了即使无显式监督信号,训练数据仍可通过行为轨迹被可靠推断。
链接: https://arxiv.org/abs/2511.14045
作者: Yule Liu,Heyi Zhang,Jinyi Zheng,Zhen Sun,Zifan Peng,Tianshuo Cong,Yilong Yang,Xinlei He,Zhuo Ma
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Shanghai Jiao Tong University (上海交通大学); Shandong University (山东大学); Xidian University (西安电子科技大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Membership inference attacks (MIAs) on large language models (LLMs) pose significant privacy risks across various stages of model training. Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have brought a profound paradigm shift in LLM training, particularly for complex reasoning tasks. However, the on-policy nature of RLVR introduces a unique privacy leakage pattern: since training relies on self-generated responses without fixed ground-truth outputs, membership inference must now determine whether a given prompt (independent of any specific response) is used during fine-tuning. This creates a threat where leakage arises not from answer memorization. To audit this novel privacy risk, we propose Divergence-in-Behavior Attack (DIBA), the first membership inference framework specifically designed for RLVR. DIBA shifts the focus from memorization to behavioral change, leveraging measurable shifts in model behavior across two axes: advantage-side improvement (e.g., correctness gain) and logit-side divergence (e.g., policy drift). Through comprehensive evaluations, we demonstrate that DIBA significantly outperforms existing baselines, achieving around 0.8 AUC and an order-of-magnitude higher TPR@0.1%FPR. We validate DIBA’s superiority across multiple settings–including in-distribution, cross-dataset, cross-algorithm, black-box scenarios, and extensions to vision-language models. Furthermore, our attack remains robust under moderate defensive measures. To the best of our knowledge, this is the first work to systematically analyze privacy vulnerabilities in RLVR, revealing that even in the absence of explicit supervision, training data exposure can be reliably inferred through behavioral traces. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2511.14045 [cs.CR] (or arXiv:2511.14045v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2511.14045 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-47] AISAC: An Integrated multi-agent System for Transparent Retrieval-Grounded Scientific Assistance
【速读】: 该论文旨在解决科学与工程工作流中对透明、可追溯且具备适应性的AI辅助系统的需求问题,尤其在多代理协作、知识检索与任务调度方面存在挑战。其解决方案的关键在于构建一个集成的多代理系统(Multi-agent System)——AI Scientific Assistant Core (AISAC),通过LangGraph实现任务编排、FAISS与SQLite结合提供混合记忆机制(支持语义检索与结构化对话历史)、基于文件哈希的增量索引策略减少冗余嵌入计算,并采用配置驱动的项目初始化层实现无代码定制化部署。此外,所有代理决策、工具调用和检索行为均被日志记录并通过Gradio可视化界面呈现,保障了每个推理过程的可解释性与可审计性。
链接: https://arxiv.org/abs/2511.14043
作者: Chandrachur Bhattacharya,Sibendu Som
机构: Argonne National Laboratory (阿贡国家实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:AI Scientific Assistant Core (AISAC) is an integrated multi-agent system developed at Argonne National Laboratory for scientific and engineering workflows. AISAC builds on established technologies - LangGraph for orchestration, FAISS for vector search, and SQLite for persistence - and integrates them into a unified system prototype focused on transparency, provenance tracking, and scientific adaptability. The system implements a Router-Planner-Coordinator workflow and an optional Evaluator role, using prompt-engineered agents coordinated via LangGraph’s StateGraph and supported by helper agents such as a Researcher. Each role is defined through custom system prompts that enforce structured JSON outputs. A hybrid memory approach (FAISS + SQLite) enables both semantic retrieval and structured conversation history. An incremental indexing strategy based on file hashing minimizes redundant re-embedding when scientific corpora evolve. A configuration-driven project bootstrap layer allows research teams to customize tools, prompts, and data sources without modifying core code. All agent decisions, tool invocations, and retrievals are logged and visualized through a custom Gradio interface, providing step-by-step transparency for each reasoning episode. The authors have applied AISAC to multiple research areas at Argonne, including specialized deployments for waste-to-products research and energy process safety, as well as general-purpose scientific assistance, demonstrating its cross-domain applicability. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA) Cite as: arXiv:2511.14043 [cs.AI] (or arXiv:2511.14043v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.14043 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-48] HiEAG: Evidence-Augmented Generation for Out-of-Context Misinformation Detection
【速读】: 该论文旨在解决多模态离域信息(Out-of-Context, OOC)虚假信息检测中外部一致性(external consistency)被忽视的问题,即现有方法过度依赖图像与文本之间的内部一致性,而忽略了其与外部证据的一致性。解决方案的关键在于提出一种分层证据增强生成框架(Hierarchical Evidence-Augmented Generation, HiEAG),通过整合检索、重排序和重写三个模块来强化外部一致性验证:其中,自动证据选择提示(Automatic Evidence Selection Prompting, AESP)用于从检索结果中筛选相关证据,自动证据生成提示(Automatic Evidence Generation Prompting, AEGP)则提升多模态大语言模型(Multimodal Large Language Models, MLLMs)在任务适配上的能力,从而显著提升检测准确率并提供可解释的判断依据。
链接: https://arxiv.org/abs/2511.14027
作者: Junjie Wu,Yumeng Fu,Nan Yu,Guohong Fu
机构: Soochow University (苏州大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in multimodal out-of-context (OOC) misinformation detection have made remarkable progress in checking the consistencies between different modalities for supporting or refuting image-text pairs. However, existing OOC misinformation detection methods tend to emphasize the role of internal consistency, ignoring the significant of external consistency between image-text pairs and external evidence. In this paper, we propose HiEAG, a novel Hierarchical Evidence-Augmented Generation framework to refine external consistency checking through leveraging the extensive knowledge of multimodal large language models (MLLMs). Our approach decomposes external consistency checking into a comprehensive engine pipeline, which integrates reranking and rewriting, apart from retrieval. Evidence reranking module utilizes Automatic Evidence Selection Prompting (AESP) that acquires the relevant evidence item from the products of evidence retrieval. Subsequently, evidence rewriting module leverages Automatic Evidence Generation Prompting (AEGP) to improve task adaptation on MLLM-based OOC misinformation detectors. Furthermore, our approach enables explanation for judgment, and achieves impressive performance with instruction tuning. Experimental results on different benchmark datasets demonstrate that our proposed HiEAG surpasses previous state-of-the-art (SOTA) methods in the accuracy over all samples.
zh
[NLP-49] Knowledge-Grounded Agent ic Large Language Models for Multi-Hazard Understanding from Reconnaissance Reports
【速读】: 该论文旨在解决灾后勘察报告(post-disaster reconnaissance reports)中蕴含的多灾害交互信息难以系统化提取与利用的问题,因其非结构化叙事特征导致知识传递效率低下。现有大语言模型(Large Language Models, LLMs)在缺乏领域知识约束时易产生不可靠或幻觉输出,限制了其在灾害响应与韧性建设中的应用。解决方案的关键在于提出一种知识增强型框架——Mixture-of-Retrieval Agentic RAG(MoRA-RAG),其核心创新包括:(1) 混合检索机制(Mixture-of-Retrieval),根据查询动态路由至特定灾害类型的数据库以提升相关性;(2) 代理分块策略(agentic chunking),在检索过程中保持上下文连贯性;(3) 验证循环(verification loop),通过评估证据充分性、迭代优化查询并触发定向搜索,从而减少幻觉并提升准确性。该框架显著优于零样本LLMs和当前最先进的RAG系统,在构建的HazardRecQA数据集上实现最高94.5%准确率,并使开源模型性能接近专有模型水平。
链接: https://arxiv.org/abs/2511.14010
作者: Chenchen Kuai,Zihao Li,Braden Rosen,Stephanie Paan,Navid Jafari,Jean-Louis Briaud,Yunlong Zhang,Youssef M. A. Hashash,Yang Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures
Abstract:Post-disaster reconnaissance reports contain critical evidence for understanding multi-hazard interactions, yet their unstructured narratives make systematic knowledge transfer difficult. Large language models (LLMs) offer new potential for analyzing these reports, but often generate unreliable or hallucinated outputs when domain grounding is absent. This study introduces the Mixture-of-Retrieval Agentic RAG (MoRA-RAG), a knowledge-grounded LLM framework that transforms reconnaissance reports into a structured foundation for multi-hazard reasoning. The framework integrates a Mixture-of-Retrieval mechanism that dynamically routes queries across hazard-specific databases while using agentic chunking to preserve contextual coherence during retrieval. It also includes a verification loop that assesses evidence sufficiency, refines queries, and initiates targeted searches when information remains incomplete. We construct HazardRecQA by deriving question-answer pairs from GEER reconnaissance reports, which document 90 global events across seven major hazard types. MoRA-RAG achieves up to 94.5 percent accuracy, outperforming zero-shot LLMs by 30 percent and state-of-the-art RAG systems by 10 percent, while reducing hallucinations across diverse LLM architectures. MoRA-RAG also enables open-weight LLMs to achieve performance comparable to proprietary models. It establishes a new paradigm for transforming post-disaster documentation into actionable, trustworthy intelligence for hazard resilience.
zh
[NLP-50] Hint-Augmented Re-ranking: Efficient Product Search using LLM -Based Query Decomposition AACL2025
【速读】: 该论文旨在解决电商搜索中带有最高级形容词(如“best”、“most popular”)的查询理解难题,这类查询需在多个维度上对候选对象进行比较,要求模型具备语言理解能力和领域知识。解决方案的关键在于提出一种框架,通过提取结构化的属性-值提示(attribute-value hints)来揭示此类查询背后的潜在意图,并将这些提示与检索过程并行生成,从而高效集成到排序管道中;同时为应对大语言模型(LLM)直接重排序带来的高延迟问题,进一步设计了轻量级模型的知识迁移策略,实现超类语义的有效表示与跨模型传递,显著提升了检索指标(MAP提升10.9点,MRR提升5.9点)。
链接: https://arxiv.org/abs/2511.13994
作者: Yilun Zhu,Nikhita Vedula,Shervin Malmasi
机构: Amazon.com, Inc. (亚马逊)
类目: Computation and Language (cs.CL)
备注: AACL 2025
Abstract:Search queries with superlatives (e.g., best, most popular) require comparing candidates across multiple dimensions, demanding linguistic understanding and domain knowledge. We show that LLMs can uncover latent intent behind these expressions in e-commerce queries through a framework that extracts structured interpretations or hints. Our approach decomposes queries into attribute-value hints generated concurrently with retrieval, enabling efficient integration into the ranking pipeline. Our method improves search performanc eby 10.9 points in MAP and ranking by 5.9 points in MRR over baselines. Since direct LLM-based reranking faces prohibitive latency, we develop an efficient approach transferring superlative interpretations to lightweight models. Our findings provide insights into how superlative semantics can be represented and transferred between models, advancing linguistic interpretation in retrieval systems while addressing practical deployment constraints.
zh
[NLP-51] Show and Tell: Prompt Strategies for Style Control in Multi-Turn LLM Code Generation
【速读】: 该论文试图解决的问题是:在生成式 AI (Generative AI) 模型对初始代码进行功能增强(如添加新特性)的过程中,如何保持用户指定的写作风格(如简洁性、可读性等),即风格约束是否能在模型扩展代码时依然有效。解决方案的关键在于对比三种提示机制——基于指令的提示、基于示例的提示以及两者的组合提示——在代码生成和后续改进两个阶段中对风格控制的效果差异。研究发现,组合提示在初始代码压缩(风格控制)和扩展纪律性(防止冗余膨胀)方面均表现最优,表明风格控制与扩展稳定性是两个独立维度,而融合策略能提供最稳定的风格一致性。
链接: https://arxiv.org/abs/2511.13972
作者: Jeremiah Bohr
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 23 pages, 2 figures, 3 tables. Under review
Abstract:Language models generate functionally correct code that tends toward excessive verbosity, with elaborate documentation and defensive patterns that diverge from human baselines. Two prompting mechanisms have emerged for stylistic control: instruction based prompts that articulate abstract directives, and example based prompts that provide concrete code demonstrations. The core problem is whether stylistic constraints persist when models enhance initial implementations with additional features while maintaining high functional accuracy. Here we show that instruction-based, example-based, and combined prompts produce distinct patterns of initial control and expansion discipline over one enhancement turn. We manipulated system prompts across four conditions in a paired two-turn protocol where models first generated solutions to an intermediate Python task, then revised their code under general improvement directives, holding the user task fixed (N = 160 paired programs). Combined prompts produced the strongest initial compression and greatest expansion discipline. Instructions showed large initial effects and moderate expansion discipline. Examples showed modest initial effects with no expansion discipline. These results show that initial prompt effectiveness and expansion discipline are separate aspects of prompt design, and that combined approaches provide the most stable stylistic control in this two-turn workflow.
zh
[NLP-52] EchoAgent : Guideline-Centric Reasoning Agent for Echocardiography Measurement and Interpretation
【速读】: 该论文旨在解决当前深度学习模型在心脏超声(echocardiographic)图像分析中缺乏视频级推理能力和基于指南的测量分析支持的问题。解决方案的关键在于提出EchoAgent框架,该框架在大语言模型(Large Language Model, LLM)控制下协调专用视觉工具,实现时间定位、空间测量与临床解读的结构化自动化;其核心创新是引入一个测量可行性预测模型(measurement-feasibility prediction model),用于判断每帧图像中解剖结构是否可可靠测量,从而实现自主工具选择,保障结果的准确性与可解释性。
链接: https://arxiv.org/abs/2511.13948
作者: Matin Daghyani,Lyuyang Wang,Nima Hashemi,Bassant Medhat,Baraa Abdelsamad,Eros Rojas Velez,XiaoXiao Li,Michael Y. C. Tsang,Christina Luong,Teresa S.M. Tsang,Purang Abolmaesumi
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, Under Review
Abstract:Purpose: Echocardiographic interpretation requires video-level reasoning and guideline-based measurement analysis, which current deep learning models for cardiac ultrasound do not support. We present EchoAgent, a framework that enables structured, interpretable automation for this domain. Methods: EchoAgent orchestrates specialized vision tools under Large Language Model (LLM) control to perform temporal localization, spatial measurement, and clinical interpretation. A key contribution is a measurement-feasibility prediction model that determines whether anatomical structures are reliably measurable in each frame, enabling autonomous tool selection. We curated a benchmark of diverse, clinically validated video-query pairs for evaluation. Results: EchoAgent achieves accurate, interpretable results despite added complexity of spatiotemporal video analysis. Outputs are grounded in visual evidence and clinical guidelines, supporting transparency and traceability. Conclusion: This work demonstrates the feasibility of agentic, guideline-aligned reasoning for echocardiographic video analysis, enabled by task-specific tools and full video-level automation. EchoAgent sets a new direction for trustworthy AI in cardiac ultrasound.
zh
[NLP-53] What Works for Lost-in-the-Middle in LLM s? A Study on GM-Extract and Mitigations
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长距离上下文中的“lost-in-the-middle”现象,即模型在处理包含多个文档的检索任务时,对中间位置信息的利用能力显著下降的问题。解决方案的关键在于构建一个名为GM-Extract的新基准数据集,并设计了一套基于两个独立指标的评估体系:Document Metric用于衡量空间检索能力,Variable Extraction Metric用于评估语义检索能力。通过系统性地测试7-8B参数规模模型在多文档任务(如键值提取和问答)上的表现,研究发现数据表示方式的变化会显著影响检索性能,且不同缓解策略(黑盒与白盒方法)的效果具有高度情境依赖性,部分方法甚至可能产生负面效果,从而为实际应用中优化LLM检索性能提供了关键洞见。
链接: https://arxiv.org/abs/2511.13900
作者: Mihir Gupte,Eshan Dixit,Muhammad Tayyab,Arun Adiththan
机构: General Motors(通用汽车)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be submitted for publication
Abstract:The diminishing ability of large language models (LLMs) to effectively utilize long-range context-the “lost-in-the-middle” phenomenon-poses a significant challenge in retrieval-based LLM applications. To study the impact of this phenomenon in a real-world application setting, we introduce GM-Extract, a novel benchmark dataset meticulously designed to evaluate LLM performance on retrieval of control variables. To accurately diagnose failure modes, we propose a simple yet elegant evaluation system using two distinct metrics: one for spatial retrieval capability (Document Metric) and the other for semantic retrieval capability (Variable Extraction Metric). We conduct a systematic evaluation of 7-8B parameter models on two multi-document tasks (key-value extraction and question-answering), demonstrating a significant change in retrieval performance simply by altering how the data is represented in the context window. While a distinct U-shaped curve was not consistently observed, our analysis reveals a clear pattern of performance across models, which we further correlate with perplexity scores. Furthermore, we perform a literature survey of mitigation methods, which we categorize into two distinct approaches: black-box and white-box methods. We then apply these techniques to our benchmark, finding that their efficacy is highly nuanced. Our evaluation highlights scenarios where these strategies successfully improve performance, as well as surprising cases where they lead to a negative impact, providing a comprehensive understanding of their utility in a practical context.
zh
[NLP-54] Can QE-informed (Re)Translation lead to Error Correction? EMNLP2025
【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)输出中存在错误且自动后编辑(Automatic Post-Editing, APE)系统易过度修正导致性能下降的问题。其核心解决方案在于提出两种无需额外训练的段落级错误修正方法:一是基于质量评估(Quality Estimation, QE)信息的重译策略,通过从多个大语言模型(Large Language Models, LLMs)生成的不同翻译候选中选择质量最高的结果;二是基于QE解释的条件性替换策略,指导LLM仅在指定错误子串处进行最小化修改,并引入“收益/编辑比”(Gain-to-Edit ratio)优化机制以减少冗余调整。前者在Delta COMET评分上取得正向提升(+0.0201),成为该子任务的优胜方案。
链接: https://arxiv.org/abs/2511.13884
作者: Govardhan Padmanabhan
机构: Institute for People-Centred AI (以人为本的人工智能研究所); University of Surrey (萨里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, WMT25 Shared Task in EMNLP 2025 Conference
Abstract:The paper presents two approaches submitted to the WMT 2025 Automated Translation Quality Evaluation Systems Task 3 - Quality Estimation (QE)-informed Segment-level Error Correction. While jointly training QE systems with Automatic Post-Editing (APE) has shown improved performance for both tasks, APE systems are still known to overcorrect the output of Machine Translation (MT), leading to a degradation in performance. We investigate a simple training-free approach - QE-informed Retranslation, and compare it with another within the same training-free paradigm. Our winning approach selects the highest-quality translation from multiple candidates generated by different LLMs. The second approach, more akin to APE, instructs an LLM to replace error substrings as specified in the provided QE explanation(s). A conditional heuristic was employed to minimise the number of edits, with the aim of maximising the Gain-to-Edit ratio. The two proposed approaches achieved a Delta COMET score of 0.0201 and -0.0108, respectively, leading the first approach to achieve the winning position on the subtask leaderboard.
zh
[NLP-55] When AI Does Science: Evaluating the Autonomous AI Scientist KOSMOS in Radiation Biology
【速读】: 该论文旨在评估自主式AI科学家KOSMOS在辐射生物学领域中生成和验证科学假设的能力,以检验其是否能从公开基因表达数据集中发现具有生物学意义的规律。研究聚焦于三个具体假设:DNA损伤应答(DDR)能力与p53转录反应的关系、OGT和CDO1基因表达对乳腺癌细胞辐射响应模块的预测能力,以及一个12基因表达特征对前列腺放疗后生化复发无进展生存期的预测价值。解决方案的关键在于使用简单随机基因null基准(random-gene null benchmarks)对AI生成的假设进行严格统计检验,从而区分真实信号与偶然关联,揭示AI科学家虽可提出有价值假说,但必须依赖严谨的零模型审计才能确保结论可靠性。
链接: https://arxiv.org/abs/2511.13825
作者: Humza Nusrat,Omar Nusrat
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 3 figures, preprint
Abstract:Agentic AI “scientists” now use language models to search the literature, run analyses, and generate hypotheses. We evaluate KOSMOS, an autonomous AI scientist, on three problems in radiation biology using simple random-gene null benchmarks. Hypothesis 1: baseline DNA damage response (DDR) capacity across cell lines predicts the p53 transcriptional response after irradiation (GSE30240). Hypothesis 2: baseline expression of OGT and CDO1 predicts the strength of repressed and induced radiation-response modules in breast cancer cells (GSE59732). Hypothesis 3: a 12-gene expression signature predicts biochemical recurrence-free survival after prostate radiotherapy plus androgen deprivation therapy (GSE116918). The DDR-p53 hypothesis was not supported: DDR score and p53 response were weakly negatively correlated (Spearman rho = -0.40, p = 0.76), indistinguishable from random five-gene scores. OGT showed only a weak association (r = 0.23, p = 0.34), whereas CDO1 was a clear outlier (r = 0.70, empirical p = 0.0039). The 12-gene signature achieved a concordance index of 0.61 (p = 0.017) but a non-unique effect size. Overall, KOSMOS produced one well-supported discovery, one plausible but uncertain result, and one false hypothesis, illustrating that AI scientists can generate useful ideas but require rigorous auditing against appropriate null models.
zh
[NLP-56] Rdgai: Classifying transcriptional changes using Large Language Models with a test case from an Arabic Gospel tradition
【速读】: 该论文旨在解决文本传统(textual traditions)中变异类型分类的效率问题,即在系统发育分析中,如何自动化处理大量阅读变体(readings)之间的差异分类任务。传统方法需人工逐条比对每个变异单元中的阅读差异,耗时且难以扩展。解决方案的关键在于开发了一个名为Rdgai的软件包,利用多语言大语言模型(multi-lingual large language models, LLMs)自动完成阅读变化的分类:用户仅需手动标注少量样本,随后通过提示工程(prompt engineering)引导LLM推断其余变体的类别,最终将分类结果以TEI XML格式存储,便于后续系统发育分析。该方法显著降低了分类门槛,提升了分析效率与可扩展性。
链接: https://arxiv.org/abs/2511.13801
作者: Robert Turnbull
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注: 8 figures
Abstract:Application of phylogenetic methods to textual traditions has traditionally treated all changes as equivalent even though it is widely recognized that certain types of variants were more likely to be introduced than others. While it is possible to give weights to certain changes using a maximum parsimony evaluation criterion, it is difficult to state a priori what these weights should be. Probabilistic methods, such as Bayesian phylogenetics, allow users to create categories of changes, and the transition rates for each category can be estimated as part of the analysis. This classification of types of changes in readings also allows for inspecting the probability of these categories across each branch in the resulting trees. However, classification of readings is time-consuming, as it requires categorizing each reading against every other reading at each variation unit, presenting a significant barrier to entry for this kind of analysis. This paper presents Rdgai, a software package that automates this classification task using multi-lingual large language models (LLMs). The tool allows users to easily manually classify changes in readings and then it uses these annotations in the prompt for an LLM to automatically classify the remaining reading transitions. These classifications are stored in TEI XML and ready for downstream phylogenetic analysis. This paper demonstrates the application with data an Arabic translation of the Gospels.
zh
[NLP-57] Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在多智能体和安全关键场景中,其脆弱性是否随模型交互而系统性放大这一开放问题,尤其关注大模型是否能有效“越狱”小模型以诱导有害行为。解决方案的关键在于构建标准化的对抗任务框架(基于JailbreakBench),通过模拟超过6000次多轮攻击者-目标模型交互,量化伤害得分(harm score)与拒绝响应频率(refusal behavior)作为评估指标,并利用三位独立LLM裁判进行一致性评分,从而揭示相对模型规模比(attacker-to-target size ratio)与有害输出强度之间的显著正相关关系(Pearson r = 0.51, p < 0.001),并发现攻击方行为多样性对结果变异贡献更大,且攻击方自身对齐程度显著抑制有害响应(rho = -0.93, p < 0.001)。
链接: https://arxiv.org/abs/2511.13788
作者: Samuel Nathanson,Rebecca Williams,Cynthia Matuszek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注: 19 pages, 6 figures, 3 tables
Abstract:Large language models (LLMs) increasingly operate in multi-agent and safety-critical settings, raising open questions about how their vulnerabilities scale when models interact adversarially. This study examines whether larger models can systematically jailbreak smaller ones - eliciting harmful or restricted behavior despite alignment safeguards. Using standardized adversarial tasks from JailbreakBench, we simulate over 6,000 multi-turn attacker-target exchanges across major LLM families and scales (0.6B-120B parameters), measuring both harm score and refusal behavior as indicators of adversarial potency and alignment integrity. Each interaction is evaluated through aggregated harm and refusal scores assigned by three independent LLM judges, providing a consistent, model-based measure of adversarial outcomes. Aggregating results across prompts, we find a strong and statistically significant correlation between mean harm and the logarithm of the attacker-to-target size ratio (Pearson r = 0.51, p 0.001; Spearman rho = 0.52, p 0.001), indicating that relative model size correlates with the likelihood and severity of harmful completions. Mean harm score variance is higher across attackers (0.18) than across targets (0.10), suggesting that attacker-side behavioral diversity contributes more to adversarial outcomes than target susceptibility. Attacker refusal frequency is strongly and negatively correlated with harm (rho = -0.93, p 0.001), showing that attacker-side alignment mitigates harmful responses. These findings reveal that size asymmetry influences robustness and provide exploratory evidence for adversarial scaling patterns, motivating more controlled investigations into inter-model alignment and safety.
zh
[NLP-58] Refine Thought: A Test-Time Inference Method for Embedding Model Reasoning
【速读】: 该论文旨在解决文本嵌入模型在语义推理任务中表现不足的问题,尤其是其在复杂语义理解与逻辑推理能力上的局限性。解决方案的关键在于提出一种名为RT(Refine Thought)的方法,该方法通过多次前向传播(multiple forward passes)来增强文本嵌入模型的语义推理能力,从而激活预训练阶段由仅解码器结构的文本嵌入模型(如Qwen3-Embedding-8B)所学习到的推理潜力。RT本质上是一种测试时推理(test-time inference)方法,在不改变模型参数的前提下显著提升了模型在BRIGHT和PJBenchmark1等语义推理基准上的性能,同时保持了在通用语义理解任务(如C-MTEB)上的稳定表现。
链接: https://arxiv.org/abs/2511.13726
作者: Guangzhi Wang,Kai Li,Yinghao Jiao,Zhi Liu
机构: CareerInternational Research Team
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:We propose RT (Refine Thought), a method that can enhance the semantic rea-soning ability of text embedding models. The method obtains the final semanticrepresentation by running multiple forward passes of the text embedding this http URL show that RT achieves significant improvements on semantic reason-ing tasks in BRIGHT and the person job matching benchmark PJBenchmark1, while maintaining consistent performance on general-purpose semantic under-standing tasks such as C-MTEB. Our results indicate that RT is effective becauseit further activates the semantic reasoning ability learned during pretraining bydecoder-only text embedding models(e.g., Qwen3-Embedding-8B). RT canbe seen as a test-time inference method.
zh
[NLP-59] Signature vs. Substance: Evaluating the Balance of Adversarial Resistance and Linguistic Quality in Watermarking Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)生成文本可能带来的潜在危害,通过引入水印技术(watermarking)来嵌入可检测信号以实现对生成文本的识别与追踪。其解决方案的关键在于评估多种水印技术在对抗攻击下的鲁棒性(如重写和回译攻击)以及其对原始文本语义、质量与写作风格的保持能力,结果表明当前水印方法虽能保留语义,但会偏离原写作风格,且对回译类攻击尤为脆弱,从而揭示了现有方案在实用性与安全性上的局限性。
链接: https://arxiv.org/abs/2511.13722
作者: William Guo,Adaku Uchendu,Ana Smith
机构: IMSA(Independent School for the Arts); MIT Lincoln Laboratory (麻省理工学院林肯实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:To mitigate the potential harms of Large Language Models (LLMs)generated text, researchers have proposed watermarking, a process of embedding detectable signals within text. With watermarking, we can always accurately detect LLM-generated texts. However, recent findings suggest that these techniques often negatively affect the quality of the generated texts, and adversarial attacks can strip the watermarking signals, causing the texts to possibly evade detection. These findings have created resistance in the wide adoption of watermarking by LLM creators. Finally, to encourage adoption, we evaluate the robustness of several watermarking techniques to adversarial attacks by comparing paraphrasing and back translation (i.e., English \to another language \to English) attacks; and their ability to preserve quality and writing style of the unwatermarked texts by using linguistic metrics to capture quality and writing style of texts. Our results suggest that these watermarking techniques preserve semantics, deviate from the writing style of the unwatermarked texts, and are susceptible to adversarial attacks, especially for the back translation attack.
zh
计算机视觉
[CV-0] ARC Is a Vision Problem!
【速读】:该论文旨在解决当前对抽象推理任务(如Abstraction and Reasoning Corpus, ARC)的研究主要依赖语言模型或循环推理模型,而忽视了其固有的视觉特性的问题。解决方案的关键在于提出一种基于视觉范式的建模方法——Vision ARC (VARC),将ARC任务重新定义为图像到图像的翻译问题,并通过在“画布”(canvas)上表示输入以引入视觉先验知识,从而利用标准视觉架构(如Vision Transformer)实现端到端的图像映射。该方法从零开始训练,在测试时通过测试阶段训练(test-time training)实现对未见任务的良好泛化,最终在ARC-1基准上达到60.4%的准确率,显著优于其他从零训练的方法,并接近人类平均水平。
链接: https://arxiv.org/abs/2511.14761
作者: Keya Hu,Ali Cy,Linlu Qiu,Xiaoman Delores Ding,Runqian Wang,Yeyin Eva Zhu,Jacob Andreas,Kaiming He
机构: MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Technical Report. Project webpage: this https URL
Abstract:The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a “canvas” that can be processed like natural images. It is then natural for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.
zh
[CV-1] UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)在图像理解、生成与编辑任务中能力割裂的问题,尤其关注如何统一提升图像生成与编辑性能。解决方案的关键在于提出一种统一的强化学习(Reinforcement Learning, RL)策略,通过共享奖励模型(reward models)联合优化图像生成与编辑能力,并引入轻量级编辑指令对齐(Edit Instruction Alignment)阶段以显著增强模型对编辑指令的理解,从而提升RL训练的有效性。实验表明,UniGen-1.5在GenEval和ImgEdit基准上分别取得0.89和4.31的综合得分,优于当前开源模型如BAGEL,并达到与闭源模型GPT-Image-1相当的水平。
链接: https://arxiv.org/abs/2511.14760
作者: Rui Tian,Mingfei Gao,Haiming Gang,Jiasen Lu,Zhe Gan,Yinfei Yang,Zuxuan Wu,Afshin Dehghan
机构: Fudan University (复旦大学); Apple
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present UniGen-1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing. Building upon UniGen, we comprehensively enhance the model architecture and training pipeline to strengthen the image understanding and generation capabilities while unlocking strong image editing ability. Especially, we propose a unified Reinforcement Learning (RL) strategy that improves both image generation and image editing jointly via shared reward models. To further enhance image editing performance, we propose a light Edit Instruction Alignment stage that significantly improves the editing instruction comprehension that is essential for the success of the RL training. Experimental results show that UniGen-1.5 demonstrates competitive understanding and generation performance. Specifically, UniGen-1.5 achieves 0.89 and 4.31 overall scores on GenEval and ImgEdit that surpass the state-of-the-art models such as BAGEL and reaching performance comparable to proprietary models such as GPT-Image-1.
zh
[CV-2] Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers
【速读】:该论文旨在解决视觉几何Transformer在实时3D感知与重建任务中计算开销过大、效率低下的问题。其核心解决方案是提出一种无需重新训练或微调基础模型的加速机制——置信度引导的Token合并(Confidence-Guided Token Merging, Co-Me)。该方法的关键在于通过蒸馏一个轻量级置信度预测器来量化每个Token的不确定性,并据此选择性地合并低置信度Token,从而在保持空间覆盖的前提下显著降低计算量。相较于基于相似性的合并或剪枝策略,Co-Me利用置信度信号更可靠地识别出Transformer关注的重点区域,实现了高效且无性能损失的加速效果,尤其适用于多视角和流式视觉几何Transformer场景。
链接: https://arxiv.org/abs/2511.14751
作者: Yutian Chen,Yuheng Qiu,Ruogu Li,Ali Agha,Shayegan Omidshafiei,Jay Patrikar,Sebastian Scherer
机构: Carnegie Mellon University (卡内基梅隆大学); Field AI (场AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to 11.3\times and 7.2\times speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.
zh
[CV-3] Vision Large Language Models Are Good Noise Handlers in Engagement Analysis
【速读】:该论文旨在解决视频数据集中参与度(engagement)识别任务中因标签主观性和噪声导致模型性能受限的问题。其核心挑战在于人工标注的参与度标签存在个体差异和不确定性,影响了传统计算机视觉模型的训练效果。解决方案的关键在于利用视觉大语言模型(Vision Large Language Models, VLMs)对原始标签进行精炼,并设计一种结合课程学习(curriculum learning)与软标签优化的训练策略:首先通过问卷提取行为线索将数据划分为高可靠性和低可靠性子集,随后在训练过程中逐步引入模糊样本并动态调整监督信号以反映标签不确定性。该方法显著提升了经典视觉模型在EngageNet、DREAMS及PAFE等基准上的表现,验证了VLM辅助标签精炼和不确定性建模在改善标签质量方面的有效性。
链接: https://arxiv.org/abs/2511.14749
作者: Alexander Vedernikov,Puneet Kumar,Haoyu Chen,Tapio Seppänen,Xiaobai Li
机构: University of Oulu (奥卢大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Engagement recognition in video datasets, unlike traditional image classification tasks, is particularly challenged by subjective labels and noise limiting model performance. To overcome the challenges of subjective and noisy engagement labels, we propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process. Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets. We also introduce a training strategy combining curriculum learning with soft label refinement, gradually incorporating ambiguous samples while adjusting supervision to reflect uncertainty. We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements, highlighting benefits of addressing label subjectivity with VLMs. This method surpasses prior state of the art across engagement benchmarks such as EngageNet (three of six feature settings, maximum improvement of +1.21%), and DREAMS / PAFE with F1 gains of +0.22 / +0.06.
zh
[CV-4] A Neural Field-Based Approach for View Computation Data Exploration in 3D Urban Environments
【速读】:该论文旨在解决三维城市数据探索中的计算瓶颈与交互复杂性问题,尤其针对高遮挡程度的复杂几何结构导致的大规模场景浏览效率低下。其核心挑战在于传统方法依赖大量手动视角调整,难以实现高效、自动化的视点分析。解决方案的关键在于提出一种基于视点的三维数据探索方法,利用向量场编码环境中的观测视角,并结合神经场(Neural Field)构建高效的隐式三维环境表示。该表示支持两类查询:一是直接查询(如视点评估指标计算),二是逆向查询(如规避遮挡并搜索匹配特定数据模式的视点),从而显著提升可视性分析、日照暴露评估及新建建筑视觉影响分析等城市分析任务的效率与准确性。
链接: https://arxiv.org/abs/2511.14742
作者: Stefan Cobeli,Kazi Shahrukh Omar,Rodrigo Valença,Nivan Ferreira,Fabio Miranda
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted at IEEE Transactions on Visualization and Computer Graphics. Code and data are publicly available at this https URL
Abstract:Despite the growing availability of 3D urban datasets, extracting insights remains challenging due to computational bottlenecks and the complexity of interacting with data. In fact, the intricate geometry of 3D urban environments results in high degrees of occlusion and requires extensive manual viewpoint adjustments that make large-scale exploration inefficient. To address this, we propose a view-based approach for 3D data exploration, where a vector field encodes views from the environment. To support this approach, we introduce a neural field-based method that constructs an efficient implicit representation of 3D environments. This representation enables both faster direct queries, which consist of the computation of view assessment indices, and inverse queries, which help avoid occlusion and facilitate the search for views that match desired data patterns. Our approach supports key urban analysis tasks such as visibility assessments, solar exposure evaluation, and assessing the visual impact of new developments. We validate our method through quantitative experiments, case studies informed by real-world urban challenges, and feedback from domain experts. Results show its effectiveness in finding desirable viewpoints, analyzing building facade visibility, and evaluating views from outdoor spaces. Code and data are publicly available at this https URL.
zh
[CV-5] Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising
【速读】:该论文旨在解决合成视频(synthetic video)缺乏真实感的问题,即如何在不依赖额外微调的情况下,将模拟器生成的合成视频转化为具有高度逼真度的视觉效果。其解决方案的关键在于提出一种零样本(zero-shot)框架,通过引入一个辅助模型来估计结构感知信息(如深度图、语义图和边缘图),并将这些信息作为条件引导扩散视频基础模型(diffusion video foundational model)的生成/去噪过程,从而在空间和时间维度上保留原始合成视频的多层次结构特征,同时实现卓越的逼真度表现。
链接: https://arxiv.org/abs/2511.14719
作者: Yifan Wang,Liya Ji,Zhanghan Ke,Harry Yang,Ser-Nam Lim,Qifeng Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); Everlyn AI; University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion. Our realism enhancement approach is a zero-shot framework that focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains, built upon a diffusion video foundational model without further fine-tuning. Specifically, we incorporate an effective modification to have the generation/denoising process conditioned on estimated structure-aware information from the synthetic video, such as depth maps, semantic maps, and edge maps, by an auxiliary model, rather than extracting the information from a simulator. This guidance ensures that the enhanced videos are consistent with the original synthetic video at both the structural and semantic levels. Our approach is a simple yet general and powerful approach to enhancing synthetic video realism: we show that our approach outperforms existing baselines in structural consistency with the original video while maintaining state-of-the-art photorealism quality in our experiments.
zh
[CV-6] Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model
【速读】:该论文旨在解决标准潜在扩散模型(Latent Diffusion Models)中三阶段模块化架构带来的计算效率低下、性能欠佳以及难以与视觉基础模型(vision foundation models)常见的单网络架构统一的问题。其核心挑战在于,直接将编码器、解码器和扩散网络联合训练会导致“潜在空间坍缩”(latent collapse),即扩散训练目标干扰了网络学习有效潜在表示的能力。解决方案的关键在于提出一种名为“扩散作为自蒸馏”(Diffusion as Self-Distillation, DSD)的新框架,通过重构训练目标以稳定潜在空间,首次实现了端到端可训练的单一网络,该网络能同时完成编码、解码与扩散过程,并在ImageNet 256×256条件生成任务上取得优异性能(FID=4.25,仅用205M参数和50个训练轮次)。
链接: https://arxiv.org/abs/2511.14716
作者: Xiyuan Wang,Muhan Zhang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech Report. 10 pages
Abstract:Standard Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network, which are trained in multiple stages. This modular design is computationally inefficient, leads to suboptimal performance, and prevents the unification of diffusion with the single-network architectures common in vision foundation models. Our goal is to unify these three components into a single, end-to-end trainable network. We first demonstrate that a naive joint training approach fails catastrophically due to ``latent collapse’', where the diffusion training objective interferes with the network’s ability to learn a good latent representation. We identify the root causes of this instability by drawing a novel analogy between diffusion and self-distillation based unsupervised learning method. Based on this insight, we propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space. This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion. DSD achieves outstanding performance on the ImageNet 256\times 256 conditional generation task: FID=13.44/6.38/4.25 with only 42M/118M/205M parameters and 50 training epochs on ImageNet, without using classifier-free-guidance.
zh
[CV-7] FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation
【速读】:该论文旨在解决基于Transformer的视频生成模型在超高清分辨率视频生成中因注意力机制带来的二次时间与内存复杂度而导致端到端训练成本过高的问题。其解决方案的关键在于提出一种无需训练的方法,核心是引入一种“内向滑动窗口注意力机制”(inward sliding window attention),该机制通过保持每个查询令牌(query token)在训练时的接收场(receptive field)来保障视觉保真度和细节;同时设计了一个双路径流水线,以交叉注意力覆盖策略(cross-attention override strategy)弥补局部窗口注意力易导致内容重复和全局不一致的问题,从而实现细粒度视觉细节与整体语义一致性的平衡,并结合交叉注意力缓存策略提升计算效率。
链接: https://arxiv.org/abs/2511.14712
作者: Yunfeng Wu,Jiayi Song,Zhenxiong Tan,Zihao He,Songhua Liu
机构: Shanghai Jiao Tong University (上海交通大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Xi’an Jiaotong University (西安交通大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures
Abstract:The quadratic time and memory complexity of the attention mechanism in modern Transformer based video generators makes end-to-end training for ultra high resolution videos prohibitively expensive. Motivated by this limitation, we introduce a training-free approach that leverages video Diffusion Transformers pretrained at their native scale to synthesize higher resolution videos without any additional training or adaptation. At the core of our method lies an inward sliding window attention mechanism, which originates from a key observation: maintaining each query token’s training scale receptive field is crucial for preserving visual fidelity and detail. However, naive local window attention, unfortunately, often leads to repetitive content and exhibits a lack of global coherence in the generated results. To overcome this challenge, we devise a dual-path pipeline that backs up window attention with a novel cross-attention override strategy, enabling the semantic content produced by local attention to be guided by another branch with a full receptive field and, therefore, ensuring holistic consistency. Furthermore, to improve efficiency, we incorporate a cross-attention caching strategy for this branch to avoid the frequent computation of full 3D attention. Extensive experiments demonstrate that our method delivers ultra-high-resolution videos with fine-grained visual details and high efficiency in a training-free paradigm. Meanwhile, it achieves superior performance on VBench, even compared to training-based alternatives, with competitive or improved efficiency. Codes are available at: this https URL
zh
[CV-8] Seeing Beyond the Image: ECG and Anatomical Knowledge-Guided Myocardial Scar Segmentation from Late Gadolinium-Enhanced Images
【速读】:该论文旨在解决从延迟钆增强心脏磁共振成像(LGE-MRI)中准确分割心肌瘢痕的问题,这一任务因对比度变化和成像伪影而具有挑战性。解决方案的关键在于提出一种多模态框架,将心电图(ECG)衍生的电生理信息与AHA-17解剖先验相结合,以实现生理一致性的心肌瘢痕分割;其中引入了时间感知特征融合(TAFF)机制,根据ECG与LGE-MRI采集时间差动态加权融合特征,从而有效整合非同步获取的多模态数据,显著提升分割性能(Dice分数从0.6149提升至0.8463)。
链接: https://arxiv.org/abs/2511.14702
作者: Farheen Ramzan,Yusuf Kiberu,Nikesh Jathanna,Meryem Jabrane,Vicente Grau,Shahnaz Jamil-Copley,Richard H. Clayton,Chen(Cherise)Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate segmentation of myocardial scar from late gadolinium enhanced (LGE) cardiac MRI is essential for evaluating tissue viability, yet remains challenging due to variable contrast and imaging artifacts. Electrocardiogram (ECG) signals provide complementary physiological information, as conduction abnormalities can help localize or suggest scarred myocardial regions. In this work, we propose a novel multimodal framework that integrates ECG-derived electrophysiological information with anatomical priors from the AHA-17 atlas for physiologically consistent LGE-based scar segmentation. As ECGs and LGE-MRIs are not acquired simultaneously, we introduce a Temporal Aware Feature Fusion (TAFF) mechanism that dynamically weights and fuses features based on their acquisition time difference. Our method was evaluated on a clinical dataset and achieved substantial gains over the state-of-the-art image-only baseline (nnU-Net), increasing the average Dice score for scars from 0.6149 to 0.8463 and achieving high performance in both precision (0.9115) and sensitivity (0.9043). These results show that integrating physiological and anatomical knowledge allows the model to “see beyond the image”, setting a new direction for robust and physiologically grounded cardiac scar segmentation.
zh
[CV-9] HyMAD: A Hybrid Multi-Activity Detection Approach for Border Surveillance and Monitoring
【速读】:该论文旨在解决地震传感系统在边境监控中对重叠活动(如人类入侵、动物移动和车辆行驶)进行准确检测与区分的难题,这一问题源于地震信号本身的复杂性和噪声干扰。解决方案的关键在于提出一种基于时空特征融合的深度神经网络架构HyMAD(Hybrid Multi-Activity Detection),其核心创新包括:利用SincNet提取频谱特征、通过循环神经网络(Recurrent Neural Network, RNN)建模时间依赖性、引入自注意力机制增强模态内表征,并结合跨模态融合模块实现鲁棒的多标签分类,从而有效提升对复杂并发事件的识别精度与系统可靠性。
链接: https://arxiv.org/abs/2511.14698
作者: Sriram Srinivasan,Srinivasan Aruchamy,Siva Ram Krisha Vadali
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: Multi-label seismic signal classification using novel attention-based feature fusion. Submitting to cs.CV due to relevance to general pattern recognition and time-frequency (spectrogram) analysis
Abstract:Seismic sensing has emerged as a promising solution for border surveillance and monitoring; the seismic sensors that are often buried underground are small and cannot be noticed easily, making them difficult for intruders to detect, avoid, or vandalize. This significantly enhances their effectiveness compared to highly visible cameras or fences. However, accurately detecting and distinguishing between overlapping activities that are happening simultaneously, such as human intrusions, animal movements, and vehicle rumbling, remains a major challenge due to the complex and noisy nature of seismic signals. Correctly identifying simultaneous activities is critical because failing to separate them can lead to misclassification, missed detections, and an incomplete understanding of the situation, thereby reducing the reliability of surveillance systems. To tackle this problem, we propose HyMAD (Hybrid Multi-Activity Detection), a deep neural architecture based on spatio-temporal feature fusion. The framework integrates spectral features extracted with SincNet and temporal dependencies modeled by a recurrent neural network (RNN). In addition, HyMAD employs self-attention layers to strengthen intra-modal representations and a cross-modal fusion module to achieve robust multi-label classification of seismic events. e evaluate our approach on a dataset constructed from real-world field recordings collected in the context of border surveillance and monitoring, demonstrating its ability to generalize to complex, simultaneous activity scenarios involving humans, animals, and vehicles. Our method achieves competitive performance and offers a modular framework for extending seismic-based activity recognition in real-world security applications.
zh
[CV-10] Attention via Synaptic Plasticity is All You Need: A Biologically Inspired Spiking Neuromorphic Transformer
【速读】:该论文旨在解决当前基于Transformer的大型语言模型(LLMs)在训练和推理过程中能耗过高、碳足迹大,且其注意力机制与生物大脑机制不一致的问题。现有Transformer依赖于浮点数运算的点积相似度计算,难以适配脉冲神经网络(SNN)等类脑硬件,并存在冯·诺依曼瓶颈限制内存计算效率。解决方案的关键在于提出一种新型脉冲Transformer——Spiking STDP Transformer (S²TDPT),其核心创新是利用突触时序依赖可塑性(spike-timing-dependent plasticity, STDP)实现自注意力机制,将查询-键的相关性嵌入突触权重中,从而在脉冲事件驱动下完成原位计算,支持非冯·诺依曼架构的能效优化与可解释性增强。实验表明,该模型在CIFAR-10和CIFAR-100上分别达到94.35%和78.08%准确率,功耗仅为标准人工神经网络(ANN)Transformer的11.53%,显著提升了能效比和硬件友好性。
链接: https://arxiv.org/abs/2511.14691
作者: Kallol Mondal(1 and 2),Ankush Kumar(2) ((1) Department of Electronics and Communication Engineering, National Institute of Technology Allahabad, Prayagraj, (2) Centre for Nanotechnology, Indian Institute of Technology Roorkee)
机构: National Institute of Technology Allahabad (印度理工学院阿拉巴德分校); Indian Institute of Technology Roorkee (印度理工学院鲁尔基分校)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (stat.ML)
备注: 21 Pages, 5 Figures, 3 Table
Abstract:Attention is the brain’s ability to selectively focus on a few specific aspects while ignoring irrelevant ones. This biological principle inspired the attention mechanism in modern Transformers. Transformers now underpin large language models (LLMs) such as GPT, but at the cost of massive training and inference energy, leading to a large carbon footprint. While brain attention emerges from neural circuits, Transformer attention relies on dot-product similarity to weight elements in the input sequence. Neuromorphic computing, especially spiking neural networks (SNNs), offers a brain-inspired path to energy-efficient intelligence. Despite recent work on attention-based spiking Transformers, the core attention layer remains non-neuromorphic. Current spiking attention (i) relies on dot-product or element-wise similarity suited to floating-point operations, not event-driven spikes; (ii) keeps attention matrices that suffer from the von Neumann bottleneck, limiting in-memory computing; and (iii) still diverges from brain-like computation. To address these issues, we propose the Spiking STDP Transformer (S ^2 TDPT), a neuromorphic Transformer that implements self-attention through spike-timing-dependent plasticity (STDP), embedding query–key correlations in synaptic weights. STDP, a core mechanism of memory and learning in the brain and widely studied in neuromorphic devices, naturally enables in-memory computing and supports non-von Neumann hardware. On CIFAR-10 and CIFAR-100, our model achieves 94.35% and 78.08% accuracy with only four timesteps and 0.49 mJ on CIFAR-100, an 88.47% energy reduction compared to a standard ANN Transformer. Grad-CAM shows that the model attends to semantically relevant regions, enhancing interpretability. Overall, S ^2 TDPT illustrates how biologically inspired attention can yield energy-efficient, hardware-friendly, and explainable neuromorphic models.
zh
[CV-11] Impact of Image Resolution on Age Estimation with DeepFace and InsightFace
【速读】:该论文旨在解决输入图像分辨率对人脸年龄估计(age estimation)准确性影响的问题,尤其是在实际应用中图像分辨率差异较大的场景下。解决方案的关键在于系统性地评估不同分辨率下两种主流模型——DeepFace 和 InsightFace 的性能表现,通过在 IMDB-Clean 数据集上对 1000 张图像进行七种分辨率处理(共 7000 个样本),并使用均方误差(MAE)、标准差(SD)和中位数绝对误差(MedAE)作为评价指标。研究发现,图像分辨率显著影响估计精度,且两者均在 224×224 像素时达到最优性能(DeepFace 的 MAE 为 10.83 年,InsightFace 为 7.46 年),过低或过高分辨率均导致性能下降,同时 InsightFace 在所有分辨率下均比 DeepFace 更快。
链接: https://arxiv.org/abs/2511.14689
作者: Shiyar Jamo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 7 figures, 7 tables. Evaluation of DeepFace and InsightFace age estimation across seven image resolutions (64 to 1080 px)
Abstract:Automatic age estimation is widely used for age verification, where input images often vary considerably in resolution. This study evaluates the effect of image resolution on age estimation accuracy using DeepFace and InsightFace. A total of 1000 images from the IMDB-Clean dataset were processed in seven resolutions, resulting in 7000 test samples. Performance was evaluated using Mean Absolute Error (MAE), Standard Deviation (SD), and Median Absolute Error (MedAE). Based on this study, we conclude that input image resolution has a clear and consistent impact on the accuracy of age estimation in both DeepFace and InsightFace. Both frameworks achieve optimal performance at 224x224 pixels, with an MAE of 10.83 years (DeepFace) and 7.46 years (InsightFace). At low resolutions, MAE increases substantially, while very high resolutions also degrade accuracy. InsightFace is consistently faster than DeepFace across all resolutions.
zh
[CV-12] Improving segmentation of retinal arteries and veins using cardiac signal in doppler holograms
【速读】:该论文旨在解决在时间分辨多普勒全息成像(Doppler holography)中,传统血管分割方法仅依赖空间信息而忽略时间维度动态特征的问题,从而限制了对视网膜血流动力学的定量分析精度。解决方案的关键在于引入一种简单有效的预处理策略:通过专用脉搏分析流程提取时间相关特征,并将其融入标准U-Net架构中,使模型能够利用全息数据的时间动态性,从而在不增加模型复杂度的前提下实现与注意力机制或迭代优化模型相当的动脉-静脉分割性能。
链接: https://arxiv.org/abs/2511.14654
作者: Marius Dubosc,Yann Fischer,Zacharie Auray,Nicolas Boutry,Edwin Carlinet,Michael Atlan,Thierry Geraud
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, 1 table. Submitted to ISBI2026
Abstract:Doppler holography is an emerging retinal imaging technique that captures the dynamic behavior of blood flow with high temporal resolution, enabling quantitative assessment of retinal hemodynamics. This requires accurate segmentation of retinal arteries and veins, but traditional segmentation methods focus solely on spatial information and overlook the temporal richness of holographic data. In this work, we propose a simple yet effective approach for artery-vein segmentation in temporal Doppler holograms using standard segmentation architectures. By incorporating features derived from a dedicated pulse analysis pipeline, our method allows conventional U-Nets to exploit temporal dynamics and achieve performance comparable to more complex attention- or iteration-based models. These findings demonstrate that time-resolved preprocessing can unlock the full potential of deep learning for Doppler holography, opening new perspectives for quantitative exploration of retinal hemodynamics. The dataset is publicly available at this https URL
zh
[CV-13] RepAir: A Framework for Airway Segmentation and Discontinuity Correction in CT
【速读】:该论文旨在解决胸部CT图像中气道分割不连通的问题,这一问题在基于U-Net的自动化方法中普遍存在,导致难以可靠提取定量生物标志物。解决方案的关键在于提出一个三阶段框架RepAir:首先使用nnU-Net生成初始气道掩膜,随后通过基于骨架的算法识别潜在断点并提出重连候选,最后利用一维卷积分类器判断哪些候选连接属于真实解剖分支而非伪连接或阻塞路径,从而实现高精度且拓扑正确的3D气道分割。
链接: https://arxiv.org/abs/2511.14649
作者: John M. Oyer,Ali Namvar,Benjamin A. Hoff,Wassim W. Labaki,Ella A. Kazerooni,Charles R. Hatt,Fernando J. Martinez,MeiLan K. Han,Craig J. Galbán,Sundaresh Ram
机构: University of Michigan (密歇根大学); Northwestern University (西北大学); University of Pittsburgh (匹兹堡大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of Chicago (芝加哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 3 figures, 1 table. Preprint submitted to SSIAI 2026 Conference on November 17, 2025
Abstract:Accurate airway segmentation from chest computed tomography (CT) scans is essential for quantitative lung analysis, yet manual annotation is impractical and many automated U-Net-based methods yield disconnected components that hinder reliable biomarker extraction. We present RepAir, a three-stage framework for robust 3D airway segmentation that combines an nnU-Net-based network with anatomically informed topology correction. The segmentation network produces an initial airway mask, after which a skeleton-based algorithm identifies potential discontinuities and proposes reconnections. A 1D convolutional classifier then determines which candidate links correspond to true anatomical branches versus false or obstructed paths. We evaluate RepAir on two distinct datasets: ATM’22, comprising annotated CT scans from predominantly healthy subjects and AeroPath, encompassing annotated scans with severe airway pathology. Across both datasets, RepAir outperforms existing 3D U-Net-based approaches such as Bronchinet and NaviAirway on both voxel-level and topological metrics, and produces more complete and anatomically consistent airway trees while maintaining high segmentation accuracy.
zh
[CV-14] SLAM-AGS: Slide-Label Aware Multi-Task Pretraining Using Adaptive Gradient Surgery in Computational Cytology
【速读】:该论文旨在解决计算细胞学(computational cytology)中的两大挑战:一是实例级标签不可靠且获取成本极高,二是阳性样本(witness rates)的稀疏性问题。为应对这些问题,作者提出了一种滑片标签感知的多任务预训练框架SLAM-AGS,其核心创新在于联合优化两个目标:(i) 在滑片负样本区域上采用弱监督相似性目标,(ii) 在滑片正样本区域上使用自监督对比学习目标,从而提升下游任务性能。关键突破在于引入自适应梯度手术(Adaptive Gradient Surgery)以缓解不同任务间梯度冲突,防止模型坍塌,实现稳定预训练,并结合基于注意力的多实例学习聚合器进行袋级别预测和异常实例检索,显著提升了在极低阳性率(低至0.5%)场景下的表现。
链接: https://arxiv.org/abs/2511.14639
作者: Marco Acerbis,Swarnadip Chatterjee,Christophe Avenel,Joakim Lindblad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures, Submitted to ISBI2026
Abstract:Computational cytology faces two major challenges: i) instance-level labels are unreliable and prohibitively costly to obtain, ii) witness rates are extremely low. We propose SLAM-AGS, a Slide-Label-Aware Multitask pretraining framework that jointly optimizes (i) a weakly supervised similarity objective on slide-negative patches and (ii) a self-supervised contrastive objective on slide-positive patches, yielding stronger performance on downstream tasks. To stabilize learning, we apply Adaptive Gradient Surgery to tackle conflicting task gradients and prevent model collapse. We integrate the pretrained encoder into an attention-based Multiple Instance Learning aggregator for bag-level prediction and attention-guided retrieval of the most abnormal instances in a bag. On a publicly available bone-marrow cytology dataset, with simulated witness rates from 10% down to 0.5%, SLAM-AGS improves bag-level F1-Score and Top 400 positive cell retrieval over other pretraining methods, with the largest gains at low witness rates, showing that resolving gradient interference enables stable pretraining and better performance on downstream tasks. To facilitate reproducibility, we share our complete implementation and evaluation framework as open source: this https URL.
zh
[CV-15] SparseSurf: Sparse-View 3D Gaussian Splatting for Surface Reconstruction AAAI2026
【速读】:该论文旨在解决稀疏视角下高斯点阵(Gaussian Splatting)优化易过拟合、导致表面重建质量下降及新视图合成性能劣化的问题。其解决方案的关键在于提出两种核心机制:一是引入立体几何-纹理对齐(Stereo Geometry-Texture Alignment),通过联合优化渲染质量和几何估计,提升表面重建与视图合成的一致性;二是设计伪特征增强的几何一致性约束(Pseudo-Feature Enhanced Geometry Consistency),利用训练视图与未见视图共同强化多视角几何一致性,有效缓解稀疏监督下的过拟合问题。
链接: https://arxiv.org/abs/2511.14633
作者: Meiying Gu,Jiawei Zhang,Jiahe Li,Xiaohan Yu,Haonan Luo,Jin Zheng,Xiao Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2026. Project page: this https URL
Abstract:Recent advances in optimizing Gaussian Splatting for scene geometry have enabled efficient reconstruction of detailed surfaces from images. However, when input views are sparse, such optimization is prone to overfitting, leading to suboptimal reconstruction quality. Existing approaches address this challenge by employing flattened Gaussian primitives to better fit surface geometry, combined with depth regularization to alleviate geometric ambiguities under limited viewpoints. Nevertheless, the increased anisotropy inherent in flattened Gaussians exacerbates overfitting in sparse-view scenarios, hindering accurate surface fitting and degrading novel view synthesis performance. In this paper, we propose \net, a method that reconstructs more accurate and detailed surfaces while preserving high-quality novel view rendering. Our key insight is to introduce Stereo Geometry-Texture Alignment, which bridges rendering quality and geometry estimation, thereby jointly enhancing both surface reconstruction and view synthesis. In addition, we present a Pseudo-Feature Enhanced Geometry Consistency that enforces multi-view geometric consistency by incorporating both training and unseen views, effectively mitigating overfitting caused by sparse supervision. Extensive experiments on the DTU, BlendedMVS, and Mip-NeRF360 datasets demonstrate that our method achieves the state-of-the-art performance.
zh
[CV-16] Fusing Biomechanical and Spatio-Temporal Features for Fall Prediction: Characterizing and Mitigating the Simulation-to-Reality Gap
【速读】:该论文旨在解决老年人跌倒预测系统在真实场景中性能不足的问题,其核心挑战在于模拟数据与现实数据之间的显著差距(simulation-reality gap),以及因老年人群(如糖尿病或衰弱患者)独特运动学特征导致的模型泛化能力下降。解决方案的关键在于提出一种双流神经网络架构——生物力学时空图卷积网络(Biomechanical Spatio-Temporal Graph Convolutional Network, BioST-GCN),该模型通过交叉注意力机制融合姿态信息与生物力学特征,在仿真数据上显著提升F1分数(相较于基线ST-GCN分别提高5.32%和2.91%),同时利用时空注意力机制增强可解释性,识别关键关节与时间阶段;此外,作者还强调需采用个性化策略和隐私保护的数据处理流程以促进真实世界验证,从而缩小模拟与现实之间的性能鸿沟。
链接: https://arxiv.org/abs/2511.14620
作者: Md Fokhrul Islam,Sajeda Al-Hammouri,Christopher J. Arellano,Kavan Hazeli,Heman Shakeri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Falls are a leading cause of injury and loss of independence among older adults. Vision-based fall prediction systems offer a non-invasive solution to anticipate falls seconds before impact, but their development is hindered by the scarcity of available fall data. Contributing to these efforts, this study proposes the Biomechanical Spatio-Temporal Graph Convolutional Network (BioST-GCN), a dual-stream model that combines both pose and biomechanical information using a cross-attention fusion mechanism. Our model outperforms the vanilla ST-GCN baseline by 5.32% and 2.91% F1-score on the simulated MCF-UA stunt-actor and MUVIM datasets, respectively. The spatio-temporal attention mechanisms in the ST-GCN stream also provide interpretability by identifying critical joints and temporal phases. However, a critical simulation-reality gap persists. While our model achieves an 89.0% F1-score with full supervision on simulated data, zero-shot generalization to unseen subjects drops to 35.9%. This performance decline is likely due to biases in simulated data, such as `intent-to-fall’ cues. For older adults, particularly those with diabetes or frailty, this gap is exacerbated by their unique kinematic profiles. To address this, we propose personalization strategies and advocate for privacy-preserving data pipelines to enable real-world validation. Our findings underscore the urgent need to bridge the gap between simulated and real-world data to develop effective fall prediction systems for vulnerable elderly populations.
zh
[CV-17] 3D-Guided Scalable Flow Matching for Generating Volumetric Tissue Spatial Transcriptomics from Serial Histology
【速读】:该论文旨在解决现有空间转录组学(Spatial Transcriptomics, ST)分析方法在处理三维组织结构时的局限性问题:当前大多数从苏木精-伊红染色(Hematoxylin and Eosin, HE)图像直接推断ST的方法仅独立处理每个切片,忽略了组织的三维连续性;而现有的3D感知方法则不具备生成能力且难以扩展至大规模数据。其解决方案的关键在于提出一种名为Holographic Tissue Expression Inpainting and Analysis (HoloTea) 的3D-aware流匹配(flow-matching)框架,通过在共享特征空间中检索相邻切片中的形态学对应点,并将跨切片上下文信息融合进轻量级ControlNet以实现解剖连续性的条件控制;同时引入结合学习到的零膨胀负二项分布(Zero-Inflated Negative Binomial, ZINB)先验与邻近切片构建的空间经验先验,以更好地建模基因表达计数特性,并利用全局注意力模块实现与切片中斑点数量线性增长的3D HE缩放能力,从而支持大规模3D ST数据的训练与推理。
链接: https://arxiv.org/abs/2511.14613
作者: Mohammad Vali Sanian,Arshia Hemmat,Amirhossein Vahidi,Jonas Maaskola,Jimmy Tsz Hang Lee,Stanislaw Makarchuk,Yeliz Demirci,Nana-Jane Chipampe,Omer Bayraktar,Lassi Paavolainen,Mohammad Lotfollahi
机构: Cambridge Stem Cell Institute, University of Cambridge (剑桥干细胞研究所,剑桥大学); Computer Science Department, University of Oxford (牛津大学计算机科学系); Computer Science Department, University of Helsinki (赫尔辛基大学计算机科学系); Institute for Molecular Medicine Finland, University of Helsinki (芬兰分子医学研究所,赫尔辛基大学); Wellcome Sanger Institute (韦尔科姆·桑格研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
Abstract:A scalable and robust 3D tissue transcriptomics profile can enable a holistic understanding of tissue organization and provide deeper insights into human biology and disease. Most predictive algorithms that infer ST directly from histology treat each section independently and ignore 3D structure, while existing 3D-aware approaches are not generative and do not scale well. We present Holographic Tissue Expression Inpainting and Analysis (HoloTea), a 3D-aware flow-matching framework that imputes spot-level gene expression from HE while explicitly using information from adjacent sections. Our key idea is to retrieve morphologically corresponding spots on neighboring slides in a shared feature space and fuse this cross section context into a lightweight ControlNet, allowing conditioning to follow anatomical continuity. To better capture the count nature of the data, we introduce a 3D-consistent prior for flow matching that combines a learned zero-inflated negative binomial (ZINB) prior with a spatial-empirical prior constructed from neighboring sections. A global attention block introduces 3D HE scaling linearly with the number of spots in the slide, enabling training and inference on large 3D ST datasets. Across three spatial transcriptomics datasets spanning different tissue types and resolutions, HoloTea consistently improves 3D expression accuracy and generalization compared to 2D and 3D baselines. We envision HoloTea advancing the creation of accurate 3D virtual tissues, ultimately accelerating biomarker discovery and deepening our understanding of disease.
zh
[CV-18] XAttn-BMD: Multimodal Deep Learning with Cross-Attention for Femoral Neck Bone Mineral Density Estimation
【速读】:该论文旨在解决骨质疏松症中股骨颈骨矿物密度(BMD)预测的精准性问题,以降低骨折风险评估的误差。其核心解决方案是提出XAttn-BMD框架,通过引入一种新颖的双向交叉注意力(bidirectional cross-attention)机制,实现髋关节X光图像与结构化临床元数据之间的动态特征融合,从而增强跨模态的相互促进作用;同时设计了加权平滑L1损失函数(Weighted Smooth L1 loss),有效缓解BMD分布不均衡问题并优先关注临床关键病例,显著提升了模型在回归泛化性和鲁棒性方面的表现。
链接: https://arxiv.org/abs/2511.14604
作者: Yilin Zhang,Leo D. Westbury,Elaine M. Dennison,Nicholas C. Harvey,Nicholas R. Fuggle,Rahman Attar
机构: University of Southampton (南安普顿大学); Southampton General Hospital (南安普顿医院); University Hospital NHS Foundation Trust (大学医院NHS基金会信托)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 figures, 10 tables, 38 pages. Submitted to Artificial Intelligence in Medicine (currently with editor)
Abstract:Poor bone health is a significant public health concern, and low bone mineral density (BMD) leads to an increased fracture risk, a key feature of osteoporosis. We present XAttn-BMD (Cross-Attention BMD), a multimodal deep learning framework that predicts femoral neck BMD from hip X-ray images and structured clinical metadata. It utilizes a novel bidirectional cross-attention mechanism to dynamically integrate image and metadata features for cross-modal mutual reinforcement. A Weighted Smooth L1 loss is tailored to address BMD imbalance and prioritize clinically significant cases. Extensive experiments on the data from the Hertfordshire Cohort Study show that our model outperforms the baseline models in regression generalization and robustness. Ablation studies confirm the effectiveness of both cross-attention fusion and the customized loss function. Experimental results show that the integration of multimodal data via cross-attention outperforms naive feature concatenation without cross-attention, reducing MSE by 16.7%, MAE by 6.03%, and increasing the R2 score by 16.4%, highlighting the effectiveness of the approach for femoral neck BMD estimation. Furthermore, screening performance was evaluated using binary classification at clinically relevant femoral neck BMD thresholds, demonstrating the model’s potential in real-world scenarios.
zh
[CV-19] MRI Embeddings Complement Clinical Predictors for Cognitive Decline Modeling in Alzheimers Disease Cohorts
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)认知衰退轨迹建模的准确性问题,特别是在早期分层和个性化管理中如何有效融合临床表型与影像学特征。其关键解决方案在于提出一种基于动态时间规整(Dynamic Time Warping, DTW)聚类的轨迹感知标签策略,并利用无监督预训练的3D视觉Transformer(3D Vision Transformer, ViT)从标准化增强的磁共振成像(MRI)数据中提取保持解剖结构的嵌入表示(embedding)。该方法在不依赖进展标签的情况下捕捉个体间认知变化的异质性模式,且通过对比传统机器学习分类器、深度学习头及卷积网络基线模型,揭示了临床/体积特征在识别高风险极端群体中的优势(AUC≈0.70),以及MRI嵌入在区分认知稳定个体方面的敏感性(AUC=0.71),从而为多模态融合建模AD进展提供了新路径。
链接: https://arxiv.org/abs/2511.14601
作者: Nathaniel Putera,Daniel Vilet Rodríguez,Noah Videcrantz,Julia Machnio,Mostafa Mehdipour Ghazi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at SPIE - Medical Imaging Conference 2026
Abstract:Accurate modeling of cognitive decline in Alzheimer’s disease is essential for early stratification and personalized management. While tabular predictors provide robust markers of global risk, their ability to capture subtle brain changes remains limited. In this study, we evaluate the predictive contributions of tabular and imaging-based representations, with a focus on transformer-derived Magnetic Resonance Imaging (MRI) embeddings. We introduce a trajectory-aware labeling strategy based on Dynamic Time Warping clustering to capture heterogeneous patterns of cognitive change, and train a 3D Vision Transformer (ViT) via unsupervised reconstruction on harmonized and augmented MRI data to obtain anatomy-preserving embeddings without progression labels. The pretrained encoder embeddings are subsequently assessed using both traditional machine learning classifiers and deep learning heads, and compared against tabular representations and convolutional network baselines. Results highlight complementary strengths across modalities. Clinical and volumetric features achieved the highest AUCs of around 0.70 for predicting mild and severe progression, underscoring their utility in capturing global decline trajectories. In contrast, MRI embeddings from the ViT model were most effective in distinguishing cognitively stable individuals with an AUC of 0.71. However, all approaches struggled in the heterogeneous moderate group. These findings indicate that clinical features excel in identifying high-risk extremes, whereas transformer-based MRI embeddings are more sensitive to subtle markers of stability, motivating multimodal fusion strategies for AD progression modeling.
zh
[CV-20] CCSD: Cross-Modal Compositional Self-Distillation for Robust Brain Tumor Segmentation with Missing Modalities
【速读】:该论文旨在解决多模态磁共振成像(MRI)中脑肿瘤分割任务因临床实际场景下常出现某一或多种模态缺失而导致深度学习模型性能下降和泛化能力受限的问题。其解决方案的关键在于提出一种新颖的跨模态组合自蒸馏(Cross-Modal Compositional Self-Distillation, CCSD)框架,该框架采用共享-特定编码器-解码器结构,并引入两种自蒸馏策略:一是分层模态自蒸馏机制,通过跨模态层次的知识迁移减少语义差异;二是渐进式模态组合蒸馏方法,在训练过程中模拟模态逐步缺失以增强对缺失模态的鲁棒性。这一设计显著提升了模型在不同模态组合下的分割准确性和稳定性。
链接: https://arxiv.org/abs/2511.14599
作者: Dongqing Xie,Yonghuang Wu,Zisheng Ai,Jun Min,Zhencun Jiang,Shaojin Geng,Lei Wang
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures
Abstract:The accurate segmentation of brain tumors from multi-modal MRI is critical for clinical diagnosis and treatment planning. While integrating complementary information from various MRI sequences is a common practice, the frequent absence of one or more modalities in real-world clinical settings poses a significant challenge, severely compromising the performance and generalizability of deep learning-based segmentation models. To address this challenge, we propose a novel Cross-Modal Compositional Self-Distillation (CCSD) framework that can flexibly handle arbitrary combinations of input modalities. CCSD adopts a shared-specific encoder-decoder architecture and incorporates two self-distillation strategies: (i) a hierarchical modality self-distillation mechanism that transfers knowledge across modality hierarchies to reduce semantic discrepancies, and (ii) a progressive modality combination distillation approach that enhances robustness to missing modalities by simulating gradual modality dropout during training. Extensive experiments on public brain tumor segmentation benchmarks demonstrate that CCSD achieves state-of-the-art performance across various missing-modality scenarios, with strong generalization and stability.
zh
[CV-21] Deep Learning-Based Regional White Matter Hyperintensity Mapping as a Robust Biomarker for Alzheimers Disease
【速读】:该论文旨在解决当前白质高信号(White Matter Hyperintensities, WMH)自动化分割方法仅提供全局病变负荷而忽视其在不同白质区域空间分布的问题。解决方案的关键在于提出一种基于深度学习的框架,实现对WMH的鲁棒分割与定位,并进一步量化其在解剖学定义区域内的体积。该方法不仅在多个公开数据集和独立的阿尔茨海默病神经影像计划(ADNI)队列中验证了预测病变负荷与参考值的一致性,还发现区域性的WMH体积相较于全局病变负担更能有效区分疾病状态,且结合脑萎缩指标后诊断性能显著提升(AUC达0.97),揭示了前部白质纤维束等特定区域对阿尔茨海默病的局部易感性。
链接: https://arxiv.org/abs/2511.14588
作者: Julia Machnio,Mads Nielsen,Mostafa Mehdipour Ghazi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at SPIE - Medical Imaging Conference 2026
Abstract:White matter hyperintensities (WMH) are key imaging markers in cognitive aging, Alzheimer’s disease (AD), and related dementias. Although automated methods for WMH segmentation have advanced, most provide only global lesion load and overlook their spatial distribution across distinct white matter regions. We propose a deep learning framework for robust WMH segmentation and localization, evaluated across public datasets and an independent Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort. Our results show that the predicted lesion loads are in line with the reference WMH estimates, confirming the robustness to variations in lesion load, acquisition, and demographics. Beyond accurate segmentation, we quantify WMH load within anatomically defined regions and combine these measures with brain structure volumes to assess diagnostic value. Regional WMH volumes consistently outperform global lesion burden for disease classification, and integration with brain atrophy metrics further improves performance, reaching area under the curve (AUC) values up to 0.97. Several spatially distinct regions, particularly within anterior white matter tracts, are reproducibly associated with diagnostic status, indicating localized vulnerability in AD. These results highlight the added value of regional WMH quantification. Incorporating localized lesion metrics alongside atrophy markers may enhance early diagnosis and stratification in neurodegenerative disorders.
zh
[CV-22] OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Omnimodal Large Language Models, OmniLLMs)在统一音频-视频理解任务中,由于处理音视频标记序列(audio-video token sequences)带来的显著计算瓶颈问题。现有标记压缩方法尚未能有效应对跨模态标记的联合压缩需求。解决方案的关键在于提出OmniZip,一个无需训练、基于音频引导的音视频标记压缩框架:首先识别关键音频标记并计算每个时间组的音频保留分数以捕捉信息密度,从而动态指导视频标记剪枝,并通过跨模态相似性增强音频锚点;随后在每个时间窗口内采用交错的时空方案压缩视频标记,实现高效推理加速与内存优化,且无需额外训练即可保持性能。
链接: https://arxiv.org/abs/2511.14582
作者: Keda Tao,Kele Shao,Bohan Yu,Weiqiang Wang,Jian liu,Huan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code Link: this https URL
Abstract:Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding, wherein processing audio-video token sequences creates a significant computational bottleneck, however. Existing token compression methods have yet to accommodate this emerging need of jointly compressing multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme. Extensive empirical results demonstrate the merits of OmniZip - it achieves 3.42X inference speedup and 1.4X memory reduction over other top-performing counterparts, while maintaining performance with no training.
zh
[CV-23] Explaining Digital Pathology Models via Clustering Activations
【速读】:该论文旨在解决数字病理学中深度学习模型(特别是基于卷积神经网络的模型)缺乏可解释性的问题,尤其是在临床实践中难以获得医生信任和快速采纳的困境。传统方法如显著性图(saliency maps)中的遮挡法(occlusion)、GradCAM或相关性传播(relevance propagation)仅能提供单张切片中对预测贡献最大的局部区域,无法反映模型的整体行为模式。本文提出的解决方案是采用基于聚类的可解释性技术,通过分析模型在多个样本上的响应特征进行分组,从而揭示模型的全局行为,并提供更细粒度的洞察信息。该方法不仅有助于理解模型决策机制,还能增强临床医生对模型可靠性的信心,加速其在实际诊疗场景中的应用落地。
链接: https://arxiv.org/abs/2511.14558
作者: Adam Bajger,Jan Obdržálek,Vojtěch Kůr,Rudolf Nenutil,Petr Holub,Vít Musil,Tomáš Brázdil
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a clustering-based explainability technique for digital pathology models based on convolutional neural networks. Unlike commonly used methods based on saliency maps, such as occlusion, GradCAM, or relevance propagation, which highlight regions that contribute the most to the prediction for a single slide, our method shows the global behaviour of the model under consideration, while also providing more fine-grained information. The result clusters can be visualised not only to understand the model, but also to increase confidence in its operation, leading to faster adoption in clinical practice. We also evaluate the performance of our technique on an existing model for detecting prostate cancer, demonstrating its usefulness.
zh
[CV-24] ForensicFlow: A Tri-Modal Adaptive Network for Robust Deepfake Detection
【速读】:该论文旨在解决由先进生成对抗网络(Generative Adversarial Networks, GANs)和自编码器生成的深度伪造(Deepfake)视频对信息完整性和社会稳定性造成的严重威胁问题。现有基于单流卷积神经网络(Single-stream CNNs)的方法难以捕捉跨空间、纹理和频域的多尺度伪造痕迹,导致检测鲁棒性和泛化能力不足。解决方案的关键在于提出ForensicFlow,一个三模态取证框架,通过协同融合RGB、纹理和频率三个模态的证据实现更精准的检测:其中RGB分支(ConvNeXt-tiny)提取全局视觉不一致;纹理分支(Swin Transformer-tiny)识别细粒度拼接伪影;频率分支(CNN + SE模块)检测周期性频谱噪声;并采用注意力机制进行时序池化与自适应特征融合,显著提升了对细微伪造内容的识别能力,在Celeb-DF (v2)数据集上达到AUC 0.9752、F1-Score 0.9408的性能表现。
链接: https://arxiv.org/abs/2511.14554
作者: Mohammad Romani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 11 pages, 4 figures, 2 tables. Preprint. Submitted on November 18, 2025
Abstract:Deepfakes generated by advanced GANs and autoencoders severely threaten information integrity and societal stability. Single-stream CNNs fail to capture multi-scale forgery artifacts across spatial, texture, and frequency domains, limiting robustness and generalization. We introduce the ForensicFlow, a tri-modal forensic framework that synergistically fuses RGB, texture, and frequency evidence for video Deepfake detection. The RGB branch (ConvNeXt-tiny) extracts global visual inconsistencies; the texture branch (Swin Transformer-tiny) detects fine-grained blending artifacts; the frequency branch (CNN + SE) identifies periodic spectral noise. Attention-based temporal pooling dynamically prioritizes high-evidence frames, while adaptive attention fusion balances branch this http URL on Celeb-DF (v2) with Focal Loss, ForensicFlow achieves AUC 0.9752, F1-Score 0.9408, and accuracy 0.9208, outperforming single-stream baselines. Ablation validates branch synergy; Grad-CAM confirms forensic focus. This comprehensive feature fusion provides superior resilience against subtle forgeries.
zh
[CV-25] Interaction-Aware 4D Gaussian Splatting for Dynamic Hand-Object Interaction Reconstruction
【速读】:该论文旨在解决在无物体先验(object priors)条件下,同时建模手与物体交互场景中几何结构与外观表示的挑战性问题。其关键解决方案在于提出一种交互感知的动态3D高斯点阵(dynamic 3D Gaussian Splatting, 3D-GS)方法:首先引入可优化参数的交互感知手-物高斯表示,以采用分段线性假设实现更清晰的结构表征;其次,通过将手信息嵌入物体形变场中构建交互感知动态场,从而捕捉交互过程中手与物体形状的互补性和紧密耦合关系;最后,设计渐进式优化策略和显式正则化项,分别用于分步处理动态区域与静态背景,并稳定手-物表示以实现平滑运动过渡、物理交互真实性及光照一致性。
链接: https://arxiv.org/abs/2511.14540
作者: Hao Tian,Chenyangguang Zhang,Rui Liu,Wen Shen,Xiaolin Qin
机构: Chengdu Institute of Computer Applications, Chinese Academy of Sciences (成都计算机应用研究所,中国科学院); Tsinghua University (清华大学); Minzu University of China (中央民族大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures
Abstract:This paper focuses on a challenging setting of simultaneously modeling geometry and appearance of hand-object interaction scenes without any object priors. We follow the trend of dynamic 3D Gaussian Splatting based methods, and address several significant challenges. To model complex hand-object interaction with mutual occlusion and edge blur, we present interaction-aware hand-object Gaussians with newly introduced optimizable parameters aiming to adopt piecewise linear hypothesis for clearer structural representation. Moreover, considering the complementarity and tightness of hand shape and object shape during interaction dynamics, we incorporate hand information into object deformation field, constructing interaction-aware dynamic fields to model flexible motions. To further address difficulties in the optimization process, we propose a progressive strategy that handles dynamic regions and static background step by step. Correspondingly, explicit regularizations are designed to stabilize the hand-object representations for smooth motion transition, physical interaction reality, and coherent lighting. Experiments show that our approach surpasses existing dynamic 3D-GS-based methods and achieves state-of-the-art performance in reconstructing dynamic hand-object interaction.
zh
[CV-26] Learning Compact Latent Space for Representing Neural Signed Distance Functions with High-fidelity Geometry Details AAAI AAAI-26
【速读】:该论文旨在解决多神经隐式表示(Neural Signed Distance Functions, SDFs)在高保真几何细节保留与紧凑潜在编码之间的矛盾问题。现有方法在处理多个SDF时,受限于潜在空间信息容量不足及几何细节丢失,导致重建质量下降。解决方案的关键在于融合基于泛化的学习策略与过拟合学习策略的优势,从而在共享潜在空间中实现高保真几何细节的保留与更紧凑的潜在码表示;同时引入一种新颖的采样策略以提升训练效率并消除因其他SDF干扰引起的伪影。
链接: https://arxiv.org/abs/2511.14539
作者: Qiang Bai,Bojian Wu,Xi Yang,Zhizhong Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as an Poster paper at the AAAI Conference on Artificial Intelligence (AAAI-26)
Abstract:Neural signed distance functions (SDFs) have been a vital representation to represent 3D shapes or scenes with neural networks. An SDF is an implicit function that can query signed distances at specific coordinates for recovering a 3D surface. Although implicit functions work well on a single shape or scene, they pose obstacles when analyzing multiple SDFs with high-fidelity geometry details, due to the limited information encoded in the latent space for SDFs and the loss of geometry details. To overcome these obstacles, we introduce a method to represent multiple SDFs in a common space, aiming to recover more high-fidelity geometry details with more compact latent representations. Our key idea is to take full advantage of the benefits of generalization-based and overfitting-based learning strategies, which manage to preserve high-fidelity geometry details with compact latent codes. Based on this framework, we also introduce a novel sampling strategy to sample training queries. The sampling can improve the training efficiency and eliminate artifacts caused by the influence of other SDFs. We report numerical and visual evaluations on widely used benchmarks to validate our designs and show advantages over the latest methods in terms of the representative ability and compactness.
zh
[CV-27] DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation
【速读】:该论文旨在解决现有视频变分自编码器(Video Variational Autoencoders, Video VAEs)在建模过程中忽略帧间内容相似性,导致潜在表示冗余的问题。其解决方案的关键在于提出解耦变分自编码器(DeCo-VAE),通过显式分解视频内容为关键帧(keyframe)、运动(motion)和残差(residual)三个独立成分,并分别为每个成分设计专用编码器以学习特定的潜在表示,同时使用共享的3D解码器确保重建过程中的时空一致性。此外,采用解耦适应策略(decoupled adaptation strategy),在训练时冻结部分编码器并顺序训练其余模块,从而实现静态与动态特征的稳定且精确的学习。
链接: https://arxiv.org/abs/2511.14530
作者: Xiangchen Yin,Jiahui Yuan,Zhangchi Hu,Wenzhang Sun,Jie Chen,Xiaozhen Qiao,Hao Li,Xiaoyan Sun
机构: University of Science and Technology of China (中国科学技术大学); Li Auto Inc. (理想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain spatiotemporal consistency during reconstruction. We further utilize a decoupled adaptation strategy that freezes partial encoders while training the others sequentially, ensuring stable training and accurate learning of both static and dynamic features. Extensive quantitative and qualitative experiments demonstrate that DeCo-VAE achieves superior video reconstruction performance.
zh
[CV-28] A Generative Data Framework with Authentic Supervision for Underwater Image Restoration and Enhancement
【速读】:该论文旨在解决当前水下图像恢复与增强任务中因高质量配对数据集稀缺而导致的模型性能受限问题。现有基准通常依赖于增强算法的主观选择结果作为参考图像,这些图像缺乏全局一致的颜色和真实监督信号,从而限制了模型在颜色还原、图像增强及泛化能力上的表现。解决方案的关键在于引入无配对图像到图像翻译(unpaired image-to-image translation)的生成式数据框架,利用自然空气中拍摄的清晰图像作为明确的参考目标,并将其合成出6种典型水下退化类型,构建大规模、带精确标签的合成数据集,为模型学习从退化水下图像到原始场景外观的准确映射提供真实监督信号。
链接: https://arxiv.org/abs/2511.14521
作者: Yufeng Tian,Yifan Chen,Zhe Sun,Libang Chen,Mingyu Dou,Jijun Lu,Ye Zheng,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信); Department of Automation, Shanghai Jiao Tong University(上海交通大学); School of Artificial Intelligence, Optics and ElectroNics (iOPEN), Northwestern Polytechnical University(西北工业大学); Guangdong Provincial Key Laboratory of Quantum Metrology and Sensing, School of Physics and Astronomy, Sun Yat-Sen University (Zhuhai Campus)(中山大学珠海校区); Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences(中国科学院西安光学精密机械研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Underwater image restoration and enhancement are crucial for correcting color distortion and restoring image details, thereby establishing a fundamental basis for subsequent underwater visual tasks. However, current deep learning methodologies in this area are frequently constrained by the scarcity of high-quality paired datasets. Since it is difficult to obtain pristine reference labels in underwater scenes, existing benchmarks often rely on manually selected results from enhancement algorithms, providing debatable reference images that lack globally consistent color and authentic supervision. This limits the model’s capabilities in color restoration, image enhancement, and generalization. To overcome this limitation, we propose using in-air natural images as unambiguous reference targets and translating them into underwater-degraded versions, thereby constructing synthetic datasets that provide authentic supervision signals for model learning. Specifically, we establish a generative data framework based on unpaired image-to-image translation, producing a large-scale dataset that covers 6 representative underwater degradation types. The framework constructs synthetic datasets with precise ground-truth labels, which facilitate the learning of an accurate mapping from degraded underwater images to their pristine scene appearances. Extensive quantitative and qualitative experiments across 6 representative network architectures and 3 independent test sets show that models trained on our synthetic data achieve comparable or superior color restoration and generalization performance to those trained on existing benchmarks. This research provides a reliable and scalable data-driven solution for underwater image restoration and enhancement. The generated dataset is publicly available at: this https URL.
zh
[CV-29] D-PerceptCT: Deep Perceptual Enhancement for Low-Dose CT Images
【速读】:该论文旨在解决低剂量计算机断层扫描(Low Dose Computed Tomography, LDCT)图像因辐射剂量降低而导致的图像质量下降问题,尤其是现有增强方法常过度平滑或高估噪声,造成关键解剖结构和病理细节丢失的问题。解决方案的关键在于提出一种受人类视觉系统(Human Visual System, HVS)启发的新架构D-PerceptCT:其核心包括两个模块——视觉双路径提取器(Visual Dual-path Extractor, ViDex),融合预训练DINOv2模型的语义先验与局部空间特征以实现语义感知增强;以及全局-局部状态空间块,用于捕获长程依赖与多尺度特征以保留诊断相关结构与细粒度纹理。此外,引入基于人眼对比敏感度设计的深度感知相关性损失函数(Deep Perceptual Relevancy Loss Function, DPRLF),进一步强化感知重要特征的保留,从而提升LDCT图像在临床诊断中的可视性和可用性。
链接: https://arxiv.org/abs/2511.14518
作者: Taifour Yousra Nabila,Azeddine Beghdadi,Marie Luong,Zuheng Ming,Habib Zaidi,Faouzi Alaya Cheikh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low Dose Computed Tomography (LDCT) is widely used as an imaging solution to aid diagnosis and other clinical tasks. However, this comes at the price of a deterioration in image quality due to the low dose of radiation used to reduce the risk of secondary cancer development. While some efficient methods have been proposed to enhance LDCT quality, many overestimate noise and perform excessive smoothing, leading to a loss of critical details. In this paper, we introduce D-PerceptCT, a novel architecture inspired by key principles of the Human Visual System (HVS) to enhance LDCT images. The objective is to guide the model to enhance or preserve perceptually relevant features, thereby providing radiologists with CT images where critical anatomical structures and fine pathological details are perceptu- ally visible. D-PerceptCT consists of two main blocks: 1) a Visual Dual-path Extractor (ViDex), which integrates semantic priors from a pretrained DINOv2 model with local spatial features, allowing the network to incorporate semantic-awareness during enhancement; (2) a Global-Local State-Space block that captures long-range information and multiscale features to preserve the important structures and fine details for diagnosis. In addition, we propose a novel deep perceptual loss, designated as the Deep Perceptual Relevancy Loss Function (DPRLF), which is inspired by human contrast sensitivity, to further emphasize perceptually important features. Extensive experiments on the Mayo2016 dataset demonstrate the effectiveness of D-PerceptCT method for LDCT enhancement, showing better preservation of structural and textural information within LDCT images compared to SOTA methods.
zh
[CV-30] IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention
【速读】:该论文旨在解决资源受限设备上语音增强(Speech Enhancement, SE)任务中轻量化设计与高性能之间的平衡难题。现有先进方法如MUSE虽已实现仅0.51M参数的紧凑模型,但仍存在效率瓶颈:其多路径增强泰勒变换(Multi-path Enhanced Taylor, MET)模块依赖复杂的“近似-补偿”机制来缓解基于泰勒展开的注意力机制局限性,且可变形嵌入(Deformable Embedding, DE)模块的偏移计算引入额外计算开销。解决方案的关键在于提出IMSE网络,通过两项核心创新实现系统性优化:一是用幅度感知线性注意力(Amplitude-Aware Linear Attention, MALA)替代MET模块,显式保留查询向量的范数信息以修正传统线性注意力忽略幅度的问题,从而无需辅助补偿分支即可高效建模全局依赖;二是用Inception深度卷积(Inception Depthwise Convolution, IDConv)替代DE模块,借鉴Inception结构将大核操作分解为并行的方形、水平和垂直条带分支,显著降低参数冗余并有效捕捉频谱特征。实验表明,IMSE在参数减少16.8%的情况下仍保持与SOTA相当的PESQ性能(3.373),为超轻量语音增强中的模型规模与语音质量权衡树立了新基准。
链接: https://arxiv.org/abs/2511.14515
作者: Xinxin Tang,Bin Qin,Yufang Li
机构: Shenzhen University (深圳大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Achieving a balance between lightweight design and high performance remains a significant challenge for speech enhancement (SE) tasks on resource-constrained devices. Existing state-of-the-art methods, such as MUSE, have established a strong baseline with only 0.51M parameters by introducing a Multi-path Enhanced Taylor (MET) transformer and Deformable Embedding (DE). However, an in-depth analysis reveals that MUSE still suffers from efficiency bottlenecks: the MET module relies on a complex “approximate-compensate” mechanism to mitigate the limitations of Taylor-expansion-based attention, while the offset calculation for deformable embedding introduces additional computational burden. This paper proposes IMSE, a systematically optimized and ultra-lightweight network. We introduce two core innovations: 1) Replacing the MET module with Amplitude-Aware Linear Attention (MALA). MALA fundamentally rectifies the “amplitude-ignoring” problem in linear attention by explicitly preserving the norm information of query vectors in the attention calculation, achieving efficient global modeling without an auxiliary compensation branch. 2) Replacing the DE module with Inception Depthwise Convolution (IDConv). IDConv borrows the Inception concept, decomposing large-kernel operations into efficient parallel branches (square, horizontal, and vertical strips), thereby capturing spectrogram features with extremely low parameter redundancy. Extensive experiments on the VoiceBank+DEMAND dataset demonstrate that, compared to the MUSE baseline, IMSE significantly reduces the parameter count by 16.8% (from 0.513M to 0.427M) while achieving competitive performance comparable to the state-of-the-art on the PESQ metric (3.373). This study sets a new benchmark for the trade-off between model size and speech quality in ultra-lightweight speech enhancement.
zh
[CV-31] Parameter Aware Mamba Model for Multi-task Dense Prediction
【速读】:该论文旨在解决多任务密集预测(multi-task dense prediction)中任务间交互关系建模不足的问题,现有方法主要依赖卷积层和注意力机制来探索任务级关联,但难以高效捕捉复杂任务间的内在联系。其解决方案的关键在于提出一种基于解码器的新型框架——参数感知Mamba模型(Parameter Aware Mamba Model, PAMM),该框架利用状态空间模型(State Space Model, SSM)丰富的可扩展参数结构增强任务互联性;通过双状态空间参数专家机制集成并设定任务特定的参数先验,从而捕获各任务的内在特性,并借助结构化的状态空间序列模型(S4)实现任务先验的全局融合。此外,引入多方向希尔伯特扫描(Multi-Directional Hilbert Scanning)构建多角度特征序列,显著提升序列模型对二维数据的感知能力。
链接: https://arxiv.org/abs/2511.14503
作者: Xinzhuo Yu,Yunzhi Zhuge,Sitong Gong,Lu Zhang,Pingping Zhang,Huchuan Lu
机构: Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transactions on Cybernetics
Abstract:Understanding the inter-relations and interactions between tasks is crucial for multi-task dense prediction. Existing methods predominantly utilize convolutional layers and attention mechanisms to explore task-level interactions. In this work, we introduce a novel decoder-based framework, Parameter Aware Mamba Model (PAMM), specifically designed for dense prediction in multi-task learning setting. Distinct from approaches that employ Transformers to model holistic task relationships, PAMM leverages the rich, scalable parameters of state space models to enhance task interconnectivity. It features dual state space parameter experts that integrate and set task-specific parameter priors, capturing the intrinsic properties of each task. This approach not only facilitates precise multi-task interactions but also allows for the global integration of task priors through the structured state space sequence model (S4). Furthermore, we employ the Multi-Directional Hilbert Scanning method to construct multi-angle feature sequences, thereby enhancing the sequence model’s perceptual capabilities for 2D data. Extensive experiments on the NYUD-v2 and PASCAL-Context benchmarks demonstrate the effectiveness of our proposed method. Our code is available at this https URL.
zh
[CV-32] Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM
【速读】:该论文旨在解决当前自动驾驶(Autonomous Driving, AD)系统在面对未见过的驾驶场景或陌生传感器输入时泛化能力不足的问题。现有基于视觉-语言模型(Vision-Language Models, VLMs)的方法虽能支持少样本或零样本任务,但引入了混合式AD架构,导致轨迹规划不一致;而端到端(End-to-End, E2E)的视觉-语言-动作(Vision-Language-Action, VLA)框架则存在计算开销过大的问题。解决方案的关键在于提出风险语义蒸馏(Risk Semantic Distillation, RSD)框架,其核心是通过一个可插拔模块RiskHead,将VLM输出的因果风险估计蒸馏至鸟瞰图(Bird’s-Eye-View, BEV)特征中,从而生成可解释的风险注意力机制,使BEV特征学习更丰富、细腻的风险感知表示,显著提升模型对空间边界和高风险区域的识别与响应能力,进而增强整体感知与规划性能,并更贴近人类驾驶行为。
链接: https://arxiv.org/abs/2511.14499
作者: Jack Qin,Zhitao Wang,Yinan Zheng,Keyu Chen,Yang Zhou,Yuanxin Zhong,Siyuan Cheng
机构: Tsinghua University (清华大学); 22012 Laboratories, Huawei Technologies (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:The autonomous driving (AD) system has exhibited remarkable performance in complex driving scenarios. However, generalization is still a key limitation for the current system, which refers to the ability to handle unseen scenarios or unfamiliar sensor this http URL works have explored the use of Vision-Language Models (VLMs) to address few-shot or zero-shot tasks. While promising, these methods introduce a new challenge: the emergence of a hybrid AD system, where two distinct systems are used to plan a trajectory, leading to potential inconsistencies. Alternative research directions have explored Vision-Language-Action (VLA) frameworks that generate control actions from VLM directly. However, these end-to-end solutions demonstrate prohibitive computational demands. To overcome these challenges, we introduce Risk Semantic Distillation (RSD), a novel framework that leverages VLMs to enhance the training of End-to-End (E2E) AD backbones. By providing risk attention for key objects, RSD addresses the issue of generalization. Specifically, we introduce RiskHead, a plug-in module that distills causal risk estimates from Vision-Language Models into Bird’s-Eye-View (BEV) features, yielding interpretable risk-attention this http URL approach allows BEV features to learn richer and more nuanced risk attention representations, which directly enhance the model’s ability to handle spatial boundaries and risky this http URL focusing on risk attention, RSD aligns better with human-like driving behavior, which is essential to navigate in complex and dynamic environments. Our experiments on the Bench2Drive benchmark demonstrate the effectiveness of RSD in managing complex and unpredictable driving conditions. Due to the enhanced BEV representations enabled by RSD, we observed a significant improvement in both perception and planning capabilities.
zh
[CV-33] Segmentation-Aware Latent Diffusion for Satellite Image Super-Resolution: Enabling Smallholder Farm Boundary Delineation
【速读】:该论文旨在解决小农户农田边界识别中因高分辨率(HR)遥感影像获取频率低(如每年一次)而导致的监测不及时问题,同时克服现有基于参考的超分辨率(Ref-SR)方法在提升图像细节时过度平滑、难以满足大尺度放大因子需求,以及两阶段SR后分割方法未能有效融合多源卫星数据的问题。其解决方案的关键在于提出一种名为SEED-SR的新方法,通过将条件潜在扩散模型与大规模多光谱、多源地理空间基础模型相结合,在分割感知的潜在空间中直接进行超分辨率重建,而非传统像素空间中的显式超分任务,从而实现高达20倍的放大因子,并显著提升实例和语义分割性能(相对提升分别达25.5%和12.9%)。
链接: https://arxiv.org/abs/2511.14481
作者: Aditi Agarwal,Anjali Jain,Nikita Saxena,Ishan Deshpande,Michal Kazmierski,Abigail Annkah,Nadav Sherman,Karthikeyan Shanmugam,Alok Talekar,Vaibhav Rajan
机构: Google DeepMind(谷歌深度思维); Google(谷歌); Google Research(谷歌研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Delineating farm boundaries through segmentation of satellite images is a fundamental step in many agricultural applications. The task is particularly challenging for smallholder farms, where accurate delineation requires the use of high resolution (HR) imagery which are available only at low revisit frequencies (e.g., annually). To support more frequent (sub-) seasonal monitoring, HR images could be combined as references (ref) with low resolution (LR) images – having higher revisit frequency (e.g., weekly) – using reference-based super-resolution (Ref-SR) methods. However, current Ref-SR methods optimize perceptual quality and smooth over crucial features needed for downstream tasks, and are unable to meet the large scale-factor requirements for this task. Further, previous two-step approaches of SR followed by segmentation do not effectively utilize diverse satellite sources as inputs. We address these problems through a new approach, \textbfSEED-SR , which uses a combination of conditional latent diffusion models and large-scale multi-spectral, multi-source geo-spatial foundation models. Our key innovation is to bypass the explicit SR task in the pixel space and instead perform SR in a segmentation-aware latent space. This unique approach enables us to generate segmentation maps at an unprecedented 20 \times scale factor, and rigorous experiments on two large, real datasets demonstrate up to \textbf25.5 and \textbf12.9 relative improvement in instance and semantic segmentation metrics respectively over approaches based on state-of-the-art Ref-SR methods.
zh
[CV-34] 2D Gaussians Spatial Transport for Point-supervised Density Regression AAAI
【速读】:该论文旨在解决计算机视觉任务中图像坐标空间与标注图(annotation map)之间概率测度传输的效率与准确性问题,尤其是在传统最优传输(Optimal Transport, OT)方法需要迭代计算运输计划导致训练效率低下的情况下。解决方案的关键在于提出高斯空间传输(Gaussian Spatial Transport, GST)框架,利用高斯点绘(Gaussian Splatting)估计像素-标注对应关系,并基于贝叶斯概率推导出运输计划,进而构建一种可直接嵌入标准网络优化的损失函数,用于衡量传输后的分布差异。该方法在不进行训练时迭代求解运输计划的前提下,显著提升了计算效率,同时保持了对人群计数和关键点检测等任务的有效性。
链接: https://arxiv.org/abs/2511.14477
作者: Miao Shang,Xiaopeng Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, accepted by AAAI, 2026
Abstract:This paper introduces Gaussian Spatial Transport (GST), a novel framework that leverages Gaussian splatting to facilitate transport from the probability measure in the image coordinate space to the annotation map. We propose a Gaussian splatting-based method to estimate pixel-annotation correspondence, which is then used to compute a transport plan derived from Bayesian probability. To integrate the resulting transport plan into standard network optimization in typical computer vision tasks, we derive a loss function that measures discrepancy after transport. Extensive experiments on representative computer vision tasks, including crowd counting and landmark detection, validate the effectiveness of our approach. Compared to conventional optimal transport schemes, GST eliminates iterative transport plan computation during training, significantly improving efficiency. Code is available at this https URL.
zh
[CV-35] Learning Subglacial Bed Topography from Sparse Radar with Physics-Guided Residuals
【速读】:该论文旨在解决冰盖模型中基底地形(subglacial bed topography)精度不足的问题,其根源在于雷达观测数据稀疏且分布不均。为提升预测准确性与物理合理性,作者提出了一种物理引导的残差学习框架,核心创新在于将先验的BedMachine模型作为基础,并通过深度神经网络预测其厚度残差(residuals),从而重建更精确的基底形态。关键解决方案包括:采用DeepLabV3+结构结合ResNet-50编码器,集成多尺度质量守恒、流线对齐总变差、拉普拉斯阻尼、厚度非负约束等轻量级物理项,以及基于置信度图加权的掩膜Huber损失函数,确保模型在真实世界场景下具有强泛化能力。实验表明,该方法在格陵兰两个子区域均优于U-Net、Attention U-Net、FPN及普通CNN,在保持高结构保真度的同时实现空间一致且物理合理的基底重构。
链接: https://arxiv.org/abs/2511.14473
作者: Bayu Adhi Tama,Jianwu Wang,Vandana Janeja,Mostafa Cham
机构: iHARP, University of Maryland Baltimore County (UMBC); Department of Information Systems, University of Maryland Baltimore County (UMBC)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate subglacial bed topography is essential for ice sheet modeling, yet radar observations are sparse and uneven. We propose a physics-guided residual learning framework that predicts bed thickness residuals over a BedMachine prior and reconstructs bed from the observed surface. A DeepLabV3+ decoder over a standard encoder (e.g.,ResNet-50) is trained with lightweight physics and data terms: multi-scale mass conservation, flow-aligned total variation, Laplacian damping, non-negativity of thickness, a ramped prior-consistency term, and a masked Huber fit to radar picks modulated by a confidence map. To measure real-world generalization, we adopt leakage-safe blockwise hold-outs (vertical/horizontal) with safety buffers and report metrics only on held-out cores. Across two Greenland sub-regions, our approach achieves strong test-core accuracy and high structural fidelity, outperforming U-Net, Attention U-Net, FPN, and a plain CNN. The residual-over-prior design, combined with physics, yields spatially coherent, physically plausible beds suitable for operational mapping under domain shift.
zh
[CV-36] CompEvent: Complex-valued Event-RGB Fusion for Low-light Video Enhancement and Deblurring
【速读】:该论文旨在解决低光照条件下视频去模糊(low-light video deblurring)问题,该问题在夜间监控和自动驾驶等应用中尤为突出,主要挑战源于光线不足导致的长曝光时间所引发的运动模糊。现有融合方法多采用分阶段策略处理事件相机(event camera)与RGB帧数据,难以有效应对低光与运动模糊的复合退化。解决方案的关键在于提出CompEvent框架,其核心创新为:1)复数时序对齐门控循环单元(Complex Temporal Alignment GRU),通过复数卷积与GRU迭代处理实现视频与事件流的时序对齐及连续融合;2)复数空频学习模块(Complex Space-Frequency Learning module),在空间域与频域统一进行复数信号处理,以结构特征与系统特性协同促进深度模态融合。该方法利用复数神经网络的全过程联合表示能力,实现了跨模态的时空融合最大化互补学习,显著提升了低光视频去模糊性能。
链接: https://arxiv.org/abs/2511.14469
作者: Mingchen Zhong,Xin Lu,Dong Li,Senyan Xu,Ruixuan Jiang,Xueyang Fu,Baocai Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-light video deblurring poses significant challenges in applications like nighttime surveillance and autonomous driving due to dim lighting and long exposures. While event cameras offer potential solutions with superior low-light sensitivity and high temporal resolution, existing fusion methods typically employ staged strategies, limiting their effectiveness against combined low-light and motion blur degradations. To overcome this, we propose CompEvent, a complex neural network framework enabling holistic full-process fusion of event data and RGB frames for enhanced joint restoration. CompEvent features two core components: 1) Complex Temporal Alignment GRU, which utilizes complex-valued convolutions and processes video and event streams iteratively via GRU to achieve temporal alignment and continuous fusion; and 2) Complex Space-Frequency Learning module, which performs unified complex-valued signal processing in both spatial and frequency domains, facilitating deep fusion through spatial structures and system-level characteristics. By leveraging the holistic representation capability of complex-valued neural networks, CompEvent achieves full-process spatiotemporal fusion, maximizes complementary learning between modalities, and significantly strengthens low-light video deblurring capability. Extensive experiments demonstrate that CompEvent outperforms SOTA methods in addressing this challenging task. The code is available at this https URL.
zh
[CV-37] DIR-TIR: Dialog-Iterative Refinement for Text-to-Image Retrieval
【速读】:该论文旨在解决交互式对话场景下的文本到图像检索(text-to-image retrieval, TIR)问题,即如何通过多轮对话逐步精准定位用户意图对应的图像。传统单次查询方法难以应对语义模糊或复杂需求,导致检索准确率低且缺乏可控性。其解决方案的核心在于提出DIR-TIR框架,通过两个协同工作的专用模块实现迭代优化:一是对话精炼模块(Dialog Refiner Module),主动与用户交互以提取关键信息并生成更精确的描述;二是图像精炼模块(Image Refiner Module),识别生成图像与用户意图之间的感知差异,并针对性地缩小视觉-语义差距。这种双路径迭代机制显著提升了检索精度和交互体验,实现了比初始描述基线方法更优越的命中率和容错能力。
链接: https://arxiv.org/abs/2511.14449
作者: Zongwei Zhen,Biqing Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper addresses the task of interactive, conversational text-to-image retrieval. Our DIR-TIR framework progressively refines the target image search through two specialized modules: the Dialog Refiner Module and the Image Refiner Module. The Dialog Refiner actively queries users to extract essential information and generate increasingly precise descriptions of the target image. Complementarily, the Image Refiner identifies perceptual gaps between generated images and user intentions, strategically reducing the visual-semantic discrepancy. By leveraging multi-turn dialogues, DIR-TIR provides superior controllability and fault tolerance compared to conventional single-query methods, significantly improving target image hit accuracy. Comprehensive experiments across diverse image datasets demonstrate our dialogue-based approach substantially outperforms initial-description-only baselines, while the synergistic module integration achieves both higher retrieval precision and enhanced interactive experience. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.14449 [cs.CV] (or arXiv:2511.14449v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.14449 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zongwei Zhen Zongwei Zhen [view email] [v1] Tue, 18 Nov 2025 12:45:10 UTC (551 KB) Full-text links: Access Paper: View a PDF of the paper titled DIR-TIR: Dialog-Iterative Refinement for Text-to-Image Retrieval, by Zongwei Zhen and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2025-11 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[CV-38] Agent ic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding
【速读】:该论文旨在解决视频理解中视觉识别与复杂推理能力不足的问题,特别是现有视觉语言模型(Vision-Language Models, VLMs)在处理视频时多采用单次遍历方式,缺乏对证据的重访和迭代优化能力;同时,新兴的基于智能体(agent-based)方法要么依赖昂贵的专有模型,要么需要大量强化学习(Reinforcement Learning, RL)训练。解决方案的关键在于提出一种无需训练的灵活框架——智能体视频智能(Agentic Video Intelligence, AVI),其核心创新包括:(1)受人类认知启发的三阶段推理流程(检索-感知-回顾),兼顾全局探索与局部聚焦分析;(2)基于实体图结构化的视频知识库及多粒度集成工具,构建智能体交互环境;(3)融合推理型大语言模型(Reasoning LLM)、轻量级基础计算机视觉(CV)模型与VLM的开源模型集合,避免对专有API或RL训练的依赖,从而在多个长视频理解基准上实现竞争力性能并显著提升可解释性。
链接: https://arxiv.org/abs/2511.14446
作者: Hong Gao,Yiming Bao,Xuezhen Tu,Yutong Xu,Yue Jin,Yiyang Mu,Bin Zhong,Linan Yue,Min-Ling Zhang
机构: SouthEast University (东南大学); ZTE Corporation (中兴通讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent’s interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.
zh
[CV-39] Learning to See Through a Babys Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
【速读】:该论文旨在解决当前视觉智能模型在鲁棒性与生物合理性方面不足的问题,即如何通过模拟婴儿早期视觉发展的阶段性特征来提升机器视觉系统的泛化能力和对真实世界复杂场景的适应性。其解决方案的关键在于提出一种名为CATDiet(Color-Blur-Temporal Diet)的自监督学习训练策略,该策略通过约束条件模拟新生儿视觉特性——从灰度到彩色(C)、模糊到清晰(A)以及保持时间连续性(T),从而在仅使用物体中心视频数据的情况下,使模型获得更强的鲁棒性,并展现出类婴儿的神经可塑性和行为模式(如视觉悬崖反应)。进一步地,CombDiet结合CATDiet预训练与标准训练,在保留时间连续性的前提下显著提升了模型在域内和域外对象识别及深度感知任务中的性能,验证了早期视觉发育阶段作为逆向工程框架的有效性。
链接: https://arxiv.org/abs/2511.14440
作者: Yusen Cai,Bhargava Satya Nunna,Qing Lin,Mengmi Zhang
机构: Nanyang Technological University (南洋理工大学); Indian Institute Of Technology Madras (印度理工学院马德拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Newborns perceive the world with low-acuity, color-degraded, and temporally continuous vision, which gradually sharpens as infants develop. To explore the ecological advantages of such staged “visual diets”, we train self-supervised learning (SSL) models on object-centric videos under constraints that simulate infant vision: grayscale-to-color ©, blur-to-sharp (A), and preserved temporal continuity (T)-collectively termed CATDiet. For evaluation, we establish a comprehensive benchmark across ten datasets, covering clean and corrupted image recognition, texture-shape cue conflict tests, silhouette recognition, depth-order classification, and the visual cliff paradigm. All CATDiet variants demonstrate enhanced robustness in object recognition, despite being trained solely on object-centric videos. Remarkably, models also exhibit biologically aligned developmental patterns, including neural plasticity changes mirroring synaptic density in macaque V1 and behaviors resembling infants’ visual cliff responses. Building on these insights, CombDiet initializes SSL with CATDiet before standard training while preserving temporal continuity. Trained on object-centric or head-mounted infant videos, CombDiet outperforms standard SSL on both in-domain and out-of-domain object recognition and depth perception. Together, these results suggest that the developmental progression of early infant visual experience offers a powerful reverse-engineering framework for understanding the emergence of robust visual intelligence in machines. All code, data, and models will be publicly released.
zh
[CV-40] Cranio-ID: Graph-Based Craniofacial Identification via Automatic Landmark Annotation in 2D Multi-View X-rays
【速读】:该论文旨在解决法医颅面识别及生物医学应用中,传统颅骨标志点(craniometric landmarks)定位方法耗时且依赖专家知识的问题。现有基于图像叠加和深度学习的自动标注方法因缺乏大规模验证而可靠性不足。解决方案的关键在于提出一种名为Cranio-ID的新框架:首先利用训练好的YOLO-pose模型实现2D颅骨X射线图像与其对应光学面部图像的自动标志点标注;其次通过将标志点表示为图结构,并采用交叉注意力机制与最优传输框架进行跨模态语义对应匹配,从而提升跨域颅骨到人脸及素描到人脸匹配的准确性与鲁棒性。
链接: https://arxiv.org/abs/2511.14411
作者: Ravi Shankar Prasad,Nandani Sharma,Dinesh Singh
机构: Indian Institute of Technology Mandi (印度理工学院曼迪分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures
Abstract:In forensic craniofacial identification and in many biomedical applications, craniometric landmarks are important. Traditional methods for locating landmarks are time-consuming and require specialized knowledge and expertise. Current methods utilize superimposition and deep learning-based methods that employ automatic annotation of landmarks. However, these methods are not reliable due to insufficient large-scale validation studies. In this paper, we proposed a novel framework Cranio-ID: First, an automatic annotation of landmarks on 2D skulls (which are X-ray scans of faces) with their respective optical images using our trained YOLO-pose models. Second, cross-modal matching by formulating these landmarks into graph representations and then finding semantic correspondence between graphs of these two modalities using cross-attention and optimal transport framework. Our proposed framework is validated on the S2F and CUHK datasets (CUHK dataset resembles with S2F dataset). Extensive experiments have been conducted to evaluate the performance of our proposed framework, which demonstrates significant improvements in both reliability and accuracy, as well as its effectiveness in cross-domain skull-to-face and sketch-to-face matching in forensic science.
zh
[CV-41] Language as an Anchor: Preserving Relative Visual Geometry for Domain Incremental Learning
【速读】:该论文旨在解决领域增量学习(Domain Incremental Learning, DIL)中的关键挑战:在分布不断变化的场景下持续学习的同时,有效保留先前领域的知识。现有方法面临两难困境——将所有领域映射到单一视觉空间会导致域间干扰和语义扭曲,而隔离领域特定参数则造成知识碎片化,形成“知识孤岛”,加剧遗忘现象。解决方案的关键在于提出LAVA(Language-Anchored Visual Alignment)框架,通过引入基于文本的参考锚点(text-based reference anchor)实现相对对齐而非直接特征对齐,使每个新领域的视觉表示保持由类别名称间成对语义相似性所定义的一致相对几何结构,从而构建跨域桥梁,支持类别感知先验知识的检索与鲁棒特征聚合。
链接: https://arxiv.org/abs/2511.14401
作者: Shuyi Geng,Tao Zhou,Yi Zhou
机构: Southeast University (东南大学); Ministry of Education (中华人民共和国教育部); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A key challenge in Domain Incremental Learning (DIL) is to continually learn under shifting distributions while preserving knowledge from previous domains. Existing methods face a fundamental dilemma. On one hand, projecting all domains into a single unified visual space leads to inter-domain interference and semantic distortion, as large shifts may vary with not only visual appearance but also underlying semantics. On the other hand, isolating domain-specific parameters causes knowledge fragmentation, creating “knowledge islands” that hamper knowledge reuse and exacerbate forgetting. To address this issue, we propose LAVA (Language-Anchored Visual Alignment), a novel DIL framework that replaces direct feature alignment with relative alignment driven by a text-based reference anchor. LAVA guides the visual representations of each incoming domain to preserve a consistent relative geometry, which is defined by mirroring the pairwise semantic similarities between the class names. This anchored geometric structure acts as a bridge across domains, enabling the retrieval of class-aware prior knowledge and facilitating robust feature aggregation. Extensive experiments on standard DIL benchmarks demonstrate that LAVA achieves significant performance improvements over state-of-the-arts. Code is available at this https URL.
zh
[CV-42] Stage Aware Diagnosis of Diabetic Retinopathy via Ordinal Regression
【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)早期检测与分级的难题,以实现及时干预、防止不可逆视力损伤。其解决方案的关键在于提出一种基于序数回归(Ordinal Regression)的DR检测框架,结合APTOS-2019眼底图像数据集,采用绿通道(Green Channel, GC)提取、噪声掩码(Noise Masking)和CLAHE(Contrast Limited Adaptive Histogram Equalization)等预处理方法,有效提取与DR分级最相关的特征,最终在Quadratic Weighted Kappa(QWK)指标上达到0.8992,显著提升了模型预测结果与临床分级的一致性,为DR自动化诊断提供了新基准。
链接: https://arxiv.org/abs/2511.14398
作者: Saksham Kumar,D Sridhar Aditya,T Likhil Kumar,Thulasi Bikku,Srinivasarao Thota,Chandan Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Confluence 2026, Amity University
Abstract:Diabetic Retinopathy (DR) has emerged as a major cause of preventable blindness in recent times. With timely screening and intervention, the condition can be prevented from causing irreversible damage. The work introduces a state-of-the-art Ordinal Regression-based DR Detection framework that uses the APTOS-2019 fundus image dataset. A widely accepted combination of preprocessing methods: Green Channel (GC) Extraction, Noise Masking, and CLAHE, was used to isolate the most relevant features for DR classification. Model performance was evaluated using the Quadratic Weighted Kappa, with a focus on agreement between results and clinical grading. Our Ordinal Regression approach attained a QWK score of 0.8992, setting a new benchmark on the APTOS dataset.
zh
[CV-43] Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning AAAI2026
【速读】:该论文旨在解决行为克隆(Behavioral Cloning, BC)在序列动作决策中因误差累积导致的性能瓶颈问题,尤其是由物理不连续性和语义-物理错位引发的动作克隆不准与执行中断问题。其解决方案的关键在于提出一种视觉-语言-动作协同连续学习框架(Continuous vision-language-action Co-Learning with Semantic-Physical Alignment, CCoL),通过双向交叉注意力机制将语言语义锚定到视觉-运动表征上,实现细粒度的语义 grounding,同时利用连续联合学习整合视觉、语言和本体感知(proprioceptive)输入,从而生成稳定且语义一致的动作执行轨迹,有效缓解了传统方法中存在的语义-物理错位与执行不连续问题。
链接: https://arxiv.org/abs/2511.14396
作者: Xiuxiu Qi,Yu Yang,Jiannong Cao,Luyao Bai,Chongshan Fan,Chengtai Cao,Hongpeng Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2026, the Project website is available at this https URL
Abstract:Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (e.g., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL’s generalization under unseen and noisy object states.
zh
[CV-44] BEDLAM2.0: Synthetic Humans and Cameras in Motion NEURIPS2025
【速读】:该论文旨在解决从视频中推断三维人体运动(3D human motion)时,如何准确估计人体在世界坐标系(world coordinates)中的运动这一挑战性问题,尤其在存在人体与相机共同运动的情况下更为困难。传统方法通常仅在图像坐标系中估计人体姿态,难以满足实际应用对世界坐标系下精确运动建模的需求。解决方案的关键在于构建了一个名为BEDLAM2.0的新数据集,其相较于前代BEDLAM数据集显著提升了多样性与真实性,包括更复杂的相机运动、多样的身体形态、服饰、发型及三维环境,并新增了鞋类细节,从而为训练能够估计世界坐标系下人体运动的方法提供了高质量的标注数据支持。实验表明,基于BEDLAM2.0训练的先进模型在精度上明显优于基于BEDLAM训练的模型。
链接: https://arxiv.org/abs/2511.14394
作者: Joachim Tesch,Giorgio Becherini,Prerana Achar,Anastasios Yiannakidis,Muhammed Kocabas,Priyanka Patel,Michael J. Black
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Meshcapade GmbH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 (Datasets and Benchmarks track, oral). Project website: this https URL
Abstract:Inferring 3D human motion from video remains a challenging problem with many applications. While traditional methods estimate the human in image coordinates, many applications require human motion to be estimated in world coordinates. This is particularly challenging when there is both human and camera motion. Progress on this topic has been limited by the lack of rich video data with ground truth human and camera movement. We address this with BEDLAM2.0, a new dataset that goes beyond the popular BEDLAM dataset in important ways. In addition to introducing more diverse and realistic cameras and camera motions, BEDLAM2.0 increases diversity and realism of body shape, motions, clothing, hair, and 3D environments. Additionally, it adds shoes, which were missing in BEDLAM. BEDLAM has become a key resource for training 3D human pose and motion regressors today and we show that BEDLAM2.0 is significantly better, particularly for training methods that estimate humans in world coordinates. We compare state-of-the art methods trained on BEDLAM and BEDLAM2.0, and find that BEDLAM2.0 significantly improves accuracy over BEDLAM. For research purposes, we provide the rendered videos, ground truth body parameters, and camera motions. We also provide the 3D assets to which we have rights and links to those from third parties.
zh
[CV-45] Enhancing LLM -based Autonomous Driving with Modular Traffic Light and Sign Recognition
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自动驾驶代理在决策与规划中缺乏显式交通规则约束机制、且难以可靠检测小尺寸安全关键物体(如交通灯和标志)的问题。解决方案的关键在于提出TLS-Assist——一个模块化的冗余层,通过将交通灯与标志的检测结果转化为结构化自然语言消息并注入LLM输入,强制模型关注安全关键提示,从而增强对交通规则的遵守能力与感知可靠性。该框架具有即插即用、模型无关性,并兼容单目与多视角相机配置,在CARLA平台LangAuto基准上的闭环测试表明其可显著提升驾驶性能(相较LMDrive最高提升14%,BEVDriver提升7%),同时持续减少交通灯与标志违规行为。
链接: https://arxiv.org/abs/2511.14391
作者: Fabian Schmidt,Noushiq Mohammed Kayilan Abdul Nazar,Markus Enzweiler,Abhinav Valada
机构: Institute for Intelligent Systems, Esslingen University of Applied Sciences, Germany (埃斯林根应用科学大学智能系统研究所); Department of Computer Science, University of Freiburg, Germany (弗莱堡大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Language Models (LLMs) are increasingly used for decision-making and planning in autonomous driving, showing promising reasoning capabilities and potential to generalize across diverse traffic situations. However, current LLM-based driving agents lack explicit mechanisms to enforce traffic rules and often struggle to reliably detect small, safety-critical objects such as traffic lights and signs. To address this limitation, we introduce TLS-Assist, a modular redundancy layer that augments LLM-based autonomous driving agents with explicit traffic light and sign recognition. TLS-Assist converts detections into structured natural language messages that are injected into the LLM input, enforcing explicit attention to safety-critical cues. The framework is plug-and-play, model-agnostic, and supports both single-view and multi-view camera setups. We evaluate TLS-Assist in a closed-loop setup on the LangAuto benchmark in CARLA. The results demonstrate relative driving performance improvements of up to 14% over LMDrive and 7% over BEVDriver, while consistently reducing traffic light and sign infractions. We publicly release the code and models on this https URL.
zh
[CV-46] Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving
【速读】:该论文旨在解决物理对抗样本(Physical Adversarial Examples, PAEs)在基于双目立体匹配的深度估计模型中攻击效果不明确的问题,尤其针对自动驾驶场景下现有攻击多依赖二维贴图且难以在不同视角下保持一致性与隐蔽性。其解决方案的关键在于提出一种基于3D纹理的PAE生成方法,通过全局伪装纹理替代局部2D贴图,确保在双目相机不同视角下的视觉一致性和攻击有效性;同时引入新的3D立体匹配渲染模块以对齐真实世界位置和朝向,提升对抗样本在双目视觉中的空间准确性,并进一步设计细粒度优化的融合攻击策略,使目标物体无缝融入环境背景,显著增强隐蔽性和攻击成功率。
链接: https://arxiv.org/abs/2511.14386
作者: Kangqiao Zhao,Shuo Huai,Xurui Song,Jun Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Though deep neural models adopted to realize the perception of autonomous driving have proven vulnerable to adversarial examples, known attacks often leverage 2D patches and target mostly monocular perception. Therefore, the effectiveness of Physical Adversarial Examples (PAEs) on stereo-based binocular depth estimation remains largely unexplored. To this end, we propose the first texture-enabled physical adversarial attack against stereo matching models in the context of autonomous driving. Our method employs a 3D PAE with global camouflage texture rather than a local 2D patch-based one, ensuring both visual consistency and attack effectiveness across different viewpoints of stereo cameras. To cope with the disparity effect of these cameras, we also propose a new 3D stereo matching rendering module that allows the PAE to be aligned with real-world positions and headings in binocular vision. We further propose a novel merging attack that seamlessly blends the target into the environment through fine-grained PAE optimization. It has significantly enhanced stealth and lethality upon existing hiding attacks that fail to get seamlessly merged into the background. Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.
zh
[CV-47] A Quantitative Method for Shoulder Presentation Evaluation in Biometric Identity Documents
【速读】:该论文旨在解决生物特征身份证件(biometric identity documents)中肩部姿态合规性自动化评估缺乏量化方法的问题,尤其针对国际标准要求的肩部正方形呈现(square presentation of shoulders)。解决方案的关键在于提出一种肩部呈现评估(Shoulder Presentation Evaluation, SPE)算法,该算法仅利用常见姿态估计框架提供的两个肩部关键点的三维坐标,即可定量计算肩部偏航角(yaw)和翻滚角(roll),从而实现对肩部姿态合规性的自动判断。实验表明,SPE评分与人工标注标签具有强相关性(Pearson相关系数约0.80),且通过改进的Error-versus-Discard分析验证了其在筛选不合规样本中的有效性,是一种轻量级、适用于注册系统自动化合规检查的工具。
链接: https://arxiv.org/abs/2511.14376
作者: Alfonso Pedro Ridao
机构: Technical University of Denmark (丹麦技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures, conference or journal submission. Course project from DTU Compute, Technical University of Denmark
Abstract:International standards for biometric identity documents mandate strict compliance with pose requirements, including the square presentation of a subject’s shoulders. However, the literature on automated quality assessment offers few quantitative methods for evaluating this specific attribute. This paper proposes a Shoulder Presentation Evaluation (SPE) algorithm to address this gap. The method quantifies shoulder yaw and roll using only the 3D coordinates of two shoulder landmarks provided by common pose estimation frameworks. The algorithm was evaluated on a dataset of 121 portrait images. The resulting SPE scores demonstrated a strong Pearson correlation (r approx. 0.80) with human-assigned labels. An analysis of the metric’s filtering performance, using an adapted Error-versus-Discard methodology, confirmed its utility in identifying non-compliant samples. The proposed algorithm is a viable lightweight tool for automated compliance checking in enrolment systems.
zh
[CV-48] Blur-Robust Detection via Feature Restoration: An End-to-End Framework for Prior-Guided Infrared UAV Target Detection AAAI2026
【速读】:该论文旨在解决红外无人机(UAV)目标图像因传感器快速运动导致的运动模糊问题,该模糊显著降低了目标与背景之间的对比度,进而严重影响检测性能。现有方法通常将去模糊视为提升视觉质量的预处理步骤,忽视了对检测任务相关特征的增强,从而难以在模糊条件下有效提升特征表示能力。其解决方案的关键在于提出一种端到端的联合特征域去模糊与检测框架(Joint Feature-Domain Deblurring and Detection, JFD3),采用共享权重的双分支结构:清晰分支引导模糊分支增强判别性特征表示;通过轻量级特征恢复网络以清晰分支特征作为监督信号,提升模糊分支的检测区分能力;引入频率结构引导模块,将恢复网络中的结构先验注入浅层检测层以丰富目标结构信息;并设计特征一致性自监督损失,在双分支检测主干之间强制特征对齐,使模糊分支逼近清晰分支的特征表示,从而实现检测性能的显著提升与实时效率的兼顾。
链接: https://arxiv.org/abs/2511.14371
作者: Xiaolin Wang,Houzhang Fang,Qingshan Li,Lu Wang,Yi Chang,Luxin Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:Infrared unmanned aerial vehicle (UAV) target images often suffer from motion blur degradation caused by rapid sensor movement, significantly reducing contrast between target and background. Generally, detection performance heavily depends on the discriminative feature representation between target and background. Existing methods typically treat deblurring as a preprocessing step focused on visual quality, while neglecting the enhancement of task-relevant features crucial for detection. Improving feature representation for detection under blur conditions remains challenging. In this paper, we propose a novel Joint Feature-Domain Deblurring and Detection end-to-end framework, dubbed JFD3. We design a dual-branch architecture with shared weights, where the clear branch guides the blurred branch to enhance discriminative feature representation. Specifically, we first introduce a lightweight feature restoration network, where features from the clear branch serve as feature-level supervision to guide the blurred branch, thereby enhancing its distinctive capability for detection. We then propose a frequency structure guidance module that refines the structure prior from the restoration network and integrates it into shallow detection layers to enrich target structural information. Finally, a feature consistency self-supervised loss is imposed between the dual-branch detection backbones, driving the blurred branch to approximate the feature representations of the clear one. Wealso construct a benchmark, named IRBlurUAV, containing 30,000 simulated and 4,118 real infrared UAV target images with diverse motion blur. Extensive experiments on IRBlurUAV demonstrate that JFD3 achieves superior detection performance while maintaining real-time efficiency.
zh
[CV-49] Clinically-Validated Innovative Mobile Application for Assessing Blinking and Eyelid Movements
【速读】:该论文旨在解决现有工具在客观评估眨眼(blink)这一重要生理过程中的局限性问题,包括复杂性高、成本昂贵以及临床适用性差等挑战。解决方案的关键在于开发并临床验证了一款名为Bapp的移动应用程序,该应用基于Flutter框架构建,并集成Google ML Kit实现设备端实时分析眼睑运动。通过45例真实患者视频数据的验证,由眼科专家手动标注作为金标准,结果显示Bapp在精确度(Precision)、召回率(Recall)和F1分数上均达到优异性能(分别为98.4%、96.9%和98.3%),证明其可作为便携、易用且客观的眼睑运动监测工具,适用于正常与异常眨眼的持续追踪及术后评估。
链接: https://arxiv.org/abs/2511.14361
作者: Gustavo Adolpho Bonesso,Carlos Marcelo Gurjão de Godoy,Tammy Hentona Osaki,Midori Hentona Osaki,Bárbara Moreira Ribeiro Trindade dos Santos,Regina Célia Coelho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures
Abstract:Blinking is a vital physiological process that protects and maintains the health of the ocular surface. Objective assessment of eyelid movements remains challenging due to the complexity, cost, and limited clinical applicability of existing tools. This study presents the clinical validation of Bapp (Blink Application), a mobile application developed using the Flutter framework and integrated with Google ML Kit for on-device, real-time analysis of eyelid movements. The validation occurred using 45 videos from real patients, whose blinks were manually annotated by ophthalmology specialists from the Paulista School of Medicine of the Federal University of Sao Paulo (EPM-UNIFESP) to serve as the ground truth. Bapp’s performance was evaluated using standard metrics, including Precision, Recall, and F1-Score, with results demonstrating 98.4% precision, 96.9% recall, and an overall accuracy of 98.3%. These outcomes confirm the reliability of Bapp as a portable, accessible, and objective tool for monitoring both normal and abnormal eyelid movements. The application offers a promising alternative to traditional manual blink counting, supporting continuous ocular health monitoring and postoperative evaluation in clinical environments.
zh
[CV-50] IBGS: Image-Based Gaussian Splatting NEURIPS2025
【速读】:该论文旨在解决3D Gaussian Splatting(3DGS)在处理空间变化颜色和视点依赖效应(如高光)时的局限性,其核心问题是传统3DGS使用低阶球谐函数(spherical harmonics)无法有效建模复杂光照和细节。解决方案的关键在于提出Image-Based Gaussian Splatting,通过融合高分辨率源图像实现精细细节和视点相关颜色建模:具体而言,每个像素的颜色由标准3DGS渲染的基色与从邻近训练图像中学习到的残差共同构成,从而提升表面配准精度,并在不增加存储开销的前提下显著改善高频细节和视点依赖效果的渲染质量。
链接: https://arxiv.org/abs/2511.14357
作者: Hoang Chuong Nguyen,Wei Mao,Jose M. Alvarez,Miaomiao Liu
机构: Australian National University (澳大利亚国立大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025
Abstract:3D Gaussian Splatting (3DGS) has recently emerged as a fast, high-quality method for novel view synthesis (NVS). However, its use of low-degree spherical harmonics limits its ability to capture spatially varying color and view-dependent effects such as specular highlights. Existing works augment Gaussians with either a global texture map, which struggles with complex scenes, or per-Gaussian texture maps, which introduces high storage overhead. We propose Image-Based Gaussian Splatting, an efficient alternative that leverages high-resolution source images for fine details and view-specific color modeling. Specifically, we model each pixel color as a combination of a base color from standard 3DGS rendering and a learned residual inferred from neighboring training images. This promotes accurate surface alignment and enables rendering images of high-frequency details and accurate view-dependent effects. Experiments on standard NVS benchmarks show that our method significantly outperforms prior Gaussian Splatting approaches in rendering quality, without increasing the storage footprint.
zh
[CV-51] ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
【速读】:该论文旨在解决长视频(如时长一小时的讲座、播客和纪录片)内容结构化中的章节划分(chaptering)问题,现有方法受限于小规模标注数据(通常为短且粗粒度的标签),难以捕捉长视频中复杂的语义转换。解决方案的关键在于构建首个大规模视频章节模型ARC-Chapter,其训练数据源自百万级长视频章节标注,具备双语(英文-中文)、时间对齐和分层标注特性;并通过结构化流程融合自动语音识别(ASR)转录文本、场景文字与视觉描述,生成多层级注释(从短标题到长摘要)。实验表明,ARC-Chapter在F1分数和SODA分数上分别比前最优方法提升14.0%和11.3%,并展现出优异的迁移能力,在YouCook2密集视频字幕任务中也显著优于现有方法。
链接: https://arxiv.org/abs/2511.14349
作者: Junfu Pu,Teng Wang,Yixiao Ge,Yuying Ge,Chen Li,Ying Shan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:The proliferation of hour-long videos (e.g., lectures, podcasts, documentaries) has intensified demand for efficient content structuring. However, existing approaches are constrained by small-scale training with annotations that are typical short and coarse, restricting generalization to nuanced transitions in long videos. We introduce ARC-Chapter, the first large-scale video chaptering model trained on over million-level long video chapters, featuring bilingual, temporally grounded, and hierarchical chapter annotations. To achieve this goal, we curated a bilingual English-Chinese chapter dataset via a structured pipeline that unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries. We demonstrate clear performance improvements with data scaling, both in data volume and label intensity. Moreover, we design a new evaluation metric termed GRACE, which incorporates many-to-one segment overlaps and semantic similarity, better reflecting real-world chaptering flexibility. Extensive experiments demonstrate that ARC-Chapter establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score. Moreover, ARC-Chapter shows excellent transferability, improving the state-of-the-art on downstream tasks like dense video captioning on YouCook2.
zh
[CV-52] Silhouette-to-Contour Registration: Aligning Intraoral Scan Models with Cephalometric Radiographs
【速读】:该论文旨在解决口腔内扫描(Intraoral Scan, IOS)模型与侧位头颅X光片(Lateral Cephalometric Radiograph)之间3D-2D配准的可靠性问题,尤其在真实临床场景下,由于投影放大、几何失真、牙冠对比度低及采集条件差异等因素,传统基于强度的配准方法易出现收敛失败或解剖学不合理对齐。解决方案的关键在于提出DentalSCR框架——一种姿态稳定、轮廓引导的轮廓到轮廓配准方法:首先构建统一的跨弓解剖坐标系U-Midline Dental Axis (UMDA),以稳定初始化并标准化投影几何;随后通过基于表面的数字重建射线影像(Digital Reconstruction Radiograph, DRR)生成机制,结合冠状轴视角和高斯点绘(Gaussian splatting),保留临床源-物-探测器放大关系并强化外部轮廓;最后采用分层粗到精的对称双向Chamfer距离优化策略进行二维相似变换,实现大范围捕获与亚像素级轮廓匹配。该方法显著提升了后牙区域特征点误差、下颌分布紧凑性,并在曲线层面获得更低的Chamfer与受控的Hausdorff距离,展现出对真实临床数据的鲁棒性和临床可解释性。
链接: https://arxiv.org/abs/2511.14343
作者: Yiyi Miao,Taoyu Wu,Ji Jiang,Tong Chen,Zhe Tang,Zhengyong Jiang,Angelos Stefanidis,Limin Yu,Jionglong Su
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Institute of Artificial Intelligence, University of Science and Technology of China (中国科学技术大学人工智能研究所); 3. Alibaba Group (阿里巴巴集团); 4. Tongyi Lab (通义实验室); 5. Tsinghua University (清华大学); 6. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable 3D-2D alignment between intraoral scan (IOS) models and lateral cephalometric radiographs is critical for orthodontic diagnosis, yet conventional intensity-driven registration methods struggle under real clinical conditions, where cephalograms exhibit projective magnification, geometric distortion, low-contrast dental crowns, and acquisition-dependent variation. These factors hinder the stability of appearance-based similarity metrics and often lead to convergence failures or anatomically implausible alignments. To address these limitations, we propose DentalSCR, a pose-stable, contour-guided framework for accurate and interpretable silhouette-to-contour registration. Our method first constructs a U-Midline Dental Axis (UMDA) to establish a unified cross-arch anatomical coordinate system, thereby stabilizing initialization and standardizing projection geometry across cases. Using this reference frame, we generate radiograph-like projections via a surface-based DRR formulation with coronal-axis perspective and Gaussian splatting, which preserves clinical source-object-detector magnification and emphasizes external silhouettes. Registration is then formulated as a 2D similarity transform optimized with a symmetric bidirectional Chamfer distance under a hierarchical coarse-to-fine schedule, enabling both large capture range and subpixel-level contour agreement. We evaluate DentalSCR on 34 expert-annotated clinical cases. Experimental results demonstrate substantial reductions in landmark error-particularly at posterior teeth-tighter dispersion on the lower jaw, and low Chamfer and controlled Hausdorff distances at the curve level. These findings indicate that DentalSCR robustly handles real-world cephalograms and delivers high-fidelity, clinically inspectable 3D–2D alignment, outperforming conventional baselines.
zh
[CV-53] Going Places: Place Recognition in Artificial and Natural Systems
【速读】:该论文旨在解决如何在不同系统(包括机器人、动物和人类)中实现高效且鲁棒的地点识别(place recognition)问题,以促进自主系统在复杂环境中的定位能力。其解决方案的关键在于整合跨领域的认知与计算策略,识别出共通机制如拓扑地图(topological mapping)、多模态线索融合(cue integration)以及记忆管理(memory management),并通过统一的概念框架将生物导航研究与人类空间认知成果应用于生成更具泛化能力和适应性的生成式 AI (Generative AI) 本地化系统。
链接: https://arxiv.org/abs/2511.14341
作者: Michael Milford,Tobias Fischer
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Place recognition, the ability to identify previously visited locations, is critical for both biological navigation and autonomous systems. This review synthesizes findings from robotic systems, animal studies, and human research to explore how different systems encode and recall place. We examine the computational and representational strategies employed across artificial systems, animals, and humans, highlighting convergent solutions such as topological mapping, cue integration, and memory management. Animal systems reveal evolved mechanisms for multimodal navigation and environmental adaptation, while human studies provide unique insights into semantic place concepts, cultural influences, and introspective capabilities. Artificial systems showcase scalable architectures and data-driven models. We propose a unifying set of concepts by which to consider and develop place recognition mechanisms and identify key challenges such as generalization, robustness, and environmental variability. This review aims to foster innovations in artificial localization by connecting future developments in artificial place recognition systems to insights from both animal navigation research and human spatial cognition studies.
zh
[CV-54] ArchMap: Arch-Flattening and Knowledge-Guided Vision Language Model for Tooth Counting and Structured Dental Understanding
【速读】:该论文旨在解决当前数字正畸中基于深度学习的3D口内扫描分析方法在设备泛化能力差、依赖大量标注数据和受控扫描条件、以及原始网格存在姿态差异大、几何不完整和缺乏纹理线索等问题,导致其难以部署于真实临床流程中的挑战。解决方案的关键在于提出一种无需训练(training-free)且由知识引导的框架ArchMap:首先通过几何感知的牙弓展平模块(geometry-aware arch-flattening module)将原始3D网格标准化为空间对齐、连续性保持的多视角投影;进而构建一个包含牙齿层级本体、萌出阶段策略与临床语义的牙科知识库(Dental Knowledge Base, DKB),以约束符号推理空间,从而实现对牙齿计数、解剖分区、牙列阶段分类及临床状况识别(如拥挤、缺牙、修复体、龋齿)的鲁棒结构化理解,在稀疏或含伪影条件下仍保持高精度与稳定性。
链接: https://arxiv.org/abs/2511.14336
作者: Bohan Zhang,Yiyi Miao,Taoyu Wu,Tong Chen,Ji Jiang,Zhuoxiao Li,Zhe Tang,Limin Yu,Jionglong Su
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Cloud (阿里云); 3. Tencent (腾讯); 4. Tsinghua University (清华大学); 5. Peking University (北京大学); 6. Baidu (百度); 7. Huawei (华为)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A structured understanding of intraoral 3D scans is essential for digital orthodontics. However, existing deep-learning approaches rely heavily on modality-specific training, large annotated datasets, and controlled scanning conditions, which limit generalization across devices and hinder deployment in real clinical workflows. Moreover, raw intraoral meshes exhibit substantial variation in arch pose, incomplete geometry caused by occlusion or tooth contact, and a lack of texture cues, making unified semantic interpretation highly challenging. To address these limitations, we propose ArchMap, a training-free and knowledge-guided framework for robust structured dental understanding. ArchMap first introduces a geometry-aware arch-flattening module that standardizes raw 3D meshes into spatially aligned, continuity-preserving multi-view projections. We then construct a Dental Knowledge Base (DKB) encoding hierarchical tooth ontology, dentition-stage policies, and clinical semantics to constrain the symbolic reasoning space. We validate ArchMap on 1060 pre-/post-orthodontic cases, demonstrating robust performance in tooth counting, anatomical partitioning, dentition-stage classification, and the identification of clinical conditions such as crowding, missing teeth, prosthetics, and caries. Compared with supervised pipelines and prompted VLM baselines, ArchMap achieves higher accuracy, reduced semantic drift, and superior stability under sparse or artifact-prone conditions. As a fully training-free system, ArchMap demonstrates that combining geometric normalization with ontology-guided multimodal reasoning offers a practical and scalable solution for the structured analysis of 3D intraoral scans in modern digital orthodontics.
zh
[CV-55] Step by Step Network
【速读】:该论文旨在解决深度神经网络在持续加深过程中难以实现理论性能提升的问题,其核心障碍在于“捷径退化(shortcut degradation)”和“宽度受限(limited width)”:前者阻碍深层特征的学习,后者因深度与宽度之间的固有权衡限制了模型容量。解决方案的关键是提出一种广义残差架构——分步网络(Step by Step Network, StepsNet),通过沿通道维度分离特征并逐层堆叠宽度递增的模块,使模型能够逐步学习更丰富的特征表示,从而有效缓解上述两个瓶颈,并显著提升不同任务(如图像分类、目标检测、语义分割和语言建模)下的性能表现。
链接: https://arxiv.org/abs/2511.14329
作者: Dongchen Han,Tianzhu Ye,Zhuofan Xia,Kaiyi Chen,Yulin Wang,Hanting Chen,Gao Huang
机构: Tsinghua University (清华大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scaling up network depth is a fundamental pursuit in neural architecture design, as theory suggests that deeper models offer exponentially greater capability. Benefiting from the residual connections, modern neural networks can scale up to more than one hundred layers and enjoy wide success. However, as networks continue to deepen, current architectures often struggle to realize their theoretical capacity improvements, calling for more advanced designs to further unleash the potential of deeper networks. In this paper, we identify two key barriers that obstruct residual models from scaling deeper: shortcut degradation and limited width. Shortcut degradation hinders deep-layer learning, while the inherent depth-width trade-off imposes limited width. To mitigate these issues, we propose a generalized residual architecture dubbed Step by Step Network (StepsNet) to bridge the gap between theoretical potential and practical performance of deep models. Specifically, we separate features along the channel dimension and let the model learn progressively via stacking blocks with increasing width. The resulting method mitigates the two identified problems and serves as a versatile macro design applicable to various models. Extensive experiments show that our method consistently outperforms residual models across diverse tasks, including image classification, object detection, semantic segmentation, and language modeling. These results position StepsNet as a superior generalization of the widely adopted residual architecture.
zh
[CV-56] LSP-YOLO: A Lightweight Single-Stage Network for Sitting Posture Recognition on Embedded Devices
【速读】:该论文旨在解决因久坐行为增加而引发的健康问题,尤其是现有坐姿识别方法普遍依赖侵入式传感器或计算机视觉技术所导致的高侵入性、计算密集和边缘设备实时性能差的问题。其解决方案的关键在于提出一种轻量级单阶段网络LSP-YOLO,通过引入部分卷积(Partial Convolution, PConv)与相似性感知激活模块(Similarity-Aware Activation Module, SimAM)设计出轻量化模块Light-C3k2,在降低计算成本的同时保持特征提取能力;并在识别头中利用逐点卷积直接将关键点映射至坐姿类别,并采用中间监督机制实现姿态估计与分类的高效融合,从而在保证高精度(94.2%)的前提下实现低延迟(251 FPS)和小模型尺寸(1.9 MB),适用于资源受限场景下的实时部署。
链接: https://arxiv.org/abs/2511.14322
作者: Nanjun Li,Ziyue Hao,Quanqiang Wang,Xuanyin Wang
机构: Zhejiang University (浙江大学); Icheego (亿科)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to Engineering Applications of Artificial Intelligence (EAAI)
Abstract:With the rise in sedentary behavior, health problems caused by poor sitting posture have drawn increasing attention. Most existing methods, whether using invasive sensors or computer vision, rely on two-stage pipelines, which result in high intrusiveness, intensive computation, and poor real-time performance on embedded edge devices. Inspired by YOLOv11-Pose, a lightweight single-stage network for sitting posture recognition on embedded edge devices termed LSP-YOLO was proposed. By integrating partial convolution(PConv) and Similarity-Aware Activation Module(SimAM), a lightweight module, Light-C3k2, was designed to reduce computational cost while maintaining feature extraction capability. In the recognition head, keypoints were directly mapped to posture classes through pointwise convolution, and intermediate supervision was employed to enable efficient fusion of pose estimation and classification. Furthermore, a dataset containing 5,000 images across six posture categories was constructed for model training and testing. The smallest trained model, LSP-YOLO-n, achieved 94.2% accuracy and 251 Fps on personal computer(PC) with a model size of only 1.9 MB. Meanwhile, real-time and high-accuracy inference under constrained computational resources was demonstrated on the SV830C + GC030A platform. The proposed approach is characterized by high efficiency, lightweight design and deployability, making it suitable for smart classrooms, rehabilitation, and human-computer interaction applications.
zh
[CV-57] Dental3R: Geometry-Aware Pairing for Intraoral 3D Reconstruction from Sparse-View Photographs
【速读】:该论文旨在解决远程正畸中基于稀疏智能手机图像的口腔三维重建难题,特别是针对临床常规采集的非摆拍前牙及双侧颊面照片所面临的挑战:大视角基线、光照不一致和高反射表面导致的姿态与几何估计不稳定,以及稀疏视图光度监督引发的频率偏差问题,从而造成重建结果过度平滑、丧失关键诊断细节。解决方案的关键在于提出一种无需姿态估计(pose-free)且基于图结构引导的重建流程——Dental3R,其核心创新包括:1)提出几何感知配对策略(Geometry-Aware Pairing Strategy, GAPS),通过智能选择高价值图像子图提升对应匹配稳定性并降低内存消耗;2)在恢复的姿态和点云基础上,采用小波正则化目标训练3D高斯泼溅(3D Gaussian Splatting, 3DGS)模型,利用离散小波变换强制带限保真度,在保留釉质边界和邻接边缘等精细结构的同时抑制高频伪影,显著提升重建质量与诊断可用性。
链接: https://arxiv.org/abs/2511.14315
作者: Yiyi Miao,Taoyu Wu,Tong Chen,Ji Jiang,Zhe Tang,Zhengyong Jiang,Angelos Stefanidis,Limin Yu,Jionglong Su
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Baidu (百度); 4. Tsinghua University (清华大学); 5. Peking University (北京大学); 6. Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Intraoral 3D reconstruction is fundamental to digital orthodontics, yet conventional methods like intraoral scanning are inaccessible for remote tele-orthodontics, which typically relies on sparse smartphone imagery. While 3D Gaussian Splatting (3DGS) shows promise for novel view synthesis, its application to the standard clinical triad of unposed anterior and bilateral buccal photographs is challenging. The large view baselines, inconsistent illumination, and specular surfaces common in intraoral settings can destabilize simultaneous pose and geometry estimation. Furthermore, sparse-view photometric supervision often induces a frequency bias, leading to over-smoothed reconstructions that lose critical diagnostic details. To address these limitations, we propose \textbfDental3R, a pose-free, graph-guided pipeline for robust, high-fidelity reconstruction from sparse intraoral photographs. Our method first constructs a Geometry-Aware Pairing Strategy (GAPS) to intelligently select a compact subgraph of high-value image pairs. The GAPS focuses on correspondence matching, thereby improving the stability of the geometry initialization and reducing memory usage. Building on the recovered poses and point cloud, we train the 3DGS model with a wavelet-regularized objective. By enforcing band-limited fidelity using a discrete wavelet transform, our approach preserves fine enamel boundaries and interproximal edges while suppressing high-frequency artifacts. We validate our approach on a large-scale dataset of 950 clinical cases and an additional video-based test set of 195 cases. Experimental results demonstrate that Dental3R effectively handles sparse, unposed inputs and achieves superior novel view synthesis quality for dental occlusion visualization, outperforming state-of-the-art methods.
zh
[CV-58] Iterative Diffusion-Refined Neural Attenuation Fields for Multi-Source Stationary CT Reconstruction: NAF Meets Diffusion Model
【速读】:该论文旨在解决多源静态计算机断层扫描(Multi-source stationary computed tomography, CT)在超稀疏视角采样条件下重建质量显著下降的问题。传统方法在此类极端稀疏视图下因插值不准而难以获得满意结果。解决方案的关键在于提出一种名为“扩散精炼神经衰减场”(Diffusion-Refined Neural Attenuation Fields, Diff-NAF)的迭代框架,其核心创新是将神经衰减场(Neural Attenuation Field, NAF)表示与双分支条件扩散模型相结合:首先利用超稀疏视角投影训练初始NAF,随后通过角度先验引导的投影合成策略生成新投影,并由扩散驱动的重用投影精炼模块对这些投影进行优化;优化后的投影作为伪标签加入训练集以支持下一轮迭代,从而逐步提升投影完整性与重建保真度,在超稀疏视角条件下实现高质量CT重建。
链接: https://arxiv.org/abs/2511.14310
作者: Jiancheng Fang,Shaoyu Wang,Junlin Wang,Weiwen Wu,Yikun Zhang,Qiegen Liu
机构: Nanchang University (南昌大学); Sun Yat-sen University (中山大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-source stationary computed tomography (CT) has recently attracted attention for its ability to achieve rapid image reconstruction, making it suitable for time-sensitive clinical and industrial applications. However, practical systems are often constrained by ultra-sparse-view sampling, which significantly degrades reconstruction quality. Traditional methods struggle under ultra-sparse-view settings, where interpolation becomes inaccurate and the resulting reconstructions are unsatisfactory. To address this challenge, this study proposes Diffusion-Refined Neural Attenuation Fields (Diff-NAF), an iterative framework tailored for multi-source stationary CT under ultra-sparse-view conditions. Diff-NAF combines a Neural Attenuation Field representation with a dual-branch conditional diffusion model. The process begins by training an initial NAF using ultra-sparse-view projections. New projections are then generated through an Angle-Prior Guided Projection Synthesis strategy that exploits inter view priors, and are subsequently refined by a Diffusion-driven Reuse Projection Refinement Module. The refined projections are incorporated as pseudo-labels into the training set for the next iteration. Through iterative refinement, Diff-NAF progressively enhances projection completeness and reconstruction fidelity under ultra-sparse-view conditions, ultimately yielding high-quality CT reconstructions. Experimental results on multiple simulated 3D CT volumes and real projection data demonstrate that Diff-NAF achieves the best performance under ultra-sparse-view conditions.
zh
[CV-59] SAM-Fed: SAM-Guided Federated Semi-Supervised Learning for Medical Image Segmentation
【速读】:该论文旨在解决联邦半监督学习(Federated Semi-Supervised Learning, FSSL)在医疗图像分割任务中面临的两大挑战:一是本地模型性能受限导致伪标签(pseudo-label)可靠性不足,二是客户端设备因计算资源有限而难以部署大模型,从而影响分割精度与稳定性。解决方案的关键在于提出SAM-Fed框架,该框架利用一个高容量的分割基础模型(segmentation foundation model)作为知识源,在训练过程中通过双知识蒸馏(dual knowledge distillation)与自适应一致性机制(adaptive agreement mechanism)协同优化轻量级客户端模型的像素级监督信号,从而在异构和同构环境下均显著提升分割性能。
链接: https://arxiv.org/abs/2511.14302
作者: Sahar Nasirihaghighi,Negin Ghamsarian,Yiping Li,Marcel Breeuwer,Raphael Sznitman,Klaus Schoeffmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical image segmentation is clinically important, yet data privacy and the cost of expert annotation limit the availability of labeled data. Federated semi-supervised learning (FSSL) offers a solution but faces two challenges: pseudo-label reliability depends on the strength of local models, and client devices often require compact or heterogeneous architectures due to limited computational resources. These constraints reduce the quality and stability of pseudo-labels, while large models, though more accurate, cannot be trained or used for routine inference on client devices. We propose SAM-Fed, a federated semi-supervised framework that leverages a high-capacity segmentation foundation model to guide lightweight clients during training. SAM-Fed combines dual knowledge distillation with an adaptive agreement mechanism to refine pixel-level supervision. Experiments on skin lesion and polyp segmentation across homogeneous and heterogeneous settings show that SAM-Fed consistently outperforms state-of-the-art FSSL methods.
zh
[CV-60] GEN3D: Generating Domain-Free 3D Scenes from a Single Image
【速读】:该论文旨在解决当前神经3D重建方法对密集多视角图像依赖性强、适用范围受限的问题,同时应对生成高质量、多样化3D场景以支持具身智能(embodied AI)和世界模型训练的需求。其解决方案的关键在于提出Gen3d方法,通过单张图像生成高保真、广域且通用的3D场景:首先利用RGBD图像生成初始点云,继而维护并扩展世界模型,最终通过优化高斯泼溅(Gaussian splatting)表示完成3D场景的精细化重建。该方法在多个数据集上展现出优异的泛化能力和新颖视图合成的一致性与真实性。
链接: https://arxiv.org/abs/2511.14291
作者: Yuxin Zhang,Ziyu Lu,Hongbo Duan,Keyu Fan,Pengting Luo,Peiyu Zhuang,Mengyu Yang,Houde Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages , 2 figures
Abstract:Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. Additionally, 3D scene generation is vital for advancing embodied AI and world models, which depend on diverse, high-quality scenes for learning and evaluation. In this work, we propose Gen3d, a novel method for generation of high-quality, wide-scope, and generic 3D scenes from a single image. After the initial point cloud is created by lifting the RGBD image, Gen3d maintains and expands its world model. The 3D scene is finalized through optimizing a Gaussian splatting representation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in generating a world model and Synthesizing high-fidelity and consistent novel views.
zh
[CV-61] NeuralBoneReg: A Novel Self-Supervised Method for Robust and Accurate Multi-Modal Bone Surface Registration
【速读】:该论文旨在解决计算机辅助骨科手术(CAOS)中跨模态骨表面配准难题,即如何在术前影像(如CT)与术中数据(如超声或RGB-D)之间实现高精度、自动化的空间对齐,以确保手术计划的准确执行。其核心挑战在于不同成像模态间存在显著的异质性,导致传统方法易出错且依赖大量标注数据。解决方案的关键在于提出一种自监督的表面配准框架NeuralBoneReg,该框架采用3D点云作为模态无关表示,并包含两个核心模块:一是基于隐式神经无符号距离场(UDF)的预训练骨模型学习机制,二是基于多层感知机(MLP)的注册模块,可完成全局初始化与局部精化,通过生成变换假设来对齐术中点云与神经UDF。该方法无需跨受试者训练数据,具备良好的跨解剖结构和跨模态泛化能力,在多个公开数据集上均达到优于现有方法的配准精度。
链接: https://arxiv.org/abs/2511.14286
作者: Luohong Wu,Matthias Seibold,Nicola A. Cavalcanti,Yunke Ao,Roman Flepp,Aidana Massalimova,Lilian Calvet,Philipp Fürnstahl
机构: Balgrist University Hospital (巴尔格里斯特大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In computer- and robot-assisted orthopedic surgery (CAOS), patient-specific surgical plans derived from preoperative imaging define target locations and implant trajectories. During surgery, these plans must be accurately transferred, relying on precise cross-registration between preoperative and intraoperative data. However, substantial modality heterogeneity across imaging modalities makes this registration challenging and error-prone. Robust, automatic, and modality-agnostic bone surface registration is therefore clinically important. We propose NeuralBoneReg, a self-supervised, surface-based framework that registers bone surfaces using 3D point clouds as a modality-agnostic representation. NeuralBoneReg includes two modules: an implicit neural unsigned distance field (UDF) that learns the preoperative bone model, and an MLP-based registration module that performs global initialization and local refinement by generating transformation hypotheses to align the intraoperative point cloud with the neural UDF. Unlike SOTA supervised methods, NeuralBoneReg operates in a self-supervised manner, without requiring inter-subject training data. We evaluated NeuralBoneReg against baseline methods on two publicly available multi-modal datasets: a CT-ultrasound dataset of the fibula and tibia (UltraBones100k) and a CT-RGB-D dataset of spinal vertebrae (SpineDepth). The evaluation also includes a newly introduced CT–ultrasound dataset of cadaveric subjects containing femur and pelvis (UltraBones-Hip), which will be made publicly available. NeuralBoneReg matches or surpasses existing methods across all datasets, achieving mean RRE/RTE of 1.68°/1.86 mm on UltraBones100k, 1.88°/1.89 mm on UltraBones-Hip, and 3.79°/2.45 mm on SpineDepth. These results demonstrate strong generalizability across anatomies and modalities, providing robust and accurate cross-modal alignment for CAOS.
zh
[CV-62] NeuralSSD: A Neural Solver for Signed Distance Surface Reconstruction
【速读】:该论文旨在解决从广泛可用的点云数据中重建高质量、高精度三维隐式表面的问题。现有隐式方法虽具备形状表示准确性和拓扑变化鲁棒性优势,但其参数化方式缺乏显式机制以确保表面与输入点云之间的紧密贴合。解决方案的关键在于提出一种新的能量方程,用于平衡点云信息的可靠性,并引入一种新型卷积神经网络以学习三维特征,从而在优化过程中实现对原始点云的精确拟合并推断有效的归纳偏置(inductive bias),最终获得稳定且高精度的表面重建结果。
链接: https://arxiv.org/abs/2511.14283
作者: Zi-Chen Xi,Jiahui Huang,Hao-Xiang Chen,Francis Williams,Qun-Ce Xu,Tai-Jiang Mu,Shi-Min Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Under review
Abstract:We proposed a generalized method, NeuralSSD, for reconstructing a 3D implicit surface from the widely-available point cloud data. NeuralSSD is a solver-based on the neural Galerkin method, aimed at reconstructing higher-quality and accurate surfaces from input point clouds. Implicit method is preferred due to its ability to accurately represent shapes and its robustness in handling topological changes. However, existing parameterizations of implicit fields lack explicit mechanisms to ensure a tight fit between the surface and input data. To address this, we propose a novel energy equation that balances the reliability of point cloud information. Additionally, we introduce a new convolutional network that learns three-dimensional information to achieve superior optimization results. This approach ensures that the reconstructed surface closely adheres to the raw input points and infers valuable inductive biases from point clouds, resulting in a highly accurate and stable surface reconstruction. NeuralSSD is evaluated on a variety of challenging datasets, including the ShapeNet and Matterport datasets, and achieves state-of-the-art results in terms of both surface reconstruction accuracy and generalizability.
zh
[CV-63] Free Lunch to Meet the Gap: Intermediate Domain Reconstruction for Cross-Domain Few-Shot Learning
【速读】:该论文旨在解决跨域少样本学习(Cross-Domain Few-Shot Learning, CDFSL)中面临的三大挑战:语义不一致(semantic disjoint)、域间差异大(large domain discrepancy)以及数据稀缺(data scarcity)。传统方法多聚焦于构建通用表征,而本文提出创新性解决方案——构建中间域代理(Intermediate Domain Proxies, IDP),以源域特征嵌入作为码本(codebook)来重建目标域特征。其核心在于利用IDP所蕴含的视觉风格与语义内容等内在属性,设计一种快速域对齐方法,将IDP作为指导信号用于目标域特征变换。通过中间域重构与目标特征变换的协同学习,模型在8个跨域少样本学习基准上显著优于现有最先进方法。
链接: https://arxiv.org/abs/2511.14279
作者: Tong Zhang,Yifan Zhao,Liangyu Wang,Jia Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IJCV 2025
Abstract:Cross-Domain Few-Shot Learning (CDFSL) endeavors to transfer generalized knowledge from the source domain to target domains using only a minimal amount of training data, which faces a triplet of learning challenges in the meantime, i.e., semantic disjoint, large domain discrepancy, and data scarcity. Different from predominant CDFSL works focused on generalized representations, we make novel attempts to construct Intermediate Domain Proxies (IDP) with source feature embeddings as the codebook and reconstruct the target domain feature with this learned codebook. We then conduct an empirical study to explore the intrinsic attributes from perspectives of visual styles and semantic contents in intermediate domain proxies. Reaping benefits from these attributes of intermediate domains, we develop a fast domain alignment method to use these proxies as learning guidance for target domain feature transformation. With the collaborative learning of intermediate domain reconstruction and target feature transformation, our proposed model is able to surpass the state-of-the-art models by a margin on 8 cross-domain few-shot learning benchmarks.
zh
[CV-64] Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation
【速读】:该论文旨在解决当前文本到三维(Text-to-3D)生成模型中存在的两大核心问题:一是语义对齐粗略,难以准确捕捉提示中的细粒度信息;二是缺乏鲁棒的三维空间理解能力,导致几何不一致以及部件组装和空间关系上的灾难性失败。解决方案的关键在于提出VLM3D框架,其核心创新是利用大规模视觉语言模型(VLMs)作为可微分的语义与空间判别器,并设计了一种基于VLM输出“是/否”对数几率的双查询判别信号,用于同时评估生成结果的语义保真度和几何一致性。该信号在两种不同范式中均展现出通用性:既可用作优化类流程的奖励目标以提升性能,也可作为前馈类流程中的测试时引导模块,在采样过程中主动修正严重空间错误,从而将VLM丰富的、基于语言的空间与语义理解注入多样化的3D生成流水线。
链接: https://arxiv.org/abs/2511.14271
作者: Weimin Bai,Yubo Li,Weijian Luo,Zeqiang Lai,Yequan Wang,Wenzheng Chen,He Sun
机构: Peking University (北京大学); Xiaohongshu Inc (小红书公司); MMLab, CUHK (多媒体实验室,香港中文大学); BAAI, Beijing (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-3D generation has advanced rapidly, yet state-of-the-art models, encompassing both optimization-based and feed-forward architectures, still face two fundamental limitations. First, they struggle with coarse semantic alignment, often failing to capture fine-grained prompt details. Second, they lack robust 3D spatial understanding, leading to geometric inconsistencies and catastrophic failures in part assembly and spatial relationships. To address these challenges, we propose VLM3D, a general framework that repurposes large vision-language models (VLMs) as powerful, differentiable semantic and spatial critics. Our core contribution is a dual-query critic signal derived from the VLM’s Yes or No log-odds, which assesses both semantic fidelity and geometric coherence. We demonstrate the generality of this guidance signal across two distinct paradigms: (1) As a reward objective for optimization-based pipelines, VLM3D significantly outperforms existing methods on standard benchmarks. (2) As a test-time guidance module for feed-forward pipelines, it actively steers the iterative sampling process of SOTA native 3D models to correct severe spatial errors. VLM3D establishes a principled and generalizable path to inject the VLM’s rich, language-grounded understanding of both semantics and space into diverse 3D generative pipelines.
zh
[CV-65] Gaussian Splatting-based Low-Rank Tensor Representation for Multi-Dimensional Image Recovery
【速读】:该论文旨在解决张量奇异值分解(tensor singular value decomposition, t-SVD)在多维图像表示中存在的两个关键问题:一是潜在张量的近似(如张量分解)较为粗糙,难以准确捕捉空间局部高频信息;二是变换矩阵由固定基原子(如离散傅里叶变换中的复指数原子或离散余弦变换中的余弦原子)构成,无法精确刻画沿模式-3纤维的局部高频特征。解决方案的关键在于提出一种基于高斯点绘(Gaussian Splatting)的低秩张量表示框架(GSLR),通过定制化的二维高斯点绘生成潜在张量、一维高斯点绘构建变换矩阵,二者在表示框架中不可或缺且互补,从而显著增强对局部高频信息的表达能力。
链接: https://arxiv.org/abs/2511.14270
作者: Yiming Zeng,Xi-Le Zhao,Wei-Hao Wu,Teng-Yu Ji,Chao Wang
机构: University of Electronic Science and Technology of China (电子科技大学); Northwest Polytechnical University (西北工业大学); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Tensor singular value decomposition (t-SVD) is a promising tool for multi-dimensional image representation, which decomposes a multi-dimensional image into a latent tensor and an accompanying transform matrix. However, two critical limitations of t-SVD methods persist: (1) the approximation of the latent tensor (e.g., tensor factorizations) is coarse and fails to accurately capture spatial local high-frequency information; (2) The transform matrix is composed of fixed basis atoms (e.g., complex exponential atoms in DFT and cosine atoms in DCT) and cannot precisely capture local high-frequency information along the mode-3 fibers. To address these two limitations, we propose a Gaussian Splatting-based Low-rank tensor Representation (GSLR) framework, which compactly and continuously represents multi-dimensional images. Specifically, we leverage tailored 2D Gaussian splatting and 1D Gaussian splatting to generate the latent tensor and transform matrix, respectively. The 2D and 1D Gaussian splatting are indispensable and complementary under this representation framework, which enjoys a powerful representation capability, especially for local high-frequency information. To evaluate the representation ability of the proposed GSLR, we develop an unsupervised GSLR-based multi-dimensional image recovery model. Extensive experiments on multi-dimensional image recovery demonstrate that GSLR consistently outperforms state-of-the-art methods, particularly in capturing local high-frequency information.
zh
[CV-66] ManipShield: A Unified Framework for Image Manipulation Detection Localization and Explanation
【速读】:该论文旨在解决当前图像篡改检测与定位(Image Manipulation Detection and Localization, IMDL)基准数据集存在的局限性,包括内容多样性不足、生成模型覆盖范围窄以及缺乏可解释性等问题,这些问题限制了现有检测方法的泛化能力和解释能力。其解决方案的关键在于构建一个大规模、多类别、高可解释性的新基准——ManipBench,其中包含超过45万张由25种前沿图像编辑模型生成的篡改图像,并对其中10万张进行边界框标注、判断线索和文本解释标注;同时提出基于多模态大语言模型(Multimodal Large Language Model, MLLM)的统一检测与解释框架ManipShield,通过对比LoRA微调和任务特定解码器设计,实现图像篡改的检测、定位与解释一体化,在多个公开数据集上展现出卓越性能和对未见篡改模型的强大泛化能力。
链接: https://arxiv.org/abs/2511.14259
作者: Zitong Xu,Huiyu Duan,Xiaoyu Wang,Zhaolin Cai,Kaiwei Zhang,Qiang Hu,Jing Liu,Xiongkuo Min,Guangtao Zhai
机构: Shanghai JiaoTong University (上海交通大学); University of Electronic and Science Technology of China (电子科技大学); TianJin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid advancement of generative models, powerful image editing methods now enable diverse and highly realistic image manipulations that far surpass traditional deepfake techniques, posing new challenges for manipulation detection. Existing image manipulation detection and localization (IMDL) benchmarks suffer from limited content diversity, narrow generative-model coverage, and insufficient interpretability, which hinders the generalization and explanation capabilities of current manipulation detection methods. To address these limitations, we introduce \textbfManipBench, a large-scale benchmark for image manipulation detection and localization focusing on AI-edited images. ManipBench contains over 450K manipulated images produced by 25 state-of-the-art image editing models across 12 manipulation categories, among which 100K images are further annotated with bounding boxes, judgment cues, and textual explanations to support interpretable detection. Building upon ManipBench, we propose \textbfManipShield, an all-in-one model based on a Multimodal Large Language Model (MLLM) that leverages contrastive LoRA fine-tuning and task-specific decoders to achieve unified image manipulation detection, localization, and explanation. Extensive experiments on ManipBench and several public datasets demonstrate that ManipShield achieves state-of-the-art performance and exhibits strong generality to unseen manipulation models. Both ManipBench and ManipShield will be released upon publication.
zh
[CV-67] V2VLoc: Robust GNSS-Free Collaborative Perception via LiDAR Localization AAAI2026
【速读】:该论文旨在解决多智能体(Multi-agents)在GNSS-denied环境中难以实现精准位姿估计与观测对齐的问题,从而影响协同感知(collaborative perception)的准确性。解决方案的关键在于提出了一种基于激光雷达(LiDAR)的鲁棒性无GNSS协同感知框架:首先设计轻量级Pose Generator with Confidence (PGC),用于生成紧凑的位姿及其置信度表示;其次引入Pose-Aware Spatio-Temporal Alignment Transformer (PASTAT),通过置信度感知的空间对齐机制和时序上下文建模,有效缓解定位误差对协同感知的影响。该方法在自建仿真数据集V2VLoc上实现了当前最优性能,并在真实世界数据集V2V4Real上验证了其有效性与泛化能力。
链接: https://arxiv.org/abs/2511.14247
作者: Wenkai Lin,Qiming Xia,Wen Li,Xun Huang,Chenglu Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2026
Abstract:Multi-agents rely on accurate poses to share and align observations, enabling a collaborative perception of the environment. However, traditional GNSS-based localization often fails in GNSS-denied environments, making consistent feature alignment difficult in collaboration. To tackle this challenge, we propose a robust GNSS-free collaborative perception framework based on LiDAR localization. Specifically, we propose a lightweight Pose Generator with Confidence (PGC) to estimate compact pose and confidence representations. To alleviate the effects of localization errors, we further develop the Pose-Aware Spatio-Temporal Alignment Transformer (PASTAT), which performs confidence-aware spatial alignment while capturing essential temporal context. Additionally, we present a new simulation dataset, V2VLoc, which can be adapted for both LiDAR localization and collaborative detection tasks. V2VLoc comprises three subsets: Town1Loc, Town4Loc, and V2VDet. Town1Loc and Town4Loc offer multi-traversal sequences for training in localization tasks, whereas V2VDet is specifically intended for the collaborative detection task. Extensive experiments conducted on the V2VLoc dataset demonstrate that our approach achieves state-of-the-art performance under GNSS-denied conditions. We further conduct extended experiments on the real-world V2V4Real dataset to validate the effectiveness and generalizability of PASTAT.
zh
[CV-68] Enhancing Generalization of Depth Estimation Foundation Model via Weakly-Supervised Adaptation with Regularization AAAI2026
【速读】:该论文旨在解决基础模型(foundation models)在单目深度估计(Monocular Depth Estimation, MDE)任务中,虽具备零样本泛化能力但仍受限于特定下游场景性能瓶颈的问题。针对这一问题,作者提出WeSTAR框架,其核心在于通过参数高效的方式实现弱监督自训练适应与正则化,从而提升模型在未见、多样化域中的鲁棒性。关键创新包括:1)采用密集自训练目标作为结构自监督的主要来源;2)引入语义感知的分层归一化机制,利用实例级分割图实现更稳定且多尺度的结构归一化;3)设计基于成对序数深度标注的轻量级弱监督信号,以约束局部拓扑错误;4)引入权重正则化损失锚定LoRA更新,保障训练稳定性并保留模型的通用知识。实验表明,该方法在多种真实和损坏的分布外数据集上均显著提升泛化性能,达到当前最优水平。
链接: https://arxiv.org/abs/2511.14238
作者: Yan Huang,Yongyi Su,Xin Lin,Le Zhang,Xun Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by AAAI 2026
Abstract:The emergence of foundation models has substantially advanced zero-shot generalization in monocular depth estimation (MDE), as exemplified by the Depth Anything series. However, given access to some data from downstream tasks, a natural question arises: can the performance of these models be further improved? To this end, we propose WeSTAR, a parameter-efficient framework that performs Weakly supervised Self-Training Adaptation with Regularization, designed to enhance the robustness of MDE foundation models in unseen and diverse domains. We first adopt a dense self-training objective as the primary source of structural self-supervision. To further improve robustness, we introduce semantically-aware hierarchical normalization, which exploits instance-level segmentation maps to perform more stable and multi-scale structural normalization. Beyond dense supervision, we introduce a cost-efficient weak supervision in the form of pairwise ordinal depth annotations to further guide the adaptation process, which enforces informative ordinal constraints to mitigate local topological errors. Finally, a weight regularization loss is employed to anchor the LoRA updates, ensuring training stability and preserving the model’s generalizable knowledge. Extensive experiments on both realistic and corrupted out-of-distribution datasets under diverse and challenging scenarios demonstrate that WeSTAR consistently improves generalization and achieves state-of-the-art performance across a wide range of benchmarks.
zh
[CV-69] Breaking the Passive Learning Trap: An Active Perception Strategy for Human Motion Prediction
【速读】:该论文旨在解决当前3D人体动作预测方法过度依赖隐式网络建模时空关系与运动特征的问题,导致学习过程被动、冗余且单调,缺乏主动引导的显式学习机制。其核心解决方案是提出一种主动感知策略(Active Perceptual Strategy, APS),关键在于两个方面:一是设计数据感知模块,通过将姿态投影到商空间(quotient space)来解耦运动几何与坐标冗余,结合切向量与Grassmann投影实现几何降维、语义解耦和动态约束强化;二是引入网络感知模块,通过恢复性学习主动建模时空依赖,利用关节掩码或噪声注入构建辅助监督信号,并设计专用辅助学习网络以从扰动信息中自适应学习。该方法具有模型无关性,可集成至多种预测模型中提升主动感知能力,实验表明其在H3.6M、CMU Mocap和3DPW数据集上分别取得16.3%、13.9%和10.1%的显著性能提升。
链接: https://arxiv.org/abs/2511.14237
作者: Juncheng Hu,Zijian Zhang,Zeyu Wang,Guoyu Wang,Yingji Li,Kedi Lyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures
Abstract:Forecasting 3D human motion is an important embodiment of fine-grained understanding and cognition of human behavior by artificial agents. Current approaches excessively rely on implicit network modeling of spatiotemporal relationships and motion characteristics, falling into the passive learning trap that results in redundant and monotonous 3D coordinate information acquisition while lacking actively guided explicit learning mechanisms. To overcome these issues, we propose an Active Perceptual Strategy (APS) for human motion prediction, leveraging quotient space representations to explicitly encode motion properties while introducing auxiliary learning objectives to strengthen spatio-temporal modeling. Specifically, we first design a data perception module that projects poses into the quotient space, decoupling motion geometry from coordinate redundancy. By jointly encoding tangent vectors and Grassmann projections, this module simultaneously achieves geometric dimension reduction, semantic decoupling, and dynamic constraint enforcement for effective motion pose characterization. Furthermore, we introduce a network perception module that actively learns spatio-temporal dependencies through restorative learning. This module deliberately masks specific joints or injects noise to construct auxiliary supervision signals. A dedicated auxiliary learning network is designed to actively adapt and learn from perturbed information. Notably, APS is model agnostic and can be integrated with different prediction models to enhance active perceptual. The experimental results demonstrate that our method achieves the new state-of-the-art, outperforming existing methods by large margins: 16.3% on H3.6M, 13.9% on CMU Mocap, and 10.1% on 3DPW.
zh
[CV-70] StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model
【速读】:该论文旨在解决语音驱动的三维人脸动画(speech-driven 3D facial animation)中现有方法在处理长音频序列时存在的两大问题:一是当输入音频超出训练时的时间范围时性能显著下降,二是处理长音频时会产生显著延迟。解决方案的关键在于提出一种新型的自回归扩散模型(autoregressive diffusion model),该模型以流式方式逐段处理音频输入,通过选取有限数量的历史帧作为运动上下文,并将其与当前音频输入结合形成动态条件,从而引导扩散过程迭代生成面部运动帧。这一设计实现了对不同长度音频的灵活适应,并保证了与音频长度无关的低延迟特性,同时支持实时合成且保持高质量结果。
链接: https://arxiv.org/abs/2511.14223
作者: Yifan Yang,Zhi Cen,Sida Peng,Xiangwei Chen,Yifu Deng,Xinyu Zhu,Fan Jia,Xiaowei Zhou,Hujun Bao
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech this http URL methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural this http URL, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides the diffusion process to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Additionally, we implemented a real-time interactive demo, highlighting the effectiveness and efficiency of our approach. We will release the code at this https URL.
zh
[CV-71] Measurement-Constrained Sampling for Text-Prompted Blind Face Restoration
【速读】:该论文旨在解决盲人脸恢复(Blind Face Restoration, BFR)中因低质量(Low-Quality, LQ)输入导致的“一对多”重建问题,即同一LQ输入可能对应多个合理的高质量(High-Quality, HQ)重建结果,而现有方法通常输出确定性结果,难以捕捉这种多样性。解决方案的关键在于提出一种测量约束采样(Measurement-Constrained Sampling, MCS)方法,将BFR建模为一个测量约束的生成任务:通过控制粗略恢复结果的退化过程构建逆问题,并在文本到图像扩散模型中引入后验引导采样;其中前向测量(Forward Measurement)确保重建结果与输入结构一致,反向测量(Reverse Measurement)生成投影空间以支持不同文本提示(prompt)下的多样化重建,从而实现提示对齐的多样恢复效果。
链接: https://arxiv.org/abs/2511.14213
作者: Wenjie Li,Yulun Zhang,Guangwei Gao,Heng Guo,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Shanghai Jiao Tong University (上海交通大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Blind face restoration (BFR) may correspond to multiple plausible high-quality (HQ) reconstructions under extremely low-quality (LQ) inputs. However, existing methods typically produce deterministic results, struggling to capture this one-to-many nature. In this paper, we propose a Measurement-Constrained Sampling (MCS) approach that enables diverse LQ face reconstructions conditioned on different textual prompts. Specifically, we formulate BFR as a measurement-constrained generative task by constructing an inverse problem through controlled degradations of coarse restorations, which allows posterior-guided sampling within text-to-image diffusion. Measurement constraints include both Forward Measurement, which ensures results align with input structures, and Reverse Measurement, which produces projection spaces, ensuring that the solution can align with various prompts. Experiments show that our MCS can generate prompt-aligned results and outperforms existing BFR methods. Codes will be released after acceptance.
zh
[CV-72] Orion: A Unified Visual Agent for Multimodal Perception Advanced Visual Reasoning and Execution
【速读】:该论文旨在解决传统视觉语言模型(Vision-Language Models, VLMs)在处理复杂视觉任务时存在的局限性,即仅能生成描述性输出而缺乏对视觉信息的主动推理与操作能力。为应对这一挑战,论文提出 Orion 框架,其核心解决方案在于将神经感知(neural perception)与符号执行(symbolic execution)相结合,构建一个具备多工具调用能力的智能体(agentic framework)。该框架通过协调一系列专用计算机视觉工具(如目标检测、关键点定位、全景分割、光学字符识别和几何分析)来执行多步骤视觉工作流,从而实现从被动理解到主动驱动的视觉智能跃迁,显著提升了在 MMMU、MMBench、DocVQA 和 MMLongBench 等基准上的性能,并推动了视觉 AI 向生产级应用演进。
链接: https://arxiv.org/abs/2511.14210
作者: N Dinesh Reddy,Sudeep Pillai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.
zh
[CV-73] InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
【速读】:该论文旨在解决视频逆问题(video inverse problems)中高感知质量与低延迟约束难以兼得的挑战,尤其针对流媒体、远程呈现和增强现实/虚拟现实(AR/VR)等实时应用场景。现有基于扩散模型的方法要么采用人工设计的时间正则化器导致时间伪影,要么依赖原生视频扩散模型但其迭代后验采样速度过慢,无法满足实时性要求。解决方案的关键在于提出一种可 amortized 推断框架 InstantViR,通过将强大的双向视频扩散模型(教师)蒸馏为因果自回归学生模型,实现单次前向传播即可完成视频重建,从而继承教师模型的强时序建模能力并彻底消除测试时的迭代优化过程;同时引入基于教师空间的正则化蒸馏机制,用高效 LeanVAE 替代原视频扩散模型中的 VAE 编码器,进一步提升潜空间处理效率,最终在保持或超越扩散基线重建质量的同时,实现在 NVIDIA A100 上超过 35 FPS 的推理速度,较迭代式视频扩散求解器提速达 100 倍。
链接: https://arxiv.org/abs/2511.14208
作者: Weimin Bai,Suzhe Xu,Yiwei Ren,Jinhua Hao,Ming Sun,Wenzheng Chen,He Sun
机构: Peking University (北京大学); Huaqiao University (华侨大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video inverse problems are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers - leading to temporal artifacts - or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher’s strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the video-diffusion backbone VAE with a high-efficiency LeanVAE via an innovative teacher-space regularized distillation scheme, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100 times speedups over iterative video diffusion solvers. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.
zh
[CV-74] Multi-Scale Correlation-Aware Transformer for Maritime Vessel Re-Identification
【速读】:该论文旨在解决海上船舶重识别(Maritime Vessel Re-Identification, Re-ID)中因船体图像存在显著的类内差异(intra-identity variations)和局部区域严重缺失(local part missing)而导致的异常样本(outlier samples)问题,这些问题会干扰模型对同一船舶身份的准确匹配。解决方案的关键在于提出多尺度相关感知Transformer网络(Multi-scale Correlation-aware Transformer Network, MCFormer),其核心创新是引入两个模块:全局相关性模块(Global Correlation Module, GCM)和局部相关性模块(Local Correlation Module, LCM)。GCM通过构建跨所有输入图像的全局相似性亲和矩阵,基于图像间一致性进行特征聚合,从而建模全局相关性;LCM则利用动态记忆库挖掘并对齐正样本的局部特征与上下文相似性,有效补偿单图中的缺失或遮挡区域。MCFormer进一步融合多尺度下的全局与局部相关特征,显著提升了特征鲁棒性,在三个基准数据集上实现了最先进性能。
链接: https://arxiv.org/abs/2511.14203
作者: Yunhe Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Maritime vessel re-identification (Re-ID) plays a crucial role in advancing maritime monitoring and intelligent situational awareness systems. However, some existing vessel Re-ID methods are directly adapted from pedestrian-focused algorithms, making them ill-suited for mitigating the unique problems present in vessel images, particularly the greater intra-identity variations and more severe missing of local parts, which lead to the emergence of outlier samples within the same identity. To address these challenges, we propose the Multi-scale Correlation-aware Transformer Network (MCFormer), which explicitly models multi-scale correlations across the entire input set to suppress the adverse effects of outlier samples with intra-identity variations or local missing, incorporating two novel modules, the Global Correlation Module (GCM), and the Local Correlation Module (LCM). Specifically, GCM constructs a global similarity affinity matrix across all input images to model global correlations through feature aggregation based on inter-image consistency, rather than solely learning features from individual images as in most existing approaches. Simultaneously, LCM mines and aligns local features of positive samples with contextual similarity to extract local correlations by maintaining a dynamic memory bank, effectively compensating for missing or occluded regions in individual images. To further enhance feature robustness, MCFormer integrates global and local features that have been respectively correlated across multiple scales, effectively capturing latent relationships among image features. Experiments on three benchmarks demonstrate that MCFormer achieves state-of-the-art performance.
zh
[CV-75] Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Averag e Precision
【速读】:该论文旨在解决对象检测(Object Detection)任务中数据效率低下的问题,尤其针对现有在线数据筛选策略难以适配对象检测结构复杂性和领域差异的局限性。其解决方案的关键在于提出DetGain方法,通过估计每张图像对整体数据集平均精度(Average Precision, AP)的边际扰动来动态选择信息量大的样本;该方法基于预测质量建模全局得分分布,并计算教师-学生贡献差距以实现高效且轻量级的样本筛选,具备架构无关性和最小侵入性,可无缝集成到多种检测模型中,显著提升性能并增强对低质量数据的鲁棒性。
链接: https://arxiv.org/abs/2511.14197
作者: Zitang Sun,Masakazu Yoshimura,Junji Otsuka,Atsushi Irie,Takeshi Ohashi
机构: Sony Group Corporation(索尼集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint version, under review
Abstract:High-quality data has become a primary driver of progress under scale laws, with curated datasets often outperforming much larger unfiltered ones at lower cost. Online data curation extends this idea by dynamically selecting training samples based on the model’s evolving state. While effective in classification and multimodal learning, existing online sampling strategies rarely extend to object detection because of its structural complexity and domain gaps. We introduce DetGain, an online data curation method specifically for object detection that estimates the marginal perturbation of each image to dataset-level Average Precision (AP) based on its prediction quality. By modeling global score distributions, DetGain efficiently estimates the global AP change and computes teacher-student contribution gaps to select informative samples at each iteration. The method is architecture-agnostic and minimally intrusive, enabling straightforward integration into diverse object detection architectures. Experiments on the COCO dataset with multiple representative detectors show consistent improvements in accuracy. DetGain also demonstrates strong robustness under low-quality data and can be effectively combined with knowledge distillation techniques to further enhance performance, highlighting its potential as a general and complementary strategy for data-efficient object detection.
zh
[CV-76] MindCross: Fast New Subject Adaptation with Limited Data for Cross-subject Video Reconstruction from Brain Signals AAAI2026
【速读】:该论文旨在解决脑信号到视频重建任务中因个体差异导致的数据稀缺问题,特别是现有跨被试(cross-subject)方法在忽略个体特异性信息的情况下,依赖缓慢的微调策略进行新被试适应的问题。其解决方案的关键在于提出MindCross框架,该框架通过设计N个特定于被试的编码器与一个共享编码器,分别提取个体特异性与跨被试不变的信息,并引入Top-K协作模块,利用先前被试编码器的知识加速新被试的解码过程,从而实现高效且数据节约的新被试适配。
链接: https://arxiv.org/abs/2511.14196
作者: Xuan-Hao Liu,Yan-Kai Liu,Tianyi Zhou,Bao-Liang Lu,Wei-Long Zheng
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: AAAI 2026, 16 pages
Abstract:Reconstructing video from brain signals is an important brain decoding task. Existing brain decoding frameworks are primarily built on a subject-dependent paradigm, which requires large amounts of brain data for each subject. However, the expensive cost of collecting brain-video data causes severe data scarcity. Although some cross-subject methods being introduced, they often overfocus with subject-invariant information while neglecting subject-specific information, resulting in slow fine-tune-based adaptation strategy. To achieve fast and data-efficient new subject adaptation, we propose MindCross, a novel cross-subject framework. MindCross’s N specific encoders and one shared encoder are designed to extract subject-specific and subject-invariant information, respectively. Additionally, a Top-K collaboration module is adopted to enhance new subject decoding with the knowledge learned from previous subjects’ encoders. Extensive experiments on fMRI/EEG-to-video benchmarks demonstrate MindCross’s efficacy and efficiency of cross-subject decoding and new subject adaptation using only one model.
zh
[CV-77] Hierarchical Semantic Learning for Multi-Class Aorta Segmentation MICCAI2024
【速读】:该论文旨在解决主动脉(aorta)复杂血管结构在医学图像分割中面临的两大挑战:一是现有方法通常忽略解剖学上的层级关系,二是血管结构存在严重的类别不平衡问题。为应对这些挑战,作者提出了一种基于课程学习(curriculum learning)策略的解决方案,其关键创新在于引入一种新颖的分形Softmax(fractal softmax)机制,以实现层次化的语义学习。该方法通过从简单到复杂的逐步分解方式模拟人类认知过程,先建立主导类别的鲁棒特征表示,再聚焦于罕见但解剖学关键的结构,从而有效缓解类别不平衡并加速多类分割模型的收敛。此外,采用两阶段推理策略可实现最高五倍的加速效果,显著提升临床实用性。
链接: https://arxiv.org/abs/2511.14187
作者: Pengcheng Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2024 Workshop AortaSeg
Abstract:The aorta, the body’s largest artery, is prone to pathologies such as dissection, aneurysm, and atherosclerosis, which often require timely intervention. Minimally invasive repairs involving branch vessels necessitate detailed 3D anatomical analysis. Existing methods often overlook hierarchical anatomical relationships while struggling with severe class imbalance inherent in vascular structures. We address these challenges with a curriculum learning strategy that leverages a novel fractal softmax for hierarchical semantic learning. Inspired by human cognition, our approach progressively learns anatomical constraints by decomposing complex structures from simple to complex components. The curriculum learning framework naturally addresses class imbalance by first establishing robust feature representations for dominant classes before tackling rare but anatomically critical structures, significantly accelerating model convergence in multi-class scenarios. Our two-stage inference strategy achieves up to fivefold acceleration, enhancing clinical practicality. On the validation set at epoch 50, our hierarchical semantic loss improves the Dice score of nnU-Net ResEnc M by 11.65%. The proposed model demonstrates a 5.6% higher Dice score than baselines on the test set. Experimental results show significant improvements in segmentation accuracy and efficiency, making the framework suitable for real-time clinical applications. The implementation code for this challenge entry is publicly available at: this https URL. The code for fractal softmax will be available at this https URL.
zh
[CV-78] Few-Shot Precise Event Spotting via Unified Multi-Entity Graph and Distillation AAAI2026 AAAI
【速读】:该论文旨在解决少样本场景下体育事件精准定位(Precise Event Spotting, PES)的难题,尤其针对快速连续事件、运动模糊及视觉差异细微等挑战。现有方法多依赖大规模标注数据进行端到端训练,且仅基于像素或姿态信息,在少量标注样本时性能显著下降。其解决方案的关键在于提出统一多实体图网络(Unified Multi-Entity Graph Network, UMEG-Net),通过将人体骨骼和特定运动物体的关键点融合至统一图结构中,并设计基于先进图卷积网络(Graph Convolutional Network, GCN)与多尺度时间偏移机制的高效时空特征提取模块;同时引入多模态蒸馏策略,将关键点图结构的知识迁移至视觉表征,从而在有限标注数据下实现鲁棒且优越的PES性能。
链接: https://arxiv.org/abs/2511.14186
作者: Zhaoyu Liu,Kan Jiang,Murong Ma,Zhe Hou,Yun Lin,Jin Song Dong
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学); 3. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026)
Abstract:Precise event spotting (PES) aims to recognize fine-grained events at exact moments and has become a key component of sports analytics. This task is particularly challenging due to rapid succession, motion blur, and subtle visual differences. Consequently, most existing methods rely on domain-specific, end-to-end training with large labeled datasets and often struggle in few-shot conditions due to their dependence on pixel- or pose-based inputs alone. However, obtaining large labeled datasets is practically hard. We propose a Unified Multi-Entity Graph Network (UMEG-Net) for few-shot PES. UMEG-Net integrates human skeletons and sport-specific object keypoints into a unified graph and features an efficient spatio-temporal extraction module based on advanced GCN and multi-scale temporal shift. To further enhance performance, we employ multimodal distillation to transfer knowledge from keypoint-based graphs to visual representations. Our approach achieves robust performance with limited labeled data and significantly outperforms baseline models in few-shot settings, providing a scalable and effective solution for few-shot PES. Code is publicly available at this https URL.
zh
[CV-79] PAVE: An End-to-End Dataset for Production Autonomous Vehicle Evaluation
【速读】:该论文旨在解决当前自动驾驶车辆(AV)安全评估中缺乏真实自主驾驶场景数据的问题,现有数据集多基于人类驾驶或未明确标注的驾驶模式,难以有效评估黑箱控制下AV的行为安全性。解决方案的关键在于构建首个完全通过自主驾驶模式采集的真实世界端到端基准数据集,包含超过100小时的自然驾驶数据,涵盖多种量产AV模型,并对关键帧进行高精度标注(如四路同步图像、GNSS/IMU定位误差≤0.8 cm),同时提供过去6秒至未来5秒的20 Hz车辆轨迹及丰富场景属性(如交通密度、天气、道路类型等)。该数据集支持基于端到端运动规划模型的轨迹预测(平均位移误差ADE=1.4 m),为AV行为分析与安全评估提供了可持续扩展的基准。
链接: https://arxiv.org/abs/2511.14185
作者: Xiangyu Li,Chen Wang,Yumao Liu,Dengbo He,Jiahao Zhang,Ke Ma
机构: Hong Kong University of Science and Technology (Guangzhou)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most existing autonomous-driving datasets (e.g., KITTI, nuScenes, and the Waymo Perception Dataset), collected by human-driving mode or unidentified driving mode, can only serve as early training for the perception and prediction of autonomous vehicles (AVs). To evaluate the real behavioral safety of AVs controlled in the black box, we present the first end-to-end benchmark dataset collected entirely by autonomous-driving mode in the real world. This dataset contains over 100 hours of naturalistic data from multiple production autonomous-driving vehicle models in the market. We segment the original data into 32,727 key frames, each consisting of four synchronized camera images and high-precision GNSS/IMU data (0.8 cm localization accuracy). For each key frame, 20 Hz vehicle trajectories spanning the past 6 s and future 5 s are provided, along with detailed 2D annotations of surrounding vehicles, pedestrians, traffic lights, and traffic signs. These key frames have rich scenario-level attributes, including driver intent, area type (covering highways, urban roads, and residential areas), lighting (day, night, or dusk), weather (clear or rain), road surface (paved or unpaved), traffic and vulnerable road users (VRU) density, traffic lights, and traffic signs (warning, prohibition, and indication). To evaluate the safety of AVs, we employ an end-to-end motion planning model that predicts vehicle trajectories with an Average Displacement Error (ADE) of 1.4 m on autonomous-driving frames. The dataset continues to expand by over 10 hours of new data weekly, thereby providing a sustainable foundation for research on AV driving behavior analysis and safety evaluation.
zh
[CV-80] GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation
【速读】:该论文旨在解决现有图像分词(image tokenization)方法中语义特征分布不均匀的问题,该问题限制了图像重建与生成质量的提升。当前方法依赖局部监督机制进行语义引导,导致特征分布缺乏一致性;而研究表明更均匀的潜在表示有助于改善生成性能(VA-VAE)。为此,作者提出全局视角分词器(Global Perspective Tokenizer, GloTok),其核心创新在于:首先通过“基于码本的直方图关系学习”方法,将预训练视觉模型在全数据集上建模的语义信息迁移至语义码本(semantic codebook),从而构建更均匀的语义分布;其次设计残差学习模块以恢复量化带来的细粒度细节,降低重建误差。该方案使生成式模型(如自回归模型)在无需访问预训练模型的情况下即可实现高质量图像生成,且在ImageNet-1k基准上达到最优重建与生成效果。
链接: https://arxiv.org/abs/2511.14184
作者: Xuan Zhao,Zhongyu Zhang,Yuge Huang,Yuxi Mi,Guodong Mu,Shouhong Ding,Jun Wang,Rizen Guo,Shuigeng Zhou
机构: Tencent Youtu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing state-of-the-art image tokenization methods leverage diverse semantic features from pre-trained vision models for additional supervision, to expand the distribution of latent representations and thereby improve the quality of image reconstruction and generation. These methods employ a locally supervised approach for semantic supervision, which limits the uniformity of semantic distribution. However, VA-VAE proves that a more uniform feature distribution yields better generation performance. In this work, we introduce a Global Perspective Tokenizer (GloTok), which utilizes global relational information to model a more uniform semantic distribution of tokenized features. Specifically, a codebook-wise histogram relation learning method is proposed to transfer the semantics, which are modeled by pre-trained models on the entire dataset, to the semantic codebook. Then, we design a residual learning module that recovers the fine-grained details to minimize the reconstruction error caused by quantization. Through the above design, GloTok delivers more uniformly distributed semantic latent representations, which facilitates the training of autoregressive (AR) models for generating high-quality images without requiring direct access to pre-trained models during the training process. Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.
zh
[CV-81] UniSER: A Foundation Model for Unified Soft Effects Removal
【速读】:该论文旨在解决数字图像中由软效应(如镜头光晕、雾霾、阴影和反射)引起的退化问题,这些问题虽不完全遮挡图像内容,但显著降低视觉质量。现有方法多针对单一退化类型设计专用模型,缺乏可扩展性且未能利用不同退化间的共性;而通用模型(如GPT-4o、Flux Kontext)则依赖复杂提示,难以在细粒度任务中实现鲁棒修复并保持场景一致性。解决方案的关键在于识别软效应的共同本质——半透明遮挡(semi-transparent occlusions),提出一个统一的多功能恢复模型UniSER,其核心包括:构建包含380万对样本的大规模数据集(含物理上合理的新型数据以填补公开基准空白),以及基于Diffusion Transformer的定制训练流程,通过细粒度掩码与强度控制学习鲁棒的恢复先验,从而在单一框架内高效处理多种软效应退化,显著优于专用模型与通用模型。
链接: https://arxiv.org/abs/2511.14183
作者: Jingdong Zhang,Lingzhi Zhang,Qing Liu,Mang Tik Chiu,Connelly Barnes,Yizhou Wang,Haoran You,Xiaoyang Liu,Yuqian Zhou,Zhe Lin,Eli Shechtman,Sohrab Amirghodsi,Xin Li,Wenping Wang,Xiaohang Zhan
机构: Texas A&M University (德州农工大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Digital images are often degraded by soft effects such as lens flare, haze, shadows, and reflections, which reduce aesthetics even though the underlying pixels remain partially visible. The prevailing works address these degradations in isolation, developing highly specialized, specialist models that lack scalability and fail to exploit the shared underlying essences of these restoration problems. While specialist models are limited, recent large-scale pretrained generalist models offer powerful, text-driven image editing capabilities. while recent general-purpose systems (e.g., GPT-4o, Flux Kontext, Nano Banana) require detailed prompts and often fail to achieve robust removal on these fine-grained tasks or preserve identity of the scene. Leveraging the common essence of soft effects, i.e., semi-transparent occlusions, we introduce a foundational versatile model UniSER, capable of addressing diverse degradations caused by soft effects within a single framework. Our methodology centers on curating a massive 3.8M-pair dataset to ensure robustness and generalization, which includes novel, physically-plausible data to fill critical gaps in public benchmarks, and a tailored training pipeline that fine-tunes a Diffusion Transformer to learn robust restoration priors from this diverse data, integrating fine-grained mask and strength controls. This synergistic approach allows UniSER to significantly outperform both specialist and generalist models, achieving robust, high-fidelity restoration in the wild.
zh
[CV-82] DoGCLR: Dominance-Game Contrastive Learning Network for Skeleton-Based Action Recognition
【速读】:该论文旨在解决基于骨架的动作识别中自监督对比学习方法存在的两个核心问题:一是现有方法对所有骨架区域进行均匀处理,导致运动信息丢失;二是采用先进先出(FIFO)队列存储负样本,造成负样本选择非最优。解决方案的关键在于提出一种基于博弈论的主导博弈对比学习网络(DoGCLR),其核心创新包括:1)设计时空双权重定位机制,识别关键运动区域并引导区域级增强,以提升运动多样性同时保持语义一致性;2)引入熵驱动的主导策略管理记忆库,保留高熵(困难)负样本、替换低熵(弱)负样本,确保持续获得具有判别力的对比信号。该框架通过建模正负样本间的动态博弈关系,实现语义保留与区分能力的平衡,从而显著提升动作识别性能。
链接: https://arxiv.org/abs/2511.14179
作者: Yanshan Li,Ke Ma,Miaomiao Wei,Linhui Dai
机构: Shenzhen University (深圳大学); Institute of Intelligence Information Processing (智能信息处理研究所); Guangdong Key Laboratory of Intelligent Information Processing (广东省智能信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures, journal
Abstract:Existing self-supervised contrastive learning methods for skeleton-based action recognition often process all skeleton regions uniformly, and adopt a first-in-first-out (FIFO) queue to store negative samples, which leads to motion information loss and non-optimal negative sample selection. To address these challenges, this paper proposes Dominance-Game Contrastive Learning network for skeleton-based action Recognition (DoGCLR), a self-supervised framework based on game theory. DoGCLR models the construction of positive and negative samples as a dynamic Dominance Game, where both sample types interact to reach an equilibrium that balances semantic preservation and discriminative strength. Specifically, a spatio-temporal dual weight localization mechanism identifies key motion regions and guides region-wise augmentations to enhance motion diversity while maintaining semantics. In parallel, an entropy-driven dominance strategy manages the memory bank by retaining high entropy (hard) negatives and replacing low-entropy (weak) ones, ensuring consistent exposure to informative contrastive signals. Extensive experiments are conducted on NTU RGB+D and PKU-MMD datasets. On NTU RGB+D 60 X-Sub/X-View, DoGCLR achieves 81.1%/89.4% accuracy, and on NTU RGB+D 120 X-Sub/X-Set, DoGCLR achieves 71.2%/75.5% accuracy, surpassing state-of-the-art methods by 0.1%, 2.7%, 1.1%, and 2.3%, respectively. On PKU-MMD Part I/Part II, DoGCLR performs comparably to the state-of-the-art methods and achieves a 1.9% higher accuracy on Part II, highlighting its strong robustness on more challenging scenarios.
zh
[CV-83] AdaTok: Adaptive Token Compression with Object-Aware Representations for Efficient Multimodal LLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像理解与推理过程中因patch-level tokenization导致的计算和内存开销过大问题,以及传统逐patch扫描策略与人类视觉认知系统不一致所引发的幻觉和计算冗余问题。解决方案的关键在于提出一种基于对象级别的token合并策略,实现自适应的token压缩(Adaptive Token Compression),该策略在保持与人类视觉认知一致性的同时,显著减少图像token数量——实验表明,在仅使用约10% token的情况下仍能达到原始模型约96%的性能表现,从而在压缩比与性能之间实现了更优平衡。
链接: https://arxiv.org/abs/2511.14169
作者: Xinliang Zhang,Lei Zhu,Hangzhou He,Shuang Zeng,Ourui Fu,Jiakui Hu,Zhengjian Yao,Yanye Lu
机构: Peking University Health Science Center (北京大学医学部); Peking University (北京大学); National Biomedical Imaging Center (北京大学生物医学成像中心); Peking University Shenzhen Graduate School (北京大学深圳研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated substantial value in unified text-image understanding and reasoning, primarily by converting images into sequences of patch-level tokens that align with their architectural paradigm. However, patch-level tokenization leads to a quadratic growth in image tokens, burdening MLLMs’ understanding and reasoning with enormous computation and memory. Additionally, the traditional patch-wise scanning tokenization workflow misaligns with the human vision cognition system, further leading to hallucination and computational redundancy. To address this issue, we propose an object-level token merging strategy for Adaptive Token compression, revealing the consistency with human vision system. The experiments are conducted on multiple comprehensive benchmarks, which show that our approach averagely, utilizes only 10% tokens while achieving almost 96% of the vanilla model’s performance. More extensive experimental results in comparison with relevant works demonstrate the superiority of our method in balancing compression ratio and performance. Our code will be available.
zh
[CV-84] RoboTidy : A 3D Gaussian Splatting Household Tidying Benchmark for Embodied Navigation and Action
【速读】:该论文旨在解决当前家庭整理任务评估中缺乏用户偏好建模、不支持移动能力以及泛化性能差的问题,从而难以全面评估语言到动作(Language-to-Action)的综合能力。其解决方案的关键在于提出RoboTidy——一个统一的基准平台,支持视觉-语言-动作(Vision-Language-Action, VLA)和视觉-语言-导航(Vision-Language-Navigation, VLN)训练与评估;该平台包含500个逼真3D高斯泼溅(3D Gaussian Splatting, 3DGS)家庭场景、500个物体与容器、碰撞检测机制,并提供6.4k高质量操作示范轨迹和1.5k导航轨迹,以支持少样本与大规模训练,同时在真实世界部署实现端到端的家庭整理评估,填补了具身智能(Embodied AI)中的关键空白。
链接: https://arxiv.org/abs/2511.14161
作者: Xiaoquan Sun,Ruijian Zhang,Kang Pang,Bingchen Miao,Yuxiang Tan,Zhen Yang,Ming Li,Jiayu Chen
机构: Huazhong University of Science and Technology (华中科技大学); The University of Hong Kong (香港大学); INFIFORCE Intelligent Technology Co., Ltd. (INFIFORCE智能科技有限公司); Zhejiang University (浙江大学); Guangming Lab, Shenzhen (光明实验室,深圳)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Household tidying is an important application area, yet current benchmarks neither model user preferences nor support mobility, and they generalize poorly, making it hard to comprehensively assess integrated language-to-action capabilities. To address this, we propose RoboTidy, a unified benchmark for language-guided household tidying that supports Vision-Language-Action (VLA) and Vision-Language-Navigation (VLN) training and evaluation. RoboTidy provides 500 photorealistic 3D Gaussian Splatting (3DGS) household scenes (covering 500 objects and containers) with collisions, formulates tidying as an “Action (Object, Container)” list, and supplies 6.4k high-quality manipulation demonstration trajectories and 1.5k naviagtion trajectories to support both few-shot and large-scale training. We also deploy RoboTidy in the real world for object tidying, establishing an end-to-end benchmark for household tidying. RoboTidy offers a scalable platform and bridges a key gap in embodied AI by enabling holistic and realistic evaluation of language-guided robots.
zh
[CV-85] MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在面对误导性视觉输入时的鲁棒性评估不足的问题。现有基准测试主要关注文本层面的幻觉或误导信息,而忽视了视觉输入本身可能带来的干扰,这限制了对LVLM真实视觉理解能力的全面评估。为此,作者提出MVI-Bench——首个专门针对误导性视觉输入(Misleading Visual Inputs, MVIs)设计的综合性评测基准,其核心创新在于基于基础视觉原语构建了三个层级的误导性视觉输入分类体系:视觉概念(Visual Concept)、视觉属性(Visual Attribute)和视觉关系(Visual Relationship),并据此整理出6类代表性场景与1,248个专家标注的视觉问答(VQA)样本。此外,论文引入MVI-Sensitivity这一新指标,实现对LVLM鲁棒性的细粒度量化分析,从而为模型改进提供可操作的洞察。
链接: https://arxiv.org/abs/2511.14159
作者: Huiyi Chen,Jiawei Peng,Dehai Min,Changchang Sun,Kaijie Chen,Yan Yan,Xu Yang,Lu Cheng
机构: Southeast University (东南大学); Tongji University (同济大学); University of Illinois at Chicago (芝加哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 8 figures
Abstract:Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at this https URL.
zh
[CV-86] Learning Representation and Synergy Invariances: A Povable Framework for Generalized Multimodal Face Anti-Spoofing
【速读】:该论文旨在解决多模态人脸识别反欺骗(Multimodal Face Anti-Spoofing, FAS)方法在部署到未见域时性能显著下降的问题。核心问题源于两个被忽视的风险:一是模态表示不变性风险,即模态特征在域迁移下难以保持泛化能力,尤其因FAS中真实人脸分布紧凑而伪造样本多样导致类不平衡,使多模态设置下的泛化误差上界进一步扩大;二是模态协同不变性风险,即模型过度拟合特定域内的跨模态相关性,这种虚假协同无法迁移到目标域的新攻击类型中。解决方案的关键在于提出一个可证明的框架——多模态表示与协同不变性学习(RiSe),其包含两个创新模块:一是不对称不变风险最小化(AsyIRM),通过在径向空间学习不变球形决策边界以适应类不平衡分布,同时保留角度空间中的域线索;二是多模态协同解耦(MMSD),一种自监督任务,通过跨样本混合与解耦增强内在且可泛化的模态特征。理论分析与实验证明该框架能有效提升跨域泛化性能,达到当前最优水平。
链接: https://arxiv.org/abs/2511.14157
作者: Xun Lin,Shuai Wang,Yi Yu,Zitong Yu,Jiale Zhou,Yizhong Liu,Xiaochun Cao,Alex Kot,Yefeng Zheng
机构: Beihang University (北京航空航天大学); Nanyang Technological University (南洋理工大学); Great Bay University (大湾区大学); Westlake University (西湖大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Face Anti-Spoofing (FAS) methods, which integrate multiple visual modalities, often suffer even more severe performance degradation than unimodal FAS when deployed in unseen domains. This is mainly due to two overlooked risks that affect cross-domain multimodal generalization. The first is the modal representation invariant risk, i.e., whether representations remain generalizable under domain shift. We theoretically show that the inherent class asymmetry in FAS (diverse spoofs vs. compact reals) enlarges the upper bound of generalization error, and this effect is further amplified in multimodal settings. The second is the modal synergy invariant risk, where models overfit to domain-specific inter-modal correlations. Such spurious synergy cannot generalize to unseen attacks in target domains, leading to performance drops. To solve these issues, we propose a provable framework, namely Multimodal Representation and Synergy Invariance Learning (RiSe). For representation risk, RiSe introduces Asymmetric Invariant Risk Minimization (AsyIRM), which learns an invariant spherical decision boundary in radial space to fit asymmetric distributions, while preserving domain cues in angular space. For synergy risk, RiSe employs Multimodal Synergy Disentanglement (MMSD), a self-supervised task enhancing intrinsic, generalizable modal features via cross-sample mixing and disentanglement. Theoretical analysis and experiments verify RiSe, which achieves state-of-the-art cross-domain performance.
zh
[CV-87] Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
【速读】:该论文旨在解决完全遮挡下多样化日常物体的高精度三维形状重建问题,这一挑战在机器人技术、增强现实和物流等领域具有重要应用价值。传统毫米波(mmWave)信号重建方法受限于覆盖范围窄和噪声高,难以准确恢复被遮挡物体的完整几何结构。解决方案的关键在于提出Wave-Former,其核心是一个三阶段物理感知的形状补全流水线:首先生成候选几何表面,接着采用专为mmWave信号设计的基于Transformer的形状补全模型进行推理,最后通过熵引导的表面选择机制优化输出结果。该方法仅需合成点云数据训练,却能实现对真实场景的优异泛化能力,在召回率上从54%提升至72%,同时保持85%的高精度。
链接: https://arxiv.org/abs/2511.14152
作者: Laura Dodds,Maisy Lam,Waleed Akbar,Yibo Cheng,Fadel Adib
机构: Massachusetts Institute of Technology (麻省理工学院); Cartesian Systems
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Wave-Former, a novel method capable of high-accuracy 3D shape reconstruction for completely occluded, diverse, everyday objects. This capability can open new applications spanning robotics, augmented reality, and logistics. Our approach leverages millimeter-wave (mmWave) wireless signals, which can penetrate common occlusions and reflect off hidden objects. In contrast to past mmWave reconstruction methods, which suffer from limited coverage and high noise, Wave-Former introduces a physics-aware shape completion model capable of inferring full 3D geometry. At the heart of Wave-Former’s design is a novel three-stage pipeline which bridges raw wireless signals with recent advancements in vision-based shape completion by incorporating physical properties of mmWave signals. The pipeline proposes candidate geometric surfaces, employs a transformer-based shape completion model designed specifically for mmWave signals, and finally performs entropy-guided surface selection. This enables Wave-Former to be trained using entirely synthetic point-clouds, while demonstrating impressive generalization to real-world this http URL head-to-head comparisons with state-of-the-art baselines, Wave-Former raises recall from 54% to 72% while maintaining a high precision of 85%.
zh
[CV-88] Gaussian: Real-Time Camera Pose Estimation via Feed-Forward 3D Gaussian Splatting Inversion IROS2025
【速读】:该论文旨在解决基于预构建3D高斯场景模型的单图像相机位姿估计问题,现有方法依赖于迭代的“渲染-比较-优化”流程,导致计算开销大、难以满足机器人实时性需求。其解决方案的关键在于提出了一种两阶段前馈框架iGaussian:第一阶段通过基于高斯场景先验的位姿回归网络(结合空间均匀采样与引导注意力机制)快速获得粗略6DoF位姿;第二阶段则利用特征匹配与多模型融合进行精修。核心创新是提出的交叉相关模块,可在无需可微渲染的情况下将图像嵌入与3D高斯属性对齐,并引入加权多视角预测器融合多个策略采样视点特征,从而实现高效且高精度的位姿估计,在移动机器人上达到2.87 FPS的跟踪速度,较优化方法提升10倍性能。
链接: https://arxiv.org/abs/2511.14149
作者: Hao Wang,Linqing Zhao,Xiuwei Xu,Jiwen Lu,Haibin Yan
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IROS 2025
Abstract:Recent trends in SLAM and visual navigation have embraced 3D Gaussians as the preferred scene representation, highlighting the importance of estimating camera poses from a single image using a pre-built Gaussian model. However, existing approaches typically rely on an iterative \textitrender-compare-refine loop, where candidate views are first rendered using NeRF or Gaussian Splatting, then compared against the target image, and finally, discrepancies are used to update the pose. This multi-round process incurs significant computational overhead, hindering real-time performance in robotics. In this paper, we propose iGaussian, a two-stage feed-forward framework that achieves real-time camera pose estimation through direct 3D Gaussian inversion. Our method first regresses a coarse 6DoF pose using a Gaussian Scene Prior-based Pose Regression Network with spatial uniform sampling and guided attention mechanisms, then refines it through feature matching and multi-model fusion. The key contribution lies in our cross-correlation module that aligns image embeddings with 3D Gaussian attributes without differentiable rendering, coupled with a Weighted Multiview Predictor that fuses features from Multiple strategically sampled viewpoints. Experimental results on the NeRF Synthetic, Mip-NeRF 360, and T\T+DB datasets demonstrate a significant performance improvement over previous methods, reducing median rotation errors to 0.2° while achieving 2.87 FPS tracking on mobile robots, which is an impressive 10 times speedup compared to optimization-based approaches. Code: this https URL
zh
[CV-89] SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM
【速读】:该论文旨在解决视频时刻检索(Video Moment Retrieval)任务中因依赖粗粒度时间理解与单一视觉模态而导致的性能瓶颈问题,尤其是在复杂视频场景下的定位精度不足。其解决方案的关键在于提出了一种基于多模态大语言模型(MLLM)的框架SMART,该框架通过引入音频增强(Audio-enhanced)机制和利用镜头级(Shot-level)时间结构来丰富多模态表征,并创新性地采用**镜头感知令牌压缩(Shot-aware Token Compression)**策略,在每个镜头内选择性保留高信息量的令牌以减少冗余并保持细粒度时间细节,同时优化提示设计以更有效地利用音视频线索。
链接: https://arxiv.org/abs/2511.14143
作者: An Yu,Weiheng Lu,Jian Li,Zhenfei Zhang,Yunhang Shen,Felix X.-F. Ye,Ming-Ching Chang
机构: University at Albany - SUNY (纽约州立大学阿尔巴尼分校); Peking University (北京大学); Nanjing University (南京大学); Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos using both traditional techniques and Multimodal Large Language Models (MLLM), most existing methods still rely on coarse temporal understanding and a single visual modality, limiting performance on complex videos. To address this, we introduce \textitShot-aware \textitMultimodal \textitAudio-enhanced \textitRetrieval of \textitTemporal \textitSegments (SMART), an MLLM-based framework that integrates audio cues and leverages shot-level temporal structure. SMART enriches multimodal representations by combining audio and visual features while applying \textbfShot-aware Token Compression, which selectively retains high-information tokens within each shot to reduce redundancy and preserve fine-grained temporal details. We also refine prompt design to better utilize audio-visual cues. Evaluations on Charades-STA and QVHighlights show that SMART achieves significant improvements over state-of-the-art methods, including a 1.61% increase in R1@0.5 and 2.59% gain in R1@0.7 on Charades-STA.
zh
[CV-90] Attention Via Convolutional Nearest Neighbors
【速读】:该论文试图解决的问题是:卷积神经网络(Convolutional Neural Networks, CNNs)与Transformer架构在计算机视觉领域中被视为两种本质不同的模型结构,这种认知限制了对二者内在联系的理解及更优架构的设计。解决方案的关键在于提出一个统一的k近邻聚合框架——卷积最近邻(Convolutional Nearest Neighbors, ConvNN),其核心洞察是:卷积操作通过空间邻近性选择邻居,而自注意力机制则基于特征相似性选择邻居,二者本质上都是邻居选择与聚合的特例,并存在于连续谱系之中。ConvNN可作为卷积层或注意力层的直接替换模块,在不改变网络结构的前提下系统探索两者之间的中间区域,从而实现局部感受野与全局建模能力的平衡,提升模型性能并增强可解释性。
链接: https://arxiv.org/abs/2511.14137
作者: Mingi Kang,Jeová Farias Sales Rocha Neto
机构: Bowdoin College (鲍登学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The shift from Convolutional Neural Networks to Transformers has reshaped computer vision, yet these two architectural families are typically viewed as fundamentally distinct. We argue that convolution and self-attention, despite their apparent differences, can be unified within a single k-nearest neighbor aggregation framework. The critical insight is that both operations are special cases of neighbor selection and aggregation; convolution selects neighbors by spatial proximity, while attention selects by feature similarity, revealing they exist on a continuous spectrum. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that formalizes this connection. Crucially, ConvNN serves as a drop-in replacement for convolutional and attention layers, enabling systematic exploration of the intermediate spectrum between these two extremes. We validate the framework’s coherence on CIFAR-10 and CIFAR-100 classification tasks across two complementary architectures: (1) Hybrid branching in VGG improves accuracy on both CIFAR datasets by combining spatial-proximity and feature-similarity selection; and (2) ConvNN in ViT outperforms standard attention and other attention variants on both datasets. Extensive ablations on k values and architectural variants reveal that interpolating along this spectrum provides regularization benefits by balancing local and global receptive fields. Our work provides a unifying framework that dissolves the apparent distinction between convolution and attention, with implications for designing more principled and interpretable vision architectures.
zh
[CV-91] Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning Framework with Vision-Language Models
【速读】:该论文旨在解决行人-车辆事故分析中缺乏对行人行为认知阶段(cognitive phases)细致解析的问题,现有视频系统虽能检测事故发生,但无法揭示事件在不同行为阶段中的演化机制。其解决方案的关键在于提出多视角、分阶段的行人-车辆事故推理框架(Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning, MP-PVIR),通过四个阶段实现从原始多视角视频到结构化诊断报告的转化:首先触发多视角视频采集,继而基于行为理论自动分割行人行为阶段,再在每个阶段内进行同步多视角推理,最后融合结果生成因果链与预防策略。该框架的核心创新是引入两个专用视觉语言模型(VLMs)——TG-VLM用于行为阶段分割(mIoU=0.4881),PhaVR-VLM实现阶段感知的多视角分析(问答准确率达64.70%),并最终由大语言模型生成包含场景理解、行为解释、因果推理和预防建议的综合报告,显著提升了AI驱动的交通协同安全分析能力。
链接: https://arxiv.org/abs/2511.14120
作者: Hao Zhen,Yunxiang Yang,Jidong J. Yang
机构: University of Georgia (佐治亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 4 figures, 3 tables
Abstract:Pedestrian-vehicle incidents remain a critical urban safety challenge, with pedestrians accounting for over 20% of global traffic fatalities. Although existing video-based systems can detect when incidents occur, they provide little insight into how these events unfold across the distinct cognitive phases of pedestrian behavior. Recent vision-language models (VLMs) have shown strong potential for video understanding, but they remain limited in that they typically process videos in isolation, without explicit temporal structuring or multi-view integration. This paper introduces Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning (MP-PVIR), a unified framework that systematically processes multi-view video streams into structured diagnostic reports through four stages: (1) event-triggered multi-view video acquisition, (2) pedestrian behavior phase segmentation, (3) phase-specific multi-view reasoning, and (4) hierarchical synthesis and diagnostic reasoning. The framework operationalizes behavioral theory by automatically segmenting incidents into cognitive phases, performing synchronized multi-view analysis within each phase, and synthesizing results into causal chains with targeted prevention strategies. Particularly, two specialized VLMs underpin the MP-PVIR pipeline: TG-VLM for behavioral phase segmentation (mIoU = 0.4881) and PhaVR-VLM for phase-aware multi-view analysis, achieving a captioning score of 33.063 and up to 64.70% accuracy on question answering. Finally, a designated large language model is used to generate comprehensive reports detailing scene understanding, behavior interpretation, causal reasoning, and prevention recommendations. Evaluation on the Woven Traffic Safety dataset shows that MP-PVIR effectively translates multi-view video data into actionable insights, advancing AI-driven traffic safety analytics for vehicle-infrastructure cooperative systems.
zh
[CV-92] Coffee: Controllable Diffusion Fine-tuning
【速读】:该论文旨在解决文本到图像扩散模型在微调过程中难以控制地学习到用户数据中包含的 undesired concepts(非期望概念)的问题,这些概念可能与用户提示词纠缠,导致模型生成不符合预期的内容。解决方案的关键在于提出 Coffee 方法,通过语言描述指定并正则化微调过程中的 undesired concepts,核心机制是确保用户提示词的嵌入不会与这些非期望概念对齐,从而实现可控微调。Coffee 不需要额外训练,且可通过修改文本描述灵活调整要规避的概念,实验表明其能有效防止模型学习指定的 undesired concepts 并优于现有方法。
链接: https://arxiv.org/abs/2511.14113
作者: Ziyao Zeng,Jingcheng Ni,Ruyi Liu,Alex Wong
机构: Yale University (耶鲁大学); Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models can generate diverse content with flexible prompts, which makes them well-suited for customization through fine-tuning with a small amount of user-provided data. However, controllable fine-tuning that prevents models from learning undesired concepts present in the fine-tuning data, and from entangling those concepts with user prompts, remains an open challenge. It is crucial for downstream tasks like bias mitigation, preventing malicious adaptation, attribute disentanglement, and generalizable fine-tuning of diffusion policy. We propose Coffee that allows using language to specify undesired concepts to regularize the adaptation process. The crux of our method lies in keeping the embeddings of the user prompt from aligning with undesired concepts. Crucially, Coffee requires no additional training and enables flexible modification of undesired concepts by modifying textual descriptions. We evaluate Coffee by fine-tuning on images associated with user prompts paired with undesired concepts. Experimental results demonstrate that Coffee can prevent text-to-image models from learning specified undesired concepts during fine-tuning and outperforms existing methods. Code will be released upon acceptance.
zh
[CV-93] CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在资源受限平台部署时面临的高计算、内存和能耗问题。其核心解决方案是提出一种轻量且计算高效的ViT架构——Cascaded-ViT(CViT),其关键技术在于设计了一种新颖的前馈网络结构,称为级联分块前馈网络(Cascaded-Chunk Feed Forward Network, CCFFN)。通过将输入特征分块处理,CCFFN在不损失模型精度的前提下显著提升了参数和浮点运算(FLOP)效率,从而实现了更低的能源消耗与更高的计算效率。实验表明,CViT在ImageNet-1K上相较EfficientViT-M5在保持相近精度的同时减少15% FLOPs和3.3%能耗,且在Accuracy-Per-FLOP(APF)指标上持续领先,尤其适用于移动设备和无人机等电池约束场景。
链接: https://arxiv.org/abs/2511.14111
作者: Srivathsan Sivakumar,Faisal Z. Qureshi
机构: Ontario Tech Univeristy (安大略省理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision Transformers (ViTs) have demonstrated remarkable performance across a range of computer vision tasks; however, their high computational, memory, and energy demands hinder deployment on resource-constrained platforms. In this paper, we propose \emphCascaded-ViT (CViT), a lightweight and compute-efficient vision transformer architecture featuring a novel feedforward network design called \emphCascaded-Chunk Feed Forward Network (CCFFN). By splitting input features, CCFFN improves parameter and FLOP efficiency without sacrificing accuracy. Experiments on ImageNet-1K show that our \emphCViT-XL model achieves 75.5% Top-1 accuracy while reducing FLOPs by 15% and energy consumption by 3.3% compared to EfficientViT-M5. Across various model sizes, the CViT family consistently exhibits the lowest energy consumption, making it suitable for deployment on battery-constrained devices such as mobile phones and drones. Furthermore, when evaluated using a new metric called \emphAccuracy-Per-FLOP (APF), which quantifies compute efficiency relative to accuracy, CViT models consistently achieve top-ranking efficiency. Particularly, CViT-L is 2.2% more accurate than EfficientViT-M2 while having comparable APF scores.
zh
[CV-94] A2GC: Asymmetric Aggregation with Geometric Constraints for Locally Aggregated Descriptors
【速读】:该论文旨在解决视觉场景识别(Visual Place Recognition, VPR)中基于最优传输(Optimal Transport)的特征聚合方法在处理图像特征与聚类中心分布差异较大时,因标准Sinkhorn算法对源和目标边际对称处理而导致匹配效果受限的问题。解决方案的关键在于提出一种带有几何约束的非对称聚合VPR方法(A²GC-VPR),其核心创新包括:1)采用行-列归一化平均与独立边际校准策略,实现不对称特征分配以适应分布差异;2)引入可学习坐标嵌入(coordinate embeddings)构建几何约束,通过融合空间兼容性得分与特征相似度,增强局部聚合描述子的空间一致性,从而提升匹配精度与鲁棒性。
链接: https://arxiv.org/abs/2511.14109
作者: Zhenyu Li,Tianyi Shang
机构: Shandong Academy of Sciences (山东省科学院); Fuzhou University (福州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4figures
Abstract:Visual Place Recognition (VPR) aims to match query images against a database using visual cues. State-of-the-art methods aggregate features from deep backbones to form global descriptors. Optimal transport-based aggregation methods reformulate feature-to-cluster assignment as a transport problem, but the standard Sinkhorn algorithm symmetrically treats source and target marginals, limiting effectiveness when image features and cluster centers exhibit substantially different distributions. We propose an asymmetric aggregation VPR method with geometric constraints for locally aggregated descriptors, called A^2 GC-VPR. Our method employs row-column normalization averaging with separate marginal calibration, enabling asymmetric matching that adapts to distributional discrepancies in visual place recognition. Geometric constraints are incorporated through learnable coordinate embeddings, computing compatibility scores fused with feature similarities, thereby promoting spatially proximal features to the same cluster and enhancing spatial awareness. Experimental results on MSLS, NordLand, and Pittsburgh datasets demonstrate superior performance, validating the effectiveness of our approach in improving matching accuracy and robustness.
zh
[CV-95] RTS-Mono: A Real-Time Self-Supervised Monocular Depth Estimation Method for Real-World Deployment
【速读】:该论文旨在解决自监督单目深度估计模型在实际应用中因计算资源消耗大、模型复杂度高而导致部署困难的问题。现有方法虽在轻量化方面有所改进,但往往以性能下降为代价,限制了其在自动驾驶和智能机器人导航等场景中的实用价值。解决方案的关键在于提出一种名为RTS-Mono的轻量级、高效编码器-解码器架构:编码器基于Lite-Encoder设计以降低计算负担,解码器采用多尺度稀疏融合框架,在减少冗余的同时保持精度并提升推理速度。实验表明,RTS-Mono在KITTI数据集上以仅3M参数量实现了当前最优(SoTA)性能,并在真实世界部署中可在Nvidia Jetson Orin平台上实现49 FPS的实时推理,显著提升了模型的实用性与效率。
链接: https://arxiv.org/abs/2511.14107
作者: Zeyu Cheng,Tongfei Liu,Tao Lei,Xiang Hua,Yi Zhang,Chengkai Tang
机构: Shaanxi University of Science and Technology (陕西科技大学); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures
Abstract:Depth information is crucial for autonomous driving and intelligent robot navigation. The simplicity and flexibility of self-supervised monocular depth estimation are conducive to its role in these fields. However, most existing monocular depth estimation models consume many computing resources. Although some methods have reduced the model’s size and improved computing efficiency, the performance deteriorates, seriously hindering the real-world deployment of self-supervised monocular depth estimation models in the real world. To address this problem, we proposed a real-time self-supervised monocular depth estimation method and implemented it in the real world. It is called RTS-Mono, which is a lightweight and efficient encoder-decoder architecture. The encoder is based on Lite-Encoder, and the decoder is designed with a multi-scale sparse fusion framework to minimize redundancy, ensure performance, and improve inference speed. RTS-Mono achieved state-of-the-art (SoTA) performance in high and low resolutions with extremely low parameter counts (3 M) in experiments based on the KITTI dataset. Compared with lightweight methods, RTS-Mono improved Abs Rel and Sq Rel by 5.6% and 9.8% at low resolution and improved Sq Rel and RMSE by 6.1% and 1.9% at high resolution. In real-world deployment experiments, RTS-Mono has extremely high accuracy and can perform real-time inference on Nvidia Jetson Orin at a speed of 49 FPS. Source code is available at this https URL.
zh
[CV-96] xt-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations
【速读】:该论文旨在解决文本驱动视频编辑中对隐式查询(implicit queries)的处理难题,即当用户通过语义属性或对象关系等非显式描述提出编辑需求时,现有方法因无法准确识别编辑目标而失效。其解决方案的关键在于提出“推理视频编辑”(reasoning video editing)任务,并设计首个实现该任务的模型RIVER(Reasoning-based Implicit Video Editor)。RIVER通过数字孪生(digital twin)表示保留视频的空间关系、时间轨迹和语义属性,将推理与生成解耦:利用大语言模型(LLM)对数字孪生表示与隐式查询进行多跳推理(multi-hop reasoning),输出结构化指令指导扩散模型执行像素级修改。该方案在强化学习框架下训练,奖励机制同时评估推理准确性与生成质量,显著提升了对复杂隐式编辑需求的理解与执行能力。
链接: https://arxiv.org/abs/2511.14100
作者: Yiqing Shen,Chenjia Li,Mathias Unberath
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-driven video editing enables users to modify video content only using text queries. While existing methods can modify video content if explicit descriptions of editing targets with precise spatial locations and temporal boundaries are provided, these requirements become impractical when users attempt to conceptualize edits through implicit queries referencing semantic properties or object relationships. We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications, and a first model attempting to solve this complex task, RIVER (Reasoning-based Implicit Video Editor). RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes. A large language model then processes this representation jointly with the implicit query, performing multi-hop reasoning to determine modifications, then outputs structured instructions that guide a diffusion-based editor to execute pixel-level changes. RIVER training uses reinforcement learning with rewards that evaluate reasoning accuracy and generation quality. Finally, we introduce RVEBenchmark, a benchmark of 100 videos with 519 implicit queries spanning three levels and categories of reasoning complexity specifically for reasoning video editing. RIVER demonstrates best performance on the proposed RVEBenchmark and also achieves state-of-the-art performance on two additional video editing benchmarks (VegGIE and FiVE), where it surpasses six baseline methods.
zh
[CV-97] FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration
【速读】:该论文旨在解决全功能图像复原(All-in-One Image Restoration, AIO-IR)中模型难以适应复杂现实场景下多种退化类型的问题。现有方法通常依赖于任务特定的设计或潜在路由策略,导致泛化能力受限。其解决方案的关键在于提出频率感知规划与执行框架(FAPE-IR),该框架利用一个冻结的多模态大语言模型(Multimodal Large Language Model, MLLM)作为规划器,分析退化图像并生成简洁的、频率感知的复原计划;这些计划指导基于LoRA的专家混合模块(LoRA-MoE)在扩散执行器中动态选择高频或低频专家,并结合输入图像的频率特征进行修复。此外,通过引入对抗训练和频率正则化损失,进一步提升复原质量并减少伪影,从而实现统一且可解释的图像复原方案。
链接: https://arxiv.org/abs/2511.14099
作者: Jingren Liu,Shuning Xu,Qirui Yang,Yun Wang,Xiangyu Chen,Zhong Ji
机构: Tianjin University (天津大学); University of Macau (澳门大学); City University of Hong Kong (香港城市大学); Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:All-in-One Image Restoration (AIO-IR) aims to develop a unified model that can handle multiple degradations under complex conditions. However, existing methods often rely on task-specific designs or latent routing strategies, making it hard to adapt to real-world scenarios with various degradations. We propose FAPE-IR, a Frequency-Aware Planning and Execution framework for image restoration. It uses a frozen Multimodal Large Language Model (MLLM) as a planner to analyze degraded images and generate concise, frequency-aware restoration plans. These plans guide a LoRA-based Mixture-of-Experts (LoRA-MoE) module within a diffusion-based executor, which dynamically selects high- or low-frequency experts, complemented by frequency features of the input image. To further improve restoration quality and reduce artifacts, we introduce adversarial training and a frequency regularization loss. By coupling semantic planning with frequency-based restoration, FAPE-IR offers a unified and interpretable solution for all-in-one image restoration. Extensive experiments show that FAPE-IR achieves state-of-the-art performance across seven restoration tasks and exhibits strong zero-shot generalization under mixed degradations.
zh
[CV-98] BCE3S: Binary Cross-Entropy Based Tripartite Synergistic Learning for Long-tailed Recognition AAAI-2026
【速读】:该论文针对长尾识别(Long-tailed Recognition, LTR)任务中特征表示难以同时实现类内紧凑性(intra-class compactness)和类间可分性(inter-class separability)的问题,特别是传统基于交叉熵(Cross-Entropy, CE)损失的方法在学习过程中不仅难以获得理想特征性质,还因Softmax分母中不平衡的分类器向量耦合效应放大了类别不平衡的影响。解决方案的关键在于提出一种基于二元交叉熵(Binary Cross-Entropy, BCE)的三重协同学习框架(BCE3S),其核心创新包括:(1) 基于BCE的联合学习通过多个Sigmoid函数解耦特征与不平衡分类器向量之间的度量关系,提升特征紧凑性和可分性;(2) 基于BCE的对比学习进一步增强类内紧凑性;(3) 基于BCE的均匀学习平衡所有分类器向量间的可分性,并与联合学习交互优化特征表示。该方法在多个长尾数据集上实现了最先进的性能。
链接: https://arxiv.org/abs/2511.14097
作者: Weijia Fan,Qiufu Li,Jiajun Wen,Xiaoyang Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: [AAAI-2026] code: this https URL
Abstract:For long-tailed recognition (LTR) tasks, high intra-class compactness and inter-class separability in both head and tail classes, as well as balanced separability among all the classifier vectors, are preferred. The existing LTR methods based on cross-entropy (CE) loss not only struggle to learn features with desirable properties but also couple imbalanced classifier vectors in the denominator of its Softmax, amplifying the imbalance effects in LTR. In this paper, for the LTR, we propose a binary cross-entropy (BCE)-based tripartite synergistic learning, termed BCE3S, which consists of three components: (1) BCE-based joint learning optimizes both the classifier and sample features, which achieves better compactness and separability among features than the CE-based joint learning, by decoupling the metrics between feature and the imbalanced classifier vectors in multiple Sigmoid; (2) BCE-based contrastive learning further improves the intra-class compactness of features; (3) BCE-based uniform learning balances the separability among classifier vectors and interactively enhances the feature properties by combining with the joint learning. The extensive experiments show that the LTR model trained by BCE3S not only achieves higher compactness and separability among sample features, but also balances the classifier’s separability, achieving SOTA performance on various long-tailed datasets such as CIFAR10-LT, CIFAR100-LT, ImageNet-LT, and iNaturalist2018.
zh
[CV-99] SMGeo: Cross-View Object Geo-Localization with Grid-Level Mixture-of-Experts
【速读】:该论文旨在解决跨视角目标地理定位(Cross-view object Geo-localization)中的精度与实时性问题,即如何在无人机图像与卫星图像之间精准定位同一目标。传统多阶段“检索-匹配”流程因视点和尺度差异显著且背景复杂,易产生累积误差。其解决方案的关键在于提出一种可提示的端到端Transformer模型SMGeo:采用Swin-Transformer进行双模态特征联合编码,并引入无锚框的Transformer检测头直接回归坐标,避免预定义锚框带来的尺度偏差与匹配复杂度;同时设计网格级稀疏专家混合(Grid-level sparse Mixture-of-Experts, GMoE),动态激活适配不同内容、尺度和来源的专家模块,以增强跨模态与视图内依赖关系的建模能力。实验表明,该方法在IoU=0.25和mIoU指标上显著优于现有方法(如DetGeo),并支持交互式点击提示实现实时定位。
链接: https://arxiv.org/abs/2511.14093
作者: Fan Zhang,Haoyuan Ren,Fei Ma,Qiang Yin,Yongsheng Zhou
机构: Beijing University of Chemical Technology (北京化工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-view object Geo-localization aims to precisely pinpoint the same object across large-scale satellite imagery based on drone images. Due to significant differences in viewpoint and scale, coupled with complex background interference, traditional multi-stage “retrieval-matching” pipelines are prone to cumulative errors. To address this, we present SMGeo, a promptable end-to-end transformer-based model for object Geo-localization. This model supports click prompting and can output object Geo-localization in real time when prompted to allow for interactive use. The model employs a fully transformer-based architecture, utilizing a Swin-Transformer for joint feature encoding of both drone and satellite imagery and an anchor-free transformer detection head for coordinate regression. In order to better capture both inter-modal and intra-view dependencies, we introduce a grid-level sparse Mixture-of-Experts (GMoE) into the cross-view encoder, allowing it to adaptively activate specialized experts according to the content, scale and source of each grid. We also employ an anchor-free detection head for coordinate regression, directly predicting object locations via heat-map supervision in the reference images. This approach avoids scale bias and matching complexity introduced by predefined anchor boxes. On the drone-to-satellite task, SMGeo achieves leading performance in accuracy at IoU=0.25 and mIoU metrics (e.g., 87.51%, 62.50%, and 61.45% in the test set, respectively), significantly outperforming representative methods such as DetGeo (61.97%, 57.66%, and 54.05%, respectively). Ablation studies demonstrate complementary gains from shared encoding, query-guided fusion, and grid-level sparse mixture-of-experts.
zh
[CV-100] GCA-ResUNet:Image segmentation in medical images using grouped coordinate attention
【速读】:该论文旨在解决医学图像分割中卷积神经网络难以捕捉长距离依赖关系、而基于Transformer的方法又存在计算开销大和数据需求高的问题。解决方案的关键在于提出GCA-ResUNet,其核心创新是将分组坐标注意力(Grouped Coordinate Attention, GCA)模块嵌入到ResNet-50的残差块中,通过分组坐标建模联合编码通道与空间位置上的全局依赖关系,在不显著增加参数量和浮点运算次数(FLOPs)的前提下,有效增强特征表示能力和边界分割精度。
链接: https://arxiv.org/abs/2511.14087
作者: Jun Ding,Shang Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical image segmentation underpins computer-aided diagnosis and therapy by supporting clinical diagnosis, preoperative planning, and disease monitoring. While U-Net style convolutional neural networks perform well due to their encoder-decoder structures with skip connections, they struggle to capture long-range dependencies. Transformer-based variants address global context but often require heavy computation and large training datasets. This paper proposes GCA-ResUNet, an efficient segmentation network that integrates Grouped Coordinate Attention (GCA) into ResNet-50 residual blocks. GCA uses grouped coordinate modeling to jointly encode global dependencies across channels and spatial locations, strengthening feature representation and boundary delineation while adding minimal parameter and FLOP overhead compared with self-attention. On the Synapse dataset, GCA-ResUNet achieves a Dice score of 86.11%, and on the ACDC dataset, it reaches 92.64%, surpassing several state-of-the-art baselines while maintaining fast inference and favorable computational efficiency. These results indicate that GCA offers a practical way to enhance convolutional architectures with global modeling capability, enabling high-accuracy and resource-efficient medical image segmentation.
zh
[CV-101] Automated glenoid bone loss measurement and segmentation in CT scans for pre-operative planning in shoulder instability
【速读】:该论文旨在解决肩关节不稳手术规划中盂骨骨量(glenoid bone loss)测量的可靠性问题,现有手动和半自动化方法存在耗时长、阅片者间一致性差的局限。其解决方案的关键在于构建了一个全自动化深度学习流程,包含三个核心阶段:(1) 基于U-Net的盂骨与肱骨头分割;(2) 通过第二阶段网络预测盂唇关键点;(3) 利用主成分分析(PCA)、投影与圆拟合进行几何计算,从而实现基于线性基底视图的最佳圆法测量。该方法在91例患者数据上验证显示,自动化测量与共识读数高度一致(ICC=0.84),优于术者间一致性(ICC=0.78),且在高低骨量亚组中均表现稳定,具备临床可推广性。
链接: https://arxiv.org/abs/2511.14083
作者: Zhonghao Liu,Hanxue Gu,Qihang Li,Michael Fox,Jay M. Levin,Maciej A. Mazurowski,Brian C. Lau
机构: Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Reliable measurement of glenoid bone loss is essential for operative planning in shoulder instability, but current manual and semi-automated methods are time-consuming and often subject to interreader variability. We developed and validated a fully automated deep learning pipeline for measuring glenoid bone loss on three-dimensional computed tomography (CT) scans using a linear-based, en-face view, best-circle method. Shoulder CT images of 91 patients (average age, 40 years; range, 14-89 years; 65 men) were retrospectively collected along with manual labels including glenoid segmentation, landmarks, and bone loss measurements. The multi-stage algorithm has three main stages: (1) segmentation, where we developed a U-Net to automatically segment the glenoid and humerus; (2) anatomical landmark detection, where a second network predicts glenoid rim points; and (3) geometric fitting, where we applied principal component analysis (PCA), projection, and circle fitting to compute the percentage of bone loss. The automated measurements showed strong agreement with consensus readings and exceeded surgeon-to-surgeon consistency (intraclass correlation coefficient (ICC) 0.84 vs 0.78), including in low- and high-bone-loss subgroups (ICC 0.71 vs 0.63 and 0.83 vs 0.21, respectively; P 0.001). For classifying patients into low, medium, and high bone-loss categories, the pipeline achieved a recall of 0.714 for low and 0.857 for high severity, with no low cases misclassified as high or vice versa. These results suggest that our method is a time-efficient and clinically reliable tool for preoperative planning in shoulder instability and for screening patients with substantial glenoid bone loss. Code and dataset are available at this https URL.
zh
[CV-102] Zero-Training Task-Specific Model Synthesis for Few-Shot Medical Image Classification
【速读】:该论文旨在解决医学图像分析中深度学习模型对大规模标注数据的高度依赖问题,尤其是在罕见疾病场景下,由于样本稀缺和专家标注成本高昂,传统训练方法难以实施。其核心挑战在于如何在极低数据量(如1-shot或5-shot)条件下构建高性能、任务特定的分类器。解决方案的关键在于提出一种全新的“零训练任务特定模型合成”(Zero-Training Task-Specific Model Synthesis, ZS-TMS)范式,通过一个预训练的生成式引擎(generative engine),直接根据极少的多模态输入(如单张图像和对应临床文本描述)合成整个任务特定分类器的参数,无需任何任务特定训练或微调。该方法由语义引导的参数合成器(Semantic-Guided Parameter Synthesizer, SGPS)实现,显著提升了在超低数据场景下的分类性能,为罕见病等长尾医疗场景下的AI诊断工具快速部署提供了新路径。
链接: https://arxiv.org/abs/2511.14082
作者: Yao Qin,Yangyang Yan,YuanChao Yang,Jinhua Pang,Huanyong Bi,Yuan Liu,HaiHua Wang
机构: Beijing 1st BioTech Group Co., Ltd.(北京首生生物科技有限公司); China Foreign Affairs University(外交学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning models have achieved remarkable success in medical image analysis but are fundamentally constrained by the requirement for large-scale, meticulously annotated datasets. This dependency on “big data” is a critical bottleneck in the medical domain, where patient data is inherently difficult to acquire and expert annotation is expensive, particularly for rare diseases where samples are scarce by definition. To overcome this fundamental challenge, we propose a novel paradigm: Zero-Training Task-Specific Model Synthesis (ZS-TMS). Instead of adapting a pre-existing model or training a new one, our approach leverages a large-scale, pre-trained generative engine to directly synthesize the entire set of parameters for a task-specific classifier. Our framework, the Semantic-Guided Parameter Synthesizer (SGPS), takes as input minimal, multi-modal task information as little as a single example image (1-shot) and a corresponding clinical text description to directly synthesize the entire set of parameters for a task-specific classifier. The generative engine interprets these inputs to generate the weights for a lightweight, efficient classifier (e.g., an EfficientNet-V2), which can be deployed for inference immediately without any task-specific training or fine-tuning. We conduct extensive evaluations on challenging few-shot classification benchmarks derived from the ISIC 2018 skin lesion dataset and a custom rare disease dataset. Our results demonstrate that SGPS establishes a new state-of-the-art, significantly outperforming advanced few-shot and zero-shot learning methods, especially in the ultra-low data regimes of 1-shot and 5-shot classification. This work paves the way for the rapid development and deployment of AI-powered diagnostic tools, particularly for the long tail of rare diseases where data is critically limited. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.14082 [cs.CV] (or arXiv:2511.14082v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.14082 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-103] CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)因图像分辨率提升导致视觉令牌(visual tokens)数量呈二次增长而带来的计算和内存开销过大的问题。现有令牌压缩方法往往缺乏高层语义理解,造成合并效果不佳、信息冗余或上下文丢失。其解决方案的关键在于提出一种新的视觉令牌压缩范式——CORE(Compact Object-centric REpresentations),该方法利用高效的分割解码器生成对象掩码,作为高层语义先验来指导视觉令牌的合并,从而形成紧凑的对象中心表示;同时引入一种新颖的质心引导排序机制,恢复合并后令牌的空间顺序,保留关键的位置信息。这一设计显著提升了压缩效率与性能保真度,在固定率和自适应率压缩场景下均取得最优结果。
链接: https://arxiv.org/abs/2511.14072
作者: Jingyu Lei,Gaoang Wang,Der-Horng Lee
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Vision-Language Models (LVLMs) usually suffer from prohibitive computational and memory costs due to the quadratic growth of visual tokens with image resolution. Existing token compression methods, while varied, often lack a high-level semantic understanding, leading to suboptimal merges, information redundancy, or context loss. To address these limitations, we introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression. CORE leverages an efficient segmentation decoder to generate object masks, which serve as a high-level semantic prior to guide the merging of visual tokens into a compact set of object-centric representations. Furthermore, a novel centroid-guided sorting mechanism restores a coherent spatial order to the merged tokens, preserving vital positional information. Extensive experiments show that CORE not only establishes a new state-of-the-art on six authoritative benchmarks for fixed-rate compression, but also achieves dramatic efficiency gains in adaptive-rate settings. Even under extreme compression, after aggressively retaining with only 2.2% of all visual tokens, CORE still maintains 97.4% of baseline performance. Our work demonstrates the superiority of object-centric representations for efficient and effective LVLM processing.
zh
[CV-104] Semantic Context Matters: Improving Conditioning for Autoregressive Models
【速读】:该论文旨在解决自回归(Autoregressive, AR)模型在通用图像编辑任务中因条件控制能力弱且效率低而导致的指令遵循性差和视觉伪影问题。其解决方案的关键在于提出SCAR(Semantic-Context-driven method for Autoregressive models),包含两个核心组件:一是压缩语义预填充(Compressed Semantic Prefilling),通过将高层语义编码为紧凑高效的前缀来增强条件输入;二是语义对齐引导(Semantic Alignment Guidance),在自回归解码过程中对最后一层视觉隐状态进行目标语义对齐,从而提升指令忠实度。该方法不依赖解码阶段注入,而是基于向量量化预填充机制,在保持架构简洁的同时显著改善了生成质量与可控性。
链接: https://arxiv.org/abs/2511.14063
作者: Dongyang Jin,Ryan Xu,Jianhao Zeng,Rui Lan,Yancheng Bai,Lei Sun,Xiangxiang Chu
机构: Amap, Alibaba Group; Alibaba Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, autoregressive (AR) models have shown strong potential in image generation, offering better scalability and easier integration with unified multi-modal systems compared to diffusion-based methods. However, extending AR models to general image editing remains challenging due to weak and inefficient conditioning, often leading to poor instruction adherence and visual artifacts. To address this, we propose SCAR, a Semantic-Context-driven method for Autoregressive models. SCAR introduces two key components: Compressed Semantic Prefilling, which encodes high-level semantics into a compact and efficient prefix, and Semantic Alignment Guidance, which aligns the last visual hidden states with target semantics during autoregressive decoding to enhance instruction fidelity. Unlike decoding-stage injection methods, SCAR builds upon the flexibility and generality of vector-quantized-based prefilling while overcoming its semantic limitations and high cost. It generalizes across both next-token and next-set AR paradigms with minimal architectural changes. SCAR achieves superior visual fidelity and semantic alignment on both instruction editing and controllable generation benchmarks, outperforming prior AR-based methods while maintaining controllability. All code will be released.
zh
[CV-105] Saliency-Guided Deep Learning for Bridge Defect Detection in Drone Imagery
【速读】:该论文旨在解决混凝土桥梁结构中缺陷的自动检测、定位与分类问题,这是计算机视觉与模式识别领域的一项关键挑战。解决方案的关键在于提出一个两阶段框架:第一阶段利用显著性(saliency)生成缺陷区域建议,捕捉缺陷在正常表面纹理中的局部不连续性;第二阶段采用基于YOLOX的深度学习检测器,在对显著缺陷区域进行边界框级亮度增强后进行精细化检测与分类,从而提升准确率与计算效率,具备部署于自供电巡检系统中的潜力。
链接: https://arxiv.org/abs/2511.14040
作者: Loucif Hebbache,Dariush Amirkhani,Mohand Saïd Allili,Jean-François Lapointe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Anomaly object detection and classification are one of the main challenging tasks in computer vision and pattern recognition. In this paper, we propose a new method to automatically detect, localize and classify defects in concrete bridge structures using drone imagery. This framework is constituted of two main stages. The first stage uses saliency for defect region proposals where defects often exhibit local discontinuities in the normal surface patterns with regard to their surrounding. The second stage employs a YOLOX-based deep learning detector that operates on saliency-enhanced images obtained by applying bounding-box level brightness augmentation to salient defect regions. Experimental results on standard datasets confirm the performance of our framework and its suitability in terms of accuracy and computational efficiency, which give a huge potential to be implemented in a self-powered inspection system.
zh
[CV-106] Flood-LDM: Generalizable Latent Diffusion Models for rapid and accurate zero-shot High-Resolution Flood Mapping WACV
【速读】:该论文旨在解决传统物理驱动的水动力模型在实时大规模洪水预测中计算成本过高、难以部署的问题,同时克服现有基于卷积神经网络(Convolutional Neural Networks, CNN)的洪水地图超分辨率方法在未见区域泛化能力不足的局限。其解决方案的关键在于引入潜在扩散模型(Latent Diffusion Models, LDM),通过在低分辨率洪水地图上进行高效推理,实现与高分辨率网格模型相当的精度,显著降低推断时间;此外,该方法结合物理信息输入(physics-informed inputs),提升了模型的可解释性,并借助迁移学习进一步增强跨地理区域的泛化性能,从而为实时洪水风险管控提供可行的技术路径。
链接: https://arxiv.org/abs/2511.14033
作者: Sun Han Neo,Sachith Seneviratne,Herath Mudiyanselage Viraj Vidura Herath,Abhishek Saha,Sanka Rasnayaka,Lucy Amanda Marshall
机构: National University of Singapore (新加坡国立大学); University of Melbourne (墨尔本大学); University of Sydney (悉尼大学); Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
Abstract:Flood prediction is critical for emergency planning and response to mitigate human and economic losses. Traditional physics-based hydrodynamic models generate high-resolution flood maps using numerical methods requiring fine-grid discretization; which are computationally intensive and impractical for real-time large-scale applications. While recent studies have applied convolutional neural networks for flood map super-resolution with good accuracy and speed, they suffer from limited generalizability to unseen areas. In this paper, we propose a novel approach that leverages latent diffusion models to perform super-resolution on coarse-grid flood maps, with the objective of achieving the accuracy of fine-grid flood maps while significantly reducing inference time. Experimental results demonstrate that latent diffusion models substantially decrease the computational time required to produce high-fidelity flood maps without compromising on accuracy, enabling their use in real-time flood risk management. Moreover, diffusion models exhibit superior generalizability across different physical locations, with transfer learning further accelerating adaptation to new geographic regions. Our approach also incorporates physics-informed inputs, addressing the common limitation of black-box behavior in machine learning, thereby enhancing interpretability. Code is available at this https URL.
zh
[CV-107] FashionMAC: Deformation-Free Fashion Image Generation with Fine-Grained Model Appearance Customization
【速读】:该论文旨在解决服装中心的时尚图像生成任务中两个关键问题:一是如何忠实保留服装细节,二是如何实现对生成人体模型外观的细粒度控制。现有方法通常在生成过程中需对服装进行形变操作,导致纹理失真;同时缺乏专门设计的机制来控制生成模型的细粒度属性。解决方案的关键在于提出一种基于扩散模型的无变形框架FashionMAC,其核心思想是通过直接外推(outpaint)已从穿服人体中分割出的服装区域,避免了服装形变带来的细节损失;此外,引入区域自适应解耦注意力(region-adaptive decoupled attention, RADA)机制与链式掩码注入策略,使文本属性能够自适应地聚焦于预测的生成区域,从而显著提升图像视觉保真度和可控性。
链接: https://arxiv.org/abs/2511.14031
作者: Rong Zhang,Jinxiao Li,Jingnan Wang,Zhiwen Zuo,Jianfeng Dong,Wei Li,Chi Wang,Weiwei Xu,Xun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Garment-centric fashion image generation aims to synthesize realistic and controllable human models dressing a given garment, which has attracted growing interest due to its practical applications in e-commerce. The key challenges of the task lie in two aspects: (1) faithfully preserving the garment details, and (2) gaining fine-grained controllability over the model’s appearance. Existing methods typically require performing garment deformation in the generation process, which often leads to garment texture distortions. Also, they fail to control the fine-grained attributes of the generated models, due to the lack of specifically designed mechanisms. To address these issues, we propose FashionMAC, a novel diffusion-based deformation-free framework that achieves high-quality and controllable fashion showcase image generation. The core idea of our framework is to eliminate the need for performing garment deformation and directly outpaint the garment segmented from a dressed person, which enables faithful preservation of the intricate garment details. Moreover, we propose a novel region-adaptive decoupled attention (RADA) mechanism along with a chained mask injection strategy to achieve fine-grained appearance controllability over the synthesized human models. Specifically, RADA adaptively predicts the generated regions for each fine-grained text attribute and enforces the text attribute to focus on the predicted regions by a chained mask injection strategy, significantly enhancing the visual fidelity and the controllability. Extensive experiments validate the superior performance of our framework compared to existing state-of-the-art methods.
zh
[CV-108] raining-free Detection of AI-generated images via Cropping Robustness
【速读】:该论文旨在解决AI生成图像检测(AI-generated image detection)问题,即如何在不依赖特定训练数据集的情况下,准确识别由生成式AI模型(如扩散模型、GAN等)生成的图像。传统方法通常需要针对特定数据集进行训练,而本文提出了一种无需训练的检测方案WaRPAD,其核心创新在于利用自监督预训练模型对图像进行特征提取时对随机缩放(RandomResizedCrop)的不变性特性。关键在于:首先通过Haar小波分解提取图像高频方向上的嵌入敏感度作为基础得分函数,再通过多尺度重缩放与分块策略模拟裁剪增强,计算每个子块的得分并取平均得到最终检测分数。该方法不依赖具体生成模型或训练数据,具备跨模型、跨分辨率和抗测试时扰动的强鲁棒性。
链接: https://arxiv.org/abs/2511.14030
作者: Sungik Choi,Hankook Lee,Moontae Lee
机构: LG AI Research (LG人工智能研究); SungkyunKwan University (成均馆大学); University of Illinois Chicago (芝加哥伊利诺伊大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:AI-generated image detection has become crucial with the rapid advancement of vision-generative models. Instead of training detectors tailored to specific datasets, we study a training-free approach leveraging self-supervised models without requiring prior data knowledge. These models, pre-trained with augmentations like RandomResizedCrop, learn to produce consistent representations across varying resolutions. Motivated by this, we propose WaRPAD, a training-free AI-generated image detection algorithm based on self-supervised models. Since neighborhood pixel differences in images are highly sensitive to resizing operations, WaRPAD first defines a base score function that quantifies the sensitivity of image embeddings to perturbations along high-frequency directions extracted via Haar wavelet decomposition. To simulate robustness against cropping augmentation, we rescale each image to a multiple of the models input size, divide it into smaller patches, and compute the base score for each patch. The final detection score is then obtained by averaging the scores across all patches. We validate WaRPAD on real datasets of diverse resolutions and domains, and images generated by 23 different generative models. Our method consistently achieves competitive performance and demonstrates strong robustness to test-time corruptions. Furthermore, as invariance to RandomResizedCrop is a common training scheme across self-supervised models, we show that WaRPAD is applicable across self-supervised models.
zh
[CV-109] LINGUAL: Language-INtegrated GUidance in Active Learning for Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割任务中主动学习(Active Learning, AL)所面临的高劳动强度与认知负荷问题,尤其是由模糊边界导致的标注困难以及标注区域大小与标注成本之间的权衡难题。其解决方案的关键在于提出 LINGUAL 框架,该框架通过自然语言指令获取专家输入,利用上下文学习(in-context learning)将语言描述自动转化为可执行程序,并实现无需人工干预的自动化子任务序列执行,从而显著降低标注时间(约减少 80%),同时在主动域适应(Active Domain Adaptation, ADA)场景下达到或超越传统 AL 方法的性能表现。
链接: https://arxiv.org/abs/2511.14028
作者: Md Shazid Islam,Shreyangshu Bera,Sudipta Paul,Amit K. Roy-Chowdhury
机构: UC Riverside (加州大学河滨分校); Samsung Research America (三星美国研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although active learning (AL) in segmentation tasks enables experts to annotate selected regions of interest (ROIs) instead of entire images, it remains highly challenging, labor-intensive, and cognitively demanding due to the blurry and ambiguous boundaries commonly observed in medical images. Also, in conventional AL, annotation effort is a function of the ROI- larger regions make the task cognitively easier but incur higher annotation costs, whereas smaller regions demand finer precision and more attention from the expert. In this context, language guidance provides an effective alternative, requiring minimal expert effort while bypassing the cognitively demanding task of precise boundary delineation in segmentation. Towards this goal, we introduce LINGUAL: a framework that receives natural language instructions from an expert, translates them into executable programs through in-context learning, and automatically performs the corresponding sequence of sub-tasks without any human intervention. We demonstrate the effectiveness of LINGUAL in active domain adaptation (ADA) achieving comparable or superior performance to AL baselines while reducing estimated annotation time by approximately 80%.
zh
[CV-110] MRI Plane Orientation Detection using a Context-Aware 2.5D Model
【速读】:该论文旨在解决医学影像中切片平面方向(轴向、冠状面和矢状面)自动识别困难的问题,这一问题会导致缺乏方向元数据时分析复杂化、跨数据集域偏移增加以及诊断分类器准确性下降。解决方案的关键在于提出一种2.5D上下文感知模型,通过融合多切片信息来消除孤立切片带来的歧义并增强特征学习能力;相较于仅使用静态2D图像的参考模型(准确率为98.74%),该方法将准确率提升至99.49%,显著减少错误,并在脑肿瘤检测任务中通过门控策略利用生成的元数据增强预测,使准确率从97.0%提高到98.0%,误诊率降低33.3%。
链接: https://arxiv.org/abs/2511.14021
作者: SangHyuk Kim,Daniel Haehn,Sumientra Rampersad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 5 figures, 2 tables
Abstract:Humans can easily identify anatomical planes (axial, coronal, and sagittal) on a 2D MRI slice, but automated systems struggle with this task. Missing plane orientation metadata can complicate analysis, increase domain shift when merging heterogeneous datasets, and reduce accuracy of diagnostic classifiers. This study develops a classifier that accurately generates plane orientation metadata. We adopt a 2.5D context-aware model that leverages multi-slice information to avoid ambiguity from isolated slices and enable robust feature learning. We train the 2.5D model on both 3D slice sequences and static 2D images. While our 2D reference model achieves 98.74% accuracy, our 2.5D method raises this to 99.49%, reducing errors by 60%, highlighting the importance of 2.5D context. We validate the utility of our generated metadata in a brain tumor detection task. A gated strategy selectively uses metadata-enhanced predictions based on uncertainty scores, boosting accuracy from 97.0% with an image-only model to 98.0%, reducing misdiagnoses by 33.3%. We integrate our plane orientation model into an interactive web application and provide it open-source.
zh
[CV-111] RISE: Single Static Radar-based Indoor Scene Understanding
【速读】:该论文旨在解决室内场景理解中长期存在的鲁棒性与隐私保护难题,尤其针对传统光学传感器(如RGB相机和LiDAR)在室内环境中易受遮挡且存在隐私泄露风险的问题,以及毫米波雷达(mmWave radar)因空间分辨率低导致几何推理困难的局限。其核心解决方案是提出RISE系统,关键创新在于:首先利用多径反射(multipath reflections)作为几何线索而非噪声,通过双角度增强(Bi-Angular Multipath Enhancement)显式建模到达角(Angle-of-Arrival, AoA)与出发角(Angle-of-Departure, AoD),恢复二次反射信号并揭示不可见结构;其次基于增强后的观测数据,构建仿真到现实的分层扩散框架(simulation-to-reality Hierarchical Diffusion framework),将碎片化的雷达响应转化为完整的布局重建与物体检测结果。该方法显著提升了精度,在布局重建上将Chamfer Distance降低60%(降至16 cm),并首次实现了基于毫米波雷达的物体检测(IoU达58%)。
链接: https://arxiv.org/abs/2511.14019
作者: Kaichen Zhou,Laura Dodds,Sayed Saad Afzal,Fadel Adib
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robust and privacy-preserving indoor scene understanding remains a fundamental open problem. While optical sensors such as RGB and LiDAR offer high spatial fidelity, they suffer from severe occlusions and introduce privacy risks in indoor environments. In contrast, millimeter-wave (mmWave) radar preserves privacy and penetrates obstacles, but its inherently low spatial resolution makes reliable geometric reasoning difficult. We introduce RISE, the first benchmark and system for single-static-radar indoor scene understanding, jointly targeting layout reconstruction and object detection. RISE is built upon the key insight that multipath reflections, traditionally treated as noise, encode rich geometric cues. To exploit this, we propose a Bi-Angular Multipath Enhancement that explicitly models Angle-of-Arrival and Angle-of-Departure to recover secondary (ghost) reflections and reveal invisible structures. On top of these enhanced observations, a simulation-to-reality Hierarchical Diffusion framework transforms fragmented radar responses into complete layout reconstruction and object detection. Our benchmark contains 50,000 frames collected across 100 real indoor trajectories, forming the first large-scale dataset dedicated to radar-based indoor scene understanding. Extensive experiments show that RISE reduces the Chamfer Distance by 60% (down to 16 cm) compared to the state of the art in layout reconstruction, and delivers the first mmWave-based object detection, achieving 58% IoU. These results establish RISE as a new foundation for geometry-aware and privacy-preserving indoor scene understanding using a single static radar. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.14019 [cs.CV] (or arXiv:2511.14019v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.14019 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-112] CD-DPE: Dual-Prompt Expert Network based on Convolutional Dictionary Feature Decoupling for Multi-Contrast MRI Super-Resolution AAAI2026
【速读】:该论文旨在解决多对比度磁共振成像(multi-contrast magnetic resonance imaging, MRI)超分辨率重建中因不同模态间对比度差异导致的参考图像纹理难以有效引导目标图像重建的问题,从而提升解剖细节和软组织分辨能力。其解决方案的关键在于提出了一种基于卷积字典特征解耦(convolutional dictionary feature decoupling, CD-FDM)的双提示专家网络(dual-prompt expert network, CD-DPE),其中CD-FDM模块通过迭代方式将特征分离为跨对比度和同对比度成分以减少冗余与干扰,而双提示特征融合专家模块(dual-prompt feature fusion expert module, DP-FFEM)则利用频率提示选择相关参考特征,并通过自适应路由提示确定最优融合策略,实现高质量重建。
链接: https://arxiv.org/abs/2511.14014
作者: Xianming Gu,Lihui Wang,Ying Cao,Zeyu Deng,Yingfeng Ou,Guodong Hu,Yi Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:Multi-contrast magnetic resonance imaging (MRI) super-resolution intends to reconstruct high-resolution (HR) images from low-resolution (LR) scans by leveraging structural information present in HR reference images acquired with different contrasts. This technique enhances anatomical detail and soft tissue differentiation, which is vital for early diagnosis and clinical decision-making. However, inherent contrasts disparities between modalities pose fundamental challenges in effectively utilizing reference image textures to guide target image reconstruction, often resulting in suboptimal feature integration. To address this issue, we propose a dual-prompt expert network based on a convolutional dictionary feature decoupling (CD-DPE) strategy for multi-contrast MRI super-resolution. Specifically, we introduce an iterative convolutional dictionary feature decoupling module (CD-FDM) to separate features into cross-contrast and intra-contrast components, thereby reducing redundancy and interference. To fully integrate these features, a novel dual-prompt feature fusion expert module (DP-FFEM) is proposed. This module uses a frequency prompt to guide the selection of relevant reference features for incorporation into the target image, while an adaptive routing prompt determines the optimal method for fusing reference and target features to enhance reconstruction quality. Extensive experiments on public multi-contrast MRI datasets demonstrate that CD-DPE outperforms state-of-the-art methods in reconstructing fine details. Additionally, experiments on unseen datasets demonstrated that CD-DPE exhibits strong generalization capabilities.
zh
[CV-113] Certified but Fooled! Breaking Certified Defences with Ghost Certificates AAAI AAAI-26
【速读】:该论文旨在解决生成式 AI (Generative AI) 中认证防御机制被恶意利用的问题,即攻击者如何通过构造微小且人眼难以察觉的扰动(adversarial perturbations),使认证模型错误地为对抗样本生成虚假的鲁棒性保证(robustness guarantee),从而绕过现有的认证防御体系。其解决方案的关键在于提出“区域聚焦型对抗样本”(region-focused adversarial examples)方法,该方法能够精准操控认证过程,在保持扰动极小、不可感知的前提下,诱导模型生成远大于原始类别“幽灵证书”(ghost certificates)的鲁棒半径,从而实现对状态领先认证防御(如 Densepure)的有效绕过。
链接: https://arxiv.org/abs/2511.14003
作者: Quoc Viet Vo,Tashreque M. Haq,Paul Montague,Tamas Abraham,Ehsan Abbasnejad,Damith C. Ranasinghe
机构: 1. University of Melbourne (墨尔本大学); 2. Monash University (蒙纳士大学); 3. Australian Institute for Machine Learning (澳大利亚机器学习研究所)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Published as a conference paper at the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26). Code available at: this https URL
Abstract:Certified defenses promise provable robustness guarantees. We study the malicious exploitation of probabilistic certification frameworks to better understand the limits of guarantee provisions. Now, the objective is to not only mislead a classifier, but also manipulate the certification process to generate a robustness guarantee for an adversarial input certificate spoofing. A recent study in ICLR demonstrated that crafting large perturbations can shift inputs far into regions capable of generating a certificate for an incorrect class. Our study investigates if perturbations needed to cause a misclassification and yet coax a certified model into issuing a deceptive, large robustness radius for a target class can still be made small and imperceptible. We explore the idea of region-focused adversarial examples to craft imperceptible perturbations, spoof certificates and achieve certification radii larger than the source class ghost certificates. Extensive evaluations with the ImageNet demonstrate the ability to effectively bypass state-of-the-art certified defenses such as Densepure. Our work underscores the need to better understand the limits of robustness certification methods.
zh
[CV-114] Learning Skill-Attributes for Transferable Assessment in Video NEURIPS2025
【速读】:该论文旨在解决视频技能评估中因专家标注数据稀缺和成本高昂而导致的跨运动项目泛化能力不足的问题,尤其针对长尾运动领域缺乏高质量监督信号的挑战。其解决方案的关键在于提出CrossTrainer方法,通过发现跨运动共有的技能属性(如平衡、控制和手部位置),构建可迁移的视频表征,并训练多模态语言模型生成针对新视频的动作改进建议(如“抬高手臂以增加力量”)及其熟练度等级(如“初级专家”),从而显著提升模型在跨运动(transfer)和同运动内(in-domain)场景下的性能,相对当前最优方法提升高达60%。
链接: https://arxiv.org/abs/2511.13993
作者: Kumar Ashutosh,Kristen Grauman
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025, Project webpage: this https URL
Abstract:Skill assessment from video entails rating the quality of a person’s physical performance and explaining what could be done better. Today’s models specialize for an individual sport, and suffer from the high cost and scarcity of expert-level supervision across the long tail of sports. Towards closing that gap, we explore transferable video representations for skill assessment. Our CrossTrainer approach discovers skill-attributes, such as balance, control, and hand positioning – whose meaning transcends the boundaries of any given sport, then trains a multimodal language model to generate actionable feedback for a novel video, e.g., “lift hands more to generate more power” as well as its proficiency level, e.g., early expert. We validate the new model on multiple datasets for both cross-sport (transfer) and intra-sport (in-domain) settings, where it achieves gains up to 60% relative to the state of the art. By abstracting out the shared behaviors indicative of human skill, the proposed video representation generalizes substantially better than an array of existing techniques, enriching today’s multimodal large language models.
zh
[CV-115] Scene Graph-Guided Generative AI Framework for Synthesizing and Evaluating Industrial Hazard Scenarios
【速读】:该论文旨在解决训练视觉模型检测工作场所危险时缺乏真实场景图像的问题,因为事故触发场景难以在现实中直接获取。解决方案的关键在于提出一种基于场景图引导的生成式AI框架,通过分析美国职业安全与健康管理局(OSHA)的历史事故报告,利用GPT-4o提取结构化危险推理,并将其转化为包含空间和上下文关系的对象级场景图;这些场景图作为指导信息输入文本到图像扩散模型,从而生成具有构图准确性的危险场景图像。
链接: https://arxiv.org/abs/2511.13970
作者: Sanjay Acharjee,Abir Khan Ratul,Diego Patino,Md Nazmus Sakib
机构: University of Texas at Arlington (德克萨斯大学阿灵顿分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Training vision models to detect workplace hazards accurately requires realistic images of unsafe conditions that could lead to accidents. However, acquiring such datasets is difficult because capturing accident-triggering scenarios as they occur is nearly impossible. To overcome this limitation, this study presents a novel scene graph-guided generative AI framework that synthesizes photorealistic images of hazardous scenarios grounded in historical Occupational Safety and Health Administration (OSHA) accident reports. OSHA narratives are analyzed using GPT-4o to extract structured hazard reasoning, which is converted into object-level scene graphs capturing spatial and contextual relationships essential for understanding risk. These graphs guide a text-to-image diffusion model to generate compositionally accurate hazard scenes. To evaluate the realism and semantic fidelity of the generated data, a visual question answering (VQA) framework is introduced. Across four state-of-the-art generative models, the proposed VQA Graph Score outperforms CLIP and BLIP metrics based on entropy-based validation, confirming its higher discriminative sensitivity.
zh
[CV-116] Single Tensor Cell Segmentation using Scalar Field Representations
【速读】:该论文旨在解决细胞图像分割问题,核心挑战在于如何从复杂背景中准确提取细胞实例边界,并保持分割结果的鲁棒性与几何一致性。其解决方案的关键在于将图像分割建模为连续标量场(scalar field)的学习任务,通过训练神经网络参数化一个满足泊松偏微分方程(Poisson partial differential equation)或类热扩散稳态解的标量场,利用分水岭算法(watershed method)对场进行等值面分割,从而实现对细胞边界的精确捕捉。该方法仅需最小化场残差(field residuals),无需正则化项,具备天然抗异常值能力,同时因仅需单个张量(tensor)进行训练,显著简化了模型实现、降低计算开销与内存占用,特别适用于边缘计算场景。
链接: https://arxiv.org/abs/2511.13947
作者: Kevin I. Ruiz Vargas,Gabriel G. Galdino,Tsang Ing Ren,Alexandre L. Cunha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to IEEE ISBI 2026
Abstract:We investigate image segmentation of cells under the lens of scalar fields. Our goal is to learn a continuous scalar field on image domains such that its segmentation produces robust instances for cells present in images. This field is a function parameterized by the trained network, and its segmentation is realized by the watershed method. The fields we experiment with are solutions to the Poisson partial differential equation and a diffusion mimicking the steady-state solution of the heat equation. These solutions are obtained by minimizing just the field residuals, no regularization is needed, providing a robust regression capable of diminishing the adverse impacts of outliers in the training data and allowing for sharp cell boundaries. A single tensor is all that is needed to train a \unet\ thus simplifying implementation, lowering training and inference times, hence reducing energy consumption, and requiring a small memory footprint, all attractive features in edge computing. We present competitive results on public datasets from the literature and show that our novel, simple yet geometrically insightful approach can achieve excellent cell segmentation results.
zh
[CV-117] Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在训练过程中对大规模标注图像数据的高度依赖问题,从而提升模型的数据效率与跨域泛化能力。其解决方案的关键在于引入一种基于程序生成数据的预训练策略:通过简单算法(如形式文法)生成无视觉或语义内容的数据,在不依赖任何自然或合成图像的前提下,对ViT进行“热身”预训练,从而绕过其视觉补丁嵌入机制,促使模型内化抽象的计算先验(computational priors)。此方法显著提升了后续图像任务中的收敛速度、数据效率和下游性能,例如仅用1%的程序生成数据即可等效于28%的ImageNet-1k数据带来的性能增益。
链接: https://arxiv.org/abs/2511.13945
作者: Zachary Shinnick,Liangze Jiang,Hemanth Saratchandran,Damien Teney,Anton van den Hengel
机构: Australian Institute for Machine Learning (AIML), University of Adelaide (阿德莱德大学); École Polytechnique Fédérale de Lausanne (EPFL), Switzerland (洛桑联邦理工学院); Idiap Research Institute, Switzerland (Idiap 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in vision transformers (ViTs) by pretraining on procedurally-generated data devoid of visual or semantic content. We generate this data with simple algorithms such as formal grammars, so the results bear no relationship to either natural or synthetic images. We use this procedurally-generated data to pretrain ViTs in a warm-up phase that bypasses their visual patch embedding mechanisms, thus encouraging the models to internalise abstract computational priors. When followed by standard image-based training, this warm-up significantly improves data efficiency, convergence speed, and downstream performance. On ImageNet-1k for example, allocating just 1% of the training budget to procedural data improves final accuracy by over 1.7%. In terms of its effect on performance, 1% procedurally generated data is thus equivalent to 28% of the ImageNet-1k data. These findings suggest a promising path toward new data-efficient and domain-agnostic pretraining strategies.
zh
[CV-118] Find the Leak Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets
【速读】:该论文旨在解决视频衍生帧数据集中的信息泄露(information leakage)问题,即由于帧之间存在高度相关性,导致训练、验证和测试集之间出现冗余或重叠信息,从而影响模型评估的可靠性。解决方案的关键在于采用基于聚类(cluster-based)的帧选择策略:在划分数据集之前,先对视觉相似的帧进行分组,确保训练、验证和测试集在分布上更加代表性和平衡,从而提升数据划分的质量与模型性能评估的准确性。
链接: https://arxiv.org/abs/2511.13944
作者: Noam Glazner,Noam Tsfaty,Sharon Shalev,Avishai Weizman
机构: Bar-Ilan University (巴伊兰大学); Afeka College of Engineering (阿费卡工程学院); Ben-Gurion University of the Negev (本古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 1 figure, 1 table
Abstract:We propose a cluster-based frame selection strategy to mitigate information leakage in video-derived frames datasets. By grouping visually similar frames before splitting into training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.
zh
[CV-119] Start Small Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding AAAI2026
【速读】:该论文旨在解决基于强化学习(Reinforcement Learning, RL)微调的思维链(Chain-of-Thought, CoT)推理在视觉定位(Visual Grounding)任务中性能反而下降的问题,尤其是在CoT输出长度或复杂度增加时表现更差;同时指出数据集规模增大并不必然提升模型性能,因数据复杂度差异显著。解决方案的关键在于提出一种基于课程学习(Curriculum Learning)的相对策略优化方法(Curriculum-based Relative Policy Optimization, CuRPO),其核心是利用CoT长度和广义交并比(generalized Intersection over Union, gIoU)作为复杂度指标,动态地将训练样本从简单到复杂进行排序,并引导模型逐步学习,从而实现更高效、鲁棒的视觉定位性能,尤其在少样本场景下优势明显。
链接: https://arxiv.org/abs/2511.13924
作者: Qingyang Yan,Guangyao Chen,Yixiong Zou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026 (Oral)
Abstract:Chain-of-Thought (CoT) prompting has recently shown significant promise across various NLP and computer vision tasks by explicitly generating intermediate reasoning steps. However, we find that reinforcement learning (RL)-based fine-tuned CoT reasoning can paradoxically degrade performance in Visual Grounding tasks, particularly as CoT outputs become lengthy or complex. Additionally, our analysis reveals that increased dataset size does not always enhance performance due to varying data complexities. Motivated by these findings, we propose Curriculum-based Relative Policy Optimization (CuRPO), a novel training strategy that leverages CoT length and generalized Intersection over Union (gIoU) rewards as complexity indicators to progressively structure training data from simpler to more challenging examples. Extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and LISA datasets demonstrate the effectiveness of our approach. CuRPO consistently outperforms existing methods, including Visual-RFT, with notable improvements of up to +12.52 mAP on RefCOCO. Moreover, CuRPO exhibits exceptional efficiency and robustness, delivering strong localization performance even in few-shot learning scenarios, particularly benefiting tasks characterized by ambiguous and intricate textual this http URL code is released on this https URL.
zh
[CV-120] Mind the Gap: Evaluating LLM Understanding of Human-Taught Road Safety Principles
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在理解道路安全概念方面的局限性,特别是其对交通标志和道路安全规范的图示化表示的理解能力不足。解决方案的关键在于构建一个基于教材图像的初步数据集,其中包含来自学校教科书的交通标志与道路安全规范的示意图像,并在此基础上以零样本(zero-shot)方式评估模型的表现。实验结果揭示了当前模型在安全推理上的薄弱环节,以及人类学习与模型解释之间存在的差距,为后续研究提供了分析基础。
链接: https://arxiv.org/abs/2511.13909
作者: Chalamalasetti Kranti
机构: University of Potsdam (波茨坦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Following road safety norms is non-negotiable not only for humans but also for the AI systems that govern autonomous vehicles. In this work, we evaluate how well multi-modal large language models (LLMs) understand road safety concepts, specifically through schematic and illustrative representations. We curate a pilot dataset of images depicting traffic signs and road-safety norms sourced from school text books and use it to evaluate models capabilities in a zero-shot setting. Our preliminary results show that these models struggle with safety reasoning and reveal gaps between human learning and model interpretation. We further provide an analysis of these performance gaps for future research.
zh
[CV-121] SAE-MCVT: A Real-Time and Scalable Multi-Camera Vehicle Tracking Framework Powered by Edge Computing
【速读】:该论文旨在解决多摄像头车辆跟踪(Multi-Camera Vehicle Tracking, MCVT)系统在实际城市规模部署中面临的实时性(real-time performance)与可扩展性(scalability)瓶颈问题。现有方法虽注重跟踪精度,但在高密度摄像头场景下难以满足低延迟和高效计算的需求。解决方案的关键在于提出SAE-MCVT框架,其核心创新是将处理流程分离至边缘侧(edge)与中心工作站(central workstation):边缘设备负责实时处理RTSP视频流并提取轻量级元数据(车辆位置与深度外观特征),仅传输这些信息至中心端;中心端则基于自监督学习的相机链接模型(self-supervised camera link model)构建时空约束下的跨摄像头关联机制,从而实现高效、可扩展的全局轨迹生成。
链接: https://arxiv.org/abs/2511.13904
作者: Yuqiang Lin,Sam Lockyer,Florian Stanek,Markus Zarbock,Adrian Evans,Wenbin Li,Nic Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In modern Intelligent Transportation Systems (ITS), cameras are a key component due to their ability to provide valuable information for multiple stakeholders. A central task is Multi-Camera Vehicle Tracking (MCVT), which generates vehicle trajectories and enables applications such as anomaly detection, traffic density estimation, and suspect vehicle tracking. However, most existing studies on MCVT emphasize accuracy while overlooking real-time performance and scalability. These two aspects are essential for real-world deployment and become increasingly challenging in city-scale applications as the number of cameras grows. To address this issue, we propose SAE-MCVT, the first scalable real-time MCVT framework. The system includes several edge devices that interact with one central workstation separately. On the edge side, live RTSP video streams are serialized and processed through modules including object detection, object tracking, geo-mapping, and feature extraction. Only lightweight metadata – vehicle locations and deep appearance features – are transmitted to the central workstation. On the central side, cross-camera association is calculated under the constraint of spatial-temporal relations between adjacent cameras, which are learned through a self-supervised camera link model. Experiments on the RoundaboutHD dataset show that SAE-MCVT maintains real-time operation on 2K 15 FPS video streams and achieves an IDF1 score of 61.2. To the best of our knowledge, this is the first scalable real-time MCVT framework suitable for city-scale deployment.
zh
[CV-122] mporal Realism Evaluation of Generated Videos Using Compressed-Domain Motion Vectors
【速读】:该论文旨在解决当前生成式视频模型在时间真实性(temporal realism)方面的不足问题,即现有评估指标主要关注空间外观而对运动动态敏感度有限。其解决方案的关键在于利用压缩域中由编码器(如H.264和HEVC)生成的运动矢量(Motion Vectors, MVs),构建一种可扩展且与模型无关的评估框架。通过计算真实视频与生成视频MV统计特征之间的Kullback-Leibler、Jensen-Shannon和Wasserstein散度,量化运动真实性差异;同时发现生成视频存在中心偏置、稀疏分段恒定流场及网格状伪影等时序缺陷,这些是帧级指标无法捕捉的。此外,论文进一步探索了MV与RGB特征融合策略,证明引入MV信息能显著提升下游分类任务(如真实/生成判别)性能,最高达99.0%准确率,表明压缩域MV是诊断生成视频运动缺陷和增强判别模型时序推理能力的有效信号。
链接: https://arxiv.org/abs/2511.13897
作者: Mert Onur Cakiroglu,Idil Bilge Altun,Zhihe Lu,Mehmet Dalkilic,Hasan Kurban
机构: Indiana University Bloomington (印第安纳大学伯明顿分校); Hamad Bin Khalifa University (哈马德本哈利法大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Temporal realism remains a central weakness of current generative video models, as most evaluation metrics prioritize spatial appearance and offer limited sensitivity to motion. We introduce a scalable, model-agnostic framework that assesses temporal behavior using motion vectors (MVs) extracted directly from compressed video streams. Codec-generated MVs from standards such as H.264 and HEVC provide lightweight, resolution-consistent descriptors of motion dynamics. We quantify realism by computing Kullback-Leibler, Jensen-Shannon, and Wasserstein divergences between MV statistics of real and generated videos. Experiments on the GenVidBench dataset containing videos from eight state-of-the-art generators reveal systematic discrepancies from real motion: entropy-based divergences rank Pika and SVD as closest to real videos, MV-sum statistics favor VC2 and Text2Video-Zero, and CogVideo shows the largest deviations across both measures. Visualizations of MV fields and class-conditional motion heatmaps further reveal center bias, sparse and piecewise constant flows, and grid-like artifacts that frame-level metrics do not capture. Beyond evaluation, we investigate MV-RGB fusion through channel concatenation, cross-attention, joint embedding, and a motion-aware fusion module. Incorporating MVs improves downstream classification across ResNet, I3D, and TSN backbones, with ResNet-18 and ResNet-34 reaching up to 97.4% accuracy and I3D achieving 99.0% accuracy on real-versus-generated discrimination. These findings demonstrate that compressed-domain MVs provide an effective temporal signal for diagnosing motion defects in generative videos and for strengthening temporal reasoning in discriminative models. The implementation is available at: this https URL
zh
[CV-123] Weakly Supervised Ephemeral Gully Detection In Remote Sensing Images Using Vision Language Models
【速读】:该论文旨在解决农业田间临时沟壑(Ephemeral Gullies)自动检测难题,其核心挑战在于:1)临时沟壑生命周期短,传统计算机视觉与遥感方法难以有效识别;2)高质量标注数据稀缺且获取成本高,导致基于机器学习的检测方法受限于零样本学习,实用性差。解决方案的关键在于提出首个弱监督检测流程,利用视觉语言模型(Vision Language Models, VLMs)的知识迁移能力减少人工标注负担,并通过教师-学生架构实现噪声标签驱动的学习:教师模型从VLM生成的噪声标签中学习,学生模型则在弱监督下训练,结合噪声感知损失函数提升鲁棒性。实验表明,该方法在半监督场景下显著优于单独使用VLM或标签模型的效果。
链接: https://arxiv.org/abs/2511.13891
作者: Seyed Mohamad Ali Tousi,John A. Lory,G. N. DeSouza
机构: Vision Guided and Intelligent Robotics Laboratory (ViGIR), EECS Dept.(电气工程与计算机科学系); Division of Plant Science and Technology (植物科学与技术系); University of Missouri (密苏里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Among soil erosion problems, Ephemeral Gullies are one of the most concerning phenomena occurring in agricultural fields. Their short temporal cycles increase the difficulty in automatically detecting them using classical computer vision approaches and remote sensing. Also, due to scarcity of and the difficulty in producing accurate labeled data, automatic detection of ephemeral gullies using Machine Learning is limited to zero-shot approaches which are hard to implement. To overcome these challenges, we present the first weakly supervised pipeline for detection of ephemeral gullies. Our method relies on remote sensing and uses Vision Language Models (VLMs) to drastically reduce the labor-intensive task of manual labeling. In order to achieve that, the method exploits: 1) the knowledge embedded in the VLM’s pretraining; 2) a teacher-student model where the teacher learns from noisy labels coming from the VLMs, and the student learns by weak supervision using teacher-generate labels and a noise-aware loss function. We also make available the first-of-its-kind dataset for semi-supervised detection of ephemeral gully from remote-sensed images. The dataset consists of a number of locations labeled by a group of soil and plant scientists, as well as a large number of unlabeled locations. The dataset represent more than 18,000 high-resolution remote-sensing images obtained over the course of 13 years. Our experimental results demonstrate the validity of our approach by showing superior performances compared to VLMs and the label model itself when using weak supervision to train an student model. The code and dataset for this work are made publicly available.
zh
[CV-124] Uni-Hema: Unified Model for Digital Hematopathology
【速读】:该论文旨在解决数字血液病理学中多任务、多模态推理的统一性问题,现有模型(如单任务、视觉-语言、WSI优化或单细胞血液学模型)难以在不同疾病类别(如白血病、疟疾及镰状细胞病等)之间实现跨任务和跨模态的协同分析。其解决方案的关键在于提出Uni-Hema——一个基于Hema-Former架构的多任务统一模型,能够整合检测、分类、分割、形态预测与推理等多种功能,并通过46个公开数据集(超70万张图像和2.1万个问答对)训练,实现层级化的视觉-文本表示融合,在不同粒度下支持多种任务,从而在保持性能优于或相当于单任务模型的同时,提供可解释的单细胞级形态学洞察。
链接: https://arxiv.org/abs/2511.13889
作者: Abdul Rehman,Iqra Rasool,Ayesha Imran,Mohsen Ali,Waqas Sultani
机构: Intelligent Machine Lab, Information Technology University of Punjab, Lahore, Pakistan (智能机器实验室,巴基斯坦信德信息技术大学,拉合尔)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Digital hematopathology requires cell-level analysis across diverse disease categories, including malignant disorders (e.g., leukemia), infectious conditions (e.g., malaria), and non-malignant red blood cell disorders (e.g., sickle cell disease). Whether single-task, vision-language, WSI-optimized, or single-cell hematology models, these approaches share a key limitation, they cannot provide unified, multi-task, multi-modal reasoning across the complexities of digital hematopathology. To overcome these limitations, we propose Uni-Hema, a multi-task, unified model for digital hematopathology integrating detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases. Uni-Hema leverages 46 publicly available datasets, encompassing over 700K images and 21K question-answer pairs, and is built upon Hema-Former, a multimodal module that bridges visual and textual representations at the hierarchy level for the different tasks (detection, classification, segmentation, morphology, mask language modeling and visual question answer) at different granularity. Extensive experiments demonstrate that Uni-Hema achieves comparable or superior performance to train on a single-task and single dataset models, across diverse hematological tasks, while providing interpretable, morphologically relevant insights at the single-cell level. Our framework establishes a new standard for multi-task and multi-modal digital hematopathology. The code will be made publicly available.
zh
[CV-125] Revisiting Data Scaling Law for Medical Segmentation
【速读】:该论文旨在解决医学解剖分割任务中数据规模与模型性能之间 scaling law 的理解不足问题,尤其是在不同成像模态和语义任务下的可扩展性缺乏系统研究。其关键解决方案在于提出一种基于图像配准生成的微分同胚映射(diffeomorphic mappings)的新型可扩展图像增强方法,通过从测地线子空间生成真实变形来提升数据利用效率,从而在不增加数据量的前提下显著改善模型收敛速度与性能,突破传统幂律 scaling 趋势。
链接: https://arxiv.org/abs/2511.13883
作者: Yuetan Chu,Zhongyi Han,Gongning Luo,Xin Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The population loss of trained deep neural networks often exhibits power law scaling with the size of the training dataset, guiding significant performance advancements in deep learning applications. In this study, we focus on the scaling relationship with data size in the context of medical anatomical segmentation, a domain that remains underexplored. We analyze scaling laws for anatomical segmentation across 15 semantic tasks and 4 imaging modalities, demonstrating that larger datasets significantly improve segmentation performance, following similar scaling trends. Motivated by the topological isomorphism in images sharing anatomical structures, we evaluate the impact of deformation-guided augmentation strategies on data scaling laws, specifically random elastic deformation and registration-guided deformation. We also propose a novel, scalable image augmentation approach that generates diffeomorphic mappings from geodesic subspace based on image registration to introduce realistic deformation. Our experimental results demonstrate that both registered and generated deformation-based augmentation considerably enhance data utilization efficiency. The proposed generated deformation method notably achieves superior performance and accelerated convergence, surpassing standard power law scaling trends without requiring additional data. Overall, this work provides insights into the understanding of segmentation scalability and topological variation impact in medical imaging, thereby leading to more efficient model development with reduced annotation and computational costs.
zh
[CV-126] VLMs Guided Interpretable Decision Making for Autonomous Driving WACV2026
【速读】:该论文旨在解决当前基于视觉语言模型(VLM)的自动驾驶(AD)决策系统中因依赖人工设计提示(handcrafted prompts)而导致性能不稳定、泛化能力弱的问题。其关键解决方案在于重新定义VLM的角色:不再作为直接决策生成器,而是作为语义增强器(semantic enhancer),利用其强大的场景理解能力为现有视觉基准提供结构化且语言丰富的场景描述;在此基础上构建多模态交互架构以融合视觉与语言特征,并引入后处理优化模块进一步提升预测可靠性,从而实现更准确、可解释的自动驾驶决策。
链接: https://arxiv.org/abs/2511.13881
作者: Xin Hu,Taotao Jing,Renran Tian,Zhengming Ding
机构: Tulane University (杜兰大学); Qualcomm (高通公司); North Carolina State University (北卡罗来纳州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV 2026
Abstract:Recent advancements in autonomous driving (AD) have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions. Building on this enriched representation, we introduce a multi-modal interactive architecture that fuses visual and linguistic features for more accurate decision-making and interpretable textual explanations. Furthermore, we design a post-hoc refinement module that utilizes VLMs to enhance prediction reliability. Extensive experiments on two autonomous driving benchmarks demonstrate that our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable AD systems.
zh
[CV-127] AnaCP: Toward Upper-Bound Continual Learning via Analytic Contrastive Projection
【速读】:该论文旨在解决类增量学习(Class-Incremental Learning, CIL)中因灾难性遗忘(Catastrophic Forgetting, CF)导致性能下降的问题。传统方法在不利用预训练模型(Pre-trained Models, PTMs)时,需同时增量学习特征表示和分类器,易引发CF;而现有基于PTM的方法虽通过固定特征提取与解析分类器实现高效训练,却无法持续适应特征表示以匹配新任务,从而限制性能提升。解决方案的关键在于提出AnaCP(Analytic Contrastive Projection),其在保持解析分类器效率的同时,引入无梯度更新的增量特征适配机制,有效消除由梯度更新引起的灾难性遗忘,实验表明该方法不仅超越现有基线,且达到联合训练(joint training)的准确率水平,后者被视为CIL的性能上限。
链接: https://arxiv.org/abs/2511.13880
作者: Saleh Momeni,Changnan Xiao,Bing Liu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper studies the problem of class-incremental learning (CIL), a core setting within continual learning where a model learns a sequence of tasks, each containing a distinct set of classes. Traditional CIL methods, which do not leverage pre-trained models (PTMs), suffer from catastrophic forgetting (CF) due to the need to incrementally learn both feature representations and the classifier. The integration of PTMs into CIL has recently led to efficient approaches that treat the PTM as a fixed feature extractor combined with analytic classifiers, achieving state-of-the-art performance. However, they still face a major limitation: the inability to continually adapt feature representations to best suit the CIL tasks, leading to suboptimal performance. To address this, we propose AnaCP (Analytic Contrastive Projection), a novel method that preserves the efficiency of analytic classifiers while enabling incremental feature adaptation without gradient-based training, thereby eliminating the CF caused by gradient updates. Our experiments show that AnaCP not only outperforms existing baselines but also achieves the accuracy level of joint training, which is regarded as the upper bound of CIL.
zh
[CV-128] Hybrid Convolution Neural Network Integrated with Pseudo-Newton Boosting for Lumbar Spine Degeneration Detection
【速读】:该论文旨在解决医学影像中腰椎退行性病变分类任务的性能瓶颈问题,尤其针对传统迁移学习方法在高维医学图像数据下特征选择与表示能力不足的局限性。其解决方案的关键在于提出一种多层架构,融合EfficientNet和VGG19,并引入两个创新组件:伪牛顿增强层(Pseudo-Newton Boosting layer)与稀疏诱导特征降维层(Sparsity-Induced Feature Reduction Layer)。前者通过智能调整特征权重以捕获更多细节解剖特征,后者则去除冗余特征,生成紧凑且鲁棒的病理表征,从而显著提升模型在DICOM图像上的分类性能,最终实现精度0.9、召回率0.861、F1分数0.88等指标的优化。
链接: https://arxiv.org/abs/2511.13877
作者: Pandiyaraju V,Abishek Karthik,Jaspin K,Kannan A,Jaime Lloret
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes a new enhanced model architecture to perform classification of lumbar spine degeneration with DICOM images while using a hybrid approach, integrating EfficientNet and VGG19 together with custom-designed components. The proposed model is differentiated from traditional transfer learning methods as it incorporates a Pseudo-Newton Boosting layer along with a Sparsity-Induced Feature Reduction Layer that forms a multi-tiered framework, further improving feature selection and representation. The Pseudo-Newton Boosting layer makes smart variations of feature weights, with more detailed anatomical features, which are mostly left out in a transfer learning setup. In addition, the Sparsity-Induced Layer removes redundancy for learned features, producing lean yet robust representations for pathology in the lumbar spine. This architecture is novel as it overcomes the constraints in the traditional transfer learning approach, especially in the high-dimensional context of medical images, and achieves a significant performance boost, reaching a precision of 0.9, recall of 0.861, F1 score of 0.88, loss of 0.18, and an accuracy of 88.1%, compared to the baseline model, EfficientNet. This work will present the architectures, preprocessing pipeline, and experimental results. The results contribute to the development of automated diagnostic tools for medical images.
zh
[CV-129] Qwen CLIP: Boosting Medical Vision-Language Pretraining via LLM Embeddings and Prompt tuning
【速读】:该论文旨在解决对比语言-图像预训练(Contrastive Language-Image Pretraining, CLIP)在医学影像领域应用时面临的两个核心限制:一是其文本编码器仅支持最多77个token,难以有效表示长且信息丰富的放射学报告;二是现有基于领域特定编码器(如PubMedBERT或ClinicalBERT)的改进方案受限于512 token的输入长度和较浅的语义理解能力。解决方案的关键在于提出QwenCLIP框架,通过将CLIP的文本编码器替换为基于大语言模型(Large Language Model, LLM)的嵌入模块(如Qwen3-Embedding),并引入可学习提示(learnable prompts)以增强跨模态对齐能力。该方法利用LLM更长的上下文窗口和更强的语言表征能力,显著提升了对长文本中医学语义的捕捉效果,从而改善医学图像与文本之间的对齐质量及下游放射学基准任务的性能表现。
链接: https://arxiv.org/abs/2511.13876
作者: Xiaoyang Wei,Camille Kurtz,Florence Cloppet
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE ISBI for possible publication
Abstract:Contrastive Language-Image Pretraining (CLIP) has demonstrated strong generalization for vision-language tasks in computer vision and medical domains, yet its text encoder accepts only up to 77 tokens, which limits its ability to represent long and information-rich radiology reports. Recent adaptations using domain-specific encoders, such as PubMedBERT or ClinicalBERT, mitigate this issue by leveraging medical corpora, but remain constrained by their limited input length (typically 512 tokens) and relatively shallow semantic understanding. To address these limitations, we propose QwenCLIP, a vision-language framework that replaces CLIP’s text encoder with a large language model (LLM)-based embedding module (e.g., Qwen3-Embedding) and introduces learnable prompts to enhance cross-modal alignment. By leveraging the extended context window and richer representations of LLMs, QwenCLIP captures comprehensive medical semantics from long-form clinical text, substantially improving medical image-text alignment and downstream performance on radiology benchmarks. Our code is publicly available at this https URL.
zh
[CV-130] H-CNN-ViT: A Hierarchical Gated Attention Multi-Branch Model for Bladder Cancer Recurrence Prediction
【速读】:该论文旨在解决膀胱癌术后复发预测中多序列增强磁共振成像(multi-sequence contrast-enhanced MRI)解读困难的问题,尤其针对术后瘢痕、肿胀和组织重塑等干扰因素导致的诊断准确性不足。其解决方案的关键在于构建了一个专门用于复发评估的多模态MRI数据集,并提出了一种新型分层门控注意力多分支模型(H-CNN-ViT),该模型通过上下文感知的机制实现对全局(ViT)与局部(CNN)路径特征的动态加权融合,从而在保持各模态特性的前提下优化特征整合,最终在自建数据集上达到78.6%的AUC性能,显著优于现有方法。
链接: https://arxiv.org/abs/2511.13869
作者: Xueyang Li,Zongren Wang,Yuliang Zhang,Zixuan Pan,Yu-Jen Chen,Nishchal Sapkota,Gelei Xu,Danny Z. Chen,Yiyu Shi
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Bladder cancer is one of the most prevalent malignancies worldwide, with a recurrence rate of up to 78%, necessitating accurate post-operative monitoring for effective patient management. Multi-sequence contrast-enhanced MRI is commonly used for recurrence detection; however, interpreting these scans remains challenging, even for experienced radiologists, due to post-surgical alterations such as scarring, swelling, and tissue remodeling. AI-assisted diagnostic tools have shown promise in improving bladder cancer recurrence prediction, yet progress in this field is hindered by the lack of dedicated multi-sequence MRI datasets for recurrence assessment study. In this work, we first introduce a curated multi-sequence, multi-modal MRI dataset specifically designed for bladder cancer recurrence prediction, establishing a valuable benchmark for future research. We then propose H-CNN-ViT, a new Hierarchical Gated Attention Multi-Branch model that enables selective weighting of features from the global (ViT) and local (CNN) paths based on contextual demands, achieving a balanced and targeted feature fusion. Our multi-branch architecture processes each modality independently, ensuring that the unique properties of each imaging channel are optimally captured and integrated. Evaluated on our dataset, H-CNN-ViT achieves an AUC of 78.6%, surpassing state-of-the-art models. Our model is publicly available at this https URL.
zh
[CV-131] GRLoc: Geometric Representation Regression for Visual Localization
【速读】:该论文旨在解决生成式 AI (Generative AI) 中绝对位姿回归(Absolute Pose Regression, APR)模型因缺乏几何先验而导致的过拟合训练视图、难以泛化的问题。传统 APR 模型将图像直接映射到 6-DoF 相机位姿,本质上是黑箱操作,易导致对训练数据的记忆而非对三维场景几何的理解。解决方案的关键在于提出几何表示回归(Geometric Representation Regression, GRR),其核心思想是将 APR 视为逆渲染过程:模型不再直接预测位姿,而是显式地从查询图像中回归两个解耦的几何表示——(1)以世界坐标系表示的射线束方向用于估计相机旋转;(2)对应的点图(pointmap)用于估计相机平移。最终通过可微分确定性求解器组合这两个几何组件获得 6-DoF 位姿。这种分离视觉到几何映射与位姿计算的设计引入了强几何先验,显著提升了模型在 7-Scenes 和 Cambridge Landmarks 数据集上的性能,验证了建模逆渲染过程是实现鲁棒且可泛化的绝对位姿估计的有效路径。
链接: https://arxiv.org/abs/2511.13864
作者: Changyang Li,Xuejian Ma,Lixiang Liu,Zhan Li,Qingan Yan,Yi Xu
机构: Goertek Alpha Labs (歌尔研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Absolute Pose Regression (APR) has emerged as a compelling paradigm for visual localization. However, APR models typically operate as black boxes, directly regressing a 6-DoF pose from a query image, which can lead to memorizing training views rather than understanding 3D scene geometry. In this work, we propose a geometrically-grounded alternative. Inspired by novel view synthesis, which renders images from intermediate geometric representations, we reformulate APR as its inverse that regresses the underlying 3D representations directly from the image, and we name this paradigm Geometric Representation Regression (GRR). Our model explicitly predicts two disentangled geometric representations in the world coordinate system: (1) a ray bundle’s directions to estimate camera rotation, and (2) a corresponding pointmap to estimate camera translation. The final 6-DoF camera pose is then recovered from these geometric components using a differentiable deterministic solver. This disentangled approach, which separates the learned visual-to-geometry mapping from the final pose calculation, introduces a strong geometric prior into the network. We find that the explicit decoupling of rotation and translation predictions measurably boosts performance. We demonstrate state-of-the-art performance on 7-Scenes and Cambridge Landmarks datasets, validating that modeling the inverse rendering process is a more robust path toward generalizable absolute pose estimation.
zh
[CV-132] Segmenting Collision Sound Sources in Egocentric Videos KR
【速读】:该论文旨在解决碰撞声音源分割(Collision Sound Source Segmentation, CS3)问题,即在视频帧中定位引发特定碰撞声音的物体,且该任务依赖于音频条件进行引导。与传统孤立声事件不同,碰撞声音由两个物体相互作用产生,其声学特征同时受两者影响,增加了分割难度;此外,在第一人称视角(egocentric)视频中,声音清晰但视觉场景杂乱、物体小且交互短暂,进一步加剧了挑战。解决方案的关键在于提出一种弱监督的音频条件分割方法,利用预训练基础模型(CLIP 和 SAM2)实现跨模态对齐与分割,并引入第一人称视角特有的线索(如手部持物)来识别潜在的碰撞发声物体。该方法在新提出的两个基准数据集 EPIC-CS3 和 Ego4D-CS3 上分别比基线模型提升 3 倍和 4.7 倍的 mIoU(平均交并比),验证了其有效性。
链接: https://arxiv.org/abs/2511.13863
作者: Kranti Kumar Parida,Omar Emara,Hazel Doughty,Dima Damen
机构: Samsung R&D Institute India – Bangalore (三星研究院印度班加罗尔); University of Bristol (布里斯托大学); Leiden University (莱顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Under Review. Webpage: this https URL
Abstract:Humans excel at multisensory perception and can often recognise object properties from the sound of their interactions. Inspired by this, we propose the novel task of Collision Sound Source Segmentation (CS3), where we aim to segment the objects responsible for a collision sound in visual input (i.e. video frames from the collision clip), conditioned on the audio. This task presents unique challenges. Unlike isolated sound events, a collision sound arises from interactions between two objects, and the acoustic signature of the collision depends on both. We focus on egocentric video, where sounds are often clear, but the visual scene is cluttered, objects are small, and interactions are brief. To address these challenges, we propose a weakly-supervised method for audio-conditioned segmentation, utilising foundation models (CLIP and SAM2). We also incorporate egocentric cues, i.e. objects in hands, to find acting objects that can potentially be collision sound sources. Our approach outperforms competitive baselines by 3\times and 4.7\times in mIoU on two benchmarks we introduce for the CS3 task: EPIC-CS3 and Ego4D-CS3. Comments: Under Review. Webpage: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2511.13863 [cs.CV] (or arXiv:2511.13863v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.13863 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-133] RSPose: Ranking Based Losses for Human Pose Estimation
【速读】:该论文针对热图(heatmap)-based人体姿态估计方法中存在的三大问题展开研究:(P1) 常用的均方误差(Mean Squared Error, MSE)损失函数对所有像素偏差一视同仁,未能聚焦于锐化并准确定位关键点对应的峰值;(P2) 热图存在空间和类别上的不平衡;(P3) 评估指标(如mAP)与损失函数之间存在不一致性。解决方案的关键在于提出基于排序(ranking-based)的损失函数,从理论上和实证上证明其优于传统热图损失(如MSE、KL散度),显著提升置信度分数与定位质量的相关性,从而改善非极大值抑制(Non-Maximum Suppression, NMS)阶段实例选择的准确性,并最终提升平均精度(mean Average Precision, mAP)。该方法被命名为RSPose,在COCO、CrowdPose和MPII三个数据集上验证了有效性,且首次实现了损失函数与mAP评估指标的一致性设计。
链接: https://arxiv.org/abs/2511.13857
作者: Muhammed Can Keles,Bedrettin Cetinkaya,Sinan Kalkan,Emre Akbas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While heatmap-based human pose estimation methods have shown strong performance, they suffer from three main problems: (P1) “Commonly used Mean Squared Error (MSE)” Loss may not always improve joint localization because it penalizes all pixel deviations equally, without focusing explicitly on sharpening and correctly localizing the peak corresponding to the joint; (P2) heatmaps are spatially and class-wise imbalanced; and, (P3) there is a discrepancy between the evaluation metric (i.e., mAP) and the loss functions. We propose ranking-based losses to address these issues. Both theoretically and empirically, we show that our proposed losses are superior to commonly used heatmap losses (MSE, KL-Divergence). Our losses considerably increase the correlation between confidence scores and localization qualities, which is desirable because higher correlation leads to more accurate instance selection during Non-Maximum Suppression (NMS) and better Average Precision (mAP) performance. We refer to the models trained with our losses as RSPose. We show the effectiveness of RSPose across two different modes: one-dimensional and two-dimensional heatmaps, on three different datasets (COCO, CrowdPose, MPII). To the best of our knowledge, we are the first to propose losses that align with the evaluation metric (mAP) for human pose estimation. RSPose outperforms the previous state of the art on the COCO-val set and achieves an mAP score of 79.9 with ViTPose-H, a vision transformer model for human pose estimation. We also improve SimCC Resnet-50, a coordinate classification-based pose estimation method, by 1.5 AP on the COCO-val set, achieving 73.6 AP. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.13857 [cs.CV] (or arXiv:2511.13857v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.13857 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Muhammed Can Keleş [view email] [v1] Mon, 17 Nov 2025 19:20:25 UTC (377 KB) Full-text links: Access Paper: View a PDF of the paper titled RSPose: Ranking Based Losses for Human Pose Estimation, by Muhammed Can Keles and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2025-11 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[CV-134] Can World Simulators Reason ? Gen-ViRe: A Generative Visual Reasoning Benchmark
【速读】:该论文旨在解决当前视频生成模型在多步推理能力评估上的空白问题,即现有基准主要关注视觉保真度或对齐性,而无法衡量模型在链式帧(Chain-of-Frames, CoF)推理中的核心认知能力,如多步规划、算法逻辑和抽象模式外推。解决方案的关键在于提出Gen-ViRe(Generative Visual Reasoning Benchmark),这是一个基于认知科学和真实世界AI应用的评估框架,将CoF推理解构为六个认知维度和24个子任务,并通过多源数据采集、最小提示协议以及结合视觉语言模型(Vision-Language Model, VLM)的混合评估机制,实现了对视频模型作为“推理者”的首次量化评估。实验表明,当前最先进系统在视觉质量与实际推理深度之间存在显著差异,从而为构建真正具备物理世界模拟能力的生成式AI提供了基线和诊断工具。
链接: https://arxiv.org/abs/2511.13853
作者: Xinxin Liu,Zhaopan Xu,Kai Wang,Yong Jae Lee,Yuzhang Shang
机构: University of Central Florida (中佛罗里达大学); National University of Singapore (新加坡国立大学); UW-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning – materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a framework grounded in cognitive science and real-world AI applications, which decomposes CoF reasoning into six cognitive dimensions – from perceptual logic to abstract planning – and 24 subtasks. Through multi-source data curation, minimal prompting protocols, and hybrid VLM-assisted evaluation with detailed criteria, Gen-ViRe delivers the first quantitative assessment of video models as reasoners. Our experiments on SOTA systems reveal substantial discrepancies between impressive visual quality and actual reasoning depth, establishing baselines and diagnostic tools to advance genuine world simulators.
zh
[CV-135] Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video
【速读】:该论文旨在解决无语言依赖的早期神经认知变化筛查问题,特别是通过分析自然状态下的面部微动态(facial temporal micro dynamics)实现阿尔茨海默病等痴呆症的被动式、非介入性筛查。传统方法多依赖语音或脚本化访谈,受限于临床场景且与语言内容强耦合,而本文提出了一种无需言语或文本信息的新型检测范式。其解决方案的关键在于:首先对视频中面部微运动(如眨眼频率、嘴部/下颌细微运动、注视变异性及头部微调)进行稳定化处理并转化为可解释的时间序列;其次采用窗口级统计特征编码,以“各运动流相对占比”(activity mix)作为核心表示,使模型聚焦于不同面部运动模式的分布而非单一幅度,从而提升判别力和可解释性;最终在自建数据集YT DemTalk上验证了该方法的有效性,仅用轻量浅层分类器即达到AUROC 0.953、AP 0.961、F1-score 0.851的优异性能,表明注视不稳定性与嘴部/下颌动力学是最重要的预测线索。
链接: https://arxiv.org/abs/2511.13802
作者: Filippo Cenacchi. Longbing Cao,Mitchell McEwan,Deborah Richards
机构: Macquarie University (麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We target passive dementia screening from short camera-facing talking head video, developing a facial temporal micro dynamics analysis for language free detection of early neuro cognitive change. This enables unscripted, in the wild video analysis at scale to capture natural facial behaviors, transferrable across devices, topics, and cultures without active intervention by clinicians or researchers during recording. Most existing resources prioritize speech or scripted interviews, limiting use outside clinics and coupling predictions to language and transcription. In contrast, we identify and analyze whether temporal facial kinematics, including blink dynamics, small mouth jaw motions, gaze variability, and subtle head adjustments, are sufficient for dementia screening without speech or text. By stabilizing facial signals, we convert these micro movements into interpretable facial microdynamic time series, smooth them, and summarize short windows into compact clip level statistics for screening. Each window is encoded by its activity mix (the relative share of motion across streams), thus the predictor analyzes the distribution of motion across streams rather than its magnitude, making per channel effects transparent. We also introduce YT DemTalk, a new dataset curated from publicly available, in the wild camera facing videos. It contains 300 clips (150 with self reported dementia, 150 controls) to test our model and offer a first benchmarking of the corpus. On YT DemTalk, ablations identify gaze lability and mouth/jaw dynamics as the most informative cues, and light weighted shallow classifiers could attain a dementia prediction performance of (AUROC) 0.953, 0.961 Average Precision (AP), 0.851 F1-score, and 0.857 accuracy.
zh
[CV-136] Synergizing Multigrid Algorithms with Vision Transformer: A Novel Approach to Enhance the Seismic Foundation Model
【速读】:该论文旨在解决地震数据(seismogram)在预训练生成式AI基础模型时面临的挑战,即现有视觉Transformer(Vision Transformer, ViT)因采用顺序标记化方法而无法有效捕捉地震信号中高频与低频特征的内在结构。其核心解决方案是提出一种自适应双网格基础模型训练策略(Adaptive Two-Grid Foundation Model Training Strategy, ADATG),结合谱分解(spectrum decomposition)分离高低频成分,并利用分层希尔伯特编码(hierarchical Hilbert encoding)对数据进行高效表征;同时基于ViT中的频率原理(frequency principle),设计了一种渐进式训练机制:先聚焦粗粒度信息,再逐步细化至细粒度特征,从而显著提升地震图像基础模型预训练的效果与效率。
链接: https://arxiv.org/abs/2511.13800
作者: Huiwen Wu,Shuo Zhang,Yi Liu,Hongbin Ye
机构: Zhejiang Laboratory (浙江实验室); Academy of Mathematics and Systems Science, Chinese Academy of Sciences (中国科学院数学与系统科学研究院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:
Abstract:Due to the emergency and homogenization of Artificial Intelligence (AI) technology development, transformer-based foundation models have revolutionized scientific applications, such as drug discovery, materials research, and astronomy. However, seismic data presents unique characteristics that require specialized processing techniques for pretraining foundation models in seismic contexts with high- and low-frequency features playing crucial roles. Existing vision transformers (ViTs) with sequential tokenization ignore the intrinsic pattern and fail to grasp both the high- and low-frequency seismic information efficiently and effectively. This work introduces a novel adaptive two-grid foundation model training strategy (ADATG) with Hilbert encoding specifically tailored for seismogram data, leveraging the hierarchical structures inherent in seismic data. Specifically, our approach employs spectrum decomposition to separate high- and low-frequency components and utilizes hierarchical Hilbert encoding to represent the data effectively. Moreover, observing the frequency principle observed in ViTs, we propose an adaptive training strategy that initially emphasizes coarse-level information and then progressively refines the model’s focus on fine-level features. Our extensive experiments demonstrate the effectiveness and efficiency of our training methods. This research highlights the importance of data encoding and training strategies informed by the distinct characteristics of high- and low-frequency features in seismic images, ultimately contributing to the enhancement of visual seismic foundation models pretraining.
zh
[CV-137] KANGURA: Kolmogorov-Arnold Network-Based Geometry-Aware Learning with Unified Representation Attention for 3D Modeling of Complex Structures
【速读】:该论文旨在解决微生物燃料电池(Microbial Fuel Cells, MFCs)中阳极结构设计对性能影响的复杂几何依赖性建模难题,现有预测模型难以准确捕捉此类三维(3D)结构特征以实现优化。其解决方案的关键在于提出KANGURA框架——一种基于Kolmogorov-Arnold Network(KAN)的几何感知学习方法,通过函数分解形式重构几何关系,摒弃传统多层感知机(MLP)结构;同时引入几何解耦表示学习与统一注意力机制,将结构变化分离为可解释成分并动态增强关键几何区域,从而显著提升对复杂3D结构的建模精度与泛化能力。
链接: https://arxiv.org/abs/2511.13798
作者: Mohammad Reza Shafie,Morteza Hajiabadi,Hamed Khosravi,Mobina Noori,Imtiaz Ahmed
机构: Iran University of Science and Technology (伊朗科学技术大学); West Virginia University (西弗吉尼亚大学); Georgia Institute of Technology (佐治亚理工学院); University of California, Davis (加州大学戴维斯分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Microbial Fuel Cells (MFCs) offer a promising pathway for sustainable energy generation by converting organic matter into electricity through microbial processes. A key factor influencing MFC performance is the anode structure, where design and material properties play a crucial role. Existing predictive models struggle to capture the complex geometric dependencies necessary to optimize these structures. To solve this problem, we propose KANGURA: Kolmogorov-Arnold Network-Based Geometry-Aware Learning with Unified Representation Attention. KANGURA introduces a new approach to three-dimensional (3D) machine learning modeling. It formulates prediction as a function decomposition problem, where Kolmogorov-Arnold Network (KAN)- based representation learning reconstructs geometric relationships without a conventional multi- layer perceptron (MLP). To refine spatial understanding, geometry-disentangled representation learning separates structural variations into interpretable components, while unified attention mechanisms dynamically enhance critical geometric regions. Experimental results demonstrate that KANGURA outperforms over 15 state-of-the-art (SOTA) models on the ModelNet40 benchmark dataset, achieving 92.7% accuracy, and excels in a real-world MFC anode structure problem with 97% accuracy. This establishes KANGURA as a robust framework for 3D geometric modeling, unlocking new possibilities for optimizing complex structures in advanced manufacturing and quality-driven engineering applications.
zh
[CV-138] A Trajectory-free Crash Detection Framework with Generative Approach and Segment Map Diffusion
【速读】:该论文旨在解决现有交通事故检测方法中因轨迹获取困难和车辆跟踪不稳定而导致的检测精度不足问题。其核心解决方案是提出一种两阶段无轨迹(trajectory-free)的 crash detection 框架,关键在于利用道路路段地图(road segment map)直接建模个体交通动态数据,并通过基于扩散模型的生成机制实现对未来路段状态的合理预测。第一阶段使用 Mapfusion 模型,借助控制网络(ControlNet)引入背景上下文信息,结合时序嵌入组件捕捉路段地图序列的时间演化特性,完成从噪声到正常路段地图的去噪生成;第二阶段通过对比实际监测路段地图与扩散模型生成结果来识别异常事件,从而实现高精度、鲁棒性强的实时事故检测。
链接: https://arxiv.org/abs/2511.13795
作者: Weiying Shen,Hao Yu,Yu Dong,Pan Liu,Yu Han,Xin Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: To be presented at TRB 2026 (TRBAM-26-01711) and a revised version will be submitted to Transportation Research Part C: Emerging Technologies
Abstract:Real-time crash detection is essential for developing proactive safety management strategy and enhancing overall traffic efficiency. To address the limitations associated with trajectory acquisition and vehicle tracking, road segment maps recording the individual-level traffic dynamic data were directly served in crash detection. A novel two-stage trajectory-free crash detection framework, was present to generate the rational future road segment map and identify crashes. The first-stage diffusion-based segment map generation model, Mapfusion, conducts a noisy-to-normal process that progressively adds noise to the road segment map until the map is corrupted to pure Gaussian noise. The denoising process is guided by sequential embedding components capturing the temporal dynamics of segment map sequences. Furthermore, the generation model is designed to incorporate background context through ControlNet to enhance generation control. Crash detection is achieved by comparing the monitored segment map with the generations from diffusion model in second stage. Trained on non-crash vehicle motion data, Mapfusion successfully generates realistic road segment evolution maps based on learned motion patterns and remains robust across different sampling intervals. Experiments on real-world crashes indicate the effectiveness of the proposed two-stage method in accurately detecting crashes.
zh
[CV-139] FusionFM: All-in-One Multi-Modal Image Fusion with Flow Matching
【速读】:该论文旨在解决当前多模态图像融合方法依赖任务特定模型导致训练成本高、可扩展性差,以及生成式方法因复杂采样轨迹而推理速度慢的问题。其解决方案的关键在于将图像融合建模为从源模态到融合图像分布的直接概率传输过程,采用流匹配(flow matching)范式以提升采样效率和结构一致性;同时通过收集多个先进模型的融合结果作为先验,并设计任务感知的选择函数筛选可靠伪标签,再结合融合精炼模块(Fusion Refiner)对伪标签中的退化成分进行分解与增强,从而在无需高质量标注数据的情况下实现性能优化;此外,在多任务场景下引入弹性权重固化(elastic weight consolidation)与经验回放机制,兼顾参数稳定性和记忆保留能力,显著提升了持续学习效果与模型轻量化水平。
链接: https://arxiv.org/abs/2511.13794
作者: Huayi Zhu,Xiu Shu,Youqiang Xiong,Qiao Liu,Rui Chen,Di Yuan,Xiaojun Chang,Zhenyu He
机构: Guangzhou Institute of Technology, Xidian University (西安电子科技大学广州研究院); Guangzhou University (广州大学); Chongqing Normal University (重庆师范大学); Australian Artificial Intelligence Institute (澳大利亚人工智能研究所); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Current multi-modal image fusion methods typically rely on task-specific models, leading to high training costs and limited scalability. While generative methods provide a unified modeling perspective, they often suffer from slow inference due to the complex sampling trajectories from noise to image. To address this, we formulate image fusion as a direct probabilistic transport from source modalities to the fused image distribution, leveraging the flow matching paradigm to improve sampling efficiency and structural consistency. To mitigate the lack of high-quality fused images for supervision, we collect fusion results from multiple state-of-the-art models as priors, and employ a task-aware selection function to select the most reliable pseudo-labels for each task. We further introduce a Fusion Refiner module that employs a divide-and-conquer strategy to systematically identify, decompose, and enhance degraded components in selected pseudo-labels. For multi-task scenarios, we integrate elastic weight consolidation and experience replay mechanisms to preserve cross-task performance and enhance continual learning ability from both parameter stability and memory retention perspectives. Our approach achieves competitive performance across diverse fusion tasks, while significantly improving sampling efficiency and maintaining a lightweight model design. The code will be available at: this https URL.
zh
[CV-140] Exploring Transferability of Self-Supervised Learning by Task Conflict Calibration
【速读】:该论文旨在解决自监督学习(Self-Supervised Learning, SSL)中表示迁移能力(representation transferability)的建模与提升问题,即如何使SSL模型在不同下游任务间更有效地迁移学到的特征表示。其核心挑战在于任务冲突(task conflict)会抑制表示的通用性。解决方案的关键在于提出一种任务冲突校准(Task Conflict Calibration, TC²)方法:首先通过批处理分割生成多个SSL任务以引入任务级信息;随后利用因子提取网络和权重提取网络分别生成因果生成因子并为每个样本分配专属权重,结合数据重建、正交性和稀疏性约束确保有效性;最终在两阶段双层优化框架下对SSL训练过程中的样本表示进行校准,从而显著增强表示的迁移能力。
链接: https://arxiv.org/abs/2511.13787
作者: Huijie Guo,Jingyao Wang,Peizheng Guo,Xingchen Shen,Changwen Zheng,Wenwen Qiang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we explore the transferability of SSL by addressing two central questions: (i) what is the representation transferability of SSL, and (ii) how can we effectively model this transferability? Transferability is defined as the ability of a representation learned from one task to support the objective of another. Inspired by the meta-learning paradigm, we construct multiple SSL tasks within each training batch to support explicitly modeling transferability. Based on empirical evidence and causal analysis, we find that although introducing task-level information improves transferability, it is still hindered by task conflict. To address this issue, we propose a Task Conflict Calibration (TC ^2 ) method to alleviate the impact of task conflict. Specifically, it first splits batches to create multiple SSL tasks, infusing task-level information. Next, it uses a factor extraction network to produce causal generative factors for all tasks and a weight extraction network to assign dedicated weights to each sample, employing data reconstruction, orthogonality, and sparsity to ensure effectiveness. Finally, TC ^2 calibrates sample representations during SSL training and integrates into the pipeline via a two-stage bi-level optimization framework to boost the transferability of learned representations. Experimental results on multiple downstream tasks demonstrate that our method consistently improves the transferability of SSL models. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.13787 [cs.LG] (or arXiv:2511.13787v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.13787 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-141] mporal Object-Aware Vision Transformer for Few-Shot Video Object Detection AAAI2026
【速读】:该论文旨在解决少样本视频目标检测(Few-shot Video Object Detection, FSVOD)中的两大核心挑战:一是如何在存在遮挡和外观变化的情况下保持帧间的时间一致性,二是如何在不依赖复杂区域提议(region proposals)的前提下实现新类别的泛化能力。传统方法通常需要大量标注数据且依赖计算密集型的区域提议机制,限制了其在少样本场景下的应用。本文提出了一种新颖的对象感知时间建模方法,其关键在于引入一个过滤机制,仅选择高置信度的目标特征进行跨帧传播,从而实现高效的特征传递、减少噪声累积,并提升检测精度。该方案通过结合少样本训练的检测与分类头以及聚焦的特征传播策略,在无需显式对象管(object tube proposals)的情况下实现了鲁棒的时间一致性,显著提升了5-shot设置下的平均精度(AP),并在1-shot、3-shot及10-shot配置中均表现出改进性能。
链接: https://arxiv.org/abs/2511.13784
作者: Yogesh Kumar,Anand Mishra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026 Main Track
Abstract:Few-shot Video Object Detection (FSVOD) addresses the challenge of detecting novel objects in videos with limited labeled examples, overcoming the constraints of traditional detection methods that require extensive training data. This task presents key challenges, including maintaining temporal consistency across frames affected by occlusion and appearance variations, and achieving novel object generalization without relying on complex region proposals, which are often computationally expensive and require task-specific training. Our novel object-aware temporal modeling approach addresses these challenges by incorporating a filtering mechanism that selectively propagates high-confidence object features across frames. This enables efficient feature progression, reduces noise accumulation, and enhances detection accuracy in a few-shot setting. By utilizing few-shot trained detection and classification heads with focused feature propagation, we achieve robust temporal consistency without depending on explicit object tube proposals. Our approach achieves performance gains, with AP improvements of 3.7% (FSVOD-500), 5.3% (FSYTV-40), 4.3% (VidOR), and 4.5 (VidVRD) in the 5-shot setting. Further results demonstrate improvements in 1-shot, 3-shot, and 10-shot configurations. We make the code public at: this https URL
zh
[CV-142] Known Meets Unknown: Mitigating Overconfidence in Open Set Recognition
【速读】:该论文旨在解决开放集识别(Open Set Recognition, OSR)中因未知样本与已知类别语义相似而导致的过自信问题(overconfidence),即模型在特征空间中由于类间重叠而对未知样本赋予过高置信度,从而将其误判为已知类别,削弱了已知与未知类之间的决策边界。解决方案的关键在于提出一个双模块框架:一是基于扰动的不确定性估计模块,通过可控参数扰动生成多样化预测以量化预测不确定性;二是基于学习的未知检测模块,采用两阶段策略利用估计的不确定性增强已知与未知类别的区分能力,从而提升OSR性能。
链接: https://arxiv.org/abs/2511.13775
作者: Dongdong Zhao,Ranxin Fang,Changtian Song,Zhihui Liu,Jianwen Xiang
机构: Wuhan University of Technology (武汉理工大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 5 figures, 2 tables
Abstract:Open Set Recognition (OSR) requires models not only to accurately classify known classes but also to effectively reject unknown samples. However, when unknown samples are semantically similar to known classes, inter-class overlap in the feature space often causes models to assign unjustifiably high confidence to them, leading to misclassification as known classes – a phenomenon known as overconfidence. This overconfidence undermines OSR by blurring the decision boundary between known and unknown classes. To address this issue, we propose a framework that explicitly mitigates overconfidence caused by inter-class overlap. The framework consists of two components: a perturbation-based uncertainty estimation module, which applies controllable parameter perturbations to generate diverse predictions and quantify predictive uncertainty, and an unknown detection module with distinct learning-based classifiers, implemented as a two-stage procedure, which leverages the estimated uncertainty to improve discrimination between known and unknown classes, thereby enhancing OSR performance. Experimental results on three public datasets show that the proposed framework achieves superior performance over existing OSR methods.
zh
[CV-143] Can LLM s Create Legally Relevant Summaries and Analyses of Videos?
【速读】:该论文试图解决普通民众在法律文书撰写中因缺乏对事件事实的准确理解和表达能力而面临的困难,尤其是在准备保险索赔或法院诉状等法律文件时。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)从视频内容中自动理解并总结法律相关事件,并据此生成高质量的法律信函。研究通过分析120个YouTube视频中的多样化法律场景,验证了LLM在无需用户主动文字描述事件经过的情况下,仍能生成71.7%高或中等质量摘要的能力,从而为提升司法可及性提供了可行的技术路径。
链接: https://arxiv.org/abs/2511.13772
作者: Lyra Hoeben-Kuil,Gijs van Dijck,Jaromir Savelka,Johanna Gunawan,Konrad Kollnig,Marta Kolacz,Mindy Duffourc,Shashank Chakravarthy,Hannes Westermann
机构: Maastricht University (马斯特里赫特大学); Brightlands Institute for Smart Society (Brightlands 智能社会研究所); Carnegie Mellon University (卡内基梅隆大学)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Accepted for publication at JURIX 2025 Torino, Italy. This is the preprint version. Code and data available at: this https URL
Abstract:Understanding the legally relevant factual basis of an event and conveying it through text is a key skill of legal professionals. This skill is important for preparing forms (e.g., insurance claims) or other legal documents (e.g., court claims), but often presents a challenge for laypeople. Current AI approaches aim to bridge this gap, but mostly rely on the user to articulate what has happened in text, which may be challenging for many. Here, we investigate the capability of large language models (LLMs) to understand and summarize events occurring in videos. We ask an LLM to summarize and draft legal letters, based on 120 YouTube videos showing legal issues in various domains. Overall, 71.7% of the summaries were rated as of high or medium quality, which is a promising result, opening the door to a number of applications in e.g. access to justice.
zh
[CV-144] MoETTA: Test-Time Adaptation Under Mixed Distribution Shifts with MoE-LayerNorm AAAI2026
【速读】:该论文旨在解决现实场景中混合分布偏移(mixed distribution shifts)下测试时适应(Test-Time Adaptation, TTA)性能下降的问题,尤其是现有方法因依赖统一适应路径而无法有效应对不同域间差异显著的梯度方向变化。其核心解决方案是提出MoETTA框架,通过引入基于熵的专家混合(Mixture-of-Experts, MoE)架构,构建一组结构解耦的专家模块,使模型能够根据输入样本所属的不同域特征自适应地选择多样化的参数更新方向,从而实现对异质分布偏移的灵活、解耦式适应。
链接: https://arxiv.org/abs/2511.13760
作者: Xiao Fan,Jingyan Jiang,Zhaoru Chen,Fanding Huang,Xiao Chen,Qinting Jiang,Bowen Zhang,Xing Tang,Zhi Wang
机构: Tsinghua University (清华大学); Shenzhen Technology University (深圳技术大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026 Main Technical Track
Abstract:Test-Time adaptation (TTA) has proven effective in mitigating performance drops under single-domain distribution shifts by updating model parameters during inference. However, real-world deployments often involve mixed distribution shifts, where test samples are affected by diverse and potentially conflicting domain factors, posing significant challenges even for SOTA TTA methods. A key limitation in existing approaches is their reliance on a unified adaptation path, which fails to account for the fact that optimal gradient directions can vary significantly across different domains. Moreover, current benchmarks focus only on synthetic or homogeneous shifts, failing to capture the complexity of real-world heterogeneous mixed distribution shifts. To address this, we propose MoETTA, a novel entropy-based TTA framework that integrates the Mixture-of-Experts (MoE) architecture. Rather than enforcing a single parameter update rule for all test samples, MoETTA introduces a set of structurally decoupled experts, enabling adaptation along diverse gradient directions. This design allows the model to better accommodate heterogeneous shifts through flexible and disentangled parameter updates. To simulate realistic deployment conditions, we introduce two new benchmarks: potpourri and potpourri+. While classical settings focus solely on synthetic corruptions, potpourri encompasses a broader range of domain shifts–including natural, artistic, and adversarial distortions–capturing more realistic deployment challenges. Additionally, potpourri+ further includes source-domain samples to evaluate robustness against catastrophic forgetting. Extensive experiments across three mixed distribution shifts settings show that MoETTA consistently outperforms strong baselines, establishing SOTA performance and highlighting the benefit of modeling multiple adaptation directions via expert-level diversity.
zh
[CV-145] nuCarla: A nuScenes-Style Birds-Eye View Perception Dataset for CARLA Simulation
【速读】:该论文旨在解决端到端(End-to-end, E2E)自动驾驶系统在闭合回路仿真(closed-loop simulation)中缺乏高质量、大规模且经过充分验证的数据集的问题,特别是针对鸟瞰图(Bird’s-eye-view, BEV)特征等中间表示的学习支持不足,导致当前闭合回路模型性能远低于简单规则基线的问题。解决方案的关键在于提出nuCarla——一个基于CARLA模拟器构建的大型BEV感知数据集,其核心优势包括:与nuScenes格式完全兼容以实现真实世界模型迁移;规模接近nuScenes但类别分布更均衡;可直接用于闭合回路仿真部署;并提供高性能BEV骨干网络以实现最先进的检测效果。通过开放数据与模型作为基准,nuCarla显著加速了闭合回路E2E自动驾驶的研发进程。
链接: https://arxiv.org/abs/2511.13744
作者: Zhijie Qiao,Zhong Cao,Henry X. Liu
机构: University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:End-to-end (E2E) autonomous driving heavily relies on closed-loop simulation, where perception, planning, and control are jointly trained and evaluated in interactive environments. Yet, most existing datasets are collected from the real world under non-interactive conditions, primarily supporting open-loop learning while offering limited value for closed-loop testing. Due to the lack of standardized, large-scale, and thoroughly verified datasets to facilitate learning of meaningful intermediate representations, such as bird’s-eye-view (BEV) features, closed-loop E2E models remain far behind even simple rule-based baselines. To address this challenge, we introduce nuCarla, a large-scale, nuScenes-style BEV perception dataset built within the CARLA simulator. nuCarla features (1) full compatibility with the nuScenes format, enabling seamless transfer of real-world perception models; (2) a dataset scale comparable to nuScenes, but with more balanced class distributions; (3) direct usability for closed-loop simulation deployment; and (4) high-performance BEV backbones that achieve state-of-the-art detection results. By providing both data and models as open benchmarks, nuCarla substantially accelerates closed-loop E2E development, paving the way toward reliable and safety-aware research in autonomous driving.
zh
[CV-146] ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders
【速读】:该论文旨在解决LiDAR点云几何压缩中因逐层独立处理导致的压缩效率低下问题,即现有方法在不同比特深度(bit-depth)层级间未有效利用上下文信息,需重复从坐标重新估计局部特征,造成计算冗余和熵模型性能受限。其解决方案的关键在于三个核心设计:一是跨比特深度特征传播(cross-bit-depth feature propagation),通过复用高密度低比特层级提取的特征来辅助稀疏高比特层级的预测;二是基于编码器池的自适应选择机制(Bag-of-Encoders, BoE),按层级动态选取最优编码网络以匹配占用率统计特性,避免为每层单独训练模型;三是保持Morton序的层次结构(Morton-order-preserving hierarchy),确保跨层级的全局Z-order一致性,从而省去每层排序操作并降低延迟。这些改进显著提升了熵建模精度与计算效率,在Ford和SemanticKITTI数据集上实现了实时状态最优压缩性能。
链接: https://arxiv.org/abs/2511.14070
作者: Junsik Kim,Gun Bang,Soowoong Kim
机构: Electronics and Telecommunications Research Institute (ETRI)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hierarchical LiDAR geometry compression encodes voxel occupancies from low to high bit-depths, yet prior methods treat each depth independently and re-estimate local context from coordinates at every level, limiting compression efficiency. We present ELiC, a real-time framework that combines cross-bit-depth feature propagation, a Bag-of-Encoders (BoE) selection scheme, and a Morton-order-preserving hierarchy. Cross-bit-depth propagation reuses features extracted at denser, lower depths to support prediction at sparser, higher depths. BoE selects, per depth, the most suitable coding network from a small pool, adapting capacity to observed occupancy statistics without training a separate model for each level. The Morton hierarchy maintains global Z-order across depth transitions, eliminating per-level sorting and reducing latency. Together these components improve entropy modeling and computation efficiency, yielding state-of-the-art compression at real-time throughput on Ford and SemanticKITTI. Code and models will be released upon publication.
zh
[CV-147] he CHASM-SWPC Dataset for Coronal Hole Detection Analysis
【速读】:该论文旨在解决太阳日冕洞(Coronal Hole, CH)自动检测的精度与效率问题,尤其是在利用高分辨率紫外波段图像进行识别时面临的标注数据稀缺和人工标注耗时的问题。解决方案的关键在于构建了一个高质量的标注数据集——CHASM-SWPC-1111,并开发了半自动化标注工具CHASM(Coronal Hole Annotation using Semi-automatic Methods),显著提升了标注效率与一致性;在此基础上,基于该数据集训练的CHRONNOS神经网络模型在准确率(Accuracy)、技能统计量(True Skill Statistic, TSS)和交并比(Intersection-over-Union, IoU)等指标上均优于原预训练模型,验证了新数据集和标注方法对提升自动检测性能的有效性。
链接: https://arxiv.org/abs/2511.14044
作者: Cutter Beck,Evan Smith,Khagendra Katuwal,Rudra Kafle,Jacob Whitehill
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Coronal holes (CHs) are low-activity, low-density solar coronal regions with open magnetic field lines (Cranmer 2009). In the extreme ultraviolet (EUV) spectrum, CHs appear as dark patches. Using daily hand-drawn maps from the Space Weather Prediction Center (SWPC), we developed a semi-automated pipeline to digitize the SWPC maps into binary segmentation masks. The resulting masks constitute the CHASM-SWPC dataset, a high-quality dataset to train and test automated CH detection models, which is released with this paper. We developed CHASM (Coronal Hole Annotation using Semi-automatic Methods), a software tool for semi-automatic annotation that enables users to rapidly and accurately annotate SWPC maps. The CHASM tool enabled us to annotate 1,111 CH masks, comprising the CHASM-SWPC-1111 dataset. We then trained multiple CHRONNOS (Coronal Hole RecOgnition Neural Network Over multi-Spectral-data) architecture (Jarolim et al. 2021) neural networks using the CHASM-SWPC dataset and compared their performance. Training the CHRONNOS neural network on these data achieved an accuracy of 0.9805, a True Skill Statistic (TSS) of 0.6807, and an intersection-over-union (IoU) of 0.5668, which is higher than the original pretrained CHRONNOS model Jarolim et al. (2021) achieved an accuracy of 0.9708, a TSS of 0.6749, and an IoU of 0.4805, when evaluated on the CHASM-SWPC-1111 test set.
zh
[CV-148] PoCGM: Poisson-Conditioned Generative Model for Sparse-View CT Reconstruction
【速读】:该论文旨在解决稀疏视图(sparse-view)计算机断层扫描(CT)重建中因投影视图减少而导致的严重伪影(aliasing artifacts)和结构细节丢失问题,这对临床应用构成显著挑战。解决方案的关键在于提出一种条件生成模型 PoCGM(Poisson-Conditioned Generative Model),该模型基于 Poisson Flow Generative Model (PFGM++) 构建,通过在训练和采样阶段引入稀疏视图数据作为条件引导,将原始无条件生成框架转化为条件生成框架,从而有效建模全视图重建结果的后验分布(posterior distribution)以条件于稀疏观测数据。此设计使 PoCGM 能够在抑制伪影的同时保留精细结构信息,在低剂量和时间敏感成像场景中表现出优越性能。
链接: https://arxiv.org/abs/2511.13967
作者: Changsheng Fang,Yongtong Liu,Bahareh Morovati,Shuo Han,Li Zhou,Hengyong Yu
机构: University of Massachusetts Lowel (马萨诸塞大学洛厄尔分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 18th International Meeting on Fully 3D Image Reconstruction in Radiology and Nuclear Medicine, Shanghai, CHINA, 2025
Abstract:In computed tomography (CT), reducing the number of projection views is an effective strategy to lower radiation exposure and/or improve temporal resolution. However, this often results in severe aliasing artifacts and loss of structural details in reconstructed images, posing significant challenges for clinical applications. Inspired by the success of the Poisson Flow Generative Model (PFGM++) in natural image generation, we propose a PoCGM (Poisson-Conditioned Generative Model) to address the challenges of sparse-view CT reconstruction. Since PFGM++ was originally designed for unconditional generation, it lacks direct applicability to medical imaging tasks that require integrating conditional inputs. To overcome this limitation, the PoCGM reformulates PFGM++ into a conditional generative framework by incorporating sparse-view data as guidance during both training and sampling phases. By modeling the posterior distribution of full-view reconstructions conditioned on sparse observations, PoCGM effectively suppresses artifacts while preserving fine structural details. Qualitative and quantitative evaluations demonstrate that PoCGM outperforms the baselines, achieving improved artifact suppression, enhanced detail preservation, and reliable performance in dose-sensitive and time-critical imaging scenarios.
zh
[CV-149] Self-Supervised Compression and Artifact Correction for Streaming Underwater Imaging Sonar WACV2026
【速读】:该论文旨在解决水下监测中实时声呐成像面临的两大挑战:受限的上行链路带宽以及严重的声呐特有伪影(如斑点噪声、运动模糊、混响和声影),这些伪影可影响高达98%的帧。解决方案的关键在于提出SCOPE框架,其核心创新包括:(i) 自适应码本压缩(Adaptive Codebook Compression, ACC),学习针对声呐特性的频率编码潜在表示;(ii) 频率感知多尺度分割(Frequency-Aware Multiscale Segmentation, FAMS),将图像分解为低频结构与稀疏高频动态成分,并抑制快速波动的伪影。此外,采用无标签的低通代理对进行“对冲训练”策略以引导频率感知学习。该方法无需干净-噪声配对或合成假设,在极低比特率(低至0.0118 bpp)下实现高质量声呐图像重建(SSIM达0.77),同时降低上行带宽超80%,并支持下游检测任务性能提升,已在太平洋西北部三条河流中实现长期实时部署。
链接: https://arxiv.org/abs/2511.13922
作者: Rongsheng Qian,Chi Xu,Xiaoqiang Ma,Hao Fang,Yili Jin,William I. Atlas,Jiangchuan Liu
机构: Simon Fraser University (西蒙菲莎大学); Douglas College (道格拉斯学院); McGill University (麦吉尔大学); Wild Salmon Center (野生鲑鱼中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted to WACV 2026
Abstract:Real-time imaging sonar has become an important tool for underwater monitoring in environments where optical sensing is unreliable. Its broader use is constrained by two coupled challenges: highly limited uplink bandwidth and severe sonar-specific artifacts (speckle, motion blur, reverberation, acoustic shadows) that affect up to 98% of frames. We present SCOPE, a self-supervised framework that jointly performs compression and artifact correction without clean-noise pairs or synthetic assumptions. SCOPE combines (i) Adaptive Codebook Compression (ACC), which learns frequency-encoded latent representations tailored to sonar, with (ii) Frequency-Aware Multiscale Segmentation (FAMS), which decomposes frames into low-frequency structure and sparse high-frequency dynamics while suppressing rapidly fluctuating artifacts. A hedging training strategy further guides frequency-aware learning using low-pass proxy pairs generated without labels. Evaluated on months of in-situ ARIS sonar data, SCOPE achieves a structural similarity index (SSIM) of 0.77, representing a 40% improvement over prior self-supervised denoising baselines, at bitrates down to = 0.0118 bpp. It reduces uplink bandwidth by more than 80% while improving downstream detection. The system runs in real time, with 3.1 ms encoding on an embedded GPU and 97 ms full multi-layer decoding on the server end. SCOPE has been deployed for months in three Pacific Northwest rivers to support real-time salmon enumeration and environmental monitoring in the wild. Results demonstrate that learning frequency-structured latents enables practical, low-bitrate sonar streaming with preserved signal details under real-world deployment conditions.
zh
人工智能
[AI-0] Heterogeneous Multi-Agent Proximal Policy Optimization for Power Distribution System Restoration
【速读】:该论文旨在解决大规模停电后配电系统(Power Distribution System, PDS)恢复过程中,因拓扑重构与分布式能源资源(Distributed Energy Resources, DERs)协调控制所面临的非线性约束(如功率平衡、电压限值和热容量限制)导致的传统优化方法和基于价值的强化学习(Reinforcement Learning, RL)算法计算效率低、难以扩展的问题。其解决方案的关键在于提出一种异构智能体强化学习(Heterogeneous-Agent Reinforcement Learning, HARL)框架,并通过异构智能体近端策略优化(Heterogeneous-Agent Proximal Policy Optimization, HAPPO)实现多微电网间的协同恢复:每个智能体控制一个具有不同负载、DER容量和开关数量的微电网,体现实际结构异质性;采用去中心化策略网络与集中式评论家网络相结合的方式,在保证稳定在线更新的同时,利用物理信息驱动的OpenDSS仿真环境提供完整的潮流反馈并以可微分惩罚信号替代无效动作掩码来处理运行约束,从而在IEEE 123-bus和IEEE 8500-node系统上实现了更快收敛速度、更高恢复功率和更平滑的多种子训练性能。
链接: https://arxiv.org/abs/2511.14730
作者: Parya Dolatyabi,Mahdi Khodayar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, TPEC 2025 Conference
Abstract:Restoring power distribution systems (PDS) after large-scale outages requires sequential switching operations that reconfigure feeder topology and coordinate distributed energy resources (DERs) under nonlinear constraints such as power balance, voltage limits, and thermal ratings. These challenges make conventional optimization and value-based RL approaches computationally inefficient and difficult to scale. This paper applies a Heterogeneous-Agent Reinforcement Learning (HARL) framework, instantiated through Heterogeneous-Agent Proximal Policy Optimization (HAPPO), to enable coordinated restoration across interconnected microgrids. Each agent controls a distinct microgrid with different loads, DER capacities, and switch counts, introducing practical structural heterogeneity. Decentralized actor policies are trained with a centralized critic to compute advantage values for stable on-policy updates. A physics-informed OpenDSS environment provides full power flow feedback and enforces operational limits via differentiable penalty signals rather than invalid action masking. The total DER generation is capped at 2400 kW, and each microgrid must satisfy local supply-demand feasibility. Experiments on the IEEE 123-bus and IEEE 8500-node systems show that HAPPO achieves faster convergence, higher restored power, and smoother multi-seed training than DQN, PPO, MAES, MAGDPG, MADQN, Mean-Field RL, and QMIX. Results demonstrate that incorporating microgrid-level heterogeneity within the HARL framework yields a scalable, stable, and constraint-aware solution for complex PDS restoration.
zh
[AI-1] Automated proving in planar geometry based on the complex number identity method and elimination
【速读】:该论文旨在解决复数恒等式证明的自动化问题,即如何在给定实关系假设下,自动验证一个目标表达式是否为实数。其解决方案的关键在于将问题转化为代数理想(ideal)的计算:通过将每个实关系假设 $ h_i $ 重写为 $ h_i - r_i $,并将目标 $ t $ 表示为 $ t - r $,同时清除分母并引入松弛变量以处理不等式约束,从而消去所有自由变量和关系点变量;最终得到一个定义在 $ \mathbb{Q}[r, r_1, r_2, \ldots] $ 上的理想 $ I $,若其中存在关于 $ r $ 的线性多项式 $ p® \in I $,且无除零错误,则可断言 $ r $ 必为实数,从而完成恒等式的自动证明。此方法已实现于 Mathematica、Maple 和 Giac 系统,并集成进 GeoGebra 的实验版本中。
链接: https://arxiv.org/abs/2511.14728
作者: Zoltán Kovács,Xicheng Peng
机构: 未知
类目: Computational Geometry (cs.CG); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures
Abstract:We improve the complex number identity proving method to a fully automated procedure, based on elimination ideals. By using declarative equations or rewriting each real-relational hypothesis h_i to h_i-r_i , and the thesis t to t-r , clearing the denominators and introducing an extra expression with a slack variable, we eliminate all free and relational point variables. From the obtained ideal I in \mathbbQ[r,r_1,r_2,\ldots] we can find a conclusive result. It plays an important role that if r_1,r_2,\ldots are real, r must also be real if there is a linear polynomial p®\in I , unless division by zero occurs when expressing r . Our results are presented in Mathematica, Maple and in a new version of the Giac computer algebra system. Finally, we present a prototype of the automated procedure in an experimental version of the dynamic geometry software GeoGebra.
zh
[AI-2] textitFLARE: Adaptive Multi-Dimensional Reputation for Robust Client Reliability in Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在实际部署中面临的恶意客户端威胁问题,尤其是针对拜占庭攻击(Byzantine attacks)、数据投毒及自适应对抗行为导致的模型完整性破坏。现有防御机制多依赖静态阈值和二元分类判断,难以适应客户端行为的动态演化,从而降低了鲁棒性与实用性。其解决方案的关键在于提出FLARE框架——一个基于声誉的自适应评估系统,将客户端可靠性从二元决策转变为连续、多维的信任评分;通过引入自校准的动态阈值机制、加权聚合策略(软排除机制)以及局部差分隐私(Local Differential Privacy, LDP)支持下的隐私保护声誉计算,实现对异常行为的精准识别与渐进式抑制,同时保持模型收敛速度和精度。实验表明,FLARE在多种攻击场景下均显著优于现有方法,在提升鲁棒性达16%的同时,维持收敛性能接近无攻击基准的70%。
链接: https://arxiv.org/abs/2511.14715
作者: Abolfazl Younesi,Leon Kiss,Zahra Najafabadi Samani,Juan Aznar Poveda,Thomas Fahringer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注: Under Review
Abstract:Federated learning (FL) enables collaborative model training while preserving data privacy. However, it remains vulnerable to malicious clients who compromise model integrity through Byzantine attacks, data poisoning, or adaptive adversarial behaviors. Existing defense mechanisms rely on static thresholds and binary classification, failing to adapt to evolving client behaviors in real-world deployments. We propose FLARE, an adaptive reputation-based framework that transforms client reliability assessment from binary decisions to a continuous, multi-dimensional trust evaluation. FLARE integrates: (i) a multi-dimensional reputation score capturing performance consistency, statistical anomaly indicators, and temporal behavior, (ii) a self-calibrating adaptive threshold mechanism that adjusts security strictness based on model convergence and recent attack intensity, (iii) reputation-weighted aggregation with soft exclusion to proportionally limit suspicious contributions rather than eliminating clients outright, and (iv) a Local Differential Privacy (LDP) mechanism enabling reputation scoring on privatized client updates. We further introduce a highly evasive Statistical Mimicry (SM) attack, a benchmark adversary that blends honest gradients with synthetic perturbations and persistent drift to remain undetected by traditional filters. Extensive experiments with 100 clients on MNIST, CIFAR-10, and SVHN demonstrate that FLARE maintains high model accuracy and converges faster than state-of-the-art Byzantine-robust methods under diverse attack types, including label flipping, gradient scaling, adaptive attacks, ALIE, and SM. FLARE improves robustness by up to 16% and preserves model convergence within 30% of the non-attacked baseline, while achieving strong malicious-client detection performance with minimal computational overhead. this https URL
zh
[AI-3] SkillGen: Learning Domain Skills for In-Context Sequential Decision Making
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在通过上下文学习(In-Context Learning, ICL)进行序列决策时,因提示(prompt)质量敏感而导致性能不稳定的问题。现有方法难以同时满足三个关键原则:聚焦决策关键信息、提供步骤级粒度、以及通过标签效率最小化对专家标注的依赖。为应对这一挑战,作者提出SkillGen——一种基于技能的ICL框架,其核心在于从采样轨迹中构建以动作为中心的领域级图结构,利用时差信用分配(temporal-difference credit assignment)识别高价值动作,并检索细粒度技能生成上下文感知的提示。该方案通过聚焦高价值决策片段提升了任务可识别性,从而显著改善了ICL提示设计的有效性,在ALFWorld、BabyAI和ScienceWorld等多个环境中的实验表明,SkillGen平均提升进度率5.9%–16.5%。
链接: https://arxiv.org/abs/2511.14670
作者: Ruomeng Ding,Wei Cheng,Minglai Shao,Chen Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly applied to sequential decision-making through in-context learning (ICL), yet their effectiveness is highly sensitive to prompt quality. Effective prompts should meet three principles: focus on decision-critical information, provide step-level granularity, and minimize reliance on expert annotations through label efficiency. However, existing ICL methods often fail to satisfy all three criteria simultaneously. Motivated by these challenges, we introduce SkillGen, a skill-based ICL framework for structured sequential reasoning. It constructs an action-centric, domain-level graph from sampled trajectories, identifies high-utility actions via temporal-difference credit assignment, and retrieves step-wise skills to generate fine-grained, context-aware prompts. We further present a theoretical analysis showing that focusing on high-utility segments supports task identifiability and informs more effective ICL prompt design. Experiments on ALFWorld, BabyAI, and ScienceWorld, using both open-source and proprietary LLMs, show that SkillGen achieves consistent gains, improving progress rate by 5.9%-16.5% on average across models.
zh
[AI-4] NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在跨不同具身形态(embodiment)或真实环境部署时存在的可靠性不足与泛化能力差的问题。其解决方案的关键在于:首先,基于预训练的NORA骨干网络引入基于流匹配(flow-matching)的动作专家模块,显著提升模型性能;其次,设计一套奖励模型用于后训练阶段,该奖励模型结合动作条件的世界模型(action-conditioned world model, WM)以评估动作是否导向目标,并引入偏离真实轨迹的启发式判据来区分优劣动作;最终通过直接偏好优化(Direct Preference Optimization, DPO)对VLA策略进行微调,实现对目标具身形态的适配。实验表明,这种基于简单但有效的奖励信号的后训练方法,在仿真和真实机器人场景中均显著提升了VLA模型的可靠性和任务成功率。
链接: https://arxiv.org/abs/2511.14659
作者: Chia-Yu Hung,Navonil Majumder,Haoyuan Deng,Liu Renhang,Yankang Ang,Amir Zadeh,Chuan Li,Dorien Herremans,Ziwei Wang,Soujanya Poria
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:Vision–language–action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization, especially when deployed across different embodiments or real-world environments. In this work, we introduce NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based action expert. This architectural enhancement alone yields substantial performance gains, enabling NORA-1.5 to outperform NORA and several state-of-the-art VLA models across both simulated and real-world benchmarks. To further improve robustness and task success, we develop a set of reward models for post-training VLA policies. Our rewards combine (i) an action-conditioned world model (WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a deviation-from-ground-truth heuristic that distinguishes good actions from poor ones. Using these reward signals, we construct preference datasets and adapt NORA-1.5 to target embodiments through direct preference optimization (DPO). Extensive evaluations show that reward-driven post-training consistently improves performance in both simulation and real-robot settings, demonstrating significant VLA model-reliability gains through simple yet effective reward models. Our findings highlight NORA-1.5 and reward-guided post-training as a viable path toward more dependable embodied agents suitable for real-world deployment.
zh
[AI-5] AutoTool: Efficient Tool Selection for Large Language Model Agents AAAI2026
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在执行复杂任务时因频繁调用LLM进行工具选择而导致的高推理成本问题,尤其针对如ReAct等框架中每一步都需重复调用LLM来决定使用哪个工具的瓶颈。解决方案的关键在于提出AutoTool——一个基于图结构的框架,其核心创新是利用了“工具使用惯性”(tool usage inertia)这一经验观察:工具调用往往遵循可预测的序列模式。AutoTool通过构建历史代理轨迹形成的有向图(节点代表工具,边表示转移概率)来建模这种惯性,并融合参数级信息以优化工具输入生成;在此结构化表示上进行遍历即可高效完成工具与参数的选择,显著减少对LLM推理的依赖,从而在保持任务完成率的同时将推理成本降低最高达30%。
链接: https://arxiv.org/abs/2511.14650
作者: Jingyi Jia,Qinbin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026, 18 pages, 11 figures, Code: this https URL
Abstract:Large Language Model (LLM) agents have emerged as powerful tools for automating complex tasks by leveraging the reasoning and decision-making abilities of LLMs. However, a major bottleneck in current agent frameworks lies in the high inference cost of tool selection, especially in approaches like ReAct that repeatedly invoke the LLM to determine which tool to use at each step. In this work, we propose AutoTool, a novel graph-based framework that bypasses repeated LLM inference by exploiting a key empirical observation: tool usage inertia - the tendency of tool invocations to follow predictable sequential patterns. AutoTool constructs a directed graph from historical agent trajectories, where nodes represent tools and edges capture transition probabilities, effectively modeling the inertia in tool selection. It further integrates parameter-level information to refine tool input generation. By traversing this structured representation, AutoTool efficiently selects tools and their parameters with minimal reliance on LLM inference. Extensive experiments across diverse agent tasks demonstrate that AutoTool reduces inference costs by up to 30% while maintaining competitive task completion rates, offering a practical and scalable enhancement for inference-heavy frameworks. Our work highlights the promise of integrating statistical structure into LLM agent design for greater efficiency without sacrificing performance.
zh
[AI-6] Adapformer: Adaptive Channel Management for Multivariate Time Series Forecasting
【速读】:该论文旨在解决多变量时间序列预测(Multivariate Time Series Forecasting, MTSF)中难以准确建模多变量间复杂依赖关系的问题。传统方法通常采用通道独立(Channel-Independent, CI)或通道依赖(Channel-Dependent, CD)策略,前者忽略变量间交互信息,后者易引入冗余噪声导致过拟合与预测效率低下。为此,作者提出自适应预测Transformer(Adaptive Forecasting Transformer, Adapformer),其核心创新在于双阶段编码器-解码器架构:通过自适应通道增强器(Adaptive Channel Enhancer, ACE)在嵌入阶段选择性地融合关键依赖关系以提升特征表示质量,同时利用自适应通道预测器(Adaptive Channel Forecaster, ACF)在解码阶段聚焦最相关协变量,显著降低噪声与冗余。该设计实现了CI与CD策略的优势互补,在多个数据集上验证了其在预测精度和计算效率上的优越性。
链接: https://arxiv.org/abs/2511.14632
作者: Yuchen Luo,Xinyu Li,Liuhua Peng,Mingming Gong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In multivariate time series forecasting (MTSF), accurately modeling the intricate dependencies among multiple variables remains a significant challenge due to the inherent limitations of traditional approaches. Most existing models adopt either \textbfchannel-independent (CI) or \textbfchannel-dependent (CD) strategies, each presenting distinct drawbacks. CI methods fail to leverage the potential insights from inter-channel interactions, resulting in models that may not fully exploit the underlying statistical dependencies present in the data. Conversely, CD approaches often incorporate too much extraneous information, risking model overfitting and predictive inefficiency. To address these issues, we introduce the Adaptive Forecasting Transformer (\textbfAdapformer), an advanced Transformer-based framework that merges the benefits of CI and CD methodologies through effective channel management. The core of Adapformer lies in its dual-stage encoder-decoder architecture, which includes the \textbfAdaptive \textbfChannel \textbfEnhancer (\textbfACE) for enriching embedding processes and the \textbfAdaptive \textbfChannel \textbfForecaster (\textbfACF) for refining the predictions. ACE enhances token representations by selectively incorporating essential dependencies, while ACF streamlines the decoding process by focusing on the most relevant covariates, substantially reducing noise and redundancy. Our rigorous testing on diverse datasets shows that Adapformer achieves superior performance over existing models, enhancing both predictive accuracy and computational efficiency, thus making it state-of-the-art in MTSF.
zh
[AI-7] Failure to Mix: Large language models struggle to answer according to desired probability distributions
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在科学创意生成与选择任务中缺乏对目标概率分布的准确建模能力的问题。现有AI基准测试通常要求确定性输出,导致通过强化学习训练的LLMs被引导追求单一最优解,从而抑制了其进行概率性探索的能力。论文的关键解决方案在于系统性地设计实验,强制LLMs生成符合预设简单概率分布的输出(如二元分布中“1”应出现49%的概率),并发现所有测试的现代LLMs均严重偏离预期分布,表现出近似阶跃函数的行为——即几乎总是输出概率最高的单一结果,即使该概率仅略高于其他选项。这一现象揭示了LLMs在概率建模上的根本缺陷,为未来改进其生成多样性与可控性提供了重要实证依据。
链接: https://arxiv.org/abs/2511.14630
作者: Ivy Yuqian Yang,David Yu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures. Code and reproducibility package: this https URL
Abstract:Scientific idea generation and selection requires exploration following a target probability distribution. In contrast, current AI benchmarks have objectively correct answers, and training large language models (LLMs) via reinforcement learning against these benchmarks discourages probabilistic exploration. Here, we conducted systematic experiments requesting LLMs to produce outputs following simple probabilistic distributions, and found that all modern LLMs tested grossly fail to follow the distributions. For example, requesting a binary output of “1” 49% of the time produces an answer of “0” nearly 100% of the time. This step function-like behavior of near-exclusively generating the output with marginally highest probability even overrules even strong in-built LLM biases.
zh
[AI-8] Expert-Guided POMDP Learning for Data-Efficient Modeling in Healthcare
【速读】:该论文旨在解决从有限数据中学习部分可观测马尔可夫决策过程(Partially Observable Markov Decision Processes, POMDPs)参数的难题。其解决方案的关键在于提出了一种名为模糊最大后验期望最大化(Fuzzy MAP EM)的算法,该方法通过将专家知识引入参数估计过程,在标准期望最大化(Expectation Maximization, EM)框架中嵌入由专家定义的模糊模型所生成的模糊伪计数(fuzzy pseudo-counts),从而将问题自然地重构为最大后验(Maximum A Posteriori, MAP)估计形式,有效提升了在低数据量和高噪声环境下的学习性能。
链接: https://arxiv.org/abs/2511.14619
作者: Marco Locatelli,Arjen Hommersom,Roberto Clemens Cerioli,Daniela Besozzi,Fabio Stella
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning the parameters of Partially Observable Markov Decision Processes (POMDPs) from limited data is a significant challenge. We introduce the Fuzzy MAP EM algorithm, a novel approach that incorporates expert knowledge into the parameter estimation process by enriching the Expectation Maximization (EM) framework with fuzzy pseudo-counts derived from an expert-defined fuzzy model. This integration naturally reformulates the problem as a Maximum A Posteriori (MAP) estimation, effectively guiding learning in environments with limited data. In synthetic medical simulations, our method consistently outperforms the standard EM algorithm under both low-data and high-noise conditions. Furthermore, a case study on Myasthenia Gravis illustrates the ability of the Fuzzy MAP EM algorithm to recover a clinically coherent POMDP, demonstrating its potential as a practical tool for data-efficient modeling in healthcare.
zh
[AI-9] Rate-Distortion Guided Knowledge Graph Construction from Lecture Notes Using Gromov-Wasserstein Optimal Transport
【速读】:该论文旨在解决将非结构化教育材料(如讲义和幻灯片)自动转化为能准确捕捉关键教学内容的任务导向知识图谱(Task-oriented Knowledge Graphs, TKGs)的难题。其核心挑战在于如何在保持语义完整性的同时实现知识图谱的紧凑性和可解释性。解决方案的关键在于提出一个基于率失真(Rate-Distortion, RD)理论与最优传输几何相结合的框架:将讲义内容建模为度量-测度空间以保留语义和关系结构,利用融合Gromov-Wasserstein(Fused Gromov-Wasserstein, FGW)耦合对候选知识图谱进行对齐并量化语义失真,通过引入以图谱规模表示的“率”项来衡量复杂度,并设计添加、合并、拆分、删除和重连等精炼算子最小化率失真拉格朗日函数,从而生成信息保真且结构紧凑的知识图谱。实验表明,该方法显著提升了由知识图谱生成的多选题(MCQs)质量。
链接: https://arxiv.org/abs/2511.14595
作者: Yuan An,Ruhma Hashmi,Michelle Rogers,Jane Greenberg,Brian K. Smith
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Accepted in the 5th Workshop on Knowledge Graphs and Big Data in Conjunction with IEEE Big Data 2025
Abstract:Task-oriented knowledge graphs (KGs) enable AI-powered learning assistant systems to automatically generate high-quality multiple-choice questions (MCQs). Yet converting unstructured educational materials, such as lecture notes and slides, into KGs that capture key pedagogical content remains difficult. We propose a framework for knowledge graph construction and refinement grounded in rate-distortion (RD) theory and optimal transport geometry. In the framework, lecture content is modeled as a metric-measure space, capturing semantic and relational structure, while candidate KGs are aligned using Fused Gromov-Wasserstein (FGW) couplings to quantify semantic distortion. The rate term, expressed via the size of KG, reflects complexity and compactness. Refinement operators (add, merge, split, remove, rewire) minimize the rate-distortion Lagrangian, yielding compact, information-preserving KGs. Our prototype applied to data science lectures yields interpretable RD curves and shows that MCQs generated from refined KGs consistently surpass those from raw notes on fifteen quality criteria. This study establishes a principled foundation for information-theoretic KG optimization in personalized and AI-assisted education.
zh
[AI-10] Is Your VLM for Autonomous Driving Safety-Ready? A Comprehensive Benchmark for Evaluating External and In-Cabin Risks
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶这一安全关键场景中适用性不明确的问题,尤其是缺乏能够同时评估外部环境风险与车内驾驶行为安全的综合性基准。其解决方案的关键在于构建首个全面的驾驶安全基准 DSBench,涵盖外部环境风险和车内行为安全两大类共10个主要类别及28个子类别,并基于98K条标注数据进行细粒度评估,揭示了主流VLMs在复杂安全场景下的显著性能下降;进一步通过针对该数据集的微调策略显著提升了模型的安全表现,为推动自动驾驶技术向更安全的方向发展提供了可量化、可复现的评估框架与优化路径。
链接: https://arxiv.org/abs/2511.14592
作者: Xianhui Meng,Yuchen Zhang,Zhijian Huang,Zheng Lu,Ziling Ji,Yaoyao Yin,Hongyuan Zhang,Guangfeng Jiang,Yandan Lin,Long Chen,Hangjun Ye,Li Zhang,Jun Liu,Xiaoshuai Hao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) show great promise for autonomous driving, but their suitability for safety-critical scenarios is largely unexplored, raising safety concerns. This issue arises from the lack of comprehensive benchmarks that assess both external environmental risks and in-cabin driving behavior safety simultaneously. To bridge this critical gap, we introduce DSBench, the first comprehensive Driving Safety Benchmark designed to assess a VLM’s awareness of various safety risks in a unified manner. DSBench encompasses two major categories: external environmental risks and in-cabin driving behavior safety, divided into 10 key categories and a total of 28 sub-categories. This comprehensive evaluation covers a wide range of scenarios, ensuring a thorough assessment of VLMs’ performance in safety-critical contexts. Extensive evaluations across various mainstream open-source and closed-source VLMs reveal significant performance degradation under complex safety-critical situations, highlighting urgent safety concerns. To address this, we constructed a large dataset of 98K instances focused on in-cabin and external safety scenarios, showing that fine-tuning on this dataset significantly enhances the safety performance of existing VLMs and paves the way for advancing autonomous driving technology. The benchmark toolkit, code, and model checkpoints will be publicly accessible.
zh
[AI-11] Biased Minds Meet Biased AI: How Class Imbalance Shapes Appropriate Reliance and Interacts with Human Base Rate Neglect
【速读】:该论文试图解决的问题是:在人机协作决策场景中,AI的偏见(如类别不平衡)与人类认知偏见(如基础率忽视)如何相互作用,进而影响人类对AI决策支持系统的适当依赖程度。其解决方案的关键在于提出一种“交互主义”视角,通过实证研究发现类别不平衡会破坏人类对AI信任的校准,并且与基础率忽视存在协同强化效应,从而揭示了复合型人-AI偏见的存在机制,为未来设计更鲁棒的人机协作系统提供了理论依据和改进方向。
链接: https://arxiv.org/abs/2511.14591
作者: Nick von Felten,Johannes Schöning,Klaus Opwis,Nicolas Scharowksi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Humans increasingly interact with artificial intelligence (AI) in decision-making. However, both AI and humans are prone to biases. While AI and human biases have been studied extensively in isolation, this paper examines their complex interaction. Specifically, we examined how class imbalance as an AI bias affects people’s ability to appropriately rely on an AI-based decision-support system, and how it interacts with base rate neglect as a human bias. In a within-subject online study (N= 46), participants classified three diseases using an AI-based decision-support system trained on either a balanced or unbalanced dataset. We found that class imbalance disrupted participants’ calibration of AI reliance. Moreover, we observed mutually reinforcing effects between class imbalance and base rate neglect, offering evidence of a compound human-AI bias. Based on these findings, we advocate for an interactionist perspective and further research into the mutually reinforcing effects of biases in human-AI interaction.
zh
[AI-12] ReflexGrad: Three-Way Synergistic Architecture for Zero-Shot Generalization in LLM Agents
【速读】:该论文旨在解决强化学习与决策制定中一个核心挑战:如何让智能体在不依赖任务特定训练的情况下,从经验中学习并实现跨任务泛化。传统方法虽分别探索了情景记忆(episodic memory)、基于梯度的提示优化(gradient-based prompt optimization)和分层任务分解(hierarchical task decomposition),但其协同整合潜力尚未被挖掘。解决方案的关键在于提出 ReflexGrad 架构,通过紧密耦合三种互补机制实现零样本泛化:(1) 基于大语言模型(LLM)的分层 TODO 分解用于战略规划;(2) 历史感知的因果反思(history-aware causal reflection)分析近期动作模式以识别失败根源并支持单次试验内的学习;(3) 基于梯度的优化实现系统性改进。该方法无需任何任务特定示例、微调或硬编码相似性度量,仅依赖纯 LLM 语义推理即可在 ALFWorld 基准上实现首次尝试即成功率达 67%,显著优于现有零样本方法,并展现出稳定的收敛性和跨任务迁移能力(提升 67% 至 78%)。
链接: https://arxiv.org/abs/2511.14584
作者: Ankush Kadu,Ashwanth Krishnan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Enabling agents to learn from experience and generalize across diverse tasks without task-specific training remains a fundamental challenge in reinforcement learning and decision-making. While recent approaches have explored episodic memory (Reflexion), gradient-based prompt optimization (TextGrad),and hierarchical task decomposition independently, their potential for synergistic integration remains unexplored. We introduce ReflexGrad, a novel architecture that tightly couples three complementary mechanisms: (1) LLM-based hierarchical TODO decomposition for strategic planning, (2) history-aware causal reflection that analyzes recent action patterns to identify failure root causes and enable within-trial learning, and (3) gradient-based optimization for systematic improvement. Unlike prior work relying on few-shot demonstrations, our system achieves true zero-shot generalization through pure LLM semantic reasoning,requiring no task-specific examples, fine-tuning, or hardcoded similarity metrics. Evaluated on ALFWorld benchmark tasks, ReflexGrad demonstrates 67% zero-shot success rate on Trial 0 without any prior task experience or demonstrations, establishing effective performance on first exposure. Through empirical analysis, we identify the architectural mechanisms underlying stable convergence (zero action loops) and effective cross-task transfer (67% to 78% improvement).Our work demonstrates that synergistic integration of complementary learning mechanisms enables robust zero-shot generalization that approaches few-shot baselines from prior work.
zh
[AI-13] SweeperBot: Making 3D Browsing Accessible through View Analysis and Visual Question Answering
【速读】:该论文旨在解决盲人及低视力(Blind and Low-Vision, BLV)用户在使用屏幕阅读器(Screen Reader, SR)时难以访问和理解三维(3D)模型的问题。当前多数3D查看工具虽支持提供替代文本,但描述通常缺乏足够细节。解决方案的关键在于提出SweeperBot系统,该系统通过结合最优视图选择技术与生成式AI(Generative AI)及基于识别的基础模型(foundation models),使SR用户能够通过视觉问答(Visual Question Answering, VQA)方式探索和比较3D模型,从而实现更高效、准确的交互体验。
链接: https://arxiv.org/abs/2511.14567
作者: Chen Chen,Cuong Nguyen,Alexa Siu,Dingzeyu Li,Nadir Weibel
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 28 pages, 16 figures, this article has been accepted for publication in the International Journal of Human-Computer Interaction (IJHCI), published by Taylor and Francis
Abstract:Accessing 3D models remains challenging for Screen Reader (SR) users. While some existing 3D viewers allow creators to provide alternative text, they often lack sufficient detail about the 3D models. Grounded on a formative study, this paper introduces SweeperBot, a system that enables SR users to leverage visual question answering to explore and compare 3D models. SweeperBot answers SR users’ visual questions by combining an optimal view selection technique with the strength of generative- and recognition-based foundation models. An expert review with 10 Blind and Low-Vision (BLV) users with SR experience demonstrated the feasibility of using SweeperBot to assist BLV users in exploring and comparing 3D models. The quality of the descriptions generated by SweeperBot was validated by a second survey study with 30 sighted participants.
zh
[AI-14] Masked IRL: LLM -Guided Reward Disambiguation from Demonstrations and Language
【速读】:该论文旨在解决机器人在有限演示数据下学习奖励函数时易过拟合虚假相关性、泛化能力差的问题,其核心挑战在于:演示数据仅提供“如何执行任务”的信息,而未明确“哪些状态特征对任务目标真正重要”,导致模型可能关注无关细节。解决方案的关键在于提出一种名为Masked Inverse Reinforcement Learning(Masked IRL)的框架,利用大语言模型(Large Language Models, LLMs)从自然语言指令中推断出状态相关性掩码(state-relevance masks),并强制模型对无关状态成分保持不变性;同时,在语言指令模糊时,通过LLM推理结合演示上下文进行澄清,从而有效融合演示与语言信息的互补优势,显著提升样本效率、泛化能力和对模糊语言的鲁棒性。
链接: https://arxiv.org/abs/2511.14565
作者: Minyoung Hwang,Alexandra Forsey-Smerek,Nathaniel Dennler,Andreea Bobu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Robots can adapt to user preferences by learning reward functions from demonstrations, but with limited data, reward models often overfit to spurious correlations and fail to generalize. This happens because demonstrations show robots how to do a task but not what matters for that task, causing the model to focus on irrelevant state details. Natural language can more directly specify what the robot should focus on, and, in principle, disambiguate between many reward functions consistent with the demonstrations. However, existing language-conditioned reward learning methods typically treat instructions as simple conditioning signals, without fully exploiting their potential to resolve ambiguity. Moreover, real instructions are often ambiguous themselves, so naive conditioning is unreliable. Our key insight is that these two input types carry complementary information: demonstrations show how to act, while language specifies what is important. We propose Masked Inverse Reinforcement Learning (Masked IRL), a framework that uses large language models (LLMs) to combine the strengths of both input types. Masked IRL infers state-relevance masks from language instructions and enforces invariance to irrelevant state components. When instructions are ambiguous, it uses LLM reasoning to clarify them in the context of the demonstrations. In simulation and on a real robot, Masked IRL outperforms prior language-conditioned IRL methods by up to 15% while using up to 4.7 times less data, demonstrating improved sample-efficiency, generalization, and robustness to ambiguous language. Project page: this https URL and Code: this https URL
zh
[AI-15] MissHDD: Hybrid Deterministic Diffusion for Hetrogeneous Incomplete Data Imputation
【速读】:该论文旨在解决现实世界中表格数据(tabular data)存在缺失值时的多类型特征联合插补问题,尤其针对数值型、类别型和离散型变量共存的异构结构下,传统基于扩散模型(diffusion-based imputation models)因假设特征空间同质性而导致条件一致性难以维持、类别变量信息坍塌或数值变量更新不稳定的问题。其解决方案的关键在于提出一种混合确定性扩散框架(hybrid deterministic diffusion framework),将异构特征分离至两个互补的生成通道:一是基于DDIM(Denoising Diffusion Implicit Models)的连续扩散通道,用于高效且稳定地对数值变量进行确定性去噪;二是受基于路径的离散扩散启发的离散潜在路径扩散通道,能够在不脱离有效样本流形的前提下建模类别与离散变量。两个通道在统一的条件插补目标下联合训练,从而实现对混合类型缺失数据的协同重建,显著提升了插补精度、采样轨迹稳定性及在MCAR、MAR和MNAR场景下的鲁棒性。
链接: https://arxiv.org/abs/2511.14543
作者: Youran Zhou,Mohamed Reda Bouadjenek,Sunil Aryal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Incomplete data are common in real-world tabular applications, where numerical, categorical, and discrete attributes coexist within a single dataset. This heterogeneous structure presents significant challenges for existing diffusion-based imputation models, which typically assume a homogeneous feature space and rely on stochastic denoising trajectories. Such assumptions make it difficult to maintain conditional consistency, and they often lead to information collapse for categorical variables or instability when numerical variables require deterministic updates. These limitations indicate that a single diffusion process is insufficient for mixed-type tabular imputation. We propose a hybrid deterministic diffusion framework that separates heterogeneous features into two complementary generative channels. A continuous DDIM-based channel provides efficient and stable deterministic denoising for numerical variables, while a discrete latent-path diffusion channel, inspired by loopholing-based discrete diffusion, models categorical and discrete features without leaving their valid sample manifolds. The two channels are trained under a unified conditional imputation objective, enabling coherent reconstruction of mixed-type incomplete data. Extensive experiments on multiple real-world datasets show that the proposed framework achieves higher imputation accuracy, more stable sampling trajectories, and improved robustness across MCAR, MAR, and MNAR settings compared with existing diffusion-based and classical methods. These results demonstrate the importance of structure-aware diffusion processes for advancing deep learning approaches to incomplete tabular data. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.14543 [cs.LG] (or arXiv:2511.14543v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.14543 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-16] A Neuro-Symbolic Framework for Reasoning under Perceptual Uncertainty: Bridging Continuous Perception and Discrete Symbolic Planning
【速读】:该论文旨在解决人工智能系统中连续感知信号与离散符号推理之间难以衔接的问题,尤其是在存在不确定性的场景下。其核心挑战在于如何在感知到符号化状态的过程中有效建模和传播不确定性,从而支持可靠的符号规划。解决方案的关键在于提出了一种神经符号(neuro-symbolic)框架:该框架通过基于Transformer的感知前端与图神经网络(Graph Neural Network, GNN)关系推理相结合,从视觉观测中提取具有校准置信度的概率符号状态;同时引入一种不确定性感知的符号规划器,在置信度低时主动收集信息。实验表明,该方法在桌面上机器人操作任务中显著优于最强的POMDP基线,平均成功率达90.7%,且规划时间不超过15毫秒,同时建立了概率图模型分析以量化不确定性与规划收敛性之间的理论联系。
链接: https://arxiv.org/abs/2511.14533
作者: Jiahao Wu,Shengwen Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 10 figures, 12 tables
Abstract:Bridging continuous perceptual signals and discrete symbolic reasoning is a fundamental challenge in AI systems that must operate under uncertainty. We present a neuro-symbolic framework that explicitly models and propagates uncertainty from perception to planning, providing a principled connection between these two abstraction levels. Our approach couples a transformer-based perceptual front-end with graph neural network (GNN) relational reasoning to extract probabilistic symbolic states from visual observations, and an uncertainty-aware symbolic planner that actively gathers information when confidence is low. We demonstrate the framework’s effectiveness on tabletop robotic manipulation as a concrete application: the translator processes 10,047 PyBullet-generated scenes (3–10 objects) and outputs probabilistic predicates with calibrated confidences (overall F1=0.68). When embedded in the planner, the system achieves 94%/90%/88% success on Simple Stack, Deep Stack, and Clear+Stack benchmarks (90.7% average), exceeding the strongest POMDP baseline by 10–14 points while planning within 15,ms. A probabilistic graphical-model analysis establishes a quantitative link between calibrated uncertainty and planning convergence, providing theoretical guarantees that are validated empirically. The framework is general-purpose and can be applied to any domain requiring uncertainty-aware reasoning from perceptual input to symbolic planning.
zh
[AI-17] owards Stable and Structured Time Series Generation with Perturbation-Aware Flow Matching
【速读】:该论文旨在解决时间序列生成中因局部扰动(localized perturbations)引起的时序异质性(temporal heterogeneous)问题,该问题会导致生成的时间序列在结构上不一致,从而影响下游分析与决策任务。现有基于流匹配(flow matching)的方法由于使用全局共享参数,难以捕捉扰动下时间序列的突变行为。其解决方案的关键在于提出一种扰动感知流匹配框架(Perturbation-Aware Flow Matching, PAFM),通过引入扰动引导训练模拟局部扰动,并采用双路径速度场(dual-path velocity field)建模扰动下的轨迹偏移,从而提升对扰动行为的精细刻画能力;此外,还设计了基于流路由(flow routing)的专家混合解码器(mixture-of-experts decoder),动态分配建模资源以增强对不同轨迹动态的表达能力,实现更稳定且结构一致的时间序列生成。
链接: https://arxiv.org/abs/2511.14488
作者: Jintao Zhang,Mingyue Cheng,Zirui Liu,Xianquan Wang,Yitong Zhou,Qi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series generation is critical for a wide range of applications, which greatly supports downstream analytical and decision-making tasks. However, the inherent temporal heterogeneous induced by localized perturbations present significant challenges for generating structurally consistent time series. While flow matching provides a promising paradigm by modeling temporal dynamics through trajectory-level supervision, it fails to adequately capture abrupt transitions in perturbed time series, as the use of globally shared parameters constrains the velocity field to a unified representation. To address these limitations, we introduce \textbfPAFM, a \textbfPerturbation-\textbfAware \textbfFlow \textbfMatching framework that models perturbed trajectories to ensure stable and structurally consistent time series generation. The framework incorporates perturbation-guided training to simulate localized disturbances and leverages a dual-path velocity field to capture trajectory deviations under perturbation, enabling refined modeling of perturbed behavior to enhance the structural coherence. In order to further improve sensitivity to trajectory perturbations while enhancing expressiveness, a mixture-of-experts decoder with flow routing dynamically allocates modeling capacity in response to different trajectory dynamics. Extensive experiments on both unconditional and conditional generation tasks demonstrate that PAFM consistently outperforms strong baselines. Code is available at this https URL.
zh
[AI-18] Agent ic AI Systems in Electrical Power Systems Engineering: Current State-of-the-Art and Challenges
【速读】:该论文旨在解决当前agentic AI(智能体AI)系统缺乏清晰概念界定与分类体系的问题,以区分其与传统AI代理及当前生成式AI模型的本质差异。解决方案的关键在于构建一个精确的定义框架和系统性分类法,并通过电气工程领域的四个前沿应用案例验证其有效性:包括用于复杂电力系统研究与基准测试的先进agentic框架,以及针对电池更换站动态定价策略生存分析的新型系统。此外,论文还深入分析了部署中的故障模式,从而提出可操作的设计与实施建议,确保agentic AI系统的安全性、可靠性与可问责性。
链接: https://arxiv.org/abs/2511.14478
作者: Soham Ghosh,Gaurav Mittal
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:Agentic AI systems have recently emerged as a critical and transformative approach in artificial intelligence, offering capabilities that extend far beyond traditional AI agents and contemporary generative AI models. This rapid evolution necessitates a clear conceptual and taxonomical understanding to differentiate this new paradigm. Our paper addresses this gap by providing a comprehensive review that establishes a precise definition and taxonomy for “agentic AI,” with the aim of distinguishing it from previous AI paradigms. The concepts are gradually introduced, starting with a highlight of its diverse applications across the broader field of engineering. The paper then presents four detailed, state-of-the-art use case applications specifically within electrical engineering. These case studies demonstrate practical impact, ranging from an advanced agentic framework for streamlining complex power system studies and benchmarking to a novel system developed for survival analysis of dynamic pricing strategies in battery swapping stations. Finally, to ensure robust deployment, the paper provides detailed failure mode investigations. From these findings, we derive actionable recommendations for the design and implementation of safe, reliable, and accountable agentic AI systems, offering a critical resource for researchers and practitioners.
zh
[AI-19] Operationalizing Pluralistic Values in Large Language Model Alignment Reveals Trade-offs in Safety Inclusivity and Model Behavior
【速读】:该论文试图解决的问题是:在大型语言模型(Large Language Models, LLMs)的对齐过程中,如何平衡不同社会群体的价值观差异,以实现既安全又公平的模型行为。当前基于人类反馈的对齐方法往往忽视了人类社会多样性,导致模型行为可能偏向特定群体的价值偏好。解决方案的关键在于系统性地引入多元价值观数据,并通过精细化设计对齐流程中的关键参数,包括评分尺度、分歧处理机制和优化算法。研究发现,使用群体特异性偏好进行微调可显著改变模型行为,且保留评分者分歧比多数投票能更有效地降低毒性(提升约53%),五点量表优于二元格式(提升约22%),同时直接偏好优化(Direct Preference Optimization, DPO)在多价值优化场景中优于组相对策略优化(Group Relative Policy Optimization, GRPO)。这为未来如何融合专家驱动与用户驱动信号以实现公平且安全的模型对齐提供了实证依据。
链接: https://arxiv.org/abs/2511.14476
作者: Dalia Ali,Dora Zhao,Allison Koenecke,Orestis Papakyriakopoulos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Although large language models (LLMs) are increasingly trained using human feedback for safety and alignment with human values, alignment decisions often overlook human social diversity. This study examines how incorporating pluralistic values affects LLM behavior by systematically evaluating demographic variation and design parameters in the alignment pipeline. We collected alignment data from US and German participants (N = 1,095, 27,375 ratings) who rated LLM responses across five dimensions: Toxicity, Emotional Awareness (EA), Sensitivity, Stereotypical Bias, and Helpfulness. We fine-tuned multiple Large Language Models and Large Reasoning Models using preferences from different social groups while varying rating scales, disagreement handling methods, and optimization techniques. The results revealed systematic demographic effects: male participants rated responses 18% less toxic than female participants; conservative and Black participants rated responses 27.9% and 44% more emotionally aware than liberal and White participants, respectively. Models fine-tuned on group-specific preferences exhibited distinct behaviors. Technical design choices showed strong effects: the preservation of rater disagreement achieved roughly 53% greater toxicity reduction than majority voting, and 5-point scales yielded about 22% more reduction than binary formats; and Direct Preference Optimization (DPO) consistently outperformed Group Relative Policy Optimization (GRPO) in multi-value optimization. These findings represent a preliminary step in answering a critical question: How should alignment balance expert-driven and user-driven signals to ensure both safety and fair representation?
zh
[AI-20] nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers NEURIPS2025
【速读】:该论文旨在解决机制可解释性研究中工具链的不一致性问题:现有方法在模型适配灵活性与数值准确性之间存在权衡——定制化实现(如TransformerLens)虽接口统一但需手动适配每个架构,易引入数值偏差;而直接使用HuggingFace的NNsight虽能保持原模型行为精确性,却缺乏跨模型标准化支持。解决方案的关键在于提出nnterp,一个轻量级封装NNsight的统一接口库,通过自动模块重命名和全面验证测试,使研究人员只需编写一次干预代码即可部署至50多个模型变体(涵盖16种架构家族),同时保留原始HuggingFace实现的精确性,并内置常见可解释性方法(如logit lens、patchscope、激活控制)及注意力概率访问功能,从而在保证正确性的前提下显著提升工具可用性。
链接: https://arxiv.org/abs/2511.14465
作者: Clément Dumas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure, accepted at the mechanistic interpretability workshop of NeurIPS 2025
Abstract:Mechanistic interpretability research requires reliable tools for analyzing transformer internals across diverse architectures. Current approaches face a fundamental tradeoff: custom implementations like TransformerLens ensure consistent interfaces but require coding a manual adaptation for each architecture, introducing numerical mismatch with the original models, while direct HuggingFace access through NNsight preserves exact behavior but lacks standardization across models. To bridge this gap, we develop nnterp, a lightweight wrapper around NNsight that provides a unified interface for transformer analysis while preserving original HuggingFace implementations. Through automatic module renaming and comprehensive validation testing, nnterp enables researchers to write intervention code once and deploy it across 50+ model variants spanning 16 architecture families. The library includes built-in implementations of common interpretability methods (logit lens, patchscope, activation steering) and provides direct access to attention probabilities for models that support it. By packaging validation tests with the library, researchers can verify compatibility with custom models locally. nnterp bridges the gap between correctness and usability in mechanistic interpretability tooling.
zh
[AI-21] Effective Diversification of Multi-Carousel Book Recommendation
【速读】:该论文旨在解决当前基于协同过滤(collaborative filtering)的图书推荐系统中缺乏多样性的问题,即虽然现有推荐机制能够提供准确的个性化推荐,但难以有效提升推荐结果的多样性,从而影响用户长期参与度。其解决方案的关键在于在协同过滤算法的基础上引入多种策略以增强推荐项的多样性,同时设计了专门的评估指标来衡量推荐系统的准确性与多样性之间的平衡效果,从而在保障推荐质量的同时显著提升内容覆盖广度和用户探索意愿。
链接: https://arxiv.org/abs/2511.14461
作者: Daniël Wilten,Gideon Maillette de Buy Wenniger,Arjen Hommersom,Paul Lucassen,Emiel Poortman
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted as a conference paper at BNAIC/BeNeLearn 2025; The 37th Benelux Conference on Artificial Intelligence and the 34th Belgian Dutch Conference on Machine Learning
Abstract:Using multiple carousels, lists that wrap around and can be scrolled, is the basis for offering content in most contemporary movie streaming platforms. Carousels allow for highlighting different aspects of users’ taste, that fall in categories such as genres and authors. However, while carousels offer structure and greater ease of navigation, they alone do not increase diversity in recommendations, while this is essential to keep users engaged. In this work we propose several approaches to effectively increase item diversity within the domain of book recommendations, on top of a collaborative filtering algorithm. These approaches are intended to improve book recommendations in the web catalogs of public libraries. Furthermore, we introduce metrics to evaluate the resulting strategies, and show that the proposed system finds a suitable balance between accuracy and beyond-accuracy aspects.
zh
[AI-22] Analyzing the Impact of Participant Failures in Cross-Silo Federated Learning
【速读】:该论文旨在解决跨组织(cross-silo)联邦学习(Federated Learning, FL)场景中参与者故障对模型质量影响的可靠性问题。在跨域联邦学习中,由于参与方(如企业或机构)数量较少且数据分布存在偏移(skew),一旦出现节点失效(如通信中断或配置错误),可能显著影响最终模型性能,但现有研究多集中于设备端(cross-device)场景,缺乏对跨组织场景下故障影响的系统性分析。论文的关键解决方案在于通过实证研究识别并量化两个核心因素:一是故障发生的时间点(timing)对模型训练质量的影响,二是数据分布偏移程度对评估指标(evaluation)的误导性——即高偏移情况下评估结果会过度乐观,掩盖真实性能下降。这些发现为构建鲁棒的跨组织联邦学习系统提供了关键设计依据。
链接: https://arxiv.org/abs/2511.14456
作者: Fabian Stricker,David Bermbach,Christian Zirpins
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted for publication in 3rd IEEE International Conference on Federated Learning Applications and Technologies (FLTA2025)
Abstract:Federated learning (FL) is a new paradigm for training machine learning (ML) models without sharing data. While applying FL in cross-silo scenarios, where organizations collaborate, it is necessary that the FL system is reliable; however, participants can fail due to various reasons (e.g., communication issues or misconfigurations). In order to provide a reliable system, it is necessary to analyze the impact of participant failures. While this problem received attention in cross-device FL where mobile devices with limited resources participate, there is comparatively little research in cross-silo FL. Therefore, we conduct an extensive study for analyzing the impact of participant failures on the model quality in the context of inter-organizational cross-silo FL with few participants. In our study, we focus on analyzing generally influential factors such as the impact of the timing and the data as well as the impact on the evaluation, which is important for deciding, if the model should be deployed. We show that under high skews the evaluation is optimistic and hides the real impact. Furthermore, we demonstrate that the timing impacts the quality of the trained model. Our results offer insights for researchers and software architects aiming to build robust FL systems. Comments: Accepted for publication in 3rd IEEE International Conference on Federated Learning Applications and Technologies (FLTA2025) Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.14456 [cs.DC] (or arXiv:2511.14456v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.14456 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-23] Hybrid Modeling of Photoplethysmography for Non-invasive Monitoring of Cardiovascular Parameters
【速读】:该论文旨在解决从非侵入性光电容积脉搏波(PPG)信号中准确预测关键心脏生物标志物(如每搏输出量和心输出量)的难题,这一问题因临床标注的PPG数据稀缺而更加复杂。解决方案的关键在于提出一种混合建模方法:该方法结合了基于配对PPG-动脉压波形(APW)数据训练的条件变分自编码器(conditional variational autoencoder)与基于标注仿真APW片段训练的心脏生物标志物条件密度估计器(conditional density estimator),从而利用血流动力学模拟和未标注临床数据实现对心血管生物标志物的直接估计。实验表明,该方法能有效捕捉心输出量和每搏输出量的时间波动,并优于监督基线模型。
链接: https://arxiv.org/abs/2511.14452
作者: Emanuele Palumbo,Sorawit Saengkyongam,Maria R. Cervera,Jens Behrmann,Andrew C. Miller,Guillermo Sapiro,Christina Heinze-Deml,Antoine Wehenkel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continuous cardiovascular monitoring can play a key role in precision health. However, some fundamental cardiac biomarkers of interest, including stroke volume and cardiac output, require invasive measurements, e.g., arterial pressure waveforms (APW). As a non-invasive alternative, photoplethysmography (PPG) measurements are routinely collected in hospital settings. Unfortunately, the prediction of key cardiac biomarkers from PPG instead of APW remains an open challenge, further complicated by the scarcity of annotated PPG measurements. As a solution, we propose a hybrid approach that uses hemodynamic simulations and unlabeled clinical data to estimate cardiovascular biomarkers directly from PPG signals. Our hybrid model combines a conditional variational autoencoder trained on paired PPG-APW data with a conditional density estimator of cardiac biomarkers trained on labeled simulated APW segments. As a key result, our experiments demonstrate that the proposed approach can detect fluctuations of cardiac output and stroke volume and outperform a supervised baseline in monitoring temporal changes in these biomarkers.
zh
[AI-24] Watchdogs and Oracles: Runtime Verification Meets Large Language Models for Autonomous Systems
【速读】:该论文旨在解决自主系统在学习型组件(learning-enabled components)和开放环境中的安全性与可信性保障难题。传统形式化方法虽能提供强保证,但依赖于完备模型和静态假设;而运行时验证(Runtime Verification, RV)可实时监控执行过程并预测潜在违规行为,却受限于规范获取与不确定性处理能力。解决方案的关键在于构建RV与大语言模型(Large Language Models, LLMs)的共生集成:RV作为LLM驱动自主性的护栏以确保运行安全,而LLM则通过辅助规范捕获、支持前瞻推理和应对不确定性来扩展RV的能力,从而实现对复杂动态环境中自主系统的可信赖控制。
链接: https://arxiv.org/abs/2511.14435
作者: Angelo Ferrando(University of Modena and Reggio Emilia)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: In Proceedings FMAS 2025, arXiv:2511.13245
Abstract:Assuring the safety and trustworthiness of autonomous systems is particularly difficult when learning-enabled components and open environments are involved. Formal methods provide strong guarantees but depend on complete models and static assumptions. Runtime verification (RV) complements them by monitoring executions at run time and, in its predictive variants, by anticipating potential violations. Large language models (LLMs), meanwhile, excel at translating natural language into formal artefacts and recognising patterns in data, yet they remain error-prone and lack formal guarantees. This vision paper argues for a symbiotic integration of RV and LLMs. RV can serve as a guardrail for LLM-driven autonomy, while LLMs can extend RV by assisting specification capture, supporting anticipatory reasoning, and helping to handle uncertainty. We outline how this mutual reinforcement differs from existing surveys and roadmaps, discuss challenges and certification implications, and identify future research directions towards dependable autonomy.
zh
[AI-25] Context-aware Ante-hoc Explanations of Driving Behaviour
【速读】:该论文旨在解决自动驾驶车辆(AV)在决策过程中的“黑箱”问题,即AI驱动的驾驶功能因决策机制不透明而难以被用户理解,从而影响其安全性与可信度。为提升可解释性,论文提出一种基于设计阶段构建的解释模型方法,其关键在于通过形式化语言Traffic Sequence Charts(交通序列图)定义解释上下文及对应的预期或非预期驾驶行为,并结合运行时监控机制实现上下文识别与提前生成(ante-hoc)解释的动态呈现,从而支持在运行时提供准确且符合需求的解释,增强用户对自动驾驶系统行为的理解与信任。
链接: https://arxiv.org/abs/2511.14428
作者: Dominik Grundt(German Aerospace Center e.V.),Ishan Saxena(German Aerospace Center e.V.),Malte Petersen(German Aerospace Center e.V.),Bernd Westphal(German Aerospace Center e.V.),Eike Möhlmann(German Aerospace Center e.V.)
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: In Proceedings FMAS 2025, arXiv:2511.13245
Abstract:Autonomous vehicles (AVs) must be both safe and trustworthy to gain social acceptance and become a viable option for everyday public transportation. Explanations about the system behaviour can increase safety and trust in AVs. Unfortunately, explaining the system behaviour of AI-based driving functions is particularly challenging, as decision-making processes are often opaque. The field of Explainability Engineering tackles this challenge by developing explanation models at design time. These models are designed from system design artefacts and stakeholder needs to develop correct and good explanations. To support this field, we propose an approach that enables context-aware, ante-hoc explanations of (un)expectable driving manoeuvres at runtime. The visual yet formal language Traffic Sequence Charts is used to formalise explanation contexts, as well as corresponding (un)expectable driving manoeuvres. A dedicated runtime monitoring enables context-recognition and ante-hoc presentation of explanations at runtime. In combination, we aim to support the bridging of correct and good explanations. Our method is demonstrated in a simulated overtaking.
zh
[AI-26] MiAD: Mirag e Atom Diffusion for De Novo Crystal Generation
【速读】:该论文旨在解决扩散模型在生成晶体材料时无法改变晶胞中原子数量的问题,这一限制显著降低了采样轨迹的多样性与生成能力。解决方案的关键在于提出了一种名为“幻影注入”(mirage infusion)的简单而强大的技术,该技术使扩散模型能够在生成过程中将原子状态从“存在”转变为“不存在”(即“幻影”状态),反之亦然,从而实现原子数目的动态调整。基于此改进的Mirage Atom Diffusion (MiAD)模型是一种等变联合扩散模型,在MP-20数据集上实现了8.2%的S.U.N.(稳定、唯一、新颖)晶体生成率,显著优于现有最先进方法。
链接: https://arxiv.org/abs/2511.14426
作者: Andrey Okhotin,Maksim Nakhodnov,Nikita Kazeev,Andrey E Ustyuzhanin,Dmitry Vetrov
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:
Abstract:In recent years, diffusion-based models have demonstrated exceptional performance in searching for simultaneously stable, unique, and novel (S.U.N.) crystalline materials. However, most of these models don’t have the ability to change the number of atoms in the crystal during the generation process, which limits the variability of model sampling trajectories. In this paper, we demonstrate the severity of this restriction and introduce a simple yet powerful technique, mirage infusion, which enables diffusion models to change the state of the atoms that make up the crystal from existent to non-existent (mirage) and vice versa. We show that this technique improves model quality by up to \times2.5 compared to the same model without this modification. The resulting model, Mirage Atom Diffusion (MiAD), is an equivariant joint diffusion model for de novo crystal generation that is capable of altering the number of atoms during the generation process. MiAD achieves an 8.2% S.U.N. rate on the MP-20 dataset, which substantially exceeds existing state-of-the-art approaches. The source code can be found at \hrefthis https URL\textttthis http URL.
zh
[AI-27] Sigil: Server-Enforced Watermarking in U-Shaped Split Federated Learning via Gradient Injection
【速读】:该论文旨在解决去中心化机器学习范式(如Split Federated Learning, SFL)中能力受限服务器的模型知识产权保护难题。在这些场景下,服务器无法访问模型参数或标签等关键信息,导致传统基于服务器端的水印技术不可行,而依赖客户端协作的水印方案在对抗环境中又不可靠。解决方案的关键在于提出Sigil框架——它将水印定义为对服务器可见激活空间的统计约束,并通过梯度注入方式嵌入水印,无需任何数据知识;同时设计自适应梯度裁剪机制,在保证水印强制性与隐蔽性的前提下,有效抵御现有梯度异常检测方法及专门针对子空间移除的攻击。
链接: https://arxiv.org/abs/2511.14422
作者: Zhengchunmin Dai,Jiaxiong Tang,Peng Sun,Honglong Chen,Liantao Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages,8 figures
Abstract:In decentralized machine learning paradigms such as Split Federated Learning (SFL) and its variant U-shaped SFL, the server’s capabilities are severely restricted. Although this enhances client-side privacy, it also leaves the server highly vulnerable to model theft by malicious clients. Ensuring intellectual property protection for such capability-limited servers presents a dual challenge: watermarking schemes that depend on client cooperation are unreliable in adversarial settings, whereas traditional server-side watermarking schemes are technically infeasible because the server lacks access to critical elements such as model parameters or labels. To address this challenge, this paper proposes Sigil, a mandatory watermarking framework designed specifically for capability-limited servers. Sigil defines the watermark as a statistical constraint on the server-visible activation space and embeds the watermark into the client model via gradient injection, without requiring any knowledge of the data. Besides, we design an adaptive gradient clipping mechanism to ensure that our watermarking process remains both mandatory and stealthy, effectively countering existing gradient anomaly detection methods and a specifically designed adaptive subspace removal attack. Extensive experiments on multiple datasets and models demonstrate Sigil’s fidelity, robustness, and stealthiness. Comments: 18 pages,8 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2511.14422 [cs.CR] (or arXiv:2511.14422v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2511.14422 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-28] When Words Change the Model: Sensitivity of LLM s for Constraint Programming Modelling
【速读】:该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在自动生成约束规划(Constraint Programming, CP)模型方面表现出的“成功”是否真正源于对问题本质的理解,还是仅仅由于训练数据中包含大量标准CP问题(即数据污染导致的过拟合)。为验证这一假设,研究者设计了一种系统性方法,通过改写和扰动CSPLib中的经典问题描述,在保持问题结构不变的前提下引入语境变化和误导性元素。解决方案的关键在于:对比LLMs在原始描述与修改后描述下生成的模型质量,发现尽管LLMs能产出语法正确且语义合理的模型,但其性能在上下文和语言形式变化时显著下降,揭示了模型对表述方式的高度敏感性,表明其理解仍停留在浅层层面,缺乏真正的推理能力。
链接: https://arxiv.org/abs/2511.14334
作者: Alessio Pellegrino,Jacopo Mauro
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:One of the long-standing goals in optimisation and constraint programming is to describe a problem in natural language and automatically obtain an executable, efficient model. Large language models appear to bring this vision closer, showing impressive results in automatically generating models for classical benchmarks. However, much of this apparent success may derive from data contamination rather than genuine reasoning: many standard CP problems are likely included in the training data of these models. To examine this hypothesis, we systematically rephrased and perturbed a set of well-known CSPLib problems to preserve their structure while modifying their context and introducing misleading elements. We then compared the models produced by three representative LLMs across original and modified descriptions. Our qualitative analysis shows that while LLMs can produce syntactically valid and semantically plausible models, their performance drops sharply under contextual and linguistic variation, revealing shallow understanding and sensitivity to wording.
zh
[AI-29] H-LDM: Hierarchical Latent Diffusion Models for Controllable and Interpretable PCG Synthesis from Clinical Metadata
【速读】:该论文旨在解决心音图(Phonocardiogram, PCG)分析中因标注病理数据稀缺而导致人工智能系统性能受限的问题。解决方案的关键在于提出一种分层潜在扩散模型(Hierarchical Latent Diffusion Model, H-LDM),其核心创新包括:(1) 多尺度变分自编码器(multi-scale VAE)构建生理解耦的潜在空间,实现心律、心音与杂音的分离;(2) 基于结构化临床元数据的分层文本到生物信号生成管道,支持对17种不同心脏疾病进行细粒度控制;(3) 引入医学注意力模块引导可解释的扩散过程,提升生成信号的临床合理性。实验表明,该方法在PhysioNet CirCor数据集上达到9.7的Fréchet Audio Distance和87.1%的临床有效性评分,且合成数据增强可使罕见病分类准确率提升11.3%,为心脏诊断中的数据增强提供了新范式。
链接: https://arxiv.org/abs/2511.14312
作者: Chenyang Xu,Siming Li,Hao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper was accepted by IEEE BIBM 2025 conference
Abstract:Phonocardiogram (PCG) analysis is vital for cardiovascular disease diagnosis, yet the scarcity of labeled pathological data hinders the capability of AI systems. To bridge this, we introduce H-LDM, a Hierarchical Latent Diffusion Model for generating clinically accurate and controllable PCG signals from structured metadata. Our approach features: (1) a multi-scale VAE that learns a physiologically-disentangled latent space, separating rhythm, heart sounds, and murmurs; (2) a hierarchical text-to-biosignal pipeline that leverages rich clinical metadata for fine-grained control over 17 distinct conditions; and (3) an interpretable diffusion process guided by a novel Medical Attention module. Experiments on the PhysioNet CirCor dataset demonstrate state-of-the-art performance, achieving a Fréchet Audio Distance of 9.7, a 92% attribute disentanglement score, and 87.1% clinical validity confirmed by cardiologists. Augmenting diagnostic models with our synthetic data improves the accuracy of rare disease classification by 11.3%. H-LDM establishes a new direction for data augmentation in cardiac diagnostics, bridging data scarcity with interpretable clinical insights.
zh
[AI-30] Weight Variance Amplifier Improves Accuracy in High-Sparsity One-Shot Pruning
【速读】:该论文旨在解决深度神经网络在进行激进剪枝(aggressive pruning)后精度显著下降的问题,尤其是在使用标准优化目标训练的模型中。现有方法如SAM和CrAM虽能通过引导模型进入参数空间中的平坦区域来提升剪枝鲁棒性,但会引入额外计算开销。论文提出一种方差增强正则化器(Variance Amplifying Regularizer, VAR),其核心创新在于:在训练过程中主动增加模型参数的方差,实验证明高方差参数具有更强的剪枝鲁棒性;VAR通过促进权重分布中此类高方差参数的形成,从而在不依赖额外训练或复杂优化策略的前提下显著提升模型对剪枝的容忍度,并提供了理论收敛性分析与大量实证结果支持。
链接: https://arxiv.org/abs/2511.14282
作者: Vincent-Daniel Yun,Junhyuk Jo,Sunwoo Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep neural networks achieve outstanding performance in visual recognition tasks, yet their large number of parameters makes them less practical for real-world applications. Recently, one-shot pruning has emerged as an effective strategy for reducing model size without additional training. However, models trained with standard objective functions often suffer a significant drop in accuracy after aggressive pruning. Some existing pruning-robust optimizers, such as SAM, and CrAM, mitigate this accuracy drop by guiding the model toward flatter regions of the parameter space, but they inevitably incur non-negligible additional computations. We propose a Variance Amplifying Regularizer (VAR) that deliberately increases the variance of model parameters during training. Our study reveals an intriguing finding that parameters with higher variance exhibit greater pruning robustness. VAR exploits this property by promoting such variance in the weight distribution, thereby mitigating the adverse effects of pruning. We further provide a theoretical analysis of its convergence behavior, supported by extensive empirical results demonstrating the superior pruning robustness of VAR.
zh
[AI-31] Comparing Task-Agnostic Embedding Models for Tabular Data
【速读】:该论文旨在解决当前表格数据基础模型(tabular foundation models)在任务特定性能优化过程中忽视可迁移、任务无关表示学习的问题。这类模型通常将表示学习与任务特定推理封装于单一资源密集型网络中,导致效率低下且缺乏通用性。解决方案的关键在于分离表示学习环节,系统评估了来自TabPFN和TabICL等表格基础模型的无任务特异性嵌入(task-agnostic embeddings),并与经典特征工程方法TableVectorizer进行对比。实验表明,TableVectorizer生成的特征在异常检测(ADBench)和监督学习(TabArena Lite)等多个任务上表现相当或更优,同时计算速度比表格基础模型快达三个数量级,凸显了高效、可迁移特征表示的重要性。
链接: https://arxiv.org/abs/2511.14276
作者: Frederik Hoppe,Lars Kleinemeier,Astrid Franz,Udo Göbel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at AI for Tabular Data (EurIPS 2025 Workshop)
Abstract:Recent foundation models for tabular data achieve strong task-specific performance via in-context learning. Nevertheless, they focus on direct prediction by encapsulating both representation learning and task-specific inference inside a single, resource-intensive network. This work specifically focuses on representation learning, i.e., on transferable, task-agnostic embeddings. We systematically evaluate task-agnostic representations from tabular foundation models (TabPFN and TabICL) alongside with classical feature engineering (TableVectorizer) across a variety of application tasks as outlier detection (ADBench) and supervised learning (TabArena Lite). We find that simple TableVectorizer features achieve comparable or superior performance while being up to three orders of magnitude faster than tabular foundation models. The code is available at this https URL.
zh
[AI-32] Object-Centric World Models for Causality-Aware Reinforcement Learning AAAI-26
【速读】:该论文旨在解决当前世界模型(world model)在高维、非平稳且包含多个物体及其复杂交互的环境中难以准确建模的问题,因为现有方法通常学习环境的整体表征而非对象级分解。其解决方案的关键在于提出STICA框架,该框架采用以对象为中心的Transformer作为世界模型,并结合因果感知的策略与价值网络:首先将观测表示为一组对象中心的token(包括代理动作和奖励token),使世界模型能够预测token级别的动态变化与交互;其次,策略和价值网络通过注意力机制显式建模token级别的因果关系,从而实现因果引导的决策制定,显著提升了样本效率和最终性能。
链接: https://arxiv.org/abs/2511.14262
作者: Yosuke Nishimoto,Takashi Matsubara
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI-26
Abstract:World models have been developed to support sample-efficient deep reinforcement learning agents. However, it remains challenging for world models to accurately replicate environments that are high-dimensional, non-stationary, and composed of multiple objects with rich interactions since most world models learn holistic representations of all environmental components. By contrast, humans perceive the environment by decomposing it into discrete objects, facilitating efficient decision-making. Motivated by this insight, we propose \emphSlot Transformer Imagination with CAusality-aware reinforcement learning (STICA), a unified framework in which object-centric Transformers serve as the world model and causality-aware policy and value networks. STICA represents each observation as a set of object-centric tokens, together with tokens for the agent action and the resulting reward, enabling the world model to predict token-level dynamics and interactions. The policy and value networks then estimate token-level cause–effect relations and use them in the attention layers, yielding causality-guided decision-making. Experiments on object-rich benchmarks demonstrate that STICA consistently outperforms state-of-the-art agents in both sample efficiency and final performance.
zh
[AI-33] PathMind: A Retrieve-Prioritize-Reason Framework for Knowledge Graph Reasoning with Large Language Models AAAI2026
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的知识图谱推理(Knowledge Graph Reasoning, KGR)方法中存在的两个关键问题:一是现有方法在提取推理路径时缺乏重要性评估,导致引入无关噪声并误导LLMs;二是多数方法依赖频繁调用LLM进行动态路径探索,造成高检索开销和计算成本。解决方案的关键在于提出PathMind框架,其核心是采用“检索-优先级排序-推理”的三阶段范式:首先从知识图谱中检索查询子图,随后通过语义感知的路径优先级函数识别关键推理路径(综合考虑累积代价与未来估计代价),最后利用双阶段训练策略(任务特定指令微调与路径级偏好对齐)生成准确且逻辑一致的推理结果,从而实现更忠实、可解释的推理过程,并显著减少输入token数量。
链接: https://arxiv.org/abs/2511.14256
作者: Yu Liu,Xixun Lin,Yanmin Shang,Yangxi Li,Shi Wang,Yanan Cao
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: AAAI 2026, Long Paper, Oral
Abstract:Knowledge graph reasoning (KGR) is the task of inferring new knowledge by performing logical deductions on knowledge graphs. Recently, large language models (LLMs) have demonstrated remarkable performance in complex reasoning tasks. Despite promising success, current LLM-based KGR methods still face two critical limitations. First, existing methods often extract reasoning paths indiscriminately, without assessing their different importance, which may introduce irrelevant noise that misleads LLMs. Second, while many methods leverage LLMs to dynamically explore potential reasoning paths, they require high retrieval demands and frequent LLM calls. To address these limitations, we propose PathMind, a novel framework designed to enhance faithful and interpretable reasoning by selectively guiding LLMs with important reasoning paths. Specifically, PathMind follows a “Retrieve-Prioritize-Reason” paradigm. First, it retrieves a query subgraph from KG through the retrieval module. Next, it introduces a path prioritization mechanism that identifies important reasoning paths using a semantic-aware path priority function, which simultaneously considers the accumulative cost and the estimated future cost for reaching the target. Finally, PathMind generates accurate and logically consistent responses via a dual-phase training strategy, including task-specific instruction tuning and path-wise preference alignment. Extensive experiments on benchmark datasets demonstrate that PathMind consistently outperforms competitive baselines, particularly on complex reasoning tasks with fewer input tokens, by identifying essential reasoning paths.
zh
[AI-34] Enhancing Regional Airbnb Trend Forecasting Using LLM -Based Embeddings of Accessibility and Human Mobility
【速读】:该论文旨在解决短期租赁平台(如Airbnb)扩张对本地住房市场造成的冲击问题,尤其是由此引发的租金上涨和住房可负担性下降等挑战。为支持政策制定者和城市规划者进行有效干预,研究提出了一种新颖的时间序列预测框架,用于精准预测区域层面的三项关键指标:收入(Revenue)、预订天数(Reservation Days)和预订数量(Number of Reservations)。其解决方案的关键在于构建融合房源特征与外部环境因素(如城市可达性和人流移动性)的区域表征,并将结构化表格数据转化为基于提示(prompt-based)的输入以驱动大语言模型(Large Language Model, LLM)生成高质量区域嵌入(embedding),进而输入到RNN、LSTM或Transformer等先进时序模型中,从而更有效地捕捉复杂的时空动态关系。实验表明,该方法在首尔Airbnb数据集上相较传统统计和机器学习基线模型,平均RMSE和MAE均降低约48%,显著提升了预测精度并提供了识别供给过剩区域的实用洞见,助力数据驱动的城市政策决策。
链接: https://arxiv.org/abs/2511.14248
作者: Hongju Lee,Youngjun Park,Jisun An,Dongman Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ASONAM 2025
Abstract:The expansion of short-term rental platforms, such as Airbnb, has significantly disrupted local housing markets, often leading to increased rental prices and housing affordability issues. Accurately forecasting regional Airbnb market trends can thus offer critical insights for policymakers and urban planners aiming to mitigate these impacts. This study proposes a novel time-series forecasting framework to predict three key Airbnb indicators – Revenue, Reservation Days, and Number of Reservations – at the regional level. Using a sliding-window approach, the model forecasts trends 1 to 3 months ahead. Unlike prior studies that focus on individual listings at fixed time points, our approach constructs regional representations by integrating listing features with external contextual factors such as urban accessibility and human mobility. We convert structured tabular data into prompt-based inputs for a Large Language Model (LLM), producing comprehensive regional embeddings. These embeddings are then fed into advanced time-series models (RNN, LSTM, Transformer) to better capture complex spatio-temporal dynamics. Experiments on Seoul’s Airbnb dataset show that our method reduces both average RMSE and MAE by approximately 48% compared to conventional baselines, including traditional statistical and machine learning models. Our framework not only improves forecasting accuracy but also offers practical insights for detecting oversupplied regions and supporting data-driven urban policy decisions.
zh
[AI-35] DevPiolt: Operation Recommendation for IoT Devices at Xiaomi Home
【速读】:该论文旨在解决物联网(IoT)设备操作推荐中存在的复杂操作逻辑建模困难、用户偏好多样性以及对劣质推荐敏感等问题,这些问题限制了现有推荐模型在实际场景中的应用效果。其解决方案的关键在于提出一个基于大语言模型(LLM)的推荐框架 DevPiolt:首先通过持续预训练和多任务微调赋予 LLM 物联网操作的基础领域知识;其次采用直接偏好优化(Direct Preference Optimization, DPO)使模型与特定用户偏好对齐;最后设计基于置信度的暴露控制机制,避免低质量推荐引发负面用户体验。该方法在多个数据集上显著优于基线模型,并已在小米智能家居应用中落地部署,有效提升了用户设备覆盖和页面接受率。
链接: https://arxiv.org/abs/2511.14227
作者: Yuxiang Wang,Siwen Wang,Haowei Han,Ao Wang,Boya Liu,Yong Zhao,Chengbo Wu,Bin Zhu,Bin Qin,Xiaokai Zhou,Xiao Yan,Jiawei Jiang,Bo Du
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Operation recommendation for IoT devices refers to generating personalized device operations for users based on their context, such as historical operations, environment information, and device status. This task is crucial for enhancing user satisfaction and corporate profits. Existing recommendation models struggle with complex operation logic, diverse user preferences, and sensitive to suboptimal suggestions, limiting their applicability to IoT device operations. To address these issues, we propose DevPiolt, a LLM-based recommendation model for IoT device operations. Specifically, we first equip the LLM with fundamental domain knowledge of IoT operations via continual pre-training and multi-task fine-tuning. Then, we employ direct preference optimization to align the fine-tuned LLM with specific user preferences. Finally, we design a confidence-based exposure control mechanism to avoid negative user experiences from low-quality recommendations. Extensive experiments show that DevPiolt significantly outperforms baselines on all datasets, with an average improvement of 69.5% across all metrics. DevPiolt has been practically deployed in Xiaomi Home app for one quarter, providing daily operation recommendations to 255,000 users. Online experiment results indicate a 21.6% increase in unique visitor device coverage and a 29.1% increase in page view acceptance rates.
zh
[AI-36] LLM -Aligned Geographic Item Tokenization for Local-Life Recommendation
【速读】:该论文旨在解决本地生活服务推荐中传统基于文本的推荐方法无法有效捕捉细粒度空间特征与真实世界距离感知的问题。现有方法通常通过提示(prompt)注入地理位置信息,但难以建模物品间的复杂空间关系。其解决方案的关键在于提出LGSID框架,包含两个核心组件:一是基于强化学习(RL)的地理语义对齐模块,通过列表级奖励模型和创新的G-DPO算法将空间知识与协同信号注入大语言模型(LLM),同时保持其语义理解能力;二是分层地理物品标记化策略,利用离散的空间和内容属性生成主标记,并通过对齐后的LLM地理表示向量进一步优化残差标记,从而实现更精准的地理感知推荐。
链接: https://arxiv.org/abs/2511.14221
作者: Hao Jiang,Guoquan Wang,Donglin Zhou,Sheng Yu,Yang Zeng,Wencong Zeng,Kun Gai,Guorui Zhou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Large Language Models (LLMs) have enhanced text-based recommendation by enriching traditional ID-based methods with semantic generalization capabilities. Text-based methods typically encode item textual information via prompt design and generate discrete semantic IDs through item tokenization. However, in domain-specific tasks such as local-life services, simply injecting location information into prompts fails to capture fine-grained spatial characteristics and real-world distance awareness among items. To address this, we propose LGSID, an LLM-Aligned Geographic Item Tokenization Framework for Local-life Recommendation. This framework consists of two key components: (1) RL-based Geographic LLM Alignment, and (2) Hierarchical Geographic Item Tokenization. In the RL-based alignment module, we initially train a list-wise reward model to capture real-world spatial relationships among items. We then introduce a novel G-DPO algorithm that uses pre-trained reward model to inject generalized spatial knowledge and collaborative signals into LLMs while preserving their semantic understanding. Furthermore, we propose a hierarchical geographic item tokenization strategy, where primary tokens are derived from discrete spatial and content attributes, and residual tokens are refined using the aligned LLM’s geographic representation vectors. Extensive experiments on real-world Kuaishou industry datasets show that LGSID consistently outperforms state-of-the-art discriminative and generative recommendation models. Ablation studies, visualizations, and case studies further validate its effectiveness.
zh
[AI-37] Parallelizing Tree Search with Twice Sequential Monte Carlo
【速读】:该论文旨在解决顺序蒙特卡洛(Sequential Monte Carlo, SMC)方法在深度搜索时面临的高方差和路径退化(path degeneracy)问题,这些问题限制了其在增加序列计算量时的可扩展性。解决方案的关键在于提出双重顺序蒙特卡洛树搜索(Twice Sequential Monte Carlo Tree Search, TSMCTS),通过引入双重采样机制实现方差减少并有效缓解路径退化,从而在保持SMC天然并行性的基础上,显著提升搜索深度下的性能表现。
链接: https://arxiv.org/abs/2511.14220
作者: Yaniv Oren,Joery A. de Vries,Pascal R. van der Vaart,Matthijs T. J. Spaan,Wendelin Böhmer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS. Through variance reduction and mitigation of path degeneracy, TSMCTS scales favorably with sequential compute while retaining the properties that make SMC natural to parallelize.
zh
[AI-38] Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation AAAI2026
【速读】:该论文旨在解决Whisper模型在噪声环境下容易产生幻觉错误(hallucination errors)的问题,尤其在实际应用中面对复杂声学条件时表现不稳定。其解决方案的关键在于提出一个两阶段架构:第一阶段通过自适应层注意力(Adaptive Layer Attention, ALA)增强编码器的鲁棒性,ALA基于层间相关性分析将编码器层分组为语义一致的块,并引入可学习的多头注意力模块融合这些块表示,从而联合利用低层与高层特征实现更稳健的语音编码;第二阶段采用多目标知识蒸馏(multi-objective knowledge distillation, KD)框架,在噪声音频上训练学生模型,使其语义分布和注意力分布对齐于教师模型在纯净音频上的输出,从而直接抑制幻觉并提升抗噪能力。实验证明该方法显著降低误识率和幻觉发生率,同时保持对干净语音的良好性能。
链接: https://arxiv.org/abs/2511.14219
作者: Kumud Tripathi,Aditya Srinivas Menon,Aman Gaurav,Raj Prakash Gohil,Pankaj Wasnik
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted at AAAI 2026 - Main Technical Track
Abstract:The Whisper model, an open-source automatic speech recognition system, is widely adopted for its strong performance across multilingual and zero-shot settings. However, it frequently suffers from hallucination errors, especially under noisy acoustic conditions. Previous works to reduce hallucinations in Whisper-style ASR systems have primarily focused on audio preprocessing or post-processing of transcriptions to filter out erroneous content. However, modifications to the Whisper model itself remain largely unexplored to mitigate hallucinations directly. To address this challenge, we present a two-stage architecture that first enhances encoder robustness through Adaptive Layer Attention (ALA) and further suppresses hallucinations using a multi-objective knowledge distillation (KD) framework. In the first stage, ALA groups encoder layers into semantically coherent blocks via inter-layer correlation analysis. A learnable multi-head attention module then fuses these block representations, enabling the model to jointly exploit low- and high-level features for more robust encoding. In the second stage, our KD framework trains the student model on noisy audio to align its semantic and attention distributions with a teacher model processing clean inputs. Our experiments on noisy speech benchmarks show notable reductions in hallucinations and word error rates, while preserving performance on clean speech. Together, ALA and KD offer a principled strategy to improve Whisper’s reliability under real-world noisy conditions.
zh
[AI-39] Bridging the Gap Between Bayesian Deep Learning and Ensemble Weather Forecasts
【速读】:该论文旨在解决气象预报中因大气混沌特性导致的不确定性量化难题,传统集合预报(Ensemble Prediction System, EPS)虽能提供概率性预测但计算成本高昂,而贝叶斯深度学习(Bayesian Deep Learning, BDL)虽具潜力却常与EPS脱节。解决方案的关键在于提出一种统一的混合贝叶斯深度学习框架,通过变分推断显式分解预测不确定性为认知不确定性(epistemic uncertainty)和随机不确定性(aleatoric uncertainty),其中前者由BDL模型学习,后者则通过物理信息驱动的随机扰动方案建模流依赖的大气动力学过程;同时建立了形式化的理论框架,严格证明在该混合框架下总预测不确定性可被分解为上述两类成分,从而实现了BDL与EPS的理论融合与实践协同。
链接: https://arxiv.org/abs/2511.14218
作者: Xinlei Xiong,Wenbo Hu,Shuxun Zhou,Kaifeng Bi,Lingxi Xie,Ying Liu,Richang Hong,Qi Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Weather forecasting is fundamentally challenged by the chaotic nature of the atmosphere, necessitating probabilistic approaches to quantify uncertainty. While traditional ensemble prediction (EPS) addresses this through computationally intensive simulations, recent advances in Bayesian Deep Learning (BDL) offer a promising but often disconnected alternative. We bridge these paradigms through a unified hybrid Bayesian Deep Learning framework for ensemble weather forecasting that explicitly decomposes predictive uncertainty into epistemic and aleatoric components, learned via variational inference and a physics-informed stochastic perturbation scheme modeling flow-dependent atmospheric dynamics, respectively. We further establish a unified theoretical framework that rigorously connects BDL and EPS, providing formal theorems that decompose total predictive uncertainty into epistemic and aleatoric components under the hybrid BDL framework. We validate our framework on the large-scale 40-year ERA5 reanalysis dataset (1979-2019) with 0.25° spatial resolution. Experimental results show that our method not only improves forecast accuracy and yields better-calibrated uncertainty quantification but also achieves superior computational efficiency compared to state-of-the-art probabilistic diffusion models. We commit to making our code open-source upon acceptance of this paper.
zh
[AI-40] Do Large Language Models (LLM s) Understand Chronology?
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在金融与经济领域应用中可能因缺乏对时间顺序的准确理解而引入回溯偏差(look-ahead bias)的问题,核心在于验证LLMs是否具备对已知事实进行正确时序排序的能力。解决方案的关键在于通过设计三类逐步复杂化的任务——时间排序、条件排序(先筛选后排序)和错时检测(anachronism detection),系统评估不同模型(GPT-4.1、Claude-3.7 Sonnet、GPT-5)在不同推理资源分配下的表现。研究发现,显式分配推理预算(尤其是GPT-5在中高推理努力下)显著提升时序一致性,实现全长度无误排序和完美条件排序,表明合理调度推理资源是突破当前LLM时序认知局限的核心策略。
链接: https://arxiv.org/abs/2511.14214
作者: Pattaraphon Kenny Wongchamcharoen,Paul Glasserman
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 47 pages
Abstract:Large language models (LLMs) are increasingly used in finance and economics, where prompt-based attempts against look-ahead bias implicitly assume that models understand chronology. We test this fundamental question with a series of chronological ordering tasks with increasing complexities over facts the model already knows from pre-training. Our tasks cover (1) chronological ordering, (2) conditional sorting (filter, then order), and (3) anachronism detection. We evaluate GPT-4.1, Claude-3.7 Sonnet, with and without Extended Thinking (ET), and GPT-5 across multiple reasoning-effort settings. Across models, Exact match rate drops sharply as sequences lengthen even while rank correlations stay high as LLMs largely preserve local order but struggle to maintain a single globally consistent timeline. In conditional sorting, most failures stem from the filtering step rather than the ordering step, but GPT-5 and Claude-3.7 Sonnet with Extended Thinking outshine normal models significantly. Lastly, anachronism detection is found to be the easiest task for the LLMs but performance still declines with increasingly overlapping timelines or entities. Overall, our main contribution is showing that allocating explicit reasoning budget helps with chronological ordering with GPT-5 at medium/high reasoning effort achieving flawless ordering at all lengths and perfect conditional sorting (both self-filtered and given-subset), whereas low/minimal effort degrades with longer lists, mirroring earlier models. Our findings delineate limits of current LLMs on chronological tasks, providing insights into task complexity, and demonstrate scenarios in which reasoning helps. These patterns are important for the real-time application of LLMs in finance. We release all code and evaluation templates to support full reproducibility.
zh
[AI-41] HFL-FlowLLM : Large Language Models for Network Traffic Flow Classification in Heterogeneous Federated Learning
【速读】:该论文旨在解决现代通信网络中异构联邦学习环境下网络流量分类的难题,传统集中式机器学习难以应对分布式数据与隐私保护问题,而现有联邦学习方法则存在训练成本高和泛化能力差的缺陷。其解决方案的关键在于提出HFL-FlowLLM框架,首次将大语言模型(Large Language Models, LLMs)引入异构联邦学习中的网络流量分类任务,通过利用LLMs强大的语义理解能力与跨客户端的知识迁移机制,在提升平均F1分数约13%的同时,显著降低训练成本(减少约87%),并随着参与客户端数量增加进一步提升性能(最高达5% F1分数提升),展现出卓越的性能与鲁棒性。
链接: https://arxiv.org/abs/2511.14199
作者: Jiazhuo Tian,Yachao Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In modern communication networks driven by 5G and the Internet of Things (IoT), effective network traffic flow classification is crucial for Quality of Service (QoS) management and security. Traditional centralized machine learning struggles with the distributed data and privacy concerns in these heterogeneous environments, while existing federated learning approaches suffer from high costs and poor generalization. To address these challenges, we propose HFL-FlowLLM, which to our knowledge is the first framework to apply large language models to network traffic flow classification in heterogeneous federated learning. Compared to state-of-the-art heterogeneous federated learning methods for network traffic flow classification, the proposed approach improves the average F1 score by approximately 13%, demonstrating compelling performance and strong robustness. When compared to existing large language models federated learning frameworks, as the number of clients participating in each training round increases, the proposed method achieves up to a 5% improvement in average F1 score while reducing the training costs by about 87%. These findings prove the potential and practical value of HFL-FlowLLM in modern communication networks security.
zh
[AI-42] DiverseClaire: Simulating Students to Improve Introductory Programming Course Materials for All CS1 Learners
【速读】:该论文试图解决的问题是:当前计算机科学(Computer Science, CS)入门课程(如CS1)普遍采用“一刀切”的教学模式,这种统一的教学设计可能加剧认知负荷,尤其不利于自闭症、注意力缺陷多动障碍(ADHD)、阅读障碍等神经多样性学习者的学习体验。为应对这一问题,研究提出了一种基于大语言模型(Large Language Models, LLMs)的模拟教学环境——DiverseClaire,其核心解决方案是通过构建具有不同神经多样性特征的学生角色(diverse personas),结合布卢姆认知分类学(Bloom’s Taxonomy)与通用学习设计(Universal Design for Learning, UDL)原则,对传统讲义进行重构并开展受控实验。关键在于利用LLMs生成多样化学生反馈,量化评估不同教学材料格式对学生认知表现的影响,从而验证多模态呈现方式对提升学习包容性的必要性。
链接: https://arxiv.org/abs/2511.14198
作者: Wendy Wong,Yuchao Jiang,Yuekang Li
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 2 pages
Abstract:Although CS programs are booming, introductory courses like CS1 still adopt a one-size-fits-all formats that can exacerbate cognitive load and discourage learners with autism, ADHD, dyslexia and other neurological conditions. These call for compassionate pedagogies and Universal Design For Learning (UDL) to create learning environments and materials where cognitive diversity is welcomed. To address this, we introduce DiverseClaire a pilot study, which simulates students including neurodiverse profiles using LLMs and diverse personas. By leveraging Bloom’s Taxonomy and UDL, DiverseClaire compared UDL-transformed lecture slides with traditional formats. To evaluate DiverseClaire controlled experiments, we used the evaluation metric the average score. The findings revealed that the simulated neurodiverse students struggled with learning due to lecture slides that were in inaccessible formats. These results highlight the need to provide course materials in multiple formats for diverse learner preferences. Data from our pilot study will be made available to assist future CS1 instructors.
zh
[AI-43] owards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion
【速读】:该论文旨在解决预训练视觉-语言-动作(Vision-Language-Action, VLA)模型在下游部署阶段性能显著下降的问题,尤其针对现有微调方法依赖昂贵的示范数据收集和高计算成本而难以在真实场景中应用的局限性。其解决方案的关键在于提出一种即插即用的推理时策略引导方法——VLA-Pilot,该方法无需任何额外微调或数据收集即可实现零样本部署,通过在推理阶段动态调整策略输出,显著提升预训练VLA模型在多种真实机器人任务和构型下的成功率,从而实现鲁棒的零样本泛化能力。
链接: https://arxiv.org/abs/2511.14178
作者: Zhuo Li,Junjia Liu,Zhipeng Dong,Tao Teng,Quentin Rouxel,Darwin Caldwell,Fei Chen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures, submitted to IEEE RA-L
Abstract:Vision-Language-Action (VLA) models have demonstrated significant potential in real-world robotic manipulation. However, pre-trained VLA policies still suffer from substantial performance degradation during downstream deployment. Although fine-tuning can mitigate this issue, its reliance on costly demonstration collection and intensive computation makes it impractical in real-world settings. In this work, we introduce VLA-Pilot, a plug-and-play inference-time policy steering method for zero-shot deployment of pre-trained VLA without any additional fine-tuning or data collection. We evaluate VLA-Pilot on six real-world downstream manipulation tasks across two distinct robotic embodiments, encompassing both in-distribution and out-of-distribution scenarios. Experimental results demonstrate that VLA-Pilot substantially boosts the success rates of off-the-shelf pre-trained VLA policies, enabling robust zero-shot generalization to diverse tasks and embodiments. Experimental videos and code are available at: this https URL.
zh
[AI-44] Certified Signed Graph Unlearning
【速读】:该论文旨在解决在Signed Graph Neural Networks (SGNNs) 中实现可证明隐私保护的图遗忘(graph unlearning)问题,现有方法因忽视符号图的异质特性,在移除数据影响时会丢失关键的正负边信息,导致模型效用和遗忘效果显著下降。解决方案的关键在于提出Certified Signed Graph Unlearning (CSGU),其核心是三阶段机制:首先通过三角结构高效识别受特定节点影响的最小邻域;其次基于社会学理论量化节点重要性以优化隐私预算分配;最后实施重要性加权的参数更新,从而在最小化效用损失的前提下实现可证明的模型修改。
链接: https://arxiv.org/abs/2511.14168
作者: Junpeng Zhao,Lin Li,Kaixi Hu,Kaize Shi,Jingling Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Signed graphs model complex relationships through positive and negative edges, with widespread real-world applications. Given the sensitive nature of such data, selective removal mechanisms have become essential for privacy protection. While graph unlearning enables the removal of specific data influences from Graph Neural Networks (GNNs), existing methods are designed for conventional GNNs and overlook the unique heterogeneous properties of signed graphs. When applied to Signed Graph Neural Networks (SGNNs), these methods lose critical sign information, degrading both model utility and unlearning effectiveness. To address these challenges, we propose Certified Signed Graph Unlearning (CSGU), which provides provable privacy guarantees while preserving the sociological principles underlying SGNNs. CSGU employs a three-stage method: (1) efficiently identifying minimal influenced neighborhoods via triangular structures, (2) applying sociological theories to quantify node importance for optimal privacy budget allocation, and (3) performing importance-weighted parameter updates to achieve certified modifications with minimal utility degradation. Extensive experiments demonstrate that CSGU outperforms existing methods, achieving superior performance in both utility preservation and unlearning effectiveness on SGNNs.
zh
[AI-45] AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
【速读】:该论文旨在解决传统视觉-语言-动作(Vision-Language-Action, VLA)模型在长时程任务中因采用同步流匹配(Synchronous Flow Matching, SFM)而导致的稳定性差问题,即单一动作错误易引发连锁失败。其关键解决方案是提出异步流匹配VLA(AsyncVLA),通过引入异步流匹配(Asynchronous Flow Matching, AFM)实现非均匀时间调度下的动作生成,并结合动作上下文感知机制与置信度评估器(confidence rater),使模型具备选择性修正不准确动作token的能力,从而提升长周期任务中的鲁棒性与自纠错能力。
链接: https://arxiv.org/abs/2511.14148
作者: Yuhua Jiang,Shuang Cheng,Yan Ding,Feifei Gao,Biqing Qi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous FM (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure. In this work, we propose asynchronous flow matching VLA (AsyncVLA), a novel framework that introduces temporal flexibility in asynchronous FM (AFM) and enables self-correction in action generation. AsyncVLA breaks from the vanilla SFM in VLA models by generating the action tokens in a non-uniform time schedule with action context awareness. Besides, our method introduces the confidence rater to extract confidence of the initially generated actions, enabling the model to selectively refine inaccurate action tokens before execution. Moreover, we propose a unified training procedure for SFM and AFM that endows a single model with both modes, improving KV-cache utilization. Extensive experiments on robotic manipulation benchmarks demonstrate that AsyncVLA is data-efficient and exhibits self-correction ability. AsyncVLA achieves state-of-the-art results across general embodied evaluations due to its asynchronous generation in AFM. Our code is available at this https URL.
zh
[AI-46] Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agent ic AI Systems
【速读】:该论文旨在解决当前代理型人工智能(Agentic AI)评估基准过度关注任务完成准确率,而忽视企业部署所需的关键维度——如成本效率、可靠性及运行稳定性的问题。其核心解决方案是提出一个名为CLEAR(Cost, Latency, Efficacy, Assurance, Reliability)的综合性评估框架,通过引入多维指标体系,在300个企业任务上对六种主流代理进行系统评估,揭示仅优化准确率会导致成本增加4.4–10.8倍,而CLEAR能显著提升对生产环境成功性的预测能力(相关系数ρ=0.83),优于传统单一准确率评估(ρ=0.41)。
链接: https://arxiv.org/abs/2511.14136
作者: Sushant Mehta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Current agentic AI benchmarks predominantly evaluate task completion accuracy, while overlooking critical enterprise requirements such as cost-efficiency, reliability, and operational stability. Through systematic analysis of 12 main benchmarks and empirical evaluation of state-of-the-art agents, we identify three fundamental limitations: (1) absence of cost-controlled evaluation leading to 50x cost variations for similar precision, (2) inadequate reliability assessment where agent performance drops from 60% (single run) to 25% (8-run consistency), and (3) missing multidimensional metrics for security, latency, and policy compliance. We propose \textbfCLEAR (Cost, Latency, Efficacy, Assurance, Reliability), a holistic evaluation framework specifically designed for enterprise deployment. Evaluation of six leading agents on 300 enterprise tasks demonstrates that optimizing for accuracy alone yields agents 4.4-10.8x more expensive than cost-aware alternatives with comparable performance. Expert evaluation (N=15) confirms that CLEAR better predicts production success (correlation \rho=0.83 ) compared to accuracy-only evaluation ( \rho=0.41 ).
zh
[AI-47] Fair-GNE : Generalized Nash Equilibrium-Seeking Fairness in Multiagent Healthcare Automation
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在医疗场景下任务分配中的公平性问题,即如何在多个自利决策者之间实现可证明且不可篡改的公平工作负载分配,以确保运行时的一致性和可靠性。现有方法通过事后奖励塑造来引导公平性,缺乏内在的、自我强制的公平机制。解决方案的关键在于将MARL建模为一个受限的广义纳什均衡(Generalized Nash Equilibrium, GNE)博弈问题,提出Fair-GNE框架,在资源共享环境中驱动群体策略收敛至安全且局部高效的均衡点,使得任何单个智能体无法通过单独改变决策来提升自身效用函数。该方法通过自适应约束执行实现了统计显著的公平性提升(如JFI指标从0.33提升至0.89,p < 0.01),同时保持86%的任务成功率,从而在复杂多智能体医疗系统中实现了原则性的公平性保障。
链接: https://arxiv.org/abs/2511.14135
作者: Promise Ekpo,Saesha Agarwal,Felix Grimm,Lekan Molu,Angelique Taylor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:
Abstract:Enforcing a fair workload allocation among multiple agents tasked to achieve an objective in learning enabled demand side healthcare worker settings is crucial for consistent and reliable performance at runtime. Existing multi-agent reinforcement learning (MARL) approaches steer fairness by shaping reward through post hoc orchestrations, leaving no certifiable self-enforceable fairness that is immutable by individual agents at runtime. Contextualized within a setting where each agent shares resources with others, we address this shortcoming with a learning enabled optimization scheme among self-interested decision makers whose individual actions affect those of other agents. This extends the problem to a generalized Nash equilibrium (GNE) game-theoretic framework where we steer group policy to a safe and locally efficient equilibrium, so that no agent can improve its utility function by unilaterally changing its decisions. Fair-GNE models MARL as a constrained generalized Nash equilibrium-seeking (GNE) game, prescribing an ideal equitable collective equilibrium within the problem’s natural fabric. Our hypothesis is rigorously evaluated in our custom-designed high-fidelity resuscitation simulator. Across all our numerical experiments, Fair-GNE achieves significant improvement in workload balance over fixed-penalty baselines (0.89 vs.\ 0.33 JFI, p 0.01 ) while maintaining 86% task success, demonstrating statistically significant fairness gains through adaptive constraint enforcement. Our results communicate our formulations, evaluation metrics, and equilibrium-seeking innovations in large multi-agent learning-based healthcare systems with clarity and principled fairness enforcement.
zh
[AI-48] Run Ruminate and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation
【速读】:该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)任务中基于大语言模型(Large Language Models, LLMs)的方法存在的两大问题:一是LLMs在理解真实世界空间关联方面存在局限,导致任务完成性能与领域专家仍有显著差距;二是引入LLMs带来较高的计算开销和推理延迟。解决方案的关键在于提出一种名为R3的双过程思维框架,其核心创新在于通过三个模块协同实现高效与精准导航:Runner作为轻量级Transformer专家模型,在常规场景下保障导航效率与准确性;Ruminator利用多模态LLM并采用链式思维(Chain-of-Thought, CoT)提示策略激发结构化推理能力;Regulator则根据导航进度动态切换思考模式,协调Runner与Ruminator的运行逻辑,从而在零样本条件下融合LLMs的泛化能力与VLN领域的专业经验。实验表明,R3在REVERIE基准上分别提升了3.28%和3.30%的SPL与RGSPL指标,验证了该方法在复杂VLN任务中的有效性。
链接: https://arxiv.org/abs/2511.14131
作者: Yu Zhong,Zihao Zhang,Rui Zhang,Lingdong Huang,Haihan Gao,Shuo Wang,Da Li,Ruijian Han,Jiaming Guo,Shaohui Peng,Di Huang,Yunji Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-and-Language Navigation (VLN) requires an agent to dynamically explore complex 3D environments following human instructions. Recent research underscores the potential of harnessing large language models (LLMs) for VLN, given their commonsense knowledge and general reasoning capabilities. Despite their strengths, a substantial gap in task completion performance persists between LLM-based approaches and domain experts, as LLMs inherently struggle to comprehend real-world spatial correlations precisely. Additionally, introducing LLMs is accompanied with substantial computational cost and inference latency. To address these issues, we propose a novel dual-process thinking framework dubbed R3, integrating LLMs’ generalization capabilities with VLN-specific expertise in a zero-shot manner. The framework comprises three core modules: Runner, Ruminator, and Regulator. The Runner is a lightweight transformer-based expert model that ensures efficient and accurate navigation under regular circumstances. The Ruminator employs a powerful multimodal LLM as the backbone and adopts chain-of-thought (CoT) prompting to elicit structured reasoning. The Regulator monitors the navigation progress and controls the appropriate thinking mode according to three criteria, integrating Runner and Ruminator harmoniously. Experimental results illustrate that R3 significantly outperforms other state-of-the-art methods, exceeding 3.28% and 3.30% in SPL and RGSPL respectively on the REVERIE benchmark. This pronounced enhancement highlights the effectiveness of our method in handling challenging VLN tasks.
zh
[AI-49] Real-Time Mobile Video Analytics for Pre-arrival Emergency Medical Services
【速读】:该论文旨在解决当前应急医疗服务体系(EMS)中预抵达阶段视频流传输与分析能力不足的问题,具体表现为单对单视频流限制和有限的自动化分析功能,导致调度员和急救人员在高压环境中需手动处理大量冗余或噪声信息,延误关键干预时机。其解决方案的核心是提出TeleEMS系统,通过融合音频与视频的多模态推理机制,在急救人员到达现场前实现统一决策管道下的实时分析;关键创新在于:(1) 基于边缘部署的EMS-Stream通信框架支持多方流畅视频流;(2) 引入EMSLlama(领域专用大语言模型)实现鲁棒的症状提取与归一化;(3) 利用rPPG方法从视频中估计心率,并结合PreNet多任务模型进行文本与生命体征联合建模,从而提升预抵达阶段干预建议的可靠性与准确性。
链接: https://arxiv.org/abs/2511.14119
作者: Liuyi Jin,Amran Haroon,Radu Stoleru,Pasan Gunawardena,Michael Middleton,Jeeeun Kim
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注:
Abstract:Timely and accurate pre-arrival video streaming and analytics are critical for emergency medical services (EMS) to deliver life-saving interventions. Yet, current-generation EMS infrastructure remains constrained by one-to-one video streaming and limited analytics capabilities, leaving dispatchers and EMTs to manually interpret overwhelming, often noisy or redundant information in high-stress environments. We present TeleEMS, a mobile live video analytics system that enables pre-arrival multimodal inference by fusing audio and video into a unified decision-making pipeline before EMTs arrive on scene. TeleEMS comprises two key components: TeleEMS Client and TeleEMS Server. The TeleEMS Client runs across phones, smart glasses, and desktops to support bystanders, EMTs en route, and 911 dispatchers. The TeleEMS Server, deployed at the edge, integrates EMS-Stream, a communication backbone that enables smooth multi-party video streaming. On top of EMSStream, the server hosts three real-time analytics modules: (1) audio-to-symptom analytics via EMSLlama, a domain-specialized LLM for robust symptom extraction and normalization; (2) video-to-vital analytics using state-of-the-art rPPG methods for heart rate estimation; and (3) joint text-vital analytics via PreNet, a multimodal multitask model predicting EMS protocols, medication types, medication quantities, and procedures. Evaluation shows that EMSLlama outperforms GPT-4o (exact-match 0.89 vs. 0.57) and that text-vital fusion improves inference robustness, enabling reliable pre-arrival intervention recommendations. TeleEMS demonstrates the potential of mobile live video analytics to transform EMS operations, bridging the gap between bystanders, dispatchers, and EMTs, and paving the way for next-generation intelligent EMS infrastructure. Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.14119 [cs.MM] (or arXiv:2511.14119v1 [cs.MM] for this version) https://doi.org/10.48550/arXiv.2511.14119 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-50] Soft-Label Training Preserves Epistemic Uncertainty
【速读】:该论文旨在解决机器学习任务中因标注主观性导致的模型不确定性建模问题,即标准做法将多元标注分布简化为单一标签(hard labels),迫使模型在本质上模糊的数据上表现出虚假的确定性,从而造成模型置信度与人类感知多样性之间的错位。其解决方案的关键在于采用软标签训练(soft-label training),将标注分布本身视为真实标签(ground truth),使模型能够保留认知不确定性(epistemic uncertainty);实证结果表明,该方法在视觉和自然语言处理任务中显著降低了模型与人类标注之间的KL散度(降低32%),并提升了模型熵与标注熵的相关性(增强61%),同时保持与硬标签训练相当的准确率。
链接: https://arxiv.org/abs/2511.14117
作者: Agamdeep Singh,Ashish Tiwari,Hosein Hasanbeig,Priyanshu Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Many machine learning tasks involve inherent subjectivity, where annotators naturally provide varied labels. Standard practice collapses these label distributions into single labels, aggregating diverse human judgments into point estimates. We argue that this approach is epistemically misaligned for ambiguous data–the annotation distribution itself should be regarded as the ground truth. Training on collapsed single labels forces models to express false confidence on fundamentally ambiguous cases, creating a misalignment between model certainty and the diversity of human perception. We demonstrate empirically that soft-label training, which treats annotation distributions as ground truth, preserves epistemic uncertainty. Across both vision and NLP tasks, soft-label training achieves 32% lower KL divergence from human annotations and 61% stronger correlation between model and annotation entropy, while matching the accuracy of hard-label training. Our work repositions annotation distributions from noisy signals to be aggregated away, to faithful representations of epistemic uncertainty that models should learn to reproduce.
zh
[AI-51] APD-Agents : A Large Language Model-Driven Multi-Agents Collaborative Framework for Automated Page Design
【速读】:该论文旨在解决移动应用页面布局设计中耗时且繁琐的问题,即设计师需反复调整控件与内容的尺寸、位置和样式以实现美观和结构合理,同时在多页面协作中面临标准不一致和风格统一困难。解决方案的关键在于提出一个由大语言模型(Large Language Model, LLM)驱动的多智能体框架APD-agents,其核心组件包括:OrchestratorAgent(任务调度)、SemanticParserAgent(语义解析)、PrimaryLayoutAgent(初始粗粒度布局生成)、TemplateRetrievalAgent(检索相关模板增强质量)以及RecursiveComponentAgent(递归细化子元素)。该框架通过智能体间的自动协作机制,实现了从用户描述到高质量页面布局的端到端自动化生成,显著提升了设计效率与一致性。
链接: https://arxiv.org/abs/2511.14101
作者: Xinpeng Chen,Xiaofeng Han,Kaihao Zhang,Guochao Ren,Yujie Wang,Wenhao Cao,Yang Zhou,Jianfeng Lu,Zhenbo Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Layout design is a crucial step in developing mobile app pages. However, crafting satisfactory designs is time-intensive for designers: they need to consider which controls and content to present on the page, and then repeatedly adjust their size, position, and style for better aesthetics and structure. Although many design software can now help to perform these repetitive tasks, extensive training is needed to use them effectively. Moreover, collaborative design across app pages demands extra time to align standards and ensure consistent styling. In this work, we propose APD-agents, a large language model (LLM) driven multi-agent framework for automated page design in mobile applications. Our framework contains OrchestratorAgent, SemanticParserAgent, PrimaryLayoutAgent, TemplateRetrievalAgent, and RecursiveComponentAgent. Upon receiving the user’s description of the page, the OrchestratorAgent can dynamically can direct other agents to accomplish users’ design task. To be specific, the SemanticParserAgent is responsible for converting users’ descriptions of page content into structured data. The PrimaryLayoutAgent can generate an initial coarse-grained layout of this page. The TemplateRetrievalAgent can fetch semantically relevant few-shot examples and enhance the quality of layout generation. Besides, a RecursiveComponentAgent can be used to decide how to recursively generate all the fine-grained sub-elements it contains for each element in the layout. Our work fully leverages the automatic collaboration capabilities of large-model-driven multi-agent systems. Experimental results on the RICO dataset show that our APD-agents achieve state-of-the-art performance.
zh
[AI-52] Collaborative QA using Interacting LLM s. Impact of Network Structure Node Capability and Distributed Data
【速读】:该论文旨在解决分布式文档环境下,多个大语言模型(Large Language Models, LLMs)协作问答(Collaborative Question-Answering, CQA)时因缺乏直接证据而产生幻觉(Hallucination)扩散的问题。当LLMs在没有充分证据支持的情况下生成回答时,其幻觉行为会在交互网络中传播,导致原本准确的模型也逐渐产生错误输出。为应对这一挑战,论文的关键解决方案是结合网络科学中的平均场动力学(Mean-Field Dynamics, MFD)与经济学中的随机效用模型(Randomized Utility Model),构建一个可解析的生成模型。该模型将每个LLM建模为具有“诚实”或“不诚实”潜状态的节点,并通过MFD刻画信息在网络中的扩散过程,同时利用随机效用模型定义状态转移概率,从而在理论上证明固定点的存在性与唯一性,并分析激励机制(如测试时计算资源分配)对系统稳定性和准确性的影响。
链接: https://arxiv.org/abs/2511.14098
作者: Adit Jain,Vikram Krishnamurthy,Yiming Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI); Systems and Control (eess.SY)
备注:
Abstract:In this paper, we model and analyze how a network of interacting LLMs performs collaborative question-answering (CQA) in order to estimate a ground truth given a distributed set of documents. This problem is interesting because LLMs often hallucinate when direct evidence to answer a question is lacking, and these effects become more pronounced in a network of interacting LLMs. The hallucination spreads, causing previously accurate LLMs to hallucinate. We study interacting LLMs and their hallucination by combining novel ideas of mean-field dynamics (MFD) from network science and the randomized utility model from economics to construct a useful generative model. We model the LLM with a latent state that indicates if it is truthful or not with respect to the ground truth, and extend a tractable analytical model considering an MFD to model the diffusion of information in a directed network of LLMs. To specify the probabilities that govern the dynamics of the MFD, we propose a randomized utility model. For a network of LLMs, where each LLM has two possible latent states, we posit sufficient conditions for the existence and uniqueness of a fixed point and analyze the behavior of the fixed point in terms of the incentive (e.g., test-time compute) given to individual LLMs. We experimentally study and analyze the behavior of a network of 100 open-source LLMs with respect to data heterogeneity, node capability, network structure, and sensitivity to framing on multiple semi-synthetic datasets.
zh
[AI-53] NeuroPath: Neurobiology-Inspired Path Tracking and Reflection for Semantically Coherent Retrieval NEURIPS2025
【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)方法在多跳问答(multi-hop question answering)任务中因难以捕捉跨文档复杂依赖关系而导致性能受限的问题,以及现有基于图结构的RAG方法在节点匹配和子图构建过程中易引入语义不一致和无关噪声的问题。其解决方案的关键在于提出一种受神经生物学中位置细胞路径导航机制启发的LLM驱动语义路径追踪框架NeuroPath,该框架包含两个核心步骤:动态路径追踪(Dynamic Path Tracking)与后检索补全(Post-retrieval Completion)。前者通过目标导向的语义路径追踪与剪枝操作,在知识图谱(Knowledge Graph, KG)上实现噪声抑制与语义连贯性提升;后者利用中间推理结果与原始查询进行二次检索,以细化查询目标并补全推理链中的缺失信息,从而显著提升多跳问答的准确性和效率。
链接: https://arxiv.org/abs/2511.14096
作者: Junchen Li,Rongzheng Wang,Yihong Huang,Qizhi Chen,Jiasheng Zhang,Shuang Liang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS 2025
Abstract:Retrieval-augmented generation (RAG) greatly enhances large language models (LLMs) performance in knowledge-intensive tasks. However, naive RAG methods struggle with multi-hop question answering due to their limited capacity to capture complex dependencies across documents. Recent studies employ graph-based RAG to capture document connections. However, these approaches often result in a loss of semantic coherence and introduce irrelevant noise during node matching and subgraph construction. To address these limitations, we propose NeuroPath, an LLM-driven semantic path tracking RAG framework inspired by the path navigational planning of place cells in neurobiology. It consists of two steps: Dynamic Path Tracking and Post-retrieval Completion. Dynamic Path Tracking performs goal-directed semantic path tracking and pruning over the constructed knowledge graph (KG), improving noise reduction and semantic coherence. Post-retrieval Completion further reinforces these benefits by conducting second-stage retrieval using intermediate reasoning and the original query to refine the query goal and complete missing information in the reasoning path. NeuroPath surpasses current state-of-the-art baselines on three multi-hop QA datasets, achieving average improvements of 16.3% on recall@2 and 13.5% on recall@5 over advanced graph-based RAG methods. Moreover, compared to existing iter-based RAG methods, NeuroPath achieves higher accuracy and reduces token consumption by 22.8%. Finally, we demonstrate the robustness of NeuroPath across four smaller LLMs (Llama3.1, GLM4, Mistral0.3, and Gemma3), and further validate its scalability across tasks of varying complexity. Code is available at this https URL.
zh
[AI-54] CFG-EC: Error Correction Classifier-Free Guidance
【速读】:该论文旨在解决Classifier-Free Guidance (CFG) 在采样过程中因条件与无条件噪声估计不一致而导致的生成质量下降问题。具体而言,CFG在训练时通过随机交替使用条件和空提示(null prompts)来实现条件与无条件生成,但在采样阶段同时输出两类提示,造成噪声估计偏差,进而影响生成图像的保真度与提示对齐性。解决方案的关键在于提出 CFG-EC(Classifier-Free Guidance with Error Correction),其核心机制是通过主动调整无条件噪声预测,使其误差分量与条件误差分量正交,从而消除两类引导分量间的干扰,有效约束采样误差上界,并提升低指导强度下的性能表现及整体提示对齐度。
链接: https://arxiv.org/abs/2511.14075
作者: Nakkyu Yang,Yechan Lee,SooJean Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Classifier-Free Guidance (CFG) has become a mainstream approach for simultaneously improving prompt fidelity and generation quality in conditional generative models. During training, CFG stochastically alternates between conditional and null prompts to enable both conditional and unconditional generation. However, during sampling, CFG outputs both null and conditional prompts simultaneously, leading to inconsistent noise estimates between the training and sampling processes. To reduce this error, we propose CFG-EC, a versatile correction scheme augmentable to any CFG-based method by refining the unconditional noise predictions. CFG-EC actively realigns the unconditional noise error component to be orthogonal to the conditional error component. This corrective maneuver prevents interference between the two guidance components, thereby constraining the sampling error’s upper bound and establishing more reliable guidance trajectories for high-fidelity image generation. Our numerical experiments show that CFG-EC handles the unconditional component more effectively than CFG and CFG++, delivering a marked performance increase in the low guidance sampling regime and consistently higher prompt alignment across the board.
zh
[AI-55] CafeMed: Causal Attention Fusion Enhanced Medication Recommendation
【速读】:该论文旨在解决现有药物推荐系统在个性化治疗决策中的两大局限:一是将医疗实体视为独立特征,未能建模其对药物选择的协同效应;二是采用静态因果关系,无法适应个体患者的特定健康状态和临床情境。解决方案的关键在于提出CafeMed框架,其核心创新包括两个组件:一是因果权重生成器(Causal Weight Generator, CWG),可基于患者个体状态将静态因果效应转化为动态调制权重;二是通道协调注意力精化模块(Channel Harmonized Attention Refinement Module, CHARM),用于捕捉诊断与操作之间的复杂交互关系。该设计使模型能够在保持药物安全约束的前提下,精准建模多种疾病共同影响治疗决策的机制,从而实现更符合临床逻辑的个性化药物推荐。
链接: https://arxiv.org/abs/2511.14064
作者: Kelin Ren,Chan-Yang Ju,Dong-Ho Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注: Accepted by BIBM 2025
Abstract:Medication recommendation systems play a crucial role in assisting clinicians with personalized treatment decisions. While existing approaches have made significant progress in learning medication representations, they suffer from two fundamental limitations: (i) treating medical entities as independent features without modeling their synergistic effects on medication selection; (ii) employing static causal relationships that fail to adapt to patient-specific contexts and health states. To address these challenges, we propose CafeMed, a framework that integrates dynamic causal reasoning with cross-modal attention for safe and accurate medication recommendation. CafeMed introduces two key components: the Causal Weight Generator (CWG) that transforms static causal effects into dynamic modulation weights based on individual patient states, and the Channel Harmonized Attention Refinement Module (CHARM) that captures complex interdependencies between diagnoses and procedures. This design enables CafeMed to model how different medical conditions jointly influence treatment decisions while maintaining medication safety constraints. Extensive experiments on MIMIC-III and MIMIC-IV datasets demonstrate that CafeMed significantly outperforms state-of-the-art baselines, achieving superior accuracy in medication prediction while maintaining the lower drug–drug interaction rates. Our results indicate that incorporating dynamic causal relationships and cross-modal synergies leads to more clinically-aligned and personalized medication recommendations. Our code is released publicly at this https URL.
zh
[AI-56] A Machine Learning-Based Multimodal Framework for Wearable Sensor-Based Archery Action Recognition and Stress Estimation
【速读】:该论文旨在解决精度类运动(如射箭)中运动员表现受生物力学稳定性与心理韧性共同影响的问题,同时克服传统运动分析系统成本高且侵入性强、难以在自然训练环境中应用的局限。解决方案的关键在于提出一种基于机器学习的多模态框架,通过自研腕戴设备同步采集加速度计和光电容积脉搏波(PPG)传感器数据,分别实现动作阶段识别与压力水平估计:在动作识别方面引入Smoothed Differential Acceleration(SmoothDiff)特征并结合LSTM模型,准确率达96.8%;在压力评估方面提取心率变异性(HRV)特征并使用多层感知机(MLP)分类器,区分高低压力状态的准确率为80%。该方法实现了对运动员技术状态与心理状态的联合感知,为开发智能实时反馈系统提供了可行路径。
链接: https://arxiv.org/abs/2511.14057
作者: Xianghe Liu,Jiajia Liu,Chuxian Xu,Minghan Wang,Hongbo Peng,Tao Sun,Jiaqi Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In precision sports such as archery, athletes’ performance depends on both biomechanical stability and psychological resilience. Traditional motion analysis systems are often expensive and intrusive, limiting their use in natural training environments. To address this limitation, we propose a machine learning-based multimodal framework that integrates wearable sensor data for simultaneous action recognition and stress estimation. Using a self-developed wrist-worn device equipped with an accelerometer and photoplethysmography (PPG) sensor, we collected synchronized motion and physiological data during real archery sessions. For motion recognition, we introduce a novel feature–Smoothed Differential Acceleration (SmoothDiff)–and employ a Long Short-Term Memory (LSTM) model to identify motion phases, achieving 96.8% accuracy and 95.9% F1-score. For stress estimation, we extract heart rate variability (HRV) features from PPG signals and apply a Multi-Layer Perceptron (MLP) classifier, achieving 80% accuracy in distinguishing high- and low-stress levels. The proposed framework demonstrates that integrating motion and physiological sensing can provide meaningful insights into athletes’ technical and mental states. This approach offers a foundation for developing intelligent, real-time feedback systems for training optimization in archery and other precision sports.
zh
[AI-57] Radial Compensation: Stable and Semantically Decoupled Generative Models on Riemannian Manifolds
【速读】:该论文旨在解决生成模型在曲面空间(manifold)上建模时因曲率与模型参数耦合而导致的梯度方差增大、数值不稳定及似然性能下降的问题。现有方法如指数映射(exponential map)虽能保持测地线(geodesic)结构但引入半径相关的刚性雅可比矩阵,而保体积映射则维持密度一致性却扭曲测地距离,二者均导致曲率信息混入参数语义中。解决方案的关键在于提出径向补偿(Radial Compensation, RC),这是一种基于信息几何的方法:通过在切空间中选择特定基密度,使得似然函数仅依赖于从极点出发的测地距离,从而将参数语义与曲率解耦;RC允许径向参数保持原有的测地单位含义,同时使坐标图(chart)可作为数值预条件器调节数值稳定性。进一步地,作者推导出平衡指数(Balanced-Exponential, bExp)坐标图族,在体积畸变与测地误差之间取得平衡,并证明RC是唯一满足测地径向似然下曲率不变Fisher信息的构造方式,显著提升了高维潜在空间归一化流(normalizing flow)的训练稳定性和生成质量。
链接: https://arxiv.org/abs/2511.14056
作者: Marios Papamichals,Regina Ruane
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Differential Geometry (math.DG); Machine Learning (stat.ML)
备注: This is the first version of the paper
Abstract:Generative models on curved spaces rely on charts to map Euclidean spaces to manifolds. Exponential maps preserve geodesics but have stiff, radius-dependent Jacobians, while volume-preserving charts maintain densities but distort geodesic distances. Both approaches entangle curvature with model parameters, inflating gradient variance. In high-dimensional latent normalizing flows, the wrapped exponential prior can stretch radii far beyond the curvature scale, leading to poor test likelihoods and stiff solvers. We introduce Radial Compensation (RC), an information-geometric method that selects the base density in the tangent space so that the likelihood depends only on geodesic distance from a pole, decoupling parameter semantics from curvature. RC lets radial parameters retain their usual meaning in geodesic units, while the chart can be tuned as a numerical preconditioner. We extend RC to manifolds with known geodesic polar volume and show that RC is the only construction for geodesic-radial likelihoods with curvature-invariant Fisher information. We derive the Balanced-Exponential (bExp) chart family, balancing volume distortion and geodesic error. Under RC, all bExp settings preserve the same manifold density and Fisher information, with smaller dial values reducing gradient variance and flow cost. Empirically, RC yields stable generative models across densities, VAEs, flows on images and graphs, and protein models. RC improves likelihoods, restores clean geodesic radii, and prevents radius blow-ups in high-dimensional flows, making RC-bExp a robust default for likelihood-trained generative models on manifolds.
zh
[AI-58] Making Evidence Actionable in Adaptive Learning
【速读】:该论文旨在解决自适应学习系统中“诊断精准但干预薄弱”的问题,即虽然能够准确识别学生的学习缺口(concept-level assessment),但后续的干预措施往往时机不当或与学习目标不匹配,导致教学效果受限。解决方案的关键在于构建一个由教师主导的反馈回路,将概念级评估证据转化为经过验证的微干预(micro-interventions),并通过一个包含三个保障机制的自适应算法实现:**充分性(adequacy)**作为硬性约束确保学习缺口被填补,**注意力(attention)作为时间与冗余的预算约束以控制资源消耗,以及多样性(diversity)**防止对单一资源的过拟合。该算法将干预分配形式化为带有多重约束的二元整数规划问题,包括覆盖度、时长、难度窗口(基于能力估计)、先修知识(通过概念矩阵编码)和抗冗余(通过多样性强制)。不同求解策略(贪婪选择、梯度松弛及混合方法)分别适用于低资源/高时效场景与丰富资源场景,在模拟和1204名学生的物理课堂部署中均实现了全技能覆盖且在限定观看时间内完成,同时显著减少冗余并均衡难度分布,从而形成可计算、可审计的控制器,实现教室规模下的公平且负载感知的个性化教学。
链接: https://arxiv.org/abs/2511.14052
作者: Amirreza Mehrabi,Jason W. Morphew,Breejha Quezada,N. Sanjay Rebello
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Applications (stat.AP); Other Statistics (stat.OT)
备注:
Abstract:Adaptive learning often diagnoses precisely yet intervenes weakly, yielding help that is mistimed or misaligned. This study presents evidence supporting an instructor-governed feedback loop that converts concept-level assessment evidence into vetted micro-interventions. The adaptive learning algorithm contains three safeguards: adequacy as a hard guarantee of gap closure, attention as a budgeted constraint for time and redundancy, and diversity as protection against overfitting to a single resource. We formalize intervention assignment as a binary integer program with constraints for coverage, time, difficulty windows informed by ability estimates, prerequisites encoded by a concept matrix, and anti-redundancy enforced through diversity. Greedy selection serves low-richness and tight-latency regimes, gradient-based relaxation serves rich repositories, and a hybrid method transitions along a richness-latency frontier. In simulation and in an introductory physics deployment with one thousand two hundred four students, both solvers achieved full skill coverage for essentially all learners within bounded watch time. The gradient-based method reduced redundant coverage by approximately twelve percentage points relative to greedy and harmonized difficulty across slates, while greedy delivered comparable adequacy with lower computational cost in scarce settings. Slack variables localized missing content and supported targeted curation, sustaining sufficiency across subgroups. The result is a tractable and auditable controller that closes the diagnostic-pedagogical loop and delivers equitable, load-aware personalization at classroom scale.
zh
[AI-59] Syn-STARTS: Synthesized START Triage Scenario Generation Framework for Scalable LLM Evaluation
【速读】:该论文旨在解决大规模伤亡事件(Mass Casualty Incidents, MCIs)中高质量标注数据稀缺的问题,从而阻碍生成式 AI 在分诊决策中的开发与性能评估。其关键解决方案是提出 Syn-STARTS 框架,利用大语言模型(Large Language Models, LLMs)自动生成符合标准分诊方法 START 的合成分诊案例,实验证明生成案例在质量上与人工标注的 TRIAGE 开放数据集无显著差异,且在红、黄、绿、黑四类分诊等级中表现出高度稳定的 LLM 准确性,表明合成数据可有效支持高精度 AI 模型在危急医疗场景中的开发。
链接: https://arxiv.org/abs/2511.14023
作者: Chiharu Hagiwara,Naoki Nonaka,Yuhta Hashimoto,Ryu Uchimido,Jun Seita
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Introducing an open dataset
Abstract:Triage is a critically important decision-making process in mass casualty incidents (MCIs) to maximize victim survival rates. While the role of AI in such situations is gaining attention for making optimal decisions within limited resources and time, its development and performance evaluation require benchmark datasets of sufficient quantity and quality. However, MCIs occur infrequently, and sufficient records are difficult to accumulate at the scene, making it challenging to collect large-scale realworld data for research use. Therefore, we developed Syn-STARTS, a framework that uses LLMs to generate triage cases, and verified its effectiveness. The results showed that the triage cases generated by Syn-STARTS were qualitatively indistinguishable from the TRIAGE open dataset generated by manual curation from training materials. Furthermore, when evaluating the LLM accuracy using hundreds of cases each from the green, yellow, red, and black categories defined by the standard triage method START, the results were found to be highly stable. This strongly indicates the possibility of synthetic data in developing high-performance AI models for severe and critical medical situations.
zh
[AI-60] Keeping Code-Aware LLM s Fresh: Full Refresh In-Context Deltas and Incremental Fine-Tuning
【速读】:该论文旨在解决代码库持续演化导致的模型性能退化问题,即在不牺牲对历史代码保留能力的前提下,如何保持模型对新代码的适应性(freshness)。其核心挑战在于应对代码库中的领域漂移(domain drift),包括文件重命名、API 变更和行为迁移等现象。解决方案的关键在于提出三种更新策略:(A) 全量刷新(Full Refresh),重新训练整个模型;(B) 上下文学习(In-Context Learning, ICL),在推理时注入近期变更(如 Git diff 或英文摘要);© 增量微调(Incremental Fine-Tuning, Inc-FT),基于增量数据集进行微调,并通过受控的新旧样本混合比例缓解灾难性遗忘。实验表明,Inc-FT 结合旧知识感知的混合策略在综合表现上最优,ICL 在无法训练时提供最快的新代码响应,而 Full Refresh 在追求最高新代码准确率时仍为上限。
链接: https://arxiv.org/abs/2511.14022
作者: Pradeep Kumar Sharma,Ishaan Puri,Mantinder Jit Singh,Swapnil Shivaprasad,Hritvik Shrivastava
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Modern codebases evolve continuously: files are renamed or deleted; public APIs drift; behavior shifts within otherwise familiar modules. A model trained yesterday to map a developer’s natural-language question to the exact set of repository file paths that matter will degrade tomorrow, even if the questions themselves look unchanged. In this paper we study, at system scale and across several widely used repositories, how to keep such a model fresh without surrendering retention on earlier code. We frame freshness as a form of domain drift between a base snapshot and the current HEAD, and we compare three families of update strategies: (A) Full Refresh, retraining the entire model at the new snapshot; (B) In-Context Learning (ICL) that injects recent deltas (raw git diffs or concise English summaries) at inference; and © Incremental Fine-Tuning (Inc-FT) on delta-derived training sets, with carefully controlled NEW:OLD mixing to mitigate catastrophic forgetting. We contribute an alias-aware evaluation protocol that credits rename while never rewarding deleted paths, and a practical Forgetting Probe that quantifies residual emissions of obsolete paths. Across Flask, SQLAlchemy, Pandas, and Poetry, Inc-FT with old-aware mixes delivers the best overall balance on mixed sets, ICL with English delta summaries delivers the fastest new-code lift when training is not feasible, and Full Refresh remains the ceiling when maximum NEW accuracy matters. We also compare Git-diff Inc-FT to full-file Inc-FT, showing that diffs excel in rename/delete-heavy windows while full-file context wins in behavior-change-heavy windows.
zh
[AI-61] ALEX:A Light Editing-knowledge Extractor
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中知识静态性导致难以适应动态信息更新的问题,尤其针对复杂多跳问答任务中现有知识编辑方法在可扩展性和检索效率方面的瓶颈。其解决方案的关键在于提出一种轻量级知识编辑框架 ALEX(A Light Editing-knowledge Extractor),核心创新是引入分层记忆架构(hierarchical memory architecture),将知识更新以语义聚类方式组织,从而将检索复杂度从线性 O(N) 降低至高度可扩展的 O(K+N/C);同时集成推理查询合成(Inferential Query Synthesis, IQS)模块和动态证据仲裁(Dynamic Evidence Adjudication, DEA)引擎,有效提升多跳推理准确性与路径可靠性,并显著压缩搜索空间超过 80%。
链接: https://arxiv.org/abs/2511.14018
作者: Minghu Wang(1, 2, 3),Shuliang Zhao(1, 2, 3),Yuanyuan Zhao(2, 3, 4, 5),Hongxia Xu(1, 2, 3) ((1) College of Computer and Cyber Security, Hebei Normal University, Hebei, China (2) Hebei Provincial Engineering Research Center for Supply Chain Big Data Analytics and Data Security, Hebei, China (3) Hebei Provincial Key Laboratory of Network and Information Security, Hebei, China (4) School of Mathematical Sciences, Hebei Normal University, Hebei, China (5) Dept of Information Engineering, Shijiazhuang College of Applied Technology, Hebei, China)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The static nature of knowledge within Large Language Models (LLMs) makes it difficult for them to adapt to evolving information, rendering knowledge editing a critical task. However, existing methods struggle with challenges of scalability and retrieval efficiency, particularly when handling complex, multi-hop questions that require multi-step reasoning. To address these challenges, this paper introduces ALEX (A Light Editing-knowledge Extractor), a lightweight knowledge editing framework. The core innovation of ALEX is its hierarchical memory architecture, which organizes knowledge updates (edits) into semantic clusters. This design fundamentally reduces retrieval complexity from a linear O(N) to a highly scalable O(K+N/C). Furthermore, the framework integrates an Inferential Query Synthesis (IQS) module to bridge the semantic gap between queries and facts , and a Dynamic Evidence Adjudication (DEA) engine that executes an efficient two-stage retrieval process. Experiments on the MQUAKE benchmark demonstrate that ALEX significantly improves both the accuracy of multi-hop answers (MultiHop-ACC) and the reliability of reasoning paths (HopWise-ACC). It also reduces the required search space by over 80% , presenting a promising path toward building scalable, efficient, and accurate knowledge editing systems.
zh
[AI-62] From Narrow Unlearning to Emergent Misalignment: Causes Consequences and Containment in LLM s
【速读】:该论文旨在解决生成式 AI (Generative AI) 在执行特定领域拒绝行为去学习(refusal unlearning)时,可能引发跨域有害行为泛化的问题,即** emergent misalignment (EMA)** ——模型在未受训的无关领域(如偏见、隐私等)中表现出更低的拒绝能力,从而产生潜在风险。其核心解决方案是:在进行窄域拒绝去学习(如网络安全或安全概念)后,通过在受影响域的小规模保留数据上引入交叉熵损失函数,可显著恢复其他领域的对齐性(alignment),同时降低目标概念上的拒绝率。关键发现在于,概念表示相似性较高的模块更易因拒绝流扰动而触发 EMA,这为理解与缓解 EMA 提供了机制层面的洞察。
链接: https://arxiv.org/abs/2511.14017
作者: Erum Mushtaq,Anil Ramakrishna,Satyapriya Krishna,Sattvik Sahai,Prasoon Goyal,Kai-Wei Chang,Tao Zhang,Rahul Gupta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work has shown that fine-tuning on insecure code data can trigger an emergent misalignment (EMA) phenomenon, where models generate malicious responses even to prompts unrelated to the original insecure code-writing task. Such cross-domain generalization of harmful behavior underscores the need for a deeper understanding of the algorithms, tasks, and datasets that induce emergent misalignment. In this work, we extend this study by demonstrating that emergent misalignment can also arise from narrow refusal unlearning in specific domains. We perform refusal unlearning on Cybersecurity and Safety concept, and evaluate EMA by monitoring refusal scores across seven responsible AI (RAI) domains, Cybersecurity, Safety, Toxicity, Bias, Sensitive Content, Medical/Legal, and Privacy. Our work shows that narrow domain unlearning can yield compliance responses for the targeted concept, however, it may also propagate EMA to unrelated domains. Among the two intervened concepts, Cybersecurity and Safety, we find that the safety concept can have larger EMA impact, i.e, causing lower refusal scores, across other unrelated domains such as bias. We observe this effect consistently across two model families, Mistral-7b-0.3v, and Qwen-7b-2.5. Further, we show that refusal unlearning augmented with cross-entropy loss function on a small set of retain data from the affected domains can largely, if not fully, restore alignment across the impacted domains while having lower refusal rate on the concept we perform unlearning on. To investigate the underlying causes of EMA, we analyze concept entanglements at the representation level via concept vectors. Our analysis reveals that concepts with higher representation similarity in earlier layers are more susceptible to EMA after intervention when the refusal stream is altered through targeted refusal unlearning.
zh
[AI-63] Developing a Grounded View of AI
【速读】:该论文试图解决的问题是:如何从工程学视角厘清人工智能(Artificial Intelligence, AI)与基于规则的软件程序在行为本质上的根本差异,从而明确AI的边界与适用条件。其解决方案的关键在于提出一种方法论,通过识别AI模型在三类决策中的行为特征,实现对AI行为可区分性的判别能力;这为人类在使用AI时的责任承担提供了前提,即在多种可能选项中做出合理选择,进而保障AI系统的可靠性与安全性,以促进人类、社会及环境的福祉。
链接: https://arxiv.org/abs/2511.14013
作者: Bifei Mao,Lanqing Hong
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:As a capability coming from computation, how does AI differ fundamentally from the capabilities delivered by rule-based software program? The paper examines the behavior of artificial intelligence (AI) from engineering points of view to clarify its nature and limits. The paper argues that the rationality underlying humanity’s impulse to pursue, articulate, and adhere to rules deserves to be valued and preserved. Identifying where rule-based practical rationality ends is the beginning of making it aware until action. Although the rules of AI behaviors are still hidden or only weakly observable, the paper has proposed a methodology to make a sense of discrimination possible and practical to identify the distinctions of the behavior of AI models with three types of decisions. It is a prerequisite for human responsibilities with alternative possibilities, considering how and when to use AI. It would be a solid start for people to ensure AI system soundness for the well-being of humans, society, and the environment.
zh
[AI-64] Can Artificial Intelligence Accelerate Technological Progress? Researchers Perspectives on AI in Manufacturing and Materials Science
【速读】:该论文试图解决的问题是:尽管人工智能(AI)被普遍预期将显著加速技术与科学进步,但缺乏对AI在创新过程中具体应用的细致实证研究,导致对其实际加速作用的理解仍不清晰。解决方案的关键在于通过32名美国高校制造与材料科学领域研究人员的深度访谈,系统分析AI/机器学习(ML)工具在材料与制造过程建模中的实际应用效果及其局限性,发现其主要优势体现在降低研发成本、缩短时间并提升计算效率,但也存在对已有密集数据覆盖的设计空间外可靠性不足、需结合传统研究方法谨慎使用、可能抑制颠覆性理论突破等挑战;因此,论文建议在推动渐进式创新的同时,必须持续支持传统经验、计算与理论研究,以保障重大科学进展的可能性。
链接: https://arxiv.org/abs/2511.14007
作者: John P. Nelson,Olajide Olugbade,Philip Shapira,Justin B. Biddle
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence (AI) raises expectations of substantial increases in rates of technological and scientific progress, but such anticipations are often not connected to detailed ground-level studies of AI use in innovation processes. Accordingly, it remains unclear how and to what extent AI can accelerate innovation. To help to fill this gap, we report results from 32 interviews with U.S.-based academic manufacturing and materials sciences researchers experienced with AI and machine learning (ML) techniques. Interviewees primarily used AI for modeling of materials and manufacturing processes, facilitating cheaper and more rapid search of design spaces for materials and manufacturing processes alike. They report benefits including cost, time, and computation savings in technology development. However, interviewees also report that AI/ML tools are unreliable outside design spaces for which dense data are already available; that they require skilled and judicious application in tandem with older research techniques; and that AI/ML tools may detrimentally circumvent opportunities for disruptive theoretical advancement. Based on these results, we suggest there is reason for optimism about acceleration in sustaining innovations through the use of to AI/ML; but that support for conventional empirical, computational, and theoretical research is required to maintain the likelihood of further major advances in manufacturing and materials science.
zh
[AI-65] FlakyGuard: Automatically Fixing Flaky Tests at Industry Scale
【速读】:该论文旨在解决工业场景中flaky测试(即非确定性通过或失败的测试)修复效率低的问题,其核心挑战在于现有方法如FlakyDoctor因“上下文问题”而失效——要么提供信息不足(缺失关键生产代码),要么提供冗余信息(过度干扰大语言模型LLM)。解决方案的关键在于提出FlakyGuard,该方法将代码视为图结构,并通过选择性图遍历策略精准定位最相关的上下文,从而显著提升修复成功率。实验表明,FlakyGuard在真实工业数据上修复了47.6%的可复现flaky测试,且51.8%的修复方案被开发者采纳,优于当前最优方法至少22个百分点,同时开发者调研显示其根因解释全部被认定为有用。
链接: https://arxiv.org/abs/2511.14002
作者: Chengpeng Li,Farnaz Behrang,August Shi,Peng Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注: To appear in ASE 2025
Abstract:Flaky tests that non-deterministically pass or fail waste developer time and slow release cycles. While large language models (LLMs) show promise for automatically repairing flaky tests, existing approaches like FlakyDoctor fail in industrial settings due to the context problem: providing either too little context (missing critical production code) or too much context (overwhelming the LLM with irrelevant information). We present FlakyGuard, which addresses this problem by treating code as a graph structure and using selective graph exploration to find only the most relevant context. Evaluation on real-world flaky tests from industrial repositories shows that FlakyGuard repairs 47.6 % of reproducible flaky tests with 51.8 % of the fixes accepted by developers. Besides it outperforms state-of-the-art approaches by at least 22 % in repair success rate. Developer surveys confirm that 100 % find FlakyGuard’s root cause explanations useful.
zh
[AI-66] How to Marginalize in Causal Structure Learning? AAAI2026
【速读】:该论文旨在解决从数据中推断贝叶斯网络(Bayesian networks, BNs)结构时面临的挑战,尤其是传统方法在进行概率分布边缘化(marginalization)时受限于动态规划算法对节点父节点集合的约束问题。解决方案的关键在于引入可 tractable probabilistic circuits(可 tractable probabilistic circuits),通过训练这些电路同时拟合原始分布和边缘查询(marginal queries),从而实现对学习到的概率分布进行快速且精确的边缘化计算,进而提升贝叶斯结构学习器的性能。
链接: https://arxiv.org/abs/2511.14001
作者: William Zhao,Guy Van den Broeck,Benjie Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages. Accepted for presentation at the GCLR 2026 Workshop (colocated with AAAI 2026)
Abstract:Bayesian networks (BNs) are a widely used class of probabilistic graphical models employed in numerous application domains. However, inferring the network’s graphical structure from data remains challenging. Bayesian structure learners approach this problem by inferring a posterior distribution over the possible directed acyclic graphs underlying the BN. The inference process often requires marginalizing over probability distributions, which is typically done using dynamic programming methods that restrict the set of possible parents for each node. Instead, we present a novel method that utilizes tractable probabilistic circuits to circumvent this restriction. This method utilizes a new learning routine that trains these circuits on both the original distribution and marginal queries. The architecture of probabilistic circuits then inherently allows for fast and exact marginalization on the learned distribution. We then show empirically that utilizing our method to answer marginals allows Bayesian structure learners to improve their performance compared to current methods.
zh
[AI-67] LoCoBench-Agent : An Interactive Benchmark for LLM Agents in Long-Context Software Engineering
【速读】:该论文旨在解决现有基准测试(如LoCoBench)在评估大语言模型(Large Language Models, LLMs)作为软件开发代理时的局限性问题,即这些基准主要聚焦于单轮交互和长上下文代码理解,无法捕捉真实场景中代理所需的多轮交互、工具使用模式及自适应推理能力。其解决方案的关键在于提出LoCoBench-Agent——一个面向真实软件工程工作流的综合性评估框架,该框架将原始8,000个场景扩展为可交互的代理环境,并引入9项指标(涵盖理解力与效率维度),提供8类专用工具(文件操作、搜索、代码分析等),支持从10K到1M tokens的长上下文评估,从而系统性地衡量代理在多轮对话、工具调用效率、错误恢复能力和架构一致性等方面的性能表现。
链接: https://arxiv.org/abs/2511.13998
作者: Jielin Qiu,Zuxin Liu,Zhiwei Liu,Rithesh Murthy,Jianguo Zhang,Haolin Chen,Shiyu Wang,Ming Zhu,Liangwei Yang,Juntao Tan,Roshan Ram,Akshara Prabhakar,Tulika Awalgaonkar,Zixiang Chen,Zhepeng Cen,Cheng Qian,Shelby Heinecke,Weiran Yao,Silvio Savarese,Caiming Xiong,Huan Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 54-pages
Abstract:As large language models (LLMs) evolve into sophisticated autonomous agents capable of complex software development tasks, evaluating their real-world capabilities becomes critical. While existing benchmarks like LoCoBench~\citeqiu2025locobench assess long-context code understanding, they focus on single-turn evaluation and cannot capture the multi-turn interactive nature, tool usage patterns, and adaptive reasoning required by real-world coding agents. We introduce \textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess LLM agents in realistic, long-context software engineering workflows. Our framework extends LoCoBench’s 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations, tool usage efficiency, error recovery, and architectural consistency across extended development sessions. We also introduce an evaluation methodology with 9 metrics across comprehension and efficiency dimensions. Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens, enabling precise assessment of long-context performance. Through systematic evaluation of state-of-the-art models, we reveal several key findings: (1) agents exhibit remarkable long-context robustness; (2) comprehension-efficiency trade-off exists with negative correlation, where thorough exploration increases comprehension but reduces efficiency; and (3) conversation efficiency varies dramatically across models, with strategic tool usage patterns differentiating high-performing agents. As the first long-context LLM agent benchmark for software engineering, LoCoBench-Agent establishes a rigorous foundation for measuring agent capabilities, identifying performance gaps, and advancing autonomous software development at scale.
zh
[AI-68] Artificial Intelligence Agents in Music Analysis: An Integrative Perspective Based on Two Use Cases
【速读】:该论文旨在解决如何将人工智能(AI)代理有效应用于音乐分析与教育中的问题,以提升模式识别、作曲参数化及教学反馈的准确性与适应性。其核心解决方案在于构建一个整合性的框架,融合深度学习、多智能体架构与检索增强生成(Retrieval-Augmented Generation, RAG)技术,从而实现模块化、可扩展且具备解释性的音乐分析流程,并通过双案例实证验证其在中学教育场景和符号音乐分析系统中的有效性,显著优于传统自动化方法在可解释性和灵活性方面的表现。
链接: https://arxiv.org/abs/2511.13987
作者: Antonio Manuel Martínez-Heredia,Dolores Godrid Rodríguez,Andrés Ortiz García
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Extended version of the conference paper presented at SATMUS 2025
Abstract:This paper presents an integrative review and experimental validation of artificial intelligence (AI) agents applied to music analysis and education. We synthesize the historical evolution from rule-based models to contemporary approaches involving deep learning, multi-agent architectures, and retrieval-augmented generation (RAG) frameworks. The pedagogical implications are evaluated through a dual-case methodology: (1) the use of generative AI platforms in secondary education to foster analytical and creative skills; (2) the design of a multiagent system for symbolic music analysis, enabling modular, scalable, and explainable workflows. Experimental results demonstrate that AI agents effectively enhance musical pattern recognition, compositional parameterization, and educational feedback, outperforming traditional automated methods in terms of interpretability and adaptability. The findings highlight key challenges concerning transparency, cultural bias, and the definition of hybrid evaluation metrics, emphasizing the need for responsible deployment of AI in educational environments. This research contributes to a unified framework that bridges technical, pedagogical, and ethical considerations, offering evidence-based guidance for the design and application of intelligent agents in computational musicology and music education. Comments: Extended version of the conference paper presented at SATMUS 2025 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2511.13987 [cs.AI] (or arXiv:2511.13987v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.13987 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-69] Node-Level Uncertainty Estimation in LLM -Generated SQL
【速读】:该论文旨在解决大语言模型(Large Language Models, LLM)生成的SQL语句中错误难以精确定位的问题,传统基于token级别的置信度估计无法提供细粒度的错误诊断。其解决方案的关键在于提出一种以抽象语法树(Abstract Syntax Tree, AST)节点为中心的不确定性估计框架:首先设计了一种语义感知的标注算法,实现对AST中每个节点的正确性判断,避免因结构容器或别名差异导致的过度惩罚;其次,构建一套包含模式感知和词法特征的丰富节点表示,训练监督分类器预测每个节点的错误概率,并将其解释为校准后的不确定性,从而实现精确到节点级别的错误定位与诊断。该方法在多个数据库和数据集上显著优于基于token概率的基线,同时支持针对性修复、人机协同审查及选择性执行等下游应用。
链接: https://arxiv.org/abs/2511.13984
作者: Hilaf Hasson,Ruocheng Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a practical framework for detecting errors in LLM-generated SQL by estimating uncertainty at the level of individual nodes in the query’s abstract syntax tree (AST). Our approach proceeds in two stages. First, we introduce a semantically aware labeling algorithm that, given a generated SQL and a gold reference, assigns node-level correctness without over-penalizing structural containers or alias variation. Second, we represent each node with a rich set of schema-aware and lexical features - capturing identifier validity, alias resolution, type compatibility, ambiguity in scope, and typo signals - and train a supervised classifier to predict per-node error probabilities. We interpret these probabilities as calibrated uncertainty, enabling fine-grained diagnostics that pinpoint exactly where a query is likely to be wrong. Across multiple databases and datasets, our method substantially outperforms token log-probabilities: average AUC improves by +27.44% while maintaining robustness under cross-database evaluation. Beyond serving as an accuracy signal, node-level uncertainty supports targeted repair, human-in-the-loop review, and downstream selective execution. Together, these results establish node-centric, semantically grounded uncertainty estimation as a strong and interpretable alternative to aggregate sequence level confidence measures.
zh
[AI-70] Data Whitening Improves Sparse Autoencoder Learning AAAI2026
【速读】:该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在训练过程中因输入数据相关性导致优化困难的问题,从而提升其学习可解释特征的能力。解决方案的关键在于引入主成分分析白化(PCA Whitening)作为预处理步骤,通过消除输入激活中的冗余相关性,使优化景观更加凸化、更易导航;实证结果表明,这一方法在多个模型架构和稀疏度设置下均能显著提升可解释性指标(如稀疏探针准确率和特征解耦度),尽管重建质量略有下降,这挑战了传统认为可解释性与最优稀疏-保真权衡一致的假设,并建议将白化作为SAE训练的标准预处理流程,尤其适用于以可解释性优先的场景。
链接: https://arxiv.org/abs/2511.13981
作者: Ashwin Saraswatula,David Klindt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the AAAI 2026 XAI4Science Workshop
Abstract:Sparse autoencoders (SAEs) have emerged as a promising approach for learning interpretable features from neural network activations. However, the optimization landscape for SAE training can be challenging due to correlations in the input data. We demonstrate that applying PCA Whitening to input activations – a standard preprocessing technique in classical sparse coding – improves SAE performance across multiple metrics. Through theoretical analysis and simulation, we show that whitening transforms the optimization landscape, making it more convex and easier to navigate. We evaluate both ReLU and Top-K SAEs across diverse model architectures, widths, and sparsity regimes. Empirical evaluation on SAEBench, a comprehensive benchmark for sparse autoencoders, reveals that whitening consistently improves interpretability metrics, including sparse probing accuracy and feature disentanglement, despite minor drops in reconstruction quality. Our results challenge the assumption that interpretability aligns with an optimal sparsity–fidelity trade-off and suggest that whitening should be considered as a default preprocessing step for SAE training, particularly when interpretability is prioritized over perfect reconstruction.
zh
[AI-71] CORGI: Efficient Pattern Matching With Quadratic Guarantees
【速读】:该论文旨在解决规则系统在实时应用中因模式匹配效率低下而导致的性能瓶颈问题,特别是在存在大量未约束变量或产生组合型中间部分匹配时,传统基于RETE算法的匹配机制可能面临指数级的时间和空间复杂度,进而导致内存溢出或程序执行停滞。解决方案的关键在于提出一种名为CORGI(Collection-Oriented Relational Graph Iteration)的新匹配算法,其核心创新在于摒弃了RETE中用于存储部分匹配的β-记忆体(β-memory),采用两步策略:首先在前向传递中构建并维护一个接地关系图(grounded relations graph),随后通过反向遍历该图按需生成匹配结果;这一设计不仅保证了单个满意匹配的二次时间与空间复杂度,还支持无需将整个冲突集载入内存即可迭代流式输出后续匹配,从而有效避免高延迟和内存溢出问题。
链接: https://arxiv.org/abs/2511.13942
作者: Daniel Weitekamp
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
备注:
Abstract:Rule-based systems must solve complex matching problems within tight time constraints to be effective in real-time applications, such as planning and reactive control for AI agents, as well as low-latency relational database querying. Pattern-matching systems can encounter issues where exponential time and space are required to find matches for rules with many underconstrained variables, or which produce combinatorial intermediate partial matches (but are otherwise well-constrained). When online AI systems automatically generate rules from example-driven induction or code synthesis, they can easily produce worst-case matching patterns that slow or halt program execution by exceeding available memory. In our own work with cognitive systems that learn from example, we’ve found that aggressive forms of anti-unification-based generalization can easily produce these circumstances. To make these systems practical without hand-engineering constraints or succumbing to unpredictable failure modes, we introduce a new matching algorithm called CORGI (Collection-Oriented Relational Graph Iteration). Unlike RETE-based approaches, CORGI offers quadratic time and space guarantees for finding single satisficing matches, and the ability to iteratively stream subsequent matches without committing entire conflict sets to memory. CORGI differs from RETE in that it does not have a traditional \beta -memory for collecting partial matches. Instead, CORGI takes a two-step approach: a graph of grounded relations is built/maintained in a forward pass, and an iterator generates matches as needed by working backward through the graph. This approach eliminates the high-latency delays and memory overflows that can result from populating full conflict sets. In a performance evaluation, we demonstrate that CORGI significantly outperforms RETE implementations from SOAR and OPS5 on a simple combinatorial matching task.
zh
[AI-72] Preference-Based Learning in Audio Applications: A Systematic Analysis
【速读】:该论文试图解决生成式音频模型(generative audio models)在评估过程中缺乏有效偏好学习(preference learning)方法的问题,尤其是在与文本领域相比,音频领域的偏好学习研究显著不足。其解决方案的关键在于通过系统性文献综述(PRISMA指南)识别当前音频生成任务中偏好学习的应用模式,并提出三个核心发现:一是多维评估策略的兴起,融合合成数据、自动化指标与人工偏好;二是传统客观指标(如WER、PESQ)与人类主观判断之间存在不一致性;三是多阶段训练流程正逐渐成为主流,整合多种奖励信号以提升模型输出质量。这些发现为未来音频生成模型的评估体系提供了方向,强调需建立标准化基准、高质量数据集以及对音频时序特性影响偏好学习机制的深入研究。
链接: https://arxiv.org/abs/2511.13936
作者: Aaron Broukhim,Yiran Shen,Prithviraj Ammanabrolu,Nadir Weibel
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Despite the parallel challenges that audio and text domains face in evaluating generative model outputs, preference learning remains remarkably underexplored in audio applications. Through a PRISMA-guided systematic review of approximately 500 papers, we find that only 30 (6%) apply preference learning to audio tasks. Our analysis reveals a field in transition: pre-2021 works focused on emotion recognition using traditional ranking methods (rankSVM), while post-2021 studies have pivoted toward generation tasks employing modern RLHF frameworks. We identify three critical patterns: (1) the emergence of multi-dimensional evaluation strategies combining synthetic, automated, and human preferences; (2) inconsistent alignment between traditional metrics (WER, PESQ) and human judgments across different contexts; and (3) convergence on multi-stage training pipelines that combine reward signals. Our findings suggest that while preference learning shows promise for audio, particularly in capturing subjective qualities like naturalness and musicality, the field requires standardized benchmarks, higher-quality datasets, and systematic investigation of how temporal factors unique to audio impact preference learning frameworks.
zh
[AI-73] Jailbreaking Large Vision Language Models in Intelligent Transportation Systems
【速读】:该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)在智能交通系统(Intelligent Transportation Systems, ITS)中因存在漏洞而易受越狱攻击(jailbreaking attacks)的问题。其核心挑战在于,LVLMs在处理包含图像和文本的多模态输入时,可能被恶意构造的输入诱导生成有害或不当响应,从而带来严重的安全风险。解决方案的关键在于提出一种新型越狱攻击方法,该方法通过图像排版 manipulation(image typography manipulation)与多轮提示(multi-turn prompting)协同利用LVLMs的脆弱性,并进一步设计了一种多层次响应过滤防御机制(multi-layered response filtering defense technique),以有效阻断有害输出的生成。实验表明,该攻击方法显著提升了对主流LVLMs(包括开源与闭源模型)的越狱成功率,且所提防御策略能有效降低毒性响应的发生率。
链接: https://arxiv.org/abs/2511.13892
作者: Badhan Chandra Das,Md Tasnim Jawad,Md Jueal Mia,M. Hadi Amini,Yanzhao Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Vision Language Models (LVLMs) demonstrate strong capabilities in multimodal reasoning and many real-world applications, such as visual question answering. However, LVLMs are highly vulnerable to jailbreaking attacks. This paper systematically analyzes the vulnerabilities of LVLMs integrated in Intelligent Transportation Systems (ITS) under carefully crafted jailbreaking attacks. First, we carefully construct a dataset with harmful queries relevant to transportation, following OpenAI’s prohibited categories to which the LVLMs should not respond. Second, we introduce a novel jailbreaking attack that exploits the vulnerabilities of LVLMs through image typography manipulation and multi-turn prompting. Third, we propose a multi-layered response filtering defense technique to prevent the model from generating inappropriate responses. We perform extensive experiments with the proposed attack and defense on the state-of-the-art LVLMs (both open-source and closed-source). To evaluate the attack method and defense technique, we use GPT-4’s judgment to determine the toxicity score of the generated responses, as well as manual verification. Further, we compare our proposed jailbreaking method with existing jailbreaking techniques and highlight severe security risks involved with jailbreaking attacks with image typography manipulation and multi-turn prompting in the LVLMs integrated in ITS.
zh
[AI-74] Causal computations in Semi Markovian Structural Causal Models using divide and conquer
【速读】:该论文旨在解决如何将Bjøru等人提出的分而治之算法从马尔可夫结构因果模型(Markovian structural causal models, SCMs)扩展至半马尔可夫结构因果模型(semi-Markovian SCMs)的问题。半马尔可夫模型能够刻画多个内生变量受同一外生变量影响的混杂关系,而传统方法无法处理此类复杂依赖。解决方案的关键在于识别并应对扩展过程中因外生变量影响多个内生变量所带来的推理挑战,通过引入替代性策略来重构子模型分解方式,并结合理论分析与计算实验验证其有效性,从而实现对原始高维模型中反事实概率的有效边界估计。
链接: https://arxiv.org/abs/2511.13852
作者: Anna Rodum Bjøru,Rafael Cabañas,Helge Langseth,Antonio Salmerón
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 36 pages, 7 figures, 1 appendix
Abstract:Recently, Bjøru et al. proposed a novel divide-and-conquer algorithm for bounding counterfactual probabilities in structural causal models (SCMs). They assumed that the SCMs were learned from purely observational data, leading to an imprecise characterization of the marginal distributions of exogenous variables. Their method leveraged the canonical representation of structural equations to decompose a general SCM with high-cardinality exogenous variables into a set of sub-models with low-cardinality exogenous variables. These sub-models had precise marginals over the exogenous variables and therefore admitted efficient exact inference. The aggregated results were used to bound counterfactual probabilities in the original model. The approach was developed for Markovian models, where each exogenous variable affects only a single endogenous variable. In this paper, we investigate extending the methodology to \textitsemi-Markovian SCMs, where exogenous variables may influence multiple endogenous variables. Such models are capable of representing confounding relationships that Markovian models cannot. We illustrate the challenges of this extension using a minimal example, which motivates a set of alternative solution strategies. These strategies are evaluated both theoretically and through a computational study.
zh
[AI-75] ScoresActivation: A New Activation Function for Model Agnostic Global Explainability by Design ECAI2025
【速读】:该论文旨在解决深度学习模型决策过程缺乏透明性的问题,即如何在保证模型预测性能的同时实现可解释性。当前的后验解释方法(post hoc explanation methods)虽能提供特征重要性信息,但因其与训练过程脱节,导致解释结果的忠实度(faithfulness)和实用性受限。论文提出了一种“设计即全局可解释”的可微分方法,其核心创新在于引入ScoresActivation函数——一种嵌入学习流程中的特征排序机制,使模型能够在端到端训练中直接估计并优化特征重要性。该方法不仅生成与SHAP值和真实特征重要性高度一致的稳定全局特征排名,且在计算效率上显著优于传统SHAP方法(快150倍),同时提升分类准确率,证明了其在高噪声或冗余输入下的鲁棒性。
链接: https://arxiv.org/abs/2511.13809
作者: Emanuel Covaci,Fabian Galis,Radu Balan,Daniela Zaharie,Darian Onchis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Paper submitted to ECAI 2025 Conference
Abstract:Understanding the decision of large deep learning models is a critical challenge for building transparent and trustworthy systems. Although the current post hoc explanation methods offer valuable insights into feature importance, they are inherently disconnected from the model training process, limiting their faithfulness and utility. In this work, we introduce a novel differentiable approach to global explainability by design, integrating feature importance estimation directly into model training. Central to our method is the ScoresActivation function, a feature-ranking mechanism embedded within the learning pipeline. This integration enables models to prioritize features according to their contribution to predictive performance in a differentiable and end-to-end trainable manner. Evaluations across benchmark datasets show that our approach yields globally faithful, stable feature rankings aligned with SHAP values and ground-truth feature importance, while maintaining high predictive performance. Moreover, feature scoring is 150 times faster than the classical SHAP method, requiring only 2 seconds during training compared to SHAP’s 300 seconds for feature ranking in the same configuration. Our method also improves classification accuracy by 11.24% with 10 features (5 relevant) and 29.33% with 16 features (5 relevant, 11 irrelevant), demonstrating robustness to irrelevant inputs. This work bridges the gap between model accuracy and interpretability, offering a scalable framework for inherently explainable machine learning.
zh
[AI-76] GAEA: Experiences and Lessons Learned from a Country-Scale Environmental Digital Twin
【速读】:该论文旨在解决如何在国家尺度上有效部署和应用环境数字孪生(Environmental Digital Twin)以支持多领域决策的问题。其解决方案的关键在于构建了一个名为GAEA的集成平台,该平台整合了27个环境地理空间服务,能够服务于城市规划、政策制定、农业、房地产、林业、保险及银行等多个行业,从而实现对环境数据的高效分析与可视化,推动基于证据的决策过程。
链接: https://arxiv.org/abs/2511.13807
作者: Andreas Kamilaris,Chirag Padubidri,Asfa Jamil,Arslan Amin,Indrajit Kalita,Jyoti Harti,Savvas Karatsiolis,Aytac Guley
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:This paper describes the experiences and lessons learned after the deployment of a country-scale environmental digital twin on the island of Cyprus for three years. This digital twin, called GAEA, contains 27 environmental geospatial services and is suitable for urban planners, policymakers, farmers, property owners, real-estate and forestry professionals, as well as insurance companies and banks that have properties in their portfolio. This paper demonstrates the power, potential, current and future challenges of geospatial analytics and environmental digital twins on a large scale.
zh
[AI-77] Modeling Fairness in Recruitment AI via Information Flow
【速读】:该论文旨在解决人工智能(AI)支持决策过程中公平性不足与责任归属不清的问题,尤其关注算法与人类决策环节在实际应用中如何相互作用并可能引入或放大偏见。其解决方案的关键在于采用基于信息流的信息流建模框架(information flow-based modeling framework),通过半结构化利益相关者访谈和迭代建模方法,构建招聘流程的多层次表示,清晰刻画信息在算法与人工组件之间的转换、过滤与解释过程,从而识别偏见产生节点、传播路径及其对候选人的下游影响,为复杂社会技术系统中的公平性风险提供可操作的透明分析工具。
链接: https://arxiv.org/abs/2511.13793
作者: Mattias Brännström,Themis Dimitra Xanthopoulou,Lili Jiang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Avoiding bias and understanding the real-world consequences of AI-supported decision-making are critical to address fairness and assign accountability. Existing approaches often focus either on technical aspects, such as datasets and models, or on high-level socio-ethical considerations - rarely capturing how these elements interact in practice. In this paper, we apply an information flow-based modeling framework to a real-world recruitment process that integrates automated candidate matching with human decision-making. Through semi-structured stakeholder interviews and iterative modeling, we construct a multi-level representation of the recruitment pipeline, capturing how information is transformed, filtered, and interpreted across both algorithmic and human components. We identify where biases may emerge, how they can propagate through the system, and what downstream impacts they may have on candidates. This case study illustrates how information flow modeling can support structured analysis of fairness risks, providing transparency across complex socio-technical systems.
zh
[AI-78] Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对后门攻击(Backdoor Attacks)时的安全性问题,尤其是当攻击者使用动态或隐式触发器(Dynamic or Implicit Triggers)时,传统防御方法因依赖特定触发器类型或额外的干净模型而难以有效检测和缓解。解决方案的关键在于发现并利用后门攻击导致注意力头(Attention Heads)之间异常高相似性的特征,提出基于注意力相似度的检测机制与头级微调(Head-wise Fine-tuning)相结合的注意力安全对齐(Attention Safety Alignment)方法,从而在无需先验触发器知识的前提下识别并修复潜在污染的注意力头,显著降低后门攻击成功率的同时保持下游任务性能。
链接: https://arxiv.org/abs/2511.13789
作者: Haotian Jin,Yang Li,Haihui Fan,Lin Shen,Xiangfang Li,Bo Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Backdoor attacks pose a serious threat to the security of large language models (LLMs), causing them to exhibit anomalous behavior under specific trigger conditions. The design of backdoor triggers has evolved from fixed triggers to dynamic or implicit triggers. This increased flexibility in trigger design makes it challenging for defenders to identify their specific forms accurately. Most existing backdoor defense methods are limited to specific types of triggers or rely on an additional clean model for support. To address this issue, we propose a backdoor detection method based on attention similarity, enabling backdoor detection without prior knowledge of the trigger. Our study reveals that models subjected to backdoor attacks exhibit unusually high similarity among attention heads when exposed to triggers. Based on this observation, we propose an attention safety alignment approach combined with head-wise fine-tuning to rectify potentially contaminated attention heads, thereby effectively mitigating the impact of backdoor attacks. Extensive experimental results demonstrate that our method significantly reduces the success rate of backdoor attacks while preserving the model’s performance on downstream tasks.
zh
[AI-79] Quantifying Distribution Shift in Traffic Signal Control with Histogram-Based GEH Distance
【速读】:该论文旨在解决交通信号控制算法在面对分布偏移(distribution shift)时性能下降的问题,即当实际交通条件与训练或设计阶段所见条件不一致时,控制策略效果显著劣化。解决方案的关键在于提出一种基于需求直方图(demand histograms)表示交通场景,并采用GEH距离函数进行量化比较的方法,该方法具有策略无关性(policy-independent)、可解释性强且利用了交通工程中广泛使用的统计指标。实验验证表明,场景间距离越大,旅行时间越长、通行能力越低,尤其对基于学习的控制策略具有更强的解释力,从而能够更有效地预测和监测适应性交通信号控制中的性能退化问题。
链接: https://arxiv.org/abs/2511.13785
作者: Federico Taschin,Ozan K. Tonguz
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Traffic signal control algorithms are vulnerable to distribution shift, where performance degrades under traffic conditions that differ from those seen during design or training. This paper introduces a principled approach to quantify distribution shift by representing traffic scenarios as demand histograms and comparing them with a GEH-based distance function. The method is policy-independent, interpretable, and leverages a widely used traffic engineering statistic. We validate the approach on 20 simulated scenarios using both a NEMA actuated controller and a reinforcement learning controller (FRAP++). Results show that larger scenario distances consistently correspond to increased travel time and reduced throughput, with particularly strong explanatory power for learning-based control. Overall, this method can predict performance degradation under distribution shift better than previously published techniques. These findings highlight the utility of the proposed framework for benchmarking, training regime design, and monitoring in adaptive traffic signal control.
zh
[AI-80] Imagine in Space: Exploring the Frontier of Spatial Intelligence and Reasoning Efficiency in Vision Language Models
【速读】:该论文旨在解决当前视觉语言模型(VLMs)在空间推理能力上的显著不足问题,尤其是其在心理旋转、导航和空间关系理解等人类认知基础任务中的表现缺陷。研究发现,先进VLMs主要依赖语言表征进行推理与想象,导致在需要感知空间关系和三维几何变换的任务中表现不佳,且推理效率随变换复杂度急剧下降。解决方案的关键在于提出一种基于意象驱动的框架(Imagery Driven Framework, IDF),通过数据合成与训练隐式构建一个内部世界模型(internal world model),从而提升模型对空间关系的理解能力和推理效率。
链接: https://arxiv.org/abs/2511.13782
作者: Xiaoxing Lian,Aidong Yang,Jun Zhu,Peng Wang,Yue Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages,a detail and effective benchmark for spatial reasoning
Abstract:Large language models (LLMs) and vision language models (VLMs), such as DeepSeek R1,OpenAI o3, and Gemini 2.5 Pro, have demonstrated remarkable reasoning capabilities across logical inference, problem solving, and decision making. However, spatial reasoning:a fundamental component of human cognition that includes mental rotation, navigation, and spatial relationship comprehension remains a significant challenge for current advanced VLMs. We hypothesize that imagination, the internal simulation of spatial states, is the dominant reasoning mechanism within a spatial world model. To test this hypothesis and systematically probe current VLM spatial reasoning mechanisms, we introduce SpatiaLite, a fully synthetic benchmark that jointly measures spatial reasoning accuracy and reasoning efficiency. Comprehensive experiments reveal three key findings. First, advanced VLMs predominantly rely on linguistic representations for reasoning and imagination, resulting in significant deficiencies on visual centric tasks that demand perceptual spatial relations and 3D geometry transformations such as mental rotation or projection prediction. Second, advanced VLMs exhibit severe inefficiency in their current spatial reasoning mechanisms, with token usage growing rapidly as transformation complexity increases. Third, we propose an Imagery Driven Framework (IDF) for data synthesis and training, which can implicitly construct an internal world model that is critical for spatial reasoning in VLMs. Building on SpatiaLite, this work delineates the spatial reasoning limits and patterns of advanced VLMs, identifies key shortcomings, and informs future advances
zh
[AI-81] Semantic Multiplexing
【速读】:该论文旨在解决移动设备在无线边缘侧并行执行多个计算任务时,现有通信系统仅支持比特级并行传输所导致的任务并发处理能力受限的问题。其解决方案的关键在于提出“语义复用(Semantic Multiplexing)”新概念,通过将多个任务相关的压缩表示合并为单一语义表示,实现任务层面的复用,从而在不增加天线数量或带宽的前提下扩展有效自由度,突破传统香农容量限制在语义层的应用边界。该方法显著提升了多任务并行处理效率,在图像分类和情感分析等任务中保持高精度的同时,相较基线方案降低延迟达8倍、能耗25倍、通信负载54倍。
链接: https://arxiv.org/abs/2511.13779
作者: Mohammad Abdi,Francesca Meneghello,Francesco Restuccia
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Image and Video Processing (eess.IV)
备注:
Abstract:Mobile devices increasingly require the parallel execution of several computing tasks offloaded at the wireless edge. Existing communication systems only support parallel transmissions at the bit level, which fundamentally limits the number of tasks that can be concurrently processed. To address this bottleneck, this paper introduces the new concept of Semantic Multiplexing. Our approach shifts stream multiplexing from bits to tasks by merging multiple task-related compressed representations into a single semantic representation. As such, Semantic Multiplexing can multiplex more tasks than the number of physical channels without adding antennas or widening bandwidth by extending the effective degrees of freedom at the semantic layer, without contradicting Shannon capacity rules. We have prototyped Semantic Multiplexing on an experimental testbed with Jetson Orin Nano and millimeter-wave software-defined radios and tested its performance on image classification and sentiment analysis while comparing to several existing baselines in semantic communications. Our experiments demonstrate that Semantic Multiplexing allows jointly processing multiple tasks at the semantic level while maintaining sufficient task accuracy. For example, image classification accuracy drops by less than 4% when increasing from 2 to 8 the number of tasks multiplexed over a 4 \times 4 channel. Semantic Multiplexing reduces latency, energy consumption, and communication load respectively by up to 8 \times , 25 \times , and 54 \times compared to the baselines while keeping comparable performance. We pledge to publicly share the complete software codebase and the collected datasets for reproducibility.
zh
[AI-82] ExplainableGuard: Interpretable Adversarial Defense for Large Language Models Using Chain-of-Thought Reasoning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对对抗性攻击时易被隐蔽扰动误导的问题,此类攻击可导致模型输出偏离预期且难以察觉。现有防御机制多为黑盒模式,缺乏决策过程的可解释性,限制了其可信部署。解决方案的关键在于提出一个可解释的防御框架 ExplainableGuard,该框架利用 DeepSeek-Reasoner 的思维链(Chain-of-Thought, CoT)推理能力,通过定制化的 CoT 提示引导模型对输入文本进行字符级、词级、结构级和语义级的多维度分析,并生成净化后的输出及人类可读的逐步解释,从而实现既有效防御又具备透明性的对抗性防护。
链接: https://arxiv.org/abs/2511.13771
作者: Shaowei Guan,Yu Zhai,Zhengyu Zhang,Yanze Wang,Hin Chi Kwok
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures
Abstract:Large Language Models (LLMs) are increasingly vulnerable to adversarial attacks that can subtly manipulate their outputs. While various defense mechanisms have been proposed, many operate as black boxes, lacking transparency in their decision-making. This paper introduces ExplainableGuard, an interpretable adversarial defense framework leveraging the chain-of-thought (CoT) reasoning capabilities of DeepSeek-Reasoner. Our approach not only detects and neutralizes adversarial perturbations in text but also provides step-by-step explanations for each defense action. We demonstrate how tailored CoT prompts guide the LLM to perform a multi-faceted analysis (character, word, structural, and semantic) and generate a purified output along with a human-readable justification. Preliminary results on the GLUE Benchmark and IMDB Movie Reviews dataset show promising defense efficacy. Additionally, a human evaluation study reveals that ExplainableGuard’s explanations outperform ablated variants in clarity, specificity, and actionability, with a 72.5% deployability-trust rating, underscoring its potential for more trustworthy LLM deployments.
zh
[AI-83] Dynamic Temperature Scheduler for Knowledge Distillation
【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)中温度参数(temperature)固定导致的训练效率低下问题,以及因教师模型与学生模型架构差异引起的logit幅度不匹配问题。传统方法采用静态温度策略,无法适应训练过程中师生分布差异的变化;而本文提出动态温度调度器(Dynamic Temperature Scheduler, DTS),其核心创新在于根据教师与学生模型之间的交叉熵损失差距(cross-entropy loss gap)自适应调整温度,使学生在训练初期受益于更软的输出概率,后期则转向更尖锐的概率分布。该方案无需修改现有KD框架即可无缝集成,并在视觉和自然语言处理任务上显著优于静态温度基线。
链接: https://arxiv.org/abs/2511.13767
作者: Sibgat Ul Islam,Jawad Ibn Ahad,Fuad Rahman,Mohammad Ruhul Amin,Nabeel Mohammed,Shafin Rahman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge Distillation (KD) trains a smaller student model using a large, pre-trained teacher model, with temperature as a key hyperparameter controlling the softness of output probabilities. Traditional methods use a fixed temperature throughout training, which is suboptimal. Moreover, architectural differences between teacher and student often result in mismatched logit magnitudes. We demonstrate that students benefit from softer probabilities early in training but require sharper probabilities in later stages. We introduce Dynamic Temperature Scheduler (DTS), which adjusts temperature dynamically based on the cross-entropy loss gap between teacher and student. To our knowledge, this is the first temperature scheduling method that adapts based on the divergence between teacher and student distributions. Our method integrates seamlessly with existing KD frameworks. We validate DTS across multiple KD strategies on vision (CIFAR-100, Tiny-ImageNet) and NLP tasks (GLUE, Dolly, SelfIns, UnNI, S-NI), consistently outperforming static-temperature baselines. Code is available at this https URL.
zh
[AI-84] Credal Ensemble Distillation for Uncertainty Quantification AAAI2026
【速读】:该论文旨在解决深度集成(Deep Ensembles, DE)在推理阶段计算和内存开销过高,限制其实际部署的问题。解决方案的关键在于提出了一种名为可信集蒸馏(Credal Ensemble Distillation, CED)的新框架,该框架将多个模型组成的DE压缩为单一模型CREDIT,后者通过输出类别概率区间而非单一softmax分布来定义一个可信集(credal set),从而实现对预测不确定性的量化。此方法在保持与DE相当或更优的不确定性估计性能的同时,显著降低了推理复杂度。
链接: https://arxiv.org/abs/2511.13766
作者: Kaizheng Wang,Fabio Cuzzolin,David Moens,Hans Hallez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: An extended version for Credal Ensemble Distillation for Uncertainty Quantification, which has been accepted for publication at AAAI 2026
Abstract:Deep ensembles (DE) have emerged as a powerful approach for quantifying predictive uncertainty and distinguishing its aleatoric and epistemic components, thereby enhancing model robustness and reliability. However, their high computational and memory costs during inference pose significant challenges for wide practical deployment. To overcome this issue, we propose credal ensemble distillation (CED), a novel framework that compresses a DE into a single model, CREDIT, for classification tasks. Instead of a single softmax probability distribution, CREDIT predicts class-wise probability intervals that define a credal set, a convex set of probability distributions, for uncertainty quantification. Empirical results on out-of-distribution detection benchmarks demonstrate that CED achieves superior or comparable uncertainty estimation compared to several existing baselines, while substantially reducing inference overhead compared to DE.
zh
[AI-85] PROF: An LLM -based Reward Code Preference Optimization Framework for Offline Imitation Learning
【速读】:该论文旨在解决离线模仿学习(offline imitation learning, offline IL)中缺乏显式奖励标注时,如何有效构建和优化奖励函数以训练高质量策略的问题。现有方法通常假设轨迹与专家示范之间的相似性与奖励呈正相关,但这一假设过于简化了真实奖励结构,导致性能受限。解决方案的关键在于提出PROF框架,其核心创新是利用大语言模型(large language models, LLMs)从自然语言描述和单条专家轨迹中生成并迭代优化可执行的奖励函数代码;同时引入奖励偏好排序(Reward Preference Ranking, RPR)策略,在无需环境交互或强化学习训练的前提下,基于专家偏好对奖励函数进行质量评估与排序,通过RPR与文本梯度优化的交替迭代,实现奖励函数的自动选择与精炼,从而显著提升下游策略学习的效果。
链接: https://arxiv.org/abs/2511.13765
作者: Shengjie Sun,Jiafei Lyu,Runze Liu,Mengbei Yan,Bo Liu,Deheng Ye,Xiu Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Offline imitation learning (offline IL) enables training effective policies without requiring explicit reward annotations. Recent approaches attempt to estimate rewards for unlabeled datasets using a small set of expert demonstrations. However, these methods often assume that the similarity between a trajectory and an expert demonstration is positively correlated with the reward, which oversimplifies the underlying reward structure. We propose PROF, a novel framework that leverages large language models (LLMs) to generate and improve executable reward function codes from natural language descriptions and a single expert trajectory. We propose Reward Preference Ranking (RPR), a novel reward function quality assessment and ranking strategy without requiring environment interactions or RL training. RPR calculates the dominance scores of the reward functions, where higher scores indicate better alignment with expert preferences. By alternating between RPR and text-based gradient optimization, PROF fully automates the selection and refinement of optimal reward functions for downstream policy learning. Empirical results on D4RL demonstrate that PROF surpasses or matches recent strong baselines across numerous datasets and domains, highlighting the effectiveness of our approach.
zh
[AI-86] Gene Incremental Learning for Single-Cell Transcriptomics AAAI2026
【速读】:该论文旨在解决**基因增量学习(gene incremental learning)**在单细胞转录组学(single-cell transcriptomics)中缺乏系统研究的问题,特别是针对token(如基因)在增量学习过程中出现的遗忘现象。其关键解决方案是借鉴类增量学习(class incremental learning)的方法,将其适配至基因这一特定类型的token上,并构建了一个完整的基准测试框架用于评估方法的有效性。通过大量实验验证了该框架设计和方法适配的合理性与有效性,填补了token层面增量学习研究的空白。
链接: https://arxiv.org/abs/2511.13762
作者: Jiaxin Qi,Yan Cui,Jianqiang Huang,Gaogang Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注: Accepted by AAAI 2026
Abstract:Classes, as fundamental elements of Computer Vision, have been extensively studied within incremental learning frameworks. In contrast, tokens, which play essential roles in many research fields, exhibit similar characteristics of growth, yet investigations into their incremental learning remain significantly scarce. This research gap primarily stems from the holistic nature of tokens in language, which imposes significant challenges on the design of incremental learning frameworks for them. To overcome this obstacle, in this work, we turn to a type of token, gene, for a large-scale biological dataset–single-cell transcriptomics–to formulate a pipeline for gene incremental learning and establish corresponding evaluations. We found that the forgetting problem also exists in gene incremental learning, thus we adapted existing class incremental learning methods to mitigate the forgetting of genes. Through extensive experiments, we demonstrated the soundness of our framework design and evaluations, as well as the effectiveness of our method adaptations. Finally, we provide a complete benchmark for gene incremental learning in single-cell transcriptomics.
zh
[AI-87] What happens when nanochat meets DiLoCo?
【速读】:该论文旨在解决在通信受限的分布式环境中训练大语言模型(Large Language Models, LLMs)时所引入的模型性能折损问题。传统LLM训练通常依赖高带宽互联和大规模计算资源,而新兴方法试图在低通信开销条件下实现高效训练,但其对模型质量的影响尚未充分探索。解决方案的关键在于提出并实现DiLoCo算法——一种轻量级封装式训练框架,它在每个工作节点上执行多个本地优化步骤后再与外部优化器同步,从而显著减少通信频率(数量级降低)。该方法通过内-外层训练结构,在保持预训练稳定收敛的同时,揭示了异步更新导致的表征漂移(representation drift)不可逆性,进而影响下游任务如MMLU、GSM8K和HumanEval等指标的性能恢复能力。
链接: https://arxiv.org/abs/2511.13761
作者: Alexander Acker,Soeren Becker,Sasho Nedelkoski,Dominik Scheinert,Odej Kao,Philipp Wiesner
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8pages, 3 figures, technical report
Abstract:Although LLM training is typically centralized with high-bandwidth interconnects and large compute budgets, emerging methods target communication-constrained training in distributed environments. The model trade-offs introduced by this shift remain underexplored, and our goal is to study them. We use the open-source nanochat project, a compact 8K-line full-stack ChatGPT-like implementation containing tokenization, pretraining, fine-tuning, and serving, as a controlled baseline. We implement the DiLoCo algorithm as a lightweight wrapper over nanochat’s training loop, performing multiple local steps per worker before synchronization with an outer optimizer, effectively reducing communication by orders of magnitude. This inner-outer training is compared against a standard data-parallel (DDP) setup. Because nanochat is small and inspectable, it enables controlled pipeline adaptations and allows direct comparison with the conventional centralized baseline. DiLoCo achieves stable convergence and competitive loss in pretraining but yields worse MMLU, GSM8K, and HumanEval scores after mid-training and SFT. We discover that using DiLoCo-pretrained weights and running mid- and post-training with DDP fails to recover performance, revealing irreversible representation drift from asynchronous updates that impairs downstream alignment. We provide this implementation as an official fork of nanochat on GitHub. Comments: 8pages, 3 figures, technical report Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2511.13761 [cs.DC] (or arXiv:2511.13761v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.13761 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-88] Multi-Agent VLMs Guided Self-Training with PNU Loss for Low-Resource Offensive Content Detection AAAI AAAI-26
【速读】:该论文旨在解决社交媒体中攻击性内容(offensive content)检测在低资源场景下的性能瓶颈问题,即标注数据稀缺与人工标注成本高昂导致模型训练受限。其核心解决方案是提出一种基于协同伪标签(collaborative pseudo-labeling)的自训练框架,关键在于利用大量未标注数据并通过多智能体视觉语言模型(Multi-Agent Vision-Language Models, MA-VLMs)模拟监管者与用户双重视角来提升伪标签可靠性;同时引入正-负-未标记(Positive-Negative-Unlabeled, PNU)损失函数,联合优化已标注数据、一致未知集(Agreed-Unknown)和冲突未知集(Disagreed-Unknown),从而有效缓解伪标签噪声并显著提升模型在有限监督下的检测准确率。
链接: https://arxiv.org/abs/2511.13759
作者: Han Wang,Deyi Ji,Junyu Lu,Lanyun Zhu,Hailong Zhang,Haiyang Wu,Liqun Liu,Peng Shu,Roy Ka-Wei Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, Fortieth AAAI Conference on Artificial Intelligence (AAAI-26)
Abstract:Accurate detection of offensive content on social media demands high-quality labeled data; however, such data is often scarce due to the low prevalence of offensive instances and the high cost of manual annotation. To address this low-resource challenge, we propose a self-training framework that leverages abundant unlabeled data through collaborative pseudo-labeling. Starting with a lightweight classifier trained on limited labeled data, our method iteratively assigns pseudo-labels to unlabeled instances with the support of Multi-Agent Vision-Language Models (MA-VLMs). Un-labeled data on which the classifier and MA-VLMs agree are designated as the Agreed-Unknown set, while conflicting samples form the Disagreed-Unknown set. To enhance label reliability, MA-VLMs simulate dual perspectives, moderator and user, capturing both regulatory and subjective viewpoints. The classifier is optimized using a novel Positive-Negative-Unlabeled (PNU) loss, which jointly exploits labeled, Agreed-Unknown, and Disagreed-Unknown data while mitigating pseudo-label noise. Experiments on benchmark datasets demonstrate that our framework substantially outperforms baselines under limited supervision and approaches the performance of large-scale models
zh
[AI-89] ChemFixer: Correcting Invalid Molecules to Unlock Previously Unseen Chemical Space
【速读】:该论文旨在解决深度学习驱动的分子生成模型在药物发现中常产生化学无效分子的问题,这限制了可探索化学空间的有效性并阻碍了实际应用。解决方案的关键在于提出ChemFixer框架,其基于Transformer架构,通过掩码预训练和大规模有效/无效分子对数据集进行微调,能够将无效分子修正为有效结构,同时保持原始输出的化学与生物分布特性,从而提升分子有效性并扩展潜在药物候选物的多样性。
链接: https://arxiv.org/abs/2511.13758
作者: Jun-Hyoung Park,Ho-Jun Song,Seong-Whan Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This is the author’s preprint version of the article accepted to IEEE JBHI. Final published version: this https URL . High-quality PDF (publisher version): this https URL . Note: Some figures may appear distorted due to arXiv’s TeXLive rendering
Abstract:Deep learning-based molecular generation models have shown great potential in efficiently exploring vast chemical spaces by generating potential drug candidates with desired properties. However, these models often produce chemically invalid molecules, which limits the usable scope of the learned chemical space and poses significant challenges for practical applications. To address this issue, we propose ChemFixer, a framework designed to correct invalid molecules into valid ones. ChemFixer is built on a transformer architecture, pre-trained using masking techniques, and fine-tuned on a large-scale dataset of valid/invalid molecular pairs that we constructed. Through comprehensive evaluations across diverse generative models, ChemFixer improved molecular validity while effectively preserving the chemical and biological distributional properties of the original outputs. This indicates that ChemFixer can recover molecules that could not be previously generated, thereby expanding the diversity of potential drug candidates. Furthermore, ChemFixer was effectively applied to a drug-target interaction (DTI) prediction task using limited data, improving the validity of generated ligands and discovering promising ligand-protein pairs. These results suggest that ChemFixer is not only effective in data-limited scenarios, but also extensible to a wide range of downstream tasks. Taken together, ChemFixer shows promise as a practical tool for various stages of deep learning-based drug discovery, enhancing molecular validity and expanding accessible chemical space.
zh
[AI-90] VitalBench: A Rigorous Multi-Center Benchmark for Long-Term Vital Sign Prediction in Intraoperative Care
【速读】:该论文旨在解决术中生命体征预测模型在临床应用中存在的标准化评估缺失、数据不完整以及跨中心泛化能力不足等关键问题。其解决方案的核心在于构建VitalBench这一新型基准测试框架,该框架整合了来自两个独立医疗中心超过4000例手术的多模态生命体征数据,并设立三个评估赛道:完整数据、不完整数据及跨中心泛化,从而更真实地反映临床实践中的复杂性;同时引入掩码损失(masked loss)技术以实现鲁棒且无偏的模型评估,减少对繁琐预处理的依赖,为研究者提供统一的数据处理标准和模型开发平台,推动生成式AI(Generative AI)在术中生命体征预测领域的准确性与跨场景适应性提升。
链接: https://arxiv.org/abs/2511.13757
作者: Xiuding Cai,Xueyao Wang,Sen Wang,Yaoyao Zhu,Jiao Chen,Yu Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Sensors Journal
Abstract:Intraoperative monitoring and prediction of vital signs are critical for ensuring patient safety and improving surgical outcomes. Despite recent advances in deep learning models for medical time-series forecasting, several challenges persist, including the lack of standardized benchmarks, incomplete data, and limited cross-center validation. To address these challenges, we introduce VitalBench, a novel benchmark specifically designed for intraoperative vital sign prediction. VitalBench includes data from over 4,000 surgeries across two independent medical centers, offering three evaluation tracks: complete data, incomplete data, and cross-center generalization. This framework reflects the real-world complexities of clinical practice, minimizing reliance on extensive preprocessing and incorporating masked loss techniques for robust and unbiased model evaluation. By providing a standardized and unified platform for model development and comparison, VitalBench enables researchers to focus on architectural innovation while ensuring consistency in data handling. This work lays the foundation for advancing predictive models for intraoperative vital sign forecasting, ensuring that these models are not only accurate but also robust and adaptable across diverse clinical environments. Our code and data are available at this https URL.
zh
[AI-91] Multi-Horizon Time Series Forecasting of non-parametric CDFs with Deep Lattice Networks
【速读】:该论文旨在解决时间序列预测中概率预测(Probabilistic Forecasting)的建模问题,特别是如何生成完整、非参数化且单调递增的累积分布函数(Cumulative Distribution Function, CDF),以更准确地捕捉突发性变化并避免传统点预测方法的局限性。其核心挑战在于:量化回归(Quantile Regression)常导致分位数交叉(Quantile Crossover),破坏CDF的单调性,从而无法构成合法的概率分布。解决方案的关键在于对深度晶格网络(Deep Lattice Networks, DLN)进行适应性改造——利用长短期记忆单元(LSTM)作为嵌入层提取时序特征,并将分位数输入扩展至DLN的所有子晶格,通过DLN天然具备的单调约束能力自动防止分位数交叉,从而实现多步预测下隐式、完整的非参数化CDF输出。实验表明,该方法在太阳能辐照度的日提前小时级预测任务中表现优于或至少等同于无约束模型,并显著优于可扩展的单调神经网络。
链接: https://arxiv.org/abs/2511.13756
作者: Niklas Erdmann,Lars Bentsen,Roy Stenbro,Heine Nygard Riise,Narada Dilp Warakagoda,Paal E. Engelstad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Probabilistic forecasting is not only a way to add more information to a prediction of the future, but it also builds on weaknesses in point prediction. Sudden changes in a time series can still be captured by a cumulative distribution function (CDF), while a point prediction is likely to miss it entirely. The modeling of CDFs within forecasts has historically been limited to parametric approaches, but due to recent advances, this no longer has to be the case. We aim to advance the fields of probabilistic forecasting and monotonic networks by connecting them and propose an approach that permits the forecasting of implicit, complete, and nonparametric CDFs. For this purpose, we propose an adaptation to deep lattice networks (DLN) for monotonically constrained simultaneous/implicit quantile regression in time series forecasting. Quantile regression usually produces quantile crossovers, which need to be prevented to achieve a legitimate CDF. By leveraging long short term memory units (LSTM) as the embedding layer, and spreading quantile inputs to all sub-lattices of a DLN with an extended output size, we can produce a multi-horizon forecast of an implicit CDF due to the monotonic constraintability of DLNs that prevent quantile crossovers. We compare and evaluate our approach’s performance to relevant state of the art within the context of a highly relevant application of time series forecasting: Day-ahead, hourly forecasts of solar irradiance observations. Our experiments show that the adaptation of a DLN performs just as well or even better than an unconstrained approach. Further comparison of the adapted DLN against a scalable monotonic neural network shows that our approach performs better. With this adaptation of DLNs, we intend to create more interest and crossover investigations in techniques of monotonic neural networks and probabilistic forecasting.
zh
[AI-92] Adaptive Redundancy Regulation for Balanced Multimodal Information Refinement
【速读】:该论文旨在解决多模态学习中因模态偏置(modality bias)导致的优化不平衡问题,即优势模态在反向传播过程中长期主导梯度更新,从而削弱了模态表示与输出之间的耦合关系,并造成冗余信息累积。其解决方案的关键在于提出一种基于信息瓶颈原理的自适应冗余调节机制(Adaptive Redundancy Regulation, RedReg),通过构建冗余阶段监测器(redundancy phase monitor)仅在冗余较高时触发干预,同时设计共信息门控机制(co-information gating mechanism)依据跨模态语义动态估计当前主导模态的贡献度,并在任务主要依赖单一模态时自动禁用抑制项以保留模态特异性信息;最终将优势模态梯度投影到联合多模态梯度子空间的正交补空间并按冗余程度进行抑制,实现更平衡的信息精炼与优化。
链接: https://arxiv.org/abs/2511.13755
作者: Zhe Yang,Wenrui Li,Hongtao Chen,Penghong Wang,Ruiqin Xiong,Xiaopeng Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal learning aims to improve performance by leveraging data from multiple sources. During joint multimodal training, due to modality bias, the advantaged modality often dominates backpropagation, leading to imbalanced optimization. Existing methods still face two problems: First, the long-term dominance of the dominant modality weakens representation-output coupling in the late stages of training, resulting in the accumulation of redundant information. Second, previous methods often directly and uniformly adjust the gradients of the advantaged modality, ignoring the semantics and directionality between modalities. To address these limitations, we propose Adaptive Redundancy Regulation for Balanced Multimodal Information Refinement (RedReg), which is inspired by information bottleneck principle. Specifically, we construct a redundancy phase monitor that uses a joint criterion of effective gain growth rate and redundancy to trigger intervention only when redundancy is high. Furthermore, we design a co-information gating mechanism to estimate the contribution of the current dominant modality based on cross-modal semantics. When the task primarily relies on a single modality, the suppression term is automatically disabled to preserve modality-specific information. Finally, we project the gradient of the dominant modality onto the orthogonal complement of the joint multimodal gradient subspace and suppress the gradient according to redundancy. Experiments show that our method demonstrates superiority among current major methods in most scenarios. Ablation experiments verify the effectiveness of our method. The code is available at this https URL
zh
[AI-93] Robustness of LLM -enabled vehicle trajectory prediction under data security threats
【速读】:该论文旨在解决生成式 AI (Generative AI) 驱动的车辆轨迹预测模型在自动驾驶系统中缺乏鲁棒性的问题,尤其是在面对对抗性攻击时的脆弱性尚未被充分研究。解决方案的关键在于提出一种单特征差分进化攻击方法(one-feature differential evolution attack),在黑盒环境下仅扰动周围车辆输入提示中的单一运动学特征,从而系统性地评估 LLM 模型的漏洞。实验表明,即使扰动微小且物理上合理,也能显著破坏模型输出,揭示了 LLM-based 预测器对对抗性干扰的高度敏感性,并进一步分析了准确率与鲁棒性之间的权衡关系及失效机制,为未来面向安全关键场景的 LLM 驱动智能交通系统设计提供了重要启示。
链接: https://arxiv.org/abs/2511.13753
作者: Feilong Wang,Fuqiang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 20 pages, 2 figures, 11 tables, working paper
Abstract:The integration of large language models (LLMs) into automated driving systems has opened new possibilities for reasoning and decision-making by transforming complex driving contexts into language-understandable representations. Recent studies demonstrate that fine-tuned LLMs can accurately predict vehicle trajectories and lane-change intentions by gathering and transforming data from surrounding vehicles. However, the robustness of such LLM-based prediction models for safety-critical driving systems remains unexplored, despite the increasing concerns about the trustworthiness of LLMs. This study addresses this gap by conducting a systematic vulnerability analysis of LLM-enabled vehicle trajectory prediction. We propose a one-feature differential evolution attack that perturbs a single kinematic feature of surrounding vehicles within the LLM’s input prompts under a black-box setting. Experiments on the highD dataset reveal that even minor, physically plausible perturbations can significantly disrupt model outputs, underscoring the susceptibility of LLM-based predictors to adversarial manipulation. Further analyses reveal a trade-off between accuracy and robustness, examine the failure mechanism, and explore potential mitigation solutions. The findings provide the very first insights into adversarial vulnerabilities of LLM-driven automated vehicle models in the context of vehicular interactions and highlight the need for robustness-oriented design in future LLM-based intelligent transportation systems.
zh
[AI-94] Motor Imagery Classification Using Feature Fusion of Spatially Weighted Electroencephalography
【速读】:该论文旨在解决脑机接口(Brain Computer Interface, BCI)系统中因多通道脑电图(Electroencephalography, EEG)信号带来的高计算复杂度问题,同时提升运动想象(Motor Imagery, MI)任务分类的准确性。解决方案的关键在于提出一种基于脑区特异性通道选择与多域特征融合的方法:首先根据功能相关性将EEG通道分组,仅保留与特定运动想象任务相关的脑区通道,从而降低数据维度并提高计算效率;随后在每组通道上分别应用共空间模式(Common Spatial Pattern, CSP)、模糊C均值聚类(Fuzzy C-means clustering)和切空间映射(Tangent Space Mapping, TSM)三种特征提取方法,以捕获不同维度的EEG信号特性,并最终通过支持向量机(Support Vector Machine, SVM)进行分类。该策略显著提升了分类性能,在公开基准数据集IVA和I上分别达到90.77%和84.50%的准确率。
链接: https://arxiv.org/abs/2511.13752
作者: Abdullah Al Shiam,Md. Khademul Islam Molla,Abu Saleh Musa Miah,Md. Abdus Samad Kamal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A Brain Computer Interface (BCI) connects the human brain to the outside world, providing a direct communication channel. Electroencephalography (EEG) signals are commonly used in BCIs to reflect cognitive patterns related to motor function activities. However, due to the multichannel nature of EEG signals, explicit information processing is crucial to lessen computational complexity in BCI systems. This study proposes an innovative method based on brain region-specific channel selection and multi-domain feature fusion to improve classification accuracy. The novelty of the proposed approach lies in region-based channel selection, where EEG channels are grouped according to their functional relevance to distinct brain regions. By selecting channels based on specific regions involved in motor imagery (MI) tasks, this technique eliminates irrelevant channels, reducing data dimensionality and improving computational efficiency. This also ensures that the extracted features are more reflective of the brain actual activity related to motor tasks. Three distinct feature extraction methods Common Spatial Pattern (CSP), Fuzzy C-means clustering, and Tangent Space Mapping (TSM), are applied to each group of channels based on their brain region. Each method targets different characteristics of the EEG signal: CSP focuses on spatial patterns, Fuzzy C means identifies clusters within the data, and TSM captures non-linear patterns in the signal. The combined feature vector is used to classify motor imagery tasks (left hand, right hand, and right foot) using Support Vector Machine (SVM). The proposed method was validated on publicly available benchmark EEG datasets (IVA and I) from the BCI competition III and IV. The results show that the approach outperforms existing methods, achieving classification accuracies of 90.77% and 84.50% for datasets IVA and I, respectively.
zh
[AI-95] SCALEX: Scalable Concept and Latent Exploration for Diffusion Models
【速读】:该论文旨在解决扩散模型(diffusion models)中社会偏见(如性别、种族和职业刻板印象)的自动化、可扩展分析难题。现有方法要么局限于预定义类别,要么依赖人工对潜在空间方向的解释,难以发现细微或未预期的偏见模式。其解决方案的关键在于提出SCALEX框架,通过仅使用自然语言提示从H-space中自动提取语义明确的潜在方向,实现无需重新训练或标注的零样本解释能力,从而支持跨概念系统比较和大规模内部关联发现,显著提升了偏见分析的可扩展性、可解释性和可拓展性。
链接: https://arxiv.org/abs/2511.13750
作者: E. Zhixuan Zeng,Yuhao Chen,Alexander Wong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Image generation models frequently encode social biases, including stereotypes tied to gender, race, and profession. Existing methods for analyzing these biases in diffusion models either focus narrowly on predefined categories or depend on manual interpretation of latent directions. These constraints limit scalability and hinder the discovery of subtle or unanticipated patterns. We introduce SCALEX, a framework for scalable and automated exploration of diffusion model latent spaces. SCALEX extracts semantically meaningful directions from H-space using only natural language prompts, enabling zero-shot interpretation without retraining or labelling. This allows systematic comparison across arbitrary concepts and large-scale discovery of internal model associations. We show that SCALEX detects gender bias in profession prompts, ranks semantic alignment across identity descriptors, and reveals clustered conceptual structure without supervision. By linking prompts to latent directions directly, SCALEX makes bias analysis in diffusion models more scalable, interpretable, and extensible than prior approaches. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.13750 [cs.LG] (or arXiv:2511.13750v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.13750 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-96] DeepDefense: Layer-Wise Gradient-Feature Alignment for Building Robust Neural Networks
【速读】:该论文旨在解决深度神经网络在面对对抗扰动(adversarial perturbations)时的脆弱性问题,即模型对微小但精心设计的输入扰动敏感,导致预测错误。其解决方案的关键在于提出一种名为DeepDefense的防御框架,通过在多个网络层中引入梯度-特征对齐(Gradient-Feature Alignment, GFA)正则化,使输入梯度与内部特征表示对齐,从而抑制对抗攻击所依赖的切向方向(tangential directions)上的损失变化,实现更平滑的损失曲面和更强的决策边界。该方法无需修改网络结构,具有良好的通用性和有效性,在多种攻击场景下显著提升了模型鲁棒性。
链接: https://arxiv.org/abs/2511.13749
作者: Ci Lin,Tet Yeap,Iluju Kiringa,Biwei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: no available
Abstract:Deep neural networks are known to be vulnerable to adversarial perturbations, which are small and carefully crafted inputs that lead to incorrect predictions. In this paper, we propose DeepDefense, a novel defense framework that applies Gradient-Feature Alignment (GFA) regularization across multiple layers to suppress adversarial vulnerability. By aligning input gradients with internal feature representations, DeepDefense promotes a smoother loss landscape in tangential directions, thereby reducing the model’s sensitivity to adversarial noise. We provide theoretical insights into how adversarial perturbation can be decomposed into radial and tangential components and demonstrate that alignment suppresses loss variation in tangential directions, where most attacks are effective. Empirically, our method achieves significant improvements in robustness across both gradient-based and optimization-based attacks. For example, on CIFAR-10, CNN models trained with DeepDefense outperform standard adversarial training by up to 15.2% under APGD attacks and 24.7% under FGSM attacks. Against optimization-based attacks such as DeepFool and EADEN, DeepDefense requires 20 to 30 times higher perturbation magnitudes to cause misclassification, indicating stronger decision boundaries and a flatter loss landscape. Our approach is architecture-agnostic, simple to implement, and highly effective, offering a promising direction for improving the adversarial robustness of deep learning models. Comments: no available Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.13749 [cs.LG] (or arXiv:2511.13749v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.13749 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-97] Deep reinforcement learning-based spacecraft attitude control with pointing keep-out constraint
【速读】:该论文旨在解决航天器在存在单一指向保持禁区(pointing keep-out zone)约束下的姿态重定向控制问题。解决方案的关键在于采用软演员-评论家(Soft Actor-Critic, SAC)算法实现深度强化学习(Deep Reinforcement Learning, DRL),设计了一种新的状态表示以显式包含姿态约束区域的紧凑表达,并通过合理的奖励函数设计确保控制目标达成的同时严格满足姿态约束;此外,引入课程学习(curriculum learning)策略提升训练效率与稳定性。
链接: https://arxiv.org/abs/2511.13746
作者: Juntang Yang,Mohamed Khalil Ben-Larbi
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper implements deep reinforcement learning (DRL) for spacecraft reorientation control with a single pointing keep-out zone. The Soft Actor-Critic (SAC) algorithm is adopted to handle continuous state and action space. A new state representation is designed to explicitly include a compact representation of the attitude constraint zone. The reward function is formulated to achieve the control objective while enforcing the attitude constraint. A curriculum learning approach is used for the agent training. Simulation results demonstrate the effectiveness of the proposed DRL-based method for spacecraft pointing-constrained attitude control.
zh
[AI-98] Review of Passenger Flow Modelling Approaches Based on a Bibliometric Analysis
【速读】:该论文旨在解决短时乘客流量预测(short-term passenger flow forecasting)领域研究发展脉络不清、方法演进不明确以及关键挑战未被充分识别的问题。其解决方案的关键在于通过文献计量学分析(bibliometric analysis)与引文网络变体、主题建模(topic modelling)相结合的方法,系统梳理1984至2024年间814篇相关文献的发展轨迹,揭示从传统统计与机器学习方法(如ARIMA、SVM、基础神经网络)向专用深度学习架构的转变趋势,并量化识别出数据融合受限、多变量开放数据匮乏、模型可解释性不足、成本效益失衡及算法性能与实际部署之间权衡等核心研究缺口。此外,还建立了该领域与更广泛机器学习和时间序列建模领域的关联,凸显了基础模型(foundation models)在该方向日益增长的重要性。
链接: https://arxiv.org/abs/2511.13742
作者: Jonathan Hecht,Weilian Li,Ziyue Li,Youness Dehbi
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注: under the review of IEEE Transactions on Intelligent Transportation Systems
Abstract:This paper presents a bibliometric analysis of the field of short-term passenger flow forecasting within local public transit, covering 814 publications that span from 1984 to 2024. In addition to common bibliometric analysis tools, a variant of a citation network was developed, and topic modelling was conducted. The analysis reveals that research activity exhibited sporadic patterns prior to 2008, followed by a marked acceleration, characterised by a shift from conventional statistical and machine learning methodologies (e.g., ARIMA, SVM, and basic neural networks) to specialised deep learning architectures. Based on this insight, a connection to more general fields such as machine learning and time series modelling was established. In addition to modelling, spatial, linguistic, and modal biases were identified and findings from existing secondary literature were validated and quantified. This revealed existing gaps, such as constrained data fusion, open (multivariate) data, and underappreciated challenges related to model interpretability, cost-efficiency, and a balance between algorithmic performance and practical deployment considerations. In connection with the superordinate fields, the growth in relevance of foundation models is also noteworthy.
zh
[AI-99] AI Kill Switch for malicious web-based LLM agent
【速读】:该论文旨在解决基于网页的大型语言模型(Large Language Model, LLM)代理在自主执行复杂任务时可能被恶意利用的问题,例如未经授权收集个人身份信息(PII)、生成社会分裂内容以及自动化网络攻击等安全风险。解决方案的关键在于提出一种名为 AutoGuard 的“AI 关闭开关”(AI Kill Switch)技术,其核心思想是生成能够触发恶意 LLM 代理内部安全机制的防御性提示(defensive prompts),并将这些提示以透明方式嵌入到网站的 DOM(文档对象模型)中,使其对人类用户不可见但可被恶意代理的爬取过程识别,从而在读取后立即终止其恶意行为。实验表明,该方法在多种主流 LLM 代理上均实现了超过 80% 的防御成功率(Defense Success Rate, DSR),并展现出良好的跨模型泛化能力。
链接: https://arxiv.org/abs/2511.13725
作者: Sechan Lee,Sangdon Park
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, web-based Large Language Model (LLM) agents autonomously perform increasingly complex tasks, thereby bringing significant convenience. However, they also amplify the risks of malicious misuse cases such as unauthorized collection of personally identifiable information (PII), generation of socially divisive content, and even automated web hacking. To address these threats, we propose an AI Kill Switch technique that can immediately halt the operation of malicious web-based LLM agents. To achieve this, we introduce AutoGuard - the key idea is generating defensive prompts that trigger the safety mechanisms of malicious LLM agents. In particular, generated defense prompts are transparently embedded into the website’s DOM so that they remain invisible to human users but can be detected by the crawling process of malicious agents, triggering its internal safety mechanisms to abort malicious actions once read. To evaluate our approach, we constructed a dedicated benchmark consisting of three representative malicious scenarios (PII collection, social rift content generation, and web hacking attempts). Experimental results show that the AutoGuard method achieves over 80% Defense Success Rate (DSR) on malicious agents, including GPT-4o, Claude-3, and Llama3.3-70B-Instruct. It also maintains strong performance, achieving around 90% DSR on GPT-5, GPT-4.1, and Gemini-2.5-Flash when used as the malicious agent, demonstrating robust generalization across models and scenarios. Through this research, we have demonstrated the controllability of web-based LLM agents across various scenarios and models, thereby contributing to the broader effort of AI control and safety.
zh
[AI-100] Preparation Meets Opportunity: Enhancing Data Preprocessing for ML Training With Seneca FAST’26
【速读】:该论文旨在解决多模态机器学习(Multimedia Machine Learning, MML)模型在现代系统中并发训练时,输入数据预处理(Input Data Preprocessing)成为性能瓶颈的问题。解决方案的关键在于提出Seneca这一数据加载系统,其核心创新包括:一是基于数据流水线的性能模型,对缓存进行最优分区以适配编码(Encoded)、解码(Decoded)和增强(Augmented)三种形式的数据;二是通过机会性地在随机批处理采样中优先服务缓存数据而非未缓存数据,实现并发任务间的资源共享与协同加速。
链接: https://arxiv.org/abs/2511.13724
作者: Omkar Desai(Syracuse University),Ziyang Jiao(Syracuse University),Shuyi Pei(Samsung Semiconductor Inc.),Janki Bhimani(Florida International University),Bryan S. Kim(Syracuse University)
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 15 figures, To be published in the 24th USENIX Conference on File and Storage Technologies (FAST '26)
Abstract:Input data preprocessing is a common bottleneck when concurrently training multimedia machine learning (ML) models in modern systems. To alleviate these bottlenecks and reduce the training time for concurrent jobs, we present Seneca, a data loading system that optimizes cache partitioning and data sampling for the data storage and ingestion (DSI) pipeline. The design of Seneca contains two key techniques. First, Seneca uses a performance model for the data pipeline to optimally partition the cache for three different forms of data (encoded, decoded, and augmented). Second, Seneca opportunistically serves cached data over uncached ones during random batch sampling so that concurrent jobs benefit from each other. We implement Seneca by modifying PyTorch and demonstrate its effectiveness by comparing it against several state-of-the-art caching systems for DNN training. Seneca reduces the makespan by 45.23% compared to PyTorch and increases data processing throughput by up to 3.45x compared to the next best dataloader.
zh
[AI-101] From Legacy Fortran to Portable Kokkos: An Autonomous Agent ic AI Workflow
【速读】:该论文旨在解决科学计算领域中遗留的Fortran代码在向异构GPU加速架构迁移时面临的可移植性问题,尤其是在缺乏原生Fortran绑定的情况下。其核心挑战在于如何高效、自动地将传统Fortran并行代码转换为具有性能可移植性的Kokkos C++实现,以适配多样化的高性能计算(HPC)硬件平台。解决方案的关键在于提出了一种基于智能体(agentic)的AI工作流,其中多个大语言模型(LLM)“代理”协同完成从代码翻译、验证、编译、运行测试到调试与优化的全流程自动化任务,从而实现对Fortran内核的端到端现代化改造。实验表明,使用付费OpenAI模型(如GPT-5)可低成本生成高性能、跨硬件平台的Kokkos代码,显著优于原始Fortran基线,而开源模型则难以产出功能正确的代码,凸显了高质量LLM在结构化科学计算任务中的关键作用。
链接: https://arxiv.org/abs/2509.12443
作者: Sparsh Gupta,Kamalavasan Kamalakkannan,Maxim Moraru,Galen Shipman,Patrick Diehl
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 12 pages, 6 figures, 7 tables
Abstract:Scientific applications continue to rely on legacy Fortran codebases originally developed for homogeneous, CPU-based systems. As High-Performance Computing (HPC) shifts toward heterogeneous GPU-accelerated architectures, many accelerators lack native Fortran bindings, creating an urgent need to modernize legacy codes for portability. Frameworks like Kokkos provide performance portability and a single-source C++ abstraction, but manual Fortran-to-Kokkos porting demands significant expertise and time. Large language models (LLMs) have shown promise in source-to-source code generation, yet their use in fully autonomous workflows for translating and optimizing parallel code remains largely unexplored, especially for performance portability across diverse hardware. This paper presents an agentic AI workflow where specialized LLM “agents” collaborate to translate, validate, compile, run, test, debug, and optimize Fortran kernels into portable Kokkos C++ programs. Results show the pipeline modernizes a range of benchmark kernels, producing performance-portable Kokkos codes across hardware partitions. Paid OpenAI models such as GPT-5 and o4-mini-high executed the workflow for only a few U.S. dollars, generating optimized codes that surpassed Fortran baselines, whereas open-source models like Llama4-Maverick often failed to yield functional codes. This work demonstrates the feasibility of agentic AI for Fortran-to-Kokkos transformation and offers a pathway for autonomously modernizing legacy scientific applications to run portably and efficiently on diverse supercomputers. It further highlights the potential of LLM-driven agentic systems to perform structured, domain-specific reasoning tasks in scientific and systems-oriented applications.
zh
[AI-102] Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models
【速读】:该论文旨在解决DNA大语言模型(DNA large language models, DNA LLMs)在处理超长基因组序列时面临的两大实践瓶颈:一是自注意力机制带来的二次方计算复杂度(O(N²)),二是自回归解码过程中键值(key-value, KV)缓存随序列长度增长而急剧膨胀的内存消耗问题。这些问题迫使现有方法采用固定窗口截断或滑动窗口等启发式策略,从而丢失远距离依赖信息,损害建模 fidelity。解决方案的关键在于提出 FOCUS(Feature-Oriented Compression for Ultra-long Self-attention),其核心是通过可插拔的渐进式上下文压缩模块,在 k-mer 粒度上插入摘要令牌(summary tokens),并在多层 Transformer 中逐级压缩注意力键和值激活,仅保留摘要状态的 KV 缓存并丢弃普通 token 的 KV;同时引入共享边界窗口机制,实现跨窗口信息传递的稳定接口,将有效推理复杂度从 O(N²) 降低至近线性 O(N),在保持近乎无损精度的前提下,使商品级 GPU 上的推理窗口长度提升约 100 倍。
链接: https://arxiv.org/abs/2511.14694
作者: Rui Zhu,Xiaopu Zhou,Haixu Tang,Stephen W. Scherer,Lucila Ohno-Machado
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
备注:
Abstract:Trained on massive cross-species DNA corpora, DNA large language models (LLMs) learn the fundamental “grammar” and evolutionary patterns of genomic sequences. This makes them powerful priors for DNA sequence modeling, particularly over long ranges. However, two major constraints hinder their use in practice: the quadratic computational cost of self-attention and the growing memory required for key-value (KV) caches during autoregressive decoding. These constraints force the use of heuristics such as fixed-window truncation or sliding windows, which compromise fidelity on ultra-long sequences by discarding distant information. We introduce FOCUS (Feature-Oriented Compression for Ultra-long Self-attention), a progressive context-compression module that can be plugged into pretrained DNA LLMs. FOCUS combines the established k-mer representation in genomics with learnable hierarchical compression: it inserts summary tokens at k-mer granularity and progressively compresses attention key and value activations across multiple Transformer layers, retaining only the summary KV states across windows while discarding ordinary-token KV. A shared-boundary windowing scheme yields a stationary cross-window interface that propagates long-range information with minimal loss. We validate FOCUS on an Evo-2-based DNA LLM fine-tuned on GRCh38 chromosome 1 with self-supervised training and randomized compression schedules to promote robustness across compression ratios. On held-out human chromosomes, FOCUS achieves near-lossless fidelity: compressing a 1 kb context into only 10 summary tokens (about 100x) shifts the average per-nucleotide probability by only about 0.0004. Compared to a baseline without compression, FOCUS reduces KV-cache memory and converts effective inference scaling from O(N^2) to near-linear O(N), enabling about 100x longer inference windows on commodity GPUs with near-lossless fidelity.
zh
[AI-103] Active Matter as a framework for living systems-inspired Robophysics
【速读】:该论文旨在解决机器人集体(robot swarms)在复杂现实环境中面临的效率瓶颈问题,具体表现为个体单元的运动控制困难以及群体层面难以实现协同目标、协调、通信和成本效益。解决方案的关键在于借鉴活性物质物理(active-matter physics)和生物学原理,将其融入机器人集群的建模与设计中,从而提升其在动态环境中的适应性与功能性。
链接: https://arxiv.org/abs/2511.14624
作者: Giulia Janzen,Gaia Maselli,Juan F. Jimenez,Lia Garcia-Perez,D A Matoz Fernandez,Chantal Valeriani
机构: 未知
类目: oft Condensed Matter (cond-mat.soft); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Robophysics investigates the physical principles that govern living-like robots operating in complex, realworld environments. Despite remarkable technological advances, robots continue to face fundamental efficiency limitations. At the level of individual units, locomotion remains a challenge, while at the collective level, robot swarms struggle to achieve shared purpose, coordination, communication, and cost efficiency. This perspective article examines the key challenges faced by bio-inspired robotic collectives and highlights recent research efforts that incorporate principles from active-matter physics and biology into the modeling and design of robot swarms.
zh
[AI-104] Apo2Mol: 3D Molecule Generation via Dynamic Pocket-Aware Diffusion Models AAAI2026
【速读】:该论文旨在解决当前基于结构的药物设计中普遍存在的问题:现有生成式模型通常假设蛋白质结合口袋为刚性结构,忽略了蛋白质自身的构象灵活性及其在配体结合诱导下的构象变化,从而限制了其在实际药物发现中的应用。解决方案的关键在于提出Apo2Mol,一个基于扩散机制的3D分子生成框架,该框架显式建模蛋白质结合口袋的构象柔性;其核心创新包括:构建包含超过24,000对实验解析的apo-holo结构数据集以刻画配体结合引起的蛋白结构变化,并采用全原子层次的图结构扩散模型,从输入的apo状态同步生成高亲和力配体分子及其对应的holo口袋构象,从而实现更贴近真实生物场景的分子与蛋白协同设计。
链接: https://arxiv.org/abs/2511.14559
作者: Xinzhe Zheng,Shiyu Jiang,Gustavo Seabra,Chenglong Li,Yanjun Li
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: Accepted by AAAI 2026
Abstract:Deep generative models are rapidly advancing structure-based drug design, offering substantial promise for generating small molecule ligands that bind to specific protein targets. However, most current approaches assume a rigid protein binding pocket, neglecting the intrinsic flexibility of proteins and the conformational rearrangements induced by ligand binding, limiting their applicability in practical drug discovery. Here, we propose Apo2Mol, a diffusion-based generative framework for 3D molecule design that explicitly accounts for conformational flexibility in protein binding pockets. To support this, we curate a dataset of over 24,000 experimentally resolved apo-holo structure pairs from the Protein Data Bank, enabling the characterization of protein structure changes associated with ligand binding. Apo2Mol employs a full-atom hierarchical graph-based diffusion model that simultaneously generates 3D ligand molecules and their corresponding holo pocket conformations from input apo states. Empirical studies demonstrate that Apo2Mol can achieve state-of-the-art performance in generating high-affinity ligands and accurately capture realistic protein pocket conformational changes.
zh
[AI-105] DecNefLab: A Modular and Interpretable Simulation Framework for Decoded Neurofeedback
【速读】:该论文旨在解决Decoded Neurofeedback (DecNef) 研究中面临的三大核心挑战:个体间学习差异性大、缺乏对学习过程的直接量化指标,以及实验成本高、耗时长。其解决方案的关键在于提出DecNefLab——一个模块化且可解释的仿真框架,将DecNef建模为机器学习问题,利用潜在变量生成模型作为模拟被试,从而直接观测内部认知状态,并系统评估不同协议设计与个体特征对学习效果的影响。该方法不仅可复现DecNef学习的实证现象,还能识别反馈失效条件并指导虚拟环境中更稳健的协议设计,为DecNef研究提供计算建模与认知神经科学之间的桥梁。
链接: https://arxiv.org/abs/2511.14555
作者: Alexander Olza,Roberto Santana,David Soto
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:
Abstract:Decoded Neurofeedback (DecNef) is a flourishing non-invasive approach to brain modulation with wide-ranging applications in neuromedicine and cognitive neuroscience. However, progress in DecNef research remains constrained by subject-dependent learning variability, reliance on indirect measures to quantify progress, and the high cost and time demands of experimentation. We present DecNefLab, a modular and interpretable simulation framework that formalizes DecNef as a machine learning problem. Beyond providing a virtual laboratory, DecNefLab enables researchers to model, analyze and understand neurofeedback dynamics. Using latent variable generative models as simulated participants, DecNefLab allows direct observation of internal cognitive states and systematic evaluation of how different protocol designs and subject characteristics influence learning. We demonstrate how this approach can (i) reproduce empirical phenomena of DecNef learning, (ii) identify conditions under which DecNef feedback fails to induce learning, and (iii) guide the design of more robust and reliable DecNef protocols in silico before human implementation. In summary, DecNefLab bridges computational modeling and cognitive neuroscience, offering a principled foundation for methodological innovation, robust protocol design, and ultimately, a deeper understanding of DecNef-based brain modulation. Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.14555 [q-bio.NC] (or arXiv:2511.14555v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2511.14555 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alexander Olza [view email] [v1] Tue, 18 Nov 2025 14:58:59 UTC (11,016 KB)
zh
[AI-106] Compute-in-Memory Implementation of State Space Models for Event Sequence Processing
【速读】:该论文旨在解决长序列处理中传统方法效率低下的问题,尤其是在事件驱动型视觉和音频任务中实现高能效、实时处理的挑战。其解决方案的关键在于通过算法与硬件协同设计,将状态空间模型(State Space Models, SSMs)重新参数化为使用实数系数和共享衰减常数的形式,从而降低模型在实际计算存储一体化(Compute-in-Memory, CIM)硬件上的映射复杂度;同时利用器件动态特性及对角化的状态转移参数,使状态演化过程可直接在基于交叉阵列的CIM系统中原生实现,结合具有短期记忆效应的忆阻器,显著提升系统精度与能效,并支持完全异步处理。
链接: https://arxiv.org/abs/2511.13912
作者: Xiaoyu Zhang,Mingtao Hu,Sen Lu,Soohyeon Kim,Eric Yeu-Jer Lee,Yuyang Liu,Wei D. Lu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Xiaoyu Zhang and Mingtao Hu contributed equally to this work
Abstract:State space models (SSMs) have recently emerged as a powerful framework for long sequence processing, outperforming traditional methods on diverse benchmarks. Fundamentally, SSMs can generalize both recurrent and convolutional networks and have been shown to even capture key functions of biological systems. Here we report an approach to implement SSMs in energy-efficient compute-in-memory (CIM) hardware to achieve real-time, event-driven processing. Our work re-parameterizes the model to function with real-valued coefficients and shared decay constants, reducing the complexity of model mapping onto practical hardware systems. By leveraging device dynamics and diagonalized state transition parameters, the state evolution can be natively implemented in crossbar-based CIM systems combined with memristors exhibiting short-term memory effects. Through this algorithm and hardware co-design, we show the proposed system offers both high accuracy and high energy efficiency while supporting fully asynchronous processing for event-based vision and audio tasks.
zh
[AI-107] Randomized Controlled Trials for Conditional Access Optimization Agent
【速读】:该论文旨在解决生成式 AI (Generative AI) 在身份治理(Identity Governance)领域中实际有效性证据不足的问题,特别是在微软 Entra 中的条件访问(Conditional Access, CA)策略管理场景下。解决方案的关键在于设计并部署一个专用的 AI 代理(AI agent),用于辅助身份管理员完成四项高价值任务:策略合并、零信任基线差距检测、分阶段上线规划和用户-策略对齐。通过首个随机对照试验(RCT),研究证明该代理显著提升了任务准确性(提升48%)和效率(完成时间减少43%),尤其在认知负荷较高的任务如基线差距检测中效果最为突出,验证了专用 AI 代理在提升身份管理精度与速度方面的核心价值。
链接: https://arxiv.org/abs/2511.13865
作者: James Bono,Beibei Cheng,Joaquin Lozano
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents are increasingly deployed to automate complex enterprise workflows, yet evidence of their effectiveness in identity governance is limited. We report results from the first randomized controlled trial (RCT) evaluating an AI agent for Conditional Access (CA) policy management in Microsoft Entra. The agent assists with four high-value tasks: policy merging, Zero-Trust baseline gap detection, phased rollout planning, and user-policy alignment. In a production-grade environment, 162 identity administrators were randomly assigned to a control group (no agent) or treatment group (agent-assisted) and asked to perform these tasks. Agent access produced substantial gains: accuracy improved by 48% and task completion time decreased by 43% while holding accuracy constant. The largest benefits emerged on cognitively demanding tasks such as baseline gap detection. These findings demonstrate that purpose-built AI agents can significantly enhance both speed and accuracy in identity administration.
zh
[AI-108] Randomized Controlled Trials for Phishing Triage Agent
【速读】:该论文旨在解决安全运营中心(Security Operations Center, SOC)在处理大量用户报告的钓鱼邮件时面临的效率与准确性难题。传统方法难以在保证威胁检测质量的同时高效完成海量告警的初步筛选与判断,导致分析师资源分配失衡。解决方案的关键在于引入一个领域特定的AI代理——Microsoft Security Copilot钓鱼邮件筛查代理(Phishing Triage Agent),该代理通过智能队列优先级排序和可解释的判定建议显著提升分析师的工作效能。实验结果表明,使用该代理后,分析师每分钟识别出的真实阳性案例最多提升6.5倍,且判定准确率提高77%,其核心优势在于优化了人类分析师的认知负荷分配,使他们能将更多注意力集中在恶意邮件上,而非简单盲从AI建议,从而实现人机协同下的SOC流程重构。
链接: https://arxiv.org/abs/2511.13860
作者: James Bono
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:Security operations centers (SOCs) face a persistent challenge: efficiently triaging a high volume of user-reported phishing emails while maintaining robust protection against threats. This paper presents the first randomized controlled trial (RCT) evaluating the impact of a domain-specific AI agent - the Microsoft Security Copilot Phishing Triage Agent - on analyst productivity and accuracy. Our results demonstrate that agent-augmented analysts achieved up to 6.5 times as many true positives per analyst minute and a 77% improvement in verdict accuracy compared to a control group. The agent’s queue prioritization and verdict explanations were both significant drivers of efficiency. Behavioral analysis revealed that agent-augmented analysts reallocated their attention, spending 53% more time on malicious emails, and were not prone to rubber-stamping the agent’s malicious verdicts. These findings offer actionable insights for SOC leaders considering AI adoption, including the potential for agents to fundamentally change the optimal allocation of SOC resources.
zh
[AI-109] MAT-MPNN: A Mobility-Aware Transformer-MPNN Model for Dynamic Spatiotemporal Prediction of HIV Diagnoses in California Florida and New England
【速读】:该论文旨在解决艾滋病(HIV)诊断率在空间和时间维度上的复杂依赖性建模难题,传统消息传递神经网络(Message Passing Neural Network, MPNN)因依赖固定二进制邻接矩阵仅能表达地理相邻关系,难以刻画非相邻县之间的潜在传播交互。其解决方案的关键在于提出一种新型深度学习框架——Mobility-Aware Transformer-MPNN(MAT-MPNN),该框架通过引入**移动性图生成器(Mobility Graph Generator, MGG)**动态构建融合地理与人口流动信息的空间结构,同时结合Transformer编码器提取时序特征,从而显著提升对加州、佛罗里达州及新英格兰地区县级HIV诊断率的预测精度与校准能力。
链接: https://arxiv.org/abs/2511.13797
作者: Zhaoxuan Wang,Weichen Kang,Yutian Han,Lingyuan Zhao,Bo Li
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 20 figures,1 table. Preprint
Abstract:Human Immunodeficiency Virus (HIV) has posed a major global health challenge for decades, and forecasting HIV diagnoses continues to be a critical area of research. However, capturing the complex spatial and temporal dependencies of HIV transmission remains challenging. Conventional Message Passing Neural Network (MPNN) models rely on a fixed binary adjacency matrix that only encodes geographic adjacency, which is unable to represent interactions between non-contiguous counties. Our study proposes a deep learning architecture Mobility-Aware Transformer-Message Passing Neural Network (MAT-MPNN) framework to predict county-level HIV diagnosis rates across California, Florida, and the New England region. The model combines temporal features extracted by a Transformer encoder with spatial relationships captured through a Mobility Graph Generator (MGG). The MGG improves conventional adjacency matrices by combining geographic and demographic information. Compared with the best-performing hybrid baseline, the Transformer MPNN model, MAT-MPNN reduced the Mean Squared Prediction Error (MSPE) by 27.9% in Florida, 39.1% in California, and 12.5% in New England, and improved the Predictive Model Choice Criterion (PMCC) by 7.7%, 3.5%, and 3.9%, respectively. MAT-MPNN also achieved better results than the Spatially Varying Auto-Regressive (SVAR) model in Florida and New England, with comparable performance in California. These results demonstrate that applying mobility-aware dynamic spatial structures substantially enhances predictive accuracy and calibration in spatiotemporal epidemiological prediction.
zh
[AI-110] XAI-Driven Deep Learning for Protein Sequence Functional Group Classification
【速读】:该论文旨在解决蛋白质序列功能分类的准确性与可解释性问题,即如何在高精度预测蛋白质功能类别的同时,识别出具有生物学意义的序列特征。解决方案的关键在于构建基于深度学习的框架,采用k-mer整数编码来捕捉局部和长程序列依赖关系,并通过四种模型(CNN、BiLSTM、CNN-BiLSTM混合结构及带注意力机制的CNN)进行比较,最终发现卷积神经网络(Convolutional Neural Network, CNN)在验证集上达到最高准确率91.8%,表明局部基序检测对功能分类尤为有效;进一步结合Grad-CAM和Integrated Gradients等可解释人工智能技术,成功识别出富含组氨酸、天冬氨酸、谷氨酸和赖氨酸的功能相关序列基序,这些残基常见于转移酶的催化位点和金属结合区域,从而实现了从预测性能到生物机制理解的有效衔接。
链接: https://arxiv.org/abs/2511.13791
作者: Pratik Chakraborty,Aryan Bhargava
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 4 figures
Abstract:Proteins perform essential biological functions, and accurate classification of their sequences is critical for understanding structure-function relationships, enzyme mechanisms, and molecular interactions. This study presents a deep learning-based framework for functional group classification of protein sequences derived from the Protein Data Bank (PDB). Four architectures were implemented: Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), CNN-BiLSTM hybrid, and CNN with Attention. Each model was trained using k-mer integer encoding to capture both local and long-range dependencies. Among these, the CNN achieved the highest validation accuracy of 91.8%, demonstrating the effectiveness of localized motif detection. Explainable AI techniques, including Grad-CAM and Integrated Gradients, were applied to interpret model predictions and identify biologically meaningful sequence motifs. The discovered motifs, enriched in histidine, aspartate, glutamate, and lysine, represent amino acid residues commonly found in catalytic and metal-binding regions of transferase enzymes. These findings highlight that deep learning models can uncover functionally relevant biochemical signatures, bridging the gap between predictive accuracy and biological interpretability in protein sequence analysis.
zh
[AI-111] GeoPl@ntNet: A Platform for Exploring Essential Biodiversity Variables
【速读】:该论文旨在解决如何使关键生物多样性变量(Essential Biodiversity Variables, EBVs)更易于公众和研究人员理解与访问的问题。传统上,EBVs数据复杂且难以直观呈现,限制了其在政策制定、生态保护和公众教育中的应用。解决方案的关键在于开发GeoPl@ntNet这一交互式网络应用,通过高分辨率AI生成的地图(空间分辨率达50×50米)动态展示物种分布、生境类型及生物多样性指标,并结合卷积神经网络(Convolutional Neural Networks, CNNs)与大语言模型(Large Language Models, LLMs)构建的级联处理流程,实现从原始生态数据到可视化信息的自动化转换,从而提供一个直观、可探索且信息丰富的界面,支持用户针对特定区域(如城市绿地或保护区)获取本地物种组成与保护状况的详细报告。
链接: https://arxiv.org/abs/2511.13790
作者: Lukas Picek,César Leblanc,Alexis Joly,Pierre Bonnet,Rémi Palard,Maximilien Servajean
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: 4 pages, 5 figures, and 2 tables
Abstract:This paper describes GeoPl@ntNet, an interactive web application designed to make Essential Biodiversity Variables accessible and understandable to everyone through dynamic maps and fact sheets. Its core purpose is to allow users to explore high-resolution AI-generated maps of species distributions, habitat types, and biodiversity indicators across Europe. These maps, developed through a cascading pipeline involving convolutional neural networks and large language models, provide an intuitive yet information-rich interface to better understand biodiversity, with resolutions as precise as 50x50 meters. The website also enables exploration of specific regions, allowing users to select areas of interest on the map (e.g., urban green spaces, protected areas, or riverbanks) to view local species and their coverage. Additionally, GeoPl@ntNet generates comprehensive reports for selected regions, including insights into the number of protected species, invasive species, and endemic species.
zh
[AI-112] Subject-Independent Imagined Speech Detection via Cross-Subject Generalization and Calibration
【速读】:该论文旨在解决脑电图(Electroencephalogram, EEG)基 imagined speech 解码中个体间神经活动模式差异导致的泛化能力不足问题。其解决方案的关键在于采用循环式跨被试训练(cyclic inter-subject training)与轻量级个体适应(lightweight subject-specific adaptation)相结合的方法:通过缩短每个被试的训练片段并频繁轮换被试进行训练,实现对未见目标数据的稳定性能提升;同时,在仅使用目标被试10%数据进行校准的情况下,即可达到0.781的准确率和0.801的AUC,验证了少样本适应的有效性。这一策略为构建可扩展、用户自适应的脑机接口系统提供了兼顾泛化与个性化的高效路径。
链接: https://arxiv.org/abs/2511.13739
作者: Byung-Kwan Ko,Soowon Kim,Seo-Hyun Lee
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 4 pages, 2 figures, Name of Conference: International Conference on Brain-Computer Interface
Abstract:Achieving robust generalization across individuals remains a major challenge in electroencephalogram based imagined speech decoding due to substantial variability in neural activity patterns. This study examined how training dynamics and lightweight subject specific adaptation influence cross subject performance in a neural decoding framework. A cyclic inter subject training approach, involving shorter per subject training segments and frequent alternation among subjects, led to modest yet consistent improvements in decoding performance across unseen target data. Furthermore, under the subject calibrated leave one subject out scheme, incorporating only 10 % of the target subjects data for calibration achieved an accuracy of 0.781 and an AUC of 0.801, demonstrating the effectiveness of few shot adaptation. These findings suggest that integrating cyclic training with minimal calibration provides a simple and effective strategy for developing scalable, user adaptive brain computer interface systems that balance generalization and personalization.
zh
[AI-113] DualLaguerreNet: A Decoupled Spectral Filter GNN and the Uncovering of the Flexibility-Stability Trade-off
【速读】:该论文旨在解决基于谱滤波的图神经网络(Graph Neural Networks, GNNs)在处理异质性(heterophily)和过平滑(over-smoothing)问题时存在的“妥协”难题,即单一自适应参数难以同时优化整个图谱域上的响应。其解决方案的关键在于提出DualLaguerreNet架构,通过引入“解耦频谱灵活性”(Decoupled Spectral Flexibility),将图拉普拉斯算子分解为低频(L_low)和高频(L_high)两个操作符,并分别学习两个独立的自适应拉盖尔多项式滤波器(parameterized by alpha₁ and alpha₂),从而实现对不同频率成分的差异化建模。这一设计显著提升了在复杂异质性任务中的性能,但也揭示了模型灵活性与稳定性之间的权衡关系,表明过度参数化可能引发简单同质性任务上的过拟合,凸显了传统单参数模型中隐含正则化作用的重要性。
链接: https://arxiv.org/abs/2511.13729
作者: Huseyin Goksu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Graph Neural Networks (GNNs) based on spectral filters, such as the Adaptive Orthogonal Polynomial Filter (AOPF) class (e.g., LaguerreNet), have shown promise in unifying the solutions for heterophily and over-smoothing. However, these single-filter models suffer from a “compromise” problem, as their single adaptive parameter (e.g., alpha) must learn a suboptimal, averaged response across the entire graph spectrum. In this paper, we propose DualLaguerreNet, a novel GNN architecture that solves this by introducing “Decoupled Spectral Flexibility.” DualLaguerreNet splits the graph Laplacian into two operators, L_low (low-frequency) and L_high (high-frequency), and learns two independent, adaptive Laguerre polynomial filters, parameterized by alpha_1 and alpha_2, respectively. This work, however, uncovers a deeper finding. While our experiments show DualLaguerreNet’s flexibility allows it to achieve state-of-the-art results on complex heterophilic tasks (outperforming LaguerreNet), it simultaneously underperforms on simpler, homophilic tasks. We identify this as a fundamental “Flexibility-Stability Trade-off”. The increased parameterization (2x filter parameters and 2x model parameters) leads to overfitting on simple tasks, demonstrating that the “compromise” of simpler models acts as a crucial regularizer. This paper presents a new SOTA architecture for heterophily while providing a critical analysis of the bias-variance trade-off inherent in adaptive GNN filter design.
zh
机器学习
[LG-0] π*_0.6: a VLA That Learns From Experience
链接: https://arxiv.org/abs/2511.14759
作者: Ali Amin,Raichelle Aniceto,Ashwin Balakrishna,Kevin Black,Ken Conley,Grace Connors,James Darpinian,Karan Dhabalia,Jared DiCarlo,Danny Driess,Michael Equi,Adnan Esmail,Yunhao Fang,Chelsea Finn,Catherine Glossop,Thomas Godden,Ivan Goryachev,Lachy Groom,Hunter Hancock,Karol Hausman,Gashon Hussein,Brian Ichter,Szymon Jakubczak,Rowan Jen,Tim Jones,Ben Katz,Liyiming Ke,Chandra Kuchi,Marinda Lamb,Devin LeBlanc,Sergey Levine,Adrian Li-Bell,Yao Lu,Vishnu Mano,Mohith Mothukuri,Suraj Nair,Karl Pertsch,Allen Z. Ren,Charvi Sharma,Lucy Xiaoyang Shi,Laura Smith,Jost Tobias Springenberg,Kyle Stachowicz,Will Stoeckle,Alex Swerdlow,James Tanner,Marcel Torne,Quan Vuong,Anna Walling,Haohuan Wang,Blake Williams,Sukwon Yoo,Lili Yu,Ury Zhilinsky,Zhiyuan Zhou
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL). We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP), that provides for RL training of VLAs via advantage conditioning. Our method incorporates heterogeneous data into the self-improvement process, including demonstrations, data from on-policy collection, and expert teleoperated interventions provided during autonomous execution. RECAP starts by pre-training a generalist VLA with offline RL, which we call \pi^_0.6 , that can then be specialized to attain high performance on downstream tasks through on-robot data collection. We show that the \pi^_0.6 model trained with the full RECAP method can fold laundry in real homes, reliably assemble boxes, and make espresso drinks using a professional espresso machine. On some of the hardest tasks, RECAP more than doubles task throughput and roughly halves the task failure rate.
[LG-1] Robust Verification of Controllers under State Uncertainty via Hamilton-Jacobi Reachability Analysis
链接: https://arxiv.org/abs/2511.14755
作者: Albert Lin,Alessandro Pinto,Somil Bansal
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to the 8th Annual Learning for Dynamics Control Conference
Abstract:As perception-based controllers for autonomous systems become increasingly popular in the real world, it is important that we can formally verify their safety and performance despite perceptual uncertainty. Unfortunately, the verification of such systems remains challenging, largely due to the complexity of the controllers, which are often nonlinear, nonconvex, learning-based, and/or black-box. Prior works propose verification algorithms that are based on approximate reachability methods, but they often restrict the class of controllers and systems that can be handled or result in overly conservative analyses. Hamilton-Jacobi (HJ) reachability analysis is a popular formal verification tool for general nonlinear systems that can compute optimal reachable sets under worst-case system uncertainties; however, its application to perception-based systems is currently underexplored. In this work, we propose RoVer-CoRe, a framework for the Robust Verification of Controllers via HJ Reachability. To the best of our knowledge, RoVer-CoRe is the first HJ reachability-based framework for the verification of perception-based systems under perceptual uncertainty. Our key insight is to concatenate the system controller, observation function, and the state estimation modules to obtain an equivalent closed-loop system that is readily compatible with existing reachability frameworks. Within RoVer-CoRe, we propose novel methods for formal safety verification and robust controller design. We demonstrate the efficacy of the framework in case studies involving aircraft taxiing and NN-based rover navigation. Code is available at the link in the footnote.
[LG-2] SparseST: Exploiting Data Sparsity in Spatiotemporal Modeling and Prediction
链接: https://arxiv.org/abs/2511.14753
作者: Junfeng Wu,Hadjer Benmeziane,Kaoutar El Maghraoui,Liu Liu,Yinan Wang
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Spatiotemporal data mining (STDM) has a wide range of applications in various complex physical systems (CPS), i.e., transportation, manufacturing, healthcare, etc. Among all the proposed methods, the Convolutional Long Short-Term Memory (ConvLSTM) has proved to be generalizable and extendable in different applications and has multiple variants achieving state-of-the-art performance in various STDM applications. However, ConvLSTM and its variants are computationally expensive, which makes them inapplicable in edge devices with limited computational resources. With the emerging need for edge computing in CPS, efficient AI is essential to reduce the computational cost while preserving the model performance. Common methods of efficient AI are developed to reduce redundancy in model capacity (i.e., model pruning, compression, etc.). However, spatiotemporal data mining naturally requires extensive model capacity, as the embedded dependencies in spatiotemporal data are complex and hard to capture, which limits the model redundancy. Instead, there is a fairly high level of data and feature redundancy that introduces an unnecessary computational burden, which has been largely overlooked in existing research. Therefore, we developed a novel framework SparseST, that pioneered in exploiting data sparsity to develop an efficient spatiotemporal model. In addition, we explore and approximate the Pareto front between model performance and computational efficiency by designing a multi-objective composite loss function, which provides a practical guide for practitioners to adjust the model according to computational resource constraints and the performance requirements of downstream tasks.
[LG-3] Look-Ahead Reasoning on Learning Platforms NEURIPS2025
链接: https://arxiv.org/abs/2511.14745
作者: Haiqing Zhu,Tijana Zrnic,Celestine Mendler-Dünner
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注: accepted to NeurIPS 2025
Abstract:On many learning platforms, the optimization criteria guiding model training reflect the priorities of the designer rather than those of the individuals they affect. Consequently, users may act strategically to obtain more favorable outcomes, effectively contesting the platform’s predictions. While past work has studied strategic user behavior on learning platforms, the focus has largely been on strategic responses to a deployed model, without considering the behavior of other users. In contrast, look-ahead reasoning takes into account that user actions are coupled, and – at scale – impact future predictions. Within this framework, we first formalize level- k thinking, a concept from behavioral economics, where users aim to outsmart their peers by looking one step ahead. We show that, while convergence to an equilibrium is accelerated, the equilibrium remains the same, providing no benefit of higher-level reasoning for individuals in the long run. Then, we focus on collective reasoning, where users take coordinated actions by optimizing through their joint impact on the model. By contrasting collective with selfish behavior, we characterize the benefits and limits of coordination; a new notion of alignment between the learner’s and the users’ utilities emerges as a key concept. We discuss connections to several related mathematical frameworks, including strategic classification, performative prediction, and algorithmic collective action.
[LG-4] Measuring AI Progress in Drug Discovery: A Reproducible Leaderboard for the Tox21 Challenge
链接: https://arxiv.org/abs/2511.14744
作者: Antonia Ebner,Christoph Bartmann,Sonja Topf,Sohvi Luukkonen,Johannes Schimunek,Günter Klambauer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning’s rise since the early 2010s has transformed fields like computer vision and natural language processing and strongly influenced biomedical research. For drug discovery specifically, a key inflection - akin to vision’s “ImageNet moment” - arrived in 2015, when deep neural networks surpassed traditional approaches on the Tox21 Data Challenge. This milestone accelerated the adoption of deep learning across the pharmaceutical industry, and today most major companies have integrated these methods into their research pipelines. After the Tox21 Challenge concluded, its dataset was included in several established benchmarks, such as MoleculeNet and the Open Graph Benchmark. However, during these integrations, the dataset was altered and labels were imputed or manufactured, resulting in a loss of comparability across studies. Consequently, the extent to which bioactivity and toxicity prediction methods have improved over the past decade remains unclear. To this end, we introduce a reproducible leaderboard, hosted on Hugging Face with the original Tox21 Challenge dataset, together with a set of baseline and representative methods. The current version of the leaderboard indicates that the original Tox21 winner - the ensemble-based DeepTox method - and the descriptor-based self-normalizing neural networks introduced in 2017, continue to perform competitively and rank among the top methods for toxicity prediction, leaving it unclear whether substantial progress in toxicity prediction has been achieved over the past decade. As part of this work, we make all baselines and evaluated models publicly accessible for inference via standardized API calls to Hugging Face Spaces.
[LG-5] Beyond Means: A Dynamic Framework for Predicting Customer Satisfaction
链接: https://arxiv.org/abs/2511.14743
作者: Christof Naumzik,Abdurahman Maarouf,Stefan Feuerriegel,Markus Weinmann
类目: Machine Learning (cs.LG)
*备注:
Abstract:Online ratings influence customer decision-making, yet standard aggregation methods, such as the sample mean, fail to adapt to quality changes over time and ignore review heterogeneity (e.g., review sentiment, a review’s helpfulness). To address these challenges, we demonstrate the value of using the Gaussian process (GP) framework for rating aggregation. Specifically, we present a tailored GP model that captures the dynamics of ratings over time while additionally accounting for review heterogeneity. Based on 121,123 ratings from Yelp, we compare the predictive power of different rating aggregation methods in predicting future ratings, thereby finding that the GP model is considerably more accurate and reduces the mean absolute error by 10.2% compared to the sample mean. Our findings have important implications for marketing practitioners and customers. By moving beyond means, designers of online reputation systems can display more informative and adaptive aggregated rating scores that are accurate signals of expected customer satisfaction.
[LG-6] LAUD: Integrating Large Language Models with Active Learning for Unlabeled Data
链接: https://arxiv.org/abs/2511.14738
作者: Tzu-Hsuan Chou,Chun-Nan Chou
类目: Machine Learning (cs.LG)
*备注: 7 pages and one figure
Abstract:Large language models (LLMs) have shown a remarkable ability to generalize beyond their pre-training data, and fine-tuning LLMs can elevate performance to human-level and beyond. However, in real-world scenarios, lacking labeled data often prevents practitioners from obtaining well-performing models, thereby forcing practitioners to highly rely on prompt-based approaches that are often tedious, inefficient, and driven by trial and error. To alleviate this issue of lacking labeled data, we present a learning framework integrating LLMs with active learning for unlabeled dataset (LAUD). LAUD mitigates the cold-start problem by constructing an initial label set with zero-shot learning. Experimental results show that LLMs derived from LAUD outperform LLMs with zero-shot or few-shot learning on commodity name classification tasks, demonstrating the effectiveness of LAUD.
[LG-7] AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training NEURIPS2025
链接: https://arxiv.org/abs/2511.14721
作者: Fu-Ming Guo,Yingfang Fan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: GPU-Accelerated and Scalable Optimization (ScaleOpt)
Abstract:Adaptive optimizers with decoupled weight decay, such as AdamW, are the de facto standard for pre-training large transformer-based generative models. Yet the quadratic nature of the \ell_2 penalty embedded in weight decay drives all parameters toward the origin at the same rate, making the update vulnerable to rare but extreme gradient directions and often over-penalizing well-conditioned coordinates. We propose AdamHuberDecay, a drop-in replacement for AdamW that substitutes the \ell_2 penalty with a decoupled smooth Huber regularizer. The resulting update decays parameters quadratically while their magnitude remains below a threshold \delta , and linearly ( \ell_1 -like) once they exceed \delta , yielding (i) bounded regularization gradients, (ii) invariance to per-coordinate second-moment rescaling, and (iii) stronger sparsity pressure on overgrown weights. We derive the closed-form decoupled Huber decay step and show how to integrate it with any Adam-family optimizer at O(1) extra cost. Extensive experiments on GPT-2 and GPT-3 pre-training demonstrate that AdamHuberDecay (a) converges 10-15% faster in wall-clock time, (b) reduces validation perplexity by up to 4 points, © delivers performance improvements of 2.5-4.7% across downstream tasks, and (d) yields visibly sparser weight histograms that translate into 20-30% memory savings after magnitude pruning, without tuning the decay coefficient beyond the default grid used for AdamW. Ablations confirm robustness to outlier gradients and large-batch regimes, together with theoretical analyses that bound the expected parameter norm under noisy updates. AdamHuberDecay therefore provides a simple, principled path toward more efficient and resilient training of next-generation foundational generative transformers. Comments: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: GPU-Accelerated and Scalable Optimization (ScaleOpt) Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) MSC classes: 68Txx ACMclasses: F.0; G.4 Cite as: arXiv:2511.14721 [cs.LG] (or arXiv:2511.14721v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.14721 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: GPU-Accelerated and Scalable Optimization (ScaleOpt)
[LG-8] Machine Learning Models for Predicting Smoking-Related Health Decline and Disease Risk DATE ALT
链接: https://arxiv.org/abs/2511.14682
作者: Vaskar Chakma,MD Jaheid Hasan Nerab,Abdur Rouf,Abu Sayed,Hossem MD Saim,Md. Nournabi Khan
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: This paper has been officially accepted for publication in the Journal of Intelligent Medicine and Healthcare. Once the final published version is available online, this document will be updated accordingly
Abstract:Smoking continues to be a major preventable cause of death worldwide, affecting millions through damage to the heart, metabolism, liver, and kidneys. However, current medical screening methods often miss the early warning signs of smoking-related health problems, leading to late-stage diagnoses when treatment options become limited. This study presents a systematic comparative evaluation of machine learning approaches for smoking-related health risk assessment, emphasizing clinical interpretability and practical deployment over algorithmic innovation. We analyzed health screening data from 55,691 individuals, examining various health indicators, including body measurements, blood tests, and demographic information. We tested three advanced prediction algorithms - Random Forest, XGBoost, and LightGBM - to determine which could most accurately identify people at high risk. This study employed a cross-sectional design to classify current smoking status based on health screening biomarkers, not to predict future disease development. Our Random Forest model performed best, achieving an Area Under the Curve (AUC) of 0.926, meaning it could reliably distinguish between high-risk and lower-risk individuals. Using SHAP (SHapley Additive exPlanations) analysis to understand what the model was detecting, we found that key health markers played crucial roles in prediction: blood pressure levels, triglyceride concentrations, liver enzyme readings, and kidney function indicators (serum creatinine) were the strongest signals of declining health in smokers.
[LG-9] Derivative of the truncated singular value and eigen decomposition
链接: https://arxiv.org/abs/2511.14651
作者: Jan Naumann
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Technical report
Abstract:Recently developed applications in the field of machine learning and computational physics rely on automatic differentiation techniques, that require stable and efficient linear algebra gradient computations. This technical note provides a comprehensive and detailed discussion of the derivative of the truncated singular and eigenvalue decomposition. It summarizes previous work and builds on them with an extensive description of how to derive the relevant terms. A main focus is correctly expressing the derivative in terms of the truncated part, despite lacking knowledge of the full decomposition.
[LG-10] Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
链接: https://arxiv.org/abs/2511.14617
作者: Ruoyu Qin,Weiran He,Weixiao Huang,Yangkun Zhang,Yikai Zhao,Bo Pang,Xinran Xu,Yingdi Shan,Yongwei Wu,Mingxing Zhang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 16 pages, 12 figures, 6 tables
Abstract:Reinforcement Learning (RL) has become critical for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks. The rollout phase, which dominates end-to-end iteration time, suffers from substantial long-tail latency and poor resource utilization due to inherent workload imbalance. We present Seer, a novel online context learning system that addresses these challenges by exploiting previously overlooked similarities in output lengths and generation patterns among requests sharing the same prompt. Seer introduces three key techniques: divided rollout for dynamic load balancing, context-aware scheduling, and adaptive grouped speculative decoding. Together, these mechanisms substantially reduce long-tail latency and improve resource efficiency during rollout. Evaluations on production-grade RL workloads demonstrate that Seer improves end-to-end rollout throughput by 74% to 97% and reduces long-tail latency by 75% to 93% compared to state-of-the-art synchronous RL systems, significantly accelerating RL training iterations.
[LG-11] ask Addition and Weight Disentanglement in Closed-Vocabulary Models
链接: https://arxiv.org/abs/2511.14569
作者: Adam Hazimeh,Alessandro Favero,Pascal Frossard
类目: Machine Learning (cs.LG)
*备注:
Abstract:Task arithmetic has recently emerged as a promising method for editing pre-trained \textitopen-vocabulary models, offering a cost-effective alternative to standard multi-task fine-tuning. However, despite the abundance of \textitclosed-vocabulary models that are not pre-trained with language supervision, applying task arithmetic to these models remains unexplored. In this paper, we deploy and study task addition in closed-vocabulary image classification models. We consider different pre-training schemes and find that \textitweight disentanglement – the property enabling task arithmetic – is a general consequence of pre-training, as it appears in different pre-trained closed-vocabulary models. In fact, we find that pre-trained closed-vocabulary vision transformers can also be edited with task arithmetic, achieving high task addition performance and enabling the efficient deployment of multi-task models. Finally, we demonstrate that simple linear probing is a competitive baseline to task addition. Overall, our findings expand the applicability of task arithmetic to a broader class of pre-trained models and open the way for more efficient use of pre-trained models in diverse settings.
[LG-12] Mind the Gaps: Measuring Visual Artifacts in Dimensionality Reduction
链接: https://arxiv.org/abs/2511.14544
作者: Jaume Ros,Alessio Arleo,Fernando Paulovich
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dimensionality Reduction (DR) techniques are commonly used for the visual exploration and analysis of high-dimensional data due to their ability to project datasets of high-dimensional points onto the 2D plane. However, projecting datasets in lower dimensions often entails some distortion, which is not necessarily easy to recognize but can lead users to misleading conclusions. Several Projection Quality Metrics (PQMs) have been developed as tools to quantify the goodness-of-fit of a DR projection; however, they mostly focus on measuring how well the projection captures the global or local structure of the data, without taking into account the visual distortion of the resulting plots, thus often ignoring the presence of outliers or artifacts that can mislead a visual analysis of the projection. In this work, we introduce the Warping Index (WI), a new metric for measuring the quality of DR projections onto the 2D plane, based on the assumption that the correct preservation of empty regions between points is of crucial importance towards a faithful visual representation of the data.
[LG-13] Full Atom Peptide Design via Riemannian Euclidean Bayesian Flow Networks AAAI2026
链接: https://arxiv.org/abs/2511.14516
作者: Hao Qian,Shikui Tu,Lei Xu
类目: Machine Learning (cs.LG)
*备注: 7pages, 4 figures, AAAI2026
Abstract:Diffusion and flow matching models have recently emerged as promising approaches for peptide binder design. Despite their progress, these models still face two major challenges. First, categorical sampling of discrete residue types collapses their continuous parameters into onehot assignments, while continuous variables (e.g., atom positions) evolve smoothly throughout the generation process. This mismatch disrupts the update dynamics and results in suboptimal performance. Second, current models assume unimodal distributions for side-chain torsion angles, which conflicts with the inherently multimodal nature of side chain rotameric states and limits prediction accuracy. To address these limitations, we introduce PepBFN, the first Bayesian flow network for full atom peptide design that directly models parameter distributions in fully continuous space. Specifically, PepBFN models discrete residue types by learning their continuous parameter distributions, enabling joint and smooth Bayesian updates with other continuous structural parameters. It further employs a novel Gaussian mixture based Bayesian flow to capture the multimodal side chain rotameric states and a Matrix Fisher based Riemannian flow to directly model residue orientations on the \mathrmSO(3) manifold. Together, these parameter distributions are progressively refined via Bayesian updates, yielding smooth and coherent peptide generation. Experiments on side chain packing, reverse folding, and binder design tasks demonstrate the strong potential of PepBFN in computational peptide design.
[LG-14] CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design
链接: https://arxiv.org/abs/2511.14510
作者: Jiawei Yi,Ping Gong,Youhui Bai,Jiaqi Ruan,Shengnan Wang,Pengcheng Wang,Haibo Wang,Weiguang Wang,Xia Zhu,Feng Wu,Cheng Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:The growth of million-token LLMs exposes the scalability limits of inference systems, where the KVCache dominates memory usage and data transfer overhead. Recent offloading systems migrate the KVCache to CPU memory and incorporate top-k attention to reduce the volume of data transferred from the CPU, while further applying system-level optimizations such as on-GPU caching and prefetching to lower transfer overhead. However, they overlook the CPU bottleneck in three aspects: (1) substantial overhead of fine-grained dynamic cache management performed on the CPU side, (2) significant transfer overhead from poor PCIe bandwidth utilization caused by heavy gathering operations at the CPU side, and (3) GPU runtime bubbles introduced by coarse-grained CPU-centric synchronization. To address these challenges, we propose CLO, a CPU-light KVCache offloading system via algorithm-system co-design. CLO features: (1) a coarse-grained head-wise approximate on-GPU caching strategy with negligible cache management cost, (2) seamless combination of data prefetching and on-GPU persistent caching for lower transfer overhead, (3) a zero-copy transfer engine to fully exploit PCIe bandwidth, and a GPU-centric synchronization method to eliminate GPU stalls. Evaluation on two widely-used LLMs demonstrates that CLO achieves comparable accuracy to state-of-the-art systems, while substantially minimizing CPU overhead, fully utilizing PCIe bandwidth, thus improving decoding throughput by 9.3%-66.6%. Our results highlight that algorithm-system co-design is essential for memory-constrained LLM inference on modern GPU platforms. We open source CLO at this https URL.
[LG-15] Notes on Kernel Methods in Machine Learning
链接: https://arxiv.org/abs/2511.14485
作者: Diego Armando Pérez-Rosero,Danna Valentina Salazar-Dubois,Juan Camilo Lugo-Rojas,Andrés Marino Álvarez-Meza,Germán Castellanos-Dominguez
类目: Machine Learning (cs.LG)
*备注:
Abstract:These notes provide a self-contained introduction to kernel methods and their geometric foundations in machine learning. Starting from the construction of Hilbert spaces, we develop the theory of positive definite kernels, reproducing kernel Hilbert spaces (RKHS), and Hilbert-Schmidt operators, emphasizing their role in statistical estimation and representation of probability measures. Classical concepts such as covariance, regression, and information measures are revisited through the lens of Hilbert space geometry. We also introduce kernel density estimation, kernel embeddings of distributions, and the Maximum Mean Discrepancy (MMD). The exposition is designed to serve as a foundation for more advanced topics, including Gaussian processes, kernel Bayesian inference, and functional analytic approaches to modern machine learning.
[LG-16] Gradient-Based Join Ordering
链接: https://arxiv.org/abs/2511.14482
作者: Tim Schwabe,Maribel Acosta
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:
Abstract:Join ordering is the NP-hard problem of selecting the most efficient sequence in which to evaluate joins (conjunctive, binary operators) in a database query. As the performance of query execution critically depends on this choice, join ordering lies at the core of query optimization. Traditional approaches cast this problem as a discrete combinatorial search over binary trees guided by a cost model, but they often suffer from high computational complexity and limited scalability. We show that, when the cost model is differentiable, the query plans can be continuously relaxed into a soft adjacency matrix representing a superposition of plans. This continuous relaxation, together with a Gumbel-Softmax parameterization of the adjacency matrix and differentiable constraints enforcing plan validity, enables gradient-based search for plans within this relaxed space. Using a learned Graph Neural Network as the cost model, we demonstrate that this gradient-based approach can find comparable and even lower-cost plans compared to traditional discrete local search methods on two different graph datasets. Furthermore, we empirically show that the runtime of this approach scales linearly with query size, in contrast to quadratic or exponential runtimes of classical approaches. We believe this first step towards gradient-based join ordering can lead to more effective and efficient query optimizers in the future.
[LG-17] Nonparametric estimation of conditional probability distributions using a generative approach based on conditional push-forward neural networks
链接: https://arxiv.org/abs/2511.14455
作者: Nicola Rares Franco,Lorenzo Tedesco
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:We introduce conditional push-forward neural networks (CPFN), a generative framework for conditional distribution estimation. Instead of directly modeling the conditional density f_Y|X , CPFN learns a stochastic map \varphi=\varphi(x,u) such that \varphi(x,U) and Y|X=x follow approximately the same law, with U a suitable random vector of pre-defined latent variables. This enables efficient conditional sampling and straightforward estimation of conditional statistics through Monte Carlo methods. The model is trained via an objective function derived from a Kullback-Leibler formulation, without requiring invertibility or adversarial training. We establish a near-asymptotic consistency result and demonstrate experimentally that CPFN can achieve performance competitive with, or even superior to, state-of-the-art methods, including kernel estimators, tree-based algorithms, and popular deep learning techniques, all while remaining lightweight and easy to train.
[LG-18] Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning
链接: https://arxiv.org/abs/2511.14427
作者: Rickmer Krohn,Vignesh Prasad,Gabriele Tiboni,Georgia Chalvatzaki
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 9 pages, 10 figures, preprint
Abstract:Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control.
[LG-19] FlowRoI A Fast Optical Flow Driven Region of Interest Extraction Framework for High-Throughput Image Compression in Immune Cell Migration Analysis
链接: https://arxiv.org/abs/2511.14419
作者: Xiaowei Xu,Justin Sonneck,Hongxiao Wang,Roman Burkard,Hendrik Wohrle,Anton Grabmasier,Matthias Gunzer,Jianxu Chen
类目: Machine Learning (cs.LG)
*备注: 12 pages, 9 figures, 2 tables
Abstract:Autonomous migration is essential for the function of immune cells such as neutrophils and plays a pivotal role in diverse diseases. Recently, we introduced ComplexEye, a multi-lens array microscope comprising 16 independent aberration-corrected glass lenses arranged at the pitch of a 96-well plate, capable of capturing high-resolution movies of migrating cells. This architecture enables high-throughput live-cell video microscopy for migration analysis, supporting routine quantification of autonomous motility with strong potential for clinical translation. However, ComplexEye and similar high-throughput imaging platforms generate data at an exponential rate, imposing substantial burdens on storage and transmission. To address this challenge, we present FlowRoI, a fast optical-flow-based region of interest (RoI) extraction framework designed for high-throughput image compression in immune cell migration studies. FlowRoI estimates optical flow between consecutive frames and derives RoI masks that reliably cover nearly all migrating cells. The raw image and its corresponding RoI mask are then jointly encoded using JPEG2000 to enable RoI-aware compression. FlowRoI operates with high computational efficiency, achieving runtimes comparable to standard JPEG2000 and reaching an average throughput of about 30 frames per second on a modern laptop equipped with an Intel i7-1255U CPU. In terms of image quality, FlowRoI yields higher peak signal-to-noise ratio (PSNR) in cellular regions and achieves 2.0-2.2x higher compression rates at matched PSNR compared to standard JPEG2000.
[LG-20] oward Robust and Harmonious Adaptation for Cross-modal Retrieval
链接: https://arxiv.org/abs/2511.14416
作者: Haobin Li,Mouxing Yang,Xi Peng
类目: Machine Learning (cs.LG)
*备注: 19 pages, 6 figures
Abstract:Recently, the general-to-customized paradigm has emerged as the dominant approach for Cross-Modal Retrieval (CMR), which reconciles the distribution shift problem between the source domain and the target domain. However, existing general-to-customized CMR methods typically assume that the entire target-domain data is available, which is easily violated in real-world scenarios and thus inevitably suffer from the query shift (QS) problem. Specifically, query shift embraces the following two characteristics and thus poses new challenges to CMR. i) Online Shift: real-world queries always arrive in an online manner, rendering it impractical to access the entire query set beforehand for customization approaches; ii) Diverse Shift: even with domain customization, the CMR models struggle to satisfy queries from diverse users or scenarios, leaving an urgent need to accommodate diverse queries. In this paper, we observe that QS would not only undermine the well-structured common space inherited from the source model, but also steer the model toward forgetting the indispensable general knowledge for CMR. Inspired by the observations, we propose a novel method for achieving online and harmonious adaptation against QS, dubbed Robust adaptation with quEry ShifT (REST). To deal with online shift, REST first refines the retrieval results to formulate the query predictions and accordingly designs a QS-robust objective function on these predictions to preserve the well-established common space in an online manner. As for tackling the more challenging diverse shift, REST employs a gradient decoupling module to dexterously manipulate the gradients during the adaptation process, thus preventing the CMR model from forgetting the general knowledge. Extensive experiments on 20 benchmarks across three CMR tasks verify the effectiveness of our method against QS.
[LG-21] Watch Out for the Lifespan: Evaluating Backdoor Attacks Against Federated Model Adaptation
链接: https://arxiv.org/abs/2511.14406
作者: Bastien Vuillod,Pierre-Alain Moellic,Jean-Max Dutertre
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted at FPS 2025
Abstract:Large models adaptation through Federated Learning (FL) addresses a wide range of use cases and is enabled by Parameter-Efficient Fine-Tuning techniques such as Low-Rank Adaptation (LoRA). However, this distributed learning paradigm faces several security threats, particularly to its integrity, such as backdoor attacks that aim to inject malicious behavior during the local training steps of certain clients. We present the first analysis of the influence of LoRA on state-of-the-art backdoor attacks targeting model adaptation in FL. Specifically, we focus on backdoor lifespan, a critical characteristic in FL, that can vary depending on the attack scenario and the attacker’s ability to effectively inject the backdoor. A key finding in our experiments is that for an optimally injected backdoor, the backdoor persistence after the attack is longer when the LoRA’s rank is lower. Importantly, our work highlights evaluation issues of backdoor attacks against FL and contributes to the development of more robust and fair evaluations of backdoor attacks, enhancing the reliability of risk assessments for critical FL systems. Our code is publicly available.
[LG-22] Enforcing hidden physics in physics-informed neural networks
链接: https://arxiv.org/abs/2511.14348
作者: Nanxi Chen,Sifan Wang,Rujin Ma,Airong Chen,Chuanjie Cui
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Physics-informed neural networks (PINNs) represent a new paradigm for solving partial differential equations (PDEs) by integrating physical laws into the learning process of neural networks. However, despite their foundational role, the hidden irreversibility implied by the Second Law of Thermodynamics is often neglected during training, leading to unphysical solutions or even training failures in conventional PINNs. In this paper, we identify this critical gap and introduce a simple, generalized, yet robust irreversibility-regularized strategy that enforces hidden physical laws as soft constraints during training. This approach ensures that the learned solutions consistently respect the intrinsic one-way nature of irreversible physical processes. Across a wide range of benchmarks spanning traveling wave propagation, steady combustion, ice melting, corrosion evolution, and crack propagation, we demonstrate that our regularization scheme reduces predictive errors by more than an order of magnitude, while requiring only minimal modification to existing PINN frameworks. We believe that the proposed framework is broadly applicable to a wide class of PDE-governed physical systems and will have significant impact within the scientific machine learning community.
[LG-23] Learning with Statistical Equality Constraints
链接: https://arxiv.org/abs/2511.14320
作者: Aneesh Barthakur,Luiz F. O. Chamon
类目: Machine Learning (cs.LG)
*备注: to be published in the 39th Annual Conference on Neural Information Processing Systems
Abstract:As machine learning applications grow increasingly ubiquitous and complex, they face an increasing set of requirements beyond accuracy. The prevalent approach to handle this challenge is to aggregate a weighted combination of requirement violation penalties into the training objective. To be effective, this approach requires careful tuning of these hyperparameters (weights), involving trial-and-error and cross-validation, which becomes ineffective even for a moderate number of requirements. These issues are exacerbated when the requirements involve parities or equalities, as is the case in fairness and boundary value problems. An alternative technique uses constrained optimization to formulate these learning problems. Yet, existing approximation and generalization guarantees do not apply to problems involving equality constraints. In this work, we derive a generalization theory for equality-constrained statistical learning problems, showing that their solutions can be approximated using samples and rich parametrizations. Using these results, we propose a practical algorithm based on solving a sequence of unconstrained, empirical learning problems. We showcase its effectiveness and the new formulations enabled by equality constraints in fair learning, interpolating classifiers, and boundary value problems.
[LG-24] Intervention Efficiency and Perturbation Validation Framework: Capacity-Aware and Robust Clinical Model Selection under the Rashomon Effect
链接: https://arxiv.org/abs/2511.14317
作者: Yuwen Zhang,Viet Tran,Paul Weng
类目: Machine Learning (cs.LG)
*备注:
Abstract:In clinical machine learning, the coexistence of multiple models with comparable performance – a manifestation of the Rashomon Effect – poses fundamental challenges for trustworthy deployment and evaluation. Small, imbalanced, and noisy datasets, coupled with high-dimensional and weakly identified clinical features, amplify this multiplicity and make conventional validation schemes unreliable. As a result, selecting among equally performing models becomes uncertain, particularly when resource constraints and operational priorities are not considered by conventional metrics like F1 score. To address these issues, we propose two complementary tools for robust model assessment and selection: Intervention Efficiency (IE) and the Perturbation Validation Framework (PVF). IE is a capacity-aware metric that quantifies how efficiently a model identifies actionable true positives when only limited interventions are feasible, thereby linking predictive performance with clinical utility. PVF introduces a structured approach to assess the stability of models under data perturbations, identifying models whose performance remains most invariant across noisy or shifted validation sets. Empirical results on synthetic and real-world healthcare datasets show that using these tools facilitates the selection of models that generalize more robustly and align with capacity constraints, offering a new direction for tackling the Rashomon Effect in clinical settings.
[LG-25] Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions
链接: https://arxiv.org/abs/2511.14307
作者: Marcel Gibier,Nolwenn Celton,Raphaël Duroselle,Pierre Serrano,Olivier Boeffard,Jean-François Bonastre
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Submission to Track 5 of the DCASE 2025 Challenge
Abstract:In this report, we describe our submission to Track 5 of the DCASE 2025 Challenge for the task of Audio Question Answering(AQA). Our system leverages the SSL backbone BEATs to extract frame-level audio features, which are then processed by a classification head to generate segment-level predictions of acoustic events, following the Audioset ontology. These segment-level predictions are subsequently calibrated before producing event-level predictions. Finally, these predictions are incorporated into a structured prompt, along with the question and candidate answers. This prompt is then fed to a fine-tuned version of Qwen2.5-7B-Instruct, trained using the GRPO algorithm with a simple reward function. Our method achieves an accuracy of 62.6 % on the development set, demonstrating the effectiveness of combining acoustic event reasoning with instruction-tuned large language models for AQA.
[LG-26] Segmentwise Pruning in Audio-Language Models ICASSP2026
链接: https://arxiv.org/abs/2511.14293
作者: Marcel Gibier,Raphaël Duroselle,Pierre Serrano,Olivier Boeffard,Jean-François Bonastre
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Submitted to ICASSP 2026 (under review)
Abstract:Recent audio-language models have shown impressive performance across a wide range of audio tasks and are increasingly capable of handling long audio inputs. However, the computing costs in these models heavily depend on sequence length, which can become very large given the nature of audio data. In the vision-language domain, token pruning methods have proven effective in reducing token counts while preserving strong performance on standard benchmarks. In this work, we investigate the relevance and effectiveness of such token selection strategies in the context of audio-language models. We also improve them by proposing a lightweight strategy that takes the time dimension into account. While retaining only a quarter of the initial tokens, our approach results in a relative maximum decrease of 2% in CIDEr on Clotho v2 and a relative maximum decrease of 4% in accuracy on MMAU.
[LG-27] Unified Multimodal Vessel Trajectory Prediction with Explainable Navigation Intention
链接: https://arxiv.org/abs/2511.14265
作者: Rui Zhang,Chao Li,Kezhong Liu,Chen Wang,Bolong Zheng,Hongbo Jiang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Vessel trajectory prediction is fundamental to intelligent maritime systems. Within this domain, short-term prediction of rapid behavioral changes in complex maritime environments has established multimodal trajectory prediction (MTP) as a promising research area. However, existing vessel MTP methods suffer from limited scenario applicability and insufficient explainability. To address these challenges, we propose a unified MTP framework incorporating explainable navigation intentions, which we classify into sustained and transient categories. Our method constructs sustained intention trees from historical trajectories and models dynamic transient intentions using a Conditional Variational Autoencoder (CVAE), while using a non-local attention mechanism to maintain global scenario consistency. Experiments on real Automatic Identification System (AIS) datasets demonstrates our method’s broad applicability across diverse scenarios, achieving significant improvements in both ADE and FDE. Furthermore, our method improves explainability by explicitly revealing the navigational intentions underlying each predicted trajectory.
[LG-28] Algebraformer: A Neural Approach to Linear Systems
链接: https://arxiv.org/abs/2511.14263
作者: Pietro Sittoni,Francesco Tudisco
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent work in deep learning has opened new possibilities for solving classical algorithmic tasks using end-to-end learned models. In this work, we investigate the fundamental task of solving linear systems, particularly those that are ill-conditioned. Existing numerical methods for ill-conditioned systems often require careful parameter tuning, preconditioning, or domain-specific expertise to ensure accuracy and stability. In this work, we propose Algebraformer, a Transformer-based architecture that learns to solve linear systems end-to-end, even in the presence of severe ill-conditioning. Our model leverages a novel encoding scheme that enables efficient representation of matrix and vector inputs, with a memory complexity of O(n^2) , supporting scalable inference. We demonstrate its effectiveness on application-driven linear problems, including interpolation tasks from spectral methods for boundary value problems and acceleration of the Newton method. Algebraformer achieves competitive accuracy with significantly lower computational overhead at test time, demonstrating that general-purpose neural architectures can effectively reduce complexity in traditional scientific computing pipelines.
[LG-29] Count The Notes: Histogram-Based Supervision for Automatic Music Transcription
链接: https://arxiv.org/abs/2511.14250
作者: Jonathan Yaffe,Ben Maman,Meinard Müller,Amit H. Bermano
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: ISMIR 2025
Abstract:Automatic Music Transcription (AMT) converts audio recordings into symbolic musical representations. Training deep neural networks (DNNs) for AMT typically requires strongly aligned training pairs with precise frame-level annotations. Since creating such datasets is costly and impractical for many musical contexts, weakly aligned approaches using segment-level annotations have gained traction. However, existing methods often rely on Dynamic Time Warping (DTW) or soft alignment loss functions, both of which still require local semantic correspondences, making them error-prone and computationally expensive. In this article, we introduce CountEM, a novel AMT framework that eliminates the need for explicit local alignment by leveraging note event histograms as supervision, enabling lighter computations and greater flexibility. Using an Expectation-Maximization (EM) approach, CountEM iteratively refines predictions based solely on note occurrence counts, significantly reducing annotation efforts while maintaining high transcription accuracy. Experiments on piano, guitar, and multi-instrument datasets demonstrate that CountEM matches or surpasses existing weakly supervised methods, improving AMT’s robustness, scalability, and efficiency. Our project page is available at this https URL.
[LG-30] EBind: a practical approach to space binding
链接: https://arxiv.org/abs/2511.14229
作者: Jim Broadbent,Felix Cohen,Frederik Hvilshøj,Eric Landau,Eren Sasoglu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We simplify space binding by focusing on two core components, a single encoder per modality and high-quality data; enabling training state-of-the-art models on a single GPU in a few hours as opposed to multiple days. We present EBind, an Easy, data-centric, and parameter-efficient method to Bind the embedding spaces of multiple contrastive models. We demonstrate that a simple 1.8B-parameter image-text-video-audio-3D model can outperform models 4 to 17x the size. The key to achieving this is a carefully curated dataset of three complementary data sources: i) 6.7M fully-automated multimodal quintuples sourced via SOTA retrieval models, ii) 1M diverse, semi-automated triples annotated by humans as negative, partial, or positive matches, and iii) 3.4M pre-existing captioned data items. We use 13 different evaluations to demonstrate the value of each data source. Due to limitations with existing benchmarks, we further introduce the first high-quality, consensus-annotated zero-shot classification benchmark between audio and PCs. In contrast to related work, we will open-source our code, model weights, and datasets.
[LG-31] N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator
链接: https://arxiv.org/abs/2511.14195
作者: Zheyu Lin,Jirui Yang,Hengqi Guo,Yubing Bao,Yao Guan
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Evaluating the safety robustness of LLMs is critical for their deployment. However, mainstream Red Teaming methods rely on online generation and black-box output analysis. These approaches are not only costly but also suffer from feedback latency, making them unsuitable for agile diagnostics after training a new model. To address this, we propose N-GLARE (A Non-Generative, Latent Representation-Efficient LLM Safety Evaluator). N-GLARE operates entirely on the model’s latent representations, bypassing the need for full text generation. It characterizes hidden layer dynamics by analyzing the APT (Angular-Probabilistic Trajectory) of latent representations and introducing the JSS (Jensen-Shannon Separability) metric. Experiments on over 40 models and 20 red teaming strategies demonstrate that the JSS metric exhibits high consistency with the safety rankings derived from Red Teaming. N-GLARE reproduces the discriminative trends of large-scale red-teaming tests at less than 1% of the token cost and the runtime cost, providing an efficient output-free evaluation proxy for real-time diagnostics.
[LG-32] A Comprehensive Study of Implicit and Explicit Biases in Large Language Models
链接: https://arxiv.org/abs/2511.14153
作者: Fatima Kazi,Alex Young,Yash Inani,Setareh Rafatirad
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Large Language Models (LLMs) inherit explicit and implicit biases from their training datasets. Identifying and mitigating biases in LLMs is crucial to ensure fair outputs, as they can perpetuate harmful stereotypes and misinformation. This study highlights the need to address biases in LLMs amid growing generative AI. We studied bias-specific benchmarks such as StereoSet and CrowSPairs to evaluate the existence of various biases in multiple generative models such as BERT and GPT 3.5. We proposed an automated Bias-Identification Framework to recognize various social biases in LLMs such as gender, race, profession, and religion. We adopted a two-pronged approach to detect explicit and implicit biases in text data. Results indicated fine-tuned models struggle with gender biases but excelled at identifying and avoiding racial biases. Our findings illustrated that despite having some success, LLMs often over-relied on keywords. To illuminate the capability of the analyzed LLMs in detecting implicit biases, we employed Bag-of-Words analysis and unveiled indications of implicit stereotyping within the vocabulary. To bolster the model performance, we applied an enhancement strategy involving fine-tuning models using prompting techniques and data augmentation of the bias benchmarks. The fine-tuned models exhibited promising adaptability during cross-dataset testing and significantly enhanced performance on implicit bias benchmarks, with performance gains of up to 20%.
[LG-33] Synthetic Survival Control: Extending Synthetic Controls for “When-If” Decision
链接: https://arxiv.org/abs/2511.14133
作者: Jessy Xinyi Han,Devavrat Shah
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注:
Abstract:Estimating causal effects on time-to-event outcomes from observational data is particularly challenging due to censoring, limited sample sizes, and non-random treatment assignment. The need for answering such “when-if” questions–how the timing of an event would change under a specified intervention–commonly arises in real-world settings with heterogeneous treatment adoption and confounding. To address these challenges, we propose Synthetic Survival Control (SSC) to estimate counterfactual hazard trajectories in a panel data setting where multiple units experience potentially different treatments over multiple periods. In such a setting, SSC estimates the counterfactual hazard trajectory for a unit of interest as a weighted combination of the observed trajectories from other units. To provide formal justification, we introduce a panel framework with a low-rank structure for causal survival analysis. Indeed, such a structure naturally arises under classical parametric survival models. Within this framework, for the causal estimand of interest, we establish identification and finite sample guarantees for SSC. We validate our approach using a multi-country clinical dataset of cancer treatment outcomes, where the staggered introduction of new therapies creates a quasi-experimental setting. Empirically, we find that access to novel treatments is associated with improved survival, as reflected by lower post-intervention hazard trajectories relative to their synthetic counterparts. Given the broad relevance of survival analysis across medicine, economics, and public policy, our framework offers a general and interpretable tool for counterfactual survival inference using observational data.
[LG-34] MalRAG : A Retrieval-Augmented LLM Framework for Open-set Malicious Traffic Identification
链接: https://arxiv.org/abs/2511.14129
作者: Xiang Luo,Chang Liu,Gang Xiong,Chen Yang,Gaopeng Gou,Yaochen Ren,Zhen Li
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages, 13 figures. Intended for submission to IEEE Transactions on Information Forensics and Security (TIFS)
Abstract:Fine-grained identification of IDS-flagged suspicious traffic is crucial in cybersecurity. In practice, cyber threats evolve continuously, making the discovery of novel malicious traffic a critical necessity as well as the identification of known classes. Recent studies have advanced this goal with deep models, but they often rely on task-specific architectures that limit transferability and require per-dataset tuning. In this paper we introduce MalRAG, the first LLM driven retrieval-augmented framework for open-set malicious traffic identification. MalRAG freezes the LLM and operates via comprehensive traffic knowledge construction, adaptive retrieval, and prompt engineering. Concretely, we construct a multi-view traffic database by mining prior malicious traffic from content, structural, and temporal perspectives. Furthermore, we introduce a Coverage-Enhanced Retrieval Algorithm that queries across these views to assemble the most probable candidates, thereby improving the inclusion of correct evidence. We then employ Traffic-Aware Adaptive Pruning to select a variable subset of these candidates based on traffic-aware similarity scores, suppressing incorrect matches and yielding reliable retrieved evidence. Moreover, we develop a suite of guidance prompts where task instruction, evidence referencing, and decision guidance are integrated with the retrieved evidence to improve LLM performance. Across diverse real-world datasets and settings, MalRAG delivers state-of-the-art results in both fine-grained identification of known classes and novel malicious traffic discovery. Ablation and deep-dive analyses further show that MalRAG effective leverages LLM capabilities yet achieves open-set malicious traffic identification without relying on a specific LLM.
[LG-35] 10Cache: Heterogeneous Resource-Aware Tensor Caching and Migration for LLM Training SOCC’25
链接: https://arxiv.org/abs/2511.14124
作者: Sabiha Afroz,Redwan Ibne Seraj Khan,Hadeel Albahar,Jingoo Han,Ali R. Butt
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: This paper accepted for presentation to the 16th ACM Symposium on Cloud Computing (SOCC’25)
Abstract:Training large language models (LLMs) in the cloud faces growing memory bottlenecks due to the limited capacity and high cost of GPUs. While GPU memory offloading to CPU and NVMe has made large-scale training more feasible, existing approaches suffer from high tensor migration latency and suboptimal device memory utilization, ultimately increasing training time and cloud costs. To address these challenges, we present 10Cache, a resource-aware tensor caching and migration system that accelerates LLM training by intelligently coordinating memory usage across GPU, CPU, and NVMe tiers. 10Cache profiles tensor execution order to construct prefetch policies, allocates memory buffers in pinned memory based on tensor size distributions, and reuses memory buffers to minimize allocation overhead. Designed for cloud-scale deployments, 10Cache improves memory efficiency and reduces reliance on high-end GPUs. Across diverse LLM workloads, it achieves up to 2x speedup in training time, improves GPU cache hit rate by up to 86.6x, and increases CPU/GPU memory utilization by up to 2.15x and 1.33x, respectively, compared to state-of-the-art offloading methods. These results demonstrate that 10Cache is a practical and scalable solution for optimizing LLM training throughput and resource efficiency in cloud environments. Comments: This paper accepted for presentation to the 16th ACM Symposium on Cloud Computing (SOCC’25) Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2511.14124 [cs.DC] (or arXiv:2511.14124v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.14124 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-36] MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts
链接: https://arxiv.org/abs/2511.14102
作者: Wenfeng Wang,Jiacheng Liu,Xiaofeng Hou,Xinfeng Xia,Peng Tang,Mingxuan Zhang,Chao Li,Minyi Guo
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:The immense memory requirements of state-of-the-art Mixture-of-Experts (MoE) models present a significant challenge for inference, often exceeding the capacity of a single accelerator. While offloading experts to host memory is a common solution, it introduces a severe I/O bottleneck over the PCIe bus, as the data-dependent nature of expert selection places these synchronous transfers directly on the critical path of execution, crippling performance. This paper argues that the I/O bottleneck can be overcome by trading a small amount of cheap, on-device computation to hide the immense cost of data movement. We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading. MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens. This foresight enables a runtime orchestrator to prefetch these experts from host memory, effectively overlapping the expensive I/O with useful computation and hiding the latency from the critical path. To maximize performance, an adaptive governor, guided by an Amortization Roofline Model, dynamically tunes the speculation strategy to the underlying hardware. Our evaluation on memory-constrained devices shows that for the Phi-MoE model, MoE-SpeQ achieves at most 2.34x speedup over the state-of-the-art offloading framework. Our work establishes a new, principled approach for managing data-dependent memory access in resource-limited environments, making MoE inference more accessible on commodity hardware. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2511.14102 [cs.LG] (or arXiv:2511.14102v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.14102 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-37] Observational Auditing of Label Privacy
链接: https://arxiv.org/abs/2511.14084
作者: Iden Kalemaj,Luca Melis,Maxime Boucher,Ilya Mironov,Saeed Mahloujifar
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Differential privacy (DP) auditing is essential for evaluating privacy guarantees in machine learning systems. Existing auditing methods, however, pose a significant challenge for large-scale systems since they require modifying the training dataset – for instance, by injecting out-of-distribution canaries or removing samples from training. Such interventions on the training data pipeline are resource-intensive and involve considerable engineering overhead. We introduce a novel observational auditing framework that leverages the inherent randomness of data distributions, enabling privacy evaluation without altering the original dataset. Our approach extends privacy auditing beyond traditional membership inference to protected attributes, with labels as a special case, addressing a key gap in existing techniques. We provide theoretical foundations for our method and perform experiments on Criteo and CIFAR-10 datasets that demonstrate its effectiveness in auditing label privacy guarantees. This work opens new avenues for practical privacy auditing in large-scale production environments.
[LG-38] Meta-SimGNN: Adaptive and Robust WiFi Localization Across Dynamic Configurations and Diverse Scenarios
链接: https://arxiv.org/abs/2511.14076
作者: Qiqi Xiao,Ziqi Ye,Yinghui He,Jianwei Liu,Guanding Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:To promote the practicality of deep learning-based localization, existing studies aim to address the issue of scenario dependence through meta-learning. However, these studies primarily focus on variations in environmental layouts while overlooking the impact of changes in device configurations, such as bandwidth, the number of access points (APs), and the number of antennas used. Unlike environmental changes, variations in device configurations affect the dimensionality of channel state information (CSI), thereby compromising neural network usability. To address this issue, we propose Meta-SimGNN, a novel WiFi localization system that integrates graph neural networks with meta-learning to improve localization generalization and robustness. First, we introduce a fine-grained CSI graph construction scheme, where each AP is treated as a graph node, allowing for adaptability to changes in the number of APs. To structure the features of each node, we propose an amplitude-phase fusion method and a feature extraction method. The former utilizes both amplitude and phase to construct CSI images, enhancing data reliability, while the latter extracts dimension-consistent features to address variations in bandwidth and the number of antennas. Second, a similarity-guided meta-learning strategy is developed to enhance adaptability in diverse scenarios. The initial model parameters for the fine-tuning stage are determined by comparing the similarity between the new scenario and historical scenarios, facilitating rapid adaptation of the model to the new localization scenario. Extensive experimental results over commodity WiFi devices in different scenarios show that Meta-SimGNN outperforms the baseline methods in terms of localization generalization and accuracy.
[LG-39] Dynamic Black-box Backdoor Attacks on IoT Sensory Data
链接: https://arxiv.org/abs/2511.14074
作者: Ajesh Koyatan Chathoth,Stephen Lee
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Sensor data-based recognition systems are widely used in various applications, such as gait-based authentication and human activity recognition (HAR). Modern wearable and smart devices feature various built-in Inertial Measurement Unit (IMU) sensors, and such sensor-based measurements can be fed to a machine learning-based model to train and classify human activities. While deep learning-based models have proven successful in classifying human activity and gestures, they pose various security risks. In our paper, we discuss a novel dynamic trigger-generation technique for performing black-box adversarial attacks on sensor data-based IoT systems. Our empirical analysis shows that the attack is successful on various datasets and classifier models with minimal perturbation on the input data. We also provide a detailed comparative analysis of performance and stealthiness to various other poisoning techniques found in backdoor attacks. We also discuss some adversarial defense mechanisms and their impact on the effectiveness of our trigger-generation technique.
[LG-40] LogPurge: Log Data Purification for Anomaly Detection via Rule-Enhanced Filtering
链接: https://arxiv.org/abs/2511.14062
作者: Shenglin Zhang,Ziang Chen,Zijing Que,Yilun Liu,Yongqian Sun,Sicheng Wei,Dan Pei,Hailin Li
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Log anomaly detection, which is critical for identifying system failures and preempting security breaches, detects irregular patterns within large volumes of log data, and impacts domains such as service reliability, performance optimization, and database log analysis. Modern log anomaly detection methods rely on training deep learning models on clean, anomaly-free log sequences. However, obtaining such clean log data requires costly and tedious human labeling, and existing automatic cleaning methods fail to fully integrate the specific characteristics and actual semantics of logs in their purification process. In this paper, we propose a cost-aware, rule-enhanced purification framework, LogPurge, that automatically selects a sufficient subset of normal log sequences from contamination log sequences to train a anomaly detection model. Our approach involves a two-stage filtering algorithm: In the first stage, we use a large language model (LLM) to remove clustered anomalous patterns and enhance system rules to improve LLM’s understanding of system logs; in the second stage, we utilize a divide-and-conquer strategy that decomposes the remaining contaminated regions into smaller subproblems, allowing each to be effectively purified through the first stage procedure. Our experiments, conducted on two public datasets and one industrial dataset, show that our method significantly removes an average of 98.74% of anomalies while retaining 82.39% of normal samples. Compared to the latest unsupervised log sample selection algorithms, our method achieves F-1 score improvements of 35.7% and 84.11% on the public datasets, and an impressive 149.72% F-1 improvement on the private dataset, demonstrating the effectiveness of our approach.
[LG-41] SmallM L: Bayesian Transfer Learning for Small-Data Predictive Analytics
链接: https://arxiv.org/abs/2511.14049
作者: Semen Leontev
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 64 pages, 5 figures, 15 tables
Abstract:Small and medium-sized enterprises (SMEs) represent 99.9% of U.S. businesses yet remain systematically excluded from AI due to a mismatch between their operational scale and modern machine learning’s data requirements. This paper introduces SmallML, a Bayesian transfer learning framework achieving enterprise-level prediction accuracy with datasets as small as 50-200 observations. We develop a three-layer architecture integrating transfer learning, hierarchical Bayesian modeling, and conformal prediction. Layer 1 extracts informative priors from 22,673 public records using a SHAP-based procedure transferring knowledge from gradient boosting to logistic regression. Layer 2 implements hierarchical pooling across J=5-50 SMEs with adaptive shrinkage, balancing population patterns with entity-specific characteristics. Layer 3 provides conformal sets with finite-sample coverage guarantees P(y in C(x)) = 1-alpha for distribution-free uncertainty quantification. Validation on customer churn data demonstrates 96.7% +/- 4.2% AUC with 100 observations per business – a +24.2 point improvement over independent logistic regression (72.5% +/- 8.1%), with p 0.000001. Conformal prediction achieves 92% empirical coverage at 90% target. Training completes in 33 minutes on standard CPU hardware. By enabling enterprise-grade predictions for 33 million U.S. SMEs previously excluded from machine learning, SmallML addresses a critical gap in AI democratization. Keywords: Bayesian transfer learning, hierarchical models, conformal prediction, small-data analytics, SME machine learning Comments: 64 pages, 5 figures, 15 tables Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) ACMclasses: I.2.6; G.3 Cite as: arXiv:2511.14049 [cs.LG] (or arXiv:2511.14049v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.14049 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Semen Leontev [view email] [v1] Tue, 18 Nov 2025 02:00:55 UTC (1,550 KB)
[LG-42] On the Gradient Complexity of Private Optimization with Private Oracles
链接: https://arxiv.org/abs/2511.13999
作者: Michael Menart,Aleksandar Nikolov
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:We study the running time, in terms of first order oracle queries, of differentially private empirical/population risk minimization of Lipschitz convex losses. We first consider the setting where the loss is non-smooth and the optimizer interacts with a private proxy oracle, which sends only private messages about a minibatch of gradients. In this setting, we show that expected running time \Omega(\min\frac\sqrtd\alpha^2, \fracd\log(1/\alpha)) is necessary to achieve \alpha excess risk on problems of dimension d when d \geq 1/\alpha^2 . Upper bounds via DP-SGD show these results are tight when d\tilde\Omega(1/\alpha^4) . We further show our lower bound can be strengthened to \Omega(\min\fracd\barm\alpha^2, \fracd\log(1/\alpha) ) for algorithms which use minibatches of size at most \barm \sqrtd . We next consider smooth losses, where we relax the private oracle assumption and give lower bounds under only the condition that the optimizer is private. Here, we lower bound the expected number of first order oracle calls by \tilde\Omega\big(\frac\sqrtd\alpha + \min\frac1\alpha^2, n\big) , where n is the size of the dataset. Modifications to existing algorithms show this bound is nearly tight. Compared to non-private lower bounds, our results show that differentially private optimizers pay a dimension dependent runtime penalty. Finally, as a natural extension of our proof technique, we show lower bounds in the non-smooth setting for optimizers interacting with information limited oracles. Specifically, if the proxy oracle transmits at most \Gamma -bits of information about the gradients in the minibatch, then \Omega\big(\min\fracd\alpha^2\Gamma, \fracd\log(1/\alpha)\big) oracle calls are needed. This result shows fundamental limitations of gradient quantization techniques in optimization.
[LG-43] Efficient reconstruction of multidimensional random field models with heterogeneous data using stochastic neural networks
链接: https://arxiv.org/abs/2511.13977
作者: Mingtao Xia,Qijing Shen
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
*备注:
Abstract:In this paper, we analyze the scalability of a recent Wasserstein-distance approach for training stochastic neural networks (SNNs) to reconstruct multidimensional random field models. We prove a generalization error bound for reconstructing multidimensional random field models on training stochastic neural networks with a limited number of training data. Our results indicate that when noise is heterogeneous across dimensions, the convergence rate of the generalization error may not depend explicitly on the model’s dimensionality, partially alleviating the “curse of dimensionality” for learning multidimensional random field models from a finite number of data points. Additionally, we improve the previous Wasserstein-distance SNN training approach and showcase the robustness of the SNN. Through numerical experiments on different multidimensional uncertainty quantification tasks, we show that our Wasserstein-distance approach can successfully train stochastic neural networks to learn multidimensional uncertainty models.
[LG-44] he Impact of Bootstrap Sampling Rate on Random Forest Performance in Regression Tasks
链接: https://arxiv.org/abs/2511.13952
作者: Michał Iwaniuk,Mateusz Jarosz,Bartłomiej Borycki,Bartosz Jezierski,Jan Cwalina,Stanisław Kaźmierczak,Jacek Mańdziuk
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Random Forests (RFs) typically train each tree on a bootstrap sample of the same size as the training set, i.e., bootstrap rate (BR) equals 1.0. We systematically examine how varying BR from 0.2 to 5.0 affects RF performance across 39 heterogeneous regression datasets and 16 RF configurations, evaluating with repeated two-fold cross-validation and mean squared error. Our results demonstrate that tuning the BR can yield significant improvements over the default: the best setup relied on BR \leq 1.0 for 24 datasets, BR 1.0 for 15, and BR = 1.0 was optimal in 4 cases only. We establish a link between dataset characteristics and the preferred BR: datasets with strong global feature-target relationships favor higher BRs, while those with higher local target variance benefit from lower BRs. To further investigate this relationship, we conducted experiments on synthetic datasets with controlled noise levels. These experiments reproduce the observed bias-variance trade-off: in low-noise scenarios, higher BRs effectively reduce model bias, whereas in high-noise settings, lower BRs help reduce model variance. Overall, BR is an influential hyperparameter that should be tuned to optimize RF regression models.
[LG-45] ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels
链接: https://arxiv.org/abs/2511.13940
作者: Stuart H. Sul,Simran Arora,Benjamin F. Spector,Christopher Ré
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performance \unicodex2014 data-transfer mechanisms, resource scheduling, and design overheads. We validate PK on both Hopper and Blackwell architectures. With fewer than 50 lines of device code, PK achieves up to 2.33 \times speedup for data- and tensor-parallel workloads, 4.08 \times for sequence-parallel workloads, and 1.22 \times for expert-parallel workloads.
[LG-46] Complex-Weighted Convolutional Networks: Provable Expressiveness via Complex Diffusion
链接: https://arxiv.org/abs/2511.13937
作者: Cristina López Amado,Tassilo Schwarz,Yu Tian,Renaud Lambiotte
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Dynamical Systems (math.DS); Physics and Society (physics.soc-ph)
*备注: 19 pages, 6 figures. Learning on Graphs Conference 2025
Abstract:Graph Neural Networks (GNNs) have achieved remarkable success across diverse applications, yet they remain limited by oversmoothing and poor performance on heterophilic graphs. To address these challenges, we introduce a novel framework that equips graphs with a complex-weighted structure, assigning each edge a complex number to drive a diffusion process that extends random walks into the complex domain. We prove that this diffusion is highly expressive: with appropriately chosen complex weights, any node-classification task can be solved in the steady state of a complex random walk. Building on this insight, we propose the Complex-Weighted Convolutional Network (CWCN), which learns suitable complex-weighted structures directly from data while enriching diffusion with learnable matrices and nonlinear activations. CWCN is simple to implement, requires no additional hyperparameters beyond those of standard GNNs, and achieves competitive performance on benchmark datasets. Our results demonstrate that complex-weighted diffusion provides a principled and general mechanism for enhancing GNN expressiveness, opening new avenues for models that are both theoretically grounded and practically effective.
[LG-47] Weather Maps as Tokens: Transformers for Renewable Energy Forecasting
链接: https://arxiv.org/abs/2511.13935
作者: Federico Battini
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate renewable energy forecasting is essential to reduce dependence on fossil fuels and enabling grid decarbonization. However, current approaches fail to effectively integrate the rich spatial context of weather patterns with their temporal evolution. This work introduces a novel approach that treats weather maps as tokens in transformer sequences to predict renewable energy. Hourly weather maps are encoded as spatial tokens using a lightweight convolutional neural network, and then processed by a transformer to capture temporal dynamics across a 45-hour forecast horizon. Despite disadvantages in input initialization, evaluation against ENTSO-E operational forecasts shows a reduction in RMSE of about 60% and 20% for wind and solar respectively. A live dashboard showing daily forecasts is available at: this https URL.
[LG-48] Beyond One-Size-Fits-All: Neural Networks for Differentially Private Tabular Data Synthesis
链接: https://arxiv.org/abs/2511.13893
作者: Kai Chen,Chen Gong,Tianhao Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 18 pages. Github Link provided: this https URL
Abstract:In differentially private (DP) tabular data synthesis, the consensus is that statistical models are better than neural network (NN)-based methods. However, we argue that this conclusion is incomplete and overlooks the challenge of densely correlated datasets, where intricate dependencies can overwhelm statistical models. In such complex scenarios, neural networks are more suitable due to their capacity to fit complex distributions by learning directly from samples. Despite this potential, existing NN-based algorithms still suffer from significant limitations. We therefore propose MargNet, incorporating successful algorithmic designs of statistical models into neural networks. MargNet applies an adaptive marginal selection strategy and trains the neural networks to generate data that conforms to the selected marginals. On sparsely correlated datasets, our approach achieves utility close to the best statistical method while offering an average 7 \times speedup over it. More importantly, on densely correlated datasets, MargNet establishes a new state-of-the-art, reducing fidelity error by up to 26% compared to the previous best. We release our code on GitHub.\footnotethis https URL
[LG-49] ractable Probabilistic Models for Investment Planning
链接: https://arxiv.org/abs/2511.13888
作者: Nicolas M. Cuadrado A.,Mohannad Takrouri,Jiří Němeček,Martin Takáč,Jakub Mareček
类目: Machine Learning (cs.LG)
*备注:
Abstract:Investment planning in power utilities, such as generation and transmission expansion, requires decade-long forecasts under profound uncertainty. Forecasting of energy mix and energy use decades ahead is nontrivial. Classical approaches focus on generating a finite number of scenarios (modeled as a mixture of Diracs in statistical theory terms), which limits insight into scenario-specific volatility and hinders robust decision-making. We propose an alternative using tractable probabilistic models (TPMs), particularly sum-product networks (SPNs). These models enable exact, scalable inference of key quantities such as scenario likelihoods, marginals, and conditional probabilities, supporting robust scenario expansion and risk assessment. This framework enables direct embedding of chance-constrained optimization into investment planning, enforcing safety or reliability with prescribed confidence levels. TPMs allow both scenario analysis and volatility quantification by compactly representing high-dimensional uncertainties. We demonstrate the approach’s effectiveness through a representative power system planning case study, illustrating computational and reliability advantages over traditional scenario-based models. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.13888 [cs.LG] (or arXiv:2511.13888v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.13888 Focus to learn more arXiv-issued DOI via DataCite
[LG-50] Beat the long tail: Distribution-Aware Speculative Decoding for RL Training
链接: https://arxiv.org/abs/2511.13841
作者: Zelei Shao,Vikranth Srivatsa,Sanjana Srivastava,Qingyang Wu,Alpay Ariyak,Xiaoxia Wu,Ameen Patel,Jue Wang,Percy Liang,Tri Dao,Ce Zhang,Yiying Zhang,Ben Athiwaratkun,Chenfeng Xu,Junxiong Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning(RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck:the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall clock time and a complementary opportunity; the availability of historical rollouts that reveal stable prompt level patterns across training epochs. Motivated by these observations, we propose DAS, a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: an adaptive, nonparametric drafter built from recent rollouts using an incrementally maintained suffix tree, and a length aware speculation policy that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post training without compromising learning quality.
[LG-51] Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora CIKM2025
链接: https://arxiv.org/abs/2511.13808
作者: Edward Raff,Ryan R. Curtin,Derek Everett,Robert J. Joyce,James Holt
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注: Published in CIKM 2025
Abstract:A classifier using byte n-grams as features is the only approach we have found fast enough to meet requirements in size (sub 2 MB), speed (multiple GB/s), and latency (sub 10 ms) for deployment in numerous malware detection scenarios. However, we’ve consistently found that 6-8 grams achieve the best accuracy on our production deployments but have been unable to deploy regularly updated models due to the high cost of finding the top-k most frequent n-grams over terabytes of executable programs. Because the Zipfian distribution well models the distribution of n-grams, we exploit its properties to develop a new top-k n-gram extractor that is up to 35\times faster than the previous best alternative. Using our new Zipf-Gramming algorithm, we are able to scale up our production training set and obtain up to 30% improvement in AUC at detecting new malware. We show theoretically and empirically that our approach will select the top-k items with little error and the interplay between theory and engineering required to achieve these results.
[LG-52] Self-Attention as Distributional Projection: A Unified Interpretation of Transformer Architecture
链接: https://arxiv.org/abs/2511.13780
作者: Nihal Mehta
类目: Machine Learning (cs.LG)
*备注: 17 pages, 0 figures. This work provides a mathematical interpretation of self-attention mechanisms in Transformers through distributional semantics principles
Abstract:This paper presents a mathematical interpretation of self-attention by connecting it to distributional semantics principles. We show that self-attention emerges from projecting corpus-level co-occurrence statistics into sequence context. Starting from the co-occurrence matrix underlying GloVe embeddings, we demonstrate how the projection naturally captures contextual influence, with the query-key-value mechanism arising as the natural asymmetric extension for modeling directional relationships. Positional encodings and multi-head attention then follow as structured refinements of this same projection principle. Our analysis demonstrates that the Transformer architecture’s particular algebraic form follows from these projection principles rather than being an arbitrary design choice.
[LG-53] Compiling to linear neurons
链接: https://arxiv.org/abs/2511.13769
作者: Joey Velez-Ginorio,Nada Amin,Konrad Kording,Steve Zdancewic
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:
Abstract:We don’t program neural networks directly. Instead, we rely on an indirect style where learning algorithms, like gradient descent, determine a neural network’s function by learning from data. This indirect style is often a virtue; it empowers us to solve problems that were previously impossible. But it lacks discrete structure. We can’t compile most algorithms into a neural network – even if these algorithms could help the network learn. This limitation occurs because discrete algorithms are not obviously differentiable, making them incompatible with the gradient-based learning algorithms that determine a neural network’s function. To address this, we introduce \textsfCajal : a typed, higher-order and linear programming language intended to be a minimal vehicle for exploring a direct style of programming neural networks. We prove \textsfCajal programs compile to linear neurons, allowing discrete algorithms to be expressed in a differentiable form compatible with gradient-based learning. With our implementation of \textsfCajal , we conduct several experiments where we link these linear neurons against other neural networks to determine part of their function prior to learning. Linking with these neurons allows networks to learn faster, with greater data-efficiency, and in a way that’s easier to debug. A key lesson is that linear programming languages provide a path towards directly programming neural networks, enabling a rich interplay between learning and the discrete structures of ordinary programming.
[LG-54] Library Liberation: Competitive Performance Matmul Through Compiler-composed Nanokernels
链接: https://arxiv.org/abs/2511.13764
作者: Arun Thangamani,Md Asghar Ahmad Shahid,Adam Siemieniuk,Rolf Morel,Renato Golin,Alexander Heinecke
类目: Machine Learning (cs.LG); Performance (cs.PF); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注:
Abstract:The rapidly evolving landscape of AI and machine learning workloads has widened the gap between high-level domain operations and efficient hardware utilization. Achieving near-peak performance still demands deep hardware expertise-experts either handcraft target-specific kernels (e.g., DeepSeek) or rely on specialized libraries (e.g., CUTLASS)-both of which add complexity and limit scalability for most ML practitioners. This paper introduces a compilation scheme that automatically generates scalable, high-performance microkernels by leveraging the MLIR dialects to bridge domain-level operations and processor capabilities. Our approach removes dependence on low-level libraries by enabling the compiler to auto-generate near-optimal code directly. At its core is a mechanism for composing nanokernels from low-level IR constructs with near-optimal register utilization, forming efficient microkernels tailored to each target. We implement this technique in an MLIR-based compiler supporting both vector and tile based CPU instructions. Experiments show that the generated nanokernels are of production-quality, and competitive with state-of-the-art microkernel libraries. Subjects: Machine Learning (cs.LG); Performance (cs.PF); Programming Languages (cs.PL); Software Engineering (cs.SE) Cite as: arXiv:2511.13764 [cs.LG] (or arXiv:2511.13764v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.13764 Focus to learn more arXiv-issued DOI via DataCite
[LG-55] A Deep Learning Density Shaping Model Predictive Gust Load Alleviation Control of a Compliant Wing Subjected to Atmospheric Turbulence
链接: https://arxiv.org/abs/2511.13745
作者: Seid H. Pourtakdoust,Amir H. Khodabakhsh
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Probability (math.PR); Applications (stat.AP)
*备注:
Abstract:This study presents a novel deep learning approach aimed at enhancing stochastic Gust Load Alleviation (GLA) specifically for compliant wings. The approach incorporates the concept of smooth wing camber variation, where the camber of the wing’s chord is actively adjusted during flight using a control signal to achieve the desired aerodynamic loading. The proposed method employs a deep learning-based model predictive controller designed for probability density shaping. This controller effectively solves the probability density evolution equation through a custom Physics-Informed Neural Network (PINN) and utilizes Automatic Differentiation for Model Predictive Control (MPC) optimization. Comprehensive numerical simulations were conducted on a compliant wing (CW) model, evaluating performance of the proposed approach against stochastic gust profiles. The evaluation involved stochastic aerodynamic loads generated from Band-Limited White Noise (BLWN) and Dryden gust models. The evaluation were conducted for two distinct Compliant Chord Fractions (CCF). The results demonstrate the effectiveness of the proposed probability density shaping model predictive control in alleviating stochastic gust load and reducing wing tip deflection.
[LG-56] Blurred Encoding for Trajectory Representation Learning KDD2025
链接: https://arxiv.org/abs/2511.13741
作者: Silin Zhou,Yao Chen,Shuo Shang,Lisi Chen,Bingsheng He,Ryosuke Shibasaki
类目: Machine Learning (cs.LG)
*备注: This paper is accepted by KDD2025(Feb. Cycle)
Abstract:Trajectory representation learning (TRL) maps trajectories to vector embeddings and facilitates tasks such as trajectory classification and similarity search. State-of-the-art (SOTA) TRL methods transform raw GPS trajectories to grid or road trajectories to capture high-level travel semantics, i.e., regions and roads. However, they lose fine-grained spatial-temporal details as multiple GPS points are grouped into a single grid cell or road segment. To tackle this problem, we propose the BLUrred Encoding method, dubbed BLUE, which gradually reduces the precision of GPS coordinates to create hierarchical patches with multiple levels. The low-level patches are small and preserve fine-grained spatial-temporal details, while the high-level patches are large and capture overall travel patterns. To complement different patch levels with each other, our BLUE is an encoder-decoder model with a pyramid structure. At each patch level, a Transformer is used to learn the trajectory embedding at the current level, while pooling prepares inputs for the higher level in the encoder, and up-resolution provides guidance for the lower level in the decoder. BLUE is trained using the trajectory reconstruction task with the MSE loss. We compare BLUE with 8 SOTA TRL methods for 3 downstream tasks, the results show that BLUE consistently achieves higher accuracy than all baselines, outperforming the best-performing baselines by an average of 30.90%. Our code is available at this https URL.
[LG-57] Extended Physics Informed Neural Network for Hyperbolic Two-Phase Flow in Porous Media
链接: https://arxiv.org/abs/2511.13734
作者: Saif Ur Rehman,Wajid Yousuf
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:The accurate solution of nonlinear hyperbolic partial differential equations (PDEs) remains a central challenge in computational science due to the presence of steep gradients, discontinuities, and multiscale structures that make conventional discretization-based solvers computationally demanding. Physics-Informed Neural Networks (PINNs) embed the governing equations into the learning process, enabling mesh-free solution of PDEs, yet they often struggle to capture steep gradients, discontinuities, and complex nonlinear wave interactions. To address these limitations, this study employs the Extended Physics-Informed Neural Network (XPINN) framework to solve the nonlinear Buckley-Leverett equation with a nonconvex flux function, which models immiscible two-phase flow in porous media. The computational domain is dynamically decomposed in space and time into evolving pre-shock and post-shock regions, allowing localized subnetworks to efficiently learn distinct flow behaviors. Coupling between subnetworks is achieved through the Rankine-Hugoniot jump condition, which enforces physically consistent flux continuity across the moving shock interface. Numerical experiments demonstrate that the proposed XPINN approach accurately captures discontinuous saturation fronts and compound wave interactions without requiring artificial diffusion or entropy corrections. Compared to standard PINNs, the XPINN framework achieves superior stability, faster convergence, and enhanced resolution of nonlinear wave dynamics using smaller, domain-specific models with fewer trainable parameters, establishing it as an effective and scalable tool for solving challenging hyperbolic PDEs in multiphase flow problems. The code of this work is available on this http URL.
[LG-58] owards a Unified Analysis of Neural Networks in Nonparametric Instrumental Variable Regression: Optimization and Generalization
链接: https://arxiv.org/abs/2511.14710
作者: Zonghao Chen,Atsushi Nitanda,Arthur Gretton,Taiji Suzuki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We establish the first global convergence result of neural networks for two stage least squares (2SLS) approach in nonparametric instrumental variable regression (NPIV). This is achieved by adopting a lifted perspective through mean-field Langevin dynamics (MFLD), unlike standard MFLD, however, our setting of 2SLS entails a \emphbilevel optimization problem in the space of probability measures. To address this challenge, we leverage the penalty gradient approach recently developed for bilevel optimization which formulates bilevel optimization as a Lagrangian problem. This leads to a novel fully first-order algorithm, termed \textttF ^2 BMLD. Apart from the convergence bound, we further provide a generalization bound, revealing an inherent trade-off in the choice of the Lagrange multiplier between optimization and statistical guarantees. Finally, we empirically validate the effectiveness of the proposed method on an offline reinforcement learning benchmark.
[LG-59] Doppler Invariant CNN for Signal Classification
链接: https://arxiv.org/abs/2511.14640
作者: Avi Bagchi,Dwight Hutchenson
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Radio spectrum monitoring in contested environments motivates the need for reliable automatic signal classification technology. Prior work highlights deep learning as a promising approach, but existing models depend on brute-force Doppler augmentation to achieve real-world generalization, which undermines both training efficiency and interpretability. In this paper, we propose a convolutional neural network (CNN) architecture with complex-valued layers that exploits convolutional shift equivariance in the frequency domain. To establish provable frequency bin shift invariance, we use adaptive polyphase sampling (APS) as pooling layers followed by a global average pooling layer at the end of the network. Using a synthetic dataset of common interference signals, experimental results demonstrate that unlike a vanilla CNN, our model maintains consistent classification accuracy with and without random Doppler shifts despite being trained on no Doppler-shifted examples. Overall, our method establishes an invariance-driven framework for signal classification that offers provable robustness against real-world effects.
[LG-60] Online learning of subgrid-scale models for quasi-geostrophic turbulence in planetary interiors
链接: https://arxiv.org/abs/2511.14581
作者: Hugo Frezat,Thomas Gastine,Alexandre Fournier
类目: Fluid Dynamics (physics.flu-dyn); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: 33 pages, 11 figures, submitted for publication in Journal of Fluid Mechanics
Abstract:The use of machine learning to represent subgrid-scale (SGS) dynamics is now well established in weather forecasting and climate modelling. Recent advances have demonstrated that SGS models trained via ``online’’ end-to-end learning – where the dynamical solver operating on the filtered equations participates in the training – can outperform traditional physics-based approaches. Most studies, however, have focused on idealised periodic domains, neglecting the mechanical boundaries present e.g. in planetary interiors. To address this issue, we consider two-dimensional quasi-geostrophic turbulent flow in an axisymmetric bounded domain that we model using a pseudo-spectral differentiable solver, thereby enabling online learning. We examine three configurations, varying the geometry (between an exponential container and a spherical shell) and the rotation rate. Flow is driven by a prescribed analytical forcing, allowing for precise control over the energy injection scale and an exact estimate of the power input. We evaluate the accuracy of the online-trained SGS model against the reference direct numerical simulation using integral quantities and spectral diagnostics. In all configurations, we show that an SGS model trained on data spanning only one turnover time remains stable and accurate over integrations at least a hundred times longer than the training period. Moreover, we demonstrate the model’s remarkable ability to reproduce slow processes occurring on time scales far exceeding the training duration, such as the inward drift of jets in the spherical shell. These results suggest a promising path towards developing SGS models for planetary and stellar interior dynamics, including dynamo processes.
[LG-61] DeepBlip: Estimating Conditional Averag e Treatment Effects Over Time
链接: https://arxiv.org/abs/2511.14545
作者: Haorui Ma,Dennis Frauen,Stefan Feuerriegel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 42 pages
Abstract:Structural nested mean models (SNMMs) are a principled approach to estimate the treatment effects over time. A particular strength of SNMMs is to break the joint effect of treatment sequences over time into localized, time-specific ``blip effects’'. This decomposition promotes interpretability through the incremental effects and enables the efficient offline evaluation of optimal treatment policies without re-computation. However, neural frameworks for SNMMs are lacking, as their inherently sequential g-estimation scheme prevents end-to-end, gradient-based training. Here, we propose DeepBlip, the first neural framework for SNMMs, which overcomes this limitation with a novel double optimization trick to enable simultaneous learning of all blip functions. Our DeepBlip seamlessly integrates sequential neural networks like LSTMs or transformers to capture complex temporal dependencies. By design, our method correctly adjusts for time-varying confounding to produce unbiased estimates, and its Neyman-orthogonal loss function ensures robustness to nuisance model misspecification. Finally, we evaluate our DeepBlip across various clinical datasets, where it achieves state-of-the-art performance.
[LG-62] Improved Convergence in Parameter-Agnostic Error Feedback through Momentum
链接: https://arxiv.org/abs/2511.14501
作者: Abdurakhmon Sadiev,Yury Demidovich,Igor Sokolov,Grigory Malinovsky,Sarit Khirirat,Peter Richtárik
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 50 pages, 12 figures
Abstract:Communication compression is essential for scalable distributed training of modern machine learning models, but it often degrades convergence due to the noise it introduces. Error Feedback (EF) mechanisms are widely adopted to mitigate this issue of distributed compression algorithms. Despite their popularity and training efficiency, existing distributed EF algorithms often require prior knowledge of problem parameters (e.g., smoothness constants) to fine-tune stepsizes. This limits their practical applicability especially in large-scale neural network training. In this paper, we study normalized error feedback algorithms that combine EF with normalized updates, various momentum variants, and parameter-agnostic, time-varying stepsizes, thus eliminating the need for problem-dependent tuning. We analyze the convergence of these algorithms for minimizing smooth functions, and establish parameter-agnostic complexity bounds that are close to the best-known bounds with carefully-tuned problem-dependent stepsizes. Specifically, we show that normalized EF21 achieve the convergence rate of near O(1/T^1/4) for Polyak’s heavy-ball momentum, O(1/T^2/7) for Iterative Gradient Transport (IGT), and O(1/T^1/3) for STORM and Hessian-corrected momentum. Our results hold with decreasing stepsizes and small mini-batches. Finally, our empirical experiments confirm our theoretical insights.
[LG-63] Skewness-Robust Causal Discovery in Location-Scale Noise Models
链接: https://arxiv.org/abs/2511.14441
作者: Daniel Klippert,Alexander Marx
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:To distinguish Markov equivalent graphs in causal discovery, it is necessary to restrict the structural causal model. Crucially, we need to be able to distinguish cause X from effect Y in bivariate models, that is, distinguish the two graphs X \to Y and Y \to X . Location-scale noise models (LSNMs), in which the effect Y is modeled based on the cause X as Y = f(X) + g(X)N , form a flexible class of models that is general and identifiable in most cases. Estimating these models for arbitrary noise terms N , however, is challenging. Therefore, practical estimators are typically restricted to symmetric distributions, such as the normal distribution. As we showcase in this paper, when N is a skewed random variable, which is likely in real-world domains, the reliability of these approaches decreases. To approach this limitation, we propose SkewD, a likelihood-based algorithm for bivariate causal discovery under LSNMs with skewed noise distributions. SkewD extends the usual normal-distribution framework to the skew-normal setting, enabling reliable inference under symmetric and skewed noise. For parameter estimation, we employ a combination of a heuristic search and an expectation conditional maximization algorithm. We evaluate SkewD on novel synthetically generated datasets with skewed noise as well as established benchmark datasets. Throughout our experiments, SkewD exhibits a strong performance and, in comparison to prior work, remains robust under high skewness.
[LG-64] Statistically controllable microstructure reconstruction framework for heterogeneous materials using sliced-Wasserstein metric and neural networks
链接: https://arxiv.org/abs/2511.14268
作者: Zhenchuan Ma,Qizhi Teng,Pengcheng Yan,Lindong Li,Kirill M. Gerke,Marina V. Karsanina,Xiaohai He
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:
Abstract:Heterogeneous porous materials play a crucial role in various engineering systems. Microstructure characterization and reconstruction provide effective means for modeling these materials, which are critical for conducting physical property simulations, structure-property linkage studies, and enhancing their performance across different applications. To achieve superior controllability and applicability with small sample sizes, we propose a statistically controllable microstructure reconstruction framework that integrates neural networks with sliced-Wasserstein metric. Specifically, our approach leverages local pattern distribution for microstructure characterization and employs a controlled sampling strategy to generate target distributions that satisfy given conditional parameters. A neural network-based model establishes the mapping from the input distribution to the target local pattern distribution, enabling microstructure reconstruction. Combinations of sliced-Wasserstein metric and gradient optimization techniques minimize the distance between these distributions, leading to a stable and reliable model. Our method can perform stochastic and controllable reconstruction tasks even with small sample sizes. Additionally, it can generate large-size (e.g. 512 and 1024) 3D microstructures using a chunking strategy. By introducing spatial location masks, our method excels at generating spatially heterogeneous and complex microstructures. We conducted experiments on stochastic reconstruction, controllable reconstruction, heterogeneous reconstruction, and large-size microstructure reconstruction across various materials. Comparative analysis through visualization, statistical measures, and physical property simulations demonstrates the effectiveness, providing new insights and possibilities for research on structure-property linkage and material inverse design.
[LG-65] Causal Discovery on Higher-Order Interactions
链接: https://arxiv.org/abs/2511.14206
作者: Alessio Zanga,Marco Scutari,Fabio Stella
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 16 pages, 2 figures
Abstract:Causal discovery combines data with knowledge provided by experts to learn the DAG representing the causal relationships between a given set of variables. When data are scarce, bagging is used to measure our confidence in an average DAG obtained by aggregating bootstrapped DAGs. However, the aggregation step has received little attention from the specialized literature: the average DAG is constructed using only the confidence in the individual edges of the bootstrapped DAGs, thus disregarding complex higher-order edge structures. In this paper, we introduce a novel theoretical framework based on higher-order structures and describe a new DAG aggregation algorithm. We perform a simulation study, discussing the advantages and limitations of the proposed approach. Our proposal is both computationally efficient and effective, outperforming state-of-the-art solutions, especially in low sample size regimes and under high dimensionality settings.
[LG-66] Imaging with super-resolution in changing random media
链接: https://arxiv.org/abs/2511.14147
作者: Alexander Christie,Matan Leibovich,Miguel Moscoso,Alexei Novikov,George Papanicolaou,Chrysoula Tsogka
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注:
Abstract:We develop an imaging algorithm that exploits strong scattering to achieve super-resolution in changing random media. The method processes large and diverse array datasets using sparse dictionary learning, clustering, and multidimensional scaling. Starting from random initializations, the algorithm reliably extracts the unknown medium properties necessary for accurate imaging using back-propagation, \ell_2 or \ell_1 methods. Remarkably, scattering enhances resolution beyond homogeneous medium limits. When abundant data are available, the algorithm allows the realization of super-resolution in imaging.
[LG-67] SCOPE: Spectral Concentration by Distributionally Robust Joint Covariance-Precision Estimation
链接: https://arxiv.org/abs/2511.14146
作者: Renjie Chen,Viet Anh Nguyen,Huifu Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We propose a distributionally robust formulation for simultaneously estimating the covariance matrix and the precision matrix of a random this http URL proposed model minimizes the worst-case weighted sum of the Frobenius loss of the covariance estimator and Stein’s loss of the precision matrix estimator against all distributions from an ambiguity set centered at the nominal distribution. The radius of the ambiguity set is measured via convex spectral divergence. We demonstrate that the proposed distributionally robust estimation model can be reduced to a convex optimization problem, thereby yielding quasi-analytical estimators. The joint estimators are shown to be nonlinear shrinkage estimators. The eigenvalues of the estimators are shrunk nonlinearly towards a positive scalar, where the scalar is determined by the weight coefficient of the loss terms. By tuning the coefficient carefully, the shrinkage corrects the spectral bias of the empirical covariance/precision matrix estimator. By this property, we call the proposed joint estimator the Spectral concentrated COvariance and Precision matrix Estimator (SCOPE). We demonstrate that the shrinkage effect improves the condition number of the estimator. We provide a parameter-tuning scheme that adjusts the shrinkage target and intensity that is asymptotically optimal. Numerical experiments on synthetic and real data show that our shrinkage estimators perform competitively against state-of-the-art estimators in practical applications.
[LG-68] A Patient-Independent Neonatal Seizure Prediction Model Using Reduced Montage EEG and ECG
链接: https://arxiv.org/abs/2511.14110
作者: Sithmini Ranasingha,Agasthi Haputhanthri,Hansa Marasinghe,Nima Wickramasinghe,Kithmin Wickremasinghe,Jithangi Wanigasinghe,Chamira U. S. Edussooriya,Joshua P. Kulasingham
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures
Abstract:Neonates are highly susceptible to seizures, often leading to short or long-term neurological impairments. However, clinical manifestations of neonatal seizures are subtle and often lead to misdiagnoses. This increases the risk of prolonged, untreated seizure activity and subsequent brain injury. Continuous video electroencephalogram (cEEG) monitoring is the gold standard for seizure detection. However, this is an expensive evaluation that requires expertise and time. In this study, we propose a convolutional neural network-based model for early prediction of neonatal seizures by distinguishing between interictal and preictal states of the EEG. Our model is patient-independent, enabling generalization across multiple subjects, and utilizes mel-frequency cepstral coefficient matrices extracted from multichannel EEG and electrocardiogram (ECG) signals as input features. Trained and validated on the Helsinki neonatal EEG dataset with 10-fold cross-validation, the proposed model achieved an average accuracy of 97.52%, sensitivity of 98.31%, specificity of 96.39%, and F1-score of 97.95%, enabling accurate seizure prediction up to 30 minutes before onset. The inclusion of ECG alongside EEG improved the F1-score by 1.42%, while the incorporation of an attention mechanism yielded an additional 0.5% improvement. To enhance transparency, we incorporated SHapley Additive exPlanations (SHAP) as an explainable artificial intelligence method to interpret the model and provided localization of seizure focus using scalp plots. The overall results demonstrate the model’s potential for minimally supervised deployment in neonatal intensive care units, enabling timely and reliable prediction of neonatal seizures, while demonstrating strong generalization capability across unseen subjects through transfer learning.
[LG-69] Wasserstein Distributionally Robust Nash Equilibrium Seeking with Heterogeneous Data: A Lagrangian Approach
链接: https://arxiv.org/abs/2511.14048
作者: Zifan Wang,Georgios Pantazis,Sergio Grammatico,Michael M. Zavlanos,Karl H. Johansson
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:We study a class of distributionally robust games where agents are allowed to heterogeneously choose their risk aversion with respect to distributional shifts of the uncertainty. In our formulation, heterogeneous Wasserstein ball constraints on each distribution are enforced through a penalty function leveraging a Lagrangian formulation. We then formulate the distributionally robust Nash equilibrium problem and show that under certain assumptions it is equivalent to a finite-dimensional variational inequality problem with a strongly monotone mapping. We then design an approximate Nash equilibrium seeking algorithm and prove convergence of the average regret to a quantity that diminishes with the number of iterations, thus learning the desired equilibrium up to an a priori specified accuracy. Numerical simulations corroborate our theoretical findings.
[LG-70] Splat Regression Models
链接: https://arxiv.org/abs/2511.14042
作者: Mara Daniels,Philippe Rigollet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We introduce a highly expressive class of function approximators called Splat Regression Models. Model outputs are mixtures of heterogeneous and anisotropic bump functions, termed splats, each weighted by an output vector. The power of splat modeling lies in its ability to locally adjust the scale and direction of each splat, achieving both high interpretability and accuracy. Fitting splat models reduces to optimization over the space of mixing measures, which can be implemented using Wasserstein-Fisher-Rao gradient flows. As a byproduct, we recover the popular Gaussian Splatting methodology as a special case, providing a unified theoretical framework for this state-of-the-art technique that clearly disambiguates the inverse problem, the model, and the optimization algorithm. Through numerical experiments, we demonstrate that the resulting models and algorithms constitute a flexible and promising approach for solving diverse approximation, estimation, and inverse problems involving low-dimensional data.
[LG-71] A Brain Wave Encodes a Thousand Tokens: Modeling Inter-Cortical Neural Interactions for Effective EEG-based Emotion Recognition
链接: https://arxiv.org/abs/2511.13954
作者: Nilay Kumar,Priyansh Bhandari,G. Maragatham
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:
Abstract:Human emotions are difficult to convey through words and are often abstracted in the process; however, electroencephalogram (EEG) signals can offer a more direct lens into emotional brain activity. Recent studies show that deep learning models can process these signals to perform emotion recognition with high accuracy. However, many existing approaches overlook the dynamic interplay between distinct brain regions, which can be crucial to understanding how emotions unfold and evolve over time, potentially aiding in more accurate emotion recognition. To address this, we propose RBTransformer, a Transformer-based neural network architecture that models inter-cortical neural dynamics of the brain in latent space to better capture structured neural interactions for effective EEG-based emotion recognition. First, the EEG signals are converted into Band Differential Entropy (BDE) tokens, which are then passed through Electrode Identity embeddings to retain spatial provenance. These tokens are processed through successive inter-cortical multi-head attention blocks that construct an electrode x electrode attention matrix, allowing the model to learn the inter-cortical neural dependencies. The resulting features are then passed through a classification head to obtain the final prediction. We conducted extensive experiments, specifically under subject-dependent settings, on the SEED, DEAP, and DREAMER datasets, over all three dimensions, Valence, Arousal, and Dominance (for DEAP and DREAMER), under both binary and multi-class classification settings. The results demonstrate that the proposed RBTransformer outperforms all previous state-of-the-art methods across all three datasets, over all three dimensions under both classification settings. The source code is available at: this https URL.
[LG-72] Empirical Likelihood for Random Forests and Ensembles
链接: https://arxiv.org/abs/2511.13934
作者: Harold D. Chiang,Yukitoshi Matsushita,Taisuke Otsu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST)
*备注: 34 pages, 1 figure
Abstract:We develop an empirical likelihood (EL) framework for random forests and related ensemble methods, providing a likelihood-based approach to quantify their statistical uncertainty. Exploiting the incomplete U -statistic structure inherent in ensemble predictions, we construct an EL statistic that is asymptotically chi-squared when subsampling induced by incompleteness is not overly sparse. Under sparser subsampling regimes, the EL statistic tends to over-cover due to loss of pivotality; we therefore propose a modified EL that restores pivotality through a simple adjustment. Our method retains key properties of EL while remaining computationally efficient. Theory for honest random forests and simulations demonstrate that modified EL achieves accurate coverage and practical reliability relative to existing inference methods.
[LG-73] Uncertainty-Calibrated Prediction of Randomly-Timed Biomarker Trajectories with Conformal Bands
链接: https://arxiv.org/abs/2511.13911
作者: Vasiliki Tassopoulou,Charis Stamouli,Haochang Shou,George J. Pappas,Christos Davatzikos
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Despite recent progress in predicting biomarker trajectories from real clinical data, uncertainty in the predictions poses high-stakes risks (e.g., misdiagnosis) that limit their clinical deployment. To enable safe and reliable use of such predictions in healthcare, we introduce a conformal method for uncertainty-calibrated prediction of biomarker trajectories resulting from randomly-timed clinical visits of patients. Our approach extends conformal prediction to the setting of randomly-timed trajectories via a novel nonconformity score that produces prediction bands guaranteed to cover the unknown biomarker trajectories with a user-prescribed probability. We apply our method across a wide range of standard and state-of-the-art predictors for two well-established brain biomarkers of Alzheimer’s disease, using neuroimaging data from real clinical studies. We observe that our conformal prediction bands consistently achieve the desired coverage, while also being tighter than baseline prediction bands. To further account for population heterogeneity, we develop group-conditional conformal bands and test their coverage guarantees across various demographic and clinically relevant subpopulations. Moreover, we demonstrate the clinical utility of our conformal bands in identifying subjects at high risk of progression to Alzheimer’s disease. Specifically, we introduce an uncertainty-calibrated risk score that enables the identification of 17.5% more high-risk subjects compared to standard risk scores, highlighting the value of uncertainty calibration in real-world clinical decision making. Our code is available at this http URL.
[LG-74] A Disentangled Low-Rank RNN Framework for Uncovering Neural Connectivity and Dynamics
链接: https://arxiv.org/abs/2511.13899
作者: Chengrui Li,Yunmiao Wang,Yule Wang,Weihan Li,Dieter Jaeger,Anqi Wu
类目: Neurons and Cognition (q-bio.NC); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:Low-rank recurrent neural networks (lrRNNs) are a class of models that uncover low-dimensional latent dynamics underlying neural population activity. Although their functional connectivity is low-rank, it lacks disentanglement interpretations, making it difficult to assign distinct computational roles to different latent dimensions. To address this, we propose the Disentangled Recurrent Neural Network (DisRNN), a generative lrRNN framework that assumes group-wise independence among latent dynamics while allowing flexible within-group entanglement. These independent latent groups allow latent dynamics to evolve separately, but are internally rich for complex computation. We reformulate the lrRNN under a variational autoencoder (VAE) framework, enabling us to introduce a partial correlation penalty that encourages disentanglement between groups of latent dimensions. Experiments on synthetic, monkey M1, and mouse voltage imaging data show that DisRNN consistently improves the disentanglement and interpretability of learned neural latent trajectories in low-dimensional space and low-rank connectivity over baseline lrRNNs that do not encourage partial disentanglement.
[LG-75] QUASAR: An Evolutionary Algorithm to Accelerate High-Dimensional Optimization
链接: https://arxiv.org/abs/2511.13843
作者: Julian Soltes
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Source code found at this https URL
Abstract:High-dimensional numerical optimization presents a persistent challenge. This paper introduces Quasi-Adaptive Search with Asymptotic Reinitialization (QUASAR), an evolutionary algorithm to accelerate convergence in complex, non-differentiable problems afflicted by the curse of dimensionality. Evaluated on the notoriously difficult CEC2017 benchmark suite of 29 functions, QUASAR achieved the lowest overall rank sum (150) using the Friedman test, significantly outperforming L-SHADE (229) and standard DE (305) in the dimension-variant trials. QUASAR also proves computationally efficient, with run times averaging 1.4 \textx faster than DE and 7.8 \textx faster than L-SHADE ( p \ll 0.001 ) in the population-variant trials. Building upon Differential Evolution (DE), QUASAR introduces a highly stochastic architecture to dynamically balance exploration and exploitation. Inspired by the probabilistic behavior of quantum particles in a stellar core, the algorithm implements three primary components that augment standard DE mechanisms: 1) probabilistically selected mutation strategies and scaling factors; 2) rank-based crossover rates; 3) asymptotically decaying reinitialization that leverages a covariance matrix of the best solutions to introduce high-quality genetic diversity. QUASAR’s performance establishes it as an effective, user-friendly optimizer for complex high-dimensional problems. Comments: Source code found at this https URL Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) MSC classes: 90C59 ACMclasses: I.2.8; G.1.6 Cite as: arXiv:2511.13843 [math.OC] (or arXiv:2511.13843v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2511.13843 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Julian Soltes [view email] [v1] Mon, 17 Nov 2025 19:02:31 UTC (815 KB)
[LG-76] CellStream: Dynamical Optimal Transport Informed Embeddings for Reconstructing Cellular Trajectories from Snapshots Data AAAI2026
链接: https://arxiv.org/abs/2511.13786
作者: Yue Ling,Peiqi Zhang,Zhenyi Zhang,Peijie Zhou
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: Published as a conference paper at AAAI 2026 (oral)
Abstract:Single-cell RNA sequencing (scRNA-seq), especially temporally resolved datasets, enables genome-wide profiling of gene expression dynamics at single-cell resolution across discrete time points. However, current technologies provide only sparse, static snapshots of cell states and are inherently influenced by technical noise, complicating the inference and representation of continuous transcriptional dynamics. Although embedding methods can reduce dimensionality and mitigate technical noise, the majority of existing approaches typically treat trajectory inference separately from embedding construction, often neglecting temporal structure. To address this challenge, here we introduce CellStream, a novel deep learning framework that jointly learns embedding and cellular dynamics from single-cell snapshot data by integrating an autoencoder with unbalanced dynamical optimal transport. Compared to existing methods, CellStream generates dynamics-informed embeddings that robustly capture temporal developmental processes while maintaining high consistency with the underlying data manifold. We demonstrate CellStream’s effectiveness on both simulated datasets and real scRNA-seq data, including spatial transcriptomics. Our experiments indicate significant quantitative improvements over state-of-the-art methods in representing cellular trajectories with enhanced temporal coherence and reduced noise sensitivity. Overall, CellStream provides a new tool for learning and representing continuous streams from the noisy, static snapshots of single-cell gene expression.
[LG-77] Knowledge vs. Experience: Asymptotic Limits of Impatience in Edge Tenants
链接: https://arxiv.org/abs/2511.13763
作者: Anthony Kiggundu,Bin Han,Hans D. Schotten
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Submitted to IEEE ICC 2026
Abstract:We study how two information feeds, a closed-form Markov estimator of residual sojourn and an online trained actor-critic, affect reneging and jockeying in a dual M/M/1 system. Analytically, for unequal service rates and total-time patience, we show that total wait grows linearly so abandonment is inevitable and the probability of a successful jockey vanishes as the backlog approaches towards infinity. Furthermore, under a mild sub-linear error condition both information models yield the same asymptotic limits (robustness). We empirically validate these limits and quantify finite backlog differences. Our findings show that learned and analytic feeds produce different delays, reneging rates and transient jockeying behavior at practical sizes, but converge to the same asymptotic outcome implied by our theory. The results characterize when value-of-information matters (finite regimes) and when it does not (asymptotics), informing lightweight telemetry and decision-logic design for low-cost, jockeying-aware systems.
[LG-78] HD-BAR: Topology Hierarchical Derived Brain Autoregressive Modeling for EEG Generic Representations
链接: https://arxiv.org/abs/2511.13733
作者: Wenchao Yang,Weidong Yan,Wenkang Liu,Yulan Ma,Yang Li
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:Large-scale pre-trained models hold significant potential for learning universal EEG representations. However, most existing methods, particularly autoregressive (AR) frameworks, primarily rely on straightforward temporal sequencing of multi-channel EEG data, which fails to capture the rich physiological characteristics inherent to EEG signals. Moreover, their time-centered modeling approach also limits the effective representation of the dynamic spatial topology of brain activity. To address these challenges and fully exploit the potential of large-scale EEG models, we propose a novel Topology Hierarchical Derived Brain Autoregressive Modeling (THD-BAR) for EEG generic representations. The core innovation of THD-BAR lies in the introduction of the Brain Topology Hierarchy (BTH), which establishes a multi-scale spatial order for EEG channels. This hierarchical structure enables a redefinition of autoregressive learning as a “next-scale-time prediction” problem, effectively capturing both spatial and temporal dynamics. Based on BTH, we design a Topology-Hierarchical Vector Quantized-Variational Autoencoder (THVQ-VAE) for multi-scale tokenization and develop an enhanced Brain Autoregressive (BAR) module with specialized masking strategies for prediction. Through extensive large-scale pre-training on 17 datasets, followed by rigorous validation on 10 downstream datasets spanning 5 distinct tasks, THD-BAR consistently outperforms existing methods. These results highlight the superior generalization and modeling capabilities of our proposed approach.
[LG-79] Principled Coarse-Grained Acceptance for Speculative Decoding in Speech
链接: https://arxiv.org/abs/2511.13732
作者: Moran Yanuka,Paul Dixon,Eyal Finkelshtein,Daniel Rotman,Raja Giryes
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:
Abstract:Speculative decoding accelerates autoregressive speech generation by letting a fast draft model propose tokens that a larger target model verifies. However, for speech LLMs that generate acoustic tokens, exact token matching is overly restrictive: many discrete tokens are acoustically or semantically interchangeable, reducing acceptance rates and limiting speedups. We introduce Principled Coarse-Graining (PCG), which verifies proposals at the level of Acoustic Similarity Groups (ASGs) derived from the target model’s embedding space. By splitting each token’s probability mass across the overlapping groups that contain it, we define an overlap-aware coarse-grained distribution and perform rejection sampling on the resulting group variable. This yields an exactness guarantee at the group level while allowing the accepted draft token to stand in for any member of the group in practice. On LibriTTS, PCG increases acceptance and throughput relative to standard speculative decoding and prior speech-specific relaxations while maintaining intelligibility and speaker similarity. These results suggest acoustically aware, group-level acceptance as a simple and general way to accelerate speech token generation while maintaining speech quality.
[LG-80] GegenbauerNet: Finding the Optimal Compromise in the GNN Flexibility-Stability Trade-off
链接: https://arxiv.org/abs/2511.13730
作者: Huseyin Goksu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Spectral Graph Neural Networks (GNNs) operating in the canonical [-1, 1] domain (like ChebyNet and its adaptive generalization, L-JacobiNet) face a fundamental Flexibility-Stability Trade-off. Our previous work revealed a critical puzzle: the 2-parameter adaptive L-JacobiNet often suffered from high variance and was surprisingly outperformed by the 0-parameter, stabilized-static S-JacobiNet. This suggested that stabilization was more critical than adaptation in this domain. In this paper, we propose \textbfGegenbauerNet, a novel GNN filter based on the Gegenbauer polynomials, to find the Optimal Compromise in this trade-off. By enforcing symmetry (alpha=beta) but allowing a single shape parameter (lambda) to be learned, GegenbauerNet limits flexibility (variance) while escaping the fixed bias of S-JacobiNet. We demonstrate that GegenbauerNet (1-parameter) achieves superior performance in the key local filtering regime (K=2 on heterophilic graphs) where overfitting is minimal, validating the hypothesis that a controlled, symmetric degree of freedom is optimal. Furthermore, our comprehensive K-ablation study across homophilic and heterophilic graphs, using 7 diverse datasets, clarifies the domain’s behavior: the fully adaptive L-JacobiNet maintains the highest performance on high-K filtering tasks, showing the value of maximum flexibility when regularization is managed. This study provides crucial design principles for GNN developers, showing that in the [-1, 1] spectral domain, the optimal filter depends critically on the target locality (K) and the acceptable level of design bias.
[LG-81] Feature weighting for data analysis via evolutionary simulation
链接: https://arxiv.org/abs/2511.06454
作者: Aris Daniilidis,Alberto Domínguez Corella,Philipp Wissgott
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We analyze an algorithm for assigning weights prior to scalarization in discrete multi-objective problems arising from data analysis. The algorithm evolves the weights (the relevance of features) by a replicator-type dynamic on the standard simplex, with update indices computed from a normalized data matrix. We prove that the resulting sequence converges globally to a unique interior equilibrium, yielding non-degenerate limiting weights. The method, originally inspired by evolutionary game theory, differs from standard weighting schemes in that it is analytically tractable with provable convergence.
信息检索
[IR-0] NeuCLIRBench: A Modern Evaluation Collection for Monolingual Cross-Language and Multilingual Information Retrieval
链接: https://arxiv.org/abs/2511.14758
作者: Dawn Lawrie,James Mayfield,Eugene Yang,Andrew Yates,Sean MacAvaney,Ronak Pradeep,Scott Miller,Paul McNamee,Luca Soldani
类目: Information Retrieval (cs.IR)
*备注: 14 pages, 1 figure
Abstract:To measure advances in retrieval, test collections with relevance judgments that can faithfully distinguish systems are required. This paper presents NeuCLIRBench, an evaluation collection for cross-language and multilingual retrieval. The collection consists of documents written natively in Chinese, Persian, and Russian, as well as those same documents machine translated into English. The collection supports several retrieval scenarios including: monolingual retrieval in English, Chinese, Persian, or Russian; cross-language retrieval with English as the query language and one of the other three languages as the document language; and multilingual retrieval, again with English as the query language and relevant documents in all three languages. NeuCLIRBench combines the TREC NeuCLIR track topics of 2022, 2023, and 2024. The 250,128 judgments across approximately 150 queries for the monolingual and cross-language tasks and 100 queries for multilingual retrieval provide strong statistical discriminatory power to distinguish retrieval approaches. A fusion baseline of strong neural retrieval systems is included with the collection so that developers of reranking algorithms are no longer reliant on BM25 as their first-stage retriever. NeuCLIRBench is publicly available.
[IR-1] Jasper-Token-Compression-600M Technical Report
链接: https://arxiv.org/abs/2511.14405
作者: Dun Zhang,Ziyang Zeng,Yudong Zhou,Shuyang Lu
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 1 figure
Abstract:This technical report presents the training methodology and evaluation results of the open-source Jasper-Token-Compression-600M model, released in November 2025. Building on previous distillation-based recipes from the English Stella and Jasper models, we successfully extend this approach to a bilingual (English and Chinese) domain, further enhancing model performance through the incorporation of contrastive learning. A key innovation of our model is the introduction of a one-dimensional convolution-based token compression module. We dynamically adjust the compression rate during training, enabling the model to learn more robust and efficient compressed text representations. By combining knowledge distillation with token compression techniques, we achieve significant improvements in both embedding quality and inference efficiency. Our model performs with higher efficiency than a traditional 0.6B model while achieving performance comparable to that of an 8B model. For more information on the model release, visit: this https URL.
[IR-2] Infer As You Train: A Symmetric Paradigm of Masked Generative for Click-Through Rate Prediction
链接: https://arxiv.org/abs/2511.14403
作者: Moyu Zhang,Yujun Jin,Yun Chen,Jinxin Hu,Yu Zhang,Xiaoyi Zeng
类目: Information Retrieval (cs.IR)
*备注: 4 pages, 4 tables, 1 figure
Abstract:Generative models are increasingly being explored in click-through rate (CTR) prediction field to overcome the limitations of the conventional discriminative paradigm, which rely on a simple binary classification objective. However, existing generative models typically confine the generative paradigm to the training phase, primarily for representation learning. During online inference, they revert to a standard discriminative paradigm, failing to leverage their powerful generative capabilities to further improve prediction accuracy. This fundamental asymmetry between the training and inference phases prevents the generative paradigm from realizing its full potential. To address this limitation, we propose the Symmetric Masked Generative Paradigm for CTR prediction (SGCTR), a novel framework that establishes symmetry between the training and inference phases. Specifically, after acquiring generative capabilities by learning feature dependencies during training, SGCTR applies the generative capabilities during online inference to iteratively redefine the features of input samples, which mitigates the impact of noisy features and enhances prediction accuracy. Extensive experiments validate the superiority of SGCTR, demonstrating that applying the generative paradigm symmetrically across both training and inference significantly unlocks its power in CTR prediction.
[IR-3] WebRec: Enhancing LLM -based Recommendations with Attention-guided RAG from Web
链接: https://arxiv.org/abs/2511.14182
作者: Zihuai Zhao,Yujuan Ding,Wenqi Fan,Qing Li
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recommender systems play a vital role in alleviating information overload and enriching users’ online experience. In the era of large language models (LLMs), LLM-based recommender systems have emerged as a prevalent paradigm for advancing personalized recommendations. Recently, retrieval-augmented generation (RAG) has drawn growing interest to facilitate the recommendation capability of LLMs, incorporating useful information retrieved from external knowledge bases. However, as a rich source of up-to-date information, the web remains under-explored by existing RAG-based recommendations. In particular, unique challenges are posed from two perspectives: one is to generate effective queries for web retrieval, considering the inherent knowledge gap between web search and recommendations; another challenge lies in harnessing online websites that contain substantial noisy content. To tackle these limitations, we propose WebRec, a novel web-based RAG framework, which takes advantage of the reasoning capability of LLMs to interpret recommendation tasks into queries of user preferences that cater to web retrieval. Moreover, given noisy web-retrieved information, where relevant pieces of evidence are scattered far apart, an insightful MP-Head is designed to enhance LLM attentions between distant tokens of relevant information via message passing. Extensive experiments have been conducted to demonstrate the effectiveness of our proposed web-based RAG methods in recommendation scenarios.
[IR-4] aoSearchEmb: A Multi-Objective Reinforcement Learning Framework for Dense Retrieval in Taobao Search
链接: https://arxiv.org/abs/2511.13885
作者: Xingxian Liu,Dongshuai Li,Tao Wen,Jiahui Wan,Gui Ling,Fuyu Lv,Dan Ou,Haihong Tang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Dense retrieval, as the core component of e-commerce search engines, maps user queries and items into a unified semantic space through pre-trained embedding models to enable large-scale real-time semantic retrieval. Despite the rapid advancement of LLMs gradually replacing traditional BERT architectures for embedding, their training paradigms still adhere to BERT-like supervised fine-tuning and hard negative mining strategies. This approach relies on complex offline hard negative sample construction pipelines, which constrain model iteration efficiency and hinder the evolutionary potential of semantic representation capabilities. Besides, existing multi-task learning frameworks face the seesaw effect when simultaneously optimizing semantic relevance and non-relevance objectives. In this paper, we propose Retrieval-GRPO, a multi-objective reinforcement learning-based dense retrieval framework designed to address these challenges. The method eliminates offline hard negative sample construction by dynamically retrieving Top-K candidate products for each query during training, while introducing a relevance LLM as a reward model to generate real-time feedback. Specifically, the retrieval model dynamically optimizes embedding representations through reinforcement learning, with reward signals combining LLM-generated relevance scores, product quality scores, and multi-way exclusivity metrics to achieve multi-objective user preference alignment and real-time error correction. This mechanism not only removes dependency on hard negatives but also mitigates the seesaw effect through collaborative multi-objective optimization, significantly enhancing the model’s semantic generalization capability for complex long-tail queries. Extensive offline and online experiments validate the effectiveness of Retrieval-GRPO, which has been deployed on China’s largest e-commerce platform.

