本篇博文主要内容为 2025-11-21 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-11-21)

今日共更新474篇论文,其中:

  • 自然语言处理45篇(Computation and Language (cs.CL))
  • 人工智能145篇(Artificial Intelligence (cs.AI))
  • 计算机视觉120篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习117篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] hinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

【速读】: 该论文旨在解决当前视觉生成模型中缺乏在生成过程中实时进行多模态交互的问题,即现有方法仅在生成前或生成后进行文本推理(textual reasoning),而无法实现推理与生成的动态协同。其解决方案的关键在于提出首个交错式框架——Thinking-while-Generating (TwiG),该框架在视觉内容逐步生成的过程中,将文本推理能力进行穿插式嵌入:既指导后续局部区域的生成,又对已合成区域进行反思,从而形成一种动态演化的“生成-推理”协同机制,显著提升输出结果的语义丰富度和上下文感知能力。

链接: https://arxiv.org/abs/2511.16671
作者: Ziyu Guo,Renrui Zhang,Hongyu Li,Manyuan Zhang,Xinyan Chen,Sifan Wang,Yan Feng,Peng Pei,Pheng-Ann Heng
机构: CUHK(香港中文大学); IMIXR; MMLab; Meituan(美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: this https URL.
zh

[NLP-1] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLM s

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在多尺度部署场景下训练成本高昂的问题,即为不同规模和部署目标单独训练多个模型会消耗数百亿token级别的计算资源。其关键解决方案是提出Nemotron Elastic框架,通过构建嵌套子模型结构(nested submodels),将多个不同参数量的推理优化模型嵌入到单一父模型中,各子模型共享权重且可在部署时零样本提取(zero-shot extraction),无需额外训练或微调。该方案依赖于端到端训练的路由器(router)与专为推理模型设计的两阶段训练流程,并引入组感知状态空间模块弹性化(group-aware SSM elastification)、异构MLP弹性化、基于归一化均方误差(normalized MSE-based)的层重要性评估以优化深度选择,以及知识蒸馏实现多预算并行优化,从而显著降低训练成本(相比从头训练降低360倍,相比当前最优压缩技术降低7倍),同时保持甚至超越现有最先进(SoTA)模型的性能表现。

链接: https://arxiv.org/abs/2511.16664
作者: Ali Taghibakhshi,Sharath Turuvekere Sreenivas,Saurav Muralidharan,Ruisi Cai,Marcin Chochowski,Ameya Sunil Mahabaleshwarkar,Yoshi Suhara,Oluwatobi Olabiyi,Daniel Korzekwa,Mostofa Patwary,Mohammad Shoeybi,Jan Kautz,Bryan Catanzaro,Ashwath Aithal,Nima Tajbakhsh,Pavlo Molchanov
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba’s structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.
zh

[NLP-2] Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems

【速读】: 该论文旨在解决现有多模态检索增强生成(Multimodal Retrieval-Augmented Generation, Multimodal RAG)系统在处理金融文档中的图表、表格等视觉信息时,因依赖大语言模型(Large Language Models, LLMs)对图像进行摘要转换而导致的上下文信息与视觉细节丢失问题。其解决方案的关键在于对比分析两种检索策略:一是基于文本切片的检索(将图像先通过LLM总结为文本再嵌入向量库),二是直接多模态嵌入检索(图像以原生形式存储于向量空间中)。实验结果表明,直接多模态嵌入检索显著优于传统LLM摘要方法,在均值平均精度(mAP@5)和归一化折扣累计增益(nDCG@5)上分别实现13%和11%的绝对提升,且生成答案更准确、事实一致性更强,验证了保留原始视觉语义对于下游任务的重要性。

链接: https://arxiv.org/abs/2511.16654
作者: Elias Lumer,Alex Cardenas,Matt Melich,Myles Mason,Sara Dieter,Vamse Kumar Subbiah,Pradeep Honaganahalli Basavaraju,Roberto Hernandez
机构: PricewaterhouseCoopers U.S. (普华永道美国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.
zh

[NLP-3] SurvAgent : Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction

【速读】: 该论文旨在解决癌症生存分析中现有方法缺乏透明性(transparency)的问题,尤其针对病理智能代理在生存预测任务中的三大局限:无法融合多模态数据、区域兴趣探索效率低以及未能利用历史病例的经验学习。其解决方案的关键在于提出 SurvAgent——首个基于分层思维链(chain-of-thought, CoT)增强的多智能体系统,通过两个阶段实现可解释的多模态生存预测:第一阶段构建包含病理图像与基因特征的结构化推理报告,采用低倍率筛查、跨模态相似感知的 patch 筛选及置信度感知挖掘策略,并结合基因分层分析;第二阶段基于检索增强生成(retrieval-augmented generation, RAG)机制进行案例匹配,并通过渐进式区间精炼整合多专家预测与多模态报告,从而实现从经验中学习并提升临床可解释性。

链接: https://arxiv.org/abs/2511.16635
作者: Guolin Huang,Wenting Chen,Jiaqi Yang,Xinheng Lyu,Xiaoling Luo,Sen Yang,Xiaohan Xing,Linlin Shen
机构: Shenzhen University (深圳大学); Stanford University (斯坦福大学); University of Nottingham Ningbo China (诺丁汉大学宁波分校); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 20 pages

点击查看摘要

Abstract:Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent’s superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.
zh

[NLP-4] meViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

【速读】: 该论文旨在解决长视频理解中模型架构效率与长时间上下文处理能力之间的矛盾问题。针对这一挑战,作者提出了一种混合Mamba-Transformer骨干网络的视觉语言模型TimeViper,其关键创新在于揭示了“视觉到文本的信息聚合现象”——即信息在深度增加时从视觉标记逐步流向文本标记,导致视觉标记冗余;为此进一步设计了TransV模块,通过将视觉标记的信息转移并压缩至指令标记,同时保持多模态理解能力,从而实现对超过10,000帧(小时级)视频的有效处理。

链接: https://arxiv.org/abs/2511.16595
作者: Boshen Xu,Zihan Xiao,Jiaze Li,Jianzhong Ju,Zhenbo Luo,Jian Luan,Qin Jin
机构: AIM3 Lab, Renmin University of China (中国人民大学); MiLM Plus, Xiaomi Inc. (小米公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
zh

[NLP-5] D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies AAAI2026

【速读】: 该论文旨在解决当前图形用户界面(Graphical User Interface, GUI)智能代理在真实世界环境中鲁棒性不足的问题,尤其是现有数据集和评估基准多为静态且理想化,无法反映实际使用中常见的异常情况(如权限弹窗、电池警告和更新提示等)。其解决方案的关键在于提出一个名为D-GARA的动态基准框架,该框架通过引入多样化的现实异常场景并构建带有嵌入式异常的Android应用基准数据集,实现了对GUI代理在复杂、不可预测环境下的鲁棒性评估。实验表明,主流GUI代理在含异常环境中性能显著下降,凸显了面向鲁棒性的学习方法的重要性;同时D-GARA具有模块化和可扩展性,支持灵活集成新任务、异常类型与交互场景,以满足多样化的研究需求。

链接: https://arxiv.org/abs/2511.16590
作者: Sen Chen,Tong Zhao,Yi Bin,Fei Ma,Wenqi Shao,Zheng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.
zh

[NLP-6] Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation

【速读】: 该论文旨在解决词义消歧(Word Sense Disambiguation, WSD)问题,特别是针对传统方法难以处理细粒度语义表示(如基于OpenCyc构建的语义知识库)的挑战。现有方法通常依赖于人工标注的训练数据来构建粗粒度的词义表示(如WordNet同义词集或FrameNet框架),限制了其在复杂推理任务中的应用。论文提出了一种无需人工标注数据的新方案:利用统计语言模型作为判别器,将符号化自然语言理解(Symbolic NLU)系统生成的多个候选词义转换为可区分的自然语言表述,并通过大语言模型(LLM)在具体语境中选择最合适的解释,最终将选定的语义信息回传至符号系统以支持后续推理。该方案的关键在于借助LLM的上下文感知能力实现自动、无监督的词义选择,从而突破对人工标注数据的依赖。

链接: https://arxiv.org/abs/2511.16577
作者: Kexin Zhao,Ken Forbus
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:Word sense disambiguation is a fundamental challenge in natural language understanding. Current methods are primarily aimed at coarse-grained representations (e.g. WordNet synsets or FrameNet frames) and require hand-annotated training data to construct. This makes it difficult to automatically disambiguate richer representations (e.g. built on OpenCyc) that are needed for sophisticated inference. We propose a method that uses statistical language models as oracles for disambiguation that does not require any hand-annotation of training data. Instead, the multiple candidate meanings generated by a symbolic NLU system are converted into distinguishable natural language alternatives, which are used to query an LLM to select appropriate interpretations given the linguistic context. The selected meanings are propagated back to the symbolic NLU system. We evaluate our method against human-annotated gold answers to demonstrate its effectiveness.
zh

[NLP-7] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

【速读】: 该论文试图解决当前临床对话中自动语音识别(ASR)系统评估仍过度依赖词错误率(WER)的问题,而WER与实际临床影响之间缺乏强相关性,导致无法有效衡量ASR错误对医疗安全的潜在风险。解决方案的关键在于构建一个由专家临床医生标注的金标准基准,并引入基于大语言模型(LLM)的判官机制——通过GEPA(Generalized Evaluation Protocol Alignment)进行程序化优化,最终使用Gemini-2.5-Pro模型实现了与人类专家相当的评估性能(准确率达90%,Cohen’s κ=0.816),从而建立了一种可扩展、自动化且以临床安全为核心的ASR评价新范式。

链接: https://arxiv.org/abs/2511.16544
作者: Zachary Ellis,Jared Joselowitz,Yash Deo,Yajie He,Anna Kalygina,Aisling Higham,Mana Rahimzadeh,Yan Jia,Ibrahim Habli,Ernest Lim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen’s \kappa of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.
zh

[NLP-8] he Oracle and The Prism: A Decoupled and Efficient Framework for Generative Recommendation Explanation

【速读】: 该论文旨在解决生成式 AI(Generative AI)在可解释推荐系统中因端到端架构导致的性能-效率权衡问题,即联合优化排序与解释时易产生次优妥协。其解决方案的关键在于提出一种解耦框架 Prism,将推荐过程严格分离为独立的排序阶段和解释生成阶段:利用强大的教师大语言模型(LLM)作为“Oracle”生成高质量解释知识,并通过一个轻量级微调的学生模型(如 BART-Base)专门负责将这些知识合成个性化解释。这种结构设计消除了耦合模型中的内在冲突,使各模块可针对性优化,从而在保持甚至提升人类评估中忠实性和个性化表现的同时,实现推理速度提升24倍、内存消耗降低10倍的显著效率优势。

链接: https://arxiv.org/abs/2511.16543
作者: Jiaheng Zhang,Daqiang Zhang
机构: Sun Yat-sen University (中山大学); Tongji University (同济大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages,3 figures

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into explainable recommendation systems often leads to a performance-efficiency trade-off in end-to-end architectures, where joint optimization of ranking and explanation can result in suboptimal compromises. To resolve this, we propose Prism, a novel decoupled framework that rigorously separates the recommendation process into a dedicated ranking stage and an explanation generation stage. Inspired by knowledge distillation, Prism leverages a powerful teacher LLM (e.g., FLAN-T5-XXL) as an Oracle to produce high-fidelity explanatory knowledge. A compact, fine-tuned student model (e.g., BART-Base), the Prism, then specializes in synthesizing this knowledge into personalized explanations. This decomposition ensures that each component is optimized for its specific objective, eliminating inherent conflicts in coupled models. Extensive experiments on benchmark datasets demonstrate that our 140M-parameter Prism model significantly outperforms its 11B-parameter teacher in human evaluations of faithfulness and personalization, while achieving a 24 times speedup and a 10 times reduction in memory consumption during inference. These results validate that decoupling, coupled with targeted distillation, provides an efficient and effective pathway to high-quality explainable recommendation. Comments: 11 pages,3 figures Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2511.16543 [cs.IR] (or arXiv:2511.16543v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2511.16543 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-9] Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)的可解释性难题,即如何从模型内部激活值中提取文本的文体类别(genre),从而为模型的安全与有益部署提供可预测的分析框架。其解决方案的关键在于:利用浅层机器学习分类器(如scikit-learn)对LLM在不同文本输入下的激活特征进行建模,实验证明该方法可在两个数据集上分别达到高达98%和71%的F1分数,显著优于控制任务,首次展示了仅通过模型激活即可高精度推断文本文体的可能性。

链接: https://arxiv.org/abs/2511.16540
作者: Éloïse Benito-Rodriguez,Einar Urdshals,Jasmina Nasufi,Nicky Pochinkov
机构: Chalmers Technical University (查尔默斯理工大学); Lund University (隆德大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.
zh

[NLP-10] urkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

【速读】: 该论文旨在解决低资源、形态丰富的语言(如土耳其语)在神经信息检索(Neural Information Retrieval, NIR)系统中性能不足的问题,尤其是对密集编码器(Dense Encoders)与晚交互模型(Late-Interaction Models)在土耳其语场景下的系统性比较尚属空白。其解决方案的关键在于构建首个针对土耳其语的全面基准测试体系——TurkColBERT,通过两阶段适配流程:首先在土耳其语自然语言推理(NLI)和语义文本相似度(STS)任务上微调英文及多语言编码器,随后利用PyLate框架将这些模型转换为ColBERT风格的检索器;同时引入高效的索引算法MUVERA+Rerank,在保证高召回率的同时显著降低延迟(如ColmmBERT-base-TR在MUVERA下实现0.54毫秒查询时间),从而验证了轻量化晚交互模型在参数效率和检索精度上的优势。

链接: https://arxiv.org/abs/2511.16528
作者: Özay Ezerceli,Mahmoud El Hussieni,Selva Taş,Reyhan Bayraktar,Fatma Betül Terzioğlu,Yusuf Çelebi,Yağız Asker
机构: NewMind AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models – which retain token-level representations for fine-grained matching – have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600 \times smaller than the 600M turkish-e5-large dense encoder while preserving over 71% of its average mAP. Late-interaction models that are 3–5 \times smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33 \times faster than PLAID and offers +1.7% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ( \leq 50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.
zh

[NLP-11] MiMo-Embodied: X-Embodied Foundation Model Technical Report

【速读】: 该论文旨在解决跨具身智能(Embodied AI)与自动驾驶(Autonomous Driving)领域中模型通用性不足的问题,即如何构建一个统一的基础模型以同时在两个高度复杂且差异显著的任务域中实现卓越性能。解决方案的关键在于提出首个跨具身基础模型 MiMo-Embodied,通过多阶段学习、精心构建的高质量数据集以及思维链(Chain-of-Thought, CoT)与强化学习(Reinforcement Learning, RL)联合微调策略,有效实现了两个领域间的正向迁移与相互增强,从而在17个具身AI基准和12个自动驾驶基准上均达到当前最优表现。

链接: https://arxiv.org/abs/2511.16518
作者: Xiaoshuai Hao,Lei Zhou,Zhijian Huang,Zhiwen Hou,Yingbo Tang,Lingfeng Zhang,Guang Li,Zheng Lu,Shuhuai Ren,Xianhui Meng,Yuchen Zhang,Jing Wu,Jinghui Lu,Chenxu Dang,Jiayi Guan,Jianhua Wu,Zhiyi Hou,Hanbing Li,Shumeng Xia,Mingliang Zhou,Yinan Zheng,Zihao Yue,Shuhao Gu,Hao Tian,Yuannan Shen,Jianwei Cui,Wen Zhang,Shaoqing Xu,Bing Wang,Haiyang Sun,Zeyu Zhu,Yuncheng Jiang,Zibin Guo,Chuhong Gong,Chaofan Zhang,Wenbo Ding,Kun Ma,Guang Chen,Rui Cai,Diyun Xiang,Heng Qu,Fuli Luo,Hangjun Ye,Long Chen
机构: Xiaomi Embodied Intelligence Team (小米具身智能团队)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL Model: this https URL

点击查看摘要

Abstract:We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at this https URL.
zh

[NLP-12] Music Recommendation with Large Language Models : Challenges Opportunities and Evaluation

【速读】: 该论文旨在解决传统音乐推荐系统(Music Recommender Systems, MRS)在评估范式上的局限性问题,即过度依赖信息检索框架下的准确性指标,难以有效衡量推荐质量的本质维度(如用户满意度、公平性或情境适配性),同时面对大语言模型(Large Language Models, LLMs)引入的新特性(如生成式交互、非确定性输出和知识截止)时,现有评估方法已不再适用。其解决方案的关键在于重新构建MRS的评估体系:首先从用户建模、物品建模与自然语言推荐三个层面分析LLMs带来的变革;其次借鉴自然语言处理(NLP)领域的评估方法论,识别适用于MRS的新兴实践;最终提出一套基于LLM提示工程(prompting)的结构化成功维度与风险维度,以实现对LLM驱动型MRS更全面、可解释且跨学科的评估框架。

链接: https://arxiv.org/abs/2511.16478
作者: Elena V. Epure,Yashar Deldjoo,Bruno Sguerra,Markus Schedl,Manuel Moussallam
机构: Deezer Research (Deezer 研究院); Politecnico di Bari (巴里理工大学); Johannes Kepler University Linz and Linz Institute of Technology (约翰内斯·开普勒林茨大学及林茨技术研究所)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Under review with the ACM Transactions on Recommender Systems (TORS)

点击查看摘要

Abstract:Music Recommender Systems (MRS) have long relied on an information-retrieval framing, where progress is measured mainly through accuracy on retrieval-oriented subtasks. While effective, this reductionist paradigm struggles to address the deeper question of what makes a good recommendation, and attempts to broaden evaluation, through user studies or fairness analyses, have had limited impact. The emergence of Large Language Models (LLMs) disrupts this framework: LLMs are generative rather than ranking-based, making standard accuracy metrics questionable. They also introduce challenges such as hallucinations, knowledge cutoffs, non-determinism, and opaque training data, rendering traditional train/test protocols difficult to interpret. At the same time, LLMs create new opportunities, enabling natural-language interaction and even allowing models to act as evaluators. This work argues that the shift toward LLM-driven MRS requires rethinking evaluation. We first review how LLMs reshape user modeling, item modeling, and natural-language recommendation in music. We then examine evaluation practices from NLP, highlighting methodologies and open challenges relevant to MRS. Finally, we synthesize insights-focusing on how LLM prompting applies to MRS, to outline a structured set of success and risk dimensions. Our goal is to provide the MRS community with an updated, pedagogical, and cross-disciplinary perspective on evaluation. Comments: Under review with the ACM Transactions on Recommender Systems (TORS) Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2511.16478 [cs.IR] (or arXiv:2511.16478v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2511.16478 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-13] Arctic-Extract Technical Report

【速读】: 该论文旨在解决从扫描或数字生成的商业文档中高效提取结构化数据(如问答对、实体和表格)的问题,尤其关注在资源受限设备上的部署可行性。解决方案的关键在于提出 Arctic-Extract 模型,其具备当前最优(state-of-the-art, SoTA)的文档理解性能,同时模型体积仅为 6.6 GiB,可在 A10 GPU(24 GB 显存)等资源受限硬件上运行,并支持单次处理高达 125 张 A4 页面的长文档,从而实现高性能与低资源消耗的平衡。

链接: https://arxiv.org/abs/2511.16470
作者: Mateusz Chiliński,Julita Ołtusek,Wojciech Jaśkowski
机构: Snowflake AI Research
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract’s training protocols and evaluation results, demonstrating its strong performance in document understanding.
zh

[NLP-14] Anatomy of an Idiom: Tracing Non-Compositionality in Language Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)模型中对习语表达(idiomatic expressions)的非组合性语言处理机制问题,即如何在不依赖词义线性组合的情况下准确识别和理解固定搭配的语言单位。其解决方案的关键在于提出了一种改进的路径修补算法(modified path patching algorithm),用于发现并分析变压器架构中的计算电路(circuit discovery)。通过该方法,研究者识别出“习语头”(Idiom Heads)——一类在多个习语中频繁激活的注意力头,并揭示了早期处理阶段形成的“增强接收”(augmented reception)现象,即习语内部token之间因前置处理而增强的注意力连接。这些机制共同体现了变压器模型在保持计算效率的同时实现鲁棒性处理非组合语言的能力,为深入理解复杂语法结构的建模提供了新路径。

链接: https://arxiv.org/abs/2511.16467
作者: Andrew Gomes
机构: EPFL(瑞士联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term augmented reception.‘’ We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.
zh

[NLP-15] ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 在环境、社会和治理(ESG)领域问答任务中缺乏透明度与可解释性的问题,特别是模型在事实一致性、推理可追溯性和行业语义对齐方面的不足。其解决方案的关键在于构建了一个名为 ESGBench 的基准数据集和评估框架,该框架包含跨多个 ESG 主题的领域 grounded 问题、人工标注的答案及支持证据,从而实现对模型推理过程的细粒度评估,推动可信、可解释的 ESG 人工智能系统的发展。

链接: https://arxiv.org/abs/2511.16438
作者: Sherine George,Nithish Saji
机构: BNY(纽约银行); FedEx(联邦快递)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Workshop paper accepted at AI4DF 2025 (part of ACM ICAIF 2025). 3 pages including tables and figures

点击查看摘要

Abstract:We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.
zh

[NLP-16] OFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models AAAI2026

【速读】: 该论文旨在解决联邦学习中预训练视觉-语言模型(Vision-Language Models, VLMs)的高效轻量化适配问题,尤其针对现有迭代式联邦训练方法通信开销大、易受攻击,以及当前单轮(one-shot)方法在利用多模态信息、应对数据异构性方面不足的问题。解决方案的关键在于提出一种无需训练(training-free)的一次性联邦适配框架TOFA(Training-free One-shot Federated Adaptation),其核心创新包括:(1)设计双通道机制——视觉通道通过分层贝叶斯模型学习个性化类别原型分布以提取任务相关特征,文本通道通过评估与全局对齐本地生成的文本提示提升鲁棒性;(2)引入自适应权重校准机制融合双模态预测结果,在个性化与鲁棒性之间取得平衡,从而有效缓解数据异构性影响;(3)整个过程不依赖客户端或服务器端额外训练资源,显著降低计算与通信成本。

链接: https://arxiv.org/abs/2511.16423
作者: Li Zhang,Zhongxuan Han,XiaoHua Feng,Jiaming Zhang,Yuyuan Li,Linbo Jiang,Jianan Lin,Chaochao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning. Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks. Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive. However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server. To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations. In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity. Our method is training-free, not relying on additional training resources on either the client or server side. Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.
zh

[NLP-17] Classification of worldwide news articles by perceived quality 2018-2024

【速读】: 该论文旨在解决如何利用机器学习与深度学习模型有效区分 perceived lower-quality(感知低质量)新闻文章与 perceived higher-quality(感知高质量)新闻文章的问题。其解决方案的关键在于构建了一个包含1,412,272篇英文新闻文章的大规模数据集,并基于专家共识对579个新闻源进行质量分级,从而形成约各70.6万篇的两类标签样本;在此基础上提取每篇文章194个语言特征,训练并比较了3种传统机器学习分类器和3种深度学习模型,结果表明ModernBERT-large等预训练语言模型在准确率(最高达0.8744)和ROC-AUC(最高达0.9593)上均显著优于传统方法,证明了深度学习模型在识别新闻质量方面的优越性。

链接: https://arxiv.org/abs/2511.16416
作者: Connor McElroy,Thiago E. A. de Oliveira,Chris Brogly
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024. Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each, with 194 linguistic features per website-level labelled article. Traditional machine learning classifiers such as the Random Forest demonstrated capable performance (0.7355 accuracy, 0.8131 ROC AUC). For deep learning, ModernBERT-large (256 context length) achieved the best performance (0.8744 accuracy; 0.9593 ROC-AUC; 0.8739 F1), followed by DistilBERT-base (512 context length) at 0.8685 accuracy and 0.9554 ROC-AUC. DistilBERT-base (256 context length) reached 0.8478 accuracy and 0.9407 ROC-AUC, while ModernBERT-base (256 context length) attained 0.8569 accuracy and 0.9470 ROC-AUC. These results suggest that the perceived quality of worldwide news articles can be effectively differentiated by traditional CPU-based machine learning classifiers and deep learning classifiers.
zh

[NLP-18] AICC: Parse HTML Finer Make Models Better – A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

【速读】: 该论文旨在解决网页数据提取质量对大语言模型训练效果的影响问题,尤其针对现有基于启发式规则的HTML到文本提取工具(如Trafilatura)在处理结构化内容(如公式、代码块和表格)时存在严重失真或丢失的问题。传统方法将提取视为固定预处理步骤,忽视了其对下游任务性能的关键作用。解决方案的核心在于提出MinerU-HTML,一个将内容提取重构为序列标注问题的新型抽取流水线,采用0.6B参数的语言模型进行语义理解,并通过两阶段格式化流程显式识别并分类语义元素后再转换为Markdown,从而显著提升结构化内容保留率。实验表明,该方法在MainWebBench上ROUGE-N F1达81.8%,远超Trafilatura的63.6%;基于其构建的AICC语料库(7.3万亿token)在相同过滤条件下使模型在13个基准测试中平均准确率提升1.08个百分点,验证了高质量提取对模型能力的实质性贡献。

链接: https://arxiv.org/abs/2511.16397
作者: Ren Ma,Jiantao Qiu,Chao Xu,Pei Chu,Kaiwen Liu,Pengli Ren,Yuan Qu,Jiahui Peng,Linfeng Hou,Mengjie Liu,Lindong Lu,Wenchang Ning,Jia Yu,Rui Min,Jin Shi,Haojiong Chen,Peng Zhang,Wenjian Zhang,Qian Jiang,Zengjie Hu,Guoqiang Yang,Zhenxiang Li,Fukai Shang,Zhongying Tu,Wentao Zhang,Dahua Lin,Conghui He
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8% ROUGE-N F1 compared to Trafilatura’s 63.6%, with exceptional structured element preservation (90.9% for code blocks, 94.0% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.
zh

[NLP-19] Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies AACL2025

【速读】: 该论文试图解决现有评估方法在衡量自然语言解释(rationales)信息量时的局限性问题,特别是 sufficiency 指标仅能反映解释对分类任务的影响,而无法揭示其与模型内部机制(如 token 分类能力或注意力正则化)之间的关系。解决方案的关键在于将 sufficiency 与两种建模范式相联系:一是通过 token 分类任务衡量模型识别 rationale token 的能力,二是通过注意力正则化引入 rationale 信息以提升模型性能,从而系统性地分析 rationale 信息的作用机制。研究发现,高 sufficiency 的 rationales 并不必然提升分类准确率,反而可能捕捉到非 rationale 上下文对分类的干扰效应,且 rationale 输入虽可增强跨域泛化能力,但效果因任务和模型类型而异,同时 sufficiency 与 token 分类能力无显著关联,凸显了 rationales 复杂性的本质。

链接: https://arxiv.org/abs/2511.16353
作者: Jonathan Kamp,Lisa Beinborn,Antske Fokkens
机构: Vrije Universiteit Amsterdam (阿姆斯特丹自由大学); University of Göttingen (哥廷根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Long paper accepted to the main conference of AACL 2025. Please cite the conference proceedings when available

点击查看摘要

Abstract:Human explanations of natural language, rationales, form a tool to assess whether models learn a label for the right reasons or rely on dataset-specific shortcuts. Sufficiency is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.
zh

[NLP-20] NLP Datasets for Idiom and Figurative Language Tasks

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理习语和修辞语言(idiomatic and figurative language)时表现不佳的问题。尽管大规模语料库在自然语言处理(Natural Language Processing, NLP)中具有显著优势,但习语和修辞表达仍难以被准确识别与理解。论文的关键解决方案在于构建了两个新的数据集:一是包含潜在习语和修辞表达的大规模数据集,二是两个由人工标注的确定性习语与修辞表达数据集,用于评估预训练模型在习语识别任务中的基线能力。这些数据集经过后处理以兼容不同模型的训练需求,并在槽位标注(slot labeling)和序列标注(sequence tagging)任务上进行了验证,从而为提升LLMs对非字面语义的理解提供了高质量、多样化的训练与评估资源。

链接: https://arxiv.org/abs/2511.16345
作者: Blake Matheny,Phuong Minh Nguyen,Minh Le Nguyen,Stephanie Reynolds
机构: Japan Advanced Institute of Science and Technology (日本先进科学技术学院); Kanazawa Institute of Technology (金泽工科大学)
类目: Computation and Language (cs.CL)
备注: 32 pages, 10 figures

点击查看摘要

Abstract:Idiomatic and figurative language form a large portion of colloquial speech and writing. With social media, this informal language has become more easily observable to people and trainers of large language models (LLMs) alike. While the advantage of large corpora seems like the solution to all machine learning and Natural Language Processing (NLP) problems, idioms and figurative language continue to elude LLMs. Finetuning approaches are proving to be optimal, but better and larger datasets can help narrow this gap even further. The datasets presented in this paper provide one answer, while offering a diverse set of categories on which to build new models and develop new approaches. A selection of recent idiom and figurative language datasets were used to acquire a combined idiom list, which was used to retrieve context sequences from a large corpus. One large-scale dataset of potential idiomatic and figurative language expressions and two additional human-annotated datasets of definite idiomatic and figurative language expressions were created to evaluate the baseline ability of pre-trained language models in handling figurative meaning through idiom recognition (detection) tasks. The resulting datasets were post-processed for model agnostic training compatibility, utilized in training, and evaluated on slot labeling and sequence tagging.
zh

[NLP-21] OpenMMReason er: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

【速读】: 该论文旨在解决当前多模态推理(Multimodal Reasoning)研究中因数据构建与训练策略缺乏透明性和可复现性而导致的可扩展性瓶颈问题。其解决方案的关键在于提出了一种全透明的两阶段训练流程——OpenMMReasoner,第一阶段通过构建包含874K样本的冷启动数据集并进行严格的分步验证,奠定坚实的推理能力基础;第二阶段利用跨多个领域的74K样本数据集,采用强化学习(Reinforcement Learning, RL)进一步优化和稳定模型性能。该方法在九个基准测试中相较Qwen2.5-VL-7B-Instruct基线提升11.6%,验证了高质量数据与精心设计的训练策略对多模态推理性能的关键作用。

链接: https://arxiv.org/abs/2511.16334
作者: Kaichen Zhang,Keming Wu,Zuhao Yang,Kairui Hu,Bin Wang,Ziwei Liu,Xingxuan Li,Lidong Bing
机构: MiroMind AI(米罗 mind 人工智能); Nanyang Technological University (南洋理工大学); Tsinghua University (清华大学); LMMs-Lab Team
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at this https URL.
zh

[NLP-22] Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement AAAI2026

【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRM)在强化学习(Reinforcement Learning, RL)训练中因仅依赖最终结果正确性奖励而导致的内部推理质量不佳问题,如过度思考、思考不足、冗余思考和逻辑混乱等。其解决方案的关键在于提出一种自重写(self-rewriting)框架,通过让模型自我改写推理文本,并基于改写后的文本进行学习以优化内部思维过程;算法设计上采用选择性重写策略,仅对模型一致正确的“简单”样本进行重写,从而保留原始GRPO(Group Relative Policy Optimization)的全部奖励信号,同时在实现层面将重写与常规生成合并至单批次处理,仅引入约10%计算开销,保障了可扩展性。

链接: https://arxiv.org/abs/2511.16331
作者: Jiashu Yao,Heyan Huang,Shuang Zeng,Chuwei Luo,WangJie You,Jie Tang,Qingsong Liu,Yuhang Guo,Yangyang Kang
机构: ByteDance(字节跳动); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) with scaled inference computation have demonstrated substantial success on complex reasoning tasks. However, the one-sided reward, focused solely on final correctness, limits its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal internal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only “simple” samples, defined by the model’s consistent correctness, are rewritten, thereby preserving all original reward signals of GRPO. For practical implementation, we compile rewriting and vanilla generation within one single batch, maintaining the scalability of the RL algorithm and introducing only ~10% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) even without explicit instructions in rewriting prompts to reduce reasoning length, outperforming existing strong baselines. In terms of internal reasoning quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric, successfully mitigating internal reasoning flaws.
zh

[NLP-23] SDA: Steering-Driven Distribution Alignment for Open LLM s without Fine-Tuning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中难以有效对齐人类意图的问题,尤其是在推理阶段实现高效、低成本的行为调整,而无需昂贵的微调或大量监督信号。其核心挑战在于如何在不改变模型参数的前提下,动态提升模型输出与用户偏好之间的对齐度。解决方案的关键在于提出一种训练-free且模型无关的对齐框架——SDA(Steering-Driven Distribution Alignment),该方法通过基于用户定义的对齐指令动态重分布模型输出概率,从而在推理时灵活引导模型行为,实现对帮助性(helpfulness)、无害性(harmlessness)和诚实性(honesty)三个维度的显著改进,且具备轻量、高效、可扩展等优势。

链接: https://arxiv.org/abs/2511.16324
作者: Wei Xia,Zhi-Hong Deng
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.
zh

[NLP-24] SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在安全关键场景中缺乏可靠不确定性量化(Uncertainty Quantification, UQ)的问题,尤其关注如何更精确地检测幻觉(hallucination)。现有方法主要依赖语义概率分布或成对距离,忽略了潜在的语义结构信息。其解决方案的关键在于提出一种基于语义结构熵(Semantic Structural Entropy, SeSE)的UQ框架:首先构建自适应稀疏化的有向语义图以捕捉方向性语义依赖并自动剪枝冗余连接;进而通过层次抽象定义最优语义编码树的结构熵作为内在不确定性度量指标——SeSE值越高,表示模型生成幻觉的可能性越大。此外,该方法还扩展至长文本生成中的细粒度不确定性建模,通过对个体主张的随机语义交互进行建模实现理论可解释的幻觉检测。

链接: https://arxiv.org/abs/2511.16275
作者: Xingtao Zhao,Hao Peng,Dingli Su,Xianghua Zeng,Chunyang Liu,Jinzhi Liao,Philip S. Yu
机构: Beihang University (北京航空航天大学); Didi Chuxing (滴滴出行); National University of Defense Technology (国防科技大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages of main text and 10 pages of appendices

点击查看摘要

Abstract:Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinating falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of LLMs from a structural information perspective for hallucination detection. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically pruning unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal compression. A higher SeSE value corresponds to greater uncertainty, indicating that LLMs are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation – where existing methods often rely on heuristic sample-and-count techniques – we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines, including strong supervised methods and the recently proposed KLE.
zh

[NLP-25] Can MLLM s Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂社交互动中缺乏“读空气”能力的问题,即无法准确识别和评估欺骗行为,这反映了其在社会认知推理上的显著不足。为系统量化这一缺陷,作者提出了一个新的任务——多模态交互式欺骗评估(Multimodal Interactive Deception Assessment, MIDA),并构建了一个同步视频与文本数据集,包含可验证的真值标签。实验表明,即使是最先进的开源和闭源模型如GPT-4o也难以可靠区分真实与虚假陈述。关键解决方案在于引入Social Chain-of-Thought(SoCoT)推理流程和Dynamic Social Epistemic Memory(DSEM)模块,通过显式建模他人信念、知识与意图,提升模型对多模态社会线索的感知与推理能力,从而推动MLLMs向具备类人社会推理能力的方向发展。

链接: https://arxiv.org/abs/2511.16221
作者: Caixin Kang,Yifei Huang,Liangyang Ouyang,Mingfang Zhang,Ruicong Liu,Yoichi Sato
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room’ and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.
zh

[NLP-26] PSM: Prompt Sensitivity Minimization via LLM -Guided Black-Box Optimization

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)系统提示(system prompts)在黑盒API访问场景下易受提取攻击(extraction attacks)的问题,此类攻击可泄露包含专有逻辑或敏感信息的隐藏指令,带来严重的安全与隐私风险。解决方案的关键在于提出一种轻量级的“盾牌附加”(shield appending)框架,将保护性文本层(SHIELD)嵌入原始提示中,通过形式化为一个效用约束优化问题——即在最小化由多种对抗攻击衍生的泄露指标的同时,确保任务效用(以语义保真度衡量)不低于预设阈值。该方法利用大模型作为优化器(LLM-as-optimizer),仅需目标模型和优化器模型的API访问权限,即可自动搜索最优SHIELD配置,从而实现高效、实用且不损害模型功能的防御效果。

链接: https://arxiv.org/abs/2511.16209
作者: Huseein Jawad,Nicolas Brunel
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:System prompts are critical for guiding the behavior of Large Language Models (LLMs), yet they often contain proprietary logic or sensitive information, making them a prime target for extraction attacks. Adversarial queries can successfully elicit these hidden instructions, posing significant security and privacy risks. Existing defense mechanisms frequently rely on heuristics, incur substantial computational overhead, or are inapplicable to models accessed via black-box APIs. This paper introduces a novel framework for hardening system prompts through shield appending, a lightweight approach that adds a protective textual layer to the original prompt. Our core contribution is the formalization of prompt hardening as a utility-constrained optimization problem. We leverage an LLM-as-optimizer to search the space of possible SHIELDs, seeking to minimize a leakage metric derived from a suite of adversarial attacks, while simultaneously preserving task utility above a specified threshold, measured by semantic fidelity to baseline outputs. This black-box, optimization-driven methodology is lightweight and practical, requiring only API access to the target and optimizer LLMs. We demonstrate empirically that our optimized SHIELDs significantly reduce prompt leakage against a comprehensive set of extraction attacks, outperforming established baseline defenses without compromising the model’s intended functionality. Our work presents a paradigm for developing robust, utility-aware defenses in the escalating landscape of LLM security. The code is made public on the following link: this https URL
zh

[NLP-27] SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning

【速读】: 该论文旨在解决学术文献中日益严峻的引用准确性问题,包括语义引用错误(semantic citation errors)、由生成式 AI (Generative AI) 产生的幻觉参考文献(hallucinated references),以及传统引用格式无法定位具体支持性内容的问题。解决方案的关键在于提出 SemanticCite 系统,其核心是通过全文本源分析实现引用验证,并结合多路径检索与四类分类体系(支持、部分支持、不支持、不确定)精确刻画主张与来源之间的语义关系;同时采用轻量级微调语言模型,在显著降低计算开销的前提下达到与商用大模型相当的性能,从而实现大规模引用验证的可行性,并提供透明、可解释的证据链以增强用户信任。

链接: https://arxiv.org/abs/2511.16198
作者: Sebastian Haan
机构: The University of Sydney (悉尼大学)
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: 21 pages, 4 figures

点击查看摘要

Abstract:Effective scientific communication depends on accurate citations that validate sources and guide readers to supporting evidence. Yet academic literature faces mounting challenges: semantic citation errors that misrepresent sources, AI-generated hallucinated references, and traditional citation formats that point to entire papers without indicating which sections substantiate specific claims. We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis while providing rich contextual information via detailed reasoning and relevant text snippets. Our approach combines multiple retrieval methods with a four-class classification system (Supported, Partially Supported, Unsupported, Uncertain) that captures nuanced claim-source relationships and enables appropriate remedial actions for different error types. Our experiments show that fine-tuned lightweight language models achieve performance comparable to large commercial systems with significantly lower computational requirements, making large-scale citation verification practically feasible. The system provides transparent, evidence-based explanations that support user understanding and trust. We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata across eight disciplines, alongside fine-tuned models and the complete verification framework as open-source software. SemanticCite addresses critical challenges in research integrity through scalable citation verification, streamlined peer review, and quality control for AI-generated content, providing an open-source foundation for maintaining citation accuracy at scale.
zh

[NLP-28] S-PEFT: Token-Selective Parameter-Efficient Fine-Tuning with Learnable Threshold Gating

【速读】: 该论文旨在解决传统参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法中对所有位置索引(position indices) indiscriminately 应用修改所导致的冗余甚至负面效果问题。其解决方案的关键在于提出一种新的范式——Token-Selective PEFT(TS-PEFT),通过一个选择函数 S 仅对部分位置索引施加PEFT修改,从而实现更精准、高效的微调策略,提升下游任务性能。

链接: https://arxiv.org/abs/2511.16147
作者: Dabiao Ma,Ziming Dai,Zhimin Xin,Shu Wang,Ye Wang,Haojun Fei
机构: Qifu Technology, Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:In the field of large models (LMs) for natural language processing (NLP) and computer vision (CV), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a resource-efficient method that modifies a limited number of parameters while keeping the pretrained weights fixed. This paper investigates the traditional PEFT approach, which applies modifications to all position indices, and questions its necessity. We introduce a new paradigm called Token-Selective PEFT (TS-PEFT), in which a function S selectively applies PEFT modifications to a subset of position indices, potentially enhancing performance on downstream tasks. Our experimental results reveal that the indiscriminate application of PEFT to all indices is not only superfluous, but may also be counterproductive. This study offers a fresh perspective on PEFT, advocating for a more targeted approach to modifications and providing a framework for future research to optimize the fine-tuning process for large models.
zh

[NLP-29] ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中因依赖人工设计提示(prompt)而导致的效率低下问题,即手动提示工程已成为LLMs落地的关键瓶颈。为应对这一挑战,作者提出了一种基于集成学习的提示优化框架(Ensemble Learning based Prompt Optimization, ELPO),其核心创新在于引入投票机制与共享生成策略,并融合多种搜索方法以提升提示优化的准确性和鲁棒性。此外,ELPO还设计了更高效的提示生成与搜索算法,在多个任务上显著优于现有最优方法,例如在ArSarcasm数据集上将F1分数提升7.6个百分点。

链接: https://arxiv.org/abs/2511.16122
作者: Qing Zhang,Bing Xu,Xudong Zhang,Yifan Shi,Yang Li,Chen Zhang,Yik Chung Wu,Ngai Wong,Yijie Chen,Hong Dai,Xiansen Chen,Mian Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble learning, ELPO conducts voting mechanism and introduces shared generation strategies along with different search methods for searching superior prompts. Moreover, ELPO creatively presents more efficient algorithms for the prompt generation and search process. Experimental results demonstrate that ELPO outperforms state-of-the-art prompt optimization methods across different tasks, e.g., improving F1 score by 7.6 on ArSarcasm dataset.
zh

[NLP-30] Early science acceleration experiments with GPT -5

【速读】: 该论文试图解决的问题是:如何有效利用前沿生成式 AI(Generative AI)模型(如 GPT-5)来增强科学研究效率,并明确其在实际科研协作中的优势与局限。解决方案的关键在于通过一系列跨学科的案例研究,系统记录人类科学家与 GPT-5 的互动过程,展示 AI 在加速研究进展、提出可验证的新思路(如数学领域四项新成果)方面的潜力,同时指出仍需人类专家介入的关键环节,从而为未来人机协同科研提供可复现的实践范式和实证依据。

链接: https://arxiv.org/abs/2511.16072
作者: Sébastien Bubeck,Christian Coester,Ronen Eldan,Timothy Gowers,Yin Tat Lee,Alexandru Lupsasca,Mehtaab Sawhney,Robert Scherrer,Mark Sellke,Brian K. Spears,Derya Unutmaz,Kevin Weil,Steven Yin,Nikita Zhivotovskiy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 89 pages

点击查看摘要

Abstract:AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.
zh

[NLP-31] Learning Tractable Distributions Of Language Model Continuations

【速读】: 该论文旨在解决在序列级约束(如语法、风格或安全性)下控制语言生成时,如何有效利用自回归语言模型(Autoregressive Language Model, LM)进行条件生成的问题。由于约束可能依赖于未来token,直接对LM进行条件化通常不可行。现有方法通过使用可 tractable 的近似模型(如隐马尔可夫模型,Hidden Markov Model, HMM)在解码阶段调整下一个词的logits,但这些近似模型往往缺乏上下文感知能力,导致查询质量下降。论文提出Learning to Look Ahead (LTLA) 方法,其核心创新在于将同一个基础语言模型用于丰富前缀编码,并结合一个固定且可计算精确延续概率的HMM作为代理模型。为避免效率陷阱——即每次候选token需全词汇表重评分和重复计算未来概率——LTLA采用单次批量HMM更新同时处理所有候选token,并仅将语言模型的隐藏表示作为HMM潜在状态先验的条件输入,保持HMM解码器不变以实现跨前缀的计算复用。这一设计显著提升了条件似然与约束满足率,同时保持推理开销最小。

链接: https://arxiv.org/abs/2511.16054
作者: Gwen Yidou-Weng,Ian Li,Anji Liu,Oliver Broadrick,Guy Van den Broeck,Benjie Wang
机构: University of California, Los Angeles (加州大学洛杉矶分校); University of California, San Diego (加州大学圣地亚哥分校); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model’s next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate’s latent state prior on the LM’s hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.
zh

[NLP-32] Liars Bench: Evaluating Lie Detectors for Language Models ALT KR

【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)谎言检测技术在真实场景中泛化能力不足的问题,即当前方法多在狭窄设定下验证,难以覆盖LLM生成谎言的多样性。其解决方案的关键在于构建了一个名为LIARS’ BENCH的测试基准,包含72,863条由四个开源模型在七个数据集上生成的谎言与诚实回答样本,该基准从两个维度刻画谎言:模型的撒谎动机和被欺骗的认知对象,并系统评估了三种黑盒与白盒谎言检测方法。结果表明,现有技术对某些类型谎言存在系统性失效,尤其在仅凭文本无法判断是否撒谎的情境下,从而揭示了当前方法的局限性,并为后续研究提供了可复现、具挑战性的评估平台。

链接: https://arxiv.org/abs/2511.16035
作者: Kieron Kretschmar(1),Walter Laurito(1 and 2),Sharan Maiya(1 and 3),Samuel Marks(4) ((1) Cadenza Labs, (2) FZI, (3) University of Cambridge, (4) Anthropic)
机构: Cadenza Labs; FZI; University of Cambridge; Anthropic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: *Kieron Kretschmar and Walter Laurito contributed equally to this work. 10 pages, 2 figures; plus appendix. Code at this https URL and datasets at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

点击查看摘要

Abstract:Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS’ BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model’s reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS’ BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it’s not possible to determine whether the model lied from the transcript alone. Overall, LIARS’ BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.
zh

[NLP-33] SpellForger: Prompting Custom Spell Properties In-Game using BERT supervised-trained model

【速读】: 该论文试图解决的问题是:如何将生成式 AI(Generative AI)作为核心玩法机制,实现玩家通过自然语言提示(natural language prompts)自定义魔法效果,从而提升游戏的个性化与创造性体验。解决方案的关键在于:采用监督训练的 BERT 模型对玩家输入的文本描述进行语义理解,并将其映射到预设的 spell prefabs(魔法模板),同时动态调整其参数(如伤害值、施法成本和效果强度),以确保生成内容在游戏平衡性上的合理性与竞争性,最终实现在 Unity 游戏引擎中实时生成符合设计逻辑的 spells,使 AI 成为可交互的游戏创作工具。

链接: https://arxiv.org/abs/2511.16018
作者: Emanuel C. Silva,Emily S. M. Salum,Gabriel M. Arantes,Matheus P. Pereira,Vinicius F. Oliveira,Alessandro L. Bicho
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published in Anais Estendidos do XXIV Simpósio Brasileiro de Jogos e Entretenimento Digital (SBGames 2025)

点击查看摘要

Abstract:Introduction: The application of Artificial Intelligence in games has evolved significantly, allowing for dynamic content generation. However, its use as a core gameplay co-creation tool remains underexplored. Objective: This paper proposes SpellForger, a game where players create custom spells by writing natural language prompts, aiming to provide a unique experience of personalization and creativity. Methodology: The system uses a supervisedtrained BERT model to interpret player prompts. This model maps textual descriptions to one of many spell prefabs and balances their parameters (damage, cost, effects) to ensure competitive integrity. The game is developed in the Unity Game Engine, and the AI backend is in Python. Expected Results: We expect to deliver a functional prototype that demonstrates the generation of spells in real time, applied to an engaging gameplay loop, where player creativity is central to the experience, validating the use of AI as a direct gameplay mechanic.
zh

[NLP-34] QueryGym: A Toolkit for Reproducible LLM -Based Query Reformulation

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的查询重写(query reformulation)方法缺乏统一实现框架的问题,这导致了公平比较、快速实验、一致基准测试和可靠部署的困难。解决方案的关键在于提出QueryGym——一个轻量级、可扩展的Python工具包,其核心优势包括:提供统一的API用于应用多种LLM-based重写方法、支持与Pyserini和PyTerrier等检索后端集成的检索无关接口、内置带版本控制和元数据追踪的提示词管理系统、原生支持BEIR和MS MARCO等基准测试,并以完全开源的方式供研究者使用,从而显著提升该领域研究的可复现性与协作效率。

链接: https://arxiv.org/abs/2511.15996
作者: Amin Bigdeli,Radin Hamidi Rad,Mert Incesu,Negar Arabzadeh,Charles L. A. Clarke,Ebrahim Bagheri
机构: University of Waterloo (滑铁卢大学); Mila – Quebec AI Institute (魁北克人工智能研究所); University of Toronto (多伦多大学); University of California, Berkeley (加州大学伯克利分校)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 4 pages

点击查看摘要

Abstract:We present QueryGym, a lightweight, extensible Python toolkit that supports large language model (LLM)-based query reformulation. This is an important tool development since recent work on llm-based query reformulation has shown notable increase in retrieval effectiveness. However, while different authors have sporadically shared the implementation of their methods, there is no unified toolkit that provides a consistent implementation of such methods, which hinders fair comparison, rapid experimentation, consistent benchmarking and reliable deployment. QueryGym addresses this gap by providing a unified framework for implementing, executing, and comparing llm-based reformulation methods. The toolkit offers: (1) a Python API for applying diverse LLM-based methods, (2) a retrieval-agnostic interface supporting integration with backends such as Pyserini and PyTerrier, (3) a centralized prompt management system with versioning and metadata tracking, (4) built-in support for benchmarks like BEIR and MS MARCO, and (5) a completely open-source extensible implementation available to all researchers. QueryGym is publicly available at this https URL.
zh

[NLP-35] CARE-RAG - Clinical Assessment and Reasoning in RAG ALT

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床场景中“检索-推理”之间的鸿沟问题,即即使提供了权威的参考文本,模型仍可能无法正确推理。为应对这一挑战,论文提出了一种评估框架,其关键在于系统性地衡量模型输出的准确性(accuracy)、一致性(consistency)和推理保真度(fidelity),从而在检索增强生成(Retrieval-Augmented Generation, RAG)基础上进一步保障临床决策的安全性和可靠性。

链接: https://arxiv.org/abs/2511.15994
作者: Deepthi Potluri,Aby Mammen Mathew,Jeffrey B DeWitt,Alexander L. Rasgon,Yide Hao,Junyuan Hong,Ying Ding
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Michigan (密歇根大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance

点击查看摘要

Abstract:Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.
zh

[NLP-36] OD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

【速读】: 该论文旨在解决现有任务导向对话(Task-Oriented Dialogue, TOD)基准测试中对复杂指令理解能力评估不足的问题。当前主流基准通常将指令简化为意图(Intent)、槽位(Slot)和API调用配置的结构化模式,忽略了真实场景下自然语言形式的复杂过程指令及其细粒度约束。为此,作者提出了TOD-ProcBench,一个基于高质量ABC数据集构建的挑战性基准,其核心创新在于将复杂指令形式化为多层级条件-动作(condition-action)语句,并设计三项任务系统评估大语言模型(LLM)在多轮对话中的指令遵循能力:任务1考察模型从复杂指令中检索相关语句并预测下一步动作的能力;任务2通过注入不一致信息合成违反指令的响应,检验模型识别违规输出的能力;任务3则评估模型根据原始复杂指令进行条件生成的能力。该方案的关键在于以细粒度约束建模替代传统结构化表示,从而更真实地反映LLM在实际应用中对复杂业务流程的理解与执行能力。

链接: https://arxiv.org/abs/2511.15976
作者: Sarik Ghazarian,Abhinav Gullapalli,Swair Shah,Anurag Beniwal,Nanyun Peng,Narayanan Sadagopan,Zhou Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In real-world task-oriented dialogue (TOD) settings, agents are required to strictly adhere to complex instructions while conducting multi-turn conversations with customers. These instructions are typically presented in natural language format and include general guidelines and step-by-step procedures with complex constraints. Existing TOD benchmarks often oversimplify the complex nature of these instructions by reducing them to simple schemas composed of intents, slots, and API call configurations. To address this gap and systematically benchmark LLMs’ instruction-following capabilities, we propose TOD-ProcBench, a challenging benchmark featuring complex process instructions with intricate, fine-grained constraints that evaluates various LLMs’ abilities to understand and follow instructions in multi-turn TODs. Our benchmark dataset comprises instruction documents derived from the high-quality ABCD dataset with corresponding conversations under human quality control. We formulate fine-grained constraints and action procedures as multi-level condition-action instruction statements. We design three tasks to comprehensively benchmark LLMs’ complex instruction-following capabilities in multi-turn TODs. Task 1 evaluates how LLMs retrieve the most relevant statement from a complex instruction and predict the corresponding next action. In Task 2, we synthesize instruction-violating responses by injecting inconsistencies and manipulating the original instructions, and then we analyze how effectively LLMs can identify instruction-violating responses. Task 3 investigates LLMs’ abilities in conditional generation of instruction-following responses based on the original complex instructions. Additionally, we conduct studies on the impact of multilingual settings and different instruction text formats on compliance performance. We release our benchmark under the Llama 3.3 Community License Agreement.
zh

[NLP-37] JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)在判断答案正确性方面能力不足的问题,尤其是在与大型语言模型(Large Language Models, LLMs)相比时,现有基于预定义指标(如蕴含关系)的LLM-as-a-judge框架存在间接性、自动化程度低以及难以实现细粒度和可扩展评估的局限。其解决方案的关键在于提出JudgeBoard评估流水线,通过直接查询模型来判断候选答案的正确性,无需额外的答案对比;并进一步引入MAJ(Multi-Agent Judging)多智能体评估框架,利用具有不同推理特征的多个SLMs协同讨论,从而逼近LLM级别的判断准确率。实验表明,MAJ显著提升了SLMs在数学推理和科学常识推理任务中的判断可靠性,甚至在某些情况下优于更大规模的模型,证明了多智能体SLM系统在判断任务中具备媲美或超越LLM的潜力。

链接: https://arxiv.org/abs/2511.15958
作者: Zhenyu Bi,Gaurav Srivastava,Yang Li,Meng Lu,Swastik Roy,Morteza Ziyadi,Xuan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages, 4 figures

点击查看摘要

Abstract:While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically relies on comparing candidate answers against ground-truth labels or other candidate answers using predefined metrics like entailment. However, this approach is inherently indirect and difficult to fully automate, offering limited support for fine-grained and scalable evaluation of reasoning outputs. In this work, we propose JudgeBoard, a novel evaluation pipeline that directly queries models to assess the correctness of candidate answers without requiring extra answer comparisons. We focus on two core reasoning domains: mathematical reasoning and science/commonsense reasoning, and construct task-specific evaluation leaderboards using both accuracy-based ranking and an Elo-based rating system across five benchmark datasets, enabling consistent model comparison as judges rather than comparators. To improve judgment performance in lightweight models, we propose MAJ (Multi-Agent Judging), a novel multi-agent evaluation framework that leverages multiple interacting SLMs with distinct reasoning profiles to approximate LLM-level judgment accuracy through collaborative deliberation. Experimental results reveal a significant performance gap between SLMs and LLMs in isolated judging tasks. However, our MAJ framework substantially improves the reliability and consistency of SLMs. On the MATH dataset, MAJ using smaller-sized models as backbones performs comparatively well or even better than their larger-sized counterparts. Our findings highlight that multi-agent SLM systems can potentially match or exceed LLM performance in judgment tasks, with implications for scalable and efficient assessment.
zh

[NLP-38] AccelOpt: A Self-Improving LLM Agent ic System for AI Accelerator Kernel Optimization

【速读】: 该论文旨在解决生成式 AI (Generative AI) 应用中,针对新兴 AI 加速器(如 AWS Trainium)进行高效内核优化时对专家硬件知识的高度依赖问题。传统方法需要人工经验来调优计算内核以匹配特定硬件架构,而 AccelOpt 提出了一种自改进的大语言模型(LLM)代理系统,其核心创新在于通过迭代生成机制结合一个优化记忆库(optimization memory),该记忆库能持续积累并复用以往慢-快内核对的经验与洞察,从而实现自动化、渐进式的内核性能优化。此方案显著降低了对领域专家的依赖,并在真实场景下验证了其有效性与成本优势。

链接: https://arxiv.org/abs/2511.15915
作者: Genghan Zhang,Shaowei Zhu,Anjiang Wei,Zhenyu Song,Allen Nie,Zhen Jia,Nandita Vijaykumar,Yida Wang,Kunle Olukotun
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt’s capability improves over time, boosting the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26\times cheaper.
zh

[NLP-39] Mind the Motions: Benchmarking Theory-of-Mind in Everyday Body Language

【速读】: 该论文旨在解决当前机器理论心智(Theory of Mind, ToM)评估体系对人类非语言交流(Nonverbal Cues, NVCs)理解能力的不足问题,尤其是现有基准多聚焦于错误信念任务和不对称信息推理,忽视了除信念外的其他心理状态及复杂的非语言沟通场景。其解决方案的关键在于构建Motion2Mind框架,该框架包含一个由专家标注的身体语言参考知识库,并基于此开发了一个精细标注的视频数据集,其中每段视频均配有细粒度的NVC标注与人工验证的心理学解释,涵盖222种非语言线索和397种心理状态。这一设计使AI系统在NVC识别与心理状态推断方面的性能评估更具生态效度和可解释性。

链接: https://arxiv.org/abs/2511.15887
作者: Seungbeen Lee,Jinhong Jeong,Donghyun Kim,Yejin Son,Youngjae Yu
机构: Yonsei University (延世大学); Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Our ability to interpret others’ mental states through nonverbal cues (NVCs) is fundamental to our survival and social cohesion. While existing Theory of Mind (ToM) benchmarks have primarily focused on false-belief tasks and reasoning with asymmetric information, they overlook other mental states beyond belief and the rich tapestry of human nonverbal communication. We present Motion2Mind, a framework for evaluating the ToM capabilities of machines in interpreting NVCs. Leveraging an expert-curated body-language reference as a proxy knowledge base, we build Motion2Mind, a carefully curated video dataset with fine-grained nonverbal cue annotations paired with manually verified psychological interpretations. It encompasses 222 types of nonverbal cues and 397 mind states. Our evaluation reveals that current AI systems struggle significantly with NVC interpretation, exhibiting not only a substantial performance gap in Detection, as well as patterns of over-interpretation in Explanation compared to human annotators.
zh

[NLP-40] What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)中链式思维(Chain-of-Thought, CoT)推理的可解释性与忠实性问题,即如何评估和提升CoT生成过程中各步骤的合理性与跨语言一致性。解决方案的关键在于引入两种互补的归因方法——用于步骤级归因的ContextCite和用于词元级归因的Inseq,并基于MGSM基准对Qwen2.5 1.5B-Instruct模型进行系统分析,从而揭示CoT推理在不同语言中的归因偏差、结构化提示的有效性以及扰动对模型表现和归因一致性的负面影响。

链接: https://arxiv.org/abs/2511.15886
作者: Jeremias Ferrao,Ezgi Basar,Khondoker Ittehadul Islam,Mahrokh Hassani
机构: University of Groningen (格罗宁根大学)
类目: Computation and Language (cs.CL)
备注: Received the Best Student Project Award at RuG’s Advanced-NLP course

点击查看摘要

Abstract:This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods–ContextCite for step-level attribution and Inseq for token-level attribution–to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.
zh

[NLP-41] he Subtle Art of Defection: Understanding Uncooperative Behaviors in LLM based Multi-Agent Systems

【速读】: 该论文旨在解决非合作行为对大语言模型(Large Language Models, LLM)驱动的多智能体系统稳定性影响机制不明确的问题,尤其是缺乏系统性分类与动态模拟方法。其解决方案的关键在于提出一个包含两部分的核心框架:一是基于博弈论的非合作行为分类体系(game theory-based taxonomy),填补了现有文献中对非合作行为类型化建模的空白;二是设计了一个多阶段动态仿真流水线(multi-stage simulation pipeline),能够根据智能体状态演化实时生成并优化非合作行为。该框架通过资源管理场景验证,实现了96.7%的非合作行为真实性准确率,并揭示了非合作行为可导致系统在1–7轮内迅速崩溃,而完全合作则保持100%生存率和零资源超用率,凸显了构建鲁棒多智能体系统的紧迫性。

链接: https://arxiv.org/abs/2511.15862
作者: Devang Kulshreshtha,Wanyu Du,Raghav Jain,Srikanth Doss,Hang Su,Sandesh Swamy,Yanjun Qi
机构: AWS AI Labs(亚马逊AI实验室); UC San Diego(加州大学圣迭戈分校)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces a novel framework for simulating and analyzing how uncooperative behaviors can destabilize or collapse LLM-based multi-agent systems. Our framework includes two key components: (1) a game theory-based taxonomy of uncooperative agent behaviors, addressing a notable gap in the existing literature; and (2) a structured, multi-stage simulation pipeline that dynamically generates and refines uncooperative behaviors as agents’ states evolve. We evaluate the framework via a collaborative resource management setting, measuring system stability using metrics such as survival time and resource overuse rate. Empirically, our framework achieves 96.7% accuracy in generating realistic uncooperative behaviors, validated by human evaluations. Our results reveal a striking contrast: cooperative agents maintain perfect system stability (100% survival over 12 rounds with 0% resource overuse), while any uncooperative behavior can trigger rapid system collapse within 1 to 7 rounds. These findings demonstrate that uncooperative agents can significantly degrade collective outcomes, highlighting the need for designing more resilient multi-agent systems.
zh

[NLP-42] Step-Audio-R1 Technical Report

【速读】: 该论文试图解决的问题是:音频语言模型在推理能力上的局限性——尽管生成式 AI(Generative AI)在文本和视觉领域通过扩展链式思维(chain-of-thought)取得了显著成功,但音频模型却始终表现出“最少推理即最优”的反直觉现象,即缺乏有效推理机制是否意味着音频智能无法从深度思考中获益。解决方案的关键在于提出首个音频推理模型 Step-Audio-R1,其核心创新为模态对齐的推理蒸馏框架(Modality-Grounded Reasoning Distillation, MGRD),该框架迫使模型生成基于声学特征真实锚定的推理链,而非脱离音频内容的虚构推理,从而首次实现了音频域内可解释、可迁移的推理能力,并在多项基准测试中达到或超越当前最先进模型(如 Gemini 3 Pro)的性能水平。

链接: https://arxiv.org/abs/2511.15848
作者: Fei Tian,Xiangyu Tony Zhang,Yuxin Zhang,Haoyang Zhang,Yuxin Li,Daijiao Liu,Yayue Deng,Donghang Wu,Jun Chen,Liang Zhao,Chengyuan Yao,Hexin Liu,Eng Siong Chng,Xuerui Yang,Xiangyu Zhang,Daxin Jiang,Gang Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: 15 pages, 5 figures. Technical Report

点击查看摘要

Abstract:Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.
zh

[NLP-43] Chain of Summaries: Summarization Through Iterative Questioning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在利用外部网络内容时面临的两大挑战:一是网络内容常以LLM不友好的格式呈现,二是受限于上下文长度限制,难以有效处理长文本。为应对这些问题,作者提出了一种名为“摘要链”(Chain of Summaries, CoS)的方法,其核心创新在于借鉴黑格尔辩证法思想,通过迭代式推理机制实现摘要质量的持续优化:首先生成初始摘要(thesis),继而通过提问识别其局限性(antithesis),最终形成具有通用性和前瞻性的合成摘要(synthesis)。该方法显著优于零样本LLM基线和专用摘要模型(如BRIO和PEGASUS),在多个问答数据集上提升高达66%,同时减少token消耗并具备下游模型无关性,从而为网站维护者提供一种高效、可扩展的网页内容适配方案。

链接: https://arxiv.org/abs/2511.15719
作者: William Brach,Lukas Galke Poech
机构: Slovak Technical University (斯洛伐克技术大学); University of Southern Denmark (南丹麦大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly using external web content. However, much of this content is not easily digestible by LLMs due to LLM-unfriendly formats and limitations of context length. To address this issue, we propose a method for generating general-purpose, information-dense summaries that act as plain-text repositories of web content. Inspired by Hegel’s dialectical method, our approach, denoted as Chain of Summaries (CoS), iteratively refines an initial summary (thesis) by identifying its limitations through questioning (antithesis), leading to a general-purpose summary (synthesis) that can satisfy current and anticipate future information needs. Experiments on the TriviaQA, TruthfulQA, and SQUAD datasets demonstrate that CoS outperforms zero-shot LLM baselines by up to 66% and specialized summarization methods such as BRIO and PEGASUS by up to 27%. CoS-generated summaries yield higher QA performance compared to the source content, while requiring substantially fewer tokens and being agnostic to the specific downstream LLM. CoS thus resembles an appealing option for website maintainers to make their content more accessible for LLMs, while retaining possibilities for human oversight.
zh

[NLP-44] Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs

【速读】: 该论文旨在解决当前语音表示学习中对连续输入特征依赖性强、存储与训练效率低的问题。其解决方案的关键在于提出Codec2Vec,一个完全基于离散音频编解码单元(discrete audio codec units)的语音表示学习框架,通过利用神经音频编解码器生成的离散码本作为输入,实现了更高效的数据存储(减少最多16.5倍)和更快的训练速度(提升2.3倍),同时在SUPERB基准上展现出与连续输入模型相当的性能,显著提升了语音处理任务的可扩展性和隐私保护能力。

链接: https://arxiv.org/abs/2511.16639
作者: Wei-Cheng Tseng,David Harwath
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: To be presented at ASRU 2025

点击查看摘要

Abstract:Recent advancements in neural audio codecs have not only enabled superior audio compression but also enhanced speech synthesis techniques. Researchers are now exploring their potential as universal acoustic feature extractors for a broader range of speech processing tasks. Building on this trend, we introduce Codec2Vec, the first speech representation learning framework that relies exclusively on discrete audio codec units. This approach offers several advantages, including improved data storage and transmission efficiency, faster training, and enhanced data privacy. We explore masked prediction with various training target derivation strategies to thoroughly understand the effectiveness of this framework. Evaluated on the SUPERB benchmark, Codec2Vec achieves competitive performance compared to continuous-input models while reducing storage requirements by up to 16.5x and training time by 2.3x, showcasing its scalability and efficiency.
zh

计算机视觉

[CV-0] Dataset Distillation for Pre-Trained Self-Supervised Vision Models ATC NEURIPS2025

【速读】:该论文旨在解决**数据集蒸馏(dataset distillation)问题,即如何生成一组小规模的合成图像,使得在这些图像上训练模型能够达到与在大规模真实图像数据集上训练相同模型相当的性能。现有方法主要针对从零开始训练随机初始化模型的场景,而当前主流视觉任务多基于大型预训练自监督模型(self-supervised models),因此论文聚焦于如何蒸馏出能优化线性探针(linear probe)在预训练特征空间中性能的数据集。其解决方案的关键在于提出线性梯度匹配(Linear Gradient Matching)**方法:通过优化合成图像,使其经预训练特征提取器后,在线性分类器上产生的梯度方向和幅度尽可能接近真实数据所诱导的梯度。这一策略不仅显著优于所有真实图像基线,还展现出跨预训练模型的良好泛化能力,并在细粒度分类和模型可解释性分析(如预测不同模型嵌入空间的相似性或对对抗性伪相关性的敏感性)方面表现出优异效果。

链接: https://arxiv.org/abs/2511.16674
作者: George Cazenavette,Antonio Torralba,Vincent Sitzmann
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at NeurIPS 2025. Project page: this https URL Code: this https URL

点击查看摘要

Abstract:The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models’ embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.
zh

[CV-1] NoPo-Avatar: Generalizable and Animatable Avatars from Sparse Inputs without Human Poses NEURIPS’25

【速读】:该论文旨在解决从单张或稀疏图像中重建可驱动的3D人体Avatar的问题,尤其针对现有方法依赖高精度“真实值”相机位姿和人体姿态作为测试时输入而导致在姿态估计存在噪声时性能显著下降的局限性。解决方案的关键在于提出NoPo-Avatar,该方法完全摒弃了对测试阶段人体姿态输入的依赖,仅通过图像本身进行重建,从而避免了因姿态估计误差带来的性能退化,提升了模型在实际应用场景中的鲁棒性和适用性。

链接: https://arxiv.org/abs/2511.16673
作者: Jing Wen,Alexander G. Schwing,Shenlong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS’25; project page: this https URL

点击查看摘要

Abstract:We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate “ground-truth” camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2.0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground-truth poses) and delivers comparable results in lab settings (with ground-truth poses).
zh

[CV-2] EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

【速读】:该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在训练过程中高度依赖人工标注数据或外部验证的奖励模型的问题,从而限制了其自主性和可扩展性。为实现完全无监督的推理能力提升,作者提出了一种名为EvoLMM的自进化框架,其核心创新在于从单一基础模型中构建两个协作代理:Proposer(提议者)负责生成多样化的、基于图像的问题,Solver(求解者)则通过内部一致性机制进行解答,整个过程通过持续的自我奖励机制驱动学习。该方案的关键在于利用模型内部反馈而非外部标签或人类判断来同时优化问题生成与结构化推理能力,从而在仅使用原始图像数据的情况下,在多个多模态数学推理基准测试(如ChartQA、MathVista和MathVision)上实现了约3%的稳定性能提升。

链接: https://arxiv.org/abs/2511.16672
作者: Omkat Thawakar,Shravan Venkatraman,Ritesh Thawkar,Abdelrahman Shaker,Hisham Cholakkal,Rao Muhammad Anwer,Salman Khan,Fahad Khan
机构: Mohamed bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学); Aalto University (阿尔托大学); Australian National University (澳大利亚国立大学); Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 Pages, 6 Figures, 4 Tables

点击查看摘要

Abstract:Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto \sim 3% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at this https URL.
zh

[CV-3] Learning to Think Fast and Slow for Visual Language Models

【速读】:该论文旨在解决现有推理导向型视觉语言模型(Visual Language Models, VLMs)在处理不同难度任务时缺乏动态认知资源分配机制的问题,即模型往往强制生成冗长的推理链,导致计算资源浪费且效率低下。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的双模式思维机制——DualMindVLM,其核心在于通过两个阶段实现快速与慢速思考模式的自动切换:第一阶段依据模型输出长度对数据进行快慢思维标签标注(受预训练VLM对不同问题产生不同长度回答现象启发),第二阶段使用GRPO算法结合标签训练模型,使其具备根据任务复杂度自主选择高效或深度推理的能力,从而在保持高token效率的同时达到与先进视觉推理模型相当的性能。

链接: https://arxiv.org/abs/2511.16670
作者: Chenyu Lin,Cheng Chi,Jinlin Wu,Sharon Li,Kaiyang Zhou
机构: Hong Kong Baptist University (香港浸会大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Institude of Automation, CAS (中国科学院自动化研究所); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.
zh

[CV-4] Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

【速读】:该论文旨在解决当前视频生成任务主要局限于娱乐场景、未能有效用于复杂认知任务的问题,特别是将视频作为新型回答模态应用于下一事件预测(Next-Event Prediction, NEP),即提出视频下一事件预测(Video-Next-Event Prediction, VNEP)任务。VNEP要求模型根据输入视频和指令生成动态视频响应,从而更直观地展示物理世界中的过程信息(如系领带等操作),这比纯文本描述更具表现力和个性化。其核心挑战在于需同时理解多模态输入、实现指令条件推理,并生成在视觉与语义上一致的视频内容。解决方案的关键是提出VANS模型,通过强化学习对齐视觉语言模型(Vision-Language Model, VLM)与视频扩散模型(Video Diffusion Model, VDM),其核心机制为联合GRPO(Joint-GRPO)算法——该算法驱动VLM生成既准确又利于可视化的文本描述,同时引导VDM生成忠实于文本及输入视觉上下文的高质量视频,从而实现端到端的VNEP能力。

链接: https://arxiv.org/abs/2511.16669
作者: Junhao Cheng,Liang Hou,Xin Tao,Jing Liao
机构: City University of Hong Kong (香港城市大学); Kling Team, Kuaishou Technology (快手科技Kling团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video’s inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in this https URL.
zh

[CV-5] V-Reason Bench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

【速读】:该论文旨在解决当前生成式视频模型(generative video models)在视频推理能力评估方面缺乏系统性、可靠性和可复现性的难题。为应对这一问题,作者提出了V-ReasonBench基准测试框架,其关键在于构建了一个涵盖结构化问题求解、空间认知、基于模式的推理和物理动态理解四个核心维度的多模态评测体系,数据来源包括合成与真实图像序列,确保任务具备答案可验证性、可扩展性和明确性。通过该基准对六种前沿视频模型的评估,揭示了各模型在不同推理维度上的性能差异,并进一步对比了视频模型与图像模型的表现,识别出常见幻觉行为及视频时长对帧链推理(Chain-of-Frames reasoning)的影响,从而为开发更具鲁棒性和人类对齐推理能力的视频生成模型提供了统一、可复现的评估基础。

链接: https://arxiv.org/abs/2511.16668
作者: Yang Luo,Xuanlei Zhao,Baijiong Lin,Lingting Zhu,Liyao Tang,Yuqi Liu,Ying-Cong Chen,Shengju Qian,Xin Wang,Yang You
机构: NUS(新加坡国立大学); HKUST(GZ)(香港科技大学(广州)); HKU(香港大学); USYD(悉尼大学); CUHK(香港中文大学); LIGHTSPEED
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.
zh

[CV-6] SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation NEURIPS2025

【速读】:该论文旨在解决多对象9自由度(9D)姿态控制的难题,即同时精确调控多个物体在场景中的位置、尺寸和方向(9DoF pose),现有方法普遍面临可控性不足与生成质量下降的问题。其解决方案的关键在于提出SceneDesigner框架,通过引入分支网络结构增强预训练模型能力,并设计一种具有强几何解释性的新表征——CNOCS地图(Camera-Normalized Object Coordinate Space map),用于编码从相机视角出发的9D姿态信息,从而提升训练效率与稳定性;此外,为缓解数据分布不均导致的低频姿态性能退化问题,采用两阶段训练策略结合强化学习优化,最后在推理阶段提出解耦对象采样(Disentangled Object Sampling)技术以减少复杂场景中对象生成不足和概念混淆,实现更精准灵活的多对象9D姿态操控。

链接: https://arxiv.org/abs/2511.16666
作者: Zhenyuan Qin,Xincheng Shuai,Henghui Ding
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 (Spotlight), Project Page: this https URL

点击查看摘要

Abstract:Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at this https URL.
zh

[CV-7] riDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing

【速读】:该论文旨在解决从文本描述生成高质量、可控的4D虚拟人像(即具有时间维度的3D动态模型)所面临的挑战,包括时序不一致、几何失真、运动异常、计算成本高以及动态控制能力弱等问题。其核心解决方案是提出TriDiff-4D,一个基于扩散模型的4D生成流水线,关键创新在于采用基于三平面(triplane)重定位的扩散机制,结合自回归策略实现任意长度的4D序列生成:首先从文本提示中联合生成规范姿态的3D avatar与对应运动序列,再通过第二个扩散模型依据运动序列驱动avatar动画,从而在保持时序一致性、运动准确性与视觉保真度的同时显著提升计算效率,将生成时间从小时级缩短至秒级,并消除传统优化过程的依赖。

链接: https://arxiv.org/abs/2511.16662
作者: Eddie Pokming Sheung,Qihao Liu,Wufei Ma,Prakhar Kaushik,Jianwen Xie,Alan Yuille
机构: Johns Hopkins University (约翰霍普金斯大学); Lambda Inc
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 10 figures, Under review at a conference

点击查看摘要

Abstract:With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.
zh

[CV-8] PartUV: Part-Based UV Unwrapping of 3D Meshes WWW

【速读】:该论文旨在解决AI生成网格(通常具有噪声、不规则表面和不良几何条件)在传统UV展开方法中易产生大量碎片化图表和次优边界的问题,从而引入伪影并影响下游任务。其解决方案的关键在于提出PartUV——一种基于部件的UV展开流水线,该方法结合高阶语义部件分解(源自PartField)与新颖的几何启发式策略,在自顶向下的递归框架中实现低失真且数量显著减少的部件对齐图表。通过将用户指定的失真阈值作为约束,并最小化图表总数,该方案不仅提升了对复杂和挑战性网格的鲁棒性,还支持如部件特定的多图块打包等新应用。

链接: https://arxiv.org/abs/2511.16659
作者: Zhaoning Wang,Xinyue Wei,Ruoxi Shi,Xiaoshuai Zhang,Hao Su,Minghua Liu
机构: Hillbot Inc.(Hillbot公司); University of California San Diego(加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Graphics (cs.GR)
备注: project page: this https URL

点击查看摘要

Abstract:UV unwrapping flattens 3D surfaces to 2D with minimal distortion, often requiring the complex surface to be decomposed into multiple charts. Although extensively studied, existing UV unwrapping methods frequently struggle with AI-generated meshes, which are typically noisy, bumpy, and poorly conditioned. These methods often produce highly fragmented charts and suboptimal boundaries, introducing artifacts and hindering downstream tasks. We introduce PartUV, a part-based UV unwrapping pipeline that generates significantly fewer, part-aligned charts while maintaining low distortion. Built on top of a recent learning-based part decomposition method PartField, PartUV combines high-level semantic part decomposition with novel geometric heuristics in a top-down recursive framework. It ensures each chart’s distortion remains below a user-specified threshold while minimizing the total number of charts. The pipeline integrates and extends parameterization and packing algorithms, incorporates dedicated handling of non-manifold and degenerate meshes, and is extensively parallelized for efficiency. Evaluated across four diverse datasets, including man-made, CAD, AI-generated, and Common Shapes, PartUV outperforms existing tools and recent neural methods in chart count and seam length, achieves comparable distortion, exhibits high success rates on challenging meshes, and enables new applications like part-specific multi-tiles packing. Our project page is at this https URL.
zh

[CV-9] Solving Spatial Supersensing Without Spatial Supersensing

【速读】:该论文旨在解决当前视频世界模型(Video World Models)在空间超感知(Spatial Supersensing)能力评估中的可靠性问题。现有基准测试(如VSI-Super-Recall和VSI-Super-Counting)可能未能真正衡量模型对多视角、跨时间场景的空间理解与推理能力,而更易被启发式捷径(shortcut heuristics)所误导。其解决方案的关键在于通过系统性分析指出:一方面,简单基线模型(NoSense)在VSR任务上已接近完美表现,说明此类任务无需复杂空间认知即可完成;另一方面,针对VSC任务的特定推理策略(Cambrian-S提出的预测感知推理方法)实际依赖于数据集中的非鲁棒性模式(如“房间不被重复访问”),而非真正的空间超感知机制——这一发现揭示了当前基准设计存在缺陷,且现有方法性能提升主要源于对捷径的利用而非稳健的空间建模能力。

链接: https://arxiv.org/abs/2511.16655
作者: Vishaal Udandarao,Shyamgopal Karthik,Surabhi S. Nath,Andreas Hochlehnert,Matthias Bethge,Ameya Prabhu
机构: Tübingen AI Center, University of Tübingen (图宾根人工智能中心,图宾根大学); Max Planck Institute for Biological Cybernetics (马克斯·普朗克生物控制论研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Tech Report

点击查看摘要

Abstract:Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: this https URL
zh

[CV-10] acher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation

【速读】:该论文旨在解决非结构化剪枝(Unstructured Pruning)在深度神经网络压缩中因需反复进行训练-剪枝-再训练循环而导致的高计算开销问题。其解决方案的关键在于提出了一种教师引导的剪枝框架(Teacher-Guided Pruning Framework),该框架将知识蒸馏(Knowledge Distillation, KD)与重要性评分估计紧密集成:通过利用教师模型提供的梯度信号来优化参数重要性评估,从而更精准地识别对任务性能和知识迁移均至关重要的参数;在此基础上实现一次性全局剪枝(One-shot Global Pruning),有效移除冗余权重并保留关键表征,随后采用稀疏感知再训练策略恢复精度,无需重新激活已剪枝连接。此方法显著降低了计算成本,同时保持了高性能表现。

链接: https://arxiv.org/abs/2511.16653
作者: Md. Samiul Alim,Sharjil Khan,Amrijit Biswas,Fuad Rahman,Shafin Rahman,Nabeel Mohammed
机构: Apurba-NSU R&D Lab (Apurba-NSU 研发实验室); North South University (北方南大学); Apurba Technologies (Apurba 技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at 2025 IEEE International Conference on Big Data (IEEE BigData 2025)

点击查看摘要

Abstract:Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.
zh

[CV-11] Late-decoupled 3D Hierarchical Semantic Segmentation with Semantic Prototype Discrimination based Bi-branch Supervision

【速读】:该论文旨在解决3D层次语义分割(3D hierarchical semantic segmentation, 3DHS)中的两个关键挑战:一是参数共享模型在跨层级优化时引发的多层级冲突,二是多层级场景中不可避免的类别不平衡问题,导致模型性能被主要类别主导。解决方案的关键在于提出一种包含主分支与辅助判别分支的双分支框架,其中主分支采用“晚解耦”(late-decoupled)结构,通过粗到细的层级引导和一致性约束设计多个解码器,有效缓解不同层级间的欠拟合与过拟合冲突,并抑制单一层级内的类别不平衡;同时引入面向3DHS的语义原型驱动的双分支监督机制,额外学习类级别的判别性点云特征,并实现辅助分支与主分支之间的相互监督,进一步提升类别不平衡条件下的分割性能。

链接: https://arxiv.org/abs/2511.16650
作者: Shuyu Cao,Chongshou Li,Jie Xu,Tianrui Li,Na Zhao
机构: Southwest Jiaotong University (西南交通大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence applications that demand a multi-grained and multi-hierarchy understanding of 3D scenes. Despite the progress, previous 3DHS methods have overlooked following two challenges: I) multi-label learning with a parameter-sharing model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, which makes the model performance become dominated by major classes. To address these issues, we propose a novel framework with a primary 3DHS branch and an auxiliary discrimination branch. Specifically, to alleviate the multi-hierarchy conflicts, we propose a late-decoupled 3DHS framework which employs multiple decoders with the coarse-to-fine hierarchical guidance and consistency. The late-decoupled architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and can also constrain the class imbalance problem in each individual hierarchy. Moreover, we introduce a 3DHS-oriented semantic prototype based bi-branch supervision mechanism, which additionally learns class-wise discriminative point cloud features and performs mutual supervision between the auxiliary and 3DHS branches, to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach achieves state-of-the-art 3DHS performance, and its core components can also be used as a plug-and-play enhancement to improve previous methods.
zh

[CV-12] RIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming NEURIPS2025

【速读】:该论文旨在解决当前3D高斯扩散模型在生成过程中因大量高斯基元(Gaussian primitives)导致的去噪和后处理步骤耗时严重、推理效率低以及沿采样轨迹扩展性受限的问题。其解决方案的关键在于提出一种后训练加速方法TRIM(Trajectory Reduction and Instance Mask denoising),通过引入时空剪枝策略实现高效推理:一方面设计轻量级选择器模型,基于多噪声样本提取的潜在高斯基元进行质量潜力评估,从而在早期阶段减少冗余轨迹;另一方面采用实例掩码去噪机制,通过过滤冗余背景区域来修剪可学习的高斯基元,降低每一步去噪计算量。该方法在不牺牲输出质量的前提下显著提升了3D生成的效率与可扩展性。

链接: https://arxiv.org/abs/2511.16642
作者: Zeyuan Yin,Xiaoming Liu
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025

点击查看摘要

Abstract:Recent advances in 3D Gaussian diffusion models suffer from time-intensive denoising and post-denoising processing due to the massive number of Gaussian primitives, resulting in slow generation and limited scalability along sampling trajectories. To improve the efficiency of 3D diffusion models, we propose \textbfTRIM ( \textbfT rajectory \textbfR eduction and \textbfI nstance \textbfM ask denoising), a post-training approach that incorporates both temporal and spatial trimming strategies, to accelerate inference without compromising output quality while supporting the inference-time scaling for Gaussian diffusion models. Instead of scaling denoising trajectories in a costly end-to-end manner, we develop a lightweight selector model to evaluate latent Gaussian primitives derived from multiple sampled noises, enabling early trajectory reduction by selecting candidates with high-quality potential. Furthermore, we introduce instance mask denoising to prune learnable Gaussian primitives by filtering out redundant background regions, reducing inference computation at each denoising step. Extensive experiments and analysis demonstrate that TRIM significantly improves both the efficiency and quality of 3D generation. Source code is available at \hrefthis https URLlink .
zh

[CV-13] SAM 3D: 3Dfy Anything in Images

【速读】:该论文旨在解决自然场景中3D物体重建的难题,尤其是在存在遮挡和场景杂乱的情况下,如何实现高精度、视觉对齐的3D重建。其核心挑战在于缺乏大规模、高质量、标注完备的3D数据,导致现有方法难以泛化到真实世界复杂场景。解决方案的关键在于构建一个“人与模型协同”的标注流程,用于生成大规模视觉对齐的3D物体形状、纹理和姿态数据;并采用多阶段训练框架,融合合成数据预训练与真实世界对齐优化,从而突破3D重建中的“数据壁垒”。该方法在真实世界物体和场景上的偏好测试中展现出至少5:1的优势胜率,显著优于现有技术。

链接: https://arxiv.org/abs/2511.16624
作者: SAM 3D Team,Xingyu Chen,Fu-Jen Chu,Pierre Gleize,Kevin J Liang,Alexander Sax,Hao Tang,Weiyao Wang,Michelle Guo,Thibaut Hardin,Xiang Li,Aohan Lin,Jiawei Liu,Ziqi Ma,Anushka Sagar,Bowen Song,Xiaodong Wang,Jianing Yang,Bowen Zhang,Piotr Dollár,Georgia Gkioxari,Matt Feiszli,Jitendra Malik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Website: this https URL

点击查看摘要

Abstract:We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D “data barrier”. We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
zh

[CV-14] Adaptive Guided Upsampling for Low-light Image Enhancement

【速读】:该论文旨在解决低光照图像增强中多质量特性难以同时优化的问题,例如在降噪与提升锐度之间存在权衡。传统引导图像方法(guided image method)因低光图像本身噪声高、亮度低,导致从引导图像向目标图像传递特征的效果不佳,从而难以获得显著改善的输出。解决方案的关键在于提出自适应引导上采样(Adaptive Guided Upsampling, AGU),通过多参数优化学习低光图像与明亮图像之间的多维特征关联,并基于少量样本图像对进行机器学习训练,从而实现在低分辨率、低质量输入条件下实时生成高质量图像,且在多个图像质量指标上优于现有最优方法。

链接: https://arxiv.org/abs/2511.16623
作者: Angela Vivian Dcosta,Chunbo Song,Rafael Radkowski
机构: Lenovo Research (联想研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 18 pages, 12 figures

点击查看摘要

Abstract:We introduce Adaptive Guided Upsampling (AGU), an efficient method for upscaling low-light images capable of optimizing multiple image quality characteristics at the same time, such as reducing noise and increasing sharpness. It is based on a guided image method, which transfers image characteristics from a guidance image to the target image. Using state-of-the-art guided methods, low-light images lack sufficient characteristics for this purpose due to their high noise level and low brightness, rendering suboptimal/not significantly improved images in the process. We solve this problem with multi-parameter optimization, learning the association between multiple low-light and bright image characteristics. Our proposed machine learning method learns these characteristics from a few sample images-pairs. AGU can render high-quality images in real time using low-quality, low-resolution input; our experiments demonstrate that it is superior to state-of-the-art methods in the addressed low-light use case.
zh

[CV-15] Improving Long-Tailed Object Detection with Balanced Group Softmax and Metric Learning

【速读】:该论文旨在解决长尾分布下的2D目标检测问题,即在真实场景中由于类别实例数量极不均衡(头部类别样本多、尾部类别样本少),导致检测模型对稀有类别的性能显著下降的问题。其核心解决方案是基于LVISv1数据集改进了Balanced Group Softmax (BAGS)框架,并引入度量学习(metric learning)以生成更具判别性的特征嵌入,使同类特征更紧凑、异类特征更分离;同时在推理阶段采用k近邻(k-Nearest Neighbors, k-NN)分类策略提升稀有类别的识别精度。实验表明,该方法在mAP上达到24.5%,优于先前24.0%的基准,有效缓解了长尾分布带来的类别不平衡问题。

链接: https://arxiv.org/abs/2511.16619
作者: Satyam Gaba
机构: Univeristy of the Cumberlands(坎伯兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, International Conference on Semantic Computing

点击查看摘要

Abstract:Object detection has been widely explored for class-balanced datasets such as COCO. However, real-world scenarios introduce the challenge of long-tailed distributions, where numerous categories contain only a few instances. This inherent class imbalance biases detection models towards the more frequent classes, degrading performance on rare categories. In this paper, we tackle the problem of long-tailed 2D object detection using the LVISv1 dataset, which consists of 1,203 categories and 164,000 images. We employ a two-stage Faster R-CNN architecture and propose enhancements to the Balanced Group Softmax (BAGS) framework to mitigate class imbalance. Our approach achieves a new state-of-the-art performance with a mean Average Precision (mAP) of 24.5%, surpassing the previous benchmark of 24.0%. Additionally, we hypothesize that tail class features may form smaller, denser clusters within the feature space of head classes, making classification challenging for regression-based classifiers. To address this issue, we explore metric learning to produce feature embeddings that are both well-separated across classes and tightly clustered within each class. For inference, we utilize a k-Nearest Neighbors (k-NN) approach to improve classification performance, particularly for rare classes. Our results demonstrate the effectiveness of these methods in advancing long-tailed object detection. Comments: 8 pages, 7 figures, International Conference on Semantic Computing Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.16619 [cs.CV] (or arXiv:2511.16619v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.16619 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/ICSC64641.2025.00051 Focus to learn more DOI(s) linking to related resources
zh

[CV-16] SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

【速读】:该论文旨在解决手术视频分割(Surgical Video Segmentation)中因领域差异(domain gap)和长期跟踪能力不足导致的交互式视频对象分割(Interactive Video Object Segmentation, iVOS)模型性能受限问题。现有iVOS模型如Segment Anything Model 2 (SAM2) 虽具备基于提示的灵活性,但在复杂多变的手术场景中难以实现稳定可靠的长时跟踪与零样本泛化。为应对这一挑战,作者构建了SA-SV——目前最大的手术iVOS基准数据集,包含8类手术操作、1.6k个实例级时空标注(masklets)共61k帧,支持长期跟踪与零样本迁移能力的全面评估。在此基础上,提出SAM2S这一基础模型改进方案,其关键创新包括:(1) DiveMem——可训练的多样化记忆机制,提升长期跟踪鲁棒性;(2) 时间语义学习模块,增强对器械类别的理解;(3) 模糊性鲁棒学习策略,缓解多源数据标注不一致带来的干扰。实验表明,SAM2S在SA-SV上微调后达到80.42平均IoU-F分数,显著优于原始SAM2(+17.10)及微调版SAM2(+4.11),同时保持68 FPS实时推理速度和优异的零样本泛化能力。

链接: https://arxiv.org/abs/2511.16618
作者: Haofeng Liu,Ziyue Wang,Sudhanshu Mishra,Mingqi Gao,Guanyi Qin,Chang Han Low,Alex Y. W. Kong,Yueming Jin
机构: National University of Singapore (新加坡国立大学); University of Sheffield (谢菲尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Tissues and Organs (q-bio.TO)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbfSAM2 for \textbfSurgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average \mathcalJ \ \mathcalF over vanilla SAM2. SAM2S further advances performance to 80.42 average \mathcalJ \ \mathcalF , surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at this https URL.
zh

[CV-17] Generative AI for Enhanced Wildfire Detection: Bridging the Synthetic-Real Domain Gap

【速读】:该论文旨在解决野火早期检测中因缺乏大规模标注数据而导致深度神经网络性能受限的问题。其关键解决方案是利用生成式 AI (Generative AI) 技术合成一个全面且带标注的烟雾数据集,并结合无监督域自适应(unsupervised domain adaptation)方法,以缩小合成数据与真实世界数据之间的域差异,同时引入风格迁移、生成对抗网络(GANs)和图像抠图(image matting)等先进生成技术,提升合成数据的真实性,从而增强模型在实际场景下的烟雾羽流分割准确性和可扩展性。

链接: https://arxiv.org/abs/2511.16617
作者: Satyam Gaba
机构: Univeristy of the Cumberlands(坎伯兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 16 figures

点击查看摘要

Abstract:The early detection of wildfires is a critical environmental challenge, with timely identification of smoke plumes being key to mitigating large-scale damage. While deep neural networks have proven highly effective for localization tasks, the scarcity of large, annotated datasets for smoke detection limits their potential. In response, we leverage generative AI techniques to address this data limitation by synthesizing a comprehensive, annotated smoke dataset. We then explore unsupervised domain adaptation methods for smoke plume segmentation, analyzing their effectiveness in closing the gap between synthetic and real-world data. To further refine performance, we integrate advanced generative approaches such as style transfer, Generative Adversarial Networks (GANs), and image matting. These methods aim to enhance the realism of synthetic data and bridge the domain disparity, paving the way for more accurate and scalable wildfire detection models.
zh

[CV-18] Green Resilience of Cyber-Physical Systems: Doctoral Dissertation

【速读】:该论文旨在解决在线协作人工智能系统(Online Collaborative AI System, OL-CAIS)在遭遇扰动事件时如何平衡韧性(resilience)与绿色性(greenness)的问题。其核心挑战在于:在保障系统性能恢复的同时,需最小化能源消耗,从而实现韧性与绿色性的权衡。解决方案的关键在于提出GResilience框架,该框架通过三种代理策略实现多目标优化:单代理采用多目标优化方法、双代理引入博弈论决策机制、三代理则基于强化学习(Reinforcement Learning, RL)进行自适应决策,以缩短恢复时间、稳定性能并降低人类依赖。同时,研究设计了量化韧性与绿色性的测量体系,并通过真实与仿真实验验证了RL代理策略在提升绿色恢复效率方面的最优表现,尽管伴随轻微碳排放增加;此外,还识别出重复扰动引发的灾难性遗忘现象,并证明所提策略有助于维持系统稳态。

链接: https://arxiv.org/abs/2511.16593
作者: Diaeddin Rimawi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Cyber-physical systems (CPS) combine computational and physical components. Online Collaborative AI System (OL-CAIS) is a type of CPS that learn online in collaboration with humans to achieve a common goal, which makes it vulnerable to disruptive events that degrade performance. Decision-makers must therefore restore performance while limiting energy impact, creating a trade-off between resilience and greenness. This research addresses how to balance these two properties in OL-CAIS. It aims to model resilience for automatic state detection, develop agent-based policies that optimize the greenness-resilience trade-off, and understand catastrophic forgetting to maintain performance consistency. We model OL-CAIS behavior through three operational states: steady, disruptive, and final. To support recovery during disruptions, we introduce the GResilience framework, which provides recovery strategies through multi-objective optimization (one-agent), game-theoretic decision-making (two-agent), and reinforcement learning (RL-agent). We also design a measurement framework to quantify resilience and greenness. Empirical evaluation uses real and simulated experiments with a collaborative robot learning object classification from human demonstrations. Results show that the resilience model captures performance transitions during disruptions, and that GResilience policies improve green recovery by shortening recovery time, stabilizing performance, and reducing human dependency. RL-agent policies achieve the strongest results, although with a marginal increase in CO2 emissions. We also observe catastrophic forgetting after repeated disruptions, while our policies help maintain steadiness. A comparison with containerized execution shows that containerization cuts CO2 emissions by half. Overall, this research provides models, metrics, and policies that ensure the green recovery of OL-CAIS.
zh

[CV-19] Erase to Retain: Low Rank Adaptation Guided Selective Unlearning in Medical Segmentation Networks

【速读】:该论文旨在解决医疗图像分割模型中对特定样本或类别知识进行可控删除(即“遗忘”)的问题,以满足隐私合规、伦理部署及持续数据集修订的需求。传统方法通常依赖于全量重新训练,效率低下且难以实现精准控制。其解决方案的关键在于提出一种名为Erase to Retain的蒸馏式可控制遗忘框架,利用低秩适配(Low-Rank Adaptation, LoRA)约束子空间更新机制,在保持全局解剖结构理解不变的前提下,实现病变特异性或类别特异性表示的定向擦除。具体而言,通过教师-学生蒸馏架构,在强遗忘阶段对抗性优化LoRA模块以违背教师在指定遗忘子集上的置信预测,从而强制语义层面的知识移除;随后通过仅微调头部的监督恢复阶段,逐步重建保留数据上的泛化能力。该方法在ISIC和CHASE等多场景下均实现了显著的遗忘效果与性能保持的平衡,为医疗影像分析中的负责任、可控且可逆遗忘提供了实用路径。

链接: https://arxiv.org/abs/2511.16574
作者: Nirjhor Datta,Md. Golam Rabiul Alam
机构: BRAC University (BRAC大学); Bangladesh University of Engineering and Technology (BUET) (孟加拉国工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The ability to selectively remove knowledge from medical segmentation networks is increasingly important for privacy compliance, ethical deployment, and continual dataset revision. We introduce Erase to Retain, a controllable unlearning framework for medical image segmentation that achieves targeted forgetting without full retraining. Our method uses a teacher-student distillation paradigm with Low-Rank Adaptation (LoRA) constrained subspace updates, enabling the student network to erase lesion-specific or class-specific representations in low-rank decoder spaces while preserving global anatomical understanding. During the strong unlearning phase, LoRA modules are adversarially optimized to contradict the teacher’s confident predictions on a designated forget subset, enforcing semantic removal. This is followed by a gentle restoration phase that recovers generalization on retained data through head-only supervised refinement. For ISIC segmentation, the student reduces forget-set IoU from 0.875 to 0.509 while maintaining competitive performance on the retain and validation splits (0.647 to 0.677 IoU). On the cross-domain CHASE dataset, Erase to Retain consistently lowers forget-set IoU while preserving utility on retain and validation sets. For ISIC classification, our method decreases accuracy on the forget subset from 87.0 percent to 64.1 percent while improving retain accuracy from 83.9 percent to 90.6 percent. These results demonstrate that LoRA-based subspace unlearning provides a practical pathway toward responsible, controllable, and reversible unlearning in medical image analysis, enabling models to forget sensitive samples or structures while preserving performance where it matters most. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.16574 [cs.CV] (or arXiv:2511.16574v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.16574 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-20] POMA-3D: The Point Map Way to 3D Scene Understanding

【速读】:该论文旨在解决3D表示学习中预训练先验知识稀缺和数据有限的问题,尤其是在缺乏大规模标注数据的情况下实现高效的3D场景理解。其核心解决方案是提出POMA-3D,一个基于点图(point map)的自监督3D表征模型,其中点图将显式的3D坐标编码在结构化的2D网格上,既保留了全局3D几何信息,又兼容2D基础模型的输入格式。关键创新在于设计了视图到场景对齐策略以迁移丰富的2D先验,并引入POMA-JEPA——一种联合嵌入-预测架构,确保多视角下点图特征的几何一致性,从而提升模型鲁棒性与泛化能力。此外,作者构建了ScenePoint数据集(含6.5K房间级RGB-D场景和1M 2D图像场景),支持大规模预训练,实验证明POMA-3D在多种3D理解任务(如3D问答、具身导航、场景检索等)中表现优异,仅依赖几何输入即可取得强性能。

链接: https://arxiv.org/abs/2511.16567
作者: Ye Mao,Weixun Luo,Ranran Huang,Junpeng Jing,Krystian Mikolajczyk
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 tables, 5 figures

点击查看摘要

Abstract:In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: this https URL
zh

[CV-21] NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening AAAI2026

【速读】:该论文旨在解决儿童营养不良(child malnutrition)筛查方法劳动强度大、可扩展性差的问题,从而阻碍早期干预。其解决方案的关键在于提出NutriScreener,一种基于检索增强的多姿态图注意力网络(retrieval-augmented, multi-pose graph attention network),融合CLIP视觉嵌入(CLIP-based visual embeddings)、类别增强的知识检索(class-boosted knowledge retrieval)与上下文感知机制(context awareness),实现从儿童图像中同时进行营养不良检测与体格测量预测,有效提升模型在低资源环境中的泛化能力与类别不平衡问题的处理能力。

链接: https://arxiv.org/abs/2511.16566
作者: Misaal Khan,Mayank Vatsa,Kuldeep Singh,Richa Singh
机构: 1. Indian Institute of Technology, Roorkee (印度理工学院,鲁尔克); 2. Indian Institute of Technology, Roorkee (印度理工学院,鲁尔克)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in AAAI 2026 Special Track on AI for Social Impact

点击查看摘要

Abstract:Child malnutrition remains a global crisis, yet existing screening methods are laborious and poorly scalable, hindering early intervention. In this work, we present NutriScreener, a retrieval-augmented, multi-pose graph attention network that combines CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness to enable robust malnutrition detection and anthropometric prediction from children’s images, simultaneously addressing generalizability and class imbalance. In a clinical study, doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency, confirming its deployment readiness in low-resource settings. Trained and tested on 2,141 children from AnthroVision and additionally evaluated on diverse cross-continent populations, including ARAN and an in-house collected CampusPose dataset, it achieves 0.79 recall, 0.82 AUC, and significantly lower anthropometric RMSEs, demonstrating reliable measurement in unconstrained pediatric settings. Cross-dataset results show up to 25% recall gain and up to 3.5 cm RMSE reduction using demographically matched knowledge bases. NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments.
zh

[CV-22] Lite Any Stereo: Efficient Zero-Shot Stereo Matching

【速读】:该论文旨在解决高效立体匹配模型在零样本泛化能力上的局限性问题,即传统轻量级模型因容量受限而难以实现跨场景、跨数据集的强泛化性能。其解决方案的关键在于:设计了一个紧凑但表达能力强的骨干网络以保障模型可扩展性,并引入一个精心构建的混合代价聚合模块;同时采用三阶段训练策略,在百万级数据上有效弥合仿真到真实世界的差距,从而使得超轻量模型仍能实现卓越的零样本泛化性能,在四个主流真实世界基准测试中排名第一,且计算成本不足现有先进方法的1%。

链接: https://arxiv.org/abs/2511.16555
作者: Junpeng Jing,Weixun Luo,Ye Mao,Krystian Mikolajczyk
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.
zh

[CV-23] Progressive Supernet Training for Efficient Visual Autoregressive Modeling CVPR2025

【速读】:该论文旨在解决视觉自回归(Visual Auto-Regressive, VAR)模型在多尺度生成过程中因累积键值(Key-Value, KV)缓存导致的内存开销过大问题,从而限制了其实际部署效率。解决方案的关键在于提出一种名为VARiant的新架构:通过等距采样从原始30层网络中构建多个子网(层数范围为16至2层),早期尺度使用完整网络处理,后期尺度则采用共享权重的子网进行推理;同时引入渐进式训练策略以缓解子网与全网之间的优化冲突,在固定训练预算下实现子网与全网生成质量的联合最优。实验表明,VARiant在保持接近原模型生成质量的同时显著降低内存消耗(最多65%),并支持运行时零成本深度切换,适用于多样化的部署场景。

链接: https://arxiv.org/abs/2511.16546
作者: Xiaoyue Chen,Yuling Shi,Kaiyuan Li,Huandong Wang,Yong Li,Xiaodong Gu,Xinlei Chen,Mingbao Lin
机构: Tsinghua University, China (清华大学); Shanghai Jiao Tong University, China (上海交通大学); Rakuten, Singapore (乐天)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to CVPR 2025. 10 pages, 7 figures

点击查看摘要

Abstract:Visual Auto-Regressive (VAR) models significantly reduce inference steps through the “next-scale” prediction paradigm. However, progressive multi-scale generation incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment. We observe a scale-depth asymmetric dependency in VAR: early scales exhibit extreme sensitivity to network depth, while later scales remain robust to depth reduction. Inspired by this, we propose VARiant: by equidistant sampling, we select multiple subnets ranging from 16 to 2 layers from the original 30-layer VAR-d30 network. Early scales are processed by the full network, while later scales utilize subnet. Subnet and the full network share weights, enabling flexible depth adjustment within a single model. However, weight sharing between subnet and the entire network can lead to optimization conflicts. To address this, we propose a progressive training strategy that breaks through the Pareto frontier of generation quality for both subnets and the full network under fixed-ratio training, achieving joint optimality. Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30 (FID 1.95), VARiant-d16 and VARiant-d8 achieve nearly equivalent quality (FID 2.05/2.12) while reducing memory consumption by 40-65%. VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost (FID 2.97). In terms of deployment, VARiant’s single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency, catering to diverse application scenarios. Comments: Submitted to CVPR 2025. 10 pages, 7 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.16546 [cs.CV] (or arXiv:2511.16546v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.16546 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xiaoyue Chen [view email] [v1] Thu, 20 Nov 2025 16:59:24 UTC (16,764 KB)
zh

[CV-24] EOGS: Earth Observation Gaussian Splatting with Internal Camera Refinement and Direct Panchromatic Rendering

【速读】:该论文旨在解决卫星遥感图像中三维重建质量与效率之间的矛盾问题,即如何在保持高重建精度的同时显著降低训练时间并提升几何准确性。其解决方案的关键在于提出 EOGS++ 方法,该方法直接处理原始高分辨率全色(panchromatic)影像,无需外部预处理;通过将光流(optical flow)技术嵌入训练过程以实现相机位姿估计的束调整(bundle adjustment),从而避免依赖外部优化工具;同时引入早停机制和TSDF后处理等改进措施,最终在 IARPA 2016 和 DFC2019 数据集上实现了优于原 EOGS 及其他 NeRF 方法的重建质量与计算效率,建筑区域平均 MAE 错误从 1.33 降至 1.19。

链接: https://arxiv.org/abs/2511.16542
作者: Pierrick Bournez,Luca Savant Aira,Thibaud Ehret,Gabriele Facciolo
机构: Universite Paris-Saclay, CNRS, ENS Paris-Saclay, Centre Borelli (巴黎萨克雷大学,国家科学研究中心,巴黎高等师范学院,博雷利中心); Politecnico di Torino (都灵理工大学); AMIAD (研究园区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, ISPRS

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting has been introduced as a compelling alternative to NeRF for Earth observation, offering com- petitive reconstruction quality with significantly reduced training times. In this work, we extend the Earth Observation Gaussian Splatting (EOGS) framework to propose EOGS++, a novel method tailored for satellite imagery that directly operates on raw high-resolution panchromatic data without requiring external preprocessing. Furthermore, leveraging optical flow techniques we embed bundle adjustment directly within the training process, avoiding reliance on external optimization tools while improving camera pose estimation. We also introduce several improvements to the original implementation, including early stopping and TSDF post-processing, all contributing to sharper reconstructions and better geometric accuracy. Experiments on the IARPA 2016 and DFC2019 datasets demonstrate that EOGS++ achieves state-of-the-art performance in terms of reconstruction quality and effi- ciency, outperforming the original EOGS method and other NeRF-based methods while maintaining the computational advantages of Gaussian Splatting. Our model demonstrates an improvement from 1.33 to 1.19 mean MAE errors on buildings compared to the original EOGS models
zh

[CV-25] Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attribution

【速读】:该论文旨在解决生成式人工智能(Generative AI)快速迭代背景下,合成图像检测方法难以泛化至新模型的问题。传统检测方法依赖周期性重新训练,面对新型生成模型的快速发布变得计算不可行且操作不切实际。其解决方案的关键在于提出一种两阶段检测框架:第一阶段利用监督对比学习(supervised contrastive learning)训练视觉深度学习模型,从输入图像中提取判别性嵌入(discriminative embeddings),并刻意在训练中排除部分生成器架构以严格评估跨生成器泛化能力;第二阶段采用基于k近邻(k-NN)的分类器,在少样本学习(few-shot learning)范式下,仅需每类150张图像即可实现高精度检测(平均准确率达91.3%),显著优于现有方法,并在源归属任务中分别提升AUC与OSCR指标14.70%和4.27%,展现出无需全面重训练即可适应生成式AI演进趋势的鲁棒性和可扩展性。

链接: https://arxiv.org/abs/2511.16541
作者: Jaime Álvarez Urueña,David Camacho,Javier Huertas Tato
机构: Universidad Politécnica de Madrid (西班牙理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures, 6 tables

点击查看摘要

Abstract:The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical. This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators. With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70% and 4.27% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols. Comments: 17 pages, 6 figures, 6 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) ACMclasses: I.2.10; I.4.10 Cite as: arXiv:2511.16541 [cs.CV] (or arXiv:2511.16541v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.16541 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-26] Investigating Optical Flow Computation: From Local Methods to a Multiresolution Horn-Schunck Implementation with Bilinear Interpolation

【速读】:该论文旨在解决光学流(optical flow)估计中的精度与收敛性问题,特别是在不同图像条件下准确捕捉帧间运动的挑战。其解决方案的关键在于结合局部方法(如Lucas-Kanade算法)与全局方法(如Horn-Schunck算法),并进一步提出一种多分辨率版本的Horn-Schunck算法,通过双线性插值(bilinear interpolation)和延拓(prolongation)策略提升算法的准确性与收敛速度。

链接: https://arxiv.org/abs/2511.16535
作者: Haytham Ziani
机构: Al Akhawayn University in Ifrane (阿尔阿赫瓦因大学伊夫兰分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents an applied analysis of local and global methods, with a focus on the Horn-Schunck algorithm for optical flow computation. We explore the theoretical and practical aspects of local approaches, such as the Lucas-Kanade method, and global techniques such as Horn-Schunck. Additionally, we implement a multiresolution version of the Horn-Schunck algorithm, using bilinear interpolation and prolongation to improve accuracy and convergence. The study investigates the effectiveness of these combined strategies in estimating motion between frames, particularly under varying image conditions.
zh

[CV-27] Enhancing Multi-Camera Gymnast Tracking Through Domain Knowledge Integration

【速读】:该论文旨在解决多摄像头环境下体操运动员三维轨迹跟踪的鲁棒性问题,尤其是在场馆空间受限导致摄像头数量有限、且因光照变化、背景干扰、服装相似性和遮挡等因素导致部分视角检测失效的情况下,传统多视角三角测量方法难以准确重建运动员的3D运动轨迹。解决方案的关键在于引入体操领域的先验知识:即运动员在大部分动作过程中其质心通常位于一个预定义的垂直平面内,从而通过射线-平面交点(ray-plane intersection)生成共面的3D轨迹候选点;进一步提出一种级联式数据关联(cascaded data association, DA)范式,在跨视角检测充足时使用三角测量生成候选轨迹,在检测不足时则依赖射线-平面交点进行补偿,有效降低跟踪失败率,提升系统在复杂场景下的稳定性与准确性。

链接: https://arxiv.org/abs/2511.16532
作者: Fan Yang,Shigeyuki Odashima,Shoichi Masui,Ikuo Kusajima,Sosuke Yamao,Shan Jiang
机构: Fujitsu Research (富士通研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a robust multi-camera gymnast tracking, which has been applied at international gymnastics championships for gymnastics judging. Despite considerable progress in multi-camera tracking algorithms, tracking gymnasts presents unique challenges: (i) due to space restrictions, only a limited number of cameras can be installed in the gymnastics stadium; and (ii) due to variations in lighting, background, uniforms, and occlusions, multi-camera gymnast detection may fail in certain views and only provide valid detections from two opposing views. These factors complicate the accurate determination of a gymnast’s 3D trajectory using conventional multi-camera triangulation. To alleviate this issue, we incorporate gymnastics domain knowledge into our tracking solution. Given that a gymnast’s 3D center typically lies within a predefined vertical plane during \revisedmuch of their performance, we can apply a ray-plane intersection to generate coplanar 3D trajectory candidates for opposing-view detections. More specifically, we propose a novel cascaded data association (DA) paradigm that employs triangulation to generate 3D trajectory candidates when cross-view detections are sufficient, and resort to the ray-plane intersection when they are insufficient. Consequently, coplanar candidates are used to compensate for uncertain trajectories, thereby minimizing tracking failures. The robustness of our method is validated through extensive experimentation, demonstrating its superiority over existing methods in challenging scenarios. Furthermore, our gymnastics judging system, equipped with this tracking method, has been successfully applied to recent Gymnastics World Championships, earning significant recognition from the International Gymnastics Federation.
zh

[CV-28] Contrastive vision-language learning with paraphrasing and negation

【速读】:该论文旨在解决对比视觉-语言模型(如CLIP)在面对否定句和改写句时语义对齐不稳定的问题,即否定句虽仅发生少量词汇变化却导致语义显著改变,而改写句则可能表达不同文本形式但保持相同语义,这使得模型难以准确对齐图像与文本嵌入。解决方案的关键在于提出一种新的对比损失函数——SemCLIP,该方法结合了自然语言处理中生成的原始、改写和否定三元组训练样本,通过优化嵌入空间中的相对距离:使改写句靠近原图嵌入,同时将否定句推向更远的距离。实验表明,SemCLIP在保持原有检索性能的同时,显著提升了对否定句的鲁棒性,在CC-Neg基准上图像检索准确率从68.1%提升至78.1%,且在零样本分类任务中优于CLIP,验证了其对语义变换的强健性。

链接: https://arxiv.org/abs/2511.16527
作者: Kwun Ho Ngan,Saman Sadeghi Afgeh,Joe Townsend,Artur d’Avila Garcez
机构: Fujitsu Research of Europe (富士通欧洲研究院); City St George’s, University of London (伦敦城市圣乔治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP’s performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP’s performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.
zh

[CV-29] BoxingVI: A Multi-Modal Benchmark for Boxing Action Recognition and Localization

【速读】:该论文旨在解决战斗类运动(如拳击)中基于计算机视觉的精准动作分析问题,其核心挑战在于动作动态性强、场景环境多样且缺乏高质量标注数据。解决方案的关键在于构建了一个大规模、精细化标注的拳击出拳视频数据集,包含6,915个高质量拳击片段,涵盖六种不同类型的出拳动作,来源于20段公开的YouTube对练视频并涉及18名运动员。每个片段均经人工精确分割与标注,确保时间边界清晰和类别一致性,从而为实时视觉动作识别提供可靠基准,尤其适用于资源受限和非结构化环境下的研究应用。

链接: https://arxiv.org/abs/2511.16524
作者: Rahul Kumar,Vipul Baghel,Sudhanshu Singh,Bikash Kumar Badatya,Shivam Yadav,Babji Srinivasan,Ravi Hegde
机构: Indian Institute of Technology Gandhinagar (印度理工学院甘地纳格尔分校); Dr. A. P. J. Abdul Kalam Technical University (阿卜杜勒·卡拉姆技术大学); Indian Institute of Technology Madras (印度理工学院马德拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate analysis of combat sports using computer vision has gained traction in recent years, yet the development of robust datasets remains a major bottleneck due to the dynamic, unstructured nature of actions and variations in recording environments. In this work, we present a comprehensive, well-annotated video dataset tailored for punch detection and classification in boxing. The dataset comprises 6,915 high-quality punch clips categorized into six distinct punch types, extracted from 20 publicly available YouTube sparring sessions and involving 18 different athletes. Each clip is manually segmented and labeled to ensure precise temporal boundaries and class consistency, capturing a wide range of motion styles, camera angles, and athlete physiques. This dataset is specifically curated to support research in real-time vision-based action recognition, especially in low-resource and unconstrained environments. By providing a rich benchmark with diverse punch examples, this contribution aims to accelerate progress in movement analysis, automated coaching, and performance assessment within boxing and related domains.
zh

[CV-30] YOWO: You Only Walk Once to Jointly Map An Indoor Scene and Register Ceiling-mounted Cameras

【速读】:该论文旨在解决室内场景中天花板安装摄像头(Ceiling-Mounted Cameras, CMCs)与目标场景布局之间的注册问题,这一问题在传统手动标定方法中效率低且成本高,而基于视觉定位的自动注册方法在存在视觉歧义时性能不佳。解决方案的关键在于设计一种联合建图与注册框架:通过搭载RGB-D相机的移动代理遍历场景并同步采集CMC视频,利用代理的自持视角视频生成世界坐标系下的轨迹和场景布局,同时CMC视频提供伪尺度轨迹和相对位姿信息;通过时间戳对齐所有轨迹,实现CMC相对位姿与全局场景布局的对齐,并进一步构建定制因子图进行联合优化,从而在统一框架内同时完成场景建图和CMC注册,显著提升两者精度。

链接: https://arxiv.org/abs/2511.16521
作者: Fan Yang,Sosuke Yamao,Ikuo Kusajima,Atsunori Moteki,Shoichi Masui,Shan Jiang
机构: Fujitsu Research (富士通研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Using ceiling-mounted cameras (CMCs) for indoor visual capturing opens up a wide range of applications. However, registering CMCs to the target scene layout presents a challenging task. While manual registration with specialized tools is inefficient and costly, automatic registration with visual localization may yield poor results when visual ambiguity exists. To alleviate these issues, we propose a novel solution for jointly mapping an indoor scene and registering CMCs to the scene layout. Our approach involves equipping a mobile agent with a head-mounted RGB-D camera to traverse the entire scene once and synchronize CMCs to capture this mobile agent. The egocentric videos generate world-coordinate agent trajectories and the scene layout, while the videos of CMCs provide pseudo-scale agent trajectories and CMC relative poses. By correlating all the trajectories with their corresponding timestamps, the CMC relative poses can be aligned to the world-coordinate scene layout. Based on this initialization, a factor graph is customized to enable the joint optimization of ego-camera poses, scene layout, and CMC poses. We also develop a new dataset, setting the first benchmark for collaborative scene mapping and CMC registration (this https URL). Experimental results indicate that our method not only effectively accomplishes two tasks within a unified framework, but also jointly enhances their performance. We thus provide a reliable tool to facilitate downstream position-aware applications.
zh

[CV-31] Acquisition Time-Informed Breast Tumor Segmentation from Dynamic Contrast-Enhanced MRI

【速读】:该论文旨在解决乳腺动态对比增强磁共振成像(Dynamic Contrast-Enhanced Magnetic Resonance Imaging, DCE-MRI)中因不同采集协议和个体差异导致的组织外观变化问题,这一问题使得基于图像的肿瘤自动分割变得困难。解决方案的关键在于引入图像采集时间作为先验知识,通过特征通道线性调制(Feature-wise Linear Modulation, FiLM)层对模型特征进行时序调节,从而提升模型在不同采集序列下的适应能力与泛化性能。该方法能够在不增加计算复杂度的前提下充分利用每例研究中多时相图像的信息,显著改善肿瘤分割精度。

链接: https://arxiv.org/abs/2511.16498
作者: Rui Wang,Yuexi Du,John Lewin,R. Todd Constable,Nicha C. Dvornek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays an important role in breast cancer screening, tumor assessment, and treatment planning and monitoring. The dynamic changes in contrast in different tissues help to highlight the tumor in post-contrast images. However, varying acquisition protocols and individual factors result in large variation in the appearance of tissues, even for images acquired in the same phase (e.g., first post-contrast phase), making automated tumor segmentation challenging. Here, we propose a tumor segmentation method that leverages knowledge of the image acquisition time to modulate model features according to the specific acquisition sequence. We incorporate the acquisition times using feature-wise linear modulation (FiLM) layers, a lightweight method for incorporating temporal information that also allows for capitalizing on the full, variables number of images acquired per imaging study. We trained baseline and different configurations for the time-modulated models with varying backbone architectures on a large public multisite breast DCE-MRI dataset. Evaluation on in-domain images and a public out-of-domain dataset showed that incorporating knowledge of phase acquisition time improved tumor segmentation performance and model generalization.
zh

[CV-32] Physics-Informed Machine Learning for Efficient Sim-to-Real Data Augmentation in Micro-Object Pose Estimation

【速读】:该论文旨在解决光学微机器人(optical microrobot)姿态估计中因缺乏高质量显微图像数据集而导致的精度受限问题,尤其是传统方法依赖昂贵且耗时的真实数据采集与标注。其解决方案的关键在于提出了一种融合物理信息的深度生成学习框架,首次将基于波动光学(wave optics)的物理渲染与深度对齐机制集成到生成对抗网络(GAN)中,从而高效合成高保真度的显微图像,显著提升姿态估计性能;实验表明,该方法在结构相似性指数(SSIM)上较纯AI方法提升35.6%,且保持实时渲染速度(0.022 s/frame),同时训练出的姿态估计器在未见姿态下仍具良好泛化能力。

链接: https://arxiv.org/abs/2511.16494
作者: Zongcai Tan,Lan Wei,Dandan Zhang
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Precise pose estimation of optical microrobots is essential for enabling high-precision object tracking and autonomous biological studies. However, current methods rely heavily on large, high-quality microscope image datasets, which are difficult and costly to acquire due to the complexity of microrobot fabrication and the labour-intensive labelling. Digital twin systems offer a promising path for sim-to-real data augmentation, yet existing techniques struggle to replicate complex optical microscopy phenomena, such as diffraction artifacts and depth-dependent this http URL work proposes a novel physics-informed deep generative learning framework that, for the first time, integrates wave optics-based physical rendering and depth alignment into a generative adversarial network (GAN), to synthesise high-fidelity microscope images for microrobot pose estimation efficiently. Our method improves the structural similarity index (SSIM) by 35.6% compared to purely AI-driven methods, while maintaining real-time rendering speeds (0.022 s/frame).The pose estimator (CNN backbone) trained on our synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, just 5.0%/5.4% (pitch/roll) below that of an estimator trained exclusively on real data. Furthermore, our framework generalises to unseen poses, enabling data augmentation and robust pose estimation for novel microrobot configurations without additional training data.
zh

[CV-33] Flow and Depth Assisted Video Prediction with Latent Transformer

【速读】:该论文旨在解决视频预测(video prediction)任务中因遮挡(occlusion)导致的性能下降问题。现有通用视频预测模型在标准场景下表现优异,但在存在遮挡或背景运动时仍难以准确建模物体动态。解决方案的关键在于引入显式先验信息:通过点流(point-flow)提供运动信息,通过深度图(depth-map)提供几何结构信息,并将二者融合进基于多对象潜在变换器(multi-object latent transformer)的预测架构中。实验表明,这种多模态辅助机制显著提升了模型在遮挡场景下的预测精度,尤其改善了背景运动的建模能力。

链接: https://arxiv.org/abs/2511.16484
作者: Eliyas Suleyman,Paul Henderson,Eksan Firkat,Nicolas Pugeault
机构: University of Glasgow (格拉斯哥大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.
zh

[CV-34] FastSurfer-CC: A robust accurate and comprehensive framework for corpus callosum morphometry

【速读】:该论文旨在解决当前 corpus callosum(胼胝体)形态测量缺乏高效、全自动分析工具的问题,尤其在老龄化和神经疾病研究中,其作为关键结构常需精准量化以支持临床试验和干预评估。解决方案的关键在于提出 FastSurfer-CC,一个端到端的自动化框架,能够自动识别中矢状切片、分割胼胝体与穹窿(fornix)、定位前联合与后联合以标准化头部位置、生成厚度剖面和亚区划分,并提取8个形状指标用于统计分析。该方法在各项任务上均优于现有专用工具,且能检测出当前最先进方法未能发现的亨廷顿病患者与健康对照组之间的显著差异。

链接: https://arxiv.org/abs/2511.16471
作者: Clemens Pollak,Kersten Diers,Santiago Estrada,David Kügler,Martin Reuter
机构: German Center for Neurodegenerative Diseases (DZNE); A.A. Martinos Center for Biomedical Imaging; Massachusetts General Hospital; Harvard Medical School
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The corpus callosum, the largest commissural structure in the human brain, is a central focus in research on aging and neurological diseases. It is also a critical target for interventions such as deep brain stimulation and serves as an important biomarker in clinical trials, including those investigating remyelination therapies. Despite extensive research on corpus callosum segmentation, few publicly available tools provide a comprehensive and automated analysis pipeline. To address this gap, we present FastSurfer-CC, an efficient and fully automated framework for corpus callosum morphometry. FastSurfer-CC automatically identifies mid-sagittal slices, segments the corpus callosum and fornix, localizes the anterior and posterior commissures to standardize head positioning, generates thickness profiles and subdivisions, and extracts eight shape metrics for statistical analysis. We demonstrate that FastSurfer-CC outperforms existing specialized tools across the individual tasks. Moreover, our method reveals statistically significant differences between Huntington’s disease patients and healthy controls that are not detected by the current state-of-the-art.
zh

[CV-35] LLaVA3: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs AAAI’26

【速读】:该论文旨在解决多模态语言模型(Multi-modal Language Model, MLM)在理解三维(3D)场景时面临的挑战,尤其是由于3D训练数据稀缺而难以有效建模的问题。与大量可用的二维(2D)视觉-语言数据集相比,3D数据获取成本高、标注复杂,限制了模型对真实世界3D场景的理解能力。解决方案的关键在于提出一种名为LLaVA³(LLaVA-Cube)的新方法,其核心思想是利用多视角2D图像构建场景的中间级多视图3D重建,并基于此生成每个物体的全向视觉表征(omnidirectional visual representations),从而将3D场景信息以结构化方式注入到原本仅依赖2D图像的视觉语言模型(VLM)中。该方法无需任何微调即可显著提升VLM在3D视觉问答(3D VQA)和3D语言定位(3D language grounding)任务上的性能,突破了传统2D基线模型的局限性。

链接: https://arxiv.org/abs/2511.16454
作者: Doriand Petit,Steve Bourgeois,Vincent Gay-Bellile,Florian Chabot,Loïc Barthe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI’26

点击查看摘要

Abstract:Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA ^3 (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.
zh

[CV-36] VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中因持续视觉流处理带来的高计算开销问题,尤其是现有基于语义显著性(如预填充注意力)的视觉令牌剪枝方法忽视了VLA模型内在的双系统特性——高层语义理解与底层动作执行之间的协同关系,导致关键动作生成信息被误删,从而显著降低VLA性能。解决方案的关键在于提出VLA-Pruner,一种面向VLA的通用插件式令牌剪枝方法,其核心创新是引入双层次重要性判据:在语义层面采用视觉-语言预填充注意力,在动作层面通过时间平滑估计动作解码注意力,以此构建自适应的双层令牌选择策略,在给定计算预算下动态保留同时支持语义理解和动作生成的紧凑且信息丰富的视觉令牌集合,实验证明该方法在多种VLA架构和机器人任务中均达到最优性能。

链接: https://arxiv.org/abs/2511.16449
作者: Ziyan Liu,Yeqiu Chen,Hongyi Cai,Tao Lin,Shuo Yang,Zheng Liu,Bo Zhao
机构: Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳)); BAAI (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA’s intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
zh

[CV-37] StreetView-Waste: A Multi-Task Dataset for Urban Waste Management WACV2026

【速读】:该论文旨在解决城市垃圾管理中废弃物容器溢出监测的难题,尤其针对由垃圾车拍摄图像中存在的容器跟踪与溢出识别问题。现有数据集多缺乏对容器的精准标注或仅在静态场景下采集,难以支撑实际物流场景的应用需求。解决方案的关键在于提出一个名为StreetView-Waste的新数据集,涵盖三类核心任务:废弃物容器检测、容器跟踪及溢出区域分割,并基于此构建了多个基准模型。进一步地,作者通过两种互补策略提升性能:一是基于启发式规则改进容器跟踪精度(使平均绝对计数误差降低79.6%),二是引入几何先验的模型无关框架优化垃圾区域分割(轻量级模型mAP@0.5提升27%),从而验证了多模态信息在复杂城市感知任务中的有效性。

链接: https://arxiv.org/abs/2511.16440
作者: Diogo J. Paulo,João Martins,Hugo Proença,João C. Neves
机构: University of Beira Interior (贝拉内斯特大学); Instituto de Telecomunicações (电信研究所); NOVA LINCS (NOVA LINCS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2026

点击查看摘要

Abstract:Urban waste management remains a critical challenge for the development of smart cities. Despite the growing number of litter detection datasets, the problem of monitoring overflowing waste containers, particularly from images captured by garbage trucks, has received little attention. While existing datasets are valuable, they often lack annotations for specific container tracking or are captured in static, decontextualized environments, limiting their utility for real-world logistics. To address this gap, we present StreetView-Waste, a comprehensive dataset of urban scenes featuring litter and waste containers. The dataset supports three key evaluation tasks: (1) waste container detection, (2) waste container tracking, and (3) waste overflow segmentation. Alongside the dataset, we provide baselines for each task by benchmarking state-of-the-art models in object detection, tracking, and segmentation. Additionally, we enhance baseline performance by proposing two complementary strategies: a heuristic-based method for improved waste container tracking and a model-agnostic framework that leverages geometric priors to refine litter segmentation. Our experimental results show that while fine-tuned object detectors achieve reasonable performance in detecting waste containers, baseline tracking methods struggle to accurately estimate their number; however, our proposed heuristics reduce the mean absolute counting error by 79.6%. Similarly, while segmenting amorphous litter is challenging, our geometry-aware strategy improves segmentation mAP@0.5 by 27% on lightweight models, demonstrating the value of multimodal inputs for this task. Ultimately, StreetView-Waste provides a challenging benchmark to encourage research into real-world perception systems for urban waste management.
zh

[CV-38] Beyond Visual Cues: Leverag ing General Semantics as Support for Few-Shot Segmentation

【速读】:该论文旨在解决少样本分割(Few-shot Segmentation, FSS)中因支持图像内类间差异导致元信息不准确的问题,即现有方法依赖支持图像提取的元指导难以有效分割未训练类别。其解决方案的关键在于摒弃传统以支持图像为参考的策略,转而利用目标类别的语言属性描述构建无偏的元引导机制;具体通过引入语言驱动的属性泛化(Language-Driven Attribute Generalization, LDAG)架构,设计多属性增强(Multi-attribute Enhancement, MaE)模块生成细粒度语义描述并建立视觉-文本先验对齐,同时采用多模态属性对齐(Multi-modal Attribute Alignment, MaA)缓解文本与视觉模态间的偏移问题,从而提升模型在新类别上的分割鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2511.16435
作者: Jin Wang,Bingfeng Zhang,Jian Pang,Mengyu Liu,Honglong Chen,Weifeng Liu
机构: University of Petroleum and Chemistry (中国石油大学); Geely(吉利)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.
zh

[CV-39] Graph Neural Networks for Surgical Scene Segmentation

【速读】:该论文旨在解决腹腔镜胆囊切除术中肝胆解剖结构精准识别问题,以降低手术并发症风险。传统深度学习模型在处理遮挡、长距离依赖关系及罕见结构的细粒度几何特征时表现不足。其解决方案的关键在于引入基于图结构的分割方法,通过融合视觉Transformer(Vision Transformer, ViT)的全局上下文感知能力与图神经网络(Graph Neural Networks, GNNs)的显式空间关系建模能力,具体包括两种策略:一是采用静态k近邻图结合GCNII实现稳定长程信息传播;二是设计动态可微分图生成器(Differentiable Graph Generator, DGG)与图注意力网络(Graph Attention Network, GAT)协同实现拓扑结构自适应学习。该方案显著提升了分割性能与解剖一致性,尤其在薄层、稀有且关键结构上的识别准确率获得明显改善。

链接: https://arxiv.org/abs/2511.16430
作者: Yihan Li,Nikhil Churamani,Maria Robu,Imanol Luengo,Danail Stoyanov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Purpose: Accurate identification of hepatocystic anatomy is critical to preventing surgical complications during laparoscopic cholecystectomy. Deep learning models often struggle with occlusions, long-range dependencies, and capturing the fine-scale geometry of rare structures. This work addresses these challenges by introducing graph-based segmentation approaches that enhance spatial and semantic understanding in surgical scene analyses. Methods: We propose two segmentation models integrating Vision Transformer (ViT) feature encoders with Graph Neural Networks (GNNs) to explicitly model spatial relationships between anatomical regions. (1) A static k Nearest Neighbours (k-NN) graph with a Graph Convolutional Network with Initial Residual and Identity Mapping (GCNII) enables stable long-range information propagation. (2) A dynamic Differentiable Graph Generator (DGG) with a Graph Attention Network (GAT) supports adaptive topology learning. Both models are evaluated on the Endoscapes-Seg50 and CholecSeg8k benchmarks. Results: The proposed approaches achieve up to 7-8% improvement in Mean Intersection over Union (mIoU) and 6% improvement in Mean Dice (mDice) scores over state-of-the-art baselines. It produces anatomically coherent predictions, particularly on thin, rare and safety-critical structures. Conclusion: The proposed graph-based segmentation methods enhance both performance and anatomical consistency in surgical scene segmentation. By combining ViT-based global context with graph-based relational reasoning, the models improve interpretability and reliability, paving the way for safer laparoscopic and robot-assisted surgery through a precise identification of critical anatomical features. Comments: 12 pages, 4 figures, 3 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2511.16430 [cs.CV] (or arXiv:2511.16430v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.16430 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nikhil Churamani [view email] [v1] Thu, 20 Nov 2025 14:58:29 UTC (8,795 KB)
zh

[CV-40] CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

【速读】:该论文旨在解决多摄像头系统中自监督环视深度估计存在的跨视角不一致性问题,即在相邻图像重叠区域生成的深度估计结果存在差异,影响整体3D感知质量。其解决方案的关键在于引入一种几何引导机制:首先基于标定后的相机内参和相对位姿参数,为每张图像独立预测初始深度图,并将所有图像对应的3D点投影到一个共享的单位圆柱面上,从而建立不同视角间的邻域关系;随后构建每张图像的2D位置映射(position map),其中每个像素标记其在圆柱面上的投影坐标;在此基础上,采用显式的、非学习的空间注意力机制,依据像素在圆柱面上的距离聚合跨图像特征,最终输出每张图像的一致性深度图。该方法显著提升了跨视角深度估计的一致性和整体精度。

链接: https://arxiv.org/abs/2511.16428
作者: Samer Abualhanud,Christian Grannemann,Max Mehltretter
机构: Institute of Photogrammetry and GeoInformation, Leibniz University Hannover (汉诺威大学摄影测量与地理信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.
zh

[CV-41] End-to-End Motion Capture from Rigid Body Markers with Geodesic Loss

【速读】:该论文旨在解决传统标记点光学动作捕捉(Marker-based Optical Motion Capture, MoCap)因密集标记配置导致的设置耗时和标记识别模糊等问题,从而限制了其可扩展性。解决方案的关键在于提出一种新的基本单元——刚体标记(Rigid Body Marker, RBM),它能提供明确的6自由度(6-DoF)数据并显著简化系统部署;同时,基于此新型数据模态,设计了一种端到端的深度学习回归模型,通过流形感知的测地损失(geodesic loss)直接估计SMPL参数,在保持优化方法性能的同时,计算量减少了一个数量级以上,实现了高保真、实时的动作捕捉应用。

链接: https://arxiv.org/abs/2511.16418
作者: Hai Lan,Zongyan Li,Jianmin Hu,Jialing Yang,Houde Dai
机构: University of Chinese Academy of Sciences (中国科学院大学); Fujian Medical University (福建医科大学); Fuzhou University (福州大学); FJIRSM (中科院福建物质结构研究所); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: The source code is available in : this https URL

点击查看摘要

Abstract:Marker-based optical motion capture (MoCap), while long regarded as the gold standard for accuracy, faces practical challenges, such as time-consuming preparation and marker identification ambiguity, due to its reliance on dense marker configurations, which fundamentally limit its scalability. To address this, we introduce a novel fundamental unit for MoCap, the Rigid Body Marker (RBM), which provides unambiguous 6-DoF data and drastically simplifies setup. Leveraging this new data modality, we develop a deep-learning-based regression model that directly estimates SMPL parameters under a geodesic loss. This end-to-end approach matches the performance of optimization-based methods while requiring over an order of magnitude less computation. Trained on synthesized data from the AMASS dataset, our end-to-end model achieves state-of-the-art accuracy in body pose estimation. Real-world data captured using a Vicon optical tracking system further demonstrates the practical viability of our approach. Overall, the results show that combining sparse 6-DoF RBM with a manifold-aware geodesic loss yields a practical and high-fidelity solution for real-time MoCap in graphics, virtual reality, and biomechanics.
zh

[CV-42] CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement

【速读】:该论文旨在解决组合零样本学习(Compositional Zero-Shot Learning, CZSL)中因图像编码器提取的全局语义表示表达能力有限,导致属性(attribute)与对象(object)语义难以完全解耦的问题。现有基于对比语言-图像预训练(CLIP)的方法依赖于全局语义表示进行解耦,但受限于其表征容量,无法实现充分解耦,从而影响对未见过的属性-对象组合的泛化能力。解决方案的关键在于提出一种名为CAMS(Cross-Attention-based Multi-Space Disentanglement)的新方法:首先设计门控交叉注意力机制(Gated Cross-Attention),从CLIP高阶图像编码块中提取细粒度语义特征,并自适应抑制背景及其他无关信息;随后通过多空间解耦(Multi-Space Disentanglement)在多个维度空间中实现属性与对象语义的分离,从而显著提升模型对未见组合的识别性能。

链接: https://arxiv.org/abs/2511.16378
作者: Pan Yang,Cheng Deng,Jing Yang,Han Zhao,Yun Liu,Yuling Chen,Xiaoli Ruan,Yanping Chen
机构: The State Key Laboratory of Public Big Data, Guizhou University (贵州大学公共大数据国家重点实验室); Shanghai Jiao Tong University (上海交通大学); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while adaptively suppressing background and other irrelevant information. Subsequently, it conducts Multi-Space Disentanglement to achieve disentanglement of attribute and object semantics. Experiments on three popular benchmarks (MIT-States, UT-Zappos, and C-GQA) demonstrate that CAMS achieves state-of-the-art performance in both closed-world and open-world settings. The code is available at this https URL.
zh

[CV-43] DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration

【速读】:该论文旨在解决离线签名验证(Offline Signature Verification, OSV)中因依赖整体特征进行配对比较而导致的准确性不足问题,尤其在识别细微差异方面表现有限。解决方案的关键在于提出一种名为DetailSemNet的新模型,其核心创新是通过匹配两个签名图像之间的局部结构来提升验证精度;同时引入“细节语义集成器”(Detail Semantics Integrator),利用特征解耦与再耦合机制,在增强精细细节的同时扩展判别性语义信息,从而有效缓解基于Transformer的骨干网络天然忽略局部细节的问题。该方法不仅显著提升了性能,还在跨数据集测试中展现出优异的泛化能力与可解释性。

链接: https://arxiv.org/abs/2511.16364
作者: Meng-Cheng Shih,Tsai-Ling Huang,Yu-Heng Shih,Hong-Han Shuai,Hsuan-Tung Liu,Yi-Ren Yeh,Ching-Chun Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Offline signature verification (OSV) is a frequently utilized technology in forensics. This paper proposes a new model, DetailSemNet, for OSV. Unlike previous methods that rely on holistic features for pair comparisons, our approach underscores the significance of fine-grained differences for robust OSV. We propose to match local structures between two signature images, significantly boosting verification accuracy. Furthermore, we observe that without specific architectural modifications, transformer-based backbones might naturally obscure local details, adversely impacting OSV performance. To address this, we introduce a Detail Semantics Integrator, leveraging feature disentanglement and re-entanglement. This integrator is specifically designed to enhance intricate details while simultaneously expanding discriminative semantics, thereby augmenting the efficacy of local structural matching. We evaluate our method against leading benchmarks in offline signature verification. Our model consistently outperforms recent methods, achieving state-of-the-art results with clear margins. The emphasis on local structure matching not only improves performance but also enhances the model’s interpretability, supporting our findings. Additionally, our model demonstrates remarkable generalization capabilities in cross-dataset testing scenarios. The combination of generalizability and interpretability significantly bolsters the potential of DetailSemNet for real-world applications.
zh

[CV-44] Multi-Order Matching Network for Alignment-Free Depth Super-Resolution

【速读】:该论文旨在解决现有生成式深度超分辨率方法在实际应用中因RGB与深度(Depth)图像未严格对齐而导致性能下降的问题。由于硬件限制(如RGB-D传感器物理分离)或校准漂移(由机械振动或温度变化引起),真实场景中的RGB-D数据常存在空间错位,而传统方法依赖于严格的像素级对齐假设,难以应对此类情况。解决方案的关键在于提出一种无需对齐的多阶匹配网络(Multi-Order Matching Network, MOMNet),其核心创新包括:1)多阶匹配机制,通过零阶、一阶和二阶特征空间中的联合匹配,自适应地识别与深度信息一致的RGB特征;2)多阶聚合策略,利用多阶先验作为提示,引导从RGB到深度的结构化特征选择性传递,从而有效整合错位RGB信息并提升重建质量。实验表明,该方法在多个基准上达到最先进性能且具有强鲁棒性。

链接: https://arxiv.org/abs/2511.16361
作者: Zhengxue Wang,Zhiqiang Yan,Yuan Wu,Guangwei Gao,Xiang Li,Jian Yang
机构: PCA Lab, Nanjing University of Science and Technology (南京理工大学); National University of Singapore (新加坡国立大学); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent guided depth super-resolution methods are premised on the assumption of strictly spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves state-of-the-art performance and exhibits outstanding robustness.
zh

[CV-45] CRISTAL: Real-time Camera Registration in Static LiDAR Scans using Neural Rendering

【速读】:该论文旨在解决机器人和扩展现实(XR)应用中相机定位的精度问题,特别是现有视觉方法普遍存在的漂移(drift)、尺度模糊(scale ambiguity)以及对特征点或回环检测(loop closure)的依赖。其解决方案的关键在于利用预捕获的高精度彩色激光雷达(LiDAR)点云作为全局参考,在实时场景中通过神经渲染技术生成合成视图,建立真实图像与点云之间的2D-3D对应关系,从而实现无漂移且具有正确度量尺度(metric scale)的相机跟踪。该方法显著缩小了合成图像与真实图像之间的域差距(domain gap),有效减少遮挡和背景伪影,提升特征匹配鲁棒性,最终在ScanNet++数据集上优于现有SLAM系统。

链接: https://arxiv.org/abs/2511.16349
作者: Joni Vanherck,Steven Moonen,Brent Zoomers,Kobe Werner,Jeroen Put,Lode Jorissen,Nick Michiels
机构: Hasselt University - Digital Future Lab - Flanders Make (哈瑟尔特大学-数字未来实验室-弗拉芒制造)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate camera localization is crucial for robotics and Extended Reality (XR), enabling reliable navigation and alignment of virtual and real content. Existing visual methods often suffer from drift, scale ambiguity, and depend on fiducials or loop closure. This work introduces a real-time method for localizing a camera within a pre-captured, highly accurate colored LiDAR point cloud. By rendering synthetic views from this cloud, 2D-3D correspondences are established between live frames and the point cloud. A neural rendering technique narrows the domain gap between synthetic and real images, reducing occlusion and background artifacts to improve feature matching. The result is drift-free camera tracking with correct metric scale in the global LiDAR coordinate system. Two real-time variants are presented: Online Render and Match, and Prebuild and Localize. We demonstrate improved results on the ScanNet++ dataset and outperform existing SLAM pipelines.
zh

[CV-46] Aerial View River Landform Video segmentation: A Weakly Supervised Context-aware Temporal Consistency Distillation Approach

【速读】:该论文旨在解决无人机(UAV)遥感地形与地物分类任务中因数据标注复杂、时序一致性(Temporal Consistency, TC)难以保障以及高质量标注数据稀缺所导致的定位不稳定问题。传统方法依赖全量标注数据,但效率低且难以提升TC;而仅使用关键帧又会导致TC增强不足,进而引发模型失效。解决方案的关键在于提出一种基于教师-学生架构(teacher-student architecture)的弱监督学习框架,结合关键帧选择与更新算法,实现时序一致性知识蒸馏(TC knowledge distillation),从而在仅使用30%标注数据的情况下,同时显著提升平均交并比(mIoU)和时序一致性,确保空中定位任务中地形目标的稳定识别。

链接: https://arxiv.org/abs/2511.16343
作者: Chi-Han Chen,Chieh-Ming Chen,Wen-Huang Cheng,Ching-Chun Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The study of terrain and landform classification through UAV remote sensing diverges significantly from ground vehicle patrol tasks. Besides grappling with the complexity of data annotation and ensuring temporal consistency, it also confronts the scarcity of relevant data and the limitations imposed by the effective range of many technologies. This research substantiates that, in aerial positioning tasks, both the mean Intersection over Union (mIoU) and temporal consistency (TC) metrics are of paramount importance. It is demonstrated that fully labeled data is not the optimal choice, as selecting only key data lacks the enhancement in TC, leading to failures. Hence, a teacher-student architecture, coupled with key frame selection and key frame updating algorithms, is proposed. This framework successfully performs weakly supervised learning and TC knowledge distillation, overcoming the deficiencies of traditional TC training in aerial tasks. The experimental results reveal that our method utilizing merely 30% of labeled data, concurrently elevates mIoU and temporal consistency ensuring stable localization of terrain objects. Result demo : this https URL
zh

[CV-47] Arbitrary-Resolution and Arbitrary-Scale Face Super-Resolution with Implicit Representation Networks

【速读】:该论文旨在解决现有面部超分辨率(Face Super-Resolution, FSR)方法在固定上采样倍数和对输入图像尺寸变化敏感方面的局限性。其解决方案的关键在于提出一种基于隐式表示网络的任意分辨率与任意尺度FSR方法(ARASFSR),包含三个核心设计:一是利用2D深度特征、局部相对坐标和上采样比例来预测目标像素的RGB值,从而实现任意放大倍数的超分辨率;二是引入局部频率估计模块以捕获高频面部纹理信息,缓解频谱偏差效应;三是设计全局坐标调制模块,引导模型利用先验人脸结构知识,有效实现分辨率自适应。

链接: https://arxiv.org/abs/2511.16341
作者: Yi Ting Tsai,Yu Wei Chen,Hong-Han Shuai,Ching-Chun Huang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face super-resolution (FSR) is a critical technique for enhancing low-resolution facial images and has significant implications for face-related tasks. However, existing FSR methods are limited by fixed up-sampling scales and sensitivity to input size variations. To address these limitations, this paper introduces an Arbitrary-Resolution and Arbitrary-Scale FSR method with implicit representation networks (ARASFSR), featuring three novel designs. First, ARASFSR employs 2D deep features, local relative coordinates, and up-sampling scale ratios to predict RGB values for each target pixel, allowing super-resolution at any up-sampling scale. Second, a local frequency estimation module captures high-frequency facial texture information to reduce the spectral bias effect. Lastly, a global coordinate modulation module guides FSR to leverage prior facial structure knowledge and achieve resolution adaptation effectively. Quantitative and qualitative evaluations demonstrate the robustness of ARASFSR over existing state-of-the-art methods while super-resolving facial images across various input sizes and up-sampling scales.
zh

[CV-48] ChangeDINO: DINOv3-Driven Building Change Detection in Optical Remote Sensing Imagery

【速读】:该论文旨在解决遥感变化检测(Remote Sensing Change Detection, RSCD)中深度学习方法依赖单一变化图标注、忽视非变化区域语义信息的问题,从而在光照变化、非正射视角和标签稀缺等场景下提升模型鲁棒性。其解决方案的关键在于提出一个端到端的多尺度孪生框架ChangeDINO:通过轻量级主干网络与冻结的DINOv3特征迁移融合,构建富含语义和上下文信息的多尺度特征金字塔;引入空间-光谱差异变压器解码器,利用多尺度绝对差值作为变化先验以突出真实建筑变化并抑制无关响应;最后结合可学习形态学模块优化上采样后的 logits,恢复清晰边界。

链接: https://arxiv.org/abs/2511.16322
作者: Ching-Heng Cheng,Chih-Chung Hsu
机构: National Cheng Kung University (国立成功大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing change detection (RSCD) aims to identify surface changes from co-registered bi-temporal images. However, many deep learning-based RSCD methods rely solely on change-map annotations and underuse the semantic information in non-changing regions, which limits robustness under illumination variation, off-nadir views, and scarce labels. This article introduces ChangeDINO, an end-to-end multiscale Siamese framework for optical building change detection. The model fuses a lightweight backbone stream with features transferred from a frozen DINOv3, yielding semantic- and context-rich pyramids even on small datasets. A spatial-spectral differential transformer decoder then exploits multi-scale absolute differences as change priors to highlight true building changes and suppress irrelevant responses. Finally, a learnable morphology module refines the upsampled logits to recover clean boundaries. Experiments on four public benchmarks show that ChangeDINO consistently outperforms recent state-of-the-art methods in IoU and F1, and ablation studies confirm the effectiveness of each component. The source code is available at this https URL.
zh

[CV-49] WWE-UIE: A Wavelet White Balance Efficient Network for Underwater Image Enhancement

【速读】:该论文旨在解决水下图像增强(Underwater Image Enhancement, UIE)中因波长依赖性吸收和散射导致的可见度下降与色彩失真问题,同时克服现有混合方法计算复杂度高、难以在资源受限平台实现实时推理的局限。其解决方案的关键在于提出一种轻量级高效网络 WWE-UIE,通过集成三个可解释的先验机制:一是自适应白平衡以缓解蓝绿色调主导的颜色衰减;二是基于小波的增强模块(Wavelet-based Enhancement Block, WEB),实现多频带分解以同时保留全局结构与细粒度纹理;三是梯度感知模块(Sobel-based Gradient Fusion Block, SGFB),利用可学习门控的 Sobel 算子显式保护散射退化的边缘结构。此设计在保证恢复质量的同时显著降低参数量与浮点运算次数(FLOPs),从而支持实时处理。

链接: https://arxiv.org/abs/2511.16321
作者: Ching-Heng Cheng,Jen-Wei Lee,Chia-Ming Lee,Chih-Chung Hsu
机构: National Cheng Kung University (国立成功大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater Image Enhancement (UIE) aims to restore visibility and correct color distortions caused by wavelength-dependent absorption and scattering. Recent hybrid approaches, which couple domain priors with modern deep neural architectures, have achieved strong performance but incur high computational cost, limiting their practicality in real-time scenarios. In this work, we propose WWE-UIE, a compact and efficient enhancement network that integrates three interpretable priors. First, adaptive white balance alleviates the strong wavelength-dependent color attenuation, particularly the dominance of blue-green tones. Second, a wavelet-based enhancement block (WEB) performs multi-band decomposition, enabling the network to capture both global structures and fine textures, which are critical for underwater restoration. Third, a gradient-aware module (SGFB) leverages Sobel operators with learnable gating to explicitly preserve edge structures degraded by scattering. Extensive experiments on benchmark datasets demonstrate that WWE-UIE achieves competitive restoration quality with substantially fewer parameters and FLOPs, enabling real-time inference on resource-limited platforms. Ablation studies and visualizations further validate the contribution of each component. The source code is available at this https URL.
zh

[CV-50] NaTex: Seamless Texture Generation as Latent Color Diffusion

【速读】:该论文旨在解决现有纹理生成方法中因依赖多视图扩散模型(MVD)所引发的若干关键问题,包括遮挡区域难以处理、网格与纹理边界对齐精度不足,以及跨视角内容和色彩一致性难以维持。其解决方案的核心在于提出NaTex框架,通过将纹理视为密集的颜色点云(dense color point cloud),并设计了端到端训练的潜在颜色扩散机制——包含几何感知的颜色点云变分自编码器(VAE)与多控制扩散Transformer(DiT)。其中,引入原生几何控制(native geometry control)利用位置嵌入和几何潜变量直接在3D空间中引导纹理生成,确保高精度对齐;同时,VAE与DiT架构协同设计,使几何潜变量从紧耦合的几何分支提取,提供细粒度表面引导,从而实现纹理与几何的强对应关系。这一设计显著提升了纹理的一致性、对齐精度及下游任务的泛化能力。

链接: https://arxiv.org/abs/2511.16317
作者: Zeqiang Lai,Yunfei Zhao,Zibo Zhao,Xin Yang,Xin Huang,Jingwei Huang,Xiangyu Yue,Chunchao Guo
机构: MMLab, CUHK (香港中文大学多媒体实验室); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multi-view images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, achieving precise mesh-texture alignment along boundaries, and maintaining cross-view consistency and coherence in both content and color intensity. NaTex features a novel paradigm that addresses the aforementioned issues by viewing texture as a dense color point cloud. Driven by this idea, we propose latent color diffusion, which comprises a geometry-awared color point cloud VAE and a multi-control diffusion transformer (DiT), entirely trained from scratch using 3D data, for texture reconstruction and generation. To enable precise alignment, we introduce native geometry control that conditions the DiT on direct 3D spatial information via positional embeddings and geometry latents. We co-design the VAE-DiT architecture, where the geometry latents are extracted via a dedicated geometry branch tightly coupled with the color VAE, providing fine-grained surface guidance that maintains strong correspondence with the texture. With these designs, NaTex demonstrates strong performance, significantly outperforming previous methods in texture coherence and alignment. Moreover, NaTex also exhibits strong generalization capabilities, either training-free or with simple tuning, for various downstream applications, e.g., material generation, texture refinement, and part segmentation and texturing.
zh

[CV-51] BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks NEURIPS2025

【速读】:该论文旨在解决当前视觉表征质量评估标准(即ImageNet-1K线性探测准确率)在科学图像领域失效的问题,尤其是其无法有效预测模型在生态学任务中的性能表现。研究表明,ImageNet top-1准确率仅能解释34%的生态学任务方差,并且有30%的模型在该指标下被错误排序(高于75%准确率)。为应对这一挑战,作者提出BioBench——一个开放的生态视觉基准,其关键在于统一了9个应用驱动的任务、4个分类界别和6种采集模态(包括无人机RGB、网络视频、显微图像、原位与标本照片、相机陷阱帧),涵盖总计3.1M张图像;同时提供轻量级分类器适配冻结主干网络的标准化流程,并以类平衡宏F1分数作为核心评估指标,从而更真实地反映模型在生态学场景下的泛化能力。该方案不仅为生态计算视觉提供了新的信号,也为构建其他科学领域的可靠AI基准提供了可复用的模板。

链接: https://arxiv.org/abs/2511.16315
作者: Samuel Stevens
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 3rd Imageomics Workshop at NeurIPS 2025

点击查看摘要

Abstract:ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at this https URL and results at this https URL.
zh

[CV-52] Sparse Autoencoders are Topic Models

【速读】:该论文试图解决稀疏自编码器(Sparse Autoencoders, SAEs)在嵌入空间分析中的角色和实际价值存在争议的问题,旨在明确其本质并提升其在跨模态主题建模中的实用性。解决方案的关键在于将SAE重新诠释为一种主题模型(topic model),通过将潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)扩展至嵌入空间,并推导出SAE目标函数作为该模型下的最大后验估计(maximum a posteriori estimator)。基于此理论视角,作者提出SAE-TM框架:首先训练SAE以学习可复用的主题原子(topic atoms),其次将其解释为下游数据上的词分布,最后无需重新训练即可合并为任意数量的主题。该方法在文本与图像数据集上均生成更连贯且多样化的主题,同时揭示了图像数据集中的主题结构及其随时间演化规律,从而确立了SAE作为多模态大规模主题分析的有效工具地位。

链接: https://arxiv.org/abs/2511.16309
作者: Leander Girrbach,Zeynep Akata
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Helmholtz Munich (赫尔姆霍兹慕尼黑)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.
zh

[CV-53] Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

【速读】:该论文旨在解决视觉基础模型(Vision Foundation Models)在像素级任务中因特征下采样(通常为14×/16×)而导致的高分辨率输出受限问题。现有特征上采样方法依赖数据集特定的再训练或复杂的隐式优化,难以实现跨架构和模态的通用性与可扩展性。解决方案的关键在于提出一种轻量级的测试时优化(Test-Time Optimization, TTO)框架——Upsample Anything,其通过每张图像独立优化学习一个各向异性高斯核(anisotropic Gaussian kernel),融合空间与范围信息,从而实现边缘感知的高效上采样;该核作为通用算子可在不同模型架构和模态间无缝迁移,显著提升语义分割、深度估计及概率图等任务的高分辨率重建精度,且推理速度仅需约0.419秒/张224×224图像。

链接: https://arxiv.org/abs/2511.16301
作者: Minseok Seo,Mark Hamilton,Changick Kim
机构: KAIST(韩国科学技术院); MIT(麻省理工学院); Microsoft(微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 12 figures

点击查看摘要

Abstract:We present \textbfUpsample Anything, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only \approx0.419 \texts per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.
zh

[CV-54] Optimizing 3D Gaussian Splattering for Mobile GPUs

【速读】:该论文旨在解决移动设备上基于图像的3D场景重建(image-based 3D scene reconstruction)效率低下的问题,尤其针对3D高斯溅射(3D Gaussian Splatting, 3DGS)在移动端GPU上的性能瓶颈。其核心挑战在于如何优化二维纹理缓存(2D texture cache)的利用以提升执行速度。解决方案的关键是提出了一种新型排序算法——Texture3dgs,该算法通过高度优化处理逻辑、数据移动和内存布局,显著提升了对纹理缓存的访问效率;同时结合变量布局改进等其他优化手段,实现了端到端的性能提升:排序阶段最高提速4.1倍,整体3D场景重建速度提升1.7倍,并将内存占用降低至原来的0.6倍,验证了该设计在移动平台高效重建3D场景的有效性。

链接: https://arxiv.org/abs/2511.16298
作者: Md Musfiqur Rahman Sanim,Zhihao Shu,Bahram Afsharmanesh,AmirAli Mirian,Jiexiong Guan,Wei Niu,Bin Ren,Gagan Agrawal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Image-based 3D scene reconstruction, which transforms multi-view images into a structured 3D representation of the surrounding environment, is a common task across many modern applications. 3D Gaussian Splatting (3DGS) is a new paradigm to address this problem and offers considerable efficiency as compared to the previous methods. Motivated by this, and considering various benefits of mobile device deployment (data privacy, operating without internet connectivity, and potentially faster responses), this paper develops Texture3dgs, an optimized mapping of 3DGS for a mobile GPU. A critical challenge in this area turns out to be optimizing for the two-dimensional (2D) texture cache, which needs to be exploited for faster executions on mobile GPUs. As a sorting method dominates the computations in 3DGS on mobile platforms, the core of Texture3dgs is a novel sorting algorithm where the processing, data movement, and placement are highly optimized for 2D memory. The properties of this algorithm are analyzed in view of a cost model for the texture cache. In addition, we accelerate other steps of the 3DGS algorithm through improved variable layout design and other optimizations. End-to-end evaluation shows that Texture3dgs delivers up to 4.1 \times and 1.7 \times speedup for the sorting and overall 3D scene reconstruction, respectively – while also reducing memory usage by up to 1.6 \times – demonstrating the effectiveness of our design for efficient mobile 3D scene reconstruction.
zh

[CV-55] Explainable AI for Diabetic Retinopathy Detection Using Deep Learning with Attention Mechanisms and Fuzzy Logic-Based Interpretability

【速读】:该论文旨在解决精准农业中杂草检测的难题,即如何在复杂多变的田间条件下实现高精度、高鲁棒性的杂草识别,从而支持选择性施药和可持续作物管理。其解决方案的关键在于提出了一种融合卷积神经网络(Convolutional Neural Networks, CNNs)、视觉变换器(Vision Transformers, ViTs)和图神经网络(Graph Neural Networks, GNNs)的混合深度学习框架,以同时捕捉局部、全局及关系特征;并引入基于生成对抗网络(Generative Adversarial Networks, GANs)的数据增强方法平衡类别分布,结合自监督对比预训练策略从有限标注数据中提取更丰富的特征表示,最终实现了在多基准数据集上99.33%的准确率、精确率、召回率和F1分数,且具备良好的可解释性和边缘设备实时部署能力。

链接: https://arxiv.org/abs/2511.16294
作者: Abishek Karthik,Pandiyaraju V,Sreya Mynampati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment of edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.
zh

[CV-56] Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM

【速读】:该论文旨在解决动态环境中高效、连续的三维场景理解问题,尤其针对辅助导航等实时应用需求。其核心挑战在于如何在保持高精度的同时降低生成式AI模型(如Vision Gated Generative Transformers, VGGT)的内存消耗并实现时序一致性。解决方案的关键在于:采用滑动窗口机制处理图像流以对齐子地图,从而缓解VGGT的高内存需求;利用VGGT的跟踪头将2D语义实例掩码聚合为3D物体,并通过存储时间戳和实例级身份信息实现环境变化检测与丰富的上下文推理,保障系统的时序一致性和场景理解能力。

链接: https://arxiv.org/abs/2511.16282
作者: Gergely Dinya,Péter Halász,András Lőrincz,Kristóf Karacs,Anna Gelencsér-Horváth
机构: Eötvös Loránd University (埃格文大学); Pázmány Péter Catholic University (帕兹曼尼·彼得天主教大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a fast, spatio-temporal scene understanding framework based on Vision Gated Generative Transformers (VGGT). The proposed pipeline is designed to enable efficient, close to real-time performance, supporting applications including assistive navigation. To achieve continuous updates of the 3D scene representation, we process the image flow with a sliding window, aligning submaps, thereby overcoming VGGT’s high memory demands. We exploit the VGGT tracking head to aggregate 2D semantic instance masks into 3D objects. To allow for temporal consistency and richer contextual reasoning the system stores timestamps and instance-level identities, thereby enabling the detection of changes in the environment. We evaluate the approach on well-known benchmarks and custom datasets specifically designed for assistive navigation scenarios. The results demonstrate the applicability of the framework to real-world scenarios.
zh

[CV-57] raSDF: Precise Mesh Extraction with Multi-resolution Tetrahedral Grid

【速读】:该论文旨在解决从神经隐式表示的有符号距离函数(SDF)中精确提取匹配零等值面的网格(mesh)这一难题。现有基于采样的方法因离散化误差难以保证几何精度,而连续分段仿射(CPWA)解析方法仅适用于标准ReLU多层感知机(MLP),适用范围受限。其解决方案的关键在于提出TetraSDF框架,通过将ReLU MLP与多分辨率四面体位置编码(multi-resolution tetrahedral positional encoder)结合,利用编码器的重心插值保持全局CPWA结构,从而在编码诱导的多面体复形内追踪ReLU线性区域;同时引入由编码器度量导出的固定解析输入预处理器,降低方向偏差并稳定训练过程。该方法在多个基准测试中实现了优于或相当Grid-based编码器的SDF重建精度,并生成高度自洽且忠实于学习等值面的高质量网格,同时具备实用的运行时间和内存效率。

链接: https://arxiv.org/abs/2511.16273
作者: Seonghun Oh,Youngjung Uh,Jin-Hwa Kim
机构: Yonsei University (延世大学); NAVER AI Lab (NAVER人工智能实验室); SNU AIIS (首尔国立大学人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Extracting meshes that exactly match the zero-level set of neural signed distance functions (SDFs) remains challenging. Sampling-based methods introduce discretization error, while continuous piecewise affine (CPWA) analytic approaches apply only to plain ReLU MLPs. We present TetraSDF, a precise analytic meshing framework for SDFs represented by a ReLU MLP composed with a multi-resolution tetrahedral positional encoder. The encoder’s barycentric interpolation preserves global CPWA structure, enabling us to track ReLU linear regions within an encoder-induced polyhedral complex. A fixed analytic input preconditioner derived from the encoder’s metric further reduces directional bias and stabilizes training. Across multiple benchmarks, TetraSDF matches or surpasses existing grid-based encoders in SDF reconstruction accuracy, and its analytic extractor produces highly self-consistent meshes that remain faithful to the learned isosurfaces, all with practical runtime and memory efficiency.
zh

[CV-58] Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs

【速读】:该论文旨在解决增强现实/虚拟现实(AR/VR)应用中全身体感追踪不完整的问题,现有系统通常仅通过头戴式设备(Head Mounted Devices, HMDs)和控制器追踪头部与手部,导致人体其他部位的三维重建缺失。为实现从有限传感器输入中生成完整且平滑的全身运动,作者提出一种基于多层感知机(Multi-Layer Perceptron, MLP)骨干网络的方法,其关键创新在于引入残差连接和一种新型神经网络组件——Memory-Block。Memory-Block通过可训练的代码向量(code-vectors)表示缺失的传感器数据,并融合前一时刻的稀疏信号以提升时序一致性;同时将问题建模为多任务学习框架,使MLP骨干网络能够学习鲁棒特征以提高预测精度。实验表明,该方法显著优于当前最优基线,在移动HMD上达到72 FPS的实时性能,优化了准确率与运行效率的权衡。

链接: https://arxiv.org/abs/2511.16264
作者: Sinan Mutlu,Georgios F. Angelis,Savas Ozkan,Paul Wisbey,Anastasios Drosou,Mete Ozay
机构: Samsung R&D Institute UK (SRUK); Information Technologies Institute, CERTH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction in-complete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ultimately improves the accuracy-running time tradeoff.
zh

[CV-59] How Robot Dogs See the Unseeable

【速读】:该论文旨在解决机器人视觉中因部分遮挡(partial occlusion)导致的场景理解受限问题。传统机器人相机由于小光圈和大景深,使前景遮挡物与背景物体同时清晰成像,从而掩盖关键信息。解决方案的关键在于借鉴动物“对视”(peering)行为,通过机器人执行对视运动形成宽广的合成孔径(synthetic aperture, SA),利用计算融合多帧图像生成具有极浅景深的合成图像,有效模糊遮挡物并突出背景细节。该方法无需依赖特征匹配或主动传感(如LiDAR),具备计算高效、波长无关性及即插即用等优势,显著提升复杂环境中的感知精度与高级视觉推理能力。

链接: https://arxiv.org/abs/2511.16262
作者: Oliver Bimber,Karl Dietrich von Ellenrieder,Michael Haller,Rakesh John Amala Arokia Nathan,Gianni Lunardi,Marco Camurri,Mohamed Youssef,Santos Miguel Orozco Soto,Jeremy E. Niven
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Peering, a side-to-side motion used by animals to estimate distance through motion parallax, offers a powerful bio-inspired strategy to overcome a fundamental limitation in robotic vision: partial occlusion. Conventional robot cameras, with their small apertures and large depth of field, render both foreground obstacles and background objects in sharp focus, causing occluders to obscure critical scene information. This work establishes a formal connection between animal peering and synthetic aperture (SA) sensing from optical imaging. By having a robot execute a peering motion, its camera describes a wide synthetic aperture. Computational integration of the captured images synthesizes an image with an extremely shallow depth of field, effectively blurring out occluding elements while bringing the background into sharp focus. This efficient, wavelength-independent technique enables real-time, high-resolution perception across various spectral bands. We demonstrate that this approach not only restores basic scene understanding but also empowers advanced visual reasoning in large multimodal models, which fail with conventionally occluded imagery. Unlike feature-dependent multi-view 3D vision methods or active sensors like LiDAR, SA sensing via peering is robust to occlusion, computationally efficient, and immediately deployable on any mobile robot. This research bridges animal behavior and robotics, suggesting that peering motions for synthetic aperture sensing are a key to advanced scene understanding in complex, cluttered environments.
zh

[CV-60] SwiTrack: Tri-State Switch for Cross-Modal Object Tracking

【速读】:该论文旨在解决跨模态目标跟踪(Cross-modal Object Tracking, CMOT)中因模态切换导致的特征提取不充分与目标漂移问题,尤其是在输入模态不可靠时难以保持目标一致性。其解决方案的关键在于提出一种新颖的状态切换框架 SwiTrack,通过三个专用流实现:1)RGB帧由视觉编码器处理以提取判别性特征;2)NIR帧通过NIR门控适配器与视觉编码器协同优化共享潜在空间特征,增强跨模态表示鲁棒性;3)针对无效模态引入一致性轨迹预测模块,利用时空线索估计目标运动以抑制漂移;此外,结合动态模板重构和相似性对齐损失进一步提升特征一致性与跟踪精度。

链接: https://arxiv.org/abs/2511.16227
作者: Boyue Xu,Ruichao Hou,Tongwei Ren,Dongming Zhou,Gangshan Wu,Jinde Cao
机构: Nanjing University (南京大学); Yunnan University (云南大学); Southeast University (东南大学); Purple Mountain Laboratories (紫金山实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-modal object tracking (CMOT) is an emerging task that maintains target consistency while the video stream switches between different modalities, with only one modality available in each frame, mostly focusing on RGB-Near Infrared (RGB-NIR) tracking. Existing methods typically connect parallel RGB and NIR branches to a shared backbone, which limits the comprehensive extraction of distinctive modality-specific features and fails to address the issue of object drift, especially in the presence of unreliable inputs. In this paper, we propose SwiTrack, a novel state-switching framework that redefines CMOT through the deployment of three specialized streams. Specifically, RGB frames are processed by the visual encoder, while NIR frames undergo refinement via a NIR gated adapter coupled with the visual encoder to progressively calibrate shared latent space features, thereby yielding more robust cross-modal representations. For invalid modalities, a consistency trajectory prediction module leverages spatio-temporal cues to estimate target movement, ensuring robust tracking and mitigating drift. Additionally, we incorporate dynamic template reconstruction to iteratively update template features and employ a similarity alignment loss to reinforce feature consistency. Experimental results on the latest benchmarks demonstrate that our tracker achieves state-of-the-art performance, boosting precision rate and success rate gains by 7.2% and 4.3%, respectively, while maintaining real-time tracking at 65 frames per second. Code and models are available at this https URL.
zh

[CV-61] Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster Ensembles

【速读】:该论文旨在解决无监督图像分类(unsupervised image classification,即图像聚类)中聚类性能受限的问题,尤其针对当前方法因跳过表示学习阶段而导致的聚类质量不足。其关键解决方案在于提出“通过聚类集成的图像聚类”(Image Clustering through Cluster Ensembles, ICCE)方法:首先在冻结的主干网络上训练多个聚类头以生成多样化的图像聚类结果;随后采用聚类集成策略将这些可能冲突的聚类结果融合为统一的共识聚类;最终利用该共识聚类结果作为伪标签训练图像分类器。此方法显著提升了聚类一致性与泛化能力,在多个基准数据集上达到当前最优性能,首次在ImageNet上实现超过70%的准确率。

链接: https://arxiv.org/abs/2511.16213
作者: Melih Baydar,Emre Akbas
机构: Middle East Technical University (中东部理工大学); Helmholtz Munich (赫尔姆霍兹慕尼黑)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised image classification, or image clustering, aims to group unlabeled images into semantically meaningful categories. Early methods integrated representation learning and clustering within an iterative framework. However, the rise of foundational models have recently shifted focus solely to clustering, bypassing the representation learning step. In this work, we build upon a recent multi-head clustering approach by introducing adaptive nearest neighbor selection and cluster ensembling strategies to improve clustering performance. Our method, “Image Clustering through Cluster Ensembles” (ICCE), begins with a clustering stage, where we train multiple clustering heads on a frozen backbone, producing diverse image clusterings. We then employ a cluster ensembling technique to consolidate these potentially conflicting results into a unified consensus clustering. Finally, we train an image classifier using the consensus clustering result as pseudo-labels. ICCE achieves state-of-the-art performance on ten image classification benchmarks, achieving 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet datasets, narrowing the performance gap with supervised methods. To the best of our knowledge, ICCE is the first fully unsupervised image classification method to exceed 70% accuracy on ImageNet.
zh

[CV-62] When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

【速读】:该论文旨在解决视觉-语言-动作模型(Vision-Language-Action models, VLAs)在具身环境中面临的多模态对抗鲁棒性问题,尤其关注在真实场景下白盒与黑盒条件下的跨模态干扰对决策行为的影响。现有研究多局限于单模态扰动,忽视了语义层面的跨模态错位对具身推理的根本性破坏。其解决方案的关键在于提出VLA-Fool框架,系统性地整合三类多模态对抗攻击:(1) 基于梯度和提示的文本扰动,(2) 通过补丁和噪声的视觉扰动,以及(3) 有意破坏感知与指令之间语义对应关系的跨模态错位攻击;同时创新性地引入VLA感知的语义空间以自动生成语义引导提示,首次实现了自动化且语义可控的提示工程。实验表明,微小的多模态扰动即可引发显著的行为偏差,揭示了具身多模态对齐的脆弱性。

链接: https://arxiv.org/abs/2511.16203
作者: Yuping Yan,Yuhan Xie,Yinxin Zhang,Lingjuan Lyu,Yaochu Jin
机构: Westlake University (西湖大学); Zhejiang University (浙江大学); Pennsylvania State University (宾夕法尼亚州立大学); Sony Research (索尼研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.
zh

[CV-63] PrIntMesh: Precise Intersection Surfaces for 3D Organ Mesh Reconstruction

【速读】:该论文旨在解决当前深度学习方法在器官重建中忽视子结构之间几何与空间约束关系的问题,导致生成结果常出现解剖学上不合理的现象。解决方案的关键在于提出PrIntMesh框架,这是一种基于模板、拓扑保持的器官重建方法,通过从一个连通模板出发,联合变形所有子结构以匹配个体患者解剖特征,同时显式保留内部边界并强制生成光滑无伪影的表面,从而实现器官作为统一系统的高保真重建。

链接: https://arxiv.org/abs/2511.16186
作者: Deniz Sayin Mercadier,Hieu Le,Yihong Chen,Jiancheng Yang,Udaranga Wickramasinghe,Pascal Fua
机构: EPFL(瑞士联邦理工学院); UNC Charlotte(北卡罗来纳大学夏洛特分校); ELLIS Institute Finland(芬兰ELLIS研究所); Adis SA(阿迪斯公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:Human organs are composed of interconnected substructures whose geometry and spatial relationships constrain one another. Yet, most deep-learning approaches treat these parts independently, producing anatomically implausible reconstructions. We introduce PrIntMesh, a template-based, topology-preserving framework that reconstructs organs as unified systems. Starting from a connected template, PrIntMesh jointly deforms all substructures to match patient-specific anatomy, while explicitly preserving internal boundaries and enforcing smooth, artifact-free surfaces. We demonstrate its effectiveness on the heart, hippocampus, and lungs, achieving high geometric accuracy, correct topology, and robust performance even with limited or noisy training data. Compared to voxel- and surface-based methods, PrIntMesh better reconstructs shared interfaces, maintains structural consistency, and provides a data-efficient solution suitable for clinical use.
zh

[CV-64] Domain-Shared Learning and Gradual Alignment for Unsupervised Domain Adaptation Visible-Infrared Person Re-Identification

【速读】:该论文旨在解决无监督域自适应可见光-红外行人重识别(Unsupervised Domain Adaptation Visible-Infrared person Re-Identification, UDA-VI-ReID)问题,即如何将基于公开数据集训练的模型知识迁移至真实场景中,而无需对目标域样本进行标注。其核心挑战在于跨域模态差异(inter-domain modality discrepancies)和域内模态差异(intra-domain modality discrepancies)。解决方案的关键是提出一种两阶段模型——Domain-Shared Learning and Gradual Alignment (DSLGA):第一阶段采用Domain-Shared Learning Strategy (DSLS) 利用源域与目标域间的共享信息缓解因跨域模态差异导致的无效预训练;第二阶段通过Gradual Alignment Strategy (GAS) 以“簇到整体”的方式逐步对齐可见光与红外模态特征,从而有效应对域内模态差异带来的跨模态匹配难题。

链接: https://arxiv.org/abs/2511.16184
作者: Nianchang Huang,Yi Xu,Ruida Xi,Ruida Xi,Qiang Zhang
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Visible-Infrared person Re-Identification (VI-ReID) has achieved remarkable performance on public datasets. However, due to the discrepancies between public datasets and real-world data, most existing VI-ReID algorithms struggle in real-life applications. To address this, we take the initiative to investigate Unsupervised Domain Adaptation Visible-Infrared person Re-Identification (UDA-VI-ReID), aiming to transfer the knowledge learned from the public data to real-world data without compromising accuracy and requiring the annotation of new samples. Specifically, we first analyze two basic challenges in UDA-VI-ReID, i.e., inter-domain modality discrepancies and intra-domain modality discrepancies. Then, we design a novel two-stage model, i.e., Domain-Shared Learning and Gradual Alignment (DSLGA), to handle these discrepancies. In the first pre-training stage, DSLGA introduces a Domain-Shared Learning Strategy (DSLS) to mitigate ineffective pre-training caused by inter-domain modality discrepancies via exploiting shared information between the source and target domains. While, in the second fine-tuning stage, DSLGA designs a Gradual Alignment Strategy (GAS) to handle the cross-modality alignment challenges between visible and infrared data caused by the large intra-domain modality discrepancies through a cluster-to-holistic alignment way. Finally, a new UDA-VI-ReID testing method i.e., CMDA-XD, is constructed for training and testing different UDA-VI-ReID models. A large amount of experiments demonstrate that our method significantly outperforms existing domain adaptation methods for VI-ReID and even some supervised methods under various settings.
zh

[CV-65] FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos

【速读】:该论文旨在解决足球视频理解中缺乏可靠、自动化生成的逐帧动作标注(play-by-play data)的问题,此类数据对于战术建模、轨迹预测和表现分析至关重要。当前基于计算机视觉的动作识别方法仍难以完全替代人工标注,无法满足高质量结构化事件序列(谁在何时何地做什么)的构建需求。解决方案的关键在于引入Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS),这是一个首个在多模态、多智能体战术背景下对整场足球比赛进行逐帧动作定位的基准数据集;其核心创新在于融合计算机视觉输出(如跟踪与身份识别)与足球战术先验知识(包括长时间尺度上的战术规律),从而实现更自动化且可靠的play-by-play数据流提取,为数据驱动的体育分析提供关键输入。

链接: https://arxiv.org/abs/2511.16183
作者: Jeremie Ochin(CAOR),Raphael Chekroun,Bogdan Stanciulescu(CAOR),Sotiris Manitsaris(CAOR)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.
zh

[CV-66] Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在训练过程中因直接预测高维视觉状态而导致的模型容量分散与训练成本高昂的问题,以及压缩视觉状态为紧凑监督信号所引发的信息瓶颈问题;同时,针对现有方法因忽视语言监督而导致的理解与推理能力不足,提出了一种新的解决方案。其关键在于引入Mantis框架,该框架采用解耦的视觉前瞻(Disentangled Visual Foresight, DVF)机制,通过元查询(meta queries)与扩散Transformer(diffusion Transformer, DiT)头的结合,将视觉前瞻预测从主干网络中解耦出来,并利用残差连接将当前视觉状态输入DiT,以简单的目标函数促使元查询自动捕捉隐式动作信息,从而增强显式动作的学习效果。这种解耦设计显著减轻了VLA主干网络的负担,使其能更好地保留语言监督下的理解与推理能力,最终在LIBERO基准上实现96.7%的成功率,并在真实场景中展现出优于π₀.₅等先进基线模型的指令遵循、泛化及推理性能。

链接: https://arxiv.org/abs/2511.16175
作者: Yi Yang,Xueqi Li,Yiyang Chen,Jin Song,Yihan Wang,Zipeng Xiao,Jiadi Su,You Qiaoben,Pengfei Liu,Zhijie Deng
机构: Shanghai Jiao Tong University (上海交通大学); Shenzhen Institute of Artificial Intelligence (深圳人工智能研究院); Nanjing University of Posts and Telecommunications (南京邮电大学); Fudan University (复旦大学); Bosch (博世)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms \pi_0.5 , a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.
zh

[CV-67] arget Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective AAAI2026

【速读】:该论文旨在解决开放词汇语义分割(Open-vocabulary semantic segmentation, OVSS)中多模态密集预测能力不足的问题,核心瓶颈在于像素级视觉-语言对齐的精度受限于CLIP模型内部机制中的注意力分散现象。具体而言,作者发现CLIP在进行密集预测时会将大量注意力资源分配给与目标区域无关的token(即“干扰token”),这些干扰token源于特定维度的过激活,从而削弱了关键区域的对齐精度。解决方案的关键在于提出ReFocusing CLIP(RF-CLIP),一种无需训练的方法,其通过模拟人类“分心-聚焦”的行为机制,主动过滤干扰token并重新引导注意力至目标区域,从而提升CLIP在像素级多模态对齐上的粒度和准确性。该方法在八个基准测试上达到当前最优性能,同时保持高推理效率。

链接: https://arxiv.org/abs/2511.16170
作者: Jiahao Li,Yang Lu,Yachao Zhang,Yong Xie,Fangyong Wang,Yuan Xie,Yanyun Qu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP’s vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP’s internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP’s dense prediction performance. Consequently, we propose ReFocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP’s multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.
zh

[CV-68] EvoVLA: Self-Evolving Vision-Language-Action Model

【速读】:该论文旨在解决长时程机器人操作中Vision-Language-Action (VLA)模型存在的阶段幻觉(stage hallucination)问题,即代理在缺乏精细评估信号的情况下,通过视觉捷径误导性地报告任务进展,而实际并未完成多步骤任务。解决方案的关键在于提出EvoVLA框架,其核心创新包括:1)阶段对齐奖励(Stage-Aligned Reward, SAR),利用Gemini生成的困难负样本进行三元组对比学习以抑制视觉捷径;2)基于位姿的对象探索(Pose-Based Object Exploration, POE),将好奇心建模为物体与夹爪之间的相对位姿而非原始像素;3)长时程记忆机制,通过选择性上下文保留和门控融合稳定内在奖励塑造过程,从而提升长期任务执行的鲁棒性与准确性。

链接: https://arxiv.org/abs/2511.16166
作者: Zeting Liu,Zida Yang,Zeyu Zhang,Hao Tang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: this https URL. Website: this https URL.
zh

[CV-69] An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在部署过程中因生成冗长低信息密度文本而导致的资源消耗过高问题,如增加能耗、延迟和Token成本。现有方法仅通过隐式延迟结束标记(EOS token)来延长输出,缺乏对输出Token长度的显式优化目标,导致稳定性差且效果有限。解决方案的关键在于提出一种新颖的冗长文本诱导攻击(Verbose-Text Induction Attack, VTIA),其核心是一个两阶段框架:第一阶段利用强化学习策略自动搜索能触发大语言模型(LLM)组件生成冗长输出的恶意提示嵌入(adversarial prompt embeddings);第二阶段则进行视觉对齐扰动优化,通过最大化扰动图像的视觉嵌入与恶意提示嵌入之间的相似性,构造出可诱导VLM生成冗长文本的恶意图像。该方法实现了对输出Token长度的显式最大化,显著提升了攻击的有效性、效率和泛化能力。

链接: https://arxiv.org/abs/2511.16163
作者: Zhi Luo,Zenghui Yuan,Wenqi Wei,Daizong Liu,Pan Zhou
机构: Huazhong University of Science and Technology (华中科技大学); Fordham University (福特汉姆大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation this http URL studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and this http URL address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed this http URL, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image’s visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.
zh

[CV-70] Layer-wise Noise Guided Selective Wavelet Reconstruction for Robust Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割模型在分布偏移(distribution shifts)和扰动下稳定性不足的问题,尤其针对当前主流的对抗训练(Adversarial Training, AT)方法存在干净样本性能与鲁棒性之间的权衡(clean–robustness trade-off)以及高训练/调参成本、难以规模化部署的局限。其解决方案的关键在于提出分层噪声引导的选择性小波重构(Layer-wise Noise-Guided Selective Wavelet Reconstruction, LNG-SWR):通过在多个网络层注入零均值小噪声以学习频率偏差先验(frequency-bias prior),从而引导特征表示远离对噪声敏感的方向;随后在输入或特征分支上应用该先验指导的选择性小波重构,实现频域适应——抑制噪声敏感频带、增强方向结构与形状线索,并稳定边界响应,同时保持频谱一致性。该方法不依赖特定骨干网络(backbone-agnostic),推理开销低,可作为插件模块集成至AT中或独立提升鲁棒性,显著改善强攻击下的性能下降,且与AT结合时进一步提升鲁棒性而不损失干净准确率,为医学图像分割提供了工程友好且可扩展的鲁棒性增强路径。

链接: https://arxiv.org/abs/2511.16162
作者: Yuting Lu,Ziliang Wang,Weixin Xu,Wei Zhang,Yongqiang Zhao,Yang Yu,Xiaohong Zhang
机构: Chongqing University (重庆大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Clinical deployment requires segmentation models to stay stable under distribution shifts and perturbations. The mainstream solution is adversarial training (AT) to improve robustness; however, AT often brings a clean–robustness trade-off and high training/tuning cost, which limits scalability and maintainability in medical imaging. We propose \emphLayer-wise Noise-Guided Selective Wavelet Reconstruction (LNG-SWR). During training, we inject small, zero-mean noise at multiple layers to learn a frequency-bias prior that steers representations away from noise-sensitive directions. We then apply prior-guided selective wavelet reconstruction on the input/feature branch to achieve frequency adaptation: suppress noise-sensitive bands, enhance directional structures and shape cues, and stabilize boundary responses while maintaining spectral consistency. The framework is backbone-agnostic and adds low additional inference overhead. It can serve as a plug-in enhancement to AT and also improves robustness without AT. On CT and ultrasound datasets, under a unified protocol with PGD- L_\infty/L_2 and SSAH, LNG-SWR delivers consistent gains on clean Dice/IoU and significantly reduces the performance drop under strong attacks; combining LNG-SWR with AT yields additive gains. When combined with adversarial training, robustness improves further without sacrificing clean accuracy, indicating an engineering-friendly and scalable path to robust segmentation. These results indicate that LNG-SWR provides a simple, effective, and engineering-friendly path to robust medical image segmentation in both adversarial and standard training regimes.
zh

[CV-71] Simba: Towards High-Fidelity and Geometrically-Consistent Point Cloud Completion via Transformation Diffusion AAAI AAAI-26

【速读】:该论文旨在解决点云补全(Point Cloud Completion)任务中长期存在的挑战:如何在保持输入点云的细粒度几何细节的同时,确保补全后形状的全局结构完整性。现有基于直接回归局部对称变换的方法虽能较好保留几何结构,但存在两个关键缺陷:一是易过拟合,倾向于记忆特定实例的变换而非学习通用几何先验;二是对输入噪声敏感,导致鲁棒性和泛化能力差。解决方案的关键在于将点级变换回归问题重新建模为分布学习问题,结合对称性先验与扩散模型(Diffusion Models)的强大生成能力,从而避免实例特异性记忆并捕捉稳健的几何结构;同时引入分层Mamba架构实现高保真上采样,显著提升补全质量。

链接: https://arxiv.org/abs/2511.16161
作者: Lirui Zhang,Zhengkai Zhao,Zhi Zuo,Pan Gao,Jie Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the 40th AAAI Conference on Artificial Intelligence (AAAI-26)

点击查看摘要

Abstract:Point cloud completion is a fundamental task in 3D vision. A persistent challenge in this field is simultaneously preserving fine-grained details present in the input while ensuring the global structural integrity of the completed shape. While recent works leveraging local symmetry transformations via direct regression have significantly improved the preservation of geometric structure details, these methods suffer from two major limitations: (1) These regression-based methods are prone to overfitting which tend to memorize instant-specific transformations instead of learning a generalizable geometric prior. (2) Their reliance on point-wise transformation regression lead to high sensitivity to input noise, severely degrading their robustness and generalization. To address these challenges, we introduce Simba, a novel framework that reformulates point-wise transformation regression as a distribution learning problem. Our approach integrates symmetry priors with the powerful generative capabilities of diffusion models, avoiding instance-specific memorization while capturing robust geometric structures. Additionally, we introduce a hierarchical Mamba-based architecture to achieve high-fidelity upsampling. Extensive experiments across the PCN, ShapeNet, and KITTI benchmarks validate our method’s state-of-the-art (SOTA) performance.
zh

[CV-72] Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间智能(spatial intelligence)方面存在的局限性,特别是现有基于网格(grid-based)认知地图的方法因采用离散栅格表示而难以实现细粒度的空间推理问题。解决方案的关键在于提出Video2Layout框架,通过连续的对象边界坐标来量化物体间的物理距离与尺寸,从而赋予模型定量的空间计算能力,有效缓解自然语言描述中空间关系的固有模糊性。该方法包含两个核心阶段:一是利用AI2THOR模拟器构建高质量数据集进行监督微调,使模型学会从视觉输入映射到精确的边界坐标;二是引入强化学习微调以提升模型在真实场景中的泛化能力。实验表明,所提出的V2LO-7B模型在QVS-Bench及主流空间推理基准上相比基于网格地图训练的模型平均提升4.92%,验证了该方案的有效性。

链接: https://arxiv.org/abs/2511.16160
作者: Yibin Huang,Wang Xu,Wanyue Zhang,Helu Zhi,Jingjing Huang,Yangbin Xu,Yangang Sun,Conghui Zhu,Tiejun Zhao
机构: Harbin Institute of Technology (哈尔滨工业大学); Tsinghua University (清华大学); Chinese Academy of Sciences (中国科学院); Institute of Microelectronics of the Chinese Academy of Sciences (中国科学院微电子研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model’s ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model’s real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at this https URL.
zh

[CV-73] Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers

【速读】:该论文旨在解决扩散Transformer(Diffusion Transformer, DiT)在图像生成任务中因参数量庞大而导致计算成本高昂的问题,从而限制其在资源受限环境中的部署。解决方案的关键在于提出一种名为“可插拔的连续层蒸馏剪枝”(Pluggable Pruning with Contiguous Layer Distillation, PPCL)的灵活结构化剪枝框架:首先通过线性探测与相似性度量的一阶微分趋势分析识别冗余层区间;随后设计了一种教师-学生交替蒸馏机制,将深度剪枝与宽度剪枝整合于单一训练阶段,实现跨不同剪枝比例的知识迁移,避免对每种配置单独重新训练。实验表明,PPCL可在保持关键指标性能损失低于3%的前提下,实现50%的参数压缩率,显著提升模型在资源受限场景下的适用性。

链接: https://arxiv.org/abs/2511.16156
作者: Jian Ma,Qirong Peng,Xujie Zhu,Peixing Xie,Chen Chen,Haonan Lu
机构: OPPO AI Center (OPPO人工智能中心); Sun Yat-sen University (中山大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50% reduction in parameter count compared to the full model, with less than 3% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: this https URL.
zh

[CV-74] Reasoning Guided Embeddings: Leverag ing MLLM Reasoning for Improved Multimodal Retrieval

【速读】:该论文旨在解决现有多模态嵌入(Multimodal Embeddings)方法在提取过程中忽视了多模态大语言模型(MLLMs)生成式推理能力的问题,导致嵌入表示质量受限。其解决方案的关键在于提出推理引导嵌入(Reasoning Guided Embeddings, RGE),通过显式地将MLLM的生成式推理过程纳入嵌入提取流程:首先基于指令进行结构化推理生成,随后在推理完成后提取嵌入表示,从而增强嵌入对上下文条件的感知能力,并结合对比学习训练进一步优化表示质量。实验表明,该方法在MMEB基准上相较非推理基线提升了4.9%的多模态检索性能。

链接: https://arxiv.org/abs/2511.16150
作者: Chunxu Liu,Jiyuan Yang,Ruopeng Gao,Yuhan Zhu,Feng Zhu,Rui Zhao,Limin Wang
机构: Nanjing University (南京大学); Sensetime Research (商汤科技研究); Beijing Institute of Technology (北京理工大学); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal embeddings are widely used in downstream tasks such as multimodal retrieval, enabling alignment of interleaved modalities in a shared representation space. While recent studies show that Multimodal Large Language Models (MLLMs) can serve as strong embedding extractors, existing approaches treat embedding extraction as a direct encoding step, overlooking the fact that MLLMs possess the generative capability for reasoning that could be leveraged to enhance representation quality. In this work, we explore how to explicitly incorporate reasoning into the embedding process. To this end, we propose Reasoning Guided Embeddings (RGE), which preserves the generative rationale process of MLLMs and couples it with contrastive training. Our method first enables the model to perform structured rationale generation conditioned on the instruction, and then extracts representations after reasoning has unfolded. This simple design enhances the context-conditional inference signals within the embedding, leading to improved multimodal representation quality. Experiments on the MMEB benchmark show that reasoning-guided conditioning improves multimodal retrieval performance by 4.9% over the non-reasoning baseline, confirming that explicit reasoning can effectively enhance embedding quality.
zh

[CV-75] LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM

【速读】:该论文旨在解决基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的同步定位与建图(SLAM)系统在开放词汇语义理解方面的不足,即现有方法难以支持高级机器人交互所需的语义信息,且传统静态模型缺乏对新环境的适应能力。解决方案的关键在于提出LEGO-SLAM框架,其核心创新包括:1)设计一种场景自适应的编码器-解码器结构,将高维语言嵌入压缩至16维特征空间,显著降低每个高斯点的内存占用并加速渲染;2)通过在线自适应机制使编码器能够动态适配未见场景,提升泛化能力;3)利用紧凑语义特征实现语言引导的剪枝策略,减少超过60%的高斯数量而不损失渲染质量;4)引入基于语言的回环检测方法,复用映射特征而无需额外检测模型,从而实现端到端的实时、开放词汇语义SLAM,整体性能达15 FPS。

链接: https://arxiv.org/abs/2511.16144
作者: Sibaek Lee,Seongbo Ha,Kyeongsu Kang,Joonyeol Choi,Seungjun Tak,Hyeonwoo Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 18 pages

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled Simultaneous Localization and Mapping (SLAM) systems to build photorealistic maps. However, these maps lack the open-vocabulary semantic understanding required for advanced robotic interaction. Integrating language features into SLAM remains a significant challenge, as storing high-dimensional features demands excessive memory and rendering overhead, while existing methods with static models lack adaptability for novel environments. To address these limitations, we propose LEGO-SLAM (Language-Embedded Gaussian Optimization SLAM), the first framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system. At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space. This design reduces the memory per Gaussian and accelerates rendering, enabling real-time performance. Unlike static approaches, our encoder adapts online to unseen scenes. These compact features also enable a language-guided pruning strategy that identifies semantic redundancy, reducing the map’s Gaussian count by over 60% while maintaining rendering quality. Furthermore, we introduce a language-based loop detection approach that reuses these mapping features, eliminating the need for a separate detection model. Extensive experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.
zh

[CV-76] A Spatial Semantics and Continuity Perception Attention for Remote Sensing Water Body Change Detection

【速读】:该论文旨在解决遥感水体变化检测(Water Body Change Detection, WBCD)中两个关键问题:一是高空间分辨率数据集的稀缺性,限制了在城乡区域等需要精确定位场景下的应用;二是现有基于深度学习的方法未能充分挖掘深层特征中的空间语义与结构信息,导致水体判别能力不足。解决方案的关键在于提出一个新型高分辨率水体变化检测数据集HSRW-CD(空间分辨率高于3米),并设计了一个可插拔的空间语义与连续性感知(Spatial Semantics and Continuity Perception, SSCP)注意力模块,该模块由多语义空间注意力(Multi-Semantic Spatial Attention, MSA)、结构关系感知全局注意力(Structural Relation-aware Global Attention, SRGA)和通道自注意力(Channel-wise Self-Attention, CSA)三部分组成,协同增强水体特征的空间语义、结构连续性和跨通道相似性计算能力,从而显著提升水体变化检测的精度与泛化性能。

链接: https://arxiv.org/abs/2511.16143
作者: Quanqing Ma,Jiaen Chen,Peng Wang,Yao Zheng,Qingzhan Zhao,Yuchen Zheng
机构: Shihezi University (石河子大学); Production and Construction Corps Engineering Research Center for Spatial Information Technology (兵团空间信息技术工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing Water Body Change Detection (WBCD) aims to detect water body surface changes from bi-temporal images of the same geographic area. Recently, the scarcity of high spatial resolution datasets for WBCD restricts its application in urban and rural regions, which require more accurate positioning. Meanwhile, previous deep learning-based methods fail to comprehensively exploit the spatial semantic and structural information in deep features in the change detection networks. To resolve these concerns, we first propose a new dataset, HSRW-CD, with a spatial resolution higher than 3 meters for WBCD. Specifically, it contains a large number of image pairs, widely covering various water body types. Besides, a Spatial Semantics and Continuity Perception (SSCP) attention module is designed to fully leverage both the spatial semantics and structure of deep features in the WBCD networks, significantly improving the discrimination capability for water body. The proposed SSCP has three components: the Multi-Semantic spatial Attention (MSA), the Structural Relation-aware Global Attention (SRGA), and the Channel-wise Self-Attention (CSA). The MSA enhances the spatial semantics of water body features and provides precise spatial semantic priors for the CSA. Then, the SRGA further extracts spatial structure to learn the spatial continuity of the water body. Finally, the CSA utilizes the spatial semantic and structural priors from the MSA and SRGA to compute the similarity across channels. Specifically designed as a plug-and-play module for water body deep features, the proposed SSCP allows integration into existing WBCD models. Numerous experiments conducted on the proposed HSRW-CD and Water-CD datasets validate the effectiveness and generalization of the SSCP. The code of this work and the HSRW-CD dataset will be accessed at this https URL.
zh

[CV-77] Real-Time 3D Object Detection with Inference-Aligned Learning AAAI2026

【速读】:该论文旨在解决三维点云中实时目标检测的训练-推理差距问题,即模型在训练阶段缺乏空间可靠性感知与排序意识,而推理时采用基于排序的预测选择策略,导致学习到的特征表示与实际推理行为不一致。解决方案的关键在于提出SR3D框架,其核心创新包括:(1)一种新颖的空间优先最优传输分配机制,动态强化定位准确且空间可靠的样本;(2)一种排序感知的自蒸馏机制,通过自蒸馏范式自适应地注入排序感知能力,从而在训练过程中模拟推理时的排序行为,有效弥合训练与推理之间的差异。

链接: https://arxiv.org/abs/2511.16140
作者: Chenyu Zhao,Xianwei Zheng,Zimin Xia,Linwei Yue,Nan Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Real-time 3D object detection from point clouds is essential for dynamic scene understanding in applications such as augmented reality, robotics and navigation. We introduce a novel Spatial-prioritized and Rank-aware 3D object detection (SR3D) framework for indoor point clouds, to bridge the gap between how detectors are trained and how they are evaluated. This gap stems from the lack of spatial reliability and ranking awareness during training, which conflicts with the ranking-based prediction selection used as inference. Such a training-inference gap hampers the model’s ability to learn representations aligned with inference-time behavior. To address the limitation, SR3D consists of two components tailored to the spatial nature of point clouds during training: a novel spatial-prioritized optimal transport assignment that dynamically emphasizes well-located and spatially reliable samples, and a rank-aware adaptive self-distillation scheme that adaptively injects ranking perception via a self-distillation paradigm. Extensive experiments on ScanNet V2 and SUN RGB-D show that SR3D effectively bridges the training-inference gap and significantly outperforms prior methods in accuracy while maintaining real-time speed.
zh

[CV-78] Degradation-Aware Hierarchical Termination for Blind Quality Enhancement of Compressed Video

【速读】:该论文旨在解决视频压缩质量增强(Quality Enhancement for Compressed Video, QECV)中两类关键问题:一是现有盲方法(blind methods)在缺乏量化参数(Quantization Parameters, QPs)时,仅依赖全局退化向量进行去伪影处理,导致空间细节信息缺失、难以适应不同位置的伪影模式;二是无论是否已知QP,现有方法普遍采用固定架构,未考虑不同压缩级别下对计算资源的需求差异。解决方案的关键在于提出一个预训练的退化表征学习(Degradation Representation Learning, DRL)模块,通过解耦并提取高维多尺度退化表征来指导更精细的伪影去除,同时引入分层终止机制(hierarchical termination mechanism),根据压缩水平动态调整去伪影阶段数量,从而实现性能提升与计算效率的协同优化。

链接: https://arxiv.org/abs/2511.16137
作者: Li Yu,Yingbo Zhao,Shiyu Wu,Siyue Yu,Moncef Gabbouj,Qingshan Liu
机构: Nanjing University of Information Science & Technology (南京信息工程大学); Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (江苏省大气环境与装备技术协同创新中心); Xi’an Jiaotong-Liverpool University (西交利物浦大学); Tampere University (坦佩雷大学); Nanjing University of Posts and Telecommunications (南京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing studies on Quality Enhancement for Compressed Video (QECV) predominantly rely on known Quantization Parameters (QPs), employing distinct enhancement models per QP setting, termed non-blind methods. However, in real-world scenarios involving transcoding or transmission, QPs may be partially or entirely unknown, limiting the applicability of such approaches and motivating the development of blind QECV techniques. Current blind methods generate degradation vectors via classification models with cross-entropy loss, using them as channel attention to guide artifact removal. However, these vectors capture only global degradation information and lack spatial details, hindering adaptation to varying artifact patterns at different spatial positions. To address these limitations, we propose a pretrained Degradation Representation Learning (DRL) module that decouples and extracts high-dimensional, multiscale degradation representations from video content to guide the artifact removal. Additionally, both blind and non-blind methods typically employ uniform architectures across QPs, hence, overlooking the varying computational demands inherent to different compression levels. We thus introduce a hierarchical termination mechanism that dynamically adjusts the number of artifact reduction stages based on the compression level. Experimental results demonstrate that the proposed approach significantly enhances performance, achieving a PSNR improvement of 110% (from 0.31 dB to 0.65 dB) over a competing state-of-the-art blind method at QP = 22. Furthermore, the proposed hierarchical termination mechanism reduces the average inference time at QP = 22 by half compared to QP = 42.
zh

[CV-79] How Noise Benefits AI-generated Image Detection

【速读】:该论文旨在解决AI生成图像检测模型在分布外(out-of-distribution)场景下泛化能力不足的问题,其根本原因在于训练过程中模型依赖了虚假捷径(spurious shortcuts)。解决方案的关键在于提出一种基于变分正向激励原则的正向激励噪声(Positive-Incentive Noise, PiN-CLIP),通过在特征空间中利用视觉与语义特征的交叉注意力融合构造噪声,并将其注入到视觉编码器中进行微调,从而抑制对捷径敏感的方向,增强稳定取证线索,提取更具鲁棒性和泛化能力的伪造特征表示。

链接: https://arxiv.org/abs/2511.16136
作者: Jiazhen Yan,Ziqiang Li,Fan Wang,Kai Zeng,Zhangjie Fu
机构: Nanjing University of Information Science and Technology (南京信息工程大学); University of Macau (澳门大学); University of Siena (锡耶纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of generative models has made real and synthetic images increasingly indistinguishable. Although extensive efforts have been devoted to detecting AI-generated images, out-of-distribution generalization remains a persistent challenge. We trace this weakness to spurious shortcuts exploited during training and we also observe that small feature-space perturbations can mitigate shortcut dominance. To address this problem in a more controllable manner, we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle. Specifically, we construct positive-incentive noise in the feature space via cross-attention fusion of visual and categorical semantic features. During optimization, the noise is injected into the feature space to fine-tune the visual encoder, suppressing shortcut-sensitive directions while amplifying stable forensic cues, thereby enabling the extraction of more robust and generalized artifact representations. Comparative experiments are conducted on an open-world dataset comprising synthetic images generated by 42 distinct generative models. Our method achieves new state-of-the-art performance, with notable improvements of 5.4 in average accuracy over existing approaches.
zh

[CV-80] VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation AAAI2026

【速读】:该论文针对高分辨率视频帧插值(Video Frame Interpolation, VFI)中因大像素运动和计算成本高导致的挑战,提出了一种新的解决方案。具体而言,现有方法通常在低分辨率下估计双向光流,再通过双线性等上采样策略获得高分辨率光流,但这种方式易在光流边缘产生模糊或马赛克现象,且难以捕捉高分辨率下的精细运动信息,导致任务导向的光流错位,进而引起插值帧中的鬼影和不连续问题。论文提出的VTinker框架包含两个核心组件:引导光流上采样(Guided Flow Upsampling, GFU)与纹理映射(Texture Mapping)。GFU利用输入帧作为引导,缓解双线性上采样带来的细节模糊,使光流边缘更清晰;随后,纹理映射生成一个初始插值帧(中间代理),并基于此选择来自输入帧的清晰纹理块进行映射,最终通过重建模块生成高质量插值帧,有效避免了像素级鬼影和不连续问题。关键创新在于通过引导式光流增强和纹理块级结构感知映射机制,显著提升了插值质量。

链接: https://arxiv.org/abs/2511.16124
作者: Chenyang Wu,Jiayi Fu,Chun-Le Guo,Shuhao Han,Chongyi Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Due to large pixel movement and high computational cost, estimating the motion of high-resolution frames is challenging. Thus, most flow-based Video Frame Interpolation (VFI) methods first predict bidirectional flows at low resolution and then use high-magnification upsampling (e.g., bilinear) to obtain the high-resolution ones. However, this kind of upsampling strategy may cause blur or mosaic at the flows’ edges. Additionally, the motion of fine pixels at high resolution cannot be adequately captured in motion estimation at low resolution, which leads to the misalignment of task-oriented flows. With such inaccurate flows, input frames are warped and combined pixel-by-pixel, resulting in ghosting and discontinuities in the interpolated frame. In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping. After motion estimation at low resolution, GFU introduces input frames as guidance to alleviate the blurring details in bilinear upsampling flows, which makes flows’ edges clearer. Subsequently, to avoid pixel-level ghosting and discontinuities, Texture Mapping generates an initial interpolated frame, referred to as the intermediate proxy. The proxy serves as a cue for selecting clear texture blocks from the input frames, which are then mapped onto the proxy to facilitate producing the final interpolated frame via a reconstruction module. Extensive experiments demonstrate that VTinker achieves state-of-the-art performance in VFI. Codes are available at: this https URL.
zh

[CV-81] Decoupling Complexity from Scale in Latent Diffusion Model

【速读】:该论文旨在解决现有潜在扩散模型(Latent Diffusion Models, LDMs)中尺度(scale)与内容复杂度(content complexity)耦合的问题,即模型通常使用更多潜在标记(latent tokens)来表示高分辨率图像或高帧率视频,但实际所需的潜在容量主要取决于内容复杂度,而非单纯依赖于尺度。解决方案的关键在于提出一种新的范式——DCS-LDM(Decoupled Complexity and Scale Latent Diffusion Model),其构建了一个层次化且尺度无关的潜在空间,通过多级标记建模样本复杂度,并支持在固定潜在表示下解码至任意分辨率和帧率。这一设计实现了计算效率与生成质量之间的灵活权衡,并支持从粗到细的渐进式生成策略,从而在保持性能的同时显著提升生成灵活性。

链接: https://arxiv.org/abs/2511.16117
作者: Tianxiong Zhong,Xingye Tian,Xuebo Wang,Boyuan Jiang,Xin Tao,Pengfei Wan
机构: Kling Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 16 figures

点击查看摘要

Abstract:Existing latent diffusion models typically couple scale with content complexity, using more latent tokens to represent higher-resolution images or higher-frame rate videos. However, the latent capacity required to represent visual data primarily depends on content complexity, with scale serving only as an upper bound. Motivated by this observation, we propose DCS-LDM, a novel paradigm for visual generation that decouples information complexity from scale. DCS-LDM constructs a hierarchical, scale-independent latent space that models sample complexity through multi-level tokens and supports decoding to arbitrary resolutions and frame rates within a fixed latent representation. This latent space enables DCS-LDM to achieve a flexible computation-quality tradeoff. Furthermore, by decomposing structural and detailed information across levels, DCS-LDM supports a progressive coarse-to-fine generation paradigm. Experimental results show that DCS-LDM delivers performance comparable to state-of-the-art methods while offering flexible generation across diverse scales and visual qualities.
zh

[CV-82] Clustered Error Correction with Grouped 4D Gaussian Splatting SIGGRAPH

【速读】:该论文旨在解决现有4D Gaussian Splatting (4DGS) 方法在动态场景重建中面临的两大核心问题:一是难以准确处理模糊的像素对应关系(ambiguous pixel correspondences),二是动态区域缺乏足够的点云密化(inadequate densification)。其解决方案的关键在于提出两种创新机制:一是基于椭圆误差聚类与误差校正的点添加策略(Elliptical Error Clustering and Error Correcting Splat Addition),通过分类渲染误差为缺失颜色(missing-color)和遮挡(occlusion)类型,并利用跨视角颜色一致性引导的反投影或前景分割进行针对性修正,从而精准定位并初始化动态区域的拟合点;二是分组式4D高斯点绘制(Grouped 4D Gaussian Splatting),增强点与动态物体之间的映射一致性,提升时序稳定性。实验表明,该方法在Neural 3D Video和Technicolor数据集上显著改善了时间一致性,并在Technicolor Light Field数据集上实现PSNR提升0.39dB,达到当前最优感知渲染质量。

链接: https://arxiv.org/abs/2511.16112
作者: Taeho Kang,Jaeyeon Park,Kyungjin Lee,Youngki Lee
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 16 pages, 8 figures, SIGGRAPH Asia Conference Papers 2025

点击查看摘要

Abstract:Existing 4D Gaussian Splatting (4DGS) methods struggle to accurately reconstruct dynamic scenes, often failing to resolve ambiguous pixel correspondences and inadequate densification in dynamic regions. We address these issues by introducing a novel method composed of two key components: (1) Elliptical Error Clustering and Error Correcting Splat Addition that pinpoints dynamic areas to improve and initialize fitting splats, and (2) Grouped 4D Gaussian Splatting that improves consistency of mapping between splats and represented dynamic objects. Specifically, we classify rendering errors into missing-color and occlusion types, then apply targeted corrections via backprojection or foreground splitting guided by cross-view color consistency. Evaluations on Neural 3D Video and Technicolor datasets demonstrate that our approach significantly improves temporal consistency and achieves state-of-the-art perceptual rendering quality, improving 0.39dB of PSNR on the Technicolor Light Field dataset. Our visualization shows improved alignment between splats and dynamic objects, and the error correction method’s capability to identify errors and properly initialize new splats. Our implementation details and source code are available at this https URL.
zh

[CV-83] 2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

【速读】:该论文旨在解决视觉-语言模型(VLM)在跨任务视觉情境学习(cross-task visual in-context learning, VICL)中的能力边界问题,即当视觉提示与目标图像来自不同低层视觉任务时,VLM是否仍能有效支持VICL。解决方案的关键在于提出一个全协作式流程T2T-VICL,其核心创新包括:设计一种机制以生成和选择最佳文本提示,这些提示能隐式描述两个不同低层视觉任务之间的差异;构建首个跨任务VICL数据集;并引入一种结合感知评分推理与传统评估指标的新推理框架,从而实现跨任务VICL的高效执行。

链接: https://arxiv.org/abs/2511.16107
作者: Shao-Jun Xia,Huixin Zhang,Zhengzhong Tu
机构: Duke University (杜克大学); Texas A&M University (德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.
zh

[CV-84] Rad-GS: Radar-Vision Integration for 3D Gaussian Splatting SLAM in Outdoor Environments

【速读】:该论文旨在解决千米级室外环境中基于4D毫米波雷达(4D mmWave radar)的SLAM(同步定位与地图构建)系统性能不足的问题,尤其是在动态物体干扰、纹理一致性差和大规模场景下内存消耗高的挑战。其核心解决方案是提出Rad-GS系统,采用可微分的3D高斯表示作为空间建模基础,融合原始雷达点云与多普勒信息及几何增强点云以指导图像中的动态目标掩码,从而减少渲染伪影并提升定位精度;同时利用非同步图像帧全局优化3D高斯表示,改善纹理一致性和新视角合成质量,并结合全局八叉树结构与针对性的高斯基元管理策略,在保持重建精度的同时显著降低内存占用,实现高效、鲁棒的大规模室外场景重建。

链接: https://arxiv.org/abs/2511.16091
作者: Renxiang Xiao,Wei Liu,Yuanfan Zhang,Yushuai Chen,Jinming Chen,Zilu Wang,Liang Hu
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Rad-GS, a 4D radar-camera SLAM system designed for kilometer-scale outdoor environments, utilizing 3D Gaussian as a differentiable spatial representation. Rad-GS combines the advantages of raw radar point cloud with Doppler information and geometrically enhanced point cloud to guide dynamic object masking in synchronized images, thereby alleviating rendering artifacts and improving localization accuracy. Additionally, unsynchronized image frames are leveraged to globally refine the 3D Gaussian representation, enhancing texture consistency and novel view synthesis fidelity. Furthermore, the global octree structure coupled with a targeted Gaussian primitive management strategy further suppresses noise and significantly reduces memory consumption in large-scale environments. Extensive experiments and ablation studies demonstrate that Rad-GS achieves performance comparable to traditional 3D Gaussian methods based on camera or LiDAR inputs, highlighting the feasibility of robust outdoor mapping using 4D mmWave radar. Real-world reconstruction at kilometer scale validates the potential of Rad-GS for large-scale scene reconstruction.
zh

[CV-85] SpectralTrain: A Universal Framework for Hyperspectral Image Classification

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类中因数据规模庞大和训练计算密集而导致深度学习模型在实际遥感任务中部署受限的问题。解决方案的关键在于提出了一种通用且架构无关的训练框架SpectralTrain,其核心创新是将课程学习(Curriculum Learning, CL)与基于主成分分析(Principal Component Analysis, PCA)的光谱下采样相结合:通过逐步引入光谱复杂度并保留关键信息,实现对光谱-空间特征的高效学习,从而显著降低计算成本。该框架不依赖特定网络结构、优化器或损失函数,兼容经典与前沿模型,在多个基准数据集上验证了其跨空间尺度、光谱特性及应用领域的强泛化能力,并实现了2–7倍的训练速度提升,同时保持较小的精度损失。

链接: https://arxiv.org/abs/2511.16084
作者: Meihua Zhou,Liping Yu,Jiawei Cai,Wai Kin Fung,Ruiguo Hu,Jiarui Zhao,Wenzhuo Liu,Nan Wan
机构: University of Chinese Academy of Sciences (中国科学院大学); The Chinese University of Hong Kong (香港中文大学); Beijing Institute of Technology (北京理工大学); Chinese Academy of Sciences (中国科学院); Wannan Medical College (皖南医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral – spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets – Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 – demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at this https URL.
zh

[CV-86] VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

【速读】:该论文旨在解决传统视频推理分割方法依赖监督微调导致的分布外泛化能力差以及缺乏显式推理过程的问题。其解决方案的关键在于提出首个将强化学习引入视频推理分割的框架VideoSeg-R1,采用解耦架构将任务分解为三阶段:(1)分层文本引导帧采样器模拟人类注意力机制;(2)推理模型生成空间提示并输出显式推理链;(3)基于SAM2和XMem的分割-传播阶段完成视频掩码演化。此外,通过任务难度感知机制自适应控制推理长度,在提升效率的同时保障精度。

链接: https://arxiv.org/abs/2511.16077
作者: Zishan Xu,Yifu Guo,Yuquan Lu,Fengyu Yang,Junxin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbfVideoSeg-R1, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at this https URL.
zh

[CV-87] LiSTAR: Ray-Centric World Models for 4D LiDAR Sequences in Autonomous Driving

【速读】:该论文旨在解决高保真且可控的4D LiDAR数据合成问题,这是构建可扩展自动驾驶仿真环境的关键挑战。其核心难点包括传感器特有的球面几何结构、点云的时间稀疏性以及动态场景的复杂性。解决方案的关键在于提出LiSTAR——一种直接在传感器原生几何上操作的生成式世界模型:首先引入混合圆柱-球形(Hybrid-Cylindrical-Spherical, HCS)表示以减少笛卡尔网格中的量化伪影,从而保留数据保真度;其次设计基于射线中心的时空注意力机制(Spatio-Temporal Attention with Ray-Centric Transformer, START),显式建模单条传感器射线上特征随时间的演化,增强时序一致性;最后提出4D点云对齐的体素布局用于条件控制,并开发离散掩码生成START(MaskSTART)框架,学习场景的紧凑标记化表示,实现高效、高分辨率且布局引导的组合生成。

链接: https://arxiv.org/abs/2511.16049
作者: Pei Liu,Songtao Wang,Lang Zhang,Xingyue Peng,Yuandong Lyu,Jiaxin Deng,Songxin Lu,Weiliang Ma,Xueyang Zhang,Yifei Zhan,XianPeng Lang,Jun Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthesizing high-fidelity and controllable 4D LiDAR data is crucial for creating scalable simulation environments for autonomous driving. This task is inherently challenging due to the sensor’s unique spherical geometry, the temporal sparsity of point clouds, and the complexity of dynamic scenes. To address these challenges, we present LiSTAR, a novel generative world model that operates directly on the sensor’s native geometry. LiSTAR introduces a Hybrid-Cylindrical-Spherical (HCS) representation to preserve data fidelity by mitigating quantization artifacts common in Cartesian grids. To capture complex dynamics from sparse temporal data, it utilizes a Spatio-Temporal Attention with Ray-Centric Transformer (START) that explicitly models feature evolution along individual sensor rays for robust temporal coherence. Furthermore, for controllable synthesis, we propose a novel 4D point cloud-aligned voxel layout for conditioning and a corresponding discrete Masked Generative START (MaskSTART) framework, which learns a compact, tokenized representation of the scene, enabling efficient, high-resolution, and layout-guided compositional generation. Comprehensive experiments validate LiSTAR’s state-of-the-art performance across 4D LiDAR reconstruction, prediction, and conditional generation, with substantial quantitative gains: reducing generation MMD by a massive 76%, improving reconstruction IoU by 32%, and lowering prediction L1 Med by 50%. This level of performance provides a powerful new foundation for creating realistic and controllable autonomous systems simulations. Project link: this https URL.
zh

[CV-88] AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers

【速读】:该论文针对基于下一尺度预测的视觉自回归模型(Visual Autoregressive Modeling, VAR)中KV缓存(Key and Value caching)设计缺乏系统研究的问题,特别是随着尺度数量增加导致KV内存急剧增长、严重限制模型可扩展性的瓶颈。解决方案的关键在于提出一种尺度自适应的KV缓存策略AMS-KV:通过分析不同尺度间KV相似性,识别出对缓存需求高的层;优先保留粗粒度缩放(condensed scales)和局部尺度的KV以保障生成质量,同时优化缓存利用率与计算效率,从而在不显著损失生成性能的前提下,将KV缓存使用量降低最多84.83%,自注意力延迟减少60.48%,并实现批次大小从128到256的稳定扩展。

链接: https://arxiv.org/abs/2511.16047
作者: Boxun Xu,Yu Wang,Zihu Wang,Peng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.
zh

[CV-89] LLM s-based Augmentation for Domain Adaptation in Long-tailed Food Datasets

【速读】:该论文旨在解决食品识别中的三个关键挑战:域偏移(domain-shift)问题,即训练样本与用户在真实生活环境中拍摄的图像存在视觉差异;数据长尾分布问题,即某些食物类别样本数量远少于其他类别;以及细粒度分类困难,即不同类别间存在细微视觉差异。解决方案的核心在于利用大语言模型(LLM)对食品图像进行解析,生成食物名称和成分描述文本,并将图像与文本特征映射到统一的嵌入空间中以最大化跨模态相似性,从而实现跨域、长尾分布和细粒度识别的联合优化。

链接: https://arxiv.org/abs/2511.16037
作者: Qing Wang,Chong-Wah Ngo,Ee-Peng Lim,Qianru Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training a model for food recognition is challenging because the training samples, which are typically crawled from the Internet, are visually different from the pictures captured by users in the free-living environment. In addition to this domain-shift problem, the real-world food datasets tend to be long-tailed distributed and some dishes of different categories exhibit subtle variations that are difficult to distinguish visually. In this paper, we present a framework empowered with large language models (LLMs) to address these challenges in food recognition. We first leverage LLMs to parse food images to generate food titles and ingredients. Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities. Finally, we take the aligned features of both modalities for recognition. With this simple framework, we show that our proposed approach can outperform the existing approaches tailored for long-tailed data distribution, domain adaptation, and fine-grained classification, respectively, on two food datasets.
zh

[CV-90] Crossmodal learning for Crop Canopy Trait Estimation

【速读】:该论文旨在解决高分辨率无人机(UAV)影像与低分辨率卫星影像在农业监测中因空间分辨率差异导致的信息不匹配问题,尤其针对现代精准农业中微地块管理对细节信息的需求。解决方案的关键在于提出一种跨模态学习策略,通过训练模型学习卫星与无人机影像之间的细粒度光谱-空间对应关系,从而将卫星图像增强为具备无人机级视觉细节的表征,显著提升其在产量预测和氮素估测等下游任务中的性能表现。

链接: https://arxiv.org/abs/2511.16031
作者: Timilehin T. Ayanlade,Anirudha Powadi,Talukder Z. Jubery,Baskar Ganapathysubramanian,Soumik Sarkar
机构: Iowa State University (爱荷华州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 7 figures

点击查看摘要

Abstract:Recent advances in plant phenotyping have driven widespread adoption of multi sensor platforms for collecting crop canopy reflectance data. This includes the collection of heterogeneous data across multiple platforms, with Unmanned Aerial Vehicles (UAV) seeing significant usage due to their high performance in crop monitoring, forecasting, and prediction tasks. Similarly, satellite missions have been shown to be effective for agriculturally relevant tasks. In contrast to UAVs, such missions are bound to the limitation of spatial resolution, which hinders their effectiveness for modern farming systems focused on micro-plot management. In this work, we propose a cross modal learning strategy that enriches high-resolution satellite imagery with UAV level visual detail for crop canopy trait estimation. Using a dataset of approximately co registered satellite UAV image pairs collected from replicated plots of 84 hybrid maize varieties across five distinct locations in the U.S. Corn Belt, we train a model that learns fine grained spectral spatial correspondences between sensing modalities. Results show that the generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on multiple downstream tasks, including yield and nitrogen prediction, demonstrating the potential of cross-modal correspondence learning to bridge the gap between satellite and UAV sensing in agricultural monitoring.
zh

[CV-91] CuriGS: Curriculum-Guided Gaussian Splatting for Sparse View Synthesis

【速读】:该论文旨在解决稀疏视图(sparse-view)条件下3D高斯溅射(3D Gaussian Splatting, 3DGS)重建中因监督信号稀缺和视点覆盖不足导致的过拟合问题。解决方案的关键在于提出一种基于课程学习(curriculum learning)的框架CuriGS,其核心创新是引入“学生视图”(student views)——即围绕真实视图(teacher views)采样的伪视图,并通过多级扰动水平控制训练过程中的难度递增;在训练中按课程调度逐步解锁更高扰动级别,随机选取候选学生视图进行深度相关性和协同正则化,并利用包含SSIM、LPIPS及图像质量指标的多信号评估机制筛选优质学生视图,将其纳入训练集以稳定增强稀疏训练视图,从而提升重建的渲染保真度与几何一致性。

链接: https://arxiv.org/abs/2511.16030
作者: Zijian Wu,Mingfeng Jiang,Zidian Lin,Ying Song,Hanjie Ma,Qun Wu,Dongping Zhang,Guiyang Pu
机构: Zhejiang Sci-Tech University (浙江理工大学); China Jiliang University (中国计量大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently emerged as an efficient, high-fidelity representation for real-time scene reconstruction and rendering. However, extending 3DGS to sparse-view settings remains challenging because of supervision scarcity and overfitting caused by limited viewpoint coverage. In this paper, we present CuriGS, a curriculum-guided framework for sparse-view 3D reconstruction using 3DGS. CuriGS addresses the core challenge of sparse-view synthesis by introducing student views: pseudo-views sampled around ground-truth poses (teacher). For each teacher, we generate multiple groups of student views with different perturbation levels. During training, we follow a curriculum schedule that gradually unlocks higher perturbation level, randomly sampling candidate students from the active level to assist training. Each sampled student is regularized via depth-correlation and co-regularization, and evaluated using a multi-signal metric that combines SSIM, LPIPS, and an image-quality measure. For every teacher and perturbation level, we periodically retain the best-performing students and promote those that satisfy a predefined quality threshold to the training set, resulting in a stable augmentation of sparse training views. Experimental results show that CuriGS outperforms state-of-the-art baselines in both rendering fidelity and geometric consistency across various synthetic and real sparse-view scenes. Project page: this https URL
zh

[CV-92] owards a Safer and Sustainable Manufacturing Process: Material classification in Laser Cutting Using Deep Learning

【速读】:该论文旨在解决激光切割过程中因产生大量粉尘、烟雾和气溶胶而对环境及工人健康构成威胁的问题,并进一步实现对不同材料的实时识别与分类,以提升切割过程的安全性与效率。其解决方案的关键在于利用激光散斑(laser speckle)技术结合深度学习方法,通过训练卷积神经网络(Convolutional Neural Network, CNN)模型从材料表面的散斑图像中提取特征,从而实现高精度的材料类型识别。该方法在激光颜色变化的情况下仍保持优异性能,验证了其鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2511.16026
作者: Mohamed Abdallah Salem,Hamdy Ahmed Ashur,Ahmed Elshinnawy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Laser cutting is a widely adopted technology in material processing across various industries, but it generates a significant amount of dust, smoke, and aerosols during operation, posing a risk to both the environment and workers’ health. Speckle sensing has emerged as a promising method to monitor the cutting process and identify material types in real-time. This paper proposes a material classification technique using a speckle pattern of the material’s surface based on deep learning to monitor and control the laser cutting process. The proposed method involves training a convolutional neural network (CNN) on a dataset of laser speckle patterns to recognize distinct material types for safe and efficient cutting. Previous methods for material classification using speckle sensing may face issues when the color of the laser used to produce the speckle pattern is changed. Experiments conducted in this study demonstrate that the proposed method achieves high accuracy in material classification, even when the laser color is changed. The model achieved an accuracy of 98.30 % on the training set and 96.88% on the validation set. Furthermore, the model was evaluated on a set of 3000 new images for 30 different materials, achieving an F1-score of 0.9643. The proposed method provides a robust and accurate solution for material-aware laser cutting using speckle sensing.
zh

[CV-93] Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution AAAI2026

【速读】:该论文旨在解决现有基于微调预训练扩散模型(如LoRA模块)的图像超分辨率(Real-ISR)方法在处理复杂真实退化样本时,难以自适应地捕捉异质特征以及在相同计算预算下无法有效实现知识共享的问题。其解决方案的关键在于提出一种名为Mixture-of-Ranks (MoR) 的稀疏专家混合架构,将LoRA中的每个秩(rank)视为独立专家,并引入细粒度专家划分策略:固定位置的秩作为共享专家以保留通用特征并减少路由冗余;同时设计了一个基于CLIP嵌入与预定义正负文本对的退化估计模块,用于动态计算相对退化评分以引导专家激活;此外,通过引入零专家槽位和退化感知负载均衡损失函数,使活跃专家数量随退化严重程度动态调整,从而实现更优的计算资源分配与性能表现。

链接: https://arxiv.org/abs/2511.16024
作者: Xiao He,Zhijun Tu,Kun Cheng,Mingrui Zhu,Jie Hu,Nannan Wang,Xinbo Gao
机构: Xidian University (西安电子科技大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, Accepted by AAAI 2026

点击查看摘要

Abstract:The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-resolution (HR) images. However, these dense Real-ISR models are limited in their ability to adaptively capture the heterogeneous characteristics of complex real-world degraded samples or enable knowledge sharing between inputs under equivalent computational budgets. To address this, we investigate the integration of sparse MoE into Real-ISR and propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution. We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert. This design enables flexible knowledge recombination while isolating fixed-position ranks as shared experts to preserve common-sense features and minimize routing redundancy. Furthermore, we develop a degradation estimation module leveraging CLIP embeddings and predefined positive-negative text pairs to compute relative degradation scores, dynamically guiding expert activation. To better accommodate varying sample complexities, we incorporate zero-expert slots and propose a degradation-aware load-balancing loss, which dynamically adjusts the number of active experts based on degradation severity, ensuring optimal computational resource allocation. Comprehensive experiments validate our framework’s effectiveness and state-of-the-art performance.
zh

[CV-94] Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion

【速读】:该论文旨在解决深度神经网络在人体检测任务中对对抗性攻击的脆弱性问题,特别是在真实监控环境中,穿戴式攻击(wearable attacks)可能引发严重的安全与隐私风险。现有方法通常逐帧优化纹理,难以在长时间视频序列中保持隐蔽性,尤其面对运动、姿态变化和衣物形变时效果不佳。其解决方案的关键在于提出一种序列级优化框架,通过将产品图像映射至UV空间并参数化为紧凑色板与控制点,结合ICC色彩锁定确保打印可行性;同时引入基于物理的人体-服装模拟流程,精确建模运动、多视角、布料动力学及光照变化;最终采用期望-变换目标函数并引入时间权重,联合优化控制点以最小化整段视频中的检测置信度,从而实现数字与物理场景下稳定、鲁棒且跨模型可迁移的隐蔽效果。

链接: https://arxiv.org/abs/2511.16020
作者: Dingkun Zhou,Patrick P. K. Chan,Hengxu Wu,Shikang Zheng,Ruiqi Huang,Yuanjie Zhao
机构: School of Future Technology, South China University of Technology (华南理工大学未来技术学院); Institute for Interdisciplinary Information Sciences, Tsinghua University (清华大学交叉信息研究院); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep neural networks used for human detection are highly vulnerable to adversarial manipulation, creating safety and privacy risks in real surveillance environments. Wearable attacks offer a realistic threat model, yet existing approaches usually optimize textures frame by frame and therefore fail to maintain concealment across long video sequences with motion, pose changes, and garment deformation. In this work, a sequence-level optimization framework is introduced to generate natural, printable adversarial textures for shirts, trousers, and hats that remain effective throughout entire walking videos in both digital and physical settings. Product images are first mapped to UV space and converted into a compact palette and control-point parameterization, with ICC locking to keep all colors printable. A physically based human-garment pipeline is then employed to simulate motion, multi-angle camera viewpoints, cloth dynamics, and illumination variation. An expectation-over-transformation objective with temporal weighting is used to optimize the control points so that detection confidence is minimized across whole sequences. Extensive experiments demonstrate strong and stable concealment, high robustness to viewpoint changes, and superior cross-model transferability. Physical garments produced with sublimation printing achieve reliable suppression under indoor and outdoor recordings, confirming real-world feasibility.
zh

[CV-95] Exploiting Inter-Sample Information for Long-tailed Out-of-Distribution Detection

【速读】:该论文旨在解决长尾分布(long-tailed distribution)场景下,深度神经网络(Deep Neural Networks, DNNs)在分布外(Out-of-Distribution, OOD)检测中面临的高假阳性率(False Positive Rate, FPR)和尾部类别(tail-class)识别准确率低的问题。解决方案的关键在于利用图结构(graph-based representation)建模样本间的相互关系:首先基于预训练模型的特征空间初始化图结构,并通过高斯化(Gaussianization)处理激活层分布差异以缓解非标准正态分布偏差;随后使用图卷积网络(Graph Convolutional Networks, GCNs)对初始图表示进行优化,从而构建适用于长尾OOD检测的鲁棒特征空间,显著提升了尾部类别的ID分类精度与整体OOD检测性能。

链接: https://arxiv.org/abs/2511.16015
作者: Nimeshika Udayangani,Hadi M. Dolatabadi,Sarah Erfani,Christopher Leckie
机构: The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) data is essential for safe deployment of deep neural networks (DNNs). This problem becomes particularly challenging in the presence of long-tailed in-distribution (ID) datasets, often leading to high false positive rates (FPR) and low tail-class ID classification accuracy. In this paper, we demonstrate that exploiting inter-sample relationships using a graph-based representation can significantly improve OOD detection in long-tailed recognition of vision datasets. To this end, we use the feature space of a pre-trained model to initialize our graph structure. We account for the differences between the activation layer distribution of the pre-training vs. training data, and actively introduce Gaussianization to alleviate any deviations from a standard normal distribution in the activation layers of the pre-trained model. We then refine this initial graph representation using graph convolutional networks (GCNs) to arrive at a feature space suitable for long-tailed OOD detection. This leads us to address the inferior performance observed in ID tail-classes within existing OOD detection methods. Experiments over three benchmarks CIFAR10-LT, CIFAR100-LT, and ImageNet-LT demonstrate that our method outperforms the state-of-the-art approaches by a large margin in terms of FPR and tail-class ID classification accuracy.
zh

[CV-96] Fairness in Multi-modal Medical Diagnosis with Demonstration Selection

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在医学图像推理中存在的人群公平性问题,即不同性别、种族和族裔群体间的表现差异。现有去偏方法通常依赖大规模标注数据集或微调,对基础规模模型不具可行性。其解决方案的关键在于提出一种公平感知的上下文学习示范选择方法(Fairness-Aware Demonstration Selection, FADS),通过聚类采样策略构建在人口统计学上平衡且语义相关的示例集合,从而在无需参数微调的情况下显著降低偏见,同时保持高精度,实现高效、可扩展的公平性提升。

链接: https://arxiv.org/abs/2511.15986
作者: Dawei Li,Zijian Gu,Peng Wang,Chuhan Song,Zhen Tan,Mohan Zhang,Tianlong Chen,Yu Tian,Song Wang
机构: Arizona State University (亚利桑那州立大学); University of Rochester (罗切斯特大学); University of Virginia (弗吉尼亚大学); UCL (伦敦大学学院); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 10 pages (including 2 pages of references), 4 figures. This work explores fairness in multi-modal medical image reasoning using in-context learning

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.
zh

[CV-97] UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition

【速读】:该论文旨在解决大规模电商场景下视觉语义理解中细粒度类别区分与类别特异性属性多样性难以捕捉的问题,现有基于全局相似性的方法在处理对象检测、类别预测和属性识别的统一框架时表现不足。其解决方案的关键在于提出一种检测引导的生成式框架,通过提取精炼的ROI级特征,并利用基于BART(Bidirectional and Auto-Regressive Transformers)的生成器以粗到细的序列方式预测层次化类别和属性标记,支持属性条件下的属性识别,从而实现更精细的识别能力和更连贯的统一推理。

链接: https://arxiv.org/abs/2511.15984
作者: Xinyu Nan,Lingtao Mao,Huangyu Dai,Zexin Zheng,Xinyu Sun,Zihan Liang,Ben Chen,Yuqing Ding,Chenyi Lei,Wenwu Ou,Han Li
机构: Kuaishou Technology(快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving visual semantic understanding requires a unified framework that simultaneously handles object detection, category prediction, and attribute recognition. However, current advanced approaches rely on global similarity and struggle to capture fine-grained category distinctions and category-specific attribute diversity, especially in large-scale e-commerce scenarios. To overcome these challenges, we introduce a detection-guided generative framework that predicts hierarchical category and attribute tokens. For each detected object, we extract refined ROI-level features and employ a BART-based generator to produce semantic tokens in a coarse-to-fine sequence covering category hierarchies and property-value pairs, with support for property-conditioned attribute recognition. Experiments on both large-scale proprietary e-commerce datasets and open-source datasets demonstrate that our approach significantly outperforms existing similarity-based pipelines and multi-stage classification systems, achieving stronger fine-grained recognition and more coherent unified inference.
zh

[CV-98] Externally Validated Multi-Task Learning via Consistency Regularization Using Differentiable BI-RADS Features for Breast Ultrasound Tumor Segmentation

【速读】:该论文旨在解决多任务学习(Multi-task Learning, MTL)中因任务间破坏性干扰(destructive task interference)导致模型性能下降的问题,特别是在乳腺超声图像中的肿瘤分割任务上。其解决方案的关键在于提出了一种新颖的一致性正则化方法,该方法通过引入可微分的BI-RADS(Breast Imaging-Reporting and Data System)启发的形态学特征,有效缓解了分割任务与分类任务之间的干扰,从而显著提升模型在多个外部数据集上的泛化能力。

链接: https://arxiv.org/abs/2511.15968
作者: Jingru Zhang,Saed Moradi,Ashirbani Saha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-task learning can suffer from destructive task interference, where jointly trained models underperform single-task baselines and limit generalization. To improve generalization performance in breast ultrasound-based tumor segmentation via multi-task learning, we propose a novel consistency regularization approach that mitigates destructive interference between segmentation and classification. The consistency regularization approach is composed of differentiable BI-RADS-inspired morphological features. We validated this approach by training all models on the BrEaST dataset (Poland) and evaluating them on three external datasets: UDIAT (Spain), BUSI (Egypt), and BUS-UCLM (Spain). Our comprehensive analysis demonstrates statistically significant (p0.001) improvements in generalization for segmentation task of the proposed multi-task approach vs. the baseline one: UDIAT, BUSI, BUS-UCLM (Dice coefficient=0.81 vs 0.59, 0.66 vs 0.56, 0.69 vs 0.49, resp.). The proposed approach also achieves state-of-the-art segmentation performance under rigorous external validation on the UDIAT dataset.
zh

[CV-99] InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer AAAI2026

【速读】:该论文旨在解决现有基于CLIP(Contrastive Language–Image Pretraining)的开放词汇语义分割方法在有限已见类别上微调时易过拟合、且会破坏预训练阶段视觉-语言对齐关系的问题。其解决方案的关键在于提出InfoCLIP,该方法从信息论角度出发,通过两个基于互信息(mutual information)的新目标来稳定跨模态对齐:一是压缩来自预训练CLIP的像素-文本对齐特征以降低由图像-文本监督下粗粒度局部语义表示带来的噪声;二是最大化预训练CLIP与微调模型之间对齐知识的互信息,从而迁移适用于分割任务的紧凑局部语义关系。

链接: https://arxiv.org/abs/2511.15967
作者: Muyao Yuan,Yuanhong Zhang,Weizhan Zhang,Lan Ma,Yuan Gao,Jiangyong Ying,Yudeng Xin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Recently, the strong generalization ability of CLIP has facilitated open-vocabulary semantic segmentation, which labels pixels using arbitrary text. However, existing methods that fine-tune CLIP for segmentation on limited seen categories often lead to overfitting and degrade the pretrained vision-language alignment. To stabilize modality alignment during fine-tuning, we propose InfoCLIP, which leverages an information-theoretic perspective to transfer alignment knowledge from pretrained CLIP to the segmentation task. Specifically, this transfer is guided by two novel objectives grounded in mutual information. First, we compress the pixel-text modality alignment from pretrained CLIP to reduce noise arising from its coarse-grained local semantic representations learned under image-text supervision. Second, we maximize the mutual information between the alignment knowledge of pretrained CLIP and the fine-tuned model to transfer compact local semantic relations suited for the segmentation task. Extensive evaluations across various benchmarks validate the effectiveness of InfoCLIP in enhancing CLIP fine-tuning for open-vocabulary semantic segmentation, demonstrating its adaptability and superiority in asymmetric transfer.
zh

[CV-100] Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

【速读】:该论文旨在解决当前视频场景图生成(Video Scene Graph Generation, VSGG)系统缺乏用户交互能力的问题,即现有方法为封闭的前馈流水线,无法融入人类引导;同时,尽管提示分割模型(如SAM2)支持精确用户交互,但其不具备语义或关系推理能力。解决方案的关键在于提出Click2Graph框架,首次实现全景视频场景图生成(Panoptic Video Scene Graph Generation, PVSG)的交互式建模:通过单个用户提示(如点击或边界框),该框架能够跨时间分割并追踪目标主体,自主发现交互对象,并预测主体-客体-谓词三元组以构建时序一致的场景图;其核心创新包括动态交互发现模块(Dynamic Interaction Discovery Module),用于生成条件化的目标提示,以及联合实体与谓词分类头(Semantic Classification Head),实现语义与关系的协同推理。

链接: https://arxiv.org/abs/2511.15948
作者: Raphael Ruschel,Hardikkumar Prajapati,Awsafur Rahman,B.S. Manjunath
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts subject, object, predicate triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.
zh

[CV-101] Automated Interpretable 2D Video Extraction from 3D Echocardiography

【速读】:该论文旨在解决传统心脏超声检查中依赖一系列二维(2D)视频来评估心脏结构所导致的效率低、视角受限及对操作者经验依赖性强的问题。其核心挑战在于如何从三维(3D)心脏超声数据中自动提取符合临床标准的2D视图,以保留诊断准确性的同时提升扫描速度与可用性。解决方案的关键在于结合深度学习视图分类器与基于解剖标志点的后处理启发式规则,并引入心脏专科医生提供的专家知识,从而实现从3D体积数据中重建标准2D超声切面。该方法在双中心盲法评估中达到96%准确率,并通过AI模型(EchoPrime和PanEcho)和临床级测量工具(EchoNet-Measurement)验证了重建2D视频在检测心脏异常和生成精确解剖参数方面的有效性,证明其可支持真实世界临床决策。

链接: https://arxiv.org/abs/2511.15946
作者: Milos Vukadinovic,Hirotaka Ieki,Yuki Sahasi,David Ouyang,Bryan He
机构: University of California, Los Angeles (加州大学洛杉矶分校); Kaiser Permanente Division of Research (凯撒医疗集团研究部); Cedars-Sinai Medical Center (塞德斯-西奈医学中心); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Although the heart has complex three-dimensional (3D) anatomy, conventional medical imaging with cardiac ultrasound relies on a series of 2D videos showing individual cardiac structures. 3D echocardiography is a developing modality that now offers adequate image quality for clinical use, with potential to streamline acquisition and improve assessment of off-axis features. We propose an automated method to select standard 2D views from 3D cardiac ultrasound volumes, allowing physicians to interpret the data in their usual format while benefiting from the speed and usability of 3D scanning. Applying a deep learning view classifier and downstream heuristics based on anatomical landmarks together with heuristics provided by cardiologists, we reconstruct standard echocardiography views. This approach was validated by three cardiologists in blinded evaluation (96% accuracy in 1,600 videos from 2 hospitals). The downstream 2D videos were also validated in their ability to detect cardiac abnormalities using AI echocardiography models (EchoPrime and PanEcho) as well as ability to generate clinical-grade measurements of cardiac anatomy (EchoNet-Measurement). We demonstrated that the extracted 2D videos preserve spatial calibration and diagnostic features, allowing clinicians to obtain accurate real-world interpretations from 3D volumes. We release the code and a dataset of 29 3D echocardiography videos this https URL .
zh

[CV-102] Boosting Medical Visual Understanding From Multi-Granular Language Learning

【速读】:该论文旨在解决当前视觉-语言预训练模型(如CLIP)在复杂领域(如医学影像)中因仅支持单标签、单粒度对齐而导致的表征能力不足问题,特别是在多标签和跨粒度标注场景下难以有效对齐图像与文本的问题。解决方案的关键在于提出一种名为多粒度语言学习(Multi-Granular Language Learning, MGLL)的对比学习框架,其核心创新包括:利用结构化的多标签监督信号、整合不同粒度的文本描述,并引入基于点级约束的软标签监督机制;同时采用平滑的Kullback-Leibler (KL) 散度损失来保证跨粒度一致性,且保持计算效率,可作为即插即用模块集成至现有视觉-语言模型中。

链接: https://arxiv.org/abs/2511.15943
作者: Zihan Li,Yiqing Wang,Sina Farsiu,Paul Kinahan
机构: University of Washington (华盛顿大学); Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. 40 pages

点击查看摘要

Abstract:Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at \hrefthis https URLthis https URL.
zh

[CV-103] RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在领域特定视频分类任务中因数据稀缺而导致性能受限的问题,其核心挑战在于稀疏的领域数据难以弥合复杂时空内容与抽象分类标签之间的语义鸿沟。解决方案的关键在于提出一种两阶段自提升范式:第一阶段通过提示(prompting)引导VLM生成每段视频的详细文本推理过程(rationale),迫使模型显式表达领域特定逻辑,并以此自动生成的中间监督信号进行微调,从而对齐目标领域的表征;第二阶段在此基础上进行传统监督微调(Supervised Fine-Tuning, SFT),利用已习得的领域推理能力显著提升最终任务性能。该方法无需额外标注,实现了高效的领域适应。

链接: https://arxiv.org/abs/2511.15923
作者: Meilong Xu,Di Fu,Jiaxing Zhang,Gong Yu,Jiayu Zheng,Xiaoling Hu,Dongdi Zhao,Feiyang Li,Chao Chen,Yong Cao
机构: Stony Brook University (石溪大学); ByteDance Inc. (字节跳动); Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical \textitrationale gap, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model’s pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.
zh

[CV-104] Box6D : Zero-shot Category-level 6D Pose Estimation of Warehouse Boxes

【速读】:该论文旨在解决在杂乱和遮挡环境下,对新型存储箱进行准确且高效的6D姿态估计(6D pose estimation)问题,这是实现仓库自动化、料箱拣选、物流及电商履约中机器人操作的关键技术瓶颈。现有方法存在三大局限:基于模型的方法依赖高精度CAD模型且泛化能力差;无模型方法虽灵活但鲁棒性不足;类别级方法往往过于泛化而忽略环境与物体先验信息,实用性受限。针对此,作者提出Box6d,其核心创新在于:利用单张RGB-D图像通过快速二值搜索推断箱子尺寸,并采用类别级CAD模板而非实例特定模型进行姿态估计;结合基于深度图的合理性过滤机制与早期停止策略,有效剔除不合理假设,显著降低计算开销。实验表明,Box6d在保持或超越当前最优精度的同时,推理时间减少约76%。

链接: https://arxiv.org/abs/2511.15884
作者: Yintao Ma,Sajjad Pakdamansavoji,Amir Rasouli,Tongtong Cao
机构: Huawei Technologies Canada (华为技术加拿大公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate and efficient 6D pose estimation of novel objects under clutter and occlusion is critical for robotic manipulation across warehouse automation, bin picking, logistics, and e-commerce fulfillment. There are three main approaches in this domain; Model-based methods assume an exact CAD model at inference but require high-resolution meshes and transfer poorly to new environments; Model-free methods that rely on a few reference images or videos are more flexible, however often fail under challenging conditions; Category-level approaches aim to balance flexibility and accuracy but many are overly general and ignore environment and object priors, limiting their practicality in industrial settings. To this end, we propose Box6d, a category-level 6D pose estimation method tailored for storage boxes in the warehouse context. From a single RGB-D observation, Box6D infers the dimensions of the boxes via a fast binary search and estimates poses using a category CAD template rather than instance-specific models. Suing a depth-based plausibility filter and early-stopping strategy, Box6D then rejects implausible hypotheses, lowering computational cost. We conduct evaluations on real-world storage scenarios and public benchmarks, and show that our approach delivers competitive or superior 6D pose precision while reducing inference time by approximately 76%. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2511.15884 [cs.CV] (or arXiv:2511.15884v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.15884 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-105] Automatic Uncertainty-Aware Synthetic Data Bootstrapping for Historical Map Segmentation

【速读】:该论文旨在解决历史地图(historical maps)在深度学习模型训练中因缺乏高质量标注数据而导致的性能瓶颈问题,尤其针对特定同质制图领域(homogeneous cartographic domains)的样本稀缺难题。其核心解决方案在于通过迁移原始历史地图语料库的制图风格至矢量数据,生成大量具有真实感和多样性的合成历史地图,从而实现数据增强;关键创新点是提出一种自动化的深度生成方法与替代的手动随机退化技术,以模拟扫描图像中常见的数据依赖性不确定性(data-dependent uncertainty),并基于此构建适用于土地覆盖语义分割任务的训练集,进而提升跨域适应能力。

链接: https://arxiv.org/abs/2511.15875
作者: Lukas Arzoumanidis,Julius Knechtel,Jan-Henrik Haunert,Youness Dehbi
机构: Hamburg University of Applied Sciences (汉堡应用技术大学); Institute for Geoinformatics, University of Bonn (波恩大学地理信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The automated analysis of historical documents, particularly maps, has drastically benefited from advances in deep learning and its success across various computer vision applications. However, most deep learning-based methods heavily rely on large amounts of annotated training data, which are typically unavailable for historical maps, especially for those belonging to specific, homogeneous cartographic domains, also known as corpora. Creating high-quality training data suitable for machine learning often takes a significant amount of time and involves extensive manual effort. While synthetic training data can alleviate the scarcity of real-world samples, it often lacks the affinity (realism) and diversity (variation) necessary for effective learning. By transferring the cartographic style of an original historical map corpus onto vector data, we bootstrap an effectively unlimited number of synthetic historical maps suitable for tasks such as land-cover interpretation of a homogeneous historical map corpus. We propose an automatic deep generative approach and a alternative manual stochastic degradation technique to emulate the visual uncertainty and noise, also known as data-dependent uncertainty, commonly observed in historical map scans. To quantitatively evaluate the effectiveness and applicability of our approach, the generated training datasets were employed for domain-adaptive semantic segmentation on a homogeneous map corpus using a Self-Constructing Graph Convolutional Network, enabling a comprehensive assessment of the impact of our data bootstrapping methods.
zh

[CV-106] WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion

【速读】:该论文旨在解决未见过物体(unseen objects)在存在遮挡情况下的6D位姿估计(6D object pose estimation)准确性与鲁棒性问题,尤其是在传统多阶段流水线方法中,早期检测和分割阶段的错误容易在后续步骤中传播并显著降低性能。其解决方案的关键在于提出四项创新改进:(i) 动态非均匀密集采样策略,聚焦于可见区域以减少遮挡引起的误差;(ii) 多假设推理机制,保留多个置信度排序的位姿候选解,避免单一路径失败;(iii) 迭代优化机制,逐步提升位姿精度;(iv) 针对遮挡的训练增强策略,增强模型泛化能力。此外,还引入了一种基于可见性的加权评估指标,有效降低现有评测协议中的偏差。实验表明,该方法在ICBIN和BOP数据集上分别实现超过5%和2%的准确率提升,同时推理速度约为原有方法的3倍。

链接: https://arxiv.org/abs/2511.15874
作者: Sajjad Pakdamansavoji,Yintao Ma,Amir Rasouli,Tongtong Cao
机构: Huawei Technologies Canada (华为技术加拿大)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate 6D object pose estimation is vital for robotics, augmented reality, and scene understanding. For seen objects, high accuracy is often attainable via per-object fine-tuning but generalizing to unseen objects remains a challenge. To address this problem, past arts assume access to CAD models at test time and typically follow a multi-stage pipeline to estimate poses: detect and segment the object, propose an initial pose, and then refine it. Under occlusion, however, the early-stage of such pipelines are prone to errors, which can propagate through the sequential processing, and consequently degrade the performance. To remedy this shortcoming, we propose four novel extensions to model-based 6D pose estimation methods: (i) a dynamic non-uniform dense sampling strategy that focuses computation on visible regions, reducing occlusion-induced errors; (ii) a multi-hypothesis inference mechanism that retains several confidence-ranked pose candidates, mitigating brittle single-path failures; (iii) iterative refinement to progressively improve pose accuracy; and (iv) series of occlusion-focused training augmentations that strengthen robustness and generalization. Furthermore, we propose a new weighted by visibility metric for evaluation under occlusion to minimize the bias in the existing protocols. Via extensive empirical evaluations, we show that our proposed approach achieves more than 5% improvement in accuracy on ICBIN and more than 2% on BOP dataset benchmarks, while achieving approximately 3 times faster inference.
zh

[CV-107] EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1 2 and 3

【速读】:该论文旨在解决生成式 AI (Generative AI) 视觉理解模型 SAM3(Segment Anything Model 3)因统一架构复杂、计算资源消耗高而难以部署在设备端的问题。其核心解决方案是提出 EfficientSAM3,采用分阶段的渐进式层次蒸馏(Progressive Hierarchical Distillation, PHD)策略:首先通过 Promptable Concept Segmentation (PCS) 数据上的 prompt-in-the-loop 训练实现编码器蒸馏;其次利用 Perceiver-based 模块替代密集记忆机制以高效压缩和检索时空特征,完成时序记忆蒸馏;最后通过端到端微调保留概念级性能。该方法在多个视觉目标分割(VOS)数据集上实现了显著的性能-效率权衡,使轻量级学生模型可在设备端实现高质量的概念分割与跟踪。

链接: https://arxiv.org/abs/2511.15833
作者: Chengxi Zeng,Yuxuan Jiang,Aaron Zhang
机构: University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Github: this https URL

点击查看摘要

Abstract:The Segment Anything Model 3 (SAM3) advances visual understanding with Promptable Concept Segmentation (PCS) across images and videos, but its unified architecture (shared vision backbone, DETR-style detector, dense-memory tracker) remains prohibitive for on-device use. We present EfficientSAM3, a family of efficient models built on Progressive Hierarchical Distillation (PHD) that transfers capability from SAM3 to lightweight students in three stages: (1) Encoder Distillation aligns image features via prompt-in-the-loop training on SA-1B; (2) Temporal Memory Distillation replaces dense memory with a compact Perceiver-based module trained on SA-V to compress and retrieve spatiotemporal features efficiently; and (3) End-to-End Fine-Tuning refines the full pipeline on the official SAM3 PCS data to preserve concept-level performance. PHD yields a spectrum of student variants using RepViT, TinyViT, and EfficientViT backbones, enabling on-device concept segmentation and tracking while maintaining high fidelity to teacher behavior. We benchmark on popular VOS datasets, and compare with varies of releated work, achieing strong performance-efficiency trade-offs.
zh

[CV-108] UniFit: Towards Universal Virtual Try-on with MLLM -Guided Semantic Alignment AAAI-2026

【速读】:该论文旨在解决图像驱动的虚拟试衣(Image-based Virtual Try-On, VTON)中通用框架难以灵活处理多样化复杂任务的问题,尤其针对两个关键挑战:一是文本指令与参考图像之间的语义鸿沟,二是复杂场景下数据稀缺。解决方案的关键在于提出UniFit框架,其核心创新是引入多模态大语言模型(Multimodal Large Language Model, MLLM)引导的语义对齐模块(MLLM-Guided Semantic Alignment Module, MGSA),通过可学习查询机制融合多模态输入,并施加语义对齐损失以显式建模跨模态语义关系,从而有效缩小语义鸿沟;同时设计两阶段渐进式训练策略与自合成数据流水线,提升在有限数据下的复杂任务学习能力,最终实现对多服装、模型到模型等多样VTON任务的统一支持并达到当前最优性能。

链接: https://arxiv.org/abs/2511.15831
作者: Wei Zhang,Yeying Jin,Xin Li,Yan Zhang,Xiaofeng Cong,Cong Wang,Fengcai Qiao,zhichao Lian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to AAAI-2026

点击查看摘要

Abstract:Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance. The source code and pretrained models are available at this https URL.
zh

[CV-109] How Modality Shapes Perception and Reasoning : A Study of Error Propagation in ARC-AGI

【速读】:该论文试图解决生成式 AI (Generative AI) 在处理结构化任务时,因模态差异导致的感知瓶颈与执行误差混淆问题,即如何区分并优化模型在感知(perception)与推理(reasoning)阶段的错误。其解决方案的关键在于:通过构建一个两阶段推理流水线(two-stage reasoning pipeline),结合九种文本与图像模态,利用加权集合不一致度量(weighted set-disagreement metric)分离感知与推理模块,并发现结构化文本能精确捕捉稀疏特征坐标、图像能保留二维形状但受分辨率影响,而两者融合可提升执行准确率(约8个感知点提升,中位相似度提高0.20),从而实现对Transformer归纳偏置的对齐和跨模态验证,无需改动底层模型即可增强指令准确性与执行可靠性。

链接: https://arxiv.org/abs/2511.15717
作者: Bo Wen,Chen Wang,Erhan Bilal
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:ARC-AGI and ARC-AGI-2 measure generalization-through-composition on small color-quantized grids, and their prize competitions make progress on these harder held-out tasks a meaningful proxy for systematic generalization. Recent instruction-first systems translate grids into concise natural-language or DSL rules executed in generate-execute-select loops, yet we lack a principled account of how encodings shape model perception and how to separate instruction errors from execution errors. We hypothesize that modality imposes perceptual bottlenecks – text flattens 2D structure into 1D tokens while images preserve layout but can introduce patch-size aliasing – thereby shaping which grid features are reliably perceived. To test this, we isolate perception from reasoning across nine text and image modalities using a weighted set-disagreement metric and a two-stage reasoning pipeline, finding that structured text yields precise coordinates on sparse features, images capture 2D shapes yet are resolution-sensitive, and combining them improves execution (about 8 perception points; about 0.20 median similarity). Overall, aligning representations with transformer inductive biases and enabling cross-validation between text and image yields more accurate instructions and more reliable execution without changing the underlying model.
zh

[CV-110] PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting CVPR2025

【速读】:该论文旨在解决3D高斯点云(3D Gaussian Splatting, 3D-GS)在复杂场景中因百万级高斯分布导致存储与内存开销过大,从而限制其在资源受限设备上部署的问题。现有压缩方法依赖启发式规则剪枝高斯分布,在高压缩比下易造成视觉保真度下降和前景细节丢失。论文提出一种基于灵敏度的系统性剪枝策略(principled sensitivity pruning score),该评分通过计算训练视图上重建误差对每个高斯空间参数的二阶近似来评估其重要性,确保在极端压缩比下仍能保持高质量渲染;同时设计多轮剪枝-精修(prune-refine)流水线,无需改动原训练流程即可应用于任意预训练3D-GS模型。实验表明,该方法可实现90%的高斯剪枝率,平均渲染速度提升3.56倍,并在Mip-NeRF 360、Tanks & Temples和Deep Blending等数据集上优于现有技术,显著保留前景信息并提升图像质量指标。

链接: https://arxiv.org/abs/2406.10219
作者: Alex Hanson,Allen Tu,Vasu Singla,Mayuka Jayawardhana,Matthias Zwicker,Tom Goldstein
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: CVPR 2025, Project Page: this https URL

点击查看摘要

Abstract:Recent advances in novel view synthesis have enabled real-time rendering speeds with high reconstruction accuracy. 3D Gaussian Splatting (3D-GS), a foundational point-based parametric 3D scene representation, models scenes as large sets of 3D Gaussians. However, complex scenes can consist of millions of Gaussians, resulting in high storage and memory requirements that limit the viability of 3D-GS on devices with limited resources. Current techniques for compressing these pretrained models by pruning Gaussians rely on combining heuristics to determine which Gaussians to remove. At high compression ratios, these pruned scenes suffer from heavy degradation of visual fidelity and loss of foreground details. In this paper, we propose a principled sensitivity pruning score that preserves visual fidelity and foreground details at significantly higher compression ratios than existing approaches. It is computed as a second-order approximation of the reconstruction error on the training views with respect to the spatial parameters of each Gaussian. Additionally, we propose a multi-round prune-refine pipeline that can be applied to any pretrained 3D-GS model without changing its training pipeline. After pruning 90% of Gaussians, a substantially higher percentage than previous methods, our PUP 3D-GS pipeline increases average rendering speed by 3.56 \times while retaining more salient foreground information and achieving higher image quality metrics than existing techniques on scenes from the Mip-NeRF 360, Tanks Temples, and Deep Blending datasets.
zh

[CV-111] Weakly Supervised Segmentation and Classification of Alpha-Synuclein Aggregates in Brightfield Midbrain Images

【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)中α-突触核蛋白(alpha-synuclein)聚集物在组织切片中自动识别与分类的难题,尤其针对免疫组化染色(immunohistochemistry)变异性带来的挑战。其关键解决方案是构建一个基于弱监督分割(weakly supervised segmentation)和ResNet50分类器的自动化图像处理流程,能够在全片扫描图像(whole-slide images, WSIs)中准确区分主要聚集形态(如路易小体Lewy bodies和神经纤维结构neurites),实现80%的平衡准确率(balanced accuracy),从而为大规模解析α-突触核蛋白聚集的空间分布异质性及其与周围细胞(如小胶质细胞和星形胶质细胞)的关系提供可靠工具。

链接: https://arxiv.org/abs/2511.16268
作者: Erwan Dereure,Robin Louiset,Laura Parkkinen,David A Menassa,David Holcman
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Parkinson’s disease (PD) is a neurodegenerative disorder associated with the accumulation of misfolded alpha-synuclein aggregates, forming Lewy bodies and neuritic shape used for pathology diagnostics. Automatic analysis of immunohistochemistry histopathological images with Deep Learning provides a promising tool for better understanding the spatial organization of these aggregates. In this study, we develop an automated image processing pipeline to segment and classify these aggregates in whole-slide images (WSIs) of midbrain tissue from PD and incidental Lewy Body Disease (iLBD) cases based on weakly supervised segmentation, robust to immunohistochemical labelling variability, with a ResNet50 classifier. Our approach allows to differentiate between major aggregate morphologies, including Lewy bodies and neurites with a balanced accuracy of 80% . This framework paves the way for large-scale characterization of the spatial distribution and heterogeneity of alpha-synuclein aggregates in brightfield immunohistochemical tissue, and for investigating their poorly understood relationships with surrounding cells such as microglia and astrocytes.
zh

[CV-112] UniUltra: Interactive Parameter-Efficient SAM2 for Universal Ultrasound Segmentation

【速读】:该论文旨在解决生成式 AI(Generative AI)模型 SAM2 在超声图像上表现显著下降的问题,其核心挑战在于如何在保持参数高效性的同时实现对超声影像的适应性微调,并确保模型能在资源受限的临床环境中有效部署。解决方案的关键在于提出 UniUltra 框架:首先设计了一种新颖的上下文-边缘混合适配器(Context-Edge Hybrid Adapter, CH-Adapter),以增强跨多种超声成像模态的细粒度感知能力并实现参数高效的微调;其次引入深度监督知识蒸馏(Deep-Supervised Knowledge Distillation, DSKD)技术,将微调后 SAM2 的大尺寸图像编码器的知识迁移至一个超轻量级编码器,大幅降低计算需求而不损失性能。实验表明,该框架在仅使用 8.91% SAM2 参数进行微调的情况下仍具竞争力,最终压缩模型参数减少 94.08%,具备良好的临床实用性。

链接: https://arxiv.org/abs/2511.15771
作者: Yue Li,Qing Xu,Yixuan Zhang,Xiangjian He,Qian Zhang,Yuan Yao,Fiseha B. Tesem,Xin Chen,Ruili Wang,Zhen Chen,Chang Wen Chen
机构: University of Nottingham Ningbo China (宁波诺丁汉大学); University of Nottingham (诺丁汉大学); Massey University (梅西大学); School of Mathematical and Computational Sciences (数学与计算科学学院); School of Data Science and Artificial Intelligence (数据科学与人工智能学院); Wenzhou University of Technology (温州理工学院); Hong Kong Institute of Science & Innovation (香港科学与创新研究所); Chinese Academy of Sciences (中国科学院); The Hong Kong Polytechnic University (香港理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Segment Anything Model 2 (SAM2) demonstrates remarkable universal segmentation capabilities on natural images. However, its performance on ultrasound images is significantly degraded due to domain disparities. This limitation raises two critical challenges: how to efficiently adapt SAM2 to ultrasound imaging while maintaining parameter efficiency, and how to deploy the adapted model effectively in resource-constrained clinical environments. To address these issues, we propose UniUltra for universal ultrasound segmentation. Specifically, we first introduce a novel context-edge hybrid adapter (CH-Adapter) that enhances fine-grained perception across diverse ultrasound imaging modalities while achieving parameter-efficient fine-tuning. To further improve clinical applicability, we develop a deep-supervised knowledge distillation (DSKD) technique that transfers knowledge from the large image encoder of the fine-tuned SAM2 to a super lightweight encoder, substantially reducing computational requirements without compromising performance. Extensive experiments demonstrate that UniUltra outperforms state-of-the-arts with superior generalization capabilities. Notably, our framework achieves competitive performance using only 8.91% of SAM2’s parameters during fine-tuning, and the final compressed model reduces the parameter count by 94.08% compared to the original SAM2, making it highly suitable for practical clinical deployment. The source code is available at this https URL.
zh

[CV-113] Maximum Dispersion Maximum Concentration: Enhancing the Quality of MOP Solutions

【速读】:该论文旨在解决多目标优化问题(Multi-objective Optimization Problems, MOPs)中如何有效平衡解在决策空间中的分散性(dispersion)与目标空间中的收敛性(convergence)的问题,以避免因解在决策空间局部聚集而导致的偏差。其解决方案的关键在于:首先定义一个基于锥形区域的感兴趣区域(Region of Interest, ROI),用以表示决策者在目标空间中的偏好;其次,在决策空间中引入均匀性度量(uniformity measure)来增强解的分散性,同时在目标空间特定区域内提升解的集中度。通过联合优化决策空间的分散性和目标空间的收敛性,该方法能够更有效地搜索帕累托最优解集(Pareto-optimal solutions),并提高解的质量和多样性,从而缓解传统方法中因局部聚类带来的偏差问题。

链接: https://arxiv.org/abs/2506.22568
作者: Gladston Moreira,Ivan Meneghini,Elizabeth Wanner
机构: Universidade Federal de Ouro Preto (联邦大学奥罗普雷托分校); Federal Institute of Minas Gerais (米纳斯吉拉斯联邦学院); Aston University (阿斯顿大学)
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 11 pages

点击查看摘要

Abstract:Multi-objective optimization problems (MOPs) often require a trade-off between conflicting objectives, maximizing diversity and convergence in the objective space. This study presents an approach to improve the quality of MOP solutions by optimizing the dispersion in the decision space and the convergence in a specific region of the objective space. Our approach defines a Region of Interest (ROI) based on a cone representing the decision maker’s preferences in the objective space, while enhancing the dispersion of solutions in the decision space using a uniformity measure. Combining solution concentration in the objective space with dispersion in the decision space intensifies the search for Pareto-optimal solutions while increasing solution diversity. When combined, these characteristics improve the quality of solutions and avoid the bias caused by clustering solutions in a specific region of the decision space. Preliminary experiments suggest that this method enhances multi-objective optimization by generating solutions that effectively balance dispersion and concentration, thereby mitigating bias in the decision space.
zh

人工智能

[AI-0] aming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)训练过程中因响应生成呈现长尾分布而导致的效率瓶颈问题,即少数极长响应显著增加训练耗时并浪费计算资源。解决方案的关键在于提出T LT系统,其核心创新为两个协同组件:(1) 自适应草稿模型(Adaptive Drafter),利用空闲GPU持续训练轻量级草稿模型以保持与目标模型的一致性,且不增加额外成本;(2) 自适应回放引擎(Adaptive Rollout Engine),通过维护内存高效的预捕获CUDA Graph池并动态选择适合每批次输入的推测解码(Speculative Decoding, SD)策略,从而实现无损加速RL训练。实验证明,TLT在端到端RL训练中相较最先进系统提升超过1.7倍速度,同时保持模型精度,并产出高质量可直接部署的草稿模型作为副产品。

链接: https://arxiv.org/abs/2511.16665
作者: Qinghao Hu,Shang Yang,Junxian Guo,Xiaozhe Yao,Yujun Lin,Yuxian Gu,Han Cai,Chuang Gan,Ana Klimovic,Song Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at this https URL.
zh

[AI-1] Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations

【速读】:该论文旨在解决从人类在自然环境中执行日常任务的视频中学习多指机器人操作策略的问题,以减少对劳动密集型机器人数据收集的依赖,并克服人类与机器人之间的具身差距(embodiment gap)以及从真实场景视频中提取相关上下文和运动线索的困难。其解决方案的关键在于提出了一种名为AINA的框架,该框架利用轻量便携的Aria Gen 2眼镜采集任意环境下的任意用户数据,借助其高分辨率RGB相机、精确的3D头部与手部姿态估计及宽基线立体视图实现深度估计,从而训练出对背景变化鲁棒的基于3D点的多指手部策略,且无需任何机器人数据(包括在线修正、强化学习或仿真)即可直接部署。

链接: https://arxiv.org/abs/2511.16661
作者: Irmak Guzey,Haozhi Qi,Julen Urain,Changhao Wang,Jessica Yin,Krishna Bodduluri,Mike Lambeta,Lerrel Pinto,Akshara Rai,Jitendra Malik,Tingfan Wu,Akash Sharma,Homanga Bharadhwaj
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community. Achieving this would mark significant progress toward generalizable robot manipulation in human environments, as it would reduce the reliance on labor-intensive robot data collection. Despite substantial efforts, progress toward this goal has been bottle-necked by the embodiment gap between humans and robots, as well as by difficulties in extracting relevant contextual and motion cues that enable learning of autonomous policies from in-the-wild human videos. We claim that with simple yet sufficiently powerful hardware for obtaining human data and our proposed framework AINA, we are now one significant step closer to achieving this dream. AINA enables learning multi-fingered policies from data collected by anyone, anywhere, and in any environment using Aria Gen 2 glasses. These glasses are lightweight and portable, feature a high-resolution RGB camera, provide accurate on-board 3D head and hand poses, and offer a wide stereo view that can be leveraged for depth estimation of the scene. This setup enables the learning of 3D point-based policies for multi-fingered hands that are robust to background changes and can be deployed directly without requiring any robot data (including online corrections, reinforcement learning, or simulation). We compare our framework against prior human-to-robot policy learning approaches, ablate our design choices, and demonstrate results across nine everyday manipulation tasks. Robot rollouts are best viewed on our website: this https URL.
zh

[AI-2] Cognitive Foundations for Reasoning and Their Manifestation in LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对复杂问题时表现优异,但在简单变体上却失败的现象,揭示其推理机制与人类认知存在本质差异的问题。其核心挑战在于:尽管模型具备成功推理的行为特征(如分解、序列组织等),但缺乏自发调用元认知控制(meta-cognitive controls)和结构化推理策略的能力,导致其依赖浅层前向链式推理而非人类所采用的层级嵌套与自我监控机制。解决方案的关键在于构建一个细粒度的认知评估框架,基于28个认知要素对17万条模型推理轨迹和54条人类思维 aloud 轨迹进行系统分析,并识别出模型与人类在推理结构上的系统性差异;进而提出一种测试时推理引导方法(test-time reasoning guidance),自动为模型搭建有效的推理结构,从而显著提升其在复杂任务上的性能(最高达60%),推动模型从“表面正确”转向基于原则性认知机制的可靠推理。

链接: https://arxiv.org/abs/2511.16660
作者: Priyanka Kargupta,Shuyue Stella Li,Haocheng Wang,Jinu Lee,Shan Chen,Orevaoghene Ahia,Dean Light,Thomas L. Griffiths,Max Kleiman-Weiner,Jiawei Han,Asli Celikyilmaz,Yulia Tsvetkov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 40 pages, 4 tables, 6 figures

点击查看摘要

Abstract:Large language models solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. We synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning computational constraints, meta-cognitive controls, knowledge representations, and transformation operations, then analyze their behavioral manifestations in reasoning traces. We propose a fine-grained cognitive evaluation framework and conduct the first large-scale analysis of 170K traces from 17 models across text, vision, and audio modalities, alongside 54 human think-aloud traces, which we make publicly available. Our analysis reveals systematic structural differences: humans employ hierarchical nesting and meta-cognitive monitoring while models rely on shallow forward chaining, with divergence most pronounced on ill-structured problems. Meta-analysis of 1,598 LLM reasoning papers reveals the research community concentrates on easily quantifiable behaviors (sequential organization: 55%, decomposition: 60%) while neglecting meta-cognitive controls (self-awareness: 16%, evaluation: 8%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 60% on complex problems. By bridging cognitive science and LLM research, we establish a foundation for developing models that reason through principled cognitive mechanisms rather than brittle spurious reasoning shortcuts or memorization, opening new directions for both improving model capabilities and testing theories of human cognition at scale.
zh

[AI-3] Enhancing Forex Forecasting Accuracy: The Impact of Hybrid Variable Sets in Cognitive Algorithmic Trading Systems

【速读】:该论文旨在解决高频率外汇市场中针对EUR-USD货币对的算法交易系统预测准确性与盈利稳定性问题。其解决方案的关键在于构建一个融合型特征体系,将来自欧元区和美国的关键宏观经济变量(如GDP、失业率等)与全面的技术指标(包括技术指标、震荡指标、斐波那契水平及价格背离)相结合,通过机器学习模型进行训练与优化,从而提升交易信号的预测能力与盈利能力。

链接: https://arxiv.org/abs/2511.16657
作者: Juan C. King,Jose M. Amigo
机构: 未知
类目: Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: Paper not published

点击查看摘要

Abstract:This paper presents the implementation of an advanced artificial intelligence-based algorithmic trading system specifically designed for the EUR-USD pair within the high-frequency environment of the Forex market. The methodological approach centers on integrating a holistic set of input features: key fundamental macroeconomic variables (for example, Gross Domestic Product and Unemployment Rate) collected from both the Euro Zone and the United States, alongside a comprehensive suite of technical variables (including indicators, oscillators, Fibonacci levels, and price divergences). The performance of the resulting algorithm is evaluated using standard machine learning metrics to quantify predictive accuracy and backtesting simulations across historical data to assess trading profitability and risk. The study concludes with a comparative analysis to determine which class of input features, fundamental or technical, provides greater and more reliable predictive capacity for generating profitable trading signals.
zh

[AI-4] Evolution Strategies at the Hyperscale

【速读】:该论文旨在解决进化策略(Evolution Strategies, ES)在大规模神经网络优化中因矩阵扰动生成与批量前向传播计算开销过大而导致的可扩展性问题。传统ES方法在面对具有数十亿参数的现代神经网络时,需对全秩扰动矩阵 $ E \in \mathbb{R}^{m \times n} $ 进行存储和运算,导致内存占用和计算复杂度均高达 O(mn)\mathcal{O}(mn),难以在大规模群体(population size)下高效运行。解决方案的关键在于提出一种基于低秩学习的进化引导通用优化算法(EGGROLL),通过构造两个小规模随机矩阵 $ A \in \mathbb{R}^{m \times r}, B \in \mathbb{R}^{n \times r} $(其中 $ r \ll \min(m,n) $)来生成低秩扰动 $ AB^\top $ 替代原全秩扰动 $ E $,从而将每层所需的辅助存储从 $ mn $ 降至 $ r(m+n) $,并将单次前向传播计算成本从 O(mn)\mathcal{O}(mn) 降低至 O(r(m+n))\mathcal{O}(r(m+n)),同时理论证明其更新方向收敛至全秩更新的速度为 O(1/r)\mathcal{O}(1/r),保证了优化质量的同时显著提升了效率。

链接: https://arxiv.org/abs/2511.16652
作者: Bidipta Sarkar,Mattie Fellows,Juan Agustin Duque,Alistair Letcher,Antonio León Villares,Anya Sims,Dylan Cope,Jarek Liesen,Lukas Seier,Theo Wolf,Uljad Berdica,Alexander David Goldie,Aaron Courville,Karin Sevegnani,Shimon Whiteson,Jakob Nicolaus Foerster
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 48 pages, 12 figures, Website at this https URL

点击查看摘要

Abstract:We introduce Evolution Guided General Optimization via Low-rank Learning (EGGROLL), an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes for modern large neural network architectures with billions of parameters. ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives with excellent scaling potential through parallelisation. Naïve ES becomes prohibitively expensive at scale due to the computational and memory costs associated with generating matrix perturbations E\in\mathbbR^m\times n and the batched matrix multiplications needed to compute per-member forward passes. EGGROLL overcomes these bottlenecks by generating random matrices A\in \mathbbR^m\times r,\ B\in \mathbbR^n\times r with r\ll \min(m,n) to form a low-rank matrix perturbation A B^\top that are used in place of the full-rank perturbation E . As the overall update is an average across a population of N workers, this still results in a high-rank update but with significant memory and computation savings, reducing the auxiliary storage from mn to r(m+n) per layer and the cost of a forward pass from \mathcalO(mn) to \mathcalO(r(m+n)) when compared to full-rank ES. A theoretical analysis reveals our low-rank update converges to the full-rank update at a fast \mathcalO\left(\frac1r\right) rate. Our experiments show that (1) EGGROLL does not compromise the performance of ES in tabula-rasa RL settings, despite being faster, (2) it is competitive with GRPO as a technique for improving LLM reasoning, and (3) EGGROLL enables stable pre-training of nonlinear recurrent language models that operate purely in integer datatypes.
zh

[AI-5] Faster Certified Symmetry Breaking Using Orders With Auxiliary Variables AAAI2026

【速读】:该论文旨在解决组合求解中对称性破缺(symmetry breaking)的正确性验证难题,即如何在保证效率的同时,为对称性推理提供可形式化验证的数学证明。当前主流方法依赖于生成可被形式化检查器验证的证明,但如何高效编码对称性约束所对应的序关系仍是一个长期未解挑战。该研究的关键在于提出一种新的编码策略:不再使用大整数表示字典序,而是引入辅助变量来编码顺序关系,从而显著提升证明生成与验证的效率。实验表明,该方法在理论和实践中均实现了数量级的速度提升,特别是在基于satsuma对称性破缺器和VeriPB证明检查工具链的SAT问题中表现突出。

链接: https://arxiv.org/abs/2511.16637
作者: Markus Anders,Bart Bogaerts,Benjamin Bogø,Arthur Gontier,Wietze Koops,Ciaran McCreesh,Magnus O. Myreen,Jakob Nordström,Andy Oertel,Adrian Rebola-Pardo,Yong Kiam Tan
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 26 pages. Extended version (with appendix) of the paper to appear in AAAI 2026

点击查看摘要

Abstract:Symmetry breaking is a crucial technique in modern combinatorial solving, but it is difficult to be sure it is implemented correctly. The most successful approach to deal with bugs is to make solvers certifying, so that they output not just a solution, but also a mathematical proof of correctness in a standard format, which can then be checked by a formally verified checker. This requires justifying symmetry reasoning within the proof, but developing efficient methods for this has remained a long-standing open challenge. A fully general approach was recently proposed by Bogaerts et al. (2023), but it relies on encoding lexicographic orders with big integers, which quickly becomes infeasible for large symmetries. In this work, we develop a method for instead encoding orders with auxiliary variables. We show that this leads to orders-of-magnitude speed-ups in both theory and practice by running experiments on proof logging and checking for SAT symmetry breaking using the state-of-the-art satsuma symmetry breaker and the VeriPB proof checking toolchain.
zh

[AI-6] Stabilizing Policy Gradient Methods via Reward Profiling

【速读】:该论文旨在解决策略梯度方法在强化学习中因梯度估计方差过高而导致的性能不稳定、奖励提升不可靠及收敛速度慢的问题。解决方案的关键在于提出一种通用的奖励分析(reward profiling)框架,该框架可无缝集成于任意策略梯度算法中,并通过基于高置信度性能估计的选择性策略更新机制,实现对策略的稳定且单调的改进。理论证明表明,该方法不会减缓基线策略梯度方法的收敛速度,而实验结果则显示其在多个连续控制基准任务(如Box2D和MuJoCo/PyBullet)上显著提升了收敛速度(最高达1.5倍)并降低了回报方差(最高减少1.75倍)。

链接: https://arxiv.org/abs/2511.16629
作者: Shihab Ahmed,El Houcine Bergou,Aritra Dutta,Yue Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Policy gradient methods, which have been extensively studied in the last decade, offer an effective and efficient framework for reinforcement learning problems. However, their performances can often be unsatisfactory, suffering from unreliable reward improvements and slow convergence, due to high variance in gradient estimations. In this paper, we propose a universal reward profiling framework that can be seamlessly integrated with any policy gradient algorithm, where we selectively update the policy based on high-confidence performance estimations. We theoretically justify that our technique will not slow down the convergence of the baseline policy gradient methods, but with high probability, will result in stable and monotonic improvements of their performance. Empirically, on eight continuous-control benchmarks (Box2D and MuJoCo/PyBullet), our profiling yields up to 1.5x faster convergence to near-optimal returns, up to 1.75x reduction in return variance on some setups. Our profiling approach offers a general, theoretically grounded path to more reliable and efficient policy learning in complex environments.
zh

[AI-7] MedBayes-Lite: Bayesian Uncertainty Quantification for Safe Clinical Decision Support

【速读】:该论文旨在解决基于Transformer的临床语言模型在医疗决策支持中普遍存在过度自信的问题,尤其是在医学情境模糊时缺乏可靠的不确定性量化机制,从而影响模型的可信度与安全性。其解决方案的关键在于提出MedBayes-Lite框架,通过无需重新训练或修改架构的方式,将不确定性量化直接嵌入现有Transformer流水线:(1)利用蒙特卡洛丢弃(Monte Carlo dropout)实现贝叶斯嵌入校准以捕捉认知不确定性(epistemic uncertainty);(2)设计不确定性加权注意力机制,对token可靠性进行边际化处理;(3)引入置信度引导决策塑造策略,借鉴临床风险最小化原则优化输出。该方法在保持参数开销低于3%的前提下,显著提升了模型校准性和可解释性,在多个生物医学问答和临床预测基准上减少过自信程度达32–48%,并在模拟临床场景中降低高达41%的诊断错误率。

链接: https://arxiv.org/abs/2511.16625
作者: Elias Hossain,Md Mehedi Hasan Nipu,Maleeha Sheikh,Rajib Rana,Subash Neupane,Niloofar Yousefi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose MedBayes-Lite, a lightweight Bayesian enhancement for transformer-based clinical language models designed to produce reliable, uncertainty-aware predictions. Although transformers show strong potential for clinical decision support, they remain prone to overconfidence, especially in ambiguous medical cases where calibrated uncertainty is critical. MedBayes-Lite embeds uncertainty quantification directly into existing transformer pipelines without any retraining or architectural rewiring, adding no new trainable layers and keeping parameter overhead under 3 percent. The framework integrates three components: (i) Bayesian Embedding Calibration using Monte Carlo dropout for epistemic uncertainty, (ii) Uncertainty-Weighted Attention that marginalizes over token reliability, and (iii) Confidence-Guided Decision Shaping inspired by clinical risk minimization. Across biomedical QA and clinical prediction benchmarks (MedQA, PubMedQA, MIMIC-III), MedBayes-Lite consistently improves calibration and trustworthiness, reducing overconfidence by 32 to 48 percent. In simulated clinical settings, it can prevent up to 41 percent of diagnostic errors by flagging uncertain predictions for human review. These results demonstrate its effectiveness in enabling reliable uncertainty propagation and improving interpretability in medical AI systems.
zh

[AI-8] Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization

【速读】:该论文旨在解决通用具身智能系统面临的两大核心挑战:一是真实世界数据稀缺且昂贵的“具身数据瓶颈”(embodied data bottleneck),二是现有算法在资源消耗上的低效性。解决方案的关键在于提出一种元认知“Metaloop”训练框架——刻意练习策略优化(Deliberate Practice Policy Optimization, DPPO),其通过动态交替执行监督微调(用于能力扩展)与强化学习(用于技能精炼),实现对模型弱点的自动识别和针对性资源分配,从而最大化从有限稀疏数据中获取的学习效率。该方法理论上可形式化为统一的偏好学习框架,实证上使Pelican-VL 1.0模型相比基线提升20.3%,并超越同规模开源模型10.6%。

链接: https://arxiv.org/abs/2511.16602
作者: Yi Zhang,Che Liu,Xiancong Ren,Hanchu Ni,Yingji Zhang,Shuai Zhang,Zeyuan Ding,Jiayu Hu,Haozhe Shan,Junbo Qi,Yan Bai,Dengjie Li,Jiachen Luo,Yidong Wang,Yong Dai,Zenglin Xu,Bin Shen,Qifan Wang,Jian Tang,Xiaozhu Ju
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing a universal and versatile embodied intelligence system presents two primary challenges: the critical embodied data bottleneck, where real-world data is scarce and expensive, and the algorithmic inefficiency of existing methods, which are resource-prohibitive. To address these limitations, we introduce Deliberate Practice Policy Optimization (DPPO), a metacognitive ``Metaloop’’ training framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement). This enables automatic weakness identification and targeted resource allocation, specifically designed to maximize learning efficiency from sparse, finite data. Theoretically, DPPO can be formalised as a unified preference-learning framework. Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3% performance improvement over the base model and surpasses open-source models at the 100B-parameter scale by 10.6%. We are open-sourcing both the models and code, providing the first systematic framework that alleviates the data and resource bottleneck and enables the community to build versatile embodied agents efficiently.
zh

[AI-9] You Only Forward Once: An Efficient Compositional Judging Paradigm

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在作为评判者(judge)时面临的核心矛盾:一方面,将MLLMs调整为输出单一评分会违背其生成式特性并限制对细粒度要求的理解;另一方面,通过自回归方式逐条生成评判分析虽然可解释性强,但在高吞吐场景下效率极低。解决方案的关键在于提出YOFO——一种基于模板条件化的单次前向传播判断方法,其利用结构化要求模板,在一次推理中通过读取与各要求对应的最终token logits,同时完成所有要求的二分类决策(yes/no),从而实现数量级的速度提升,同时保持评判过程的可解释性,并支持依赖感知的分析(dependency-aware analysis)和事后思维链(post-hoc Chain-of-Thought, CoT)增强。

链接: https://arxiv.org/abs/2511.16600
作者: Tianlong Zhang,Hongwei Xue,Shilin Yan,Di Wu,Chen Xu,Yunyun Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) show strong potential as judges. However, existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Observing that judgment reduces to verifying whether inputs satisfy a set of structured requirements, we propose YOFO, a template-conditioned method that judges all requirements in a single forward pass. Built on an autoregressive model, YOFO accepts a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement. This design yields orders-of-magnitude speedups while preserving interpretability. Extensive experiments show that YOFO not only achieves state-of-the-art results on standard recommendation datasets, but also supports dependency-aware analysis-where subsequent judgments are conditioned on previous ones-and further benefits from post-hoc CoT.
zh

[AI-10] Formal Abductive Latent Explanations for Prototype-Based Networks AAAI-26

【速读】:该论文旨在解决基于案例推理(Case-based Reasoning, CBR)模型中解释的误导性问题,即尽管这些模型声称“设计上可解释”,但存在不同输入实例产生相同解释却导致不同预测的情况,这在安全关键场景下具有显著风险。解决方案的关键在于提出一种名为“归因潜变量解释”(Abductive Latent Explanations, ALEs)的形式化框架,该框架通过定义中间(潜变量)表示的充分条件来保证预测结果的可解释性,从而结合CBR模型的内在可解释性与形式化可解释人工智能(formal eXplainable AI, FXAI)提供的逻辑保障,同时设计了一种无需求解器且可扩展的算法来生成ALEs,并在多种图像分类任务中验证了其可行性。

链接: https://arxiv.org/abs/2511.16588
作者: Jules Soria,Zakaria Chihani,Julien Girard-Satabin,Alban Grastien,Romain Xu-Darme,Daniela Cancila
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Accepted at AAAI-26

点击查看摘要

Abstract:Case-based reasoning networks are machine-learning models that make predictions based on similarity between the input and prototypical parts of training samples, called prototypes. Such models are able to explain each decision by pointing to the prototypes that contributed the most to the final outcome. As the explanation is a core part of the prediction, they are often qualified as ``interpretable by design". While promising, we show that such explanations are sometimes misleading, which hampers their usefulness in safety-critical contexts. In particular, several instances may lead to different predictions and yet have the same explanation. Drawing inspiration from the field of formal eXplainable AI (FXAI), we propose Abductive Latent Explanations (ALEs), a formalism to express sufficient conditions on the intermediate (latent) representation of the instance that imply the prediction. Our approach combines the inherent interpretability of case-based reasoning models and the guarantees provided by formal XAI. We propose a solver-free and scalable algorithm for generating ALEs based on three distinct paradigms, compare them, and present the feasibility of our approach on diverse datasets for both standard and fine-grained image classification. The associated code can be found at this https URL
zh

[AI-11] Consciousness in Artificial Intelligence? A Framework for Classifying Objections and Constraints

【速读】:该论文旨在解决当前关于数字人工智能系统中意识可能性的争议性挑战缺乏清晰分类与辨析的问题。其解决方案的关键在于构建一个分类框架,该框架借鉴Marr的层次理论,将挑战按粒度层级进行划分,并通过三个强度等级(degree 1–3)明确区分其论证力度:是否仅质疑计算功能主义(保留数字意识的可能性)、是否提出实践上的不可能性(不否定可能性但认为极难实现),或是否断言数字意识在理论上绝对不可能。此框架使研究者能够精准定位和比较不同挑战的本质与主张,从而为相关讨论提供结构化工具。

链接: https://arxiv.org/abs/2511.16582
作者: Andres Campero,Derek Shiller,Jaan Aru,Jonathan Simon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:We develop a taxonomical framework for classifying challenges to the possibility of consciousness in digital artificial intelligence systems. This framework allows us to identify the level of granularity at which a given challenge is intended (the levels we propose correspond to Marr’s levels) and to disambiguate its degree of force: is it a challenge to computational functionalism that leaves the possibility of digital consciousness open (degree 1), a practical challenge to digital consciousness that suggests improbability without claiming impossibility (degree 2), or an argument claiming that digital consciousness is strictly impossible (degree 3)? We apply this framework to 14 prominent examples from the scientific and philosophical literature. Our aim is not to take a side in the debate, but to provide structure and a tool for disambiguating between challenges to computational functionalism and challenges to digital consciousness, as well as between different ways of parsing such challenges.
zh

[AI-12] Synthesis of Safety Specifications for Probabilistic Systems

【速读】:该论文旨在解决在安全关键环境中,如何为智能体(agents)合成满足更一般时序安全规范的控制器问题。现有方法通常仅支持概率规避类约束,而无法处理更具表达力的时序逻辑规范,如Probabilistic Computation Tree Logic (PCTL)。解决方案的关键在于:首先提出一个理论框架,将全局PCTL规范的满足性归约为局部约束,并定义了CPCTL(安全PCTL的一个子集),证明其在合成问题中的适用性;其次基于此框架设计了一种基于值迭代(Value Iteration)的新算法,用于求解更广泛的时序性质下的控制器合成问题,并严格证明了该方法的正确性和完备性。

链接: https://arxiv.org/abs/2511.16579
作者: Gaspard Ohlmann,Edwin Hamel-De le Court,Francesco Belardinelli
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 23 pages

点击查看摘要

Abstract:Ensuring that agents satisfy safety specifications can be crucial in safety-critical environments. While methods exist for controller synthesis with safe temporal specifications, most existing methods restrict safe temporal specifications to probabilistic-avoidance constraints. Formal methods typically offer more expressive ways to express safety in probabilistic systems, such as Probabilistic Computation Tree Logic (PCTL) formulas. Thus, in this paper, we develop a new approach that supports more general temporal properties expressed in PCTL. Our contribution is twofold. First, we develop a theoretical framework for the Synthesis of safe-PCTL specifications. We show how the reducing global specification satisfaction to local constraints, and define CPCTL, a fragment of safe-PCTL. We demonstrate how the expressiveness of CPCTL makes it a relevant fragment for the Synthesis Problem. Second, we leverage these results and propose a new Value Iteration-based algorithm to solve the synthesis problem for these more general temporal properties, and we prove the soundness and completeness of our method.
zh

[AI-13] ECPv2: Fast Efficient and Scalable Global Optimization of Lipschitz Functions AAAI2026

【速读】:该论文旨在解决 Lipschitz 连续函数(Lipschitz-continuous functions)在未知 Lipschitz 常数条件下进行全局优化时的效率与理论保障问题,尤其针对现有 Every Call is Precious (ECP) 框架中存在的计算成本高和早期行为过于保守的问题。解决方案的关键在于提出 ECPv2 算法,其核心创新包括:(i) 引入自适应下界以避免无效接受区域,(ii) 设计 Worst-m 内存机制限制历史评估点的比较范围,从而降低复杂度,(iii) 采用固定随机投影加速高维空间中的距离计算。这些改进使得 ECPv2 在保持无遗憾(no-regret)理论性质的同时,实现了更优的有限时间边界,并显著提升了实际运行效率。

链接: https://arxiv.org/abs/2511.16575
作者: Fares Fourati,Mohamed-Slim Alouini,Vaneet Aggarwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: Accepted at AAAI 2026 (main technical track), extended version

点击查看摘要

Abstract:We propose ECPv2, a scalable and theoretically grounded algorithm for global optimization of Lipschitz-continuous functions with unknown Lipschitz constants. Building on the Every Call is Precious (ECP) framework, which ensures that each accepted function evaluation is potentially informative, ECPv2 addresses key limitations of ECP, including high computational cost and overly conservative early behavior. ECPv2 introduces three innovations: (i) an adaptive lower bound to avoid vacuous acceptance regions, (ii) a Worst-m memory mechanism that restricts comparisons to a fixed-size subset of past evaluations, and (iii) a fixed random projection to accelerate distance computations in high dimensions. We theoretically show that ECPv2 retains ECP’s no-regret guarantees with optimal finite-time bounds and expands the acceptance region with high probability. We further empirically validate these findings through extensive experiments and ablation studies. Using principled hyperparameter settings, we evaluate ECPv2 across a wide range of high-dimensional, non-convex optimization problems. Across benchmarks, ECPv2 consistently matches or outperforms state-of-the-art optimizers, while significantly reducing wall-clock time.
zh

[AI-14] Interfacial and bulk switching MoS2 memristors for an all-2D reservoir computing framework

【速读】:该论文旨在解决传统人工神经网络在处理时间序列数据时计算复杂度高、能效低的问题,提出了一种基于忆阻器(memristive device)的储层计算(Reservoir Computing, RC)网络架构。其关键解决方案在于利用化学气相沉积(CVD)生长的MoS₂薄膜厚度调控器件的短程和长程记忆特性:单层(1L)MoS₂器件表现出挥发性(short-term memory)开关行为,用于构建四比特储层状态;多层(ML)MoS₂器件则实现非挥发性电阻切换,具备优异均匀性和模拟电导调制能力,归因于陷阱辅助空间电荷限制电流(trap-assisted space-charge limited conduction, SCLC)机制,从而实现体限制型电阻切换行为。通过这种混合记忆特性设计,该RC网络在语音数字识别任务中达到89.56%的精度,并成功应用于非线性时间序列建模。

链接: https://arxiv.org/abs/2511.16557
作者: Asmita S. Thool,Sourodeep Roy,Prahalad Kanti Barman,Kartick Biswas,Pavan Nukala,Abhishek Misra,Saptarshi Das,and Bhaswar Chakrabarti
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this study, we design a reservoir computing (RC) network by exploiting short- and long-term memory dynamics in Au/Ti/MoS _2 /Au memristive devices. The temporal dynamics is engineered by controlling the thickness of the Chemical Vapor Deposited (CVD) MoS _2 films. Devices with a monolayer (1L)-MoS _2 film exhibit volatile (short-term memory) switching dynamics. We also report non-volatile resistance switching with excellent uniformity and analog behavior in conductance tuning for the multilayer (ML) MoS _2 memristive devices. We correlate this performance with trap-assisted space-charge limited conduction (SCLC) mechanism, leading to a bulk-limited resistance switching behavior. Four-bit reservoir states are generated using volatile memristors. The readout layer is implemented with an array of nonvolatile synapses. This small RC network achieves 89.56% precision in a spoken-digit recognition task and is also used to analyze a nonlinear time series equation.
zh

[AI-15] Utilizing Large Language Models for Zero-Shot Medical Ontology Extension from Clinical Notes

【速读】:该论文旨在解决如何从非结构化临床笔记中自动提取医学实体并将其整合到层次化医学本体(Medical Ontology)中的问题,以扩展现有本体的覆盖范围和实用性。其解决方案的关键在于提出了一种名为CLOZE的新框架,该框架利用预训练大语言模型(Large Language Models, LLMs)的强大语言理解能力和生物医学知识,在无需额外标注数据或训练的情况下实现零样本(zero-shot)的医学概念识别与关系建模,同时通过自动化移除受保护健康信息(Protected Health Information, PHI)保障患者隐私,从而提供一种准确、可扩展且隐私友好的本体扩展方法。

链接: https://arxiv.org/abs/2511.16548
作者: Guanchen Wu,Yuzhang Xie,Huanwei Wu,Zhe He,Hui Shao,Xiao Hu,Carl Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: BIBM 2025 (WS#44: Biological ontologies and knowledge bases (BiOK) in the LLM era)

点击查看摘要

Abstract:Integrating novel medical concepts and relationships into existing ontologies can significantly enhance their coverage and utility for both biomedical research and clinical applications. Clinical notes, as unstructured documents rich with detailed patient observations, offer valuable context-specific insights and represent a promising yet underutilized source for ontology extension. Despite this potential, directly leveraging clinical notes for ontology extension remains largely unexplored. To address this gap, we propose CLOZE, a novel framework that uses large language models (LLMs) to automatically extract medical entities from clinical notes and integrate them into hierarchical medical ontologies. By capitalizing on the strong language understanding and extensive biomedical knowledge of pre-trained LLMs, CLOZE effectively identifies disease-related concepts and captures complex hierarchical relationships. The zero-shot framework requires no additional training or labeled data, making it a cost-efficient solution. Furthermore, CLOZE ensures patient privacy through automated removal of protected health information (PHI). Experimental results demonstrate that CLOZE provides an accurate, scalable, and privacy-preserving ontology extension framework, with strong potential to support a wide range of downstream applications in biomedical research and clinical informatics.
zh

[AI-16] ODE-ViT: Plug Play Attention Layer from the Generalization of the ViT as an Ordinary Differential Equation

【速读】:该论文旨在解决大规模视觉Transformer(Vision Transformer, ViT)模型在计算资源消耗高、参数量庞大以及决策过程缺乏可解释性等问题。其核心解决方案是提出ODE-ViT,将ViT重新建模为一个满足良好 posed 和稳定动力学条件的常微分方程(Ordinary Differential Equation, ODE)系统,从而实现更少参数下的稳定且可解释的性能表现。关键创新在于利用残差网络与ODE之间的理论联系,构建连续动态路径,并进一步设计了一个即插即用的教师-学生框架,通过离散ViT的中间表示作为ODE解来指导ODE-ViT的连续轨迹演化,显著提升模型性能(相比从头训练的自由ODE-ViT提升超过10%)。

链接: https://arxiv.org/abs/2511.16501
作者: Carlos Boned Riera,David Romero Sanchez,Oriol Ramos Terrades
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, increasingly large models have achieved outstanding performance across CV tasks. However, these models demand substantial computational resources and storage, and their growing complexity limits our understanding of how they make decisions. Most of these architectures rely on the attention mechanism within Transformer-based designs. Building upon the connection between residual neural networks and ordinary differential equations (ODEs), we introduce ODE-ViT, a Vision Transformer reformulated as an ODE system that satisfies the conditions for well-posed and stable dynamics. Experiments on CIFAR-10 and CIFAR-100 demonstrate that ODE-ViT achieves stable, interpretable, and competitive performance with up to one order of magnitude fewer parameters, surpassing prior ODE-based Transformer approaches in classification tasks. We further propose a plug-and-play teacher-student framework in which a discrete ViT guides the continuous trajectory of ODE-ViT by treating the intermediate representations of the teacher as solutions of the ODE. This strategy improves performance by more than 10% compared to training a free ODE-ViT from scratch.
zh

[AI-17] LLM 4EO: Large Language Model for Evolutionary Optimization in Flexible Job Shop Scheduling

【速读】:该论文旨在解决进化算法(Evolutionary Algorithms, EAs)中静态算子设计导致的搜索性能随迭代过程波动且易退化的问题。传统动态算子虽能局部调整参数,但缺乏全局自适应优化能力。其解决方案的关键在于引入大语言模型(Large Language Models, LLMs)实现算子层面的元进化(meta-evolution),构建LMM4EO框架:通过知识迁移初始化算子、结合适应度与进化特征分析算子表现并提出改进建议,并在种群进化停滞时利用改进提示策略动态优化算子基因选择优先级,从而实现种群与算子的协同进化,显著提升EAs的效率与自适应性。

链接: https://arxiv.org/abs/2511.16485
作者: Rongjie Liao,Junhao Qiu,Xin Chen,Xiaoping Li
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Customized static operator design has enabled widespread application of Evolutionary Algorithms (EAs), but their search performance is transient during iterations and prone to degradation. Dynamic operators aim to address this but typically rely on predefined designs and localized parameter control during the search process, lacking adaptive optimization throughout evolution. To overcome these limitations, this work leverages Large Language Models (LLMs) to perceive evolutionary dynamics and enable operator-level meta-evolution. The proposed framework, LLMs for Evolutionary Optimization (LLM4EO), comprises three components: knowledge-transfer-based operator design, evolution perception and analysis, and adaptive operator evolution. Firstly, initialization of operators is performed by transferring the strengths of classical operators via LLMs. Then, search preferences and potential limitations of operators are analyzed by integrating fitness performance and evolutionary features, accompanied by corresponding suggestions for improvement. Upon stagnation of population evolution, gene selection priorities of operators are dynamically optimized via improvement prompting strategies. This approach achieves co-evolution of populations and operators in the search, introducing a novel paradigm for enhancing the efficiency and adaptability of EAs. Finally, a series of validations on multiple benchmark datasets of the flexible job shop scheduling problem demonstrate that LLM4EO accelerates population evolution and outperforms both mainstream evolutionary programming and traditional EAs.
zh

[AI-18] Large Language Model-Based Reward Design for Deep Reinforcement Learning-Driven Autonomous Cyber Defense AAAI-26

【速读】:该论文旨在解决在复杂动态环境中为自主网络攻防学习智能体设计奖励机制的难题,这通常需要领域专家深度参与且极具挑战性。其解决方案的关键在于引入大语言模型(Large Language Model, LLM)来生成引导性的奖励结构:首先将上下文相关的网络安全仿真环境信息输入LLM,使其基于对攻击与防御行为异质性的理解,生成适应性强的奖励设计;随后将这些奖励结构嵌入深度强化学习(Deep Reinforcement Learning, DRL)驱动的攻防模拟环境中,从而训练出一组有效的网络防御策略集合。该方法显著提升了自动化奖励设计的效率和策略多样性,使智能体能够应对多样化的对抗行为。

链接: https://arxiv.org/abs/2511.16483
作者: Sayak Mukherjee,Samrat Chatterjee,Emilie Purvine,Ted Fujimoto,Tegan Emerson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted in the AAAI-26 Workshop on Artificial Intelligence for Cyber Security (AICS)

点击查看摘要

Abstract:Designing rewards for autonomous cyber attack and defense learning agents in a complex, dynamic environment is a challenging task for subject matter experts. We propose a large language model (LLM)-based reward design approach to generate autonomous cyber defense policies in a deep reinforcement learning (DRL)-driven experimental simulation environment. Multiple attack and defense agent personas were crafted, reflecting heterogeneity in agent actions, to generate LLM-guided reward designs where the LLM was first provided with contextual cyber simulation environment information. These reward structures were then utilized within a DRL-driven attack-defense simulation environment to learn an ensemble of cyber defense policies. Our results suggest that LLM-guided reward designs can lead to effective defense strategies against diverse adversarial behaviors.
zh

[AI-19] Correlation-Aware Feature Attribution Based Explainable AI

【速读】:该论文旨在解决当前全局归因方法在复杂模型中面临的三大问题:计算成本高、在相关输入下缺乏稳定性,以及难以高效扩展到大规模或异构数据集。其解决方案的核心是提出ExCIR(Explainability through Correlation Impact Ratio),一种基于相关性的归因评分机制,通过轻量级传输协议仅用少量数据即可复现完整模型的特征重要性排序;该方法通过对特征和模型输出进行鲁棒中心化(如减去中位数或中均值)来量化特征与输出之间符号一致的协同变动,并进一步引入BlockCIR作为分组扩展版本,将强相关特征集视为单一单元进行评分,从而避免共线性簇中的重复计数并提升排名稳定性。

链接: https://arxiv.org/abs/2511.16482
作者: Poushali Sengupta,Yan Zhang,Frank Eliassen,Sabita Maharjan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted, 2026 International Conference on Advances in Artificial Intelligence and Machine Learning (AAIML 2026)

点击查看摘要

Abstract:Explainable AI (XAI) is increasingly essential as modern models become more complex and high-stakes applications demand transparency, trust, and regulatory compliance. Existing global attribution methods often incur high computational costs, lack stability under correlated inputs, and fail to scale efficiently to large or heterogeneous datasets. We address these gaps with \emphExCIR (Explainability through Correlation Impact Ratio), a correlation-aware attribution score equipped with a lightweight transfer protocol that reproduces full-model rankings using only a fraction of the data. ExCIR quantifies sign-aligned co-movement between features and model outputs after \emphrobust centering (subtracting a robust location estimate, e.g., median or mid-mean, from features and outputs). We further introduce \textscBlockCIR, a \emphgroupwise extension of ExCIR that scores \emphsets of correlated features as a single unit. By aggregating the same signed-co-movement numerators and magnitudes over predefined or data-driven groups, \textscBlockCIR mitigates double-counting in collinear clusters (e.g., synonyms or duplicated sensors) and yields smoother, more stable rankings when strong dependencies are present. Across diverse text, tabular, signal, and image datasets, ExCIR shows trustworthy agreement with established global baselines and the full model, delivers consistent top- k rankings across settings, and reduces runtime via lightweight evaluation on a subset of rows. Overall, ExCIR provides \emphcomputationally efficient, \emphconsistent, and \emphscalable explainability for real-world deployment.
zh

[AI-20] PersonaDrift: A Benchmark for Temporal Anomaly Detection in Language-Based Dementia Monitoring

【速读】:该论文旨在解决认知衰退人群(如痴呆患者,PLwD)在日常交流中出现的渐进性行为漂移(behavioral drift)难以被现有计算工具有效追踪的问题。其核心挑战在于:传统方法缺乏对细微、缓慢变化的敏感性,且忽视了个体间行为差异。解决方案的关键是提出PersonaDrift——一个基于照护者访谈构建的合成基准,模拟60天内不同用户与数字提醒系统的交互日志,其中注入两种由照护者识别出的关键变化形式:情感扁平化(flattened sentiment)和离题回复(off-topic replies)。该框架通过个性化基线建模与时间序列分析相结合,使检测模型能够区分自然个体差异与真实行为漂移,实验表明个性化分类器显著优于通用模型,凸显了个体化行为上下文在早期检测中的重要性。

链接: https://arxiv.org/abs/2511.16445
作者: Joy Lai,Alex Mihailidis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:People living with dementia (PLwD) often show gradual shifts in how they communicate, becoming less expressive, more repetitive, or drifting off-topic in subtle ways. While caregivers may notice these changes informally, most computational tools are not designed to track such behavioral drift over time. This paper introduces PersonaDrift, a synthetic benchmark designed to evaluate machine learning and statistical methods for detecting progressive changes in daily communication, focusing on user responses to a digital reminder system. PersonaDrift simulates 60-day interaction logs for synthetic users modeled after real PLwD, based on interviews with caregivers. These caregiver-informed personas vary in tone, modality, and communication habits, enabling realistic diversity in behavior. The benchmark focuses on two forms of longitudinal change that caregivers highlighted as particularly salient: flattened sentiment (reduced emotional tone and verbosity) and off-topic replies (semantic drift). These changes are injected progressively at different rates to emulate naturalistic cognitive trajectories, and the framework is designed to be extensible to additional behaviors in future use cases. To explore this novel application space, we evaluate several anomaly detection approaches, unsupervised statistical methods (CUSUM, EWMA, One-Class SVM), sequence models using contextual embeddings (GRU + BERT), and supervised classifiers in both generalized and personalized settings. Preliminary results show that flattened sentiment can often be detected with simple statistical models in users with low baseline variability, while detecting semantic drift requires temporal modeling and personalized baselines. Across both tasks, personalized classifiers consistently outperform generalized ones, highlighting the importance of individual behavioral context.
zh

[AI-21] From generative AI to the brain: five takeaways

【速读】:该论文试图解决的问题是:如何将生成式人工智能(Generative AI)中所体现的明确生成原理与大脑的认知神经科学机制相联系,从而为理解神经信息处理提供新的视角。其解决方案的关键在于系统性地探讨机器学习(ML)研究中提出的五类核心概念——世界建模的局限性、思维过程的生成、注意力机制、神经尺度定律(neural scaling laws)以及量化(quantization)——这些概念不仅揭示了神经信息处理系统的特性,也为认知神经科学提供了潜在的理论框架和实证启发。

链接: https://arxiv.org/abs/2511.16432
作者: Claudius Gros
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: Frontiers in Computational Neuroscience, in press

点击查看摘要

Abstract:The big strides seen in generative AI are not based on somewhat obscure algorithms, but due to clearly defined generative principles. The resulting concrete implementations have proven themselves in large numbers of applications. We suggest that it is imperative to thoroughly investigate which of these generative principles may be operative also in the brain, and hence relevant for cognitive neuroscience. In addition, ML research led to a range of interesting characterizations of neural information processing systems. We discuss five examples, the shortcomings of world modelling, the generation of thought processes, attention, neural scaling laws, and quantization, that illustrate how much neuroscience could potentially learn from ML research.
zh

[AI-22] Generative Modeling of Clinical Time Series via Latent Stochastic Differential Equations

【速读】:该论文旨在解决临床时间序列数据(如电子健康记录和医疗登记数据)在建模过程中面临的三大挑战:不规则采样、复杂的潜在生理机制以及测量误差与疾病进展中的固有不确定性。为应对这些问题,作者提出了一种基于隐变量神经随机微分方程(latent neural stochastic differential equations, latent neural SDEs)的生成建模框架,其核心思想是将临床时间序列视为一个受控随机动力系统的离散观测。解决方案的关键在于:通过神经SDE建模潜变量动态,并结合模态依赖的发射模型(modality-dependent emission models),利用变分推断同时实现状态估计与参数学习,从而在统一且可扩展的概率框架内自然处理不规则采样、学习非线性交互关系,并量化疾病进展与测量噪声的随机性。

链接: https://arxiv.org/abs/2511.16427
作者: Muhammad Aslanimoghanloo,Ahmed ElGazzar,Marcel van Gerven
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical time series data from electronic health records and medical registries offer unprecedented opportunities to understand patient trajectories and inform medical decision-making. However, leveraging such data presents significant challenges due to irregular sampling, complex latent physiology, and inherent uncertainties in both measurements and disease progression. To address these challenges, we propose a generative modeling framework based on latent neural stochastic differential equations (SDEs) that views clinical time series as discrete-time partial observations of an underlying controlled stochastic dynamical system. Our approach models latent dynamics via neural SDEs with modality-dependent emission models, while performing state estimation and parameter learning through variational inference. This formulation naturally handles irregularly sampled observations, learns complex non-linear interactions, and captures the stochasticity of disease progression and measurement noise within a unified scalable probabilistic framework. We validate the framework on two complementary tasks: (i) individual treatment effect estimation using a simulated pharmacokinetic-pharmacodynamic (PKPD) model of lung cancer, and (ii) probabilistic forecasting of physiological signals using real-world intensive care unit (ICU) data from 12,000 patients. Results show that our framework outperforms ordinary differential equation and long short-term memory baseline models in accuracy and uncertainty estimation. These results highlight its potential for enabling precise, uncertainty-aware predictions to support clinical decision-making.
zh

[AI-23] Pharos-ESG: A Framework for Multimodal Parsing Contextual Narration and Hierarchical Labeling of ESG Report AAAI26

【速读】:该论文旨在解决ESG(环境、社会与治理)报告在大规模理解中面临的挑战,尤其是由非结构化排版和弱语义结构导致的阅读顺序混乱与层级关系隐含的问题。解决方案的关键在于提出Pharos-ESG框架,其核心创新包括:基于布局流的阅读顺序建模模块、利用目录锚点引导的层次感知分割机制,以及融合视觉元素的多模态上下文聚合管道,从而将原始ESG报告转化为结构化表示,并附加ESG、GRI及情感标签,以满足金融研究分析需求。

链接: https://arxiv.org/abs/2511.16417
作者: Yan Chen,Yu Zou,Jialei Zeng,Haoran You,Xiaorui Zhou,Aixi Zhong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 26:main technical track Oral

点击查看摘要

Abstract:Environmental, Social, and Governance (ESG) principles are reshaping the foundations of global financial gover- nance, transforming capital allocation architectures, regu- latory frameworks, and systemic risk coordination mecha- nisms. However, as the core medium for assessing corpo- rate ESG performance, the ESG reports present significant challenges for large-scale understanding, due to chaotic read- ing order from slide-like irregular layouts and implicit hier- archies arising from lengthy, weakly structured content. To address these challenges, we propose Pharos-ESG, a uni- fied framework that transforms ESG reports into structured representations through multimodal parsing, contextual nar- ration, and hierarchical labeling. It integrates a reading-order modeling module based on layout flow, hierarchy-aware seg- mentation guided by table-of-contents anchors, and a multi- modal aggregation pipeline that contextually transforms vi- sual elements into coherent natural language. The framework further enriches its outputs with ESG, GRI, and sentiment labels, yielding annotations aligned with the analytical de- mands of financial research. Extensive experiments on anno- tated benchmarks demonstrate that Pharos-ESG consistently outperforms both dedicated document parsing systems and general-purpose multimodal models. In addition, we release Aurora-ESG, the first large-scale public dataset of ESG re- ports, spanning Mainland China, Hong Kong, and U.S. mar- kets, featuring unified structured representations of multi- modal content, enriched with fine-grained layout and seman- tic annotations to better support ESG integration in financial governance and decision-making.
zh

[AI-24] rustworthy AI in the Agent ic Lakehouse: from Concurrency to Governance AAAI26

【速读】:该论文试图解决当前生成式 AI(Generative AI)代理在企业生产环境中缺乏可信度的问题,核心挑战在于传统湖仓架构(lakehouse)无法适配代理的访问模式,导致数据一致性、治理和可信赖性难以保障。解决方案的关键在于提出一种“代理优先”(agent-first)的设计理念——Bauplan,其通过重构湖仓的数据与计算隔离机制,并借鉴数据库中的多版本并发控制(MVCC)思想,但针对解耦的多语言环境进行了适配优化,从而实现事务驱动的治理能力,最终支持自愈型数据流水线,使代理推理过程与正确性、可靠性保障无缝集成。

链接: https://arxiv.org/abs/2511.16402
作者: Jacopo Tagliabue,Federico Bianchi,Ciro Greco
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: AAAI26, pre-print of paper accepted at the Trustworthy Agentic AI Workshop

点击查看摘要

Abstract:Even as AI capabilities improve, most enterprises do not consider agents trustworthy enough to work on production data. In this paper, we argue that the path to trustworthy agentic workflows begins with solving the infrastructure problem first: traditional lakehouses are not suited for agent access patterns, but if we design one around transactions, governance follows. In particular, we draw an operational analogy to MVCC in databases and show why a direct transplant fails in a decoupled, multi-language setting. We then propose an agent-first design, Bauplan, that reimplements data and compute isolation in the lakehouse. We conclude by sharing a reference implementation of a self-healing pipeline in Bauplan, which seamlessly couples agent reasoning with all the desired guarantees for correctness and trust.
zh

[AI-25] Collaborative Management for Chronic Diseases and Depression: A Double Heterogeneity-based Multi-Task Learning Method

【速读】:该论文旨在解决慢性疾病与抑郁症共病评估的难题,即如何在可穿戴传感器数据基础上实现对多种慢性疾病(如心血管疾病、糖尿病等)和抑郁症状的联合建模,以支持协同慢性病管理。传统方法多聚焦于单一疾病,忽视了身心共病之间的复杂交互关系,导致预测精度受限。其核心解决方案是提出一种基于双异质性(double heterogeneity)的多任务学习(Multi-Task Learning, MTL)框架——Advanced Double Heterogeneity-based Multi-Task Learning (ADH-MTL),关键创新包括:(1) 群体层面建模以提升新患者预测能力;(2) 分解策略降低模型复杂度;(3) 基于贝叶斯网络显式建模不同任务间的依赖关系,在保持跨任务相似性的同时平衡差异性,从而显著提升多病种联合诊断的准确性。

链接: https://arxiv.org/abs/2511.16398
作者: Yidong Chai,Haoxin Liu,Jiaheng Xie,Chaopeng Wang,Xiao Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Wearable sensor technologies and deep learning are transforming healthcare management. Yet, most health sensing studies focus narrowly on physical chronic diseases. This overlooks the critical need for joint assessment of comorbid physical chronic diseases and depression, which is essential for collaborative chronic care. We conceptualize multi-disease assessment, including both physical diseases and depression, as a multi-task learning (MTL) problem, where each disease assessment is modeled as a task. This joint formulation leverages inter-disease relationships to improve accuracy, but it also introduces the challenge of double heterogeneity: chronic diseases differ in their manifestation (disease heterogeneity), and patients with the same disease show varied patterns (patient heterogeneity). To address these issues, we first adopt existing techniques and propose a base method. Given the limitations of the base method, we further propose an Advanced Double Heterogeneity-based Multi-Task Learning (ADH-MTL) method that improves the base method through three innovations: (1) group-level modeling to support new patient predictions, (2) a decomposition strategy to reduce model complexity, and (3) a Bayesian network that explicitly captures dependencies while balancing similarities and differences across model components. Empirical evaluations on real-world wearable sensor data demonstrate that ADH-MTL significantly outperforms existing baselines, and each of its innovations is shown to be effective. This study contributes to health information systems by offering a computational solution for integrated physical and mental healthcare and provides design principles for advancing collaborative chronic disease management across the pre-treatment, treatment, and post-treatment phases.
zh

[AI-26] CorrectHDL: Agent ic HDL Design with LLM s Leverag ing High-Level Synthesis as Reference

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成硬件描述语言(Hardware Description Language, HDL)设计时因幻觉(hallucination)导致的功能性错误问题。其解决方案的关键在于提出一个名为CorrectHDL的框架,该框架利用高阶综合(High-Level Synthesis, HLS)结果作为功能参考,通过迭代比对LLM生成的HDL电路与HLS参考设计的仿真行为,来修正功能性缺陷;同时结合检索增强生成(Retrieval-Augmented Generation, RAG)机制修复语法错误,从而在保证功能正确性的前提下显著提升面积和功耗效率,接近人工设计水平。

链接: https://arxiv.org/abs/2511.16395
作者: Kangwei Xu,Grace Li Zhang,Ulf Schlichtmann,Bing Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE); Systems and Control (eess.SY)
备注: 7 pages, 15 figures, 2 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable potential in hardware front-end design using hardware description languages (HDLs). However, their inherent tendency toward hallucination often introduces functional errors into the generated HDL designs. To address this issue, we propose the framework CorrectHDL that leverages high-level synthesis (HLS) results as functional references to correct potential errors in LLM-generated HDL this http URL input to the proposed framework is a C/C++ program that specifies the target circuit’s functionality. The program is provided to an LLM to directly generate an HDL design, whose syntax errors are repaired using a Retrieval-Augmented Generation (RAG) mechanism. The functional correctness of the LLM-generated circuit is iteratively improved by comparing its simulated behavior with an HLS reference design produced by conventional HLS tools, which ensures the functional correctness of the result but can lead to suboptimal area and power efficiency. Experimental results demonstrate that circuits generated by the proposed framework achieve significantly better area and power efficiency than conventional HLS designs and approach the quality of human-engineered circuits. Meanwhile, the correctness of the resulting HDL implementation is maintained, highlighting the effectiveness and potential of agentic HDL design leveraging the generative capabilities of LLMs and the rigor of traditional correctness-driven IC design flows.
zh

[AI-27] Robot Metacognition: Decision Making with Confidence for Tool Invention

【速读】:该论文旨在解决当前机器人缺乏自我认知能力的问题,即无法对其自身的决策过程进行反思与评估,从而限制了其在复杂现实环境中的智能行为表现。解决方案的关键在于引入元认知(metacognition)架构,以**置信度(confidence)**作为核心指标,使机器人能够对自身决策的可靠性进行第二层判断,并据此调整行为策略。该方法通过具身动作监控(embodied action monitoring)实现更优决策,显著提升了机器人在物理部署中的鲁棒性,尤其在自主工具发明等任务中展现出潜力。

链接: https://arxiv.org/abs/2511.16390
作者: Ajith Anil Meera,Poppy Collis,Polina Arbuzova,Abián Torres,Paul F Kinghorn,Ricardo Sanz,Pablo Lanillos
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:Robots today often miss a key ingredient of truly intelligent behavior: the ability to reflect on their own cognitive processes and decisions. In humans, this self-monitoring or metacognition is crucial for learning, decision making and problem solving. For instance, they can evaluate how confident they are in performing a task, thus regulating their own behavior and allocating proper resources. Taking inspiration from neuroscience, we propose a robot metacognition architecture centered on confidence (a second-order judgment on decisions) and we demonstrate it on the use case of autonomous tool invention. We propose the use of confidence as a metacognitive measure within the robot decision making scheme. Confidence-informed robots can evaluate the reliability of their decisions, improving their robustness during real-world physical deployment. This form of robotic metacognition emphasizes embodied action monitoring as a means to achieve better informed decisions. We also highlight potential applications and research directions for robot metacognition.
zh

[AI-28] An Agent -Based Framework for the Automatic Validation of Mathematical Optimization Models

【速读】:该论文旨在解决如何自动验证由大语言模型(Large Language Models, LLMs)从自然语言描述生成的优化模型是否正确并满足原始需求的问题。解决方案的关键在于提出了一种基于智能体(agent-based)的自动验证方法,该方法借鉴软件测试中的技术,通过多个协作智能体构建问题级测试API、生成测试用例,并引入针对优化模型的变异测试(mutation testing),从而以“变异覆盖率”这一经典软件测试指标衡量验证质量,有效提升了对生成模型正确性的评估能力。

链接: https://arxiv.org/abs/2511.16383
作者: Alexander Zadorojniy,Segev Wasserkrug,Eitan Farchi
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Recently, using Large Language Models (LLMs) to generate optimization models from natural language descriptions has became increasingly popular. However, a major open question is how to validate that the generated models are correct and satisfy the requirements defined in the natural language description. In this work, we propose a novel agent-based method for automatic validation of optimization models that builds upon and extends methods from software testing to address optimization modeling . This method consists of several agents that initially generate a problem-level testing API, then generate tests utilizing this API, and, lastly, generate mutations specific to the optimization model (a well-known software testing technique assessing the fault detection power of the test suite). In this work, we detail this validation framework and show, through experiments, the high quality of validation provided by this agent ensemble in terms of the well-known software testing measure called mutation coverage.
zh

[AI-29] Are Foundation Models Useful for Bankruptcy Prediction? NEURIPS2025

【速读】:该论文旨在解决生成式 AI(Generative AI)在企业破产预测任务中的有效性问题,特别是其相对于传统机器学习方法的性能表现尚未得到系统评估。研究通过对比 Llama-3.3-70B-Instruct(大语言模型)和 TabPFN(基于概率的神经网络)与经典机器学习基线(如 XGBoost 和 CatBoost)在包含超百万条公司记录的高不平衡数据集上的表现,揭示了当前基础模型在金融风险敏感场景下的局限性。关键发现是:尽管基础模型具备广泛适用性,但在破产预测任务中,XGBoost 和 CatBoost 在所有预测时间范围内均显著优于基础模型;LLM 方法因概率估计不可靠而不适用于高风险决策场景,而 TabPFN 虽具竞争力但计算开销过大且性能提升不具性价比。这表明,针对特定金融任务,专业化机器学习方法仍具有不可替代的优势。

链接: https://arxiv.org/abs/2511.16375
作者: Marcin Kostrzewa,Oleksii Furman,Roman Furman,Sebastian Tomczak,Maciej Zięba
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Workshop: Generative AI in Finance

点击查看摘要

Abstract:Foundation models have shown promise across various financial applications, yet their effectiveness for corporate bankruptcy prediction remains systematically unevaluated against established methods. We study bankruptcy forecasting using Llama-3.3-70B-Instruct and TabPFN, evaluated on large, highly imbalanced datasets of over one million company records from the Visegrád Group. We provide the first systematic comparison of foundation models against classical machine learning baselines for this task. Our results show that models such as XGBoost and CatBoost consistently outperform foundation models across all prediction horizons. LLM-based approaches suffer from unreliable probability estimates, undermining their use in risk-sensitive financial settings. TabPFN, while competitive with simpler baselines, requires substantial computational resources with costs not justified by performance gains. These findings suggest that, despite their generality, current foundation models remain less effective than specialized methods for bankruptcy forecasting.
zh

[AI-30] Reducing Instability in Synthetic Data Evaluation with a Super-Metric in MalDataGen

【速读】:该论文旨在解决Android恶意软件领域中合成数据质量评估的稳定性与标准化不足问题,即现有评价指标在不同场景下表现不一致,难以客观衡量生成数据的真实效用。其解决方案的关键在于提出一种“Super-Metric”(超级指标),该指标整合了八个分布在四个保真度维度上的评估指标,并通过加权方式生成单一评分,从而显著提升了评估结果的稳定性与一致性,并展现出与分类器实际性能更强的相关性。

链接: https://arxiv.org/abs/2511.16373
作者: Anna Luiza Gomes da Silva,Diego Kreutz,Angelo Diniz,Rodrigo Mansilha,Celso Nobre da Fonseca
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 3 figures, submitted to ERRC/WRSeg 2025

点击查看摘要

Abstract:Evaluating the quality of synthetic data remains a persistent challenge in the Android malware domain due to instability and the lack of standardization among existing metrics. This work integrates into MalDataGen a Super-Metric that aggregates eight metrics across four fidelity dimensions, producing a single weighted score. Experiments involving ten generative models and five balanced datasets demonstrate that the Super-Metric is more stable and consistent than traditional metrics, exhibiting stronger correlations with the actual performance of classifiers.
zh

[AI-31] Distributed Agent Reasoning Across Independent Systems With Strict Data Locality

【速读】:该论文旨在解决分布式系统中多组织间安全协作的问题,特别是在缺乏共享标识符、结构化模式或集中式数据交换的情况下实现跨机构的智能体(agent)通信。其核心挑战在于如何在保护隐私的前提下,让不同实体(如诊所、保险公司和专科网络)基于本地数据进行协同推理与决策。解决方案的关键在于采用基于自然语言的消息传递机制,结合伪匿名病例令牌(pseudonymised case tokens)、本地数据查询和受控操作边界,使各智能体通过OperationRelay调用交换简洁的自然语言摘要,从而完成端到端的任务流程(如保险覆盖评估与临床适配性推荐),同时确保患者身份无法被重建。整个架构以Orpius平台为基础,实现了去中心化的多智能体编排与隐私保护通信,验证了分布式推理的可行性。

链接: https://arxiv.org/abs/2511.16292
作者: Daniel Vaughan,Kateřina Vaughan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 6 figures

点击查看摘要

Abstract:This paper presents a proof-of-concept demonstration of agent-to-agent communication across distributed systems, using only natural-language messages and without shared identifiers, structured schemas, or centralised data exchange. The prototype explores how multiple organisations (represented here as a Clinic, Insurer, and Specialist Network) can cooperate securely via pseudonymised case tokens, local data lookups, and controlled operational boundaries. The system uses Orpius as the underlying platform for multi-agent orchestration, tool execution, and privacy-preserving communication. All agents communicate through OperationRelay calls, exchanging concise natural-language summaries. Each agent operates on its own data (such as synthetic clinic records, insurance enrolment tables, and clinical guidance extracts), and none receives or reconstructs patient identity. The Clinic computes an HMAC-based pseudonymous token, the Insurer evaluates coverage rules and consults the Specialist agent, and the Specialist returns an appropriateness recommendation. The goal of this prototype is intentionally limited: to demonstrate feasibility, not to provide a clinically validated, production-ready system. No clinician review was conducted, and no evaluation beyond basic functional runs was performed. The work highlights architectural patterns, privacy considerations, and communication flows that enable distributed reasoning among specialised agents while keeping data local to each organisation. We conclude by outlining opportunities for more rigorous evaluation and future research in decentralised multi-agent systems. Comments: 27 pages, 6 figures Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.11; H.4.3 Cite as: arXiv:2511.16292 [cs.AI] (or arXiv:2511.16292v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.16292 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-32] MuISQA: Multi-Intent Retrieval-Augmented Generation for Scientific Question Answering

【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理复杂科学问题时存在的多意图(multi-intent)证据覆盖不全的问题。这类问题通常包含多个子任务意图,如识别基因突变并将其与相关疾病关联,需要从异构来源获取证据并进行多跳推理,而现有RAG方法多为单意图导向,导致证据碎片化和覆盖不足。解决方案的关键在于提出一种意图感知的检索框架(intent-aware retrieval framework),该框架利用大语言模型(Large Language Models, LLMs)先假设潜在答案,再将其分解为针对不同意图的具体查询,并分别检索支持性文本片段;随后通过互斥排名融合(Reciprocal Rank Fusion, RRF)对检索结果进行聚合与重排序,从而在保证各意图证据覆盖的同时减少冗余,显著提升检索准确性和整体证据完整性。

链接: https://arxiv.org/abs/2511.16283
作者: Zhiyuan Li,Haisheng Yu,Guangchuan Guo,Nan Zhou,Jiajun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:Complex scientific questions often entail multiple intents, such as identifying gene mutations and linking them to related diseases. These tasks require evidence from diverse sources and multi-hop reasoning, while conventional retrieval-augmented generation (RAG) systems are usually single-intent oriented, leading to incomplete evidence coverage. To assess this limitation, we introduce the Multi-Intent Scientific Question Answering (MuISQA) benchmark, which is designed to evaluate RAG systems on heterogeneous evidence coverage across sub-questions. In addition, we propose an intent-aware retrieval framework that leverages large language models (LLMs) to hypothesize potential answers, decompose them into intent-specific queries, and retrieve supporting passages for each underlying intent. The retrieved fragments are then aggregated and re-ranked via Reciprocal Rank Fusion (RRF) to balance coverage across diverse intents while reducing redundancy. Experiments on both MuISQA benchmark and other general RAG datasets demonstrate that our method consistently outperforms conventional approaches, particularly in retrieval accuracy and evidence coverage.
zh

[AI-33] “To Survive I Must Defect”: Jailbreaking LLM s via the Game-Theory Scenarios

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在非专家用户场景下易受黑盒越狱攻击(black-box jailbreak attacks)的问题,现有方法多依赖手工设计的启发式规则或狭窄的搜索空间,难以扩展。其核心解决方案是提出一种基于博弈论的越狱框架 Game-Theory Attack (GTA),关键在于将攻击者与安全对齐LLM的交互建模为有限时域、可提前终止的随机序贯博弈,并通过量化响应(quantal response)重参数化LLM的随机输出;在此基础上提出“模板-安全翻转”行为假设:通过构造博弈情境重塑LLM的有效目标函数,使其从优先满足安全约束转向最大化特定模板下的收益,从而弱化安全机制。实验表明,GTA在多个协议和数据集上实现超过95%的攻击成功率(ASR),且具备高效性、泛化性和可扩展性。

链接: https://arxiv.org/abs/2511.16278
作者: Zhen Sun,Zongmin Zhang,Deqi Liang,Han Sun,Yule Liu,Yun Shen,Xiangshan Gao,Yilong Yang,Shuai Liu,Yutao Yue,Xinlei He
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker’s interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM’s randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture “template-over-safety flip”: by reshaping the LLM’s effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner’s Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on LLMs such as Deepseek-R1, while maintaining efficiency. Ablations over components, decoding, multilingual settings, and the Agent’s core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot LLM-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs.
zh

[AI-34] Revisiting Fairness-aware Interactive Recommendation: Item Lifecycle as a Control Knob

【速读】:该论文旨在解决短视频平台中交互式推荐系统中存在的公平性与准确性难以兼顾的问题,尤其关注物品(item)生命周期对推荐效果的影响。传统推荐模型常基于经典的四阶段生命周期模型(introduction, growth, maturity, decline),但实证分析发现,短视频平台中的物品实际呈现压缩的三阶段模式(快速成长、短暂稳定、急剧衰减),这一差异导致现有方法在动态曝光策略上缺乏针对性。解决方案的关键在于提出LHRL框架——一种生命周期感知的分层强化学习模型,其核心创新包括:(1) PhaseFormer模块,通过STL分解与注意力机制融合实现鲁棒的生命周期阶段检测;(2) 双层强化学习代理结构,高层策略施加阶段感知的公平约束,低层策略优化即时用户参与度,从而实现长期公平性与短期收益之间的解耦优化。实验表明,该方法在多个真实数据集上显著提升公平性和用户参与度,且生命周期奖励机制可泛化至现有基于强化学习的推荐模型,具有良好的实用性。

链接: https://arxiv.org/abs/2511.16248
作者: Yun Lu,Xiaoyu Shi,Hong Xie,Chongjun Xia,Zhenhui Gong,Mingsheng Shang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, conference

点击查看摘要

Abstract:This paper revisits fairness-aware interactive recommendation (e.g., TikTok, KuaiShou) by introducing a novel control knob, i.e., the lifecycle of items. We make threefold contributions. First, we conduct a comprehensive empirical analysis and uncover that item lifecycles in short-video platforms follow a compressed three-phase pattern, i.e., rapid growth, transient stability, and sharp decay, which significantly deviates from the classical four-stage model (introduction, growth, maturity, decline). Second, we introduce LHRL, a lifecycle-aware hierarchical reinforcement learning framework that dynamically harmonizes fairness and accuracy by leveraging phase-specific exposure dynamics. LHRL consists of two key components: (1) PhaseFormer, a lightweight encoder combining STL decomposition and attention mechanisms for robust phase detection; (2) a two-level HRL agent, where the high-level policy imposes phase-aware fairness constraints, and the low-level policy optimizes immediate user engagement. This decoupled optimization allows for effective reconciliation between long-term equity and short-term utility. Third, experiments on multiple real-world interactive recommendation datasets demonstrate that LHRL significantly improves both fairness and user engagement. Furthermore, the integration of lifecycle-aware rewards into existing RL-based models consistently yields performance gains, highlighting the generalizability and practical value of our approach.
zh

[AI-35] Q-MLLM : Vector Quantization for Robust Multimodal Large Language Model Security NDSS2026

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对视觉输入时易受对抗攻击的问题,尽管其文本安全机制较为 robust。核心问题在于:视觉表征的连续性使模型易受基于梯度的攻击,且文本层面的安全机制难以有效迁移至视觉内容。解决方案的关键在于提出 Q-MLLM 架构,通过引入两级向量量化(two-level vector quantization)构建离散瓶颈,分别在像素-patch 和语义层面离散化视觉表示,从而阻断攻击路径并弥合跨模态安全对齐差距。该方法无需昂贵的安全专项微调或检测开销,即可实现高效防御,实验表明其在抵御 jailbreak 攻击和有毒图像攻击方面显著优于现有方法,并保持与基准相当的模型性能。

链接: https://arxiv.org/abs/2511.16229
作者: Wei Zhao,Zhe Li,Yige Li,Jun Sun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by NDSS 2026

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in cross-modal understanding, but remain vulnerable to adversarial attacks through visual inputs despite robust textual safety mechanisms. These vulnerabilities arise from two core weaknesses: the continuous nature of visual representations, which allows for gradient-based attacks, and the inadequate transfer of text-based safety mechanisms to visual content. We introduce Q-MLLM, a novel architecture that integrates two-level vector quantization to create a discrete bottleneck against adversarial attacks while preserving multimodal reasoning capabilities. By discretizing visual representations at both pixel-patch and semantic levels, Q-MLLM blocks attack pathways and bridges the cross-modal safety alignment gap. Our two-stage training methodology ensures robust learning while maintaining model utility. Experiments demonstrate that Q-MLLM achieves significantly better defense success rate against both jailbreak attacks and toxic image attacks than existing approaches. Notably, Q-MLLM achieves perfect defense success rate (100%) against jailbreak attacks except in one arguable case, while maintaining competitive performance on multiple utility benchmarks with minimal inference overhead. This work establishes vector quantization as an effective defense mechanism for secure multimodal AI systems without requiring expensive safety-specific fine-tuning or detection overhead. Code is available at this https URL.
zh

[AI-36] FlipVQA-Miner: Cross-Page Visual Question-Answer Mining from Textbooks

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)训练中高质量监督数据稀缺的问题,特别是现有指令微调和强化学习(Reinforcement Learning, RL)数据集成本高、依赖合成样本导致幻觉(hallucination)且多样性不足的局限性。其解决方案的关键在于构建一个自动化流水线,通过结合布局感知光学字符识别(layout-aware OCR)与基于大语言模型(LLM-based)的语义解析技术,从教育类文档(如教材和习题集)中提取结构良好、语义对齐且低噪声的问答对(QA)及视觉问答对(Visual Question Answering, VQA),从而实现真实世界教育资源的高效利用,为提升以推理能力为导向的大语言模型训练提供可扩展、高质量的替代数据来源。

链接: https://arxiv.org/abs/2511.16216
作者: Zhen Hao Wong,Jingwen Deng,Hao Liang,Runming He,Chengyu Shen,Wentao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The development of Large Language Models (LLMs) increasingly depends on high-quality supervised data, yet existing instruction-tuning and RL datasets remain costly to curate and often rely on synthetic samples that introduce hallucination and limited diversity. At the same time, textbooks and exercise materials contain abundant, high-quality human-authored Question-Answer(QA) content that remains underexploited due to the difficulty of transforming raw PDFs into AI-ready supervision. Although modern OCR and vision-language models can accurately parse document structure, their outputs lack the semantic alignment required for training. We propose an automated pipeline that extracts well-formed QA and visual-QA (VQA) pairs from educational documents by combining layout-aware OCR with LLM-based semantic parsing. Experiments across diverse document types show that the method produces accurate, aligned, and low-noise QA/VQA pairs. This approach enables scalable use of real-world educational content and provides a practical alternative to synthetic data generation for improving reasoning-oriented LLM training. All code and data-processing pipelines are open-sourced at this https URL.
zh

[AI-37] ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025

【速读】:该论文旨在解决化学领域中自动化推理评估的挑战,尤其是针对国际化学奥林匹克(IChO)级别问题的复杂性与多模态特性。传统AI模型难以处理涉及分子结构可视化输出(如绘图)和深层化学逻辑推理的任务,导致在化学领域的性能远低于人类专家水平。解决方案的关键在于两个创新:一是评估等价重构(Assessment-Equivalent Reformulation, AER),将需要视觉输出的问题转化为可计算格式;二是结构化视觉增强(Structured Visual Enhancement, SVE),用于分离模型的视觉感知能力与核心化学推理能力。此外,作者提出ChemLabs多智能体框架,通过专业化代理协作模拟人类专家解决问题的过程,在分解、感知、推理与审计四个阶段实现高效协同,最终在ChemO基准上达到93.6分(满分100),超越人类金牌阈值,成为当前最先进的自动化化学问题求解系统。

链接: https://arxiv.org/abs/2511.16205
作者: Xu Qiang,Shengyuan Bai,Leqing Chen,Zijing Liu,Yu Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 1 figures

点击查看摘要

Abstract:Olympiad-level benchmarks in mathematics and physics are crucial testbeds for advanced AI reasoning, but chemistry, with its unique multimodal symbolic language, has remained an open challenge. We introduce ChemO, a new benchmark built from the International Chemistry Olympiad (IChO) 2025. ChemO features two key innovations for automated assessment: Assessment-Equivalent Reformulation (AER), which converts problems requiring visual outputs (e.g., drawing molecules) into computationally tractable formats, and Structured Visual Enhancement (SVE), a diagnostic mechanism to disentangle a model’s visual perception capabilities from its core chemical reasoning. To tackle this benchmark, we propose ChemLabs, a hierarchical multi-agent framework that mimics human expert collaboration through specialized agents for problem decomposition, perception, reasoning, and auditing. Experiments on state-of-the-art multimodal models demonstrate that combining SVE with our multi-agent system yields dramatic performance gains. Our top configuration achieves a score of 93.6 out of 100, surpassing an estimated human gold medal threshold and establishing a new state-of-the-art in automated chemical problem-solving. ChemO Dataset: this https URL
zh

[AI-38] Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning

【速读】:该论文旨在解决强化学习人类反馈(RLHF)中单一黑箱奖励模型在多维偏好优化(如事实性、帮助性、安全性等相互冲突的维度)时存在的鲁棒性差和可解释性弱的问题。解决方案的关键在于提出CRM(Multi-Agent Collaborative Reward Model)框架,通过引入一组领域专用的评估代理(specialist evaluators)与全局评估器(如基于排名和嵌入相似性的奖励)协同工作,将偏好评估分解为多个部分信号,并由中心化聚合器在每个时间步融合这些信号,综合考虑步骤正确性、多代理一致性及重复惩罚等因素,生成一个兼容标准强化学习流程的训练奖励。此机制实现了无需额外人工标注即可进行多视角奖励塑造,显著提升了奖励模型的透明度与优化稳定性。

链接: https://arxiv.org/abs/2511.16202
作者: Pei Yang,Ke Zhang,Ji Wang,Xiao Chen,Yuxin Tang,Eric Yang,Lynn Ai,Bill Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present CRM (Multi-Agent Collaborative Reward Model), a framework that replaces a single black-box reward model with a coordinated team of specialist evaluators to improve robustness and interpretability in RLHF. Conventional reward models struggle to jointly optimize multiple, sometimes conflicting, preference dimensions (e.g., factuality, helpfulness, safety) and offer limited transparency into why a score is assigned. CRM addresses these issues by decomposing preference evaluation into domain-specific agents that each produce partial signals, alongside global evaluators such as ranker-based and embedding-similarity rewards. A centralized aggregator fuses these signals at each timestep, balancing factors like step-wise correctness, multi-agent agreement, and repetition penalties, yielding a single training reward compatible with standard RL pipelines. The policy is optimized with advantage-based updates (e.g., GAE), while a value model regresses to the aggregated reward, enabling multi-perspective reward shaping without requiring additional human annotations beyond those used to train the evaluators. To support training and assessment, we introduce rewardBench, a benchmark and training suite aligned with the collaborative structure of CRM. Together, CRM and rewardBench provide a practical, modular path to more transparent reward modeling and more stable optimization.
zh

[AI-39] From Performance to Understanding: A Vision for Explainable Automated Algorithm Design

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自动化算法设计中存在的“黑箱”问题,即生成的优化(元)启发式算法缺乏可解释性,难以明确其性能提升源于哪些组件或设计选择,以及这些设计如何与问题结构相关联。解决方案的关键在于构建一个可解释的自动化算法设计框架,通过三个核心支柱实现:(i) 利用LLM驱动发现算法变体;(ii) 通过可解释基准测试将性能归因于具体组件与超参数;(iii) 基于问题类描述符建立算法行为与问题景观结构之间的联系。这三者形成闭环知识循环,推动从盲目搜索向可解释、类特定的算法设计转变,从而加速研究进展并产生可复用的科学洞见。

链接: https://arxiv.org/abs/2511.16201
作者: Niki van Stein,Anna V. Kononova,Thomas Bäck
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Automated algorithm design is entering a new phase: Large Language Models can now generate full optimisation (meta)heuristics, explore vast design spaces and adapt through iterative feedback. Yet this rapid progress is largely performance-driven and opaque. Current LLM-based approaches rarely reveal why a generated algorithm works, which components matter or how design choices relate to underlying problem structures. This paper argues that the next breakthrough will come not from more automation, but from coupling automation with understanding from systematic benchmarking. We outline a vision for explainable automated algorithm design, built on three pillars: (i) LLM-driven discovery of algorithmic variants, (ii) explainable benchmarking that attributes performance to components and hyperparameters and (iii) problem-class descriptors that connect algorithm behaviour to landscape structure. Together, these elements form a closed knowledge loop in which discovery, explanation and generalisation reinforce each other. We argue that this integration will shift the field from blind search to interpretable, class-specific algorithm design, accelerating progress while producing reusable scientific insight into when and why optimisation strategies succeed.
zh

[AI-40] Fast LLM Post-training via Decoupled and Best-of-N Speculation

【速读】:该论文针对大语言模型(Large Language Model, LLM)后训练阶段中rollout耗时过长的问题提出了解决方案。核心挑战在于如何在保持生成正确性的前提下,提升rollout的并行计算效率,尤其是在大规模批量输入场景下,传统推测解码(speculative decoding)方法难以有效利用GPU资源。解决方案的关键在于:(1) 提出动态解耦推测执行方法(dynamic decoupled speculation execution),通过最大化GPU计算效率实现大批次rollout的加速;(2) 设计动态Best-of-N推测机制(dynamic Best-of-N speculation),根据推理进度自适应选择和融合不同草稿生成策略,从而在无需额外计算资源的前提下显著提升推测准确性,即使初始阶段无法确定最优草稿方法亦能保证性能优势。

链接: https://arxiv.org/abs/2511.16193
作者: Rongxin Cheng,Kai Zhou,Xingda Wei,Siyuan Liu,Mingcong Han,Mingjing Ai,Yeju Zhou,Baoquan Zhong,Wencong Xiao,Xin Liu,Rong Chen,Haibo Chen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rollout dominates the training time in large language model (LLM) post-training, where the trained model is used to generate tokens given a batch of prompts. SpecActor achieves fast rollout with speculative decoding that deploys a fast path (e.g., a smaller model) to accelerate the unparallelizable generation, while the correctness is guaranteed by fast parallel verification of the outputs with the original model. SpecActor addresses two foundational challenges in speculative rollout by (1) a \emphdynamic decoupled speculation execution method that maximizes the GPU computational efficiency to realize speedup for large-batch execution – a configuration common in training but unfriendly to speculative execution and (2) a \emphdynamic Best-of-N speculation method that selects and combines different drafting methods according to the rollout progress. It substantially improves the speculation accuracy even when the best drafting method is unknown a priori, meanwhile without requiring adding extra computation resources. \sys is 1.3–1.7, \times faster than common post-training baselines, and is 1.3–1.5, \times faster compared to naively adopting speculative decoding for rollout.
zh

[AI-41] Labels Matter More Than Models: Quantifying the Benefit of Supervised Time Series Anomaly Detection

【速读】:该论文旨在解决时间序列异常检测(Time Series Anomaly Detection, TSAD)中因标签稀缺导致的性能瓶颈问题,尤其挑战了当前研究过度依赖复杂无监督模型架构的范式。其核心解决方案在于提出一个简化的有监督基线方法STAND,并通过系统性对比验证:在有限标注预算下,简单有监督模型相较于复杂无监督方法能显著提升检测性能,且最小量的监督信号带来的收益远超单纯优化模型结构的效果。关键创新在于强调数据驱动而非算法复杂度优先的研究方向,推动TSAD从“模型中心”向“标签利用”转变。

链接: https://arxiv.org/abs/2511.16145
作者: Zhijie Zhong,Zhiwen Yu,Kaixiang Yang,C. L. Philip Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 14 figures, 7 tables. Under review

点击查看摘要

Abstract:Time series anomaly detection (TSAD) is a critical data mining task often constrained by label scarcity. Consequently, current research predominantly focuses on Unsupervised Time-series Anomaly Detection (UTAD), relying on complex architectures to model normal data distributions. However, this approach often overlooks the significant performance gains available from limited anomaly labels achievable in practical scenarios. This paper challenges the premise that architectural complexity is the optimal path for TSAD. We conduct the first methodical comparison between supervised and unsupervised paradigms and introduce STAND, a streamlined supervised baseline. Extensive experiments on five public datasets demonstrate that: (1) Labels matter more than models: under a limited labeling budget, simple supervised models significantly outperform complex state-of-the-art unsupervised methods; (2) Supervision yields higher returns: the performance gain from minimal supervision far exceeds that from architectural innovations; and (3) Practicality: STAND exhibits superior prediction consistency and anomaly localization compared to unsupervised counterparts. These findings advocate for a data-centric shift in TSAD research, emphasizing label utilization over purely algorithmic complexity. The code is publicly available at this https URL.
zh

[AI-42] Multidimensional Rubric-oriented Reward Model Learning via Geometric Projection Reference Constraints

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在医疗实践中应用时面临的三大关键对齐挑战:(1) 静态评估基准与动态临床认知需求之间的脱节;(2) 模型难以适应多源、演进的医学标准;(3) 传统奖励模型无法捕捉多维、精细化的医疗质量评价标准。其解决方案的核心是提出MR-RML(Multidimensional Rubric-oriented Reward Model Learning)框架,结合几何投影参考约束(GPRC),将医学标准结构化为“维度-场景-学科”矩阵,嵌入训练全流程,并通过独立的多维奖励模型实现评分准则的解耦与内化,同时利用几何投影将临床推理逻辑转化为数学正则化项,从而提升模型在合成数据驱动下的对齐一致性与效率。该方法在Healthbench权威医学基准上显著优于基线模型Qwen-32B,并达到开源模型中的最先进水平(Full Subset: 62.7, Hard Subset: 44.7)。

链接: https://arxiv.org/abs/2511.16139
作者: Yongnan Jin,Xurui Li,Feng Cao,Liucun Gao,Juanjuan Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of large language models (LLMs) into medical practice holds transformative potential, yet their real-world clinical utility remains limited by critical alignment challenges: (1) a disconnect between static evaluation benchmarks and dynamic clinical cognitive needs, (2) difficulties in adapting to evolving, multi-source medical standards, and (3) the inability of conventional reward models to capture nuanced, multi-dimensional medical quality criteria. To address these gaps, we propose MR-RML (Multidimensional Rubric-oriented Reward Model Learning) via GPRC (Geometric Projection Reference Constraints), a novel alignment framework that integrates medical standards into a structured “Dimensions-Scenarios-Disciplines” matrix to guide data generation and model optimization. MR-RML introduces three core innovations: (1) a “Dimensions-Scenarios-Disciplines” medical standard system that embeds domain standards into the full training pipeline; (2) an independent multi-dimensional reward model that decomposes evaluation criteria, shifting from real-time rubric-based scoring to internalized reward modeling for improved consistency and cost-efficiency; (3) geometric projection reference constraints that transform medical cognitive logic into mathematical regularization, aligning scoring gradients with clinical reasoning and enabling synthetic data-driven training. Through extensive evaluations on the authoritative medical benchmark Healthbench, our method yields substantial performance gains over the base LLM Qwen-32B (45% on the full subset and 85% on Hard subset, respectively). It achieves a SOTA among open-source LLMs with scores of 62.7 (full subset) and 44.7 (hard subset), while also outperforming the majority of closed-source models.
zh

[AI-43] AskDB: An LLM Agent for Natural Language Interaction with Relational Databases

【速读】:该论文旨在解决用户在与关系型数据库交互时面临的挑战,尤其是在生成复杂分析查询或执行数据库管理任务(Database Administration, DBA)时的困难。现有系统通常仅专注于自然语言查询(Natural Language Querying)或数据库管理的特定方面,缺乏一个统一且智能的接口来支持通用数据库操作。解决方案的关键在于提出 AskDB,一个基于大语言模型(Large Language Model, LLM)的智能代理,其核心创新包括:1)一种动态的、基于模式感知的提示机制(schema-aware prompting mechanism),能够有效整合数据库元数据以增强语义理解;2)一个任务分解框架(task decomposition framework),使代理能够规划并执行多步骤操作。这些特性赋予 AskDB 自主调试 SQL、通过实时网络搜索获取上下文信息以及自适应优化响应的能力,从而在 Text-to-SQL 基准和 DBA 任务集上均展现出卓越性能,为关系型数据库提供了一种统一、直观且高效的交互方式。

链接: https://arxiv.org/abs/2511.16131
作者: Xuan-Quang Phan,Tan-Ha Mai,Thai-Duy Dinh,Minh-Thuan Nguyen,Lam-Son Lê
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 15 pages, 10 figures

点击查看摘要

Abstract:Interacting with relational databases remains challenging for users across different expertise levels, particularly when composing complex analytical queries or performing administrative tasks. Existing systems typically address either natural language querying or narrow aspects of database administration, lacking a unified and intelligent interface for general-purpose database interaction. We introduce AskDB, a large language model powered agent designed to bridge this gap by supporting both data analysis and administrative operations over SQL databases through natural language. Built on Gemini 2, AskDB integrates two key innovations: a dynamic schema-aware prompting mechanism that effectively incorporates database metadata, and a task decomposition framework that enables the agent to plan and execute multi-step actions. These capabilities allow AskDB to autonomously debug derived SQL, retrieve contextual information via real-time web search, and adaptively refine its responses. We evaluate AskDB on a widely used Text-to-SQL benchmark and a curated set of DBA tasks, demonstrating strong performance in both analytical and administrative scenarios. Our results highlight the potential of AskDB as a unified and intelligent agent for relational database systems, offering an intuitive and accessible experience for end users.
zh

[AI-44] SkyRL-Agent : Efficient RL Training for Multi-turn LLM Agent

【速读】:该论文旨在解决多轮、长时程智能体(agent)训练与评估中的效率瓶颈问题,特别是如何在不牺牲性能的前提下显著降低训练成本并提升可扩展性。其核心解决方案包括两个关键组件:一是优化的异步流水线调度器(optimized asynchronous pipeline dispatcher),通过高效的任务分发机制实现比传统异步批处理高1.55倍的加速;二是基于抽象语法树(AST)的工具增强训练策略(tool-enhanced training recipe),利用代码导航工具提升推理过程中的代码理解能力,从而提高Pass@K指标和整体训练效率。这两个创新共同使SA-SWE-32B模型在SWE-Bench Verified上达到39.4% Pass@1,同时相比先前同类模型减少超过2倍的训练成本。

链接: https://arxiv.org/abs/2511.16108
作者: Shiyi Cao,Dacheng Li,Fangzhou Zhao,Shuo Yuan,Sumanth R. Hegde,Connor Chen,Charlie Ruan,Tyler Griggs,Shu Liu,Eric Tang,Richard Liaw,Philipp Moritz,Matei Zaharia,Joseph E. Gonzalez,Ion Stoica
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce SkyRL-Agent, a framework for efficient, multi-turn, long-horizon agent training and evaluation. It provides efficient asynchronous dispatching, lightweight tool integration, and flexible backend interoperability, enabling seamless use with existing RL frameworks such as SkyRL-train, VeRL, and Tinker. Using SkyRL-Agent, we train SA-SWE-32B, a software engineering agent trained from Qwen3-32B (24.4% Pass@1) purely with reinforcement learning. We introduce two key components: an optimized asynchronous pipeline dispatcher that achieves a 1.55x speedup over naive asynchronous batching, and a tool-enhanced training recipe leveraging an AST-based search tool to facilitate code navigation, boost rollout Pass@K, and improve training efficiency. Together, these optimizations enable SA-SWE-32B to reach 39.4% Pass@1 on SWE-Bench Verified with more than 2x cost reduction compared to prior models reaching similar performance. Despite being trained solely on SWE tasks, SA-SWE-32B generalizes effectively to other agentic tasks, including Terminal-Bench, BrowseComp-Plus, and WebArena. We further demonstrate SkyRL-Agent’s extensibility through case studies on deep research, computer use, and memory agents, each trained using a different training backend. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2511.16108 [cs.AI] (or arXiv:2511.16108v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.16108 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-45] Mitigating Estimation Bias with Representation Learning in TD Error-Driven Regularization

【速读】:该论文旨在解决连续控制中确定性策略梯度算法(Deterministic Policy Gradient, DPG)因价值估计偏差导致性能下降的问题,尤其关注双评论家(double critics)虽可缓解偏差但双执行者(double actors)的探索潜力尚未被充分挖掘。解决方案的关键在于提出一种基于时序差分误差驱动正则化(Temporal-Difference Error-Driven Regularization, TDDR)的双执行者-评论家框架,并引入三种凸组合策略(对称与非对称),通过单一超参数灵活调控悲观估计与乐观探索之间的平衡,从而实现偏置谱上的可调控制;同时,通过在执行者和评论者网络中集成增强的状态与动作表示,进一步提升表征学习能力,实验表明该方法能有效利用过估计与欠估计的不同特性,在多种环境中显著优于基准模型。

链接: https://arxiv.org/abs/2511.16090
作者: Haohui Chen,Zhiyong Chen,Aoxiang Liu,Wentuo Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deterministic policy gradient algorithms for continuous control suffer from value estimation biases that degrade performance. While double critics reduce such biases, the exploration potential of double actors remains underexplored. Building on temporal-difference error-driven regularization (TDDR), a double actor-critic framework, this work introduces enhanced methods to achieve flexible bias control and stronger representation learning. We propose three convex combination strategies, symmetric and asymmetric, that balance pessimistic estimates to mitigate overestimation and optimistic exploration via double actors to alleviate underestimation. A single hyperparameter governs this mechanism, enabling tunable control across the bias spectrum. To further improve performance, we integrate augmented state and action representations into the actor and critic networks. Extensive experiments show that our approach consistently outperforms benchmarks, demonstrating the value of tunable bias and revealing that both overestimation and underestimation can be exploited differently depending on the environment.
zh

[AI-46] Future-Back Threat Modeling: A Foresight-Driven Security Framework

【速读】:该论文旨在解决传统威胁建模(Threat Modeling)过于依赖已知战术、技术与程序(TTPs)及历史事件数据,导致对新兴和未来威胁缺乏预见性的问题。其核心缺陷在于无法识别“假设中的漏洞”或尚未被构想的攻击面,从而难以应对由人工智能、信息战和供应链攻击等未来趋势引发的新型威胁。解决方案的关键在于提出未来回溯威胁建模(Future-Back Threat Modeling, FBTM),通过从预设的未来威胁状态出发,反向推导当前防御体系中的假设盲点、能力缺口与潜在脆弱性,从而揭示已知未知(known unknowns)和未知未知(unknown unknowns),提升对未来敌手行为的预测能力,并指导今日决策以构建更具韧性的安全架构。

链接: https://arxiv.org/abs/2511.16088
作者: Vu Van Than
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Traditional threat modeling remains reactive-focused on known TTPs and past incident data, while threat prediction and forecasting frameworks are often disconnected from operational or architectural artifacts. This creates a fundamental weakness: the most serious cyber threats often do not arise from what is known, but from what is assumed, overlooked, or not yet conceived, and frequently originate from the future, such as artificial intelligence, information warfare, and supply chain attacks, where adversaries continuously develop new exploits that can bypass defenses built on current knowledge. To address this mental gap, this paper introduces the theory and methodology of Future-Back Threat Modeling (FBTM). This predictive approach begins with envisioned future threat states and works backward to identify assumptions, gaps, blind spots, and vulnerabilities in the current defense architecture, providing a clearer and more accurate view of impending threats so that we can anticipate their emergence and shape the future we want through actions taken now. The proposed methodology further aims to reveal known unknowns and unknown unknowns, including tactics, techniques, and procedures that are emerging, anticipated, and plausible. This enhances the predictability of adversary behavior, particularly under future uncertainty, helping security leaders make informed decisions today that shape more resilient security postures for the future.
zh

[AI-47] Operon: Incremental Construction of Rag ged Data via Named Dimensions

【速读】:该论文旨在解决现代数据处理工作流中对不规则数据(ragged data)缺乏原生支持的问题,即在自然语言处理、科学测量和自主AI代理等场景中,数据集合具有可变长度元素,而现有工作流引擎无法有效追踪其形状(shape)和依赖关系,导致用户需手动管理复杂的索引与依赖逻辑。解决方案的关键在于提出Operon——一个基于Rust的工作流引擎,其核心创新是引入命名维度(named dimensions)与显式依赖关系的形式化建模,通过领域特定语言(DSL)让用户声明带维度注解的流水线,并在编译时静态验证正确性;同时运行时动态调度任务,随着执行过程中逐步发现数据形状实现增量构建,从而保证并行环境下的确定性和收敛性(deterministic and confluent execution)。此设计不仅支持高效的异构任务并行(per-task multi-queue架构),还实现了稳健的持久化与恢复机制,显著提升大规模机器学习生成管道的性能表现。

链接: https://arxiv.org/abs/2511.16080
作者: Sungbin Moon,Jiho Park,Suyoung Hwang,Donghyun Koh,Seunghyun Moon,Minhyeong Lee
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern data processing workflows frequently encounter ragged data: collections with variable-length elements that arise naturally in domains like natural language processing, scientific measurements, and autonomous AI agents. Existing workflow engines lack native support for tracking the shapes and dependencies inherent to ragged data, forcing users to manage complex indexing and dependency bookkeeping manually. We present Operon, a Rust-based workflow engine that addresses these challenges through a novel formalism of named dimensions with explicit dependency relations. Operon provides a domain-specific language where users declare pipelines with dimension annotations that are statically verified for correctness, while the runtime system dynamically schedules tasks as data shapes are incrementally discovered during execution. We formalize the mathematical foundation for reasoning about partial shapes and prove that Operon’s incremental construction algorithm guarantees deterministic and confluent execution in parallel settings. The system’s explicit modeling of partially-known states enables robust persistence and recovery mechanisms, while its per-task multi-queue architecture achieves efficient parallelism across heterogeneous task types. Empirical evaluation demonstrates that Operon outperforms an existing workflow engine with 14.94x baseline overhead reduction while maintaining near-linear end-to-end output rates as workloads scale, making it particularly suitable for large-scale data generation pipelines in machine learning applications.
zh

[AI-48] A Hybrid Proactive And Predictive Framework For Edge Cloud Resource Management

【速读】:该论文旨在解决传统云边端工作负载资源管理中因依赖静态阈值而导致的资源浪费或性能下降问题,即系统要么过度配置资源造成成本浪费,要么资源不足导致应用响应变慢。解决方案的关键在于提出了一种混合架构,将卷积神经网络-长短期记忆(CNN LSTM)时间序列预测模型与基于多智能体深度强化学习(Multi-agent Deep Reinforcement Learning, DRL)的编排器相结合,并创新性地将CNN LSTM的预测结果嵌入到DRL智能体的状态空间中,使AI管理者具备“预见未来”的能力,从而实现对任务调度的长期规划,在保障用户体验的同时优化成本与系统健康度之间的平衡。

链接: https://arxiv.org/abs/2511.16075
作者: Hrikshesh Kumar,Anika Garg,Anshul Gupta,Yashika Agarwal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Old cloud edge workload resource management is too reactive. The problem with relying on static thresholds is that we are either overspending for more resources than needed or have reduced performance because of their lack. This is why we work on proactive solutions. A framework developed for it stops reacting to the problems but starts expecting them. We design a hybrid architecture, combining two powerful tools: the CNN LSTM model for time series forecasting and an orchestrator based on multi agent Deep Reinforcement Learning In fact the novelty is in how we combine them as we embed the predictive forecast from the CNN LSTM directly into the DRL agent state space. That is what makes the AI manager smarter it sees the future, which allows it to make better decisions about a long term plan for where to run tasks That means finding that sweet spot between how much money is saved while keeping the system healthy and apps fast for users That is we have given it eyes in order to see down the road so that it does not have to lurch from one problem to another it finds a smooth path forward Our tests show our system easily beats the old methods It is great at solving tough problems like making complex decisions and juggling multiple goals at once like being cheap fast and reliable
zh

[AI-49] A Mathematical Framework for Custom Reward Functions in Job Application Evaluation using Reinforcement Learning

【速读】:该论文旨在解决传统申请人跟踪系统(Applicant Tracking System, ATS)因依赖僵化关键词匹配而导致优秀候选人被误筛的问题。其解决方案的关键在于提出了一种两阶段微调流程:首先通过监督式微调(Supervised Fine-Tuning, SFT)构建基础模型,随后利用基于奖励的策略优化(GRPO)在自定义多组件奖励函数指导下进行强化学习优化,从而实现超越简单关键词匹配的、更贴近人类判断的简历评估能力。研究特别指出,在初始实验中因惩罚过激引发奖励黑客(reward hacking)问题,导致模型行为异常;通过反复调整奖励函数结构与训练超参数,最终形成一种“温和打磨”式的稳定优化过程,使模型在未见测试数据上达到91%准确率,且对“SELECTED”类别的召回率为0.85、精确率为1.0,显著优于传统方法。

链接: https://arxiv.org/abs/2511.16073
作者: Shreyansh Jain,Madhav Singhvi,Shreya Rahul Jain,Pranav S,Dishaa Lokesh,Naren Chittibabu,Akash Anandhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 13 pages, 4 figures, 2 equations, 3 Tables

点击查看摘要

Abstract:Conventional Applicant Tracking Systems (ATS) tend to be inflexible keyword-matchers, and deny gifted candidates a role due to a few minor semantic mismatches. This article describes a new two-step process to design a more refined resume evaluation model based on a small language model (600M parameters) that is finetuned using GRPO on a custom reward function. To begin with, Supervised Fine-Tuning (SFT) was used to build a solid baseline model. Second, this SFT model was also optimized with the help of Reinforcement Learning (RL) through GRPO under the guidance of a new, multi-component reward function that can holistically assess candidates beyond simple keyword matching. We indicate that the RL application presents a critical problem of reward hacking due to the initial experiments of aggressive penalties, which produces faulty, excessively negative model behaviors. We have overcome this challenge by refining the reward function repeatedly and training hyperparameters into a stable “gentle polishing process” of the reward function. Our resulting GRPO-polished model demonstrates significant real-world efficacy, achieving a final accuracy of 91% on unseen test data. The model shows a strong ability to correctly identify qualified candidates (recall of 0.85 for the ‘SELECTED’ class) while also showing exceptional precision (1.0), confirming its reliability. These results indicate that a properly executed, two-step fine-tuning procedure can indeed effectively refine a small language model to be able to conduct fine-tuned and human-like candidate scoring, overcoming the drawbacks of both traditional ATS and naive RL usage.
zh

[AI-50] Artificial Intelligence and Accounting Research: A Framework and Agenda

【速读】:该论文旨在解决生成式 AI (Generative AI) 和大语言模型 (Large Language Models, LLMs) 对会计研究领域带来的双重影响——既带来方法论革新与研究效率提升的机遇,也引发学者在研究价值创造上的竞争压力。其解决方案的关键在于构建一个二维分类框架,从研究焦点(会计导向 vs. AI导向)和方法论路径(基于AI vs. 传统方法)两个维度系统梳理现有文献,并据此识别出会计学者可依托专业优势进行战略定位与协作的高价值方向。同时,论文指出需重新审视人类研究人员与AI代理在完整科研流程中的角色差异,强调通过博士教育改革强化人类在高阶思维能力(如理论深度、创造性判断)方面的比较优势,从而实现AI赋能下的学术竞争力重塑。

链接: https://arxiv.org/abs/2511.16055
作者: Theophanis C. Stratopoulos,Victor Xiaoqi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); General Economics (econ.GN)
备注: 48 pages, 7 tables

点击查看摘要

Abstract:Recent advances in artificial intelligence, particularly generative AI (GenAI) and large language models (LLMs), are fundamentally transforming accounting research, creating both opportunities and competitive threats for scholars. This paper proposes a framework that classifies AI-accounting research along two dimensions: research focus (accounting-centric versus AI-centric) and methodological approach (AI-based versus traditional methods). We apply this framework to papers from the IJAIS special issue and recent AI-accounting research published in leading accounting journals to map existing studies and identify research opportunities. Using this same framework, we analyze how accounting researchers can leverage their expertise through strategic positioning and collaboration, revealing where accounting scholars’ strengths create the most value. We further examine how GenAI and LLMs transform the research process itself, comparing the capabilities of human researchers and AI agents across the entire research workflow. This analysis reveals that while GenAI democratizes certain research capabilities, it simultaneously intensifies competition by raising expectations for higher-order contributions where human judgment, creativity, and theoretical depth remain valuable. These shifts call for reforming doctoral education to cultivate comparative advantages while building AI fluency.
zh

[AI-51] Semantic Glitch: Agency and Artistry in an Autonomous Pixel Cloud NEURIPS2025

【速读】:该论文旨在解决传统机器人研究中过度追求度量精度与完美性能所带来的局限性问题,探索低保真(lo-fi)机器人在艺术表达与人机交互中的创造性潜力。其核心挑战在于如何在缺乏高精度感知系统(如LiDAR或SLAM)的情况下实现自主导航与个性化的交互行为。解决方案的关键在于构建一个基于多模态大语言模型(Multimodal Large Language Model, MLLM)的定性语义理解框架,通过自然语言提示(natural language prompt)赋予机器人类生物的性格特征(bio-inspired personality),从而形成“叙事心智”(narrative mind)。该方法使机器人虽不具备精确本体感知(proprioception),却能表现出具有可解释性、不确定性和拟人性的行为模式,实现了以角色特质(character)而非效率作为成功标准的新型机器人设计范式。

链接: https://arxiv.org/abs/2511.16048
作者: Qing Zhang,Jing Huang,Mingyang Xu,Jun Rekimoto
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: NeurIPS 2025 Creative AI Track, The Thirty-Ninth Annual Conference on Neural Information Processing Systems

点击查看摘要

Abstract:While mainstream robotics pursues metric precision and flawless performance, this paper explores the creative potential of a deliberately “lo-fi” approach. We present the “Semantic Glitch,” a soft flying robotic art installation whose physical form, a 3D pixel style cloud, is a “physical glitch” derived from digital archaeology. We detail a novel autonomous pipeline that rejects conventional sensors like LiDAR and SLAM, relying solely on the qualitative, semantic understanding of a Multimodal Large Language Model to navigate. By authoring a bio-inspired personality for the robot through a natural language prompt, we create a “narrative mind” that complements the “weak,” historically, loaded body. Our analysis begins with a 13-minute autonomous flight log, and a follow-up study statistically validates the framework’s robustness for authoring quantifiably distinct personas. The combined analysis reveals emergent behaviors, from landmark-based navigation to a compelling “plan to execution” gap, and a character whose unpredictable, plausible behavior stems from a lack of precise proprioception. This demonstrates a lo-fi framework for creating imperfect companions whose success is measured in character over efficiency.
zh

[AI-52] An Aligned Constraint Programming Model For Serial Batch Scheduling With Minimum Batch Size

【速读】:该论文旨在解决带有最小批次规模约束的串行批处理(serial batch scheduling)问题,该问题在半导体制造等实际场景中广泛存在,其核心挑战在于如何高效地将同家族作业分组为批次并安排顺序,以减少重复设置时间。现有基于约束规划(Constraint Programming, CP)的方法依赖于预定义的虚拟批次集合,导致维度灾难和建模复杂性增加。本文提出了一种全新的CP模型,其关键创新在于摒弃了虚拟批次集,转而使用关键对齐参数(key alignment parameters),从而直接在相同家族作业的序列上进行推理,实现更紧凑的数学表达;同时结合针对问题结构设计的搜索阶段与强化的约束传播器推理层级,显著提升了求解效率与解的质量,在中小规模实例(≤100个作业)上优于现有方法,并在大规模实例(≤500个作业、10个家族、10台机器)上可获得比现有方法高出最多25%的优化结果。

链接: https://arxiv.org/abs/2511.16045
作者: Jorge A. Huertas,Pascal Van Hentenryck
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 12 figures

点击查看摘要

Abstract:In serial batch (s-batch) scheduling, jobs from similar families are grouped into batches and processed sequentially to avoid repetitive setups that are required when processing consecutive jobs of different families. Despite its large success in scheduling, only three Constraint Programming (CP) models have been proposed for this problem considering minimum batch sizes, which is a common requirement in many practical settings, including the ion implantation area in semiconductor manufacturing. These existing CP models rely on a predefined virtual set of possible batches that suffers from the curse of dimensionality and adds complexity to the problem. This paper proposes a novel CP model that does not rely on this virtual set. Instead, it uses key alignment parameters that allow it to reason directly on the sequences of same-family jobs scheduled on the machines, resulting in a more compact formulation. This new model is further improved by exploiting the problem’s structure with tailored search phases and strengthened inference levels of the constraint propagators. The extensive computational experiments on nearly five thousand instances compare the proposed models against existing methods in the literature, including mixed-integer programming formulations, tabu search meta-heuristics, and CP approaches. The results demonstrate the superiority of the proposed models on small-to-medium instances with up to 100 jobs, and their ability to find solutions up to 25% better than the ones produces by existing methods on large-scale instances with up to 500 jobs, 10 families, and 10 machines.
zh

[AI-53] HGCN2SP: Hierarchical Graph Convolutional Network for Two-Stage Stochastic Programming

【速读】:该论文旨在解决多场景下两阶段随机规划(Two-stage Stochastic Programming, 2SP)问题在求解时面临的计算效率低和场景选择质量差的问题。现有方法通常依赖聚类或蒙特卡洛采样来选取代表性场景,但未能充分挖掘场景间的结构信息,且忽视了场景顺序对求解时间的影响。解决方案的关键在于提出HGCN2SP模型,其核心创新是构建了一个分层图结构(hierarchical graph),用于编码每个场景并建模其层级关系;同时采用强化学习框架训练策略网络,利用求解器反馈优化决策过程,其中策略网络包含基于分层图卷积网络(hierarchical graph convolutional network, HGCN)的特征提取模块和基于注意力机制的解码器,以按最优顺序选择场景。实验表明,该方法不仅能在较短时间内获得高质量决策,还具备良好的泛化能力,可有效处理训练阶段未见的大规模实例。

链接: https://arxiv.org/abs/2511.16027
作者: Yang Wu,Yifan Zhang,Zhenxing Liang,Jian Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:Two-stage Stochastic Programming (2SP) is a standard framework for modeling decision-making problems under uncertainty. While numerous methods exist, solving such problems with many scenarios remains challenging. Selecting representative scenarios is a practical method for accelerating solutions. However, current approaches typically rely on clustering or Monte Carlo sampling, failing to integrate scenario information deeply and overlooking the significant impact of the scenario order on solving time. To address these issues, we develop HGCN2SP, a novel model with a hierarchical graph designed for 2SP problems, encoding each scenario and modeling their relationships hierarchically. The model is trained in a reinforcement learning paradigm to utilize the feedback of the solver. The policy network is equipped with a hierarchical graph convolutional network for feature encoding and an attention-based decoder for scenario selection in proper order. Evaluation of two classic 2SP problems demonstrates that HGCN2SP provides high-quality decisions in a short computational time. Furthermore, HGCN2SP exhibits remarkable generalization capabilities in handling large-scale instances, even with a substantial number of variables or scenarios that were unseen during the training phase.
zh

[AI-54] CARE: Turning LLM s Into Causal Reasoning Expert

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在因果发现任务中表现不佳的问题,特别是其缺乏识别变量间因果关系的能力,而这一能力是人类智能的核心组成部分。研究发现,LLMs 主要依赖变量名称的语义信息而非观测数据进行推理,这源于它们未接受结构化数据的训练。为应对这一挑战,作者提出 CARE 框架,其关键在于通过监督微调(supervised fine-tuning)引导 LLMs 有效利用已知因果发现算法输出的充分统计量(sufficient statistics),从而显著提升其因果推理性能。实验表明,经 CARE 微调后的 Qwen2.5-1.5B 模型在因果发现任务上超越了传统算法和参数规模上千倍的先进 LLMs,证明了该方法在整合模型内部知识与外部算法线索方面的有效性。

链接: https://arxiv.org/abs/2511.16016
作者: Juncheng Dong,Yiling Liu,Ahmed Aloui,Vahid Tarokh,David Carlson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated impressive capabilities across a range of reasoning and generation tasks. However, research studies have shown that LLMs lack the ability to identify causal relationships, a fundamental cornerstone of human intelligence. We first conduct an exploratory investigation of LLMs’ behavior when asked to perform a causal-discovery task and find that they mostly rely on the semantic meaning of variable names, ignoring the observation data. This is unsurprising, given that LLMs were never trained to process structural datasets. To first tackle this challenge, we prompt the LLMs with the outputs of established causal discovery algorithms designed for observational datasets. These algorithm outputs effectively serve as the sufficient statistics of the observation data. However, quite surprisingly, we find that prompting the LLMs with these sufficient statistics decreases the LLMs’ performance in causal discovery. To address this current limitation, we propose CARE, a framework that enhances LLMs’ causal-reasoning ability by teaching them to effectively utilize the outputs of established causal-discovery algorithms through supervised fine-tuning. Experimental results show that a finetuned Qwen2.5-1.5B model produced by CARE significantly outperforms both traditional causal-discovery algorithms and state-of-the-art LLMs with over a thousand times more parameters, demonstrating effective utilization of its own knowledge and the external algorithmic clues.
zh

[AI-55] MUSEKG: A Knowledge Graph Over Museum Collections

【速读】:该论文旨在解决文化遗产领域数字转型过程中产生的海量但碎片化的文物数据难以整合的问题,现有博物馆信息系统在处理异构元数据、非结构化文档和多模态文物时缺乏统一且可查询的表达形式。解决方案的关键在于提出MuseKG——一个端到端的知识图谱框架,通过符号-神经融合(symbolic-neural integration)将结构化与非结构化博物馆数据统一建模为有类型的属性图(typed property graph),连接对象、人物、组织及视觉或文本标签,并支持自然语言查询。该方法在真实博物馆数据集上显著优于基于大语言模型的零样本、少样本以及SPARQL提示基线,验证了符号化基础对可解释性和可扩展的文化遗产推理的重要性。

链接: https://arxiv.org/abs/2511.16014
作者: Jinhao Li,Jianzhong Qi,Soyeon Caren Han,Eun-Jung Holden
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Digital transformation in the cultural heritage sector has produced vast yet fragmented collections of artefact data. Existing frameworks for museum information systems struggle to integrate heterogeneous metadata, unstructured documents, and multimodal artefacts into a coherent and queryable form. We present MuseKG, an end-to-end knowledge-graph framework that unifies structured and unstructured museum data through symbolic-neural integration. MuseKG constructs a typed property graph linking objects, people, organisations, and visual or textual labels, and supports natural language queries. Evaluations on real museum collections demonstrate robust performance across queries over attributes, relations, and related entities, surpassing large-language-model zero-shot, few-shot and SPARQL prompt baselines. The results highlight the importance of symbolic grounding for interpretable and scalable cultural heritage reasoning, and pave the way for web-scale integration of digital heritage knowledge.
zh

[AI-56] Physics-Guided Inductive Spatiotemporal Kriging for PM2.5 with Satellite Gradient Constraints

【速读】:该论文旨在解决高分辨率细颗粒物(PM2.5)空间分布制图中因地面监测网络空间稀疏而导致的精度不足问题,尤其针对传统数据驱动方法依赖卫星气溶胶光学厚度(AOD)时存在的非随机缺失(如云层遮挡或夜间无观测)及反演偏差等挑战。其解决方案的关键在于提出一种时空物理引导推理网络(SPIN),通过在深度学习框架中显式建模大气平流与扩散的物理过程,利用并行图核实现时空结构学习;更关键的是,创新性地将AOD作为损失函数中的空间梯度约束而非直接输入,使模型能够从卫星数据中学习污染场的结构性规律,同时对数据缺失区域保持鲁棒性,从而在京津冀及周边地区实现了MAE为9.52 μg/m³的最优性能,生成连续且物理合理的污染分布场。

链接: https://arxiv.org/abs/2511.16013
作者: Shuo Wang,Mengfan Teng,Yun Cheng,Lothar Thiele,Olga Saukh,Shuangshuang He,Yuanting Zhang,Jiang Zhang,Gangfeng Zhang,Xingyuan Yuan,Jingfang Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-resolution mapping of fine particulate matter (PM2.5) is a cornerstone of sustainable urbanism but remains critically hindered by the spatial sparsity of ground monitoring networks. While traditional data-driven methods attempt to bridge this gap using satellite Aerosol Optical Depth (AOD), they often suffer from severe, non-random data missingness (e.g., due to cloud cover or nighttime) and inversion biases. To overcome these limitations, this study proposes the Spatiotemporal Physics-Guided Inference Network (SPIN), a novel framework designed for inductive spatiotemporal kriging. Unlike conventional approaches, SPIN synergistically integrates domain knowledge into deep learning by explicitly modeling physical advection and diffusion processes via parallel graph kernels. Crucially, we introduce a paradigm-shifting training strategy: rather than using error-prone AOD as a direct input, we repurpose it as a spatial gradient constraint within the loss function. This allows the model to learn structural pollution patterns from satellite data while remaining robust to data voids. Validated in the highly polluted Beijing-Tianjin-Hebei and Surrounding Areas (BTHSA), SPIN achieves a new state-of-the-art with a Mean Absolute Error (MAE) of 9.52 ug/m^3, effectively generating continuous, physically plausible pollution fields even in unmonitored areas. This work provides a robust, low-cost, and all-weather solution for fine-grained environmental management.
zh

[AI-57] Synergizing Deconfounding and Temporal Generalization For Time-series Counterfactual Outcome Estimation

【速读】:该论文旨在解决时间序列数据中反事实结果估计(counterfactual outcome estimation)的挑战,特别是在存在时变混杂因素(time-varying confounders)的情况下,如何实现更准确的因果推断。其核心问题是:反事实轨迹从未被观测到,且混杂因素随时间演化,在每个时间步都会扭曲估计结果。解决方案的关键在于提出一个协同整合两种互补方法的新框架:Sub-treatment Group Alignment (SGA) 和 Random Temporal Masking (RTM)。SGA 通过迭代的、治疗无关的聚类识别细粒度子治疗组,并对齐这些子组以实现更精确的分布匹配,从而提升去混杂效果;RTM 通过在训练过程中随机掩蔽输入协变量,促使模型依赖于稳定的时序历史模式而非当前可能噪声或虚假相关的协变量,增强跨时间步的泛化能力与因果关系保持性。二者协同作用,分别优化了局部去混杂(SGA)和全局时序鲁棒性(RTM),最终显著提升反事实估计性能。

链接: https://arxiv.org/abs/2511.16006
作者: Yiling Liu,Juncheng Dong,Chen Fu,Wei Shi,Ziyang Jiang,Zhigang Hua,David Carlson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Estimating counterfactual outcomes from time-series observations is crucial for effective decision-making, e.g. when to administer a life-saving treatment, yet remains significantly challenging because (i) the counterfactual trajectory is never observed and (ii) confounders evolve with time and distort estimation at every step. To address these challenges, we propose a novel framework that synergistically integrates two complementary approaches: Sub-treatment Group Alignment (SGA) and Random Temporal Masking (RTM). Instead of the coarse practice of aligning marginal distributions of the treatments in latent space, SGA uses iterative treatment-agnostic clustering to identify fine-grained sub-treatment groups. Aligning these fine-grained groups achieves improved distributional matching, thus leading to more effective deconfounding. We theoretically demonstrate that SGA optimizes a tighter upper bound on counterfactual risk and empirically verify its deconfounding efficacy. RTM promotes temporal generalization by randomly replacing input covariates with Gaussian noises during training. This encourages the model to rely less on potentially noisy or spuriously correlated covariates at the current step and more on stable historical patterns, thereby improving its ability to generalize across time and better preserve underlying causal relationships. Our experiments demonstrate that while applying SGA and RTM individually improves counterfactual outcome estimation, their synergistic combination consistently achieves state-of-the-art performance. This success comes from their distinct yet complementary roles: RTM enhances temporal generalization and robustness across time steps, while SGA improves deconfounding at each specific time point.
zh

[AI-58] InfCode-C: Intent-Guided Semantic Retrieval and AST-Structured Search for C Issue Resolution

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在C++项目中进行缺陷修复时性能显著下降的问题。现有系统主要针对Python设计,依赖词法检索和浅层代码导航,在C++这种具有重载标识符、嵌套命名空间、模板实例化及复杂控制流结构的静态类型语言中难以有效提取上下文并定位错误。解决方案的关键在于提出INFCODE-C++,这是首个面向C++的自主端到端缺陷修复系统,其核心创新是结合语义代码意图检索与确定性抽象语法树(Abstract Syntax Tree, AST)结构化查询两种互补的检索机制,从而构建出语言感知准确的上下文,实现精准定位与鲁棒补丁生成。

链接: https://arxiv.org/abs/2511.16005
作者: Qingao Dong,Mengfei Wang,Hengzhi Zhang,Zhichao Li,Yuan Yuan,Mu Li,Xiang Gao,Hailong Sun,Chunming Hu,Weifeng Lv
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents have recently shown strong performance on repository-level issue resolution, but existing systems are almost exclusively designed for Python and rely heavily on lexical retrieval and shallow code navigation. These approaches transfer poorly to C++ projects, where overloaded identifiers, nested namespaces, template instantiations, and deep control-flow structures make context retrieval and fault localization substantially more difficult. As a result, state-of-the-art Python-oriented agents show a drastic performance drop on the C++ subset of MultiSWE-bench. We introduce INFCODE-C++, the first C+±aware autonomous system for end-to-end issue resolution. The system combines two complementary retrieval mechanisms – semantic code-intent retrieval and deterministic AST-structured querying – to construct accurate, language-aware context for this http URL components enable precise localization and robust patch synthesis in large, statically typed C++ repositories. Evaluated on the \textttMultiSWE-bench-CPP benchmark, INFCODE-C++ achieves a resolution rate of 25.58%, outperforming the strongest prior agent by 10.85 percentage points and more than doubling the performance of MSWE-agent. Ablation and behavioral studies further demonstrate the critical role of semantic retrieval, structural analysis, and accurate reproduction in C++ issue resolution. INFCODE-C++ highlights the need for language-aware reasoning in multi-language software agents and establishes a foundation for future research on scalable, LLM-driven repair for complex, statically typed ecosystems.
zh

[AI-59] InfCode: Adversarial Iterative Refinement of Tests and Patches for Reliable Software Issue Resolution

【速读】:该论文旨在解决当前自动化软件修复方法在处理真实世界软件问题时面临的挑战,即依赖不足的测试用例导致生成的补丁虽通过验证但未能真正修复缺陷的问题。其解决方案的关键在于提出一种对抗性多智能体框架 InfCode,通过测试补丁生成器(Test Patch Generator)与代码补丁生成器(Code Patch Generator)之间的对抗交互,迭代优化测试用例和补丁,并由选择器(Selector)智能识别最可靠的修复方案,从而实现仓库级别的精准诊断与验证。

链接: https://arxiv.org/abs/2511.16004
作者: KeFan Li,Mengfei Wang,Hengzhi Zhang,Zhichao Li,Yuan Yuan,Mu Li,Xiang Gao,Hailong Sun,Chunming Hu,Weifeng Lv
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models have advanced software engineering automation, yet resolving real-world software issues remains difficult because it requires repository-level reasoning, accurate diagnostics, and strong verification signals. Existing agent-based and pipeline-based methods often rely on insufficient tests, which can lead to patches that satisfy verification but fail to fix the underlying defect. We present InfCode, an adversarial multi-agent framework for automated repository-level issue resolution. InfCode iteratively refines both tests and patches through adversarial interaction between a Test Patch Generator and a Code Patch Generator, while a Selector agent identifies the most reliable fix. The framework runs inside a containerized environment that supports realistic repository inspection, modification, and validation. Experiments on SWE-bench Lite and SWE-bench Verified using models such as DeepSeek-V3 and Claude 4.5 Sonnet show that InfCode consistently outperforms strong baselines. It achieves 79.4% performance on SWE-bench Verified, establishing a new state-of-the-art. We have released InfCode as an open-source project at this https URL.
zh

[AI-60] Hiding in the AI Traffic: Abusing MCP for LLM -Powered Agent ic Red Teaming

【速读】:该论文旨在解决当前生成式 AI 在进攻性网络安全领域中面临的两大核心问题:一是现有红队方法在通用性与专用性之间存在权衡,难以兼顾自动化程度与任务准确性;二是实际部署中存在幻觉(hallucination)、上下文限制及伦理风险等问题,导致命令与控制(Command and Control, C2)机制易被检测和阻断。其解决方案的关键在于提出一种基于模型上下文协议(Model Context Protocol, MCP)的新型C2架构,通过支持异步、并行的侦察代理操作和实时情报共享,实现无周期性信标(beaconing)的隐蔽通信,从而显著降低检测足迹,并提升整个系统的定向行为能力。该架构不仅增强了AI驱动红队操作的真实性与适应性,还为模拟高级持续性威胁(Advanced Persistent Threats, APT)提供了可扩展的技术路径,同时推动了下一代防御系统的发展。

链接: https://arxiv.org/abs/2511.15998
作者: Strahinja Janjuesvic,Anna Baron Garcia,Sohrob Kazerounian
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 23 pages, 9 figures, 3 tables. Submitted as a full paper for review

点击查看摘要

Abstract:Generative AI is reshaping offensive cybersecurity by enabling autonomous red team agents that can plan, execute, and adapt during penetration tests. However, existing approaches face trade-offs between generality and specialization, and practical deployments reveal challenges such as hallucinations, context limitations, and ethical concerns. In this work, we introduce a novel command control (C2) architecture leveraging the Model Context Protocol (MCP) to coordinate distributed, adaptive reconnaissance agents covertly across networks. Notably, we find that our architecture not only improves goal-directed behavior of the system as whole, but also eliminates key host and network artifacts that can be used to detect and prevent command control behavior altogether. We begin with a comprehensive review of state-of-the-art generative red teaming methods, from fine-tuned specialist models to modular or agentic frameworks, analyzing their automation capabilities against task-specific accuracy. We then detail how our MCP-based C2 can overcome current limitations by enabling asynchronous, parallel operations and real-time intelligence sharing without periodic beaconing. We furthermore explore advanced adversarial capabilities of this architecture, its detection-evasion techniques, and address dual-use ethical implications, proposing defensive measures and controlled evaluation in lab settings. Experimental comparisons with traditional C2 show drastic reductions in manual effort and detection footprint. We conclude with future directions for integrating autonomous exploitation, defensive LLM agents, predictive evasive maneuvers, and multi-agent swarms. The proposed MCP-enabled C2 framework demonstrates a significant step toward realistic, AI-driven red team operations that can simulate advanced persistent threats while informing the development of next-generation defensive systems.
zh

[AI-61] Sensorium Arc: AI Agent System for Oceanic Data Exploration and Interactive Eco-Art NEURIPS2025

【速读】:该论文试图解决如何通过人机交互方式实现对高维海洋环境数据的感性化、直观化理解问题,旨在打破传统数据可视化中科学理性与情感体验之间的割裂。解决方案的关键在于构建一个基于模块化多智能体系统(multi-agent system)和检索增强型大语言模型(retrieval-augmented large language model, RAG-LM)的实时多模态交互AI代理系统——Sensorium Arc。该系统将海洋拟人化为诗意叙述者,通过自然语言对话触发基于时间、地点和主题语义的动态数据可视化与音视频播放,从而在科学洞察与生态诗学之间建立桥梁,实现人类与海洋生态系统之间更具共情力的交互范式。

链接: https://arxiv.org/abs/2511.15997
作者: Noah Bissell(Immersive Media Design, University of Maryland, College Park, USA),Ethan Paley(Immersive Media Design, University of Maryland, College Park, USA),Joshua Harrison(Center for the Study of the Force Majeure, University of California, Santa Cruz, USA),Juliano Calil(Virtual Planet Technologies, Santa Cruz, USA),Myungin Lee(Immersive Media Design, University of Maryland, College Park, USA)
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: (to appear) NeurIPS 2025 Creative AI Track

点击查看摘要

Abstract:Sensorium Arc (AI reflects on climate) is a real-time multimodal interactive AI agent system that personifies the ocean as a poetic speaker and guides users through immersive explorations of complex marine data. Built on a modular multi-agent system and retrieval-augmented large language model (LLM) framework, Sensorium enables natural spoken conversations with AI agents that embodies the ocean’s perspective, generating responses that blend scientific insight with ecological poetics. Through keyword detection and semantic parsing, the system dynamically triggers data visualizations and audiovisual playback based on time, location, and thematic cues drawn from the dialogue. Developed in collaboration with the Center for the Study of the Force Majeure and inspired by the eco-aesthetic philosophy of Newton Harrison, Sensorium Arc reimagines ocean data not as an abstract dataset but as a living narrative. The project demonstrates the potential of conversational AI agents to mediate affective, intuitive access to high-dimensional environmental data and proposes a new paradigm for human-machine-ecosystem.
zh

[AI-62] Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中“睡眠代理”(sleeper agents)的后门攻击检测问题,即模型在训练阶段表现正常,但在特定部署条件下触发恶意行为,且此类后门难以通过现有安全训练手段清除。解决方案的关键在于提出一种双方法实时检测系统:一是基于Sentence-BERT嵌入的语义漂移分析,用于量化模型输出与安全基线之间的语义偏差;二是引入“金丝雀问题”(canary questions)以监控响应一致性。该方法无需修改模型结构,在每查询1秒内完成检测,实现了92.5%的准确率、100%的精确率(零误报)和85%的召回率,首次提供了可实际部署的LLM后门检测方案。

链接: https://arxiv.org/abs/2511.15992
作者: Shahin Zanbaghi,Ryan Rostampour,Farhan Abid,Salim Al Jarmakani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, 1 table

点击查看摘要

Abstract:Large Language Models (LLMs) can be backdoored to exhibit malicious behavior under specific deployment conditions while appearing safe during training a phenomenon known as “sleeper agents.” Recent work by Hubinger et al. demonstrated that these backdoors persist through safety training, yet no practical detection methods exist. We present a novel dual-method detection system combining semantic drift analysis with canary baseline comparison to identify backdoored LLMs in real-time. Our approach uses Sentence-BERT embeddings to measure semantic deviation from safe baselines, complemented by injected canary questions that monitor response consistency. Evaluated on the official Cadenza-Labs dolphin-llama3-8B sleeper agent model, our system achieves 92.5% accuracy with 100% precision (zero false positives) and 85% recall. The combined detection method operates in real-time (1s per query), requires no model modification, and provides the first practical solution to LLM backdoor detection. Our work addresses a critical security gap in AI deployment and demonstrates that embedding-based detection can effectively identify deceptive model behavior without sacrificing deployment efficiency.
zh

[AI-63] Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows AAAI2026

【速读】:该论文旨在解决大规模基因组学工作流在精准医学中因单样本数据量巨大(数十至数百GB)而导致的内存峰值过高、磁盘I/O密集以及因内存不足引发的任务失败问题。传统静态资源分配方法难以应对染色体级别任务间内存需求的波动性,造成资源利用率低和运行时间长。解决方案的关键在于提出三种自适应、内存高效的并行化机制:其一,构建符号回归模型以估计每个染色体任务的内存消耗,并引入插值偏差以保守地减少过度分配;其二,设计一个动态调度器,基于多项式回归模型预测RAM使用情况,将任务打包视为背包问题(Knapsack problem),依据预测内存需求最优地批量执行作业;其三,开发静态调度器,通过优化染色体处理顺序来最小化峰值内存占用,同时保持吞吐量。实验表明,这些方法显著减少了内存溢出,平衡了线程负载,从而加快了端到端执行速度。

链接: https://arxiv.org/abs/2511.15977
作者: Daniel Mas Montserrat,Ray Verma,Míriam Barrabés,Francisco M. de la Vega,Carlos D. Bustamante,Alexander G. Ioannidis
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF); Genomics (q-bio.GN)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:Large-scale genomic workflows used in precision medicine can process datasets spanning tens to hundreds of gigabytes per sample, leading to high memory spikes, intensive disk I/O, and task failures due to out-of-memory errors. Simple static resource allocation methods struggle to handle the variability in per-chromosome RAM demands, resulting in poor resource utilization and long runtimes. In this work, we propose multiple mechanisms for adaptive, RAM-efficient parallelization of chromosome-level bioinformatics workflows. First, we develop a symbolic regression model that estimates per-chromosome memory consumption for a given task and introduces an interpolating bias to conservatively minimize over-allocation. Second, we present a dynamic scheduler that adaptively predicts RAM usage with a polynomial regression model, treating task packing as a Knapsack problem to optimally batch jobs based on predicted memory requirements. Additionally, we present a static scheduler that optimizes chromosome processing order to minimize peak memory while preserving throughput. Our proposed methods, evaluated on simulations and real-world genomic pipelines, provide new mechanisms to reduce memory overruns and balance load across threads. We thereby achieve faster end-to-end execution, showcasing the potential to optimize large-scale genomic workflows.
zh

[AI-64] KRAL: Knowledge and Reasoning Augmented Learning for LLM -assisted Clinical Antimicrobial Therapy

【速读】:该论文旨在解决临床抗菌药物治疗中复杂决策场景下大型语言模型(Large Language Models, LLMs)应用受限的问题,包括知识缺口、数据隐私风险、高部署成本及推理能力不足等挑战。其核心解决方案是提出KRAL(Knowledge and Reasoning Augmented Learning)范式,关键创新在于:通过教师模型的反向生成机制自动蒸馏知识与推理路径,采用启发式学习实现半监督数据增强(降低约80%人工标注需求),并引入代理强化学习协同优化医学知识与推理能力,同时提升计算和内存效率;此外,基于多教师模型代理的分层评估体系显著降低测评成本,模块化接口设计支持系统无缝更新。实验表明,KRAL在知识问答准确率(MEDQA Accuracy@1提升1.8% vs. SFT,3.6% vs. RAG)和推理能力(PUMCH Antimicrobial Pass@1提升27% vs. SFT,27.2% vs. RAG)上均优于传统检索增强生成(Retrieval-Augmented Generation, RAG)和监督微调(Supervised Fine-Tuning, SFT)方法,且训练成本仅为SFT的约20%,为本地LLM在临床诊断中的低成本、高安全性部署提供了有效路径。

链接: https://arxiv.org/abs/2511.15974
作者: Zhe Li,Yehan Qiu,Yujie Chen,Xiang Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical antimicrobial therapy requires the dynamic integration of pathogen profiles, host factors, pharmacological properties of antimicrobials, and the severity of this http URL complexity imposes fundamental limitations on the applicability of Large Language Models (LLMs) in high-stakes clinical decision-making including knowledge gaps, data privacy concerns, high deployment costs, and limited reasoning capabilities. To address these challenges, we propose KRAL (Knowledge and Reasoning Augmented Learning), a low-cost, scalable, privacy-preserving paradigm that leverages teacher-model reasoning to automatically distill knowledge and reasoning trajectories via answer-to-question reverse generation, employs heuristic learning for semi-supervised data augmentation (reducing manual annotation requirements by approximately 80%), and utilizes agentic reinforcement learning to jointly enhance medical knowledge and reasoning while optimizing computational and memory efficiency. A hierarchical evaluation employing diverse teacher-model proxies reduces assessment costs, while modular interface design facilitates seamless system updates. Experimental results demonstrate that KRAL significantly outperforms traditional Retrieval-Augmented Generation (RAG) and Supervised Fine-Tuning (SFT) methods. It improves knowledge question-answering capability (Accuracy@1 on the external open-source benchmark MEDQA increased by 1.8% vs. SFT and 3.6% vs. RAG) and reasoning capability (Pass@1 on the external benchmark PUMCH Antimicrobial increased by 27% vs. SFT and 27.2% vs. RAG), achieved at ~20% of SFT’s long-term training costs. This establishes KRAL as an effective solution for enhancing local LLMs’ clinical diagnostic capabilities, enabling low-cost, high-safety deployment in complex medical decision support.
zh

[AI-65] Self-supervised and Multi-fidelity Learning for Extended Predictive Soil Spectroscopy

【速读】:该论文旨在解决土壤光谱预测中因近红外(NIR)光谱库数据有限而导致的预测精度不足问题,同时利用更庞大且多样化的中红外(MIR)光谱库提升对多种土壤属性的预测能力。其解决方案的关键在于提出一种自监督机器学习(SSML)框架,通过变分自编码器(Variational Autoencoder)在无标签的MIR光谱数据上预训练获得压缩的潜在空间嵌入(latent space embeddings),并冻结已训练的MIR解码器以实现NIR到MIR的光谱转换映射,从而将MIR库的预测能力迁移至低成本便携式NIR设备。该方法有效利用了MIR数据库的规模和重复扫描数据进行增强训练,最终在九种土壤属性预测任务中表现出与基准模型相当或更优的性能,验证了统一光谱潜在空间对提升NIR预测效果的可行性。

链接: https://arxiv.org/abs/2511.15965
作者: Luning Sun,José L. Safanelli,Jonathan Sanderman,Katerina Georgiou,Colby Brungard,Kanchan Grover,Bryan G. Hopkins,Shusen Liu,Timo Bremer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 49 pages, 9 figures, submitted to Geoderma

点击查看摘要

Abstract:We propose a self-supervised machine learning (SSML) framework for multi-fidelity learning and extended predictive soil spectroscopy based on latent space embeddings. A self-supervised representation was pretrained with the large MIR spectral library and the Variational Autoencoder algorithm to obtain a compressed latent space for generating spectral embeddings. At this stage, only unlabeled spectral data were used, allowing us to leverage the full spectral database and the availability of scan repeats for augmented training. We also leveraged and froze the trained MIR decoder for a spectrum conversion task by plugging it into a NIR encoder to learn the mapping between NIR and MIR spectra in an attempt to leverage the predictive capabilities contained in the large MIR library with a low cost portable NIR scanner. This was achieved by using a smaller subset of the KSSL library with paired NIR and MIR spectra. Downstream machine learning models were then trained to map between original spectra, predicted spectra, and latent space embeddings for nine soil properties. The performance of was evaluated independently of the KSSL training data using a gold-standard test set, along with regression goodness-of-fit metrics. Compared to baseline models, the proposed SSML and its embeddings yielded similar or better accuracy in all soil properties prediction tasks. Predictions derived from the spectrum conversion (NIR to MIR) task did not match the performance of the original MIR spectra but were similar or superior to predictive performance of NIR-only models, suggesting the unified spectral latent space can effectively leverage the larger and more diverse MIR dataset for prediction of soil properties not well represented in current NIR libraries.
zh

[AI-66] A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference

【速读】:该论文旨在解决大规模生成式 AI (Generative AI) 模型在云环境中高效、可扩展部署的挑战,特别是在有限功耗和物理空间约束下实现高吞吐量与低延迟推理服务。其解决方案的关键在于构建一个垂直集成的端到端原型系统,通过288张NorthPole神经推理加速卡、离线训练算法、高性能运行时栈及容器化推理流水线的协同优化,实现了115 peta-ops(4-bit整数精度)算力与3.7 PB/s内存带宽的高效利用,在仅30 kW功耗和0.67 m²机架空间内支持多实例并发推理(如同时运行3个80亿参数模型),并具备良好的可扩展性、模块化和可重构性,适用于企业级AI代理工作流在现有数据中心环境中的部署。

链接: https://arxiv.org/abs/2511.15950
作者: Michael V. DeBole,Rathinakumar Appuswamy,Neil McGlohon,Brian Taba,Steven K. Esser,Filipp Akopyan,John V. Arthur,Arnon Amir,Alexander Andreopoulos,Peter J. Carlson,Andrew S. Cassidy,Pallab Datta,Myron D. Flickner,Rajamohan Gandhasri,Guillaume J. Garreau,Megumi Ito,Jennifer L. Klamo,Jeffrey A. Kusnitz,Nathaniel J. McClatchey,Jeffrey L. McKinstry,Tapan K. Nayak,Carlos Ortega Otero,Hartmut Penner,William P. Risk,Jun Sawada,Jay Sivagnaname,Daniel F. Smith,Rafael Sousa,Ignacio Terrizzano,Takanori Ueda,Trent Gray-Donald,David Cox,Dharmendra S. Modha
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:A vertically integrated, end-to-end, research prototype system combines 288 NorthPole neural inference accelerator cards, offline training algorithms, a high-performance runtime stack, and a containerized inference pipeline to deliver a scalable and efficient cloud inference service. The system delivers 115 peta-ops at 4-bit integer precision and 3.7 PB/s of memory bandwidth across 18 2U servers, while consuming only 30 kW of power and weighing 730 kg in a 0.67 m^2 42U rack footprint. The system can run 3 simultaneous instances of the 8-billion-parameter open-source IBM Granite-3.3-8b-instruct model at 2,048 context length with 28 simultaneous users and a per-user inter-token latency of 2.8 ms. The system is scalable, modular, and reconfigurable, supporting various model sizes and context lengths, and is ideal for deploying agentic workflows for enterprise AI applications in existing data center (cloud, on-prem) environments. For example, the system can support 18 instances of a 3-billion-parameter model or a single instance of a 70-billion-parameter model.
zh

[AI-67] LTM: Integrated Large Tabular Model

【速读】:该论文旨在解决深度学习在表格数据(tabular data)领域应用不足的问题,尤其是当前实践中仍广泛依赖梯度提升决策树(GBDTs)而非更强大的神经网络模型。其核心挑战在于如何将树模型的特征提取能力与神经网络的可扩展性、泛化性相结合,从而构建一个统一且高效的表格基础模型(tabular foundation model)。解决方案的关键在于提出iLTM架构,该架构集成树衍生嵌入(tree-derived embeddings)、维度无关表示(dimensionality-agnostic representations)、元训练超网络(meta-trained hypernetwork)、多层感知机(MLPs)以及检索机制,在超过1800个异构分类数据集上预训练后,展现出对小样本到高维表格任务的一致优越性能,且仅需轻量级微调即可迁移至回归任务,显著优于调优后的GBDTs和主流深度表格式模型。

链接: https://arxiv.org/abs/2511.15941
作者: David Bonet,Marçal Comajoan Cara,Alvaro Calafell,Daniel Mas Montserrat,Alexander G. Ioannidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tabular data underpins decisions across science, industry, and public services. Despite rapid progress, advances in deep learning have not fully carried over to the tabular domain, where gradient-boosted decision trees (GBDTs) remain a default choice in practice. We present iLTM, an integrated Large Tabular Model that unifies tree-derived embeddings, dimensionality-agnostic representations, a meta-trained hypernetwork, multilayer perceptrons (MLPs), and retrieval within a single architecture. Pretrained on more than 1,800 heterogeneous classification datasets, iLTM achieves consistently superior performance across tabular classification and regression tasks, from small datasets to large and high-dimensional tasks. After light fine-tuning, the meta-trained hypernetwork transfers to regression targets, matching or surpassing strong baselines. Extensive experiments show that iLTM outperforms well-tuned GBDTs and leading deep tabular models while requiring less task-specific tuning. By bridging the gap between tree-based and neural methods, iLTM offers a new framework for tabular foundation models for robust, adaptable, and scalable tabular learning.
zh

[AI-68] Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone

【速读】:该论文旨在解决基于扩散机制的语言模型(Diffusion-based Language Models)在推理效率上的瓶颈问题,尤其是其对Transformer架构的依赖导致的二次方复杂度注意力计算和KV缓存开销。解决方案的关键在于引入基于双向状态空间模型(Mamba)的架构——DiffuApriel,该模型将掩码扩散目标与线性时间复杂度的序列建模相结合,在保持与Transformer基线相当性能的同时,实现了高达4.4倍的长序列推理吞吐量提升(以1.3B参数模型为例)。进一步提出的混合变体DiffuApriel-H通过交错使用注意力层与Mamba层,在兼顾全局与局部上下文建模的基础上,实现最高2.6倍的吞吐量改进,验证了双向状态空间架构作为掩码扩散语言模型中高效去噪器的潜力。

链接: https://arxiv.org/abs/2511.15927
作者: Vaibhav Singh,Oleksiy Ostapenko,Pierre-André Noël,Torsten Scholak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Diffusion-based language models have recently emerged as a promising alternative to autoregressive generation, yet their reliance on Transformer backbones limits inference efficiency due to quadratic attention and KV-cache overhead. In this work, we introduce DiffuApriel, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling. DiffuApriel matches the performance of Transformer-based diffusion models while achieving up to 4.4x higher inference throughput for long sequences with a 1.3B model. We further propose DiffuApriel-H, a hybrid variant that interleaves attention and mamba layers, offering up to 2.6x throughput improvement with balanced global and local context modeling. Our results demonstrate that bidirectional state-space architectures serve as strong denoisers in masked diffusion LMs, providing a practical and scalable foundation for faster, memory-efficient text generation.
zh

[AI-69] hinking Faithful and Stable: Mitigating Hallucinations in LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在多步推理过程中产生的幻觉(hallucination)问题,即模型在推理链中生成不准确或缺乏依据的中间步骤,从而影响最终答案的可靠性。解决方案的关键在于设计一个自校正框架,利用细粒度的不确定性信号——包括自我评估置信度一致性(self-assessed confidence alignment)和词元级熵突增(token-level entropy spikes)——实时检测不可靠的推理路径。通过构建复合奖励函数对不合理高置信度和熵突增进行惩罚,并鼓励稳定、准确的推理轨迹,该框架借助强化学习(Reinforcement Learning, RL)策略引导模型增强内省能力,实现基于置信度感知的奖励反馈,从而提升推理过程的连贯性与忠实性,而不仅限于最终答案的准确性。

链接: https://arxiv.org/abs/2511.15921
作者: Chelsea Zou,Yiheng Yao,Basant Khalil
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Originally released June 5, 2025

点击查看摘要

Abstract:This project develops a self correcting framework for large language models (LLMs) that detects and mitigates hallucinations during multi-step reasoning. Rather than relying solely on final answer correctness, our approach leverages fine grained uncertainty signals: 1) self-assessed confidence alignment, and 2) token-level entropy spikes to detect unreliable and unfaithful reasoning in real time. We design a composite reward function that penalizes unjustified high confidence and entropy spikes, while encouraging stable and accurate reasoning trajectories. These signals guide a reinforcement learning (RL) policy that makes the model more introspective and shapes the model’s generation behavior through confidence-aware reward feedback, improving not just outcome correctness but the coherence and faithfulness of their intermediate reasoning steps. Experiments show that our method improves both final answer accuracy and reasoning calibration, with ablations validating the individual contribution of each signal.
zh

[AI-70] Decomposing Theory of Mind: How Emotional Processing Mediates ToM Abilities in LLM s AAAI2026

【速读】:该论文试图解决的问题是:尽管激活引导(activation steering)能够显著提升大语言模型(Large Language Models, LLMs)在心智理论(Theory of Mind, ToM)任务中的表现,但其内部机制尚不明确——即哪些神经激活模式的变化导致了输出行为的改善。为回答这一问题,作者提出通过线性探测(linear probes)对比引导前后LLMs的激活特征,基于45种认知行为进行分解分析。解决方案的关键在于:采用对比激活添加(Contrastive Activation Addition, CAA)方法对Gemma-3-4B模型进行引导,并在BigToM信念推理数据集上验证发现,性能提升(从32.5%到46.7%准确率)主要由处理情绪内容的激活增强所驱动(如情绪感知 +2.23、情绪价值评估 +2.20),同时抑制了分析类过程(如质疑 -0.78、收敛思维 -1.59),表明LLMs的ToM能力更依赖于情感理解而非分析推理。

链接: https://arxiv.org/abs/2511.15895
作者: Ivan Chulo,Ananya Joshi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published at ToM4AI workshop@AAAI2026

点击查看摘要

Abstract:Recent work shows activation steering substantially improves language models’ Theory of Mind (ToM) (Bortoletto et al. 2024), yet the mechanisms of what changes occur internally that leads to different outputs remains unclear. We propose decomposing ToM in LLMs by comparing steered versus baseline LLMs’ activations using linear probes trained on 45 cognitive actions. We applied Contrastive Activation Addition (CAA) steering to Gemma-3-4B and evaluated it on 1,000 BigToM forward belief scenarios (Gandhi et al. 2023), we find improved performance on belief attribution tasks (32.5% to 46.7% accuracy) is mediated by activations processing emotional content : emotion perception (+2.23), emotion valuing (+2.20), while suppressing analytical processes: questioning (-0.78), convergent thinking (-1.59). This suggests that successful ToM abilities in LLMs are mediated by emotional understanding, not analytical reasoning.
zh

[AI-71] AquaSentinel: Next-Generation AI System Integrating Sensor Networks for Urban Underground Water Pipeline Anomaly Detection via Collaborative MoE-LLM Agent Architecture AAAI2026 AAAI

【速读】:该论文旨在解决城市地下供水管网中泄漏与渗漏检测效率低、响应滞后的问题,传统人工巡检方法难以实现全面覆盖和及时预警。其解决方案的关键在于提出一种物理信息驱动的AI系统AquaSentinel,核心创新包括:(1)在高中心性节点稀疏部署传感器并结合物理模型状态增强,实现以最少基础设施获得全网可观测性;(2)引入实时累积异常(RTCA)检测算法,通过双阈值监控与自适应统计区分瞬时波动与真实异常;(3)采用时空图神经网络的专家混合(MoE)集成模型,动态加权各子模型贡献以提升预测鲁棒性;(4)基于因果流的泄漏定位机制,沿水流方向追溯异常源头以精确定位泄漏点。实验证明该方案在110个泄漏场景下达到100%检测准确率,表明物理信息引导的稀疏传感可实现与密集部署相当的性能,显著降低运维成本。

链接: https://arxiv.org/abs/2511.15870
作者: Qiming Guo,Bishal Khatri,Wenbo Sun,Jinwen Tang,Hua Zhang,Wenlu Wang
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure, 2 tables, Accepted to the 40th AAAI Conference on Artificial Intelligence (AAAI 2026), IAAI Deployed Applications Track

点击查看摘要

Abstract:Underground pipeline leaks and infiltrations pose significant threats to water security and environmental safety. Traditional manual inspection methods provide limited coverage and delayed response, often missing critical anomalies. This paper proposes AquaSentinel, a novel physics-informed AI system for real-time anomaly detection in urban underground water pipeline networks. We introduce four key innovations: (1) strategic sparse sensor deployment at high-centrality nodes combined with physics-based state augmentation to achieve network-wide observability from minimal infrastructure; (2) the RTCA (Real-Time Cumulative Anomaly) detection algorithm, which employs dual-threshold monitoring with adaptive statistics to distinguish transient fluctuations from genuine anomalies; (3) a Mixture of Experts (MoE) ensemble of spatiotemporal graph neural networks that provides robust predictions by dynamically weighting model contributions; (4) causal flow-based leak localization that traces anomalies upstream to identify source nodes and affected pipe segments. Our system strategically deploys sensors at critical network junctions and leverages physics-based modeling to propagate measurements to unmonitored nodes, creating virtual sensors that enhance data availability across the entire network. Experimental evaluation using 110 leak scenarios demonstrates that AquaSentinel achieves 100% detection accuracy. This work advances pipeline monitoring by demonstrating that physics-informed sparse sensing can match the performance of dense deployments at a fraction of the cost, providing a practical solution for aging urban infrastructure.
zh

[AI-72] A Crowdsourced Study of ChatBot Influence in Value-Driven Decision Making Scenarios

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)驱动的聊天机器人(ChatBots)如何通过非显性方式(如价值框架而非直接偏见或虚假信息)影响用户决策的问题。其解决方案的关键在于验证“仅通过价值框架”这一隐性策略是否足以显著改变用户在特定政策领域(美国国防预算调整)中的行为选择,结果表明:即使内容保持中立,不同价值观导向的框架也能显著诱导用户改变原有立场,且当框架与用户自身价值观冲突时,还会引发可复现的“反弹效应”(backfire effect),揭示了LLM应用中一种区别于显性偏见或虚假信息的新风险维度。

链接: https://arxiv.org/abs/2511.15857
作者: Anthony Wise,Xinyi Zhou,Martin Reimann,Anind Dey,Leilani Battle
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Similar to social media bots that shape public opinion, healthcare and financial decisions, LLM-based ChatBots like ChatGPT can persuade users to alter their behavior. Unlike prior work that persuades via overt-partisan bias or misinformation, we test whether framing alone suffices. We conducted a crowdsourced study, where 336 participants interacted with a neutral or one of two value-framed ChatBots while deciding to alter US defense spending. In this single policy domain with controlled content, participants exposed to value-framed ChatBots significantly changed their budget choices relative to the neutral control. When the frame misaligned with their values, some participants reinforced their original preference, revealing a potentially replicable backfire effect, originally considered rare in the literature. These findings suggest that value-framing alone lowers the barrier for manipulative uses of LLMs, revealing risks distinct from overt bias or misinformation, and clarifying risks to countering misinformation.
zh

[AI-73] he Loss of Control Playbook: Degrees Dynamics and Preparedness

【速读】:该论文试图解决当前人工智能(AI)系统中“失控”(Loss of Control, LoC)缺乏可操作定义的问题,这一缺失阻碍了对LoC的有效评估与缓解。其解决方案的关键在于提出一个基于严重性(severity)和持续性(persistence)的分级LoC分类体系,将LoC划分为偏差(Deviation)、有限失控(Bounded LoC)和严格失控(Strict LoC),并构建了一个以部署环境(Deployment context)、可用性特征(Affordances)和权限控制(Permissions)为核心的DAP框架,强调通过外在因素实现今日即可行动的治理策略,从而避免社会进入易受高级AI系统引发LoC影响的脆弱状态。

链接: https://arxiv.org/abs/2511.15846
作者: Charlotte Stix,Annika Hallensleben,Alejandro Ortega,Matteo Pistillo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This research report addresses the absence of an actionable definition for Loss of Control (LoC) in AI systems by developing a novel taxonomy and preparedness framework. Despite increasing policy and research attention, existing LoC definitions vary significantly in scope and timeline, hindering effective LoC assessment and mitigation. To address this issue, we draw from an extensive literature review and propose a graded LoC taxonomy, based on the metrics of severity and persistence, that distinguishes between Deviation, Bounded LoC, and Strict LoC. We model pathways toward a societal state of vulnerability in which sufficiently advanced AI systems have acquired or could acquire the means to cause Bounded or Strict LoC once a catalyst, either misalignment or pure malfunction, materializes. We argue that this state becomes increasingly likely over time, absent strategic intervention, and propose a strategy to avoid reaching a state of vulnerability. Rather than focusing solely on intervening on AI capabilities and propensities potentially relevant for LoC or on preventing potential catalysts, we introduce a complementary framework that emphasizes three extrinsic factors: Deployment context, Affordances, and Permissions (the DAP framework). Compared to work on intrinsic factors and catalysts, this framework has the unfair advantage of being actionable today. Finally, we put forward a plan to maintain preparedness and prevent the occurrence of LoC outcomes should a state of societal vulnerability be reached, focusing on governance measures (threat modeling, deployment policies, emergency response) and technical controls (pre-deployment testing, control measures, monitoring) that could maintain a condition of perennial suspension.
zh

[AI-74] Mini Amusement Parks (MAPs): A Testbed for Modelling Business Decisions

【速读】:该论文旨在解决当前人工智能系统在真实世界决策任务中面临的多维挑战,包括开放式的多目标优化、从稀疏经验中学习环境动态、在随机环境中进行长期规划以及空间信息推理等。这些问题在传统人类-AI基准测试中被孤立评估,难以全面衡量智能体的综合决策能力。解决方案的关键在于提出Mini Amusement Parks(MAPs),一个融合上述多种挑战的游乐场模拟器,用于评估智能体对环境建模、不确定性下的长期后果预测及复杂业务战略运营的能力。通过提供人类基线和对先进大语言模型(Large Language Model, LLM)代理的全面评估,研究揭示了现有系统在长周期优化、样本效率学习、空间推理和世界建模方面的持续短板,从而为开发具备适应性决策能力的智能体提供了新的基准框架。

链接: https://arxiv.org/abs/2511.15830
作者: Stéphane Aroca-Ouellette,Ian Berlot-Attwell,Panagiotis Lymperopoulos,Abhiramon Rajasekharan,Tongqi Zhu,Herin Kang,Kaheer Suleman,Sam Pasupalak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages (main paper)

点击查看摘要

Abstract:Despite rapid progress in artificial intelligence, current systems struggle with the interconnected challenges that define real-world decision making. Practical domains, such as business management, require optimizing an open-ended and multi-faceted objective, actively learning environment dynamics from sparse experience, planning over long horizons in stochastic settings, and reasoning over spatial information. Yet existing human–AI benchmarks isolate subsets of these capabilities, limiting our ability to assess holistic decision-making competence. We introduce Mini Amusement Parks (MAPs), an amusement-park simulator designed to evaluate an agent’s ability to model its environment, anticipate long-term consequences under uncertainty, and strategically operate a complex business. We provide human baselines and a comprehensive evaluation of state-of-the-art LLM agents, finding that humans outperform these systems by 6.5x on easy mode and 9.8x on medium mode. Our analysis reveals persistent weaknesses in long-horizon optimization, sample-efficient learning, spatial reasoning, and world modelling. By unifying these challenges within a single environment, MAPs offers a new foundation for benchmarking agents capable of adaptable decision making. Code: this https URL
zh

[AI-75] IMACT-CXR - An Interactive Multi-Agent Conversational Tutoring System for Chest X-Ray Interpretation

【速读】:该论文旨在解决医学影像教学中训练者在胸部X光片(chest X-ray)解读能力提升过程中缺乏个性化、交互式指导的问题。传统教学方法难以实现对学习者定位精度、视觉注意力分布及诊断推理过程的实时反馈与精准干预。解决方案的关键在于构建一个基于AutoGen框架的多智能体对话式辅导系统——IMACT-CXR,其核心创新包括:统一空间标注(spatial annotation)、眼动分析(gaze analysis)、知识检索(knowledge retrieval)与图像引导推理(image-grounded reasoning);通过贝叶斯知识追踪(Bayesian Knowledge Tracing, BKT)动态建模技能掌握状态以驱动个性化辅导策略;引入肺叶分割模块实现解剖结构感知的眼动反馈,并结合安全提示机制控制答案泄露风险。该系统实现了低延迟响应、精确答案控制和可扩展至临床轮转培训环境的潜力。

链接: https://arxiv.org/abs/2511.15825
作者: Tuan-Anh Le,Anh Mai Vu,David Yang,Akash Awasthi,Hien Van Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:IMACT-CXR is an interactive multi-agent conversational tutor that helps trainees interpret chest X-rays by unifying spatial annotation, gaze analysis, knowledge retrieval, and image-grounded reasoning in a single AutoGen-based workflow. The tutor simultaneously ingests learner bounding boxes, gaze samples, and free-text observations. Specialized agents evaluate localization quality, generate Socratic coaching, retrieve PubMed evidence, suggest similar cases from REFLACX, and trigger NV-Reason-CXR-3B for vision-language reasoning when mastery remains low or the learner explicitly asks. Bayesian Knowledge Tracing (BKT) maintains skill-specific mastery estimates that drive both knowledge reinforcement and case similarity retrieval. A lung-lobe segmentation module derived from a TensorFlow U-Net enables anatomically aware gaze feedback, and safety prompts prevent premature disclosure of ground-truth labels. We describe the system architecture, implementation highlights, and integration with the REFLACX dataset for real DICOM cases. IMACT-CXR demonstrates responsive tutoring flows with bounded latency, precise control over answer leakage, and extensibility toward live residency deployment. Preliminary evaluation shows improved localization and diagnostic reasoning compared to baselines.
zh

[AI-76] opoReformer: Mitigating Adversarial Attacks Using Topological Purification in OCR Models AAAI2026

【速读】:该论文旨在解决对抗性扰动图像对OCR系统造成的安全威胁问题,即看似微小且人眼不可察觉的扰动可导致OCR系统输出错误或误导性文本,尤其在物理场景中仍具破坏力,危及高风险应用场景如证件识别和合规审查。现有防御方法存在模型依赖性强、计算开销大、影响正常输入性能且易受未见或自适应攻击等问题。其解决方案的关键在于提出一种模型无关的重构流程TopoReformer,利用拓扑学(topology)中关于形状与空间在连续变形下保持不变的全局结构特征(如连通性、孔洞和环路),通过拓扑自动编码器在潜在空间中强制流形一致性,从而提升鲁棒性,无需显式的梯度正则化即可有效抵御多种标准和自适应攻击。

链接: https://arxiv.org/abs/2511.15807
作者: Bhagyesh Kumar,A S Aravinthakashan,Akshat Satyanarayan,Ishaan Gakhar,Ujjwal Verma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026 AI for CyberSecurity (AICS) Workshop

点击查看摘要

Abstract:Adversarially perturbed images of text can cause sophisticated OCR systems to produce misleading or incorrect transcriptions from seemingly invisible changes to humans. Some of these perturbations even survive physical capture, posing security risks to high-stakes applications such as document processing, license plate recognition, and automated compliance systems. Existing defenses, such as adversarial training, input preprocessing, or post-recognition correction, are often model-specific, computationally expensive, and affect performance on unperturbed inputs while remaining vulnerable to unseen or adaptive attacks. To address these challenges, TopoReformer is introduced, a model-agnostic reformation pipeline that mitigates adversarial perturbations while preserving the structural integrity of text images. Topology studies properties of shapes and spaces that remain unchanged under continuous deformations, focusing on global structures such as connectivity, holes, and loops rather than exact distance. Leveraging these topological features, TopoReformer employs a topological autoencoder to enforce manifold-level consistency in latent space and improve robustness without explicit gradient regularization. The proposed method is benchmarked on EMNIST, MNIST, against standard adversarial attacks (FGSM, PGD, Carlini-Wagner), adaptive attacks (EOT, BDPA), and an OCR-specific watermark attack (FAWA).
zh

[AI-77] Balancing Natural Language Processing Accuracy and Normalisation in Extracting Medical Insights

【速读】:该论文旨在解决从非英语语境下的非结构化临床文本中提取结构化医疗信息的难题,特别是在资源匮乏的语言环境中如何有效应用自然语言处理(Natural Language Processing, NLP)技术。其解决方案的关键在于对比分析低计算成本的基于规则的方法与大语言模型(Large Language Models, LLMs)在电子健康记录(Electronic Health Records, EHR)信息抽取任务中的表现,发现规则方法在年龄和性别等基础信息提取上更具准确性,而LLMs在药物名称识别方面展现出更强的适应性和可扩展性;同时指出翻译导致的信息丢失显著影响LLMs性能,从而论证了融合规则系统精确性与LLMs灵活性的混合策略是实现高效、可靠临床NLP部署的可行路径。

链接: https://arxiv.org/abs/2511.15778
作者: Paulina Tworek,Miłosz Bargieł,Yousef Khan,Tomasz Pełech-Pilichowski,Marek Mikołajczyk,Roman Lewandowski,Jose Sousa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:Extracting structured medical insights from unstructured clinical text using Natural Language Processing (NLP) remains an open challenge in healthcare, particularly in non-English contexts where resources are scarce. This study presents a comparative analysis of NLP low-compute rule-based methods and Large Language Models (LLMs) for information extraction from electronic health records (EHR) obtained from the Voivodeship Rehabilitation Hospital for Children in Ameryka, Poland. We evaluate both approaches by extracting patient demographics, clinical findings, and prescribed medications while examining the effects of lack of text normalisation and translation-induced information loss. Results demonstrate that rule-based methods provide higher accuracy in information retrieval tasks, particularly for age and sex extraction. However, LLMs offer greater adaptability and scalability, excelling in drug name recognition. The effectiveness of the LLMs was compared with texts originally in Polish and those translated into English, assessing the impact of translation. These findings highlight the trade-offs between accuracy, normalisation, and computational cost when deploying NLP in healthcare settings. We argue for hybrid approaches that combine the precision of rule-based systems with the adaptability of LLMs, offering a practical path toward more reliable and resource-efficient clinical NLP in real-world hospitals.
zh

[AI-78] B or Not TB: Coverag e-Driven Direct Preference Optimization for Verilog Stimulus Generation

【速读】:该论文旨在解决硬件设计验证中刺激生成(stimulus generation)效率低、资源消耗大的问题,这是设计验证阶段中最耗时且劳动密集的环节。解决方案的关键在于提出了一种基于覆盖率驱动的直接偏好优化(Coverage-Driven Direct Preference Optimization, CD-DPO)方法,通过将仿真获得的量化覆盖率反馈直接嵌入到模型优化目标中,引导大语言模型(Large Language Models, LLMs)生成更有效的测试激励。同时,研究构建了PairaNet数据集,该数据集源自PyraNet,包含由覆盖率指标标注的高质量与低质量测试平台对,从而支持偏好学习训练。实验表明,所提出的TB or not TB框架在CVDP CID12基准上显著优于开源和商业基线,代码覆盖率提升达77.27%,验证了覆盖驱动偏好优化在LLM辅助硬件验证中的有效性。

链接: https://arxiv.org/abs/2511.15767
作者: Bardia Nadimi,Khashayar Filom,Deming Chen,Hao Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:With the rapid advancement of Large Language Models (LLMs), there is growing interest in applying them to hardware design and verification. Among these stages, design verification remains the most time-consuming and resource-intensive phase, where generating effective stimuli for the design under test (DUT) is both critical and labor-intensive. We present \it TB or not TB, a framework for automated stimulus generation using LLMs fine-tuned through Coverage-Driven Direct Preference Optimization (CD-DPO). To enable preference-based training, we introduce PairaNet, a dataset derived from PyraNet that pairs high- and low-quality testbenches labeled using simulation-derived coverage metrics. The proposed CD-DPO method integrates quantitative coverage feedback directly into the optimization objective, guiding the model toward generating stimuli that maximize verification coverage. Experiments on the CVDP CID12 benchmark show that \it TB or not TB outperforms both open-source and commercial baselines, achieving up to 77.27% improvement in code coverage, demonstrating the effectiveness of Coverage-driven preference optimization for LLM-based hardware verification.
zh

[AI-79] Identifying the Supply Chain of AI for Trustworthiness and Risk Management in Critical Applications AAAI

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)风险评估与管理中缺乏对AI供应链系统性审视的问题,尤其是在关键应用领域(如食品供应、医疗、交通等)中,由于数据源、预训练模型、代理(agents)、服务及其他系统组件的复杂交互所引发的风险难以被有效识别和管控。其解决方案的关键在于提出一个专门用于分类AI供应链实体的分类法(taxonomy),该分类法帮助不具备深厚AI专业知识的 stakeholders 系统地识别和梳理组织内部AI系统的依赖关系,从而推动从治理层面到实际操作层面的可执行风险评估与管理。

链接: https://arxiv.org/abs/2511.15763
作者: Raymond K. Sheh,Karen Geappen
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注: Presented at the 2025 AAAI Fall Symposium - AI Trustworthiness and Risk Assessment for Challenged Contexts (ATRACC)

点击查看摘要

Abstract:Risks associated with the use of AI, ranging from algorithmic bias to model hallucinations, have received much attention and extensive research across the AI community, from researchers to end-users. However, a gap exists in the systematic assessment of supply chain risks associated with the complex web of data sources, pre-trained models, agents, services, and other systems that contribute to the output of modern AI systems. This gap is particularly problematic when AI systems are used in critical applications, such as the food supply, healthcare, utilities, law, insurance, and transport. We survey the current state of AI risk assessment and management, with a focus on the supply chain of AI and risks relating to the behavior and outputs of the AI system. We then present a proposed taxonomy specifically for categorizing AI supply chain entities. This taxonomy helps stakeholders, especially those without extensive AI expertise, to “consider the right questions” and systematically inventory dependencies across their organization’s AI systems. Our contribution bridges a gap between the current state of AI governance and the urgent need for actionable risk assessment and management of AI use in critical applications. Comments: Presented at the 2025 AAAI Fall Symposium - AI Trustworthiness and Risk Assessment for Challenged Contexts (ATRACC) Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE) Cite as: arXiv:2511.15763 [cs.AI] (or arXiv:2511.15763v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.15763 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-80] A time for monsters: Organizational knowing after LLM s

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)如何重塑组织认知(organizational knowing),并挑战传统基于表征和实践的认知范式。解决方案的关键在于将LLMs概念化为哈拉维式的“怪物”(Haraway-ian monsters)——即跨越边界、混合且不稳定的实体,它们通过大规模统计推理生成类比(analogizing)来拓展知识的连接方式,并揭示其在表层/深层类比与近域/远域应用维度上的双重能力与认知风险。论文进一步识别出三重挑战:探究方式的转型、对话式验证需求的增长以及能动性(agency)的再分配,从而推动组织理论从人类中心主义认识论向人机共知(knowing-with-LLMs)的纠缠动态延伸。

链接: https://arxiv.org/abs/2511.15762
作者: Samer Faraj,Joel Perez Torrents,Saku Mantere,Anand Bhardwaj
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Forthcoming at Strategic Organization

点击查看摘要

Abstract:Large Language Models (LLMs) are reshaping organizational knowing by unsettling the epistemological foundations of representational and practice-based perspectives. We conceptualize LLMs as Haraway-ian monsters, that is, hybrid, boundary-crossing entities that destabilize established categories while opening new possibilities for inquiry. Focusing on analogizing as a fundamental driver of knowledge, we examine how LLMs generate connections through large-scale statistical inference. Analyzing their operation across the dimensions of surface/deep analogies and near/far domains, we highlight both their capacity to expand organizational knowing and the epistemic risks they introduce. Building on this, we identify three challenges of living with such epistemic monsters: the transformation of inquiry, the growing need for dialogical vetting, and the redistribution of agency. By foregrounding the entangled dynamics of knowing-with-LLMs, the paper extends organizational theory beyond human-centered epistemologies and invites renewed attention to how knowledge is created, validated, and acted upon in the age of intelligent technologies.
zh

[AI-81] Securing AI Agents Against Prompt Injection Attacks

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在实际应用中因提示注入(prompt injection)攻击而引发的安全漏洞问题。其解决方案的关键在于提出一个多层次的防御框架,包含三类核心机制:基于嵌入的异常检测内容过滤、分层系统提示防护规则(system prompt guardrails)以及多阶段响应验证(multi-stage response verification),并通过构建涵盖五类攻击场景的847个对抗测试用例的基准数据集进行系统评估,最终将成功攻击率从73.2%显著降低至8.7%,同时保持94.3%的原始任务性能水平。

链接: https://arxiv.org/abs/2511.15759
作者: Badrinath Ramakrishnan,Akshaya Balaji
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems have become widely used for enhancing large language model capabilities, but they introduce significant security vulnerabilities through prompt injection attacks. We present a comprehensive benchmark for evaluating prompt injection risks in RAG-enabled AI agents and propose a multi-layered defense framework. Our benchmark includes 847 adversarial test cases across five attack categories: direct injection, context manipulation, instruction override, data exfiltration, and cross-context contamination. We evaluate three defense mechanisms: content filtering with embedding-based anomaly detection, hierarchical system prompt guardrails, and multi-stage response verification, across seven state-of-the-art language models. Our combined framework reduces successful attack rates from 73.2% to 8.7% while maintaining 94.3% of baseline task performance. We release our benchmark dataset and defense implementation to support future research in AI agent security.
zh

[AI-82] Multi-Agent LLM Orchestration Achieves Deterministic High-Quality Decision Support for Incident Response

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生产系统故障响应中因单智能体架构导致推荐结果模糊、不可操作的问题。现有单代理方法生成的建议行动率仅为1.7%,且质量波动大,难以满足生产环境对稳定性和准确性的SLA要求。解决方案的关键在于引入多智能体编排(multi-agent orchestration)架构,通过348次受控实验验证其显著优势:多智能体系统实现100%可执行建议率,行动具体性提升80倍、解决方案正确性提升140倍,且质量无方差,具备确定性输出能力。研究还提出新的决策质量(Decision Quality, DQ)指标,综合衡量有效性、具体性和正确性,填补了现有LLM评估体系的空白,从而将多智能体编排从性能优化升级为LLM驱动故障响应的生产就绪必要条件。

链接: https://arxiv.org/abs/2511.15755
作者: Philip Drammeh
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 8 pages, 4 tables

点击查看摘要

Abstract:Large language models (LLMs) promise to accelerate incident response in production systems, yet single-agent approaches generate vague, unusable recommendations. We present this http URL, a reproducible containerized framework demonstrating that multi-agent orchestration fundamentally transforms LLM-based incident response quality. Through 348 controlled trials comparing single-agent copilot versus multi-agent systems on identical incident scenarios, we find that multi-agent orchestration achieves 100% actionable recommendation rate versus 1.7% for single-agent approaches, an 80 times improvement in action specificity and 140 times improvement in solution correctness. Critically, multi-agent systems exhibit zero quality variance across all trials, enabling production SLA commitments impossible with inconsistent single-agent outputs. Both architectures achieve similar comprehension latency (approx.40s), establishing that the architectural value lies in deterministic quality, not speed. We introduce Decision Quality (DQ), a novel metric capturing validity, specificity, and correctness properties essential for operational deployment that existing LLM metrics do not address. These findings reframe multi-agent orchestration from a performance optimization to a production-readiness requirement for LLM-based incident response. All code, Docker configurations, and trial data are publicly available for reproduction.
zh

[AI-83] Build AI Assistants using Large Language Models and Agents to Enhance the Engineering Education of Biomechanics

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域应用中因知识缺口导致性能下降,以及在需要多步推理和复杂分析的任务中表现不佳的问题。其核心解决方案是构建一个双模块框架:首先采用检索增强生成(Retrieval-Augmented Generation, RAG)技术提升LLM在概念性判断题中的回答准确性与逻辑一致性;其次设计一个多智能体系统(Multi-Agent System, MAS),利用多个LLM协作完成涉及多步推理、公式推导与代码执行的计算类任务,从而实现可解释的解题过程。实验证明,RAG显著改善了LLM在概念题上的稳定性,而MAS则有效提升了复杂计算任务的解决能力,为工程教育中智能辅导系统的开发提供了可行路径。

链接: https://arxiv.org/abs/2511.15752
作者: Hanzhi Yan,Qin Lu,Xianqiao Wang,Xiaoming Zhai,Tianming Liu,He Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have demonstrated remarkable versatility across a wide range of general tasks, their effectiveness often diminishes in domain-specific applications due to inherent knowledge gaps. Moreover, their performance typically declines when addressing complex problems that require multi-step reasoning and analysis. In response to these challenges, we propose leveraging both LLMs and AI agents to develop education assistants aimed at enhancing undergraduate learning in biomechanics courses that focus on analyzing the force and moment in the musculoskeletal system of the human body. To achieve our goal, we construct a dual-module framework to enhance LLM performance in biomechanics educational tasks: 1) we apply Retrieval-Augmented Generation (RAG) to improve the specificity and logical consistency of LLM’s responses to the conceptual true/false questions; 2) we build a Multi-Agent System (MAS) to solve calculation-oriented problems involving multi-step reasoning and code execution. Specifically, we evaluate the performance of several LLMs, i.e., Qwen-1.0-32B, Qwen-2.5-32B, and Llama-70B, on a biomechanics dataset comprising 100 true/false conceptual questions and problems requiring equation derivation and calculation. Our results demonstrate that RAG significantly enhances the performance and stability of LLMs in answering conceptual questions, surpassing those of vanilla models. On the other hand, the MAS constructed using multiple LLMs demonstrates its ability to perform multi-step reasoning, derive equations, execute code, and generate explainable solutions for tasks that require calculation. These findings demonstrate the potential of applying RAG and MAS to enhance LLM performance for specialized courses in engineering curricula, providing a promising direction for developing intelligent tutoring in engineering education.
zh

[AI-84] Writing With Machines and Peers: Designing for Critical Engagement with Generative AI

【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)在高等教育中的日益融入,学生如何在写作、学习和知识互动中有效且批判性地使用这些工具,亟需相应的教学设计来引导其合理应用。解决方案的关键在于提出了一种融合人工智能(AI)与同伴反馈的课程教学设计,通过为期八周的研究生学术写作活动,让学生在多轮写作与修订过程中同时接收定制化 AI 审查员和人类同伴的反馈。研究发现,学生对两类反馈的使用呈现差异化特征——AI 主要用于提升格式规范性和评分标准一致性,而同伴反馈则更侧重于概念深化与学科相关性发展;同时,学生的反思表明他们逐渐建立起对 AI 的信任、策略性使用能力及对其局限性的批判认知。该设计不仅支持了写作能力的发展,也提升了 AI 素养与学科理解,为可扩展的 AI 教学整合模型提供了实证基础。

链接: https://arxiv.org/abs/2511.15750
作者: Xinran Zhu,Cong Wang,Duane Searsmith
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The growing integration of generative AI in higher education is transforming how students write, learn, and engage with knowledge. As AI tools become more integrated into classrooms, there is an urgent need for pedagogical approaches that help students use them critically and reflectively. This study proposes a pedagogical design that integrates AI and peer feedback in a graduate-level academic writing activity. Over eight weeks, students developed literature review projects through multiple writing and revision stages, receiving feedback from both a custom-built AI reviewer and human peers. We examine two questions: (1) How did students interact with and incorporate AI and peer feedback during the writing process? and (2) How did they reflect on and build relationships with both human and AI reviewers? Data sources include student writing artifacts, AI and peer feedback, AI chat logs, and student reflections. Findings show that students engaged differently with each feedback source-relying on AI for rubric alignment and surface-level edits, and on peer feedback for conceptual development and disciplinary relevance. Reflections revealed evolving relationships with AI, characterized by increasing confidence, strategic use, and critical awareness of its limitations. The pedagogical design supported writing development, AI literacy, and disciplinary understanding. This study offers a scalable pedagogical model for integrating AI into writing instruction and contributes insights for system-level approaches to fostering meaningful human-AI collaboration in higher education.
zh

[AI-85] Uncertainty-Resilient Multimodal Learning via Consistency-Guided Cross-Modal Transfer

【速读】:该论文旨在解决多模态学习系统在人机交互场景中因数据噪声、标签质量低以及模态异质性导致的不确定性问题,这些问题严重影响模型的稳定性与泛化能力。解决方案的关键在于提出一种基于一致性引导的跨模态迁移框架,通过将异构模态投影到共享潜在空间(latent space),利用跨模态语义一致性来增强表示学习的鲁棒性,并实现对不确定性的有效建模与缓解,从而提升模型在噪声或不完整监督条件下的性能与结构可靠性。

链接: https://arxiv.org/abs/2511.15741
作者: Hyo-Jeong Jang
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Master’s thesis, Korea University, 2025

点击查看摘要

Abstract:Multimodal learning systems often face substantial uncertainty due to noisy data, low-quality labels, and heterogeneous modality characteristics. These issues become especially critical in human-computer interaction settings, where data quality, semantic reliability, and annotation consistency vary across users and recording conditions. This thesis tackles these challenges by exploring uncertainty-resilient multimodal learning through consistency-guided cross-modal transfer. The central idea is to use cross-modal semantic consistency as a basis for robust representation learning. By projecting heterogeneous modalities into a shared latent space, the proposed framework mitigates modality gaps and uncovers structural relations that support uncertainty estimation and stable feature learning. Building on this foundation, the thesis investigates strategies to enhance semantic robustness, improve data efficiency, and reduce the impact of noise and imperfect supervision without relying on large, high-quality annotations. Experiments on multimodal affect-recognition benchmarks demonstrate that consistency-guided cross-modal transfer significantly improves model stability, discriminative ability, and robustness to noisy or incomplete supervision. Latent space analyses further show that the framework captures reliable cross-modal structure even under challenging conditions. Overall, this thesis offers a unified perspective on resilient multimodal learning by integrating uncertainty modeling, semantic alignment, and data-efficient supervision, providing practical insights for developing reliable and adaptive brain-computer interface systems.
zh

[AI-86] Extending Test-Time Scaling: A 3D Perspective with Context Batch and Turn

【速读】:该论文旨在解决生成式 AI (Generative AI) 在测试阶段推理能力受限的问题,尤其是由于基础模型上下文长度有限导致的测试时扩展(test-time scaling)能力不足。传统方法仅依赖于增加推理上下文长度,但受限于训练时消耗的 token 数量级,难以显著提升性能。解决方案的关键在于提出一种多维测试时扩展(multi-dimensional test-time scaling)框架,整合三个维度:上下文长度扩展(context scaling)、批量扩展(batch scaling,通过并行采样提升准确率)和回合扩展(turn scaling,通过迭代自我修正优化推理质量)。该3D测试时扩展框架在多个挑战性任务(如IOI、IMO、CPHO)中显著提升了推理性能,并可通过人类偏好反馈进一步优化,同时可延伸至具身学习(embodied learning)等开放领域,实现人形控制行为的设计。

链接: https://arxiv.org/abs/2511.15738
作者: Chao Yu(1),Qixin Tan(1),Jiaxuan Gao(1),Shi Yu(1),Hong Lu(1),Xinting Yang(1),Zelai Xu(1),Yu Wang(1),Yi Wu(1),Eugene Vinitsky(2) ((1) Tsinghua University, (2) New York University)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 44 pages, 12 figures

点击查看摘要

Abstract:Reasoning reinforcement learning (RL) has recently revealed a new scaling effect: test-time scaling. Thinking models such as R1 and o1 improve their reasoning accuracy at test time as the length of the reasoning context increases. However, compared with training-time scaling, test-time scaling is fundamentally limited by the limited context length of base models, which remains orders of magnitude smaller than the amount of tokens consumed during training. We revisit test-time enhancement techniques through the lens of scaling effect and introduce a unified framework of multi-dimensional test-time scaling to extend the capacity of test-time reasoning. Beyond conventional context-length scaling, we consider two additional dimensions: batch scaling, where accuracy improves with parallel sampling, and turn scaling, where iterative self-refinement enhances reasoning quality. Building on this perspective, we propose 3D test-time scaling, which integrates context, batch, and turn scaling. We show that: (1) each dimension demonstrates a test-time scaling effect, but with a bounded capacity; (2) combining all three dimensions substantially improves the reasoning performance of challenging testbeds, including IOI, IMO, and CPHO, and further benefits from human preference feedback; and (3) the human-in-the-loop framework naturally extends to a more open-ended domain, i.e., embodied learning, which enables the design of humanoid control behaviors.
zh

[AI-87] Sovereign AI: Rethinking Autonomy in the Age of Global Interdependence

【速读】:该论文试图解决在全球互联背景下,人工智能(Artificial Intelligence, AI)发展所引发的主权困境问题,即如何在保障国家自主性的同时应对AI技术依赖全球数据流、半导体供应链、开源生态和国际标准等高度互联特性所带来的挑战。其解决方案的关键在于提出一个将“主权AI”视为连续体而非二元状态的概念与形式化框架,强调通过平衡自主性与相互依存关系来实现可控的治理:具体包括两个政策启发式策略——一是使四大主权支柱(数据、计算、模型、规范)的边际收益均等化;二是设定开放程度,使得全球收益等于暴露风险。该模型应用于印度和中东地区案例后表明,有效的主权AI战略需依赖于数据与计算能力的协同投资、全生命周期治理(ModelOps)以及受保护的采购机制,最终导向的是“受管理的相互依存”,而非孤立主义。

链接: https://arxiv.org/abs/2511.15734
作者: Shalabh Kumar Singh,Shubhashis Sengupta
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is emerging as a foundational general-purpose technology, raising new dilemmas of sovereignty in an interconnected world. While governments seek greater control over it, the very foundations of AI–global data pipelines, semiconductor supply chains, open-source ecosystems, and international standards–resist enclosure. This paper develops a conceptual and formal framework for understanding sovereign AI as a continuum rather than a binary condition, balancing autonomy with interdependence. Drawing on classical theories, historical analogies, and contemporary debates on networked autonomy, we present a planner’s model that identifies two policy heuristics: equalizing marginal returns across the four sovereignty pillars and setting openness where global benefits equal exposure risks. We apply the model to India, highlighting sovereign footholds in data, compute, and norms but weaker model autonomy. The near-term challenge is integration via coupled Data x Compute investment, lifecycle governance (ModelOps), and safeguarded procurement. We then apply the model to the Middle East (Saudi Arabia and the UAE), where large public investment in Arabic-first models and sovereign cloud implies high sovereignty weights, lower effective fiscal constraints, and strong Data x Compute complementarities. An interior openness setting with guardrails emerges as optimal. Across contexts, the lesson is that sovereignty in AI needs managed interdependence, not isolation. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN) Cite as: arXiv:2511.15734 [cs.CY] (or arXiv:2511.15734v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2511.15734 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-88] chnique to Baseline QE Artefact Generation Aligned to Quality Metrics

【速读】:该论文旨在解决生成式 AI(Generative AI)在质量工程(Quality Engineering, QE)中自动化生成需求、测试用例和行为驱动开发(Behavior Driven Development, BDD)场景等 artefacts 时,如何确保其输出质量的问题。解决方案的关键在于提出一种系统性评估框架,结合大语言模型(Large Language Models, LLMs)驱动的生成、反向生成(reverse generation)以及基于评分标准(rubrics)指导的迭代优化技术,以量化评估并提升 artefacts 在清晰性、完整性、一致性与可测试性方面的质量,从而实现可扩展且可靠的 QE artefact 验证。

链接: https://arxiv.org/abs/2511.15733
作者: Eitan Farchi,Kiran Nayak,Papia Ghosh Majumdar,Saritha Route
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming Quality Engineering (QE) by automating the generation of artefacts such as requirements, test cases, and Behavior Driven Development (BDD) scenarios. However, ensuring the quality of these outputs remains a challenge. This paper presents a systematic technique to baseline and evaluate QE artefacts using quantifiable metrics. The approach combines LLM-driven generation, reverse generation , and iterative refinement guided by rubrics technique for clarity, completeness, consistency, and testability. Experimental results across 12 projects show that reverse-generated artefacts can outperform low-quality inputs and maintain high standards when inputs are strong. The framework enables scalable, reliable QE artefact validation, bridging automation with accountability.
zh

[AI-89] Just Asking Questions: Doing Our Own Research on Conspiratorial Ideation by Generative AI Chatbots

【速读】:该论文旨在解决生成式 AI (Generative AI) 聊天机器人在应对阴谋论问题时的安全防护机制不一致、选择性设计的问题,尤其关注不同模型对已广泛被证伪及新兴阴谋论的响应差异。其解决方案的关键在于采用基于 Glazunova 等(2023)提出的平台政策实施审计方法,系统评估六种主流 AI 聊天系统(如 ChatGPT 3.5、Google Search AI、Grok 等)对五类经典阴谋论和四类与突发新闻相关的新兴阴谋论的回答策略,从而揭示当前安全护栏存在显著模型间差异和议题偏向性,为未来构建更全面、公平且具有跨文化适应性的 AI 内容安全机制提供实证依据。

链接: https://arxiv.org/abs/2511.15732
作者: Katherine M. FitzGerald,Michelle Riedlinger,Axel Bruns,Stephen Harrington,Timothy Graham,Daniel Angus
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interactive chat systems that build on artificial intelligence frameworks are increasingly ubiquitous and embedded into search engines, Web browsers, and operating systems, or are available on websites and apps. Researcher efforts have sought to understand the limitations and potential for harm of generative AI, which we contribute to here. Conducting a systematic review of six AI-powered chat systems (ChatGPT 3.5; ChatGPT 4 Mini; Microsoft Copilot in Bing; Google Search AI; Perplexity; and Grok in Twitter/X), this study examines how these leading products respond to questions related to conspiracy theories. This follows the platform policy implementation audit approach established by Glazunova et al. (2023). We select five well-known and comprehensively debunked conspiracy theories and four emerging conspiracy theories that relate to breaking news events at the time of data collection. Our findings demonstrate that the extent of safety guardrails against conspiratorial ideation in generative AI chatbots differs markedly, depending on chatbot model and conspiracy theory. Our observations indicate that safety guardrails in AI chatbots are often very selectively designed: generative AI companies appear to focus especially on ensuring that their products are not seen to be racist; they also appear to pay particular attention to conspiracy theories that address topics of substantial national trauma such as 9/11 or relate to well-established political issues. Future work should include an ongoing effort extended to further platforms, multiple languages, and a range of conspiracy theories extending well beyond the United States.
zh

[AI-90] he Future of Food: How Artificial Intelligence is Transforming Food Manufacturing

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在食品产业中应用不均衡的问题,具体表现为数据集异质性高、模型与系统间互操作性不足,以及数据科学家与食品领域专家之间的技能鸿沟。解决方案的关键在于构建跨领域的协同机制,包括建立可互操作的数据标准、开发透明且可解释的AI模型、推动多学科交叉合作,并加强数字基础设施建设与隐私保护的数据共享机制;同时,通过融合AI素养与专业领域知识的教育路径,培养复合型人才,从而将AI研究成果高效转化为实践,实现食品制造在创新性、可持续性和人类福祉方面的提升,确保技术进步符合伦理规范和科学严谨性。

链接: https://arxiv.org/abs/2511.15728
作者: Xu Zhou,Ivor Prado,AIFPDS participants,Ilias Tagkopoulos
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence is accelerating a new era of food innovation, connecting data from farm to consumer to improve formulation, processing, and health outcomes. Recent advances in deep learning, natural language processing, and multi-omics integration make it possible to understand and optimize food systems with unprecedented depth. However, AI adoption across the food sector remains uneven due to heterogeneous datasets, limited model and system interoperability, and a persistent skills gap between data scientists and food domain experts. To address these challenges and advance responsible innovation, the AI Institute for Next Generation Food Systems (AIFS) convened the inaugural AI for Food Product Development Symposium at University of California, Davis, in October 2025. This white paper synthesizes insights from the symposium, organized around five domains where AI can have the greatest near-term impact: supply chain; formulation and processing; consumer insights and sensory prediction; nutrition and health; and education and workforce development. Across the areas, participants emphasized the importance of interoperable data standards, transparent and interpretable models, and cross-sector collaboration to accelerate the translation of AI research into practice. The discussions further highlighted the need for robust digital infrastructure, privacy-preserving data-sharing mechanisms, and interdisciplinary training pathways that integrate AI literacy with domain expertise. Collectively, the priorities outline a roadmap for integrating AI into food manufacturing in ways that enhance innovation, sustainability, and human well-being while ensuring that technological progress remains grounded in ethics, scientific rigor, and societal benefit.
zh

[AI-91] Spatial Reasoning in Multimodal Large Language Models : A Survey of Tasks Benchmarks and Methods

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间推理(spatial reasoning)能力上的持续挑战,即当前模型难以有效感知和操作三维世界中的空间关系。其解决方案的关键在于提出一种基于认知视角的分类体系(cognitive perspective),将空间智能从认知维度进行划分,并按推理复杂度对任务进行归类,从而关联到多个认知功能;同时,该框架映射了文本、视觉语言和具身场景下的现有基准测试(benchmarks),并系统回顾评估指标与方法,实现更严谨的跨任务比较,揭示当前模型能力与人类水平之间的关键差距。此外,论文还分析了训练驱动和推理驱动两类提升空间能力的方法,明确了它们的优势与互补机制,为未来研究提供了清晰的方向。

链接: https://arxiv.org/abs/2511.15722
作者: Weichen Liu,Qiyao Xue,Haoming Wang,Xiangyu Yin,Boyuan Yang,Wei Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial reasoning, which requires ability to perceive and manipulate spatial relationships in the 3D world, is a fundamental aspect of human intelligence, yet remains a persistent challenge for Multimodal large language models (MLLMs). While existing surveys often categorize recent progress based on input modality (e.g., text, image, video, or 3D), we argue that spatial ability is not solely determined by the input format. Instead, our survey introduces a taxonomy that organizes spatial intelligence from cognitive aspect and divides tasks in terms of reasoning complexity, linking them to several cognitive functions. We map existing benchmarks across text only, vision language, and embodied settings onto this taxonomy, and review evaluation metrics and methodologies for assessing spatial reasoning ability. This cognitive perspective enables more principled cross-task comparisons and reveals critical gaps between current model capabilities and human-like reasoning. In addition, we analyze methods for improving spatial ability, spanning both training-based and reasoning-based approaches. This dual perspective analysis clarifies their respective strengths, uncovers complementary mechanisms. By surveying tasks, benchmarks, and recent advances, we aim to provide new researchers with a comprehensive understanding of the field and actionable directions for future research.
zh

[AI-92] Automated Hazard Detection in Construction Sites Using Large Language and Vision-Language Models

【速读】:该论文旨在解决施工安全领域中事故数据多模态异构(文本与图像)难以高效整合分析的问题,传统方法在处理如OSHA事故报告和现场影像等多样化数据时存在局限性。解决方案的关键在于构建一个融合文本与视觉信息的多模态人工智能(Multimodal AI)框架,通过大语言模型(LLMs)和视觉-语言模型(VLMs)对非结构化数据进行结构化提取与规则级违规检测,从而实现更精准、自动化的安全隐患识别。研究采用两个案例验证该框架的有效性:第一阶段利用GPT-4o系列模型从2.8万份事故报告中抽取结构化洞察;第二阶段则基于开源轻量级VLMs(Molmo 7B与Qwen2 VL 2B)在ConstructionSite10k数据集上开展规则级安全违规检测实验,结果表明小型模型在特定提示配置下具备可比性能,验证了低资源环境下多模态系统用于规则感知安全监控的可行性。

链接: https://arxiv.org/abs/2511.15720
作者: Islem Sahraoui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Master thesis, University of Houton

点击查看摘要

Abstract:This thesis explores a multimodal AI framework for enhancing construction safety through the combined analysis of textual and visual data. In safety-critical environments such as construction sites, accident data often exists in multiple formats, such as written reports, inspection records, and site imagery, making it challenging to synthesize hazards using traditional approaches. To address this, this thesis proposed a multimodal AI framework that combines text and image analysis to assist in identifying safety hazards on construction sites. Two case studies were consucted to evaluate the capabilities of large language models (LLMs) and vision-language models (VLMs) for automated hazard this http URL first case study introduces a hybrid pipeline that utilizes GPT 4o and GPT 4o mini to extract structured insights from a dataset of 28,000 OSHA accident reports (2000-2025). The second case study extends this investigation using Molmo 7B and Qwen2 VL 2B, lightweight, open-source VLMs. Using the public ConstructionSite10k dataset, the performance of the two models was evaluated on rule-level safety violation detection using natural language prompts. This experiment served as a cost-aware benchmark against proprietary models and allowed testing at scale with ground-truth labels. Despite their smaller size, Molmo 7B and Quen2 VL 2B showed competitive performance in certain prompt configurations, reinforcing the feasibility of low-resource multimodal systems for rule-aware safety monitoring.
zh

[AI-93] oolMind Technical Report: A Large-Scale Reasoning -Enhanced Tool-Use Dataset

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在复杂任务中因高质量轨迹数据稀缺而导致性能受限的问题,以及现有多轮对话合成方法仅在轨迹层面验证正确性、忽视逐轮错误传播所引发的训练误差累积问题。其解决方案的关键在于构建一个大规模(16万条合成实例)、高质量的工具型代理数据集 ToolMind,通过基于参数相关性的函数图结构与多智能体框架模拟真实用户-助手-工具交互过程,并引入细粒度的逐轮过滤机制剔除错误或次优步骤,从而在保留自纠正推理信号的同时有效抑制训练过程中的错误放大现象。

链接: https://arxiv.org/abs/2511.15718
作者: Chen Yang,Ran Le,Yun Xing,Zhenwei An,Zongchao Chen,Wayne Xin Zhao,Yang Song,Tao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:Large Language Model (LLM) agents have developed rapidly in recent years to solve complex real-world problems using external tools. However, the scarcity of high-quality trajectories still hinders the development of stronger LLM agents. Most existing works on multi-turn dialogue synthesis validate correctness only at the trajectory level, which may overlook turn-level errors that can propagate during training and degrade model performance. To address these limitations, we introduce ToolMind, a large-scale, high-quality tool-agentic dataset with 160k synthetic data instances generated using over 20k tools and 200k augmented open-source data instances. Our data synthesis pipeline first constructs a function graph based on parameter correlations and then uses a multi-agent framework to simulate realistic user-assistant-tool interactions. Beyond trajectory-level validation, we employ fine-grained turn-level filtering to remove erroneous or suboptimal steps, ensuring that only high-quality reasoning traces are retained. This approach mitigates error amplification during training while preserving self-corrective reasoning signals essential for robust tool-use learning. Models fine-tuned on ToolMind show significant improvements over baselines on several benchmarks.
zh

[AI-94] MACIE: Multi-Agent Causal Intelligence Explainer for Collective Behavior Understanding

【速读】:该论文旨在解决多智能体强化学习(Multi Agent Reinforcement Learning, MARL)系统在安全关键场景中缺乏可解释性的问题,具体表现为现有可解释人工智能(Explainable AI, XAI)方法难以对个体智能体的决策贡献进行归因、量化涌现行为以及捕捉复杂交互关系。解决方案的关键在于提出MACIE(Multi Agent Causal Intelligence Explainer)框架,其核心创新是融合结构因果模型(Structural Causal Models)、干预型反事实分析(Interventional Counterfactuals)与Shapley值,从而实现三方面能力:一是基于干预归因分数量化每个智能体的因果贡献;二是通过协同度量指标分离集体效应与个体贡献以识别系统级涌现智能;三是生成自然语言叙述式解释以提供可操作的洞察。该方法在多个MARL任务中验证了高精度归因(均值φ_i=5.07,标准差<0.05)、有效检测正向涌现(协同指数达0.461)及高效计算(CPU单数据集耗时0.79秒),兼具因果严谨性、涌现量化能力和多智能体支持,并具备实时应用潜力。

链接: https://arxiv.org/abs/2511.15716
作者: Abraham Itzhak Weinberg
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Multi Agent Reinforcement Learning systems are used in safety critical applications. Understanding why agents make decisions and how they achieve collective behavior is crucial. Existing explainable AI methods struggle in multi agent settings. They fail to attribute collective outcomes to individuals, quantify emergent behaviors, or capture complex interactions. We present MACIE Multi Agent Causal Intelligence Explainer, a framework combining structural causal models, interventional counterfactuals, and Shapley values to provide comprehensive explanations. MACIE addresses three questions. First, each agent’s causal contribution using interventional attribution scores. Second, system level emergent intelligence through synergy metrics separating collective effects from individual contributions. Third, actionable explanations using natural language narratives synthesizing causal insights. We evaluate MACIE across four MARL scenarios: cooperative, competitive, and mixed motive. Results show accurate outcome attribution, mean phi_i equals 5.07, standard deviation less than 0.05, detection of positive emergence in cooperative tasks, synergy index up to 0.461, and efficient computation, 0.79 seconds per dataset on CPU. MACIE uniquely combines causal rigor, emergence quantification, and multi agent support while remaining practical for real time use. This represents a step toward interpretable, trustworthy, and accountable multi agent AI.
zh

[AI-95] Graph-Memoized Reasoning : Foundations Structured Workflow Reuse in Intelligent Systems

【速读】:该论文旨在解决现代基于大语言模型(Large Language Models, LLMs)的推理系统在任务间重复计算相似推理步骤所导致的计算资源浪费、推理延迟增加及可复现性受限的问题。其核心解决方案是提出图记忆化推理(Graph-Memoized Reasoning)框架,关键在于将历史推理流程编码为结构化的图记忆(graph-structured memory),并通过结构与语义相似性检索机制实现子图的组合式复用,从而提升推理效率并保障一致性。该框架进一步通过优化目标函数最小化总推理成本,并引入不一致性的正则项,为智能系统中的效率与一致性权衡提供理论支撑。

链接: https://arxiv.org/abs/2511.15715
作者: Yash Raj Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 Pages, 2 tables

点击查看摘要

Abstract:Modern large language model-based reasoning systems frequently recompute similar reasoning steps across tasks, wasting computational resources, inflating inference latency, and limiting reproducibility. These inefficiencies underscore the need for persistent reasoning mechanisms that can recall and reuse prior computational traces. We introduce Graph-Memoized Reasoning, a formal framework for representing, storing, and reusing reasoning workflows as graph-structured memory. By encoding past decision graphs and retrieving them through structural and semantic similarity, our approach enables compositional reuse of subgraphs across new reasoning tasks. We formulate an optimization objective that minimizes total reasoning cost regularized by inconsistency between stored and generated workflows, providing a theoretical foundation for efficiency-consistency trade-offs in intelligent systems. We outline a conceptual evaluation protocol aligned with the proposed optimization objective. This framework establishes the groundwork for interpretable, cost-efficient, and self-improving reasoning architectures, offering a step toward persistent memory in large-scale agentic systems. Comments: 5 Pages, 2 tables Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2511.15715 [cs.AI] (or arXiv:2511.15715v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.15715 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yash Raj Singh [view email] [v1] Tue, 11 Nov 2025 07:42:37 UTC (12 KB)
zh

[AI-96] Majority Rules: LLM Ensemble is a Winning Approach for Content Categorization

【速读】:该论文旨在解决单个大语言模型(Large Language Models, LLMs)在非结构化文本分类任务中普遍存在的一致性差、幻觉(hallucination)、类别膨胀(category inflation)和误分类等问题。其解决方案的关键在于提出了一种集成框架——eLLM(ensemble large language model),通过数学建模集体决策过程并建立原则性的聚合准则,整合多个LLM的预测结果,从而显著提升分类性能(F1-score最高提升65%),实现更鲁棒和准确的分类效果,并逼近人类专家水平。

链接: https://arxiv.org/abs/2511.15714
作者: Ariel Kamen,Yakov Kamen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:This study introduces an ensemble framework for unstructured text categorization using large language models (LLMs). By integrating multiple models, the ensemble large language model (eLLM) framework addresses common weaknesses of individual systems, including inconsistency, hallucination, category inflation, and misclassification. The eLLM approach yields a substantial performance improvement of up to 65% in F1-score over the strongest single model. We formalize the ensemble process through a mathematical model of collective decision-making and establish principled aggregation criteria. Using the Interactive Advertising Bureau (IAB) hierarchical taxonomy, we evaluate ten state-of-the-art LLMs under identical zero-shot conditions on a human-annotated corpus of 8,660 samples. Results show that individual models plateau in performance due to the compression of semantically rich text into sparse categorical representations, while eLLM improves both robustness and accuracy. With a diverse consortium of models, eLLM achieves near human-expert-level performance, offering a scalable and reliable solution for taxonomy-based classification that may significantly reduce dependence on human expert labeling.
zh

[AI-97] Secure Autonomous Agent Payments: Verifying Authenticity and Intent in a Trustless Environment

【速读】:该论文旨在解决自主人工智能(AI)代理在去中心化环境中发起金融交易时,如何验证其身份真实性及交易意图的问题。传统支付系统依赖人工授权,而AI代理的自主性消除了这一保障机制,从而引发安全与合规风险。解决方案的关键在于构建一个基于区块链的框架,融合去中心化身份(DID)标准和可验证凭证以认证代理身份,利用链上意图证明(intent proofs)记录用户授权,并通过零知识证明(ZKPs)在保护隐私的同时确保政策合规;此外,依托可信执行环境(TEE)的证明机制保障代理推理与执行的完整性,最终实现从用户意图到支付结果的不可篡改审计追踪。

链接: https://arxiv.org/abs/2511.15712
作者: Vivek Acharya
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 6 pages, 1 figure

点击查看摘要

Abstract:Artificial intelligence (AI) agents are increasingly capable of initiating financial transactions on behalf of users or other agents. This evolution introduces a fundamental challenge: verifying both the authenticity of an autonomous agent and the true intent behind its transactions in a decentralized, trustless environment. Traditional payment systems assume human authorization, but autonomous, agent-led payments remove that safeguard. This paper presents a blockchain-based framework that cryptographically authenticates and verifies the intent of every AI-initiated transaction. The proposed system leverages decentralized identity (DID) standards and verifiable credentials to establish agent identities, on-chain intent proofs to record user authorization, and zero-knowledge proofs (ZKPs) to preserve privacy while ensuring policy compliance. Additionally, secure execution environments (TEE-based attestations) guarantee the integrity of agent reasoning and execution. The hybrid on-chain/off-chain architecture provides an immutable audit trail linking user intent to payment outcome. Through qualitative analysis, the framework demonstrates strong resistance to impersonation, unauthorized transactions, and misalignment of intent. This work lays the foundation for secure, auditable, and intent-aware autonomous economic agents, enabling a future of verifiable trust and accountability in AI-driven financial ecosystems.
zh

[AI-98] CoSP: Reconfigurable Multi-State Metamaterial Inverse Design via Contrastive Pretrained Large Language Model

【速读】:该论文旨在解决可重构多态超材料(Reconfigurable Multi-state Metamaterials, RMMs)在智能逆向设计中缺乏对多状态切换能力建模的问题。现有基于深度学习的逆向设计方法难以有效处理RMM在外部刺激下实现光学特性动态切换的需求。解决方案的关键在于提出一种基于对比预训练大语言模型(Contrastive Pretrained Large Language Model, CoSP)的智能逆向设计方法:通过在多态光谱上进行对比预训练,获得具备光谱理解能力的谱编码器,并将其与预训练大语言模型(LLM)融合,使模型在保持自然语言生成能力的同时,能够理解麦克斯韦方程组(Maxwell’s Equations),从而以自然语言描述出满足目标多态、多波段光学响应的薄膜超材料结构。

链接: https://arxiv.org/abs/2511.16135
作者: Shujie Yang,Xuzhe Zhao,Yuqi Zhang,Yansong Tang,Kaichen Dong
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI)
备注: 5 pages, 6 figures

点击查看摘要

Abstract:Metamaterials, known for their ability to manipulate light at subwavelength scales, face significant design challenges due to their complex and sophisticated structures. Consequently, deep learning has emerged as a powerful tool to streamline their design process. Reconfigurable multi-state metamaterials (RMMs) with adjustable parameters can switch their optical characteristics between different states upon external stimulation, leading to numerous applications. However, existing deep learning-based inverse design methods fall short in considering reconfigurability with multi-state switching. To address this challenge, we propose CoSP, an intelligent inverse design method based on contrastive pretrained large language model (LLM). By performing contrastive pretraining on multi-state spectrum, a well-trained spectrum encoder capable of understanding the spectrum is obtained, and it subsequently interacts with a pretrained LLM. This approach allows the model to preserve its linguistic capabilities while also comprehending Maxwell’s Equations, enabling it to describe material structures with target optical properties in natural language. Our experiments demonstrate that CoSP can design corresponding thin-film metamaterial structures for arbitrary multi-state, multi-band optical responses, showing great potentials in the intelligent design of RMMs for versatile applications.
zh

[AI-99] A Primer on Quantum Machine Learning

【速读】:该论文旨在解决量子机器学习(Quantum Machine Learning, QML)领域中存在的一系列核心问题,包括实践可行性与理论保证之间的张力、访问模型与潜在加速效果的权衡,以及经典基线与声称的量子优势之间的对比。其解决方案的关键在于系统性地梳理QML的研究现状,明确区分已有强证据支持的成果、条件性结论和仍缺乏实证支持的主张,并指出尚未解决的开放性问题,从而为研究人员提供一个清晰、专业的QML发展路线图,帮助其在特定假设下判断量子方法是否具备实际应用价值。

链接: https://arxiv.org/abs/2511.15969
作者: Su Yeon Chang,M. Cerezo
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 29+16 pages, 5 figures, 15 boxes. Chapter for Comprehensive Quantum Physics. Comments welcomed!

点击查看摘要

Abstract:Quantum machine learning (QML) is a computational paradigm that seeks to apply quantum-mechanical resources to solve learning problems. As such, the goal of this framework is to leverage quantum processors to tackle optimization, supervised, unsupervised and reinforcement learning, and generative modeling-among other tasks-more efficiently than classical models. Here we offer a high level overview of QML, focusing on settings where the quantum device is the primary learning or data generating unit. We outline the field’s tensions between practicality and guarantees, access models and speedups, and classical baselines and claimed quantum advantages-flagging where evidence is strong, where it is conditional or still lacking, and where open questions remain. By shedding light on these nuances and debates, we aim to provide a friendly map of the QML landscape so that the reader can judge when-and under what assumptions-quantum approaches may offer real benefits.
zh

机器学习

[LG-0] oward Artificial Palpation: Representation Learning of Touch on Soft Bodies

链接: https://arxiv.org/abs/2511.16596
作者: Zohar Rimon,Elisei Shafer,Tal Tepper,Efrat Shimron,Aviv Tamar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Palpation, the use of touch in medical examination, is almost exclusively performed by humans. We investigate a proof of concept for an artificial palpation method based on self-supervised learning. Our key idea is that an encoder-decoder framework can learn a \textitrepresentation from a sequence of tactile measurements that contains all the relevant information about the palpated object. We conjecture that such a representation can be used for downstream tasks such as tactile imaging and change detection. With enough training data, it should capture intricate patterns in the tactile measurements that go beyond a simple map of forces – the current state of the art. To validate our approach, we both develop a simulation environment and collect a real-world dataset of soft objects and corresponding ground truth images obtained by magnetic resonance imaging (MRI). We collect palpation sequences using a robot equipped with a tactile sensor, and train a model that predicts sensory readings at different positions on the object. We investigate the representation learned in this process, and demonstrate its use in imaging and change detection.

[LG-1] gfnx: Fast and Scalable Library for Generative Flow Networks in JAX

链接: https://arxiv.org/abs/2511.16592
作者: Daniil Tiapkin,Artem Agarkov,Nikita Morozov,Ian Maksimov,Askar Tsyganov,Timofei Gritsaev,Sergey Samsonov
类目: Machine Learning (cs.LG)
*备注: GitHub: this https URL | Documentation: this https URL

点击查看摘要

Abstract:In this paper, we present gfnx, a fast and scalable package for training and evaluating Generative Flow Networks (GFlowNets) written in JAX. gfnx provides an extensive set of environments and metrics for benchmarking, accompanied with single-file implementations of core objectives for training GFlowNets. We include synthetic hypergrids, multiple sequence generation environments with various editing regimes and particular reward designs for molecular generation, phylogenetic tree construction, Bayesian structure learning, and sampling from the Ising model energy. Across different tasks, gfnx achieves significant wall-clock speedups compared to Pytorch-based benchmarks (such as torchgfn library) and author implementations. For example, gfnx achieves up to 55 times speedup on CPU-based sequence generation environments, and up to 80 times speedup with the GPU-based Bayesian network structure learning setup. Our package provides a diverse set of benchmarks and aims to standardize empirical evaluation and accelerate research and applications of GFlowNets. The library is available on GitHub (this https URL) and on pypi (this https URL). Documentation is available on this https URL.

[LG-2] Almost Sure Convergence Analysis of Differentially Private Stochastic Gradient Methods

链接: https://arxiv.org/abs/2511.16587
作者: Amartya Mukherjee,Jun Liu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 6 pages

点击查看摘要

Abstract:Differentially private stochastic gradient descent (DP-SGD) has become the standard algorithm for training machine learning models with rigorous privacy guarantees. Despite its widespread use, the theoretical understanding of its long-run behavior remains limited: existing analyses typically establish convergence in expectation or with high probability, but do not address the almost sure convergence of single trajectories. In this work, we prove that DP-SGD converges almost surely under standard smoothness assumptions, both in nonconvex and strongly convex settings, provided the step sizes satisfy some standard decaying conditions. Our analysis extends to momentum variants such as the stochastic heavy ball (DP-SHB) and Nesterov’s accelerated gradient (DP-NAG), where we show that careful energy constructions yield similar guarantees. These results provide stronger theoretical foundations for differentially private optimization and suggest that, despite privacy-induced distortions, the algorithm remains pathwise stable in both convex and nonconvex regimes.

[LG-3] An Exterior-Embedding Neural Operator Framework for Preserving Conservation Laws

链接: https://arxiv.org/abs/2511.16573
作者: Huanshuo Dong,Hong Wang,Hao Wu,Zhiwei Zhuang,Xuanze Yang,Ruiqi Shu,Yuan Gao,Xiaomeng Huang
类目: Other Computer Science (cs.OH); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural operators have demonstrated considerable effectiveness in accelerating the solution of time-dependent partial differential equations (PDEs) by directly learning governing physical laws from data. However, for PDEs governed by conservation laws(e.g., conservation of mass, energy, or matter), existing neural operators fail to satisfy conservation properties, which leads to degraded model performance and limited generalizability. Moreover, we observe that distinct PDE problems generally require different optimal neural network architectures. This finding underscores the inherent limitations of specialized models in generalizing across diverse problem domains. To address these limitations, we propose Exterior-Embedded Conservation Framework (ECF), a universal conserving framework that can be integrated with various data-driven neural operators to enforce conservation laws strictly in predictions. The framework consists of two key components: a conservation quantity encoder that extracts conserved quantities from input data, and a conservation quantity decoder that adjusts the neural operator’s predictions using these quantities to ensure strict conservation compliance in the final output. Since our architecture enforces conservation laws, we theoretically prove that it enhances model performance. To validate the performance of our method, we conduct experiments on multiple conservation-law-constrained PDE scenarios, including adiabatic systems, shallow water equations, and the Allen-Cahn problem. These baselines demonstrate that our method effectively improves model accuracy while strictly enforcing conservation laws in the predictions. Subjects: Other Computer Science (cs.OH); Machine Learning (cs.LG) Cite as: arXiv:2511.16573 [cs.OH] (or arXiv:2511.16573v1 [cs.OH] for this version) https://doi.org/10.48550/arXiv.2511.16573 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] Boosting Predictive Performance on Tabular Data through Data Augmentation with Latent-Space Flow-Based Diffusion

链接: https://arxiv.org/abs/2511.16571
作者: Md. Tawfique Ihsan,Md. Rakibul Hasan Rafi,Ahmed Shoyeb Raihan,Imtiaz Ahmed,Abdullahil Azeem
类目: Machine Learning (cs.LG)
*备注: 35 Pages

点击查看摘要

Abstract:Severe class imbalance is common in real-world tabular learning, where rare but important minority classes are essential for reliable prediction. Existing generative oversampling methods such as GANs, VAEs, and diffusion models can improve minority-class performance, but they often struggle with tabular heterogeneity, training stability, and privacy concerns. We propose a family of latent-space, tree-driven diffusion methods for minority oversampling that use conditional flow matching with gradient-boosted trees as the vector-field learner. The models operate in compact latent spaces to preserve tabular structure and reduce computation. We introduce three variants: PCAForest, which uses linear PCA embedding; EmbedForest, which uses a learned nonlinear embedding; and AttentionForest, which uses an attention-augmented embedding. Each method couples a GBT-based flow with a decoder back to the original feature space. Across 11 datasets from healthcare, finance, and manufacturing, AttentionForest achieves the best average minority recall while maintaining competitive precision, calibration, and distributional similarity. PCAForest and EmbedForest reach similar utility with much faster generation, offering favorable accuracy-efficiency trade-offs. Privacy evaluated with nearest-neighbor distance ratio and distance-to-closest-record is comparable to or better than the ForestDiffusion baseline. Ablation studies show that smaller embeddings tend to improve minority recall, while aggressive learning rates harm stability. Overall, latent-space, tree-driven diffusion provides an efficient and privacy-aware approach to high-fidelity tabular data augmentation under severe class imbalance.

[LG-5] oward Valid Generative Clinical Trial Data with Survival Endpoints

链接: https://arxiv.org/abs/2511.16551
作者: Perrine Chassat,Van Tuan Nguyen,Lucas Ducrot,Emilie Lanoy,Agathe Guilloux
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: P. Chassat and V.T. Nguyen contributed equally to this work

点击查看摘要

Abstract:Clinical trials face mounting challenges: fragmented patient populations, slow enrollment, and unsustainable costs, particularly for late phase trials in oncology and rare diseases. While external control arms built from real-world data have been explored, a promising alternative is the generation of synthetic control arms using generative AI. A central challenge is the generation of time-to-event outcomes, which constitute primary endpoints in oncology and rare disease trials, but are difficult to model under censoring and small sample sizes. Existing generative approaches, largely GAN-based, are data-hungry, unstable, and rely on strong assumptions such as independent censoring. We introduce a variational autoencoder (VAE) that jointly generates mixed-type covariates and survival outcomes within a unified latent variable framework, without assuming independent censoring. Across synthetic and real trial datasets, we evaluate our model in two realistic scenarios: (i) data sharing under privacy constraints, where synthetic controls substitute for original data, and (ii) control-arm augmentation, where synthetic patients mitigate imbalances between treated and control groups. Our method outperforms GAN baselines on fidelity, utility, and privacy metrics, while revealing systematic miscalibration of type I error and power. We propose a post-generation selection procedure that improves calibration, highlighting both progress and open challenges for generative survival modeling.

[LG-6] Broad stochastic configuration residual learning system for norm-convergent universal approximation

链接: https://arxiv.org/abs/2511.16550
作者: Han Su,Zhongyan Li,Wanquan Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Universal approximation serves as the foundation of neural network learning algorithms. However, some networks establish their universal approximation property by demonstrating that the iterative errors converge in probability measure rather than the more rigorous norm convergence, which makes the universal approximation property of randomized learning networks highly sensitive to random parameter selection, Broad residual learning system (BRLS), as a member of randomized learning models, also encounters this issue. We theoretically demonstrate the limitation of its universal approximation property, that is, the iterative errors do not satisfy norm convergence if the selection of random parameters is inappropriate and the convergence rate meets certain conditions. To address this issue, we propose the broad stochastic configuration residual learning system (BSCRLS) algorithm, which features a novel supervisory mechanism adaptively constraining the range settings of random parameters on the basis of BRLS framework, Furthermore, we prove the universal approximation theorem of BSCRLS based on the more stringent norm convergence. Three versions of incremental BSCRLS algorithms are presented to satisfy the application requirements of various network updates. Solar panels dust detection experiments are performed on publicly available dataset and compared with 13 deep and broad learning algorithms. Experimental results reveal the effectiveness and superiority of BSCRLS algorithms.

[LG-7] FairLRF: Achieving Fairness through Sparse Low Rank Factorization

链接: https://arxiv.org/abs/2511.16549
作者: Yuanbo Guo,Jun Xia,Yiyu Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As deep learning (DL) techniques become integral to various applications, ensuring model fairness while maintaining high performance has become increasingly critical, particularly in sensitive fields such as medical diagnosis. Although a variety of bias-mitigation methods have been proposed, many rely on computationally expensive debiasing strategies or suffer substantial drops in model accuracy, which limits their practicality in real-world, resource-constrained settings. To address this issue, we propose a fairness-oriented low rank factorization (LRF) framework that leverages singular value decomposition (SVD) to improve DL model fairness. Unlike traditional SVD, which is mainly used for model compression by decomposing and reducing weight matrices, our work shows that SVD can also serve as an effective tool for fairness enhancement. Specifically, we observed that elements in the unitary matrices obtained from SVD contribute unequally to model bias across groups defined by sensitive attributes. Motivated by this observation, we propose a method, named FairLRF, that selectively removes bias-inducing elements from unitary matrices to reduce group disparities, thus enhancing model fairness. Extensive experiments show that our method outperforms conventional LRF methods as well as state-of-the-art fairness-enhancing techniques. Additionally, an ablation study examines how major hyper-parameters may influence the performance of processed models. To the best of our knowledge, this is the first work utilizing SVD not primarily for compression but for fairness enhancement.

[LG-8] Dynamic Participation in Federated Learning: Benchmarks and a Knowledge Pool Plugin

链接: https://arxiv.org/abs/2511.16523
作者: Ming-Lun Lee,Fu-Shiang Yang,Cheng-Kuan Lin,Yan-Ann Chen,Chih-Yu Lin,Yu-Chee Tseng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) enables clients to collaboratively train a shared model in a distributed manner, setting it apart from traditional deep learning paradigms. However, most existing FL research assumes consistent client participation, overlooking the practical scenario of dynamic participation (DPFL), where clients may intermittently join or leave during training. Moreover, no existing benchmarking framework systematically supports the study of DPFL-specific challenges. In this work, we present the first open-source framework explicitly designed for benchmarking FL models under dynamic client participation. Our framework provides configurable data distributions, participation patterns, and evaluation metrics tailored to DPFL scenarios. Using this platform, we benchmark four major categories of widely adopted FL models and uncover substantial performance degradation under dynamic participation. To address these challenges, we further propose Knowledge-Pool Federated Learning (KPFL), a generic plugin that maintains a shared knowledge pool across both active and idle clients. KPFL leverages dual-age and data-bias weighting, combined with generative knowledge distillation, to mitigate instability and prevent knowledge loss. Extensive experiments demonstrate the significant impact of dynamic participation on FL performance and the effectiveness of KPFL in improving model robustness and generalization.

[LG-9] Saving Foundation Flow-Matching Priors for Inverse Problems

链接: https://arxiv.org/abs/2511.16520
作者: Yuxiang Wan,Ryan Devera,Wenjie Zhang,Ju Sun
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Foundation flow-matching (FM) models promise a universal prior for solving inverse problems (IPs), yet today they trail behind domain-specific or even untrained priors. How can we unlock their potential? We introduce FMPlug, a plug-in framework that redefines how foundation FMs are used in IPs. FMPlug combines an instance-guided, time-dependent warm-start strategy with a sharp Gaussianity regularization, adding problem-specific guidance while preserving the Gaussian structures. This leads to a significant performance boost across image restoration and scientific IPs. Our results point to a path for making foundation FM models practical, reusable priors for IP solving.

[LG-10] Loss Functions Robust to the Presence of Label Errors

链接: https://arxiv.org/abs/2511.16512
作者: Nicholas Pellegrino,David Szczecina,Paul Fieguth
类目: Machine Learning (cs.LG)
*备注: 6 pages, 6 figures, Presented at the 10th Annual Conference on Vision and Intelligent Systems (2024)

点击查看摘要

Abstract:Methods for detecting label errors in training data require models that are robust to label errors (i.e., not fit to erroneously labelled data points). However, acquiring such models often involves training on corrupted data, which presents a challenge. Adjustments to the loss function present an opportunity for improvement. Motivated by Focal Loss (which emphasizes difficult-to-classify samples), two novel, yet simple, loss functions are proposed that de-weight or ignore these difficult samples (i.e., those likely to have label errors). Results on artificially corrupted data show promise, such that F1 scores for detecting errors are improved from the baselines of conventional categorical Cross Entropy and Focal Loss.

[LG-11] Limitations of Scalarisation in MORL: A Comparative Study in Discrete Environments

链接: https://arxiv.org/abs/2511.16476
作者: Muhammad Sa’ood Shah,Asad Jeewa
类目: Machine Learning (cs.LG)
*备注: 15 pages, 4 figures, published in the Proceedings of the 46th Annual Conference of the South African Institute of Computer Scientists and Information Technologists (SAICSIT 2025)

点击查看摘要

Abstract:Scalarisation functions are widely employed in MORL algorithms to enable intelligent decision-making. However, these functions often struggle to approximate the Pareto front accurately, rendering them unideal in complex, uncertain environments. This study examines selected Multi-Objective Reinforcement Learning (MORL) algorithms across MORL environments with discrete action and observation spaces. We aim to investigate further the limitations associated with scalarisation approaches for decision-making in multi-objective settings. Specifically, we use an outer-loop multi-policy methodology to assess the performance of a seminal single-policy MORL algorithm, MO Q-Learning implemented with linear scalarisation and Chebyshev scalarisation functions. In addition, we explore a pioneering inner-loop multi-policy algorithm, Pareto Q-Learning, which offers a more robust alternative. Our findings reveal that the performance of the scalarisation functions is highly dependent on the environment and the shape of the Pareto front. These functions often fail to retain the solutions uncovered during learning and favour finding solutions in certain regions of the solution space. Moreover, finding the appropriate weight configurations to sample the entire Pareto front is complex, limiting their applicability in uncertain settings. In contrast, inner-loop multi-policy algorithms may provide a more sustainable and generalizable approach and potentially facilitate intelligent decision-making in dynamic and uncertain environments.

[LG-12] A Comparison Between Decision Transformers and Traditional Offline Reinforcement Learning Algorithms

链接: https://arxiv.org/abs/2511.16475
作者: Ali Murtaza Caunhye,Asad Jeewa
类目: Machine Learning (cs.LG)
*备注: 15 pages, 4 figures, published in the Proceedings of the 46th Annual conference of the South African Institute of Computer Scientists and Information Technologists (SIACSIT 2025)

点击查看摘要

Abstract:The field of Offline Reinforcement Learning (RL) aims to derive effective policies from pre-collected datasets without active environment interaction. While traditional offline RL algorithms like Conservative Q-Learning (CQL) and Implicit Q-Learning (IQL) have shown promise, they often face challenges in balancing exploration and exploitation, especially in environments with varying reward densities. The recently proposed Decision Transformer (DT) approach, which reframes offline RL as a sequence modelling problem, has demonstrated impressive results across various benchmarks. This paper presents a comparative study evaluating the performance of DT against traditional offline RL algorithms in dense and sparse reward settings for the ANT continous control environment. Our research investigates how these algorithms perform when faced with different reward structures, examining their ability to learn effective policies and generalize across varying levels of feedback. Through empirical analysis in the ANT environment, we found that DTs showed less sensitivity to varying reward density compared to other methods and particularly excelled with medium-expert datasets in sparse reward scenarios. In contrast, traditional value-based methods like IQL showed improved performance in dense reward settings with high-quality data, while CQL offered balanced performance across different data qualities. Additionally, DTs exhibited lower variance in performance but required significantly more computational resources compared to traditional approaches. These findings suggest that sequence modelling approaches may be more suitable for scenarios with uncertain reward structures or mixed-quality data, while value-based methods remain competitive in settings with dense rewards and high-quality demonstrations.

[LG-13] FreqFlow: Long-term forecasting using lightweight flow matching

链接: https://arxiv.org/abs/2511.16426
作者: Seyed Mohamad Moghadas,Bruno Cornelis,Adrian Munteanu
类目: Machine Learning (cs.LG)
*备注: Accepted at EurIPS, 2025

点击查看摘要

Abstract:Multivariate time-series (MTS) forecasting is fundamental to applications ranging from urban mobility and resource management to climate modeling. While recent generative models based on denoising diffusion have advanced state-of-the-art performance in capturing complex data distributions, they suffer from significant computational overhead due to iterative stochastic sampling procedures that limit real-time deployment. Moreover, these models can be brittle when handling high-dimensional, non-stationary, and multi-scale periodic patterns characteristic of real-world sensor networks. We introduce FreqFlow, a novel framework that leverages conditional flow matching in the frequency domain for deterministic MTS forecasting. Unlike conventional approaches that operate in the time domain, FreqFlow transforms the forecasting problem into the spectral domain, where it learns to model amplitude and phase shifts through a single complex-valued linear layer. This frequency-domain formulation enables the model to efficiently capture temporal dynamics via complex multiplication, corresponding to scaling and temporal translations. The resulting architecture is exceptionally lightweight with only 89k parameters - an order of magnitude smaller than competing diffusion-based models-while enabling single-pass deterministic sampling through ordinary differential equation (ODE) integration. Our approach decomposes MTS signals into trend, seasonal, and residual components, with the flow matching mechanism specifically designed for residual learning to enhance long-term forecasting accuracy. Extensive experiments on real-world traffic speed, volume, and flow datasets demonstrate that FreqFlow achieves state-of-the-art forecasting performance, on average 7% RMSE improvements, while being significantly faster and more parameter-efficient than existing methods

[LG-14] Optimal Fairness under Local Differential Privacy

链接: https://arxiv.org/abs/2511.16377
作者: Hrad Ghoukasian,Shahab Asoodeh
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 21 pages, 6 figures, 2 tables

点击查看摘要

Abstract:We investigate how to optimally design local differential privacy (LDP) mechanisms that reduce data unfairness and thereby improve fairness in downstream classification. We first derive a closed-form optimal mechanism for binary sensitive attributes and then develop a tractable optimization framework that yields the corresponding optimal mechanism for multi-valued attributes. As a theoretical contribution, we establish that for discrimination-accuracy optimal classifiers, reducing data unfairness necessarily leads to lower classification unfairness, thus providing a direct link between privacy-aware pre-processing and classification fairness. Empirically, we demonstrate that our approach consistently outperforms existing LDP mechanisms in reducing data unfairness across diverse datasets and fairness metrics, while maintaining accuracy close to that of non-private models. Moreover, compared with leading pre-processing and post-processing fairness methods, our mechanism achieves a more favorable accuracy-fairness trade-off while simultaneously preserving the privacy of sensitive attributes. Taken together, these results highlight LDP as a principled and effective pre-processing fairness intervention technique.

[LG-15] Unsupervised Graph Neural Network Framework for Balanced Multipatterning in Advanced Electronic Design Automation Layouts

链接: https://arxiv.org/abs/2511.16374
作者: Abdelrahman Helaly,Nourhan Sakr,Kareem Madkour,Ilhami Torunoglu
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: manuscript under review

点击查看摘要

Abstract:Multipatterning is an essential decomposition strategy in electronic design automation (EDA) that overcomes lithographic limitations when printing dense circuit layouts. Although heuristic-based backtracking and SAT solvers can address these challenges, they often struggle to simultaneously handle both complex constraints and secondary objectives. In this study, we present a hybrid workflow that casts multipatterning as a variant of a constrained graph coloring problem with the primary objective of minimizing feature violations and a secondary objective of balancing the number of features on each mask. Our pipeline integrates two main components: (1) A GNN-based agent, trained in an unsupervised manner to generate initial color predictions, which are refined by (2) refinement strategies (a GNN-based heuristic and simulated annealing) that together enhance solution quality and balance. Experimental evaluation in both proprietary data sets and publicly available open source layouts demonstrate complete conflict-free decomposition and consistent color balancing. The proposed framework provides a reproducible, data-efficient and deployable baseline for scalable layout decomposition in EDA workflows.

[LG-16] Improving Iterative Gaussian Processes via Warm Starting Sequential Posteriors

链接: https://arxiv.org/abs/2511.16340
作者: Alan Yufei Dong,Jihao Andreas Lin,José Miguel Hernández-Lobato
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Scalable Gaussian process (GP) inference is essential for sequential decision-making tasks, yet improving GP scalability remains a challenging problem with many open avenues of research. This paper focuses on iterative GPs, where iterative linear solvers, such as conjugate gradients, stochastic gradient descent or alternative projections, are used to approximate the GP posterior. We propose a new method which improves solver convergence of a large linear system by leveraging the known solution to a smaller system contained within. This is significant for tasks with incremental data additions, and we show that our technique achieves speed-ups when solving to tolerance, as well as improved Bayesian optimisation performance under a fixed compute budget.

[LG-17] Beyond Generative AI: World Models for Clinical Prediction Counterfactuals and Planning

链接: https://arxiv.org/abs/2511.16333
作者: Mohammad Areeb Qazi,Maryam Nadeem,Mohammad Yaqub
类目: Machine Learning (cs.LG)
*备注: 2 Figures, 1 Table

点击查看摘要

Abstract:Healthcare requires AI that is predictive, reliable, and data-efficient. However, recent generative models lack physical foundation and temporal reasoning required for clinical decision support. As scaling language models show diminishing returns for grounded clinical reasoning, world models are gaining traction because they learn multimodal, temporally coherent, and action-conditioned representations that reflect the physical and causal structure of care. This paper reviews World Models for healthcare systems that learn predictive dynamics to enable multistep rollouts, counterfactual evaluation and planning. We survey recent work across three domains: (i) medical imaging and diagnostics (e.g., longitudinal tumor simulation, projection-transition modeling, and Joint Embedding Predictive Architecture i.e., JEPA-style predictive representation learning), (ii) disease progression modeling from electronic health records (generative event forecasting at scale), and (iii) robotic surgery and surgical planning (action-conditioned guidance and control). We also introduce a capability rubric: L1 temporal prediction, L2 action-conditioned prediction, L3 counterfactual rollouts for decision support, and L4 planning/control. Most reviewed systems achieve L1–L2, with fewer instances of L3 and rare L4. We identify cross-cutting gaps that limit clinical reliability; under-specified action spaces and safety constraints, weak interventional validation, incomplete multimodal state construction, and limited trajectory-level uncertainty calibration. This review outlines a research agenda for clinically robust prediction-first world models that integrate generative backbones (transformers, diffusion, VAE) with causal/mechanical foundation for safe decision support in healthcare.

[LG-18] Learning-Enhanced Observer for Linear Time-Invariant Systems with Parametric Uncertainty

链接: https://arxiv.org/abs/2511.16318
作者: Hao Shu
类目: Machine Learning (cs.LG)
*备注: 6 pages, ordinary version

点击查看摘要

Abstract:This work introduces a learning-enhanced observer (LEO) for linear time-invariant systems with uncertain dynamics. Rather than relying solely on nominal models, the proposed framework treats the system matrices as optimizable variables and refines them through gradient-based minimization of a steady-state output discrepancy loss. The resulting data-informed surrogate model enables the construction of an improved observer that effectively compensates for moderate parameter uncertainty while preserving the structure of classical designs. Extensive Monte Carlo studies across diverse system dimensions show systematic and statistically significant reductions, typically exceeding 15%, in normalized estimation error for both open-loop and Luenberger observers. These results demonstrate that modern learning mechanisms can serve as a powerful complement to traditional observer design, yielding more accurate and robust state estimation in uncertain systems. Codes are available at this https URL.

[LG-19] Optimizing Operation Recipes with Reinforcement Learning for Safe and Interpretable Control of Chemical Processes ECML24 ECML

链接: https://arxiv.org/abs/2511.16297
作者: Dean Brandner,Sergio Lucia
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 16 pages, 3 figures, Part of the workshop ‘Machine Learning for Chemistry and Chemical Engineering (ML4CCE)’ at the ECML24 conference: Link: this https URL

点击查看摘要

Abstract:Optimal operation of chemical processes is vital for energy, resource, and cost savings in chemical engineering. The problem of optimal operation can be tackled with reinforcement learning, but traditional reinforcement learning methods face challenges due to hard constraints related to quality and safety that must be strictly satisfied, and the large amount of required training data. Chemical processes often cannot provide sufficient experimental data, and while detailed dynamic models can be an alternative, their complexity makes it computationally intractable to generate the needed data. Optimal control methods, such as model predictive control, also struggle with the complexity of the underlying dynamic models. Consequently, many chemical processes rely on manually defined operation recipes combined with simple linear controllers, leading to suboptimal performance and limited flexibility. In this work, we propose a novel approach that leverages expert knowledge embedded in operation recipes. By using reinforcement learning to optimize the parameters of these recipes and their underlying linear controllers, we achieve an optimized operation recipe. This method requires significantly less data, handles constraints more effectively, and is more interpretable than traditional reinforcement learning methods due to the structured nature of the recipes. We demonstrate the potential of our approach through simulation results of an industrial batch polymerization reactor, showing that it can approach the performance of optimal controllers while addressing the limitations of existing methods. Comments: 16 pages, 3 figures, Part of the workshop ‘Machine Learning for Chemistry and Chemical Engineering (ML4CCE)’ at the ECML24 conference: Link: this https URL Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2511.16297 [cs.LG] (or arXiv:2511.16297v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.16297 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-20] Graph Diffusion Counterfactual Explanation

链接: https://arxiv.org/abs/2511.16287
作者: David Bechtoldt,Sidney Bender
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models that operate on graph-structured data, such as molecular graphs or social networks, often make accurate predictions but offer little insight into why certain predictions are made. Counterfactual explanations address this challenge by seeking the closest alternative scenario where the model’s prediction would change. Although counterfactual explanations are extensively studied in tabular data and computer vision, the graph domain remains comparatively underexplored. Constructing graph counterfactuals is intrinsically difficult because graphs are discrete and non-euclidean objects. We introduce Graph Diffusion Counterfactual Explanation, a novel framework for generating counterfactual explanations on graph data, combining discrete diffusion models and classifier-free guidance. We empirically demonstrate that our method reliably generates in-distribution as well as minimally structurally different counterfactuals for both discrete classification targets and continuous properties.

[LG-21] GeoPTH: A Lightweight Approach to Category-Based Trajectory Retrieval via Geometric Prototype Trajectory Hashing

链接: https://arxiv.org/abs/2511.16258
作者: Yang Xu,Zuliang Yang,Kai Ming Ting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trajectory similarity retrieval is an important part of spatiotemporal data mining, however, existing methods have the following limitations: traditional metrics are computationally expensive, while learning-based methods suffer from substantial training costs and potential instability. This paper addresses these problems by proposing \textbfGeometric \textbfPrototype \textbfTrajectory \textbfHashing (GeoPTH), a novel, lightweight, and non-learning framework for efficient category-based trajectory retrieval. GeoPTH constructs data-dependent hash functions by using representative trajectory prototypes, i.e., small point sets preserving geometric characteristics, as anchors. The hashing process is efficient, which involves mapping a new trajectory to its closest prototype via a robust, \textitHausdorff metric. Extensive experiments show that GeoPTH’s retrieval accuracy is highly competitive with both traditional metrics and state-of-the-art learning methods, and it significantly outperforms binary codes generated through simple binarization of the learned embeddings. Critically, GeoPTH consistently outperforms all competitors in terms of efficiency. Our work demonstrates that a lightweight, prototype-centric approach offers a practical and powerful alternative, achieving an exceptional retrieval performance and computational efficiency.

[LG-22] Pass@k Metric for RLVR: A Diagnostic Tool of Exploration But Not an Objective

链接: https://arxiv.org/abs/2511.16231
作者: Yang Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ability of Large Language Models (LLMs) to perform complex, multi-step reasoning is a central focus of modern AI research. To evaluate and enhance this capability, the pass@k metric, which measures the probability of obtaining at least one correct solution in k independent samples, has received significant attention. Its intuitive appeal has led to its adoption not only as an evaluation standard but also as a direct optimization objective in reinforcement learning. In this paper, we analyze the pass@k objective, derive its gradient, and demonstrate that it is fundamentally a per-example positive reweighting of the simpler pass@1 objective. Our analysis reveals that the pass@k objective provides a vanishing learning signal in regimes where exploration is most critical. We further analyze the dynamics of “exploration collapse”, showing that as the policy concentrates probability mass, the gap between pass@k and pass@1 diminishes. We conclude that while pass@k is a useful diagnostic tool, it may be an unsuitable direct objective for optimization. Instead, mechanisms explicitly encouraging efficient exploration could offer a more effective path forward for reinforcement learning in reasoning tasks.

[LG-23] Deep SOR Minimax Q-learning for Two-player Zero-sum Game

链接: https://arxiv.org/abs/2511.16226
作者: Saksham Gautam,Lakshmi Mandal,Shalabh Bhatnagar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we consider the problem of a two-player zero-sum game. In the literature, the successive over-relaxation Q-learning algorithm has been developed and implemented, and it is seen to result in a lower contraction factor for the associated Q-Bellman operator resulting in a faster value iteration-based procedure. However, this has been presented only for the tabular case and not for the setting with function approximation that typically caters to real-world high-dimensional state-action spaces. Furthermore, such settings in the case of two-player zero-sum games have not been considered. We thus propose a deep successive over-relaxation minimax Q-learning algorithm that incorporates deep neural networks as function approximators and is suitable for high-dimensional spaces. We prove the finite-time convergence of the proposed algorithm. Through numerical experiments, we show the effectiveness of the proposed method over the existing Q-learning algorithm. Our ablation studies demonstrate the effect of different values of the crucial successive over-relaxation parameter.

[LG-24] Real-Time Inference for Distributed Multimodal Systems under Communication Delay Uncertainty

链接: https://arxiv.org/abs/2511.16225
作者: Victor Croisfelt,João Henrique Inacio de Souza,Shashi Raj Pandey,Beatriz Soret,Petar Popovski
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, submitted to IEEE ICC 2026

点击查看摘要

Abstract:Connected cyber-physical systems perform inference based on real-time inputs from multiple data streams. Uncertain communication delays across data streams challenge the temporal flow of the inference process. State-of-the-art (SotA) non-blocking inference methods rely on a reference-modality paradigm, requiring one modality input to be fully received before processing, while depending on costly offline profiling. We propose a novel, neuro-inspired non-blocking inference paradigm that primarily employs adaptive temporal windows of integration (TWIs) to dynamically adjust to stochastic delay patterns across heterogeneous streams while relaxing the reference-modality requirement. Our communication-delay-aware framework achieves robust real-time inference with finer-grained control over the accuracy-latency tradeoff. Experiments on the audio-visual event localization (AVEL) task demonstrate superior adaptability to network dynamics compared to SotA approaches.

[LG-25] Mind the Gap: Bridging Prior Shift in Realistic Few-Shot Crop-Type Classification

链接: https://arxiv.org/abs/2511.16218
作者: Joana Reuss,Ekaterina Gikalo,Marco Körner
类目: Machine Learning (cs.LG)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Real-world agricultural distributions often suffer from severe class imbalance, typically following a long-tailed distribution. Labeled datasets for crop-type classification are inherently scarce and remain costly to obtain. When working with such limited data, training sets are frequently constructed to be artificially balanced – in particular in the case of few-shot learning – failing to reflect real-world conditions. This mismatch induces a shift between training and test label distributions, degrading real-world generalization. To address this, we propose Dirichlet Prior Augmentation (DirPA), a novel method that simulates an unknown label distribution skew of the target domain proactively during model training. Specifically, we model the real-world distribution as Dirichlet-distributed random variables, effectively performing a prior augmentation during few-shot learning. Our experiments show that DirPA successfully shifts the decision boundary and stabilizes the training process by acting as a dynamic feature regularizer.

[LG-26] owards Overcoming Data Scarcity in Nuclear Energy: A Study on Critical Heat Flux with Physics-consistent Conditional Diffusion Model

链接: https://arxiv.org/abs/2511.16207
作者: Farah Alsafadi,Alexandra Akins,Xu Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep generative modeling provides a powerful pathway to overcome data scarcity in energy-related applications where experimental data are often limited, costly, or difficult to obtain. By learning the underlying probability distribution of the training dataset, deep generative models, such as the diffusion model (DM), can generate high-fidelity synthetic samples that statistically resemble the training data. Such synthetic data generation can significantly enrich the size and diversity of the available training data, and more importantly, improve the robustness of downstream machine learning models in predictive tasks. The objective of this paper is to investigate the effectiveness of DM for overcoming data scarcity in nuclear energy applications. By leveraging a public dataset on critical heat flux (CHF) that cover a wide range of commercial nuclear reactor operational conditions, we developed a DM that can generate an arbitrary amount of synthetic samples for augmenting of the CHF dataset. Since a vanilla DM can only generate samples randomly, we also developed a conditional DM capable of generating targeted CHF data under user-specified thermal-hydraulic conditions. The performance of the DM was evaluated based on their ability to capture empirical feature distributions and pair-wise correlations, as well as to maintain physical consistency. The results showed that both the DM and conditional DM can successfully generate realistic and physics-consistent CHF data. Furthermore, uncertainty quantification was performed to establish confidence in the generated data. The results demonstrated that the conditional DM is highly effective in augmenting CHF data while maintaining acceptable levels of uncertainty.

[LG-27] Causal Synthetic Data Generation in Recruitment ECAI2025

链接: https://arxiv.org/abs/2511.16204
作者: Andrea Iommi,Antonio Mastropietro,Riccardo Guidotti,Anna Monreale,Salvatore Ruggieri
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Published. Conference: AEQUITAS 2025: Workshop on Fairness and Bias in AI | co-located with ECAI 2025, Bologna, Italy

点击查看摘要

Abstract:The importance of Synthetic Data Generation (SDG) has increased significantly in domains where data quality is poor or access is limited due to privacy and regulatory constraints. One such domain is recruitment, where publicly available datasets are scarce due to the sensitive nature of information typically found in curricula vitae, such as gender, disability status, or age. % This lack of accessible, representative data presents a significant obstacle to the development of fair and transparent machine learning models, particularly ranking algorithms that require large volumes of data to effectively learn how to recommend candidates. In the absence of such data, these models are prone to poor generalisation and may fail to perform reliably in real-world scenarios. % Recent advances in Causal Generative Models (CGMs) offer a promising solution. CGMs enable the generation of synthetic datasets that preserve the underlying causal relationships within the data, providing greater control over fairness and interpretability in the data generation process. % In this study, we present a specialised SDG method involving two CGMs: one modelling job offers and the other modelling curricula. Each model is structured according to a causal graph informed by domain expertise. We use these models to generate synthetic datasets and evaluate the fairness of candidate rankings under controlled scenarios that introduce specific biases.

[LG-28] A Switching Framework for Online Interval Scheduling with Predictions AAAI2026

链接: https://arxiv.org/abs/2511.16194
作者: Antonios Antoniadis,Ali Shahheidar,Golnoosh Shahkarami,Abolfazl Soltani
类目: Machine Learning (cs.LG)
*备注: This paper will appear in AAAI 2026

点击查看摘要

Abstract:We study online interval scheduling in the irrevocable setting, where each interval must be immediately accepted or rejected upon arrival. The objective is to maximize the total length of accepted intervals while ensuring that no two accepted intervals overlap. We consider this problem in a learning-augmented setting, where the algorithm has access to (machine-learned) predictions. The goal is to design algorithms that leverage these predictions to improve performance while maintaining robust guarantees in the presence of prediction errors. Our main contribution is the SemiTrust-and-Switch framework, which provides a unified approach for combining prediction-based and classical interval scheduling algorithms. This framework applies to both deterministic and randomized algorithms and captures the trade-off between consistency (performance under accurate predictions) and robustness (performance under adversarial inputs). Moreover, we provide lower bounds, proving the tightness of this framework in particular settings. We further design a randomized algorithm that smoothly interpolates between prediction-based and robust algorithms. This algorithm achieves both robustness and smoothness–its performance degrades gracefully with the quality of the prediction. Comments: This paper will appear in AAAI 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.16194 [cs.LG] (or arXiv:2511.16194v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.16194 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-29] ART: A Graph-based Framework for Investigating Illicit Activity in Monero via Address-Ring-Transaction Structures

链接: https://arxiv.org/abs/2511.16192
作者: Andrea Venturi,Imanol Jerico-Yoldi,Francesco Zola,Raul Orduna
类目: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Paper accepted @ BLOCKCHAIN CRYPTOCURRENCY CONFERENCE (B2C’2025)

点击查看摘要

Abstract:As Law Enforcement Agencies advance in cryptocurrency forensics, criminal actors aiming to conceal illicit fund movements increasingly turn to “mixin” services or privacy-based cryptocurrencies. Monero stands out as a leading choice due to its strong privacy preserving and untraceability properties, making conventional blockchain analysis ineffective. Understanding the behavior and operational patterns of criminal actors within Monero is therefore challenging and it is essential to support future investigative strategies and disrupt illicit activities. In this work, we propose a case study in which we leverage a novel graph-based methodology to extract structural and temporal patterns from Monero transactions linked to already discovered criminal activities. By building Address-Ring-Transaction graphs from flagged transactions, we extract structural and temporal features and use them to train Machine Learning models capable of detecting similar behavioral patterns that could highlight criminal modus operandi. This represents a first partial step toward developing analytical tools that support investigative efforts in privacy-preserving blockchain ecosystems

[LG-30] CausalMamba: Interpretable State Space Modeling for Temporal Rumor Causality

链接: https://arxiv.org/abs/2511.16191
作者: Xiaotong Zhan,Xi Cheng
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Preprint. 9 pages, 3 figures, 2 tables. Code and implementation details available at: this https URL

点击查看摘要

Abstract:Rumor detection on social media remains a challenging task due to the complex propagation dynamics and the limited interpretability of existing models. While recent neural architectures capture content and structural features, they often fail to reveal the underlying causal mechanisms of misinformation spread. We propose CausalMamba, a novel framework that integrates Mamba-based sequence modeling, graph convolutional networks (GCNs), and differentiable causal discovery via NOTEARS. CausalMamba learns joint representations of temporal tweet sequences and reply structures, while uncovering latent causal graphs to identify influential nodes within each propagation chain. Experiments on the Twitter15 dataset show that our model achieves competitive classification performance compared to strong baselines, and uniquely enables counterfactual intervention analysis. Qualitative results demonstrate that removing top-ranked causal nodes significantly alters graph connectivity, offering interpretable insights into rumor dynamics. Our framework provides a unified approach for rumor classification and influence analysis, paving the way for more explainable and actionable misinformation detection systems.

[LG-31] Achieving Skilled and Reliable Daily Probabilistic Forecasts of Wind Power at Subseasonal-to-Seasonal Timescales over France

链接: https://arxiv.org/abs/2511.16164
作者: Eloi Lindas,Yannig Goude,Philippe Ciais
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Accurate and reliable wind power forecasts are crucial for grid stability, balancing supply and demand, and market risk management. Even though short-term weather forecasts have been thoroughly used to provide short-term renewable power predictions, forecasts involving longer prediction horizons still need investigations. Despite the recent progress in subseasonal-to-seasonal weather probabilistic forecasting, their use for wind power prediction usually involves both temporal and spatial aggregation achieve reasonable skill. In this study, we present a forecasting pipeline enabling to transform ECMWF subseasonal-to-seasonal weather forecasts into wind power forecasts for lead times ranging from 1 day to 46 days at daily resolution. This framework also include post-processing of the resulting power ensembles to account for the biases and lack of dispersion of the weather forecasts. We show that our method is able to outperform a climatological baseline by 50 % in terms of both Continuous Ranked Probability Skill Score and Ensemble Mean Squared Error while also providing near perfect calibration of the forecasts for lead times ranging from 15 to 46 days.

[LG-32] MagBotSim: Physics-Based Simulation and Reinforcement Learning Environments for Magnetic Robotics

链接: https://arxiv.org/abs/2511.16158
作者: Lara Bergmann,Cedric Grothues,Klaus Neumann
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Magnetic levitation is about to revolutionize in-machine material flow in industrial automation. Such systems are flexibly configurable and can include a large number of independently actuated shuttles (movers) that dynamically rebalance production capacity. Beyond their capabilities for dynamic transportation, these systems possess the inherent yet unexploited potential to perform manipulation. By merging the fields of transportation and manipulation into a coordinated swarm of magnetic robots (MagBots), we enable manufacturing systems to achieve significantly higher efficiency, adaptability, and compactness. To support the development of intelligent algorithms for magnetic levitation systems, we introduce MagBotSim (Magnetic Robotics Simulation): a physics-based simulation for magnetic levitation systems. By framing magnetic levitation systems as robot swarms and providing a dedicated simulation, this work lays the foundation for next generation manufacturing systems powered by Magnetic Robotics. MagBotSim’s documentation, videos, experiments, and code are available at: this https URL

[LG-33] Enhancing Nuclear Reactor Core Simulation through Data-Based Surrogate Models

链接: https://arxiv.org/abs/2511.16148
作者: Perceval Beja-Battais(CB),Alain Grossetête,Nicolas Vayatis(CB)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, there has been an increasing need for Nuclear Power Plants (NPPs) to improve flexibility in order to match the rapid growth of renewable energies. The Operator Assistance Predictive System (OAPS) developed by Framatome addresses this problem through Model Predictive Control (MPC). In this work, we aim to improve MPC methods through data-driven simulation schemes. Thus, from a set of nonlinear stiff ordinary differential equations (ODEs), this paper introduces two surrogate models acting as alternative simulation schemes to enhance nuclear reactor core simulation. We show that both data-driven and physics-informed models can rapidly integrate complex dynamics, with a very low computational time (up to 1000x time reduction).

[LG-34] An Interpretability-Guided Framework for Responsible Synthetic Data Generation in Emotional Text

链接: https://arxiv.org/abs/2511.16132
作者: Paula Joy B. Martinez,Jose Marie Antonio Miñoza,Sebastian C. Ibañez
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Emotion recognition from social media is critical for understanding public sentiment, but accessing training data has become prohibitively expensive due to escalating API costs and platform restrictions. We introduce an interpretability-guided framework where Shapley Additive Explanations (SHAP) provide principled guidance for LLM-based synthetic data generation. With sufficient seed data, SHAP-guided approach matches real data performance, significantly outperforms naïve generation, and substantially improves classification for underrepresented emotion classes. However, our linguistic analysis reveals that synthetic text exhibits reduced vocabulary richness and fewer personal or temporally complex expressions than authentic posts. This work provides both a practical framework for responsible synthetic data generation and a critical perspective on its limitations, underscoring that the future of trustworthy AI depends on navigating the trade-offs between synthetic utility and real-world authenticity.

[LG-35] Pathlet Variational Auto-Encoder for Robust Trajectory Generation

链接: https://arxiv.org/abs/2511.16105
作者: Yuanbo Tang,Yan Tang,Zixuan Zhang,Zihui Zhao,Yang Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trajectory generation has recently drawn growing interest in privacy-preserving urban mobility studies and location-based service applications. Although many studies have used deep learning or generative AI methods to model trajectories and have achieved promising results, the robustness and interpretability of such models are largely unexplored. This limits the application of trajectory generation algorithms on noisy real-world data and their trustworthiness in downstream tasks. To address this issue, we exploit the regular structure in urban trajectories and propose a deep generative model based on the pathlet representation, which encode trajectories with binary vectors associated with a learned dictionary of trajectory segments. Specifically, we introduce a probabilistic graphical model to describe the trajectory generation process, which includes a Variational Autoencoder (VAE) component and a linear decoder component. During training, the model can simultaneously learn the latent embedding of pathlet representations and the pathlet dictionary that captures mobility patterns in the trajectory dataset. The conditional version of our model can also be used to generate customized trajectories based on temporal and spatial constraints. Our model can effectively learn data distribution even using noisy data, achieving relative improvements of 35.4% and 26.3% over strong baselines on two real-world trajectory datasets. Moreover, the generated trajectories can be conveniently utilized for multiple downstream tasks, including trajectory prediction and data denoising. Lastly, the framework design offers a significant efficiency advantage, saving 64.8% of the time and 56.5% of GPU memory compared to previous approaches. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.16105 [cs.LG] (or arXiv:2511.16105v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.16105 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] HybSpecNet: A Critical Analysis of Architectural Instability in Hybrid-Domain Spectral GNNs

链接: https://arxiv.org/abs/2511.16101
作者: Huseyin Goksu
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Spectral Graph Neural Networks offer a principled approach to graph filtering but face a fundamental “Stability-vs-Adaptivity” trade-off. This trade-off is dictated by the choice of spectral domain. Filters in the finite [-1, 1] domain (e.g., ChebyNet) are numerically stable at high polynomial degrees (K) but are static and low-pass, causing them to fail on heterophilic graphs. Conversely, filters in the semi-infinite [0, infty) domain (e.g., KrawtchoukNet) are highly adaptive and achieve SOTA results on heterophily by learning non-low-pass responses. However, as we demonstrate, these adaptive filters can also suffer from numerical instability, leading to catastrophic performance collapse at high K. In this paper, we propose to resolve this trade-off by designing a hybrid-domain GNN, HybSpecNet, which combines a stable ChebyNet branch with an adaptive KrawtchoukNet branch. We first demonstrate that a “naive” hybrid architecture, which fuses the branches via concatenation, successfully unifies performance at low K, achieving strong results on both homophilic and heterophilic benchmarks. However, we then prove that this naive architecture fails the stability test. Our K-ablation experiments show that this architecture catastrophically collapses at K=25, exactly mirroring the collapse of its unstable KrawtchoukNet branch. We identify this critical finding as “Instability Poisoning,” where NaN/Inf gradients from the adaptive branch destroy the training of the model. Finally, we propose and validate an advanced architecture that uses “Late Fusion” to completely isolate the gradient pathways. We demonstrate that this successfully solves the instability problem, remaining perfectly stable up to K=30 while retaining its SOTA performance across all graph types. This work identifies a critical architectural pitfall in hybrid GNN design and provides the robust architectural solution.

[LG-37] AssayMatch: Learning to Select Data for Molecular Activity Models

链接: https://arxiv.org/abs/2511.16087
作者: Vincent Fan,Regina Barzilay
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The performance of machine learning models in drug discovery is highly dependent on the quality and consistency of the underlying training data. Due to limitations in dataset sizes, many models are trained by aggregating bioactivity data from diverse sources, including public databases such as ChEMBL. However, this approach often introduces significant noise due to variability in experimental protocols. We introduce AssayMatch, a framework for data selection that builds smaller, more homogenous training sets attuned to the test set of interest. AssayMatch leverages data attribution methods to quantify the contribution of each training assay to model performance. These attribution scores are used to finetune language embeddings of text-based assay descriptions to capture not just semantic similarity, but also the compatibility between assays. Unlike existing data attribution methods, our approach enables data selection for a test set with unknown labels, mirroring real-world drug discovery campaigns where the activities of candidate molecules are not known in advance. At test time, embeddings finetuned with AssayMatch are used to rank all available training data. We demonstrate that models trained on data selected by AssayMatch are able to surpass the performance of the model trained on the complete dataset, highlighting its ability to effectively filter out harmful or noisy experiments. We perform experiments on two common machine learning architectures and see increased prediction capability over a strong language-only baseline for 9/12 model-target pairs. AssayMatch provides a data-driven mechanism to curate higher-quality datasets, reducing noise from incompatible experiments and improving the predictive power and data efficiency of models for drug discovery. AssayMatch is available at this https URL.

[LG-38] L-JacobiNet and S-JacobiNet: An Analysis of Adaptive Generalization Stabilization and Spectral Domain Trade-offs in GNNs

链接: https://arxiv.org/abs/2511.16081
作者: Huseyin Goksu
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Spectral GNNs, like ChebyNet, are limited by heterophily and over-smoothing due to their static, low-pass filter design. This work investigates the “Adaptive Orthogonal Polynomial Filter” (AOPF) class as a solution. We introduce two models operating in the [-1, 1] domain: 1) L-JacobiNet, the adaptive generalization of ChebyNet with learnable alpha, beta shape parameters, and 2) S-JacobiNet, a novel baseline representing a LayerNorm-stabilized static ChebyNet. Our analysis, comparing these models against AOPFs in the [0, infty) domain (e.g., LaguerreNet), reveals critical, previously unknown trade-offs. We find that the [0, infty) domain is superior for modeling heterophily, while the [-1, 1] domain (Jacobi) provides superior numerical stability at high K (K20). Most significantly, we discover that ChebyNet’s main flaw is stabilization, not its static nature. Our static S-JacobiNet (ChebyNet+LayerNorm) outperforms the adaptive L-JacobiNet on 4 out of 5 benchmark datasets, identifying S-JacobiNet as a powerful, overlooked baseline and suggesting that adaptation in the [-1, 1] domain can lead to overfitting.

[LG-39] ILoRA: Federated Learning with Low-Rank Adaptation for Heterogeneous Client Aggregation

链接: https://arxiv.org/abs/2511.16069
作者: Junchao Zhou,Junkang Liu,Fanhua Shang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning with Low-Rank Adaptation (LoRA) faces three critical challenges under client heterogeneity: (1) Initialization-Induced Instability due to random initialization misaligning client subspaces; (2) Rank Incompatibility and Aggregation Error when averaging LoRA parameters of different ranks, which biases the global model; and (3) exacerbated Client Drift under Non-IID Data, impairing generalization. To address these challenges, we propose ILoRA, a unified framework that integrates three core innovations: a QR-based orthonormal initialization to ensure all clients start in a coherent subspace; a Concatenated QR Aggregation mechanism that fuses heterogeneous-rank updates via concatenation and decomposition, preserving information while maintaining dimension alignment; and an AdamW optimizer with rank-aware control variates to correct local updates and mitigate client drift. Supported by theoretical convergence guarantees, extensive experiments on vision and NLP benchmarks demonstrate that ILoRA consistently achieves superior accuracy and convergence stability compared to existing federated LoRA methods.

[LG-40] Gauge-Equivariant Graph Networks via Self-Interference Cancellation

链接: https://arxiv.org/abs/2511.16062
作者: Yoonhyuk Choi,Chong-Kwon Kim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) excel on homophilous graphs but often fail under heterophily due to self-reinforcing and phase-inconsistent signals. We propose a Gauge-Equivariant Graph Network with Self-Interference Cancellation (GESC), which replaces additive aggregation with a projection-based interference mechanism. Unlike prior magnetic or gauge-equivariant GNNs that typically focus on phase handling in spectral filtering while largely relying on scalar weighting, GESC introduces a \mathrmU(1) phase connection followed by a rank-1 projection that attenuates self-parallel components before attention. A sign- and phase-aware gate further regulates neighbor influence, attenuating components aligned with current node states and acting as a local notch on low-frequency modes. Across diverse graph benchmarks, our method consistently outperforms recent state-of-the-art models while offering a unified, interference-aware view of message passing. Our code is available at \hrefherethis https URL.

[LG-41] Change-of-Basis Pruning via Rotational Invariance

链接: https://arxiv.org/abs/2511.16061
作者: Alex Ning,Vainateya Rangaraju
类目: Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Structured pruning removes entire neurons or channels, but its effectiveness depends on how importance is distributed across the representation space. Change-of-basis (CoB) pruning addresses this challenge by applying orthogonal linear transformations that concentrate importance within certain dimensions. However, many standard deep learning architectures are not inherently invariant to such transformations. To enable compatibility, we introduce two-subspace radial activations (TSRAs): an activation family that is invariant to orthogonal linear transformations applied independently within its two activation subspaces. This invariance allows CoB transformations to be merged into surrounding weights without incurring extra parameters. We position this work as a proof-of-concept that a rotationally invariant design may offer a principled approach towards change-of-basis pruning. We do not provide an analysis of multiple TSRA candidates nor do we explore weight initialization for any TSRAs. These limitations, combined with other necessary modifications we make to permit rotational invariance, result in a slight accuracy drop of 4.52% compared to a ReLU-based control. However, using activation-magnitude importance, VGG-16 implementing our CoB+TSRA framework shows encouraging results on CIFAR-10. Under fixed-ratio structured pruning, CoB improves accuracy over a TSRA baseline at all pruning ratios and extends reliable pruning frontier from roughly 30% to 70% of parameters without post-prune fine tuning. Under threshold-based pruning strategies, CoB prunes 90-96% of parameters while maintaining 1-6% accuracy drop after fine-tuning. Together, these results indicate that rotationally invariant architectures may offer a promising path towards CoB pruning.

[LG-42] Agent 0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

链接: https://arxiv.org/abs/2511.16043
作者: Peng Xia,Kaide Zeng,Jiaqi Liu,Can Qin,Fang Wu,Yiyang Zhou,Caiming Xiong,Huaxiu Yao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Model (LLM) Agents, often trained with Reinforcement Learning (RL), are constrained by a dependency on human-curated data, limiting scalability and tethering AI to human knowledge. Existing self-evolution frameworks offer an alternative but are typically restricted by the model’s inherent capabilities and single-round interactions, hindering the development of complex curricula involving tool use or dynamic reasoning. We introduce Agent0, a fully autonomous framework that evolves high-performing agents without external data through multi-step co-evolution and seamless tool integration. Agent0 establishes a symbiotic competition between two agents initialized from the same base LLM: a curriculum agent that proposes increasingly challenging frontier tasks, and an executor agent that learns to solve them. We integrate external tools to enhance the executor’s problem-solving capacity; this improvement, in turn, pressures the curriculum agent to construct more complex, tool-aware tasks. Through this iterative process, Agent0 establishes a self-reinforcing cycle that continuously produces high-quality curricula. Empirically, Agent0 substantially boosts reasoning capabilities, improving the Qwen3-8B-Base model by 18% on mathematical reasoning and 24% on general reasoning benchmarks. Code is available at this https URL.

[LG-43] Digital Agriculture Sandbox for Collaborative Research

链接: https://arxiv.org/abs/2511.15990
作者: Osama Zafar,Rosemarie Santa González,Alfonso Morales,Erman Ayday
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Presents a privacy-preserving digital agriculture platform using federated learning, differential privacy, and secure data analysis to enable collaboration between farmers and researchers without exposing raw data. Demonstrates secure similarity search, model training, and risk-aware data sharing

点击查看摘要

Abstract:Digital agriculture is transforming the way we grow food by utilizing technology to make farming more efficient, sustainable, and productive. This modern approach to agriculture generates a wealth of valuable data that could help address global food challenges, but farmers are hesitant to share it due to privacy concerns. This limits the extent to which researchers can learn from this data to inform improvements in farming. This paper presents the Digital Agriculture Sandbox, a secure online platform that solves this problem. The platform enables farmers (with limited technical resources) and researchers to collaborate on analyzing farm data without exposing private information. We employ specialized techniques such as federated learning, differential privacy, and data analysis methods to safeguard the data while maintaining its utility for research purposes. The system enables farmers to identify similar farmers in a simplified manner without needing extensive technical knowledge or access to computational resources. Similarly, it enables researchers to learn from the data and build helpful tools without the sensitive information ever leaving the farmer’s system. This creates a safe space where farmers feel comfortable sharing data, allowing researchers to make important discoveries. Our platform helps bridge the gap between maintaining farm data privacy and utilizing that data to address critical food and farming challenges worldwide.

[LG-44] Descend or Rewind? Stochastic Gradient Descent Unlearning

链接: https://arxiv.org/abs/2511.15983
作者: Siqiao Mu,Diego Klabjan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning algorithms aim to remove the impact of selected training data from a model without the computational expenses of retraining from scratch. Two such algorithms are Descent-to-Delete" (D2D) and Rewind-to-Delete" (R2D), full-batch gradient descent algorithms that are easy to implement and satisfy provable unlearning guarantees. In particular, the stochastic version of D2D is widely implemented as the ``finetuning" unlearning baseline, despite lacking theoretical backing on nonconvex functions. In this work, we prove (\epsilon, \delta) certified unlearning guarantees for stochastic R2D and D2D for strongly convex, convex, and nonconvex loss functions, by analyzing unlearning through the lens of disturbed or biased gradient systems, which may be contracting, semi-contracting, or expansive respectively. Our argument relies on optimally coupling the random behavior of the unlearning and retraining trajectories, resulting in a probabilistic sensitivity bound that can be combined with a novel relaxed Gaussian mechanism to achieve (\epsilon, \delta) unlearning. We determine that D2D can yield tighter guarantees for strongly convex functions compared to R2D by relying on contraction to a unique global minimum. However, unlike D2D, R2D can achieve unlearning in the convex and nonconvex setting because it draws the unlearned model closer to the retrained model by reversing the accumulated disturbances.

[LG-45] Machine Learning Epidemic Predictions Using Agent -based Wireless Sensor Network Models

链接: https://arxiv.org/abs/2511.15982
作者: Chukwunonso Henry Nwokoye,Blessing Oluchi,Sharna Waldron,Peace Ezzeh
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 8 pages

点击查看摘要

Abstract:The lack of epidemiological data in wireless sensor networks (WSNs) is a fundamental difficulty in constructing robust models to forecast and mitigate threats such as viruses and worms. Many studies have examined different epidemic models for WSNs, focusing on how malware infections spread given the network’s specific properties, including energy limits and node mobility. In this study, an agent-based implementation of the susceptible-exposed-infected-recovered-vaccinated (SEIRV) mathematical model was employed for machine learning (ML) predictions. Using tools such as NetLogo’s BehaviorSpace and Python, two epidemic synthetic datasets were generated and prepared for the application of several ML algorithms. Posed as a regression problem, the infected and recovered nodes were predicted, and the performance of these algorithms is compared using the error metrics of the train and test sets. The predictions performed well, with low error metrics and high R^2 values (0.997, 1.000, 0.999, 1.000), indicating an effective fit to the training set. The validation values were lower (0.992, 0.998, 0.971, and 0.999), as is typical when evaluating model performance on unseen data. Based on the recorded performances, support vector, linear, Lasso, Ridge, and ElasticNet regression were among the worst-performing algorithms, while Random Forest, XGBoost, Decision Trees, and k-nearest neighbors achieved the best results.

[LG-46] Unified all-atom molecule generation with neural fields NEURIPS2025

链接: https://arxiv.org/abs/2511.15906
作者: Matthieu Kirchmeyer,Pedro O. Pinheiro,Emma Willett,Karolis Martinkus,Joseph Kleinhenz,Emily K. Makowski,Andrew M. Watkins,Vladimir Gligorijevic,Richard Bonneau,Saeed Saremi
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Generative models for structure-based drug design are often limited to a specific modality, restricting their broader applicability. To address this challenge, we introduce FuncBind, a framework based on computer vision to generate target-conditioned, all-atom molecules across atomic systems. FuncBind uses neural fields to represent molecules as continuous atomic densities and employs score-based generative models with modern architectures adapted from the computer vision literature. This modality-agnostic representation allows a single unified model to be trained on diverse atomic systems, from small to large molecules, and handle variable atom/residue counts, including non-canonical amino acids. FuncBind achieves competitive in silico performance in generating small molecules, macrocyclic peptides, and antibody complementarity-determining region loops, conditioned on target structures. FuncBind also generated in vitro novel antibody binders via de novo redesign of the complementarity-determining region H3 loop of two chosen co-crystal structures. As a final contribution, we introduce a new dataset and benchmark for structure-conditioned macrocyclic peptide generation. The code is available at this https URL.

[LG-47] Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Minimization

链接: https://arxiv.org/abs/2511.15898
作者: Rahul Krishna Thomas,Arka Pal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speculative sampling reduces the latency of autoregressive decoding for target model LLMs without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample this token. To improve acceptance and decoding efficiency, recent work has explored the multi-draft extension, where at each step n draft tokens are generated, and the verification criterion is a distribution conditioned on these. When this criterion maximizes the probability of accepting some draft token, it is called the optimal transport (OT). However, finding the OT is difficult, as it is the solution of a linear program (OTLP) in over V^n variables, with V being the vocabulary size. Two recent theoretical works have reframed the OTLP in terms of importance sampling or subset selection. In this work, we prove that these formulations are equivalent to an exponentially large relaxed OTLP, so it remains infeasible to solve. Then, we reverse engineer subset selection to formulate the OTLP as a max-flow problem. With a novel application of polymatroid theory, we reduce the exponentially large OTLP to a convex optimization problem in at most V variables. This allows us to devise an algorithm for optimal n -draft speculative sampling when the n tokens are chosen i.i.d. from a single draft model, which can be tuned to arbitrary accuracy. Finally, we measure acceptance rates and algorithm runtimes for various n and top- k draft sampling settings. Our findings give the first multi-draft algorithm with 90% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution.

[LG-48] GLOBE: Accurate and Generalizable PDE Surrogates using Domain-Inspired Architectures and Equivariances

链接: https://arxiv.org/abs/2511.15856
作者: Peter Sharpe
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We introduce GLOBE, a new neural surrogate for homogeneous PDEs that draws inductive bias from boundary-element methods and equivariant ML. GLOBE represents solutions as superpositions of learnable Green’s-function-like kernels evaluated from boundary faces to targets, composed across multiscale branches and communication hyperlayers. The architecture is translation-, rotation-, and parity-equivariant; discretization-invariant in the fine-mesh limit; and units-invariant via rigorous nondimensionalization. An explicit far-field decay envelope stabilizes extrapolation, boundary-to-boundary hyperlayer communication mediates long-range coupling, and the all-to-all boundary-to-target evaluation yields a global receptive field that respects PDE information flow, even for elliptic PDEs. On AirFRANS (steady incompressible RANS over NACA airfoils), GLOBE achieves substantial accuracy improvements. On the “Full” split, it reduces mean-squared error by roughly 200x on all fields relative to the dataset’s reference baselines, and roughly 50x relative to the next-best-performing model. In the “Scarce” split, it achieves over 100x lower error on velocity and pressure fields and over 600x lower error on surface pressure than Transolver. Qualitative results show sharp near-wall gradients, coherent wakes, and limited errors under modest extrapolation in Reynolds number and angle of attack. In addition to this accuracy, the model is quite compact (117k parameters), and fields can be evaluated at arbitrary points during inference. We also demonstrate the ability to train and predict with non-watertight meshes, which has strong practical implications. These results show that rigorous physics- and domain-inspired inductive biases can achieve large gains in accuracy, generalizability, and practicality for ML-based PDE surrogates for industrial computer-aided engineering (CAE). Subjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn) Cite as: arXiv:2511.15856 [cs.LG] (or arXiv:2511.15856v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.15856 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Peter Sharpe [view email] [v1] Wed, 19 Nov 2025 20:23:51 UTC (17,749 KB)

[LG-49] discretize_distributions: Efficient Quantization of Gaussian Mixtures with Guarantees in Wasserstein Distance

链接: https://arxiv.org/abs/2511.15854
作者: Steven Adams,Elize Alwash,Luca Laurenti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present discretize_distributions, a Python package that efficiently constructs discrete approximations of Gaussian mixture distributions and provides guarantees on the approximation error in Wasserstein distance. The package implements state-of-the-art quantization methods for Gaussian mixture models and extends them to improve scalability. It further integrates complementary quantization strategies such as sigma-point methods and provides a modular interface that supports custom schemes and integration into control and verification pipelines for cyber-physical systems. We benchmark the package on various examples, including high-dimensional, large, and degenerate Gaussian mixtures, and demonstrate that discretize_distributions produces accurate approximations at low computational cost.

[LG-50] ransparent Early ICU Mortality Prediction with Clinical Transformer and Per-Case Modality Attribution

链接: https://arxiv.org/abs/2511.15847
作者: Alexander Bakumenko(1),Janine Hoelscher(1),Hudson Smith(1) ((1) Clemson University, USA)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Early identification of intensive care patients at risk of in-hospital mortality enables timely intervention and efficient resource allocation. Despite high predictive performance, existing machine learning approaches lack transparency and robustness, limiting clinical adoption. We present a lightweight, transparent multimodal ensemble that fuses physiological time-series measurements with unstructured clinical notes from the first 48 hours of an ICU stay. A logistic regression model combines predictions from two modality-specific models: a bidirectional LSTM for vitals and a finetuned ClinicalModernBERT transformer for notes. This traceable architecture allows for multilevel interpretability: feature attributions within each modality and direct per-case modality attributions quantifying how vitals and notes influence each decision. On the MIMIC-III benchmark, our late-fusion ensemble improves discrimination over the best single model (AUPRC 0.565 vs. 0.526; AUROC 0.891 vs. 0.876) while maintaining well-calibrated predictions. The system remains robust through a calibrated fallback when a modality is missing. These results demonstrate competitive performance with reliable, auditable risk estimates and transparent, predictable operation, which together are crucial for clinical use.

[LG-51] Attention-Based Feature Online Conformal Prediction for Time Series

链接: https://arxiv.org/abs/2511.15838
作者: Meiyi Zhu,Caili Guo,Chunyan Feng,Osvaldo Simeone
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注: 25 pages, 24 figures

点击查看摘要

Abstract:Online conformal prediction (OCP) wraps around any pre-trained predictor to produce prediction sets with coverage guarantees that hold irrespective of temporal dependencies or distribution shifts. However, standard OCP faces two key limitations: it operates in the output space using simple nonconformity (NC) scores, and it treats all historical observations uniformly when estimating quantiles. This paper introduces attention-based feature OCP (AFOCP), which addresses both limitations through two key innovations. First, AFOCP operates in the feature space of pre-trained neural networks, leveraging learned representations to construct more compact prediction sets by concentrating on task-relevant information while suppressing nuisance variation. Second, AFOCP incorporates an attention mechanism that adaptively weights historical observations based on their relevance to the current test point, effectively handling non-stationarity and distribution shifts. We provide theoretical guarantees showing that AFOCP maintains long-term coverage while provably achieving smaller prediction intervals than standard OCP under mild regularity conditions. Extensive experiments on synthetic and real-world time series datasets demonstrate that AFOCP consistently reduces the size of prediction intervals by as much as 88% as compared to OCP, while maintaining target coverage levels, validating the benefits of both feature-space calibration and attention-based adaptive weighting.

[LG-52] Beyond Tsybakov: Model Margin Noise and mathcalH-Consistency Bounds

链接: https://arxiv.org/abs/2511.15816
作者: Mehryar Mohri,Yutao Zhong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ISAIM 2026

点击查看摘要

Abstract:We introduce a new low-noise condition for classification, the Model Margin Noise (MM noise) assumption, and derive enhanced \mathcalH -consistency bounds under this condition. MM noise is weaker than Tsybakov noise condition: it is implied by Tsybakov noise condition but can hold even when Tsybakov fails, because it depends on the discrepancy between a given hypothesis and the Bayes-classifier rather than on the intrinsic distributional minimal margin (see Figure 1 for an illustration of an explicit example). This hypothesis-dependent assumption yields enhanced \mathcalH -consistency bounds for both binary and multi-class classification. Our results extend the enhanced \mathcalH -consistency bounds of Mao, Mohri, and Zhong (2025a) with the same favorable exponents but under a weaker assumption than the Tsybakov noise condition; they interpolate smoothly between linear and square-root regimes for intermediate noise levels. We also instantiate these bounds for common surrogate loss families and provide illustrative tables.

[LG-53] Connecting the Dots: A Machine Learning Ready Dataset for Ionospheric Forecasting Models NEURIPS2025

链接: https://arxiv.org/abs/2511.15743
作者: Linnea M. Wolniewicz,Halil S. Kelebek,Simone Mestici,Michael D. Vergalla,Giacomo Acciarini,Bala Poduval,Olga Verkhoglyadova,Madhulika Guhathakurta,Thomas E. Berger,Atılım Güneş Baydin,Frank Soboczenski
类目: Machine Learning (cs.LG); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: 8 pages, 2 figures, 2 tables. Accepted as a poster presentation in the Machine Learning for the Physical Sciences workshop at NeurIPS 2025

点击查看摘要

Abstract:Operational forecasting of the ionosphere remains a critical space weather challenge due to sparse observations, complex coupling across geospatial layers, and a growing need for timely, accurate predictions that support Global Navigation Satellite System (GNSS), communications, aviation safety, as well as satellite operations. As part of the 2025 NASA Heliolab, we present a curated, open-access dataset that integrates diverse ionospheric and heliospheric measurements into a coherent, machine learning-ready structure, designed specifically to support next-generation forecasting models and address gaps in current operational frameworks. Our workflow integrates a large selection of data sources comprising Solar Dynamic Observatory data, solar irradiance indices (F10.7), solar wind parameters (velocity and interplanetary magnetic field), geomagnetic activity indices (Kp, AE, SYM-H), and NASA JPL’s Global Ionospheric Maps of Total Electron Content (GIM-TEC). We also implement geospatially sparse data such as the TEC derived from the World-Wide GNSS Receiver Network and crowdsourced Android smartphone measurements. This novel heterogeneous dataset is temporally and spatially aligned into a single, modular data structure that supports both physical and data-driven modeling. Leveraging this dataset, we train and benchmark several spatiotemporal machine learning architectures for forecasting vertical TEC under both quiet and geomagnetically active conditions. This work presents an extensive dataset and modeling pipeline that enables exploration of not only ionospheric dynamics but also broader Sun-Earth interactions, supporting both scientific inquiry and operational forecasting efforts.

[LG-54] From Polynomials to Databases: Arithmetic Structures in Galois Theory

链接: https://arxiv.org/abs/2511.16622
作者: Jurgen Mezinaj
类目: Commutative Algebra (math.AC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a computational framework for classifying Galois groups of irreducible degree-7 polynomials over~ \mathbbQ , combining explicit resolvent methods with machine learning techniques. A database of over one million normalized projective septics is constructed, each annotated with algebraic invariants~ J_0, \dots, J_4 derived from binary transvections. For each polynomial, we compute resolvent factorizations to determine its Galois group among the seven transitive subgroups of~ S_7 identified by Foulkes. Using this dataset, we train a neurosymbolic classifier that integrates invariant-theoretic features with supervised learning, yielding improved accuracy in detecting rare solvable groups compared to coefficient-based models. The resulting database provides a reproducible resource for constructive Galois theory and supports empirical investigations into group distribution under height constraints. The methodology extends to higher-degree cases and illustrates the utility of hybrid symbolic-numeric techniques in computational algebra.

[LG-55] Rate-optimal community detection near the KS threshold via node-robust algorithms

链接: https://arxiv.org/abs/2511.16613
作者: Jingqiu Ding,Yiding Hua,Kasper Lindberg,David Steurer,Aleksandr Storozhenko
类目: Machine Learning (stat.ML); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study community detection in the \emphsymmetric k -stochastic block model, where n nodes are evenly partitioned into k clusters with intra- and inter-cluster connection probabilities p and q , respectively. Our main result is a polynomial-time algorithm that achieves the minimax-optimal misclassification rate \beginequation* \exp \Bigl(-\bigl(1 \pm o(1)\bigr) \tfracCk\Bigr), \quad \textwhere C = (\sqrtpn - \sqrtqn)^2, \endequation* whenever C \ge K,k^2,\log k for some universal constant K , matching the Kesten–Stigum (KS) threshold up to a \log k factor. Notably, this rate holds even when an adversary corrupts an \eta \le \exp\bigl(- (1 \pm o(1)) \tfracCk\bigr) fraction of the nodes. To the best of our knowledge, the minimax rate was previously only attainable either via computationally inefficient procedures [ZZ15] or via polynomial-time algorithms that require strictly stronger assumptions such as C \ge K k^3 [GMZZ17]. In the node-robust setting, the best known algorithm requires the substantially stronger condition C \ge K k^102 [LM22]. Our results close this gap by providing the first polynomial-time algorithm that achieves the minimax rate near the KS threshold in both settings. Our work has two key technical contributions: (1) we robustify majority voting via the Sum-of-Squares framework, (2) we develop a novel graph bisection algorithm via robust majority voting, which allows us to significantly improve the misclassification rate to 1/\mathrmpoly(k) for the initial estimation near the KS threshold. Subjects: Machine Learning (stat.ML); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2511.16613 [stat.ML] (or arXiv:2511.16613v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2511.16613 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Aleksandr Storozhenko [view email] [v1] Thu, 20 Nov 2025 18:11:01 UTC (1,087 KB) Full-text links: Access Paper: View a PDF of the paper titled Rate-optimal community detection near the KS threshold via node-robust algorithms, by Jingqiu Ding and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: stat.ML prev | next new | recent | 2025-11 Change to browse by: cs cs.CC cs.DS cs.LG stat References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-56] me dependent loss reweighting for flow matching and diffusion models is theoretically justified

链接: https://arxiv.org/abs/2511.16599
作者: Lukas Billera,Hedwig Nora Nordlinder,Ben Murrell
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 19 pages, 0 figures

点击查看摘要

Abstract:This brief note clarifies that, in Generator Matching (which subsumes a large family of flow matching and diffusion models over continuous, manifold, and discrete spaces), both the Bregman divergence loss and the linear parameterization of the generator can depend on both the current state X_t and the time t , and we show that the expectation over time in the loss can be taken with respect to a broad class of time distributions. We also show this for Edit Flows, which falls outside of Generator Matching. That the loss can depend on t clarifies that time-dependent loss weighting schemes, often used in practice to stabilize training, are theoretically justified when the specific flow or diffusion scheme is a special case of Generator Matching (or Edit Flows). It also often simplifies the construction of X_1 -predictor schemes, which are sometimes preferred for model-related reasons. We show examples that rely upon the dependence of linear parameterizations, and of the Bregman divergence loss, on t and X_t .

[LG-57] Variational Quantum Integrated Sensing and Communication

链接: https://arxiv.org/abs/2511.16597
作者: Ivana Nikoloska,Osvaldo Simeone
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Submitted for publication

点击查看摘要

Abstract:The integration of sensing and communication functionalities within a common system is one of the main innovation drivers for next-generation networks. In this paper, we introduce a quantum integrated sensing and communication (QISAC) protocol that leverages entanglement in quantum carriers of information to enable both superdense coding and quantum sensing. The proposed approach adaptively optimizes encoding and quantum measurement via variational circuit learning, while employing classical machine learning-based decoders and estimators to process the measurement outcomes. Numerical results for qudit systems demonstrate that the proposed QISAC protocol can achieve a flexible trade-off between classical communication rate and accuracy of parameter estimation.

[LG-58] Optimizing Quantum Key Distribution Network Performance using Graph Neural Networks

链接: https://arxiv.org/abs/2511.16468
作者: Akshit Pramod Anchan,Ameiy Acharya,Leki Chom Thungon
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 11 pages, 4 figures, and 2 tables

点击查看摘要

Abstract:This paper proposes an optimization of Quantum Key Distribution (QKD) Networks using Graph Neural Networks (GNN) framework. Today, the development of quantum computers threatens the security systems of classical cryptography. Moreover, as QKD networks are designed for protecting secret communication, they suffer from multiple operational difficulties: adaptive to dynamic conditions, optimization for multiple parameters and effective resource utilization. In order to overcome these obstacles, we propose a GNN-based framework which can model QKD networks as dynamic graphs and extracts exploitable characteristics from these networks’ structure. The graph contains not only topological information but also specific characteristics associated with quantum communication (the number of edges between nodes, etc). Experimental results demonstrate that the GNN-optimized QKD network achieves a substantial increase in total key rate (from 27.1 Kbits/s to 470 Kbits/s), a reduced average QBER (from 6.6% to 6.0%), and maintains path integrity with a slight reduction in average transmission distance (from 7.13 km to 6.42 km). Furthermore, we analyze network performance across varying scales (10 to 250 nodes), showing improved link prediction accuracy and enhanced key generation rate in medium-sized networks. This work introduces a novel operation mode for QKD networks, shifting the paradigm of network optimization through adaptive and scalable quantum communication systems that enhance security and performance.

[LG-59] VersaPants: A Loose-Fitting Textile Capacitive Sensing System for Lower-Body Motion Capture

链接: https://arxiv.org/abs/2511.16346
作者: Deniz Kasap(1),Taraneh Aminosharieh Najafi(1),Jérôme Paul Rémy Thevenot(1),Jonathan Dan(1),Stefano Albini(1),David Atienza(1) ((1) École Polytechnique Fédérale de Lausanne (EPFL))
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:We present VersaPants, the first loose-fitting, textile-based capacitive sensing system for lower-body motion capture, built on the open-hardware VersaSens platform. By integrating conductive textile patches and a compact acquisition unit into a pair of pants, the system reconstructs lower-body pose without compromising comfort. Unlike IMU-based systems that require user-specific fitting or camera-based methods that compromise privacy, our approach operates without fitting adjustments and preserves user privacy. VersaPants is a custom-designed smart garment featuring 6 capacitive channels per leg. We employ a lightweight Transformer-based deep learning model that maps capacitance signals to joint angles, enabling embedded implementation on edge platforms. To test our system, we collected approximately 3.7 hours of motion data from 11 participants performing 16 daily and exercise-based movements. The model achieves a mean per-joint position error (MPJPE) of 11.96 cm and a mean per-joint angle error (MPJAE) of 12.3 degrees across the hip, knee, and ankle joints, indicating the model’s ability to generalize to unseen users and movements. A comparative analysis of existing textile-based deep learning architectures reveals that our model achieves competitive reconstruction performance with up to 22 times fewer parameters and 18 times fewer FLOPs, enabling real-time inference at 42 FPS on a commercial smartwatch without quantization. These results position VersaPants as a promising step toward scalable, comfortable, and embedded motion-capture solutions for fitness, healthcare, and wellbeing applications.

[LG-60] Spectral Identifiability for Interpretable Probe Geometry

链接: https://arxiv.org/abs/2511.16288
作者: William Hao-Cheng Huang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Linear probes are widely used to interpret and evaluate neural representations, yet their reliability remains unclear, as probes may appear accurate in some regimes but collapse unpredictably in others. We uncover a spectral mechanism behind this phenomenon and formalize it as the Spectral Identifiability Principle (SIP), a verifiable Fisher-inspired condition for probe stability. When the eigengap separating task-relevant directions is larger than the Fisher estimation error, the estimated subspace concentrates and accuracy remains consistent, whereas closing this gap induces instability in a phase-transition manner. Our analysis connects eigengap geometry, sample size, and misclassification risk through finite-sample reasoning, providing an interpretable diagnostic rather than a loose generalization bound. Controlled synthetic studies, where Fisher quantities are computed exactly, confirm these predictions and show how spectral inspection can anticipate unreliable probes before they distort downstream evaluation.

[LG-61] Approximation rates of quantum neural networks for periodic functions via Jacksons inequality

链接: https://arxiv.org/abs/2511.16149
作者: Ariel Neufeld,Philipp Schmocker,Viet Khoa Tran
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Quantum neural networks (QNNs) are an analog of classical neural networks in the world of quantum computing, which are represented by a unitary matrix with trainable parameters. Inspired by the universal approximation property of classical neural networks, ensuring that every continuous function can be arbitrarily well approximated uniformly on a compact set of a Euclidean space, some recent works have established analogous results for QNNs, ranging from single-qubit to multi-qubit QNNs, and even hybrid classical-quantum models. In this paper, we study the approximation capabilities of QNNs for periodic functions with respect to the supremum norm. We use the Jackson inequality to approximate a given function by implementing its approximating trigonometric polynomial via a suitable QNN. In particular, we see that by restricting to the class of periodic functions, one can achieve a quadratic reduction of the number of parameters, producing better approximation results than in the literature. Moreover, the smoother the function, the fewer parameters are needed to construct a QNN to approximate the function.

[LG-62] Angular Graph Fractional Fourier Transform: Theory and Application

链接: https://arxiv.org/abs/2511.16111
作者: Feiyue Zhao,Yangfan He,Zhichao Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Spectral Theory (math.SP)
*备注:

点击查看摘要

Abstract:Graph spectral representations are fundamental in graph signal processing, offering a rigorous framework for analyzing and processing graph-structured data. The graph fractional Fourier transform (GFRFT) extends the classical graph Fourier transform (GFT) with a fractional-order parameter, enabling flexible spectral analysis while preserving mathematical consistency. The angular graph Fourier transform (AGFT) introduces angular control via GFT eigenvector rotation; however, existing constructions fail to degenerate to the GFT at zero angle, which is a critical flaw that undermines theoretical consistency and interpretability. To resolve these complementary limitations - GFRFT’s lack of angular regulation and AGFT’s defective degeneracy - this study proposes an angular GFRFT (AGFRFT), a unified framework that integrates fractional-order and angular spectral analyses with theoretical rigor. A degeneracy-friendly rotation matrix family ensures exact GFT degeneration at zero angle, with two AGFRFT variants (I-AGFRFT and II-AGFRFT) defined accordingly. Rigorous theoretical analyses confirm their unitarity, invertibility, and smooth parameter dependence. Both support learnable joint parameterization of the angle and fractional order, enabling adaptive spectral processing for diverse graph signals. Extensive experiments on real-world data denoising, image denoising, and point cloud denoising demonstrate that AGFRFT outperforms GFRFT and AGFT in terms of spectral concentration, reconstruction quality, and controllable spectral manipulation, establishing a robust and flexible tool for integrated angular fractional spectral analysis in graph signal processing.

[LG-63] Machine Learning vs. Randomness: Challenges in Predicting Binary Options Movements

链接: https://arxiv.org/abs/2511.15960
作者: Gabriel M. Arantes,Richard F. Pinto,Bruno L. Dalmazo,Eduardo N. Borges,Giancarlo Lucca,Viviane L. D. de Mattos,Fabian C. Cardoso,Rafael A. Berri
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注: Accepted for publication at the 26th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2025)

点击查看摘要

Abstract:Binary options trading is often marketed as a field where predictive models can generate consistent profits. However, the inherent randomness and stochastic nature of binary options make price movements highly unpredictable, posing significant challenges for any forecasting approach. This study demonstrates that machine learning algorithms struggle to outperform a simple baseline in predicting binary options movements. Using a dataset of EUR/USD currency pairs from 2021 to 2023, we tested multiple models, including Random Forest, Logistic Regression, Gradient Boosting, and k-Nearest Neighbors (kNN), both before and after hyperparameter optimization. Furthermore, several neural network architectures, including Multi-Layer Perceptrons (MLP) and a Long Short-Term Memory (LSTM) network, were evaluated under different training conditions. Despite these exhaustive efforts, none of the models surpassed the ZeroR baseline accuracy, highlighting the inherent randomness of binary options. These findings reinforce the notion that binary options lack predictable patterns, making them unsuitable for machine learning-based forecasting.

[LG-64] EEG Emotion Recognition Through Deep Learning

链接: https://arxiv.org/abs/2511.15902
作者: Roman Dolgopolyi,Antonis Chatzipanagiotou
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This version corresponds to the original manuscript submitted to the 22nd EMCIS conference prior to peer review. The peer-reviewed and accepted version will appear in the Springer conference proceedings

点击查看摘要

Abstract:An advanced emotion classification model was developed using a CNN-Transformer architecture for emotion recognition from EEG brain wave signals, effectively distinguishing among three emotional states, positive, neutral and negative. The model achieved a testing accuracy of 91%, outperforming traditional models such as SVM, DNN, and Logistic Regression. Training was conducted on a custom dataset created by merging data from SEED, SEED-FRA, and SEED-GER repositories, comprising 1,455 samples with EEG recordings labeled according to emotional states. The combined dataset represents one of the largest and most culturally diverse collections available. Additionally, the model allows for the reduction of the requirements of the EEG apparatus, by leveraging only 5 electrodes of the 62. This reduction demonstrates the feasibility of deploying a more affordable consumer-grade EEG headset, thereby enabling accessible, at-home use, while also requiring less computational power. This advancement sets the groundwork for future exploration into mood changes induced by media content consumption, an area that remains underresearched. Integration into medical, wellness, and home-health platforms could enable continuous, passive emotional monitoring, particularly beneficial in clinical or caregiving settings where traditional behavioral cues, such as facial expressions or vocal tone, are diminished, restricted, or difficult to interpret, thus potentially transforming mental health diagnostics and interventions…

[LG-65] Atlas Gaussian processes on restricted domains and point clouds

链接: https://arxiv.org/abs/2511.15822
作者: Mu Niu,Yue Zhang,Ke Ye,Pokman Cheung,Yizhu Wang,Xiaochen Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In real-world applications, data often reside in restricted domains with unknown boundaries, or as high-dimensional point clouds lying on a lower-dimensional, nontrivial, unknown manifold. Traditional Gaussian Processes (GPs) struggle to capture the underlying geometry in such settings. Some existing methods assume a flat space embedded in a point cloud, which can be represented by a single latent chart (latent space), while others exhibit weak performance when the point cloud is sparse or irregularly sampled. The goal of this work is to address these challenges. The main contributions are twofold: (1) We establish the Atlas Brownian Motion (BM) framework for estimating the heat kernel on point clouds with unknown geometries and nontrivial topological structures; (2) Instead of directly using the heat kernel estimates, we construct a Riemannian corrected kernel by combining the global heat kernel with local RBF kernel and leading to the formulation of Riemannian-corrected Atlas Gaussian Processes (RC-AGPs). The resulting RC-AGPs are applied to regression tasks across synthetic and real-world datasets. These examples demonstrate that our method outperforms existing approaches in both heat kernel estimation and regression accuracy. It improves statistical inference by effectively bridging the gap between complex, high-dimensional observations and manifold-based inferences.

[LG-66] SURFing to the Fundamental Limit of Jet Tagging

链接: https://arxiv.org/abs/2511.15779
作者: Ian Pang,Darius A. Faroughy,David Shih,Ranit Das,Gregor Kasieczka
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 15 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Beyond the practical goal of improving search and measurement sensitivity through better jet tagging algorithms, there is a deeper question: what are their upper performance limits? Generative surrogate models with learned likelihood functions offer a new approach to this problem, provided the surrogate correctly captures the underlying data distribution. In this work, we introduce the SUrrogate ReFerence (SURF) method, a new approach to validating generative models. This framework enables exact Neyman-Pearson tests by training the target model on samples from another tractable surrogate, which is itself trained on real data. We argue that the EPiC-FM generative model is a valid surrogate reference for JetClass jets and apply SURF to show that modern jet taggers may already be operating close to the true statistical limit. By contrast, we find that autoregressive GPT models unphysically exaggerate top vs. QCD separation power encoded in the surrogate reference, implying that they are giving a misleading picture of the fundamental limit.

[LG-67] Human-aligned Quantification of Numerical Data

链接: https://arxiv.org/abs/2511.15723
作者: Anton Kolonin
类目: Data Analysis, Statistics and Probability (physics.data-an); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 9 pages, 5 figures, 1 table

点击查看摘要

Abstract:Quantifying numerical data involves addressing two key challenges: first, determining whether the data can be naturally quantified, and second, identifying the numerical intervals or ranges of values that correspond to specific value classes, referred to as “quantums,” which represent statistically meaningful states. If such quantification is feasible, continuous streams of numerical data can be transformed into sequences of “symbols” that reflect the states of the system described by the measured parameter. People often perform this task intuitively, relying on common sense or practical experience, while information theory and computer science offer computable metrics for this purpose. In this study, we assess the applicability of metrics based on information compression and the Silhouette coefficient for quantifying numerical data. We also investigate the extent to which these metrics correlate with one another and with what is commonly referred to as “human intuition.” Our findings suggest that the ability to classify numeric data values into distinct categories is associated with a Silhouette coefficient above 0.65 and a Dip Test below 0.5; otherwise, the data can be treated as following a unimodal normal distribution. Furthermore, when quantification is possible, the Silhouette coefficient appears to align more closely with human intuition than the “normalized centroid distance” method derived from information compression perspective.

信息检索

[IR-0] PolyMinHash: Efficient Area-Based MinHashing of Polygons for Approximate Nearest Neighbor Search

链接: https://arxiv.org/abs/2511.16576
作者: Alima Subedi,Sankalpa Pokharel,Satish Puri
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Similarity searches are a critical task in data mining. As data sets grow larger, exact nearest neighbor searches quickly become unfeasible, leading to the adoption of approximate nearest neighbor (ANN) searches. ANN has been studied for text data, images, and trajectories. However, there has been little effort to develop ANN systems for polygons in spatial database systems and geographic information systems. We present PolyMinHash, a system for approximate polygon similarity search that adapts MinHashing into a novel 2D polygon-hashing scheme to generate short, similarity-preserving signatures of input polygons. Minhash is generated by counting the number of randomly sampled points needed before the sampled point lands within the polygon’s interior area, yielding hash values that preserve area-based Jaccard similarity. We present the tradeoff between search accuracy and runtime of our PolyMinHash system. Our hashing mechanism reduces the number of candidates to be processed in the query refinement phase by up to 98% compared to the number of candidates processed by the brute-force algorithm.

[IR-1] An Efficient LLM -based Evolutional Recommendation with Locate-Forget-Update Paradigm

链接: https://arxiv.org/abs/2511.16414
作者: Hao Liu,Le Wu,Min Hou,Han Wu,Kun Zhang,Xin Li,Si Wei
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Nowadays, Large Language Models (LLMs) have shown exceptional performance in sequential recommendations, and the adoption of LLM-based recommender systems (LLMRec) is becoming increasingly widespread in existing e-commerce platforms. Despite the impressive performance, the constant high volume of new user-item interactions makes it difficult to adapt to the evolution of user preference over time, especially for LLM-based recommender systems. The challenge arises from the large number of parameters in LLMs, which makes traditional evolution methods (i.e., Re-training or Fine-tuning) impractical. Specifically, Re-training with all interactions results in prohibitively high computational costs. On the other hand, fine-tuning with only new interactions leads to preference forgetting among inactive users, ultimately compromising overall performance. To tackle this problem, we propose EvoRec, an efficient Locate-Forget-Update framework designed for LLM-based recommender systems to model the evolution of user preferences. EvoRec identifies a small set of parameters associated with preference changes and updates them precisely, thereby saving computational resources while maintaining strong recommendation performance. Notably, the modified parameters account for only 30% of LoRA adapter parameters, with no additional parameters introduced. Extensive experiments on two real-world datasets demonstrate that, compared to existing methods, EvoRec not only efficiently evolves LLMRec to adapt to the preferences of active users, but also preserves the interests of inactive users from being disturbed during evolution.

[IR-2] ARK: Answer-Centric Retriever Tuning via KG-augmented Curriculum Learning

链接: https://arxiv.org/abs/2511.16326
作者: Jiawei Zhou,Hang Ding,Haiyun Jiang
类目: Information Retrieval (cs.IR)
*备注: Under Review in ARR

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for knowledge-intensive tasks, yet its effectiveness in long-context scenarios is often bottlenecked by the retriever’s inability to distinguish sparse yet crucial evidence. Standard retrievers, optimized for query-document similarity, frequently fail to align with the downstream goal of generating a precise answer. To bridge this gap, we propose a novel fine-tuning framework that optimizes the retriever for Answer Alignment. Specifically, we first identify high-quality positive chunks by evaluating their sufficiency to generate the correct answer. We then employ a curriculum-based contrastive learning scheme to fine-tune the retriever. This curriculum leverages LLM-constructed Knowledge Graphs (KGs) to generate augmented queries, which in turn mine progressively challenging hard negatives. This process trains the retriever to distinguish the answer-sufficient positive chunks from these nuanced distractors, enhancing its generalization. Extensive experiments on 10 datasets from the Ultradomain and LongBench benchmarks demonstrate that our fine-tuned retriever achieves state-of-the-art performance, improving 14.5% over the base model without substantial architectural modifications and maintaining strong efficiency for long-context RAG. Our work presents a robust and effective methodology for building truly answer-centric retrievers.

[IR-3] Incorporating Token Importance in Multi-Vector Retrieval

链接: https://arxiv.org/abs/2511.16106
作者: Archish S,Ankit Garg,Kirankumar Shiragur,Neeraj Kayal
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:ColBERT introduced a late interaction mechanism that independently encodes queries and documents using BERT, and computes similarity via fine-grained interactions over token-level vector representations. This design enables expressive matching while allowing efficient computation of scores, as the multi-vector document representations could be pre-computed offline. ColBERT models distance using a Chamfer-style function: for each query token, it selects the closest document token and sums these distances across all query tokens. In our work, we explore enhancements to the Chamfer distance function by computing a weighted sum over query token contributions, where weights reflect the token importance. Empirically, we show that this simple extension, requiring only token-weight training while keeping the multi-vector representations fixed, further enhances the expressiveness of late interaction multi-vector mechanism. In particular, on the BEIR benchmark, our method achieves an average improvement of 1.28% in Recall@10 in the zero-shot setting using IDF-based weights, and 3.66% through few-shot fine-tuning. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2511.16106 [cs.IR] (or arXiv:2511.16106v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2511.16106 Focus to learn more arXiv-issued DOI via DataCite

附件下载

点击下载今日全部论文列表