本篇博文主要内容为 2025-12-17 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-12-17)
今日共更新486篇论文,其中:
- 自然语言处理共48篇(Computation and Language (cs.CL))
- 人工智能共157篇(Artificial Intelligence (cs.AI))
- 计算机视觉共135篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共134篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] meLens: Rethinking Video Temporal Grounding with Multimodal LLM s
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频时序定位(Video Temporal Grounding, VTG)任务中性能评估与训练优化缺乏可靠基准和系统性方法的问题。现有VTG基准存在标注质量不可靠、训练数据噪声大等关键缺陷,导致模型性能评估失真且难以有效提升。解决方案的关键在于两个维度:一是构建高质量的数据基础,包括提出TimeLens-Bench——一个严格重新标注的多基准评测集以纠正原有标注误差,并开发TimeLens-100K大规模高质量训练数据集;二是探索高效的算法设计原则,如采用交错文本编码进行时间表示、引入无需推理过程的可验证奖励强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)训练范式及精细化的RLVR训练策略。这些改进共同推动了TimeLens系列模型在开源模型中达到最先进的VTG性能,甚至超越部分闭源商用模型(如GPT-5和Gemini-2.5-Flash)。
链接: https://arxiv.org/abs/2512.14698
作者: Jun Zhang,Teng Wang,Yuying Ge,Yixiao Ge,Xinhao Li,Ying Shan,Limin Wang
机构: Nanjing University (南京大学); ARC Lab, Tencent PCG (腾讯PCG ARC实验室); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Project Page: this https URL
Abstract:This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
zh
[NLP-1] MMGR: Multi-Modal Generative Reasoning
【速读】: 该论文旨在解决当前视频基础模型(Video Foundation Models)在生成内容时虽具备视觉真实性和时间连贯性,但缺乏对物理、逻辑与空间约束的正确建模问题,即其作为世界模拟器的可靠性不足。现有评估指标如Frechet Video Distance(FVD)仅关注感知质量,忽略推理错误(如因果违背、物理违反和全局一致性缺失)。解决方案的关键在于提出MMGR(Multi-Modal Generative Reasoning Evaluation and Benchmark),这是一个基于五种推理能力(物理、逻辑、3D空间、2D空间和时间)的系统性评估框架,涵盖抽象推理(ARC-AGI、数独)、具身导航(真实世界3D导航与定位)和物理常识(体育运动与组合交互)三个领域,并采用细粒度指标要求视频与图像生成在整体上保持正确性,从而揭示模型在推理能力上的显著短板,为构建具有因果推理能力的生成式世界模型提供诊断工具与改进方向。
链接: https://arxiv.org/abs/2512.14691
作者: Zefan Cai,Haoyi Qiu,Tianyi Ma,Haozhe Zhao,Gengze Zhou,Kung-Hsiang Huang,Parisa Kordjamshidi,Minjia Zhang,Xiao Wen,Jiuxiang Gu,Nanyun Peng,Junjie Hu
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: work in progress
Abstract:Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
zh
[NLP-2] Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization
【速读】: 该论文旨在解决当前情感感知型语音对话摘要(emotion-aware spoken dialogue summarization)研究中因缺乏多模态对齐数据而受限的问题,即缺少将原始对话音频、事实性摘要、情感丰富摘要以及说话人年龄、性别和情绪等语调特征(paralinguistic cues)进行精准对齐的数据集。其解决方案的关键在于构建首个此类大规模语料库 Spoken DialogSum,该语料库通过两个阶段完成:首先利用大语言模型(LLM)重构对话脚本并添加自然口语填充词与回应词(fillers and back-channels),同时标注每句的语义情感、音高(pitch)和语速(speaking rate);其次使用富有表现力的文本转语音(TTS)引擎合成与上述标注一致的语音数据。该数据集包含13,460个情感多样对话,每个对话配有事实性摘要和情感聚焦摘要,实验表明端到端音频大模型(Audio-LLM)相比传统ASR+LLM级联系统在情感摘要的ROUGE-L指标上提升28%,验证了直接建模语音信号对情感理解的重要性。
链接: https://arxiv.org/abs/2512.14687
作者: Yen-Ju Lu,Kunxiao Gao,Mingrui Liang,Helin Wang,Thomas Thebaud,Laureano Moro-Velazquez,Najim Dehak,Jesus Villalba
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 12 pages, 2 figures
Abstract:Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. The dataset is available online at this https URL. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.
zh
[NLP-3] Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
【速读】: 该论文旨在解决基于Transformer的大模型在推理阶段因自回归(Autoregressive, AR)解码方式导致的高延迟问题,尤其是现有扩散大语言模型(diffusion Large Language Models, dLLMs)在并行解码时因预训练到微调阶段的数据分布不一致(pretrain-to-posttrain mismatch)以及双向注意力机制与预训练中学习到的因果先验冲突,从而难以实现高效且高质量的生成。其解决方案的关键在于提出一种称为Jacobi Forcing的渐进式蒸馏范式:通过让模型在其自身生成的并行解码轨迹上进行训练,逐步将AR模型转化为高效的并行解码器,同时保留其预训练时的因果推理特性。该方法实现了3.8倍的墙-clock加速,并进一步结合多块解码与拒绝回收策略,使每轮迭代的token接受率提升至4.5倍,整体推理延迟降低近4倍。
链接: https://arxiv.org/abs/2512.14681
作者: Lanxiang Hu,Siqi Kou,Yichao Fu,Samyam Rajbhandari,Tajana Rosing,Yuxiong He,Zhijie Deng,Hao Zhang
机构: UC San Diego; Shanghai Jiao Tong University; Snowflake
类目: Computation and Language (cs.CL)
备注:
Abstract:Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models’ trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at this https URL.
zh
[NLP-4] ME: Tiny Monolingual Encoders for Efficient NLP Pipelines
【速读】: 该论文旨在解决大型通用语言模型在效率关键型自然语言处理(NLP)应用中面临的性能瓶颈问题,包括推理速度慢、能耗高以及难以部署于资源受限设备等挑战。其核心解决方案是通过现代训练技术(如知识蒸馏)构建轻量级单语编码器模型(TiME),在保持良好基准性能的同时显著提升吞吐量、降低延迟和能耗,并支持低资源语言。关键创新在于证明了从多语言教师模型蒸馏出单语学生模型的可行性,以及从使用相对位置编码的教师模型蒸馏出使用绝对位置编码的学生模型的可能性,从而实现高效、可扩展且可持续的NLP模型部署。
链接: https://arxiv.org/abs/2512.14645
作者: David Schulmeister,Valentin Hartmann,Lars Klein,Robert West
机构: EPFL; Timely Learning
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Today, a lot of research on language models is focused on large, general-purpose models. However, many NLP pipelines only require models with a well-defined, small set of capabilities. While large models are capable of performing the tasks of those smaller models, they are simply not fast enough to process large amounts of data or offer real-time responses. Furthermore, they often use unnecessarily large amounts of energy, leading to sustainability concerns and problems when deploying them on battery-powered devices. In our work, we show how to train small models for such efficiency-critical applications. As opposed to many off-the-shelf NLP pipelines, our models use modern training techniques such as distillation, and offer support for low-resource languages. We call our models TiME (Tiny Monolingual Encoders) and comprehensively evaluate them on a range of common NLP tasks, observing an improved trade-off between benchmark performance on one hand, and throughput, latency and energy consumption on the other. Along the way, we show that distilling monolingual models from multilingual teachers is possible, and likewise distilling models with absolute positional embeddings from teachers with relative positional embeddings.
zh
[NLP-5] JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction
【速读】: 该论文旨在解决现有多模态理解基准在日语场景下评估能力不足的问题,尤其是针对图像-文本联合理解任务缺乏高质量、大规模且具有挑战性的评测数据集。为应对这一问题,作者提出JMMMU-Pro,一个基于图像的日本多学科多模态理解基准,并设计了Vibe Benchmark Construction方法作为其构建核心。解决方案的关键在于利用图像生成模型(如Nano Banana Pro)生成候选视觉问题,结合人工验证与提示调整机制确保内容质量,从而以低成本高效构建覆盖广泛背景和布局设计的高保真日语多模态问答数据集,显著提升了对大语言模型(Large Language Models, LLMs)在日语环境下视觉-文本融合理解能力的评估严谨性。
链接: https://arxiv.org/abs/2512.14620
作者: Atsuyuki Miyai,Shota Onohara,Jeonghun Baek,Kiyoharu Aizawa
机构: The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro’s highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.
zh
[NLP-6] owards Nepali-language LLM s: Efficient GPT training with a Nepali BPE tokenizer
【速读】: 该论文旨在解决尼泊尔语(Nepali)在自然语言处理(Natural Language Processing, NLP)领域面临的挑战,包括其复杂的语法结构、黏着性形态学特征以及高质量语料库资源匮乏的问题。现有研究多集中于基础编码器架构,难以满足尼泊尔语特定的文本生成需求。解决方案的关键在于构建一个基于GPT-2的尼泊尔语语言模型,并采用多项受GPT-3启发的训练策略优化性能,包括优化的学习率调度、批量规模扩展和架构改进;同时开发了一个专为尼泊尔语文本训练的16k字节对编码(Byte-Pair Encoding, BPE)分词器以提升分词一致性与输入表示质量;此外引入FlashAttention机制降低内存消耗并稳定训练过程。最终模型在预训练后达到验证损失3.081982和困惑度21.80,展现出生成连贯尼泊尔语新闻文本的能力。
链接: https://arxiv.org/abs/2512.14585
作者: Adarsha Shrestha,Basanta Pokharel,Binit Shrestha,Smriti Adhikari,Dinesh Gothe
机构: Khwopa College of Engineering (Khwopa工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Nepali, a low-resource language spoken by over 32 million people, continues to face challenges in natural language processing (NLP) due to its complex grammar, agglutinative morphology, and limited availability of high-quality corpora. Most efforts to date have centered on basic encoder architectures; they remain insufficient for Nepali-specific text generation. This study presents a GPT-2-based Nepali language model trained using several training strategies inspired by GPT-3, including optimized learning rate schedules, batch scaling, and architectural refinements. A custom 16k Byte-Pair Encoding (BPE) tokenizer was trained exclusively on Nepali text to ensure more consistent segmentation and improved input representation. The model was pretrained on a combined dataset comprising a 10.75GB cleaned NepBERTa corpus and additional web-scraped Nepali news articles. FlashAttention was integrated to reduce memory usage and stabilize training. After two epochs, the model achieved a training loss of 3.168177, a validation loss of 3.081982, and a final perplexity of 21.80, demonstrating its capability to generate coherent Nepali news-style text.
zh
[NLP-7] Low-Resource High-Impact: Building Corpora for Inclusive Language Technologies LREC2026
【速读】: 该论文旨在解决多语言和低资源语言(low-resource languages)在自然语言处理(Natural Language Processing, NLP)中面临的公平性与社会影响力不足的问题,尤其针对数据稀缺性和文化差异带来的挑战。其解决方案的关键在于提供一套端到端的实用工具链,涵盖从数据收集、网络爬取、平行句对挖掘、机器翻译到下游任务(如文本分类和多模态推理)的全流程方法,并强调公平性、可复现性及社区参与式开发,以实现更包容、可持续的语言技术构建。
链接: https://arxiv.org/abs/2512.14576
作者: Ekaterina Artemova,Laurie Burchell,Daryna Dementieva,Shu Okabe,Mariya Shmatova,Pedro Ortiz Suarez
机构: Toloka AI( Toloka AI); Common Crawl Foundation( Common Crawl Foundation); TUM(慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Tutorial is accepted to LREC2026
Abstract:This tutorial (this https URL) is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages – from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning. The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We will focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We will showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.
zh
[NLP-8] Polypersona: Persona-Grounded LLM for Synthetic Survey Responses
【速读】: 该论文旨在解决如何高效生成具有人格特征(persona-conditioned)的合成调查回答的问题,以支持多领域、可控的指令微调和系统性评估。其解决方案的关键在于提出PolyPersona框架,通过资源自适应训练设置,结合参数高效微调技术LoRA(Low-Rank Adaptation)与4-bit量化压缩,利用对话式数据流水线显式保留人格线索,从而确保生成响应在行为上的一致性;该方法使小型语言模型(如TinyLlama 1.1B和Phi-2)在多项指标(BLEU、ROUGE、BERTScore及问卷特定指标)上达到与7B–8B大模型相当的性能,验证了小模型在生成可靠、连贯合成调查数据方面的潜力。
链接: https://arxiv.org/abs/2512.14562
作者: Tejaswani Dash,Dinesh Karri,Anudeep Vurity,Gautam Datla,Tazeem Ahmad,Saima Rafi,Rohith Tangudu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE Bigdata 2025- LLMs4ALL
Abstract:This paper introduces PolyPersona, a generative framework for synthesizing persona-conditioned survey responses across multiple domains. The framework instruction-tunes compact chat models using parameter-efficient LoRA adapters with 4-bit quantization under a resource-adaptive training setup. A dialogue-based data pipeline explicitly preserves persona cues, ensuring consistent behavioral alignment across generated responses. Using this pipeline, we construct a dataset of 3,568 synthetic survey responses spanning ten domains and 433 distinct personas, enabling controlled instruction tuning and systematic multi-domain evaluation. We evaluate the generated responses using a multi-metric evaluation suite that combines standard text generation metrics, including BLEU, ROUGE, and BERTScore, with survey-specific metrics designed to assess structural coherence, stylistic consistency, and sentiment this http URL results show that compact models such as TinyLlama 1.1B and Phi-2 achieve performance comparable to larger 7B to 8B baselines, with a highest BLEU score of 0.090 and ROUGE-1 of 0.429. These findings demonstrate that persona-conditioned fine-tuning enables small language models to generate reliable and coherent synthetic survey data. The proposed framework provides an efficient and reproducible approach for survey data generation, supporting scalable evaluation while facilitating bias analysis through transparent and open protocols.
zh
[NLP-9] Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis
【速读】: 该论文旨在解决生成式 AI(Generative AI)在自动作文评分(Automatic Essay Scoring, AES)中与人工评分者之间一致性水平不明确的问题。现有研究结果存在混杂性,缺乏系统性的整合分析以厘清大语言模型(Large Language Models, LLMs)与人类评分者之间的实际表现差异。其解决方案的关键在于遵循PRISMA 2020指南,对2022年1月至2025年8月间发表和未发表的65项相关研究进行系统综述与元分析,量化了LLM与人类评分者之间的一致性指标(如加权二次Kappa、皮尔逊相关系数和斯皮尔曼等级相关系数),并识别出影响一致性的关键变量及标准化报告不足等方法学问题,从而为后续研究提供可操作的方向与改进依据。
链接: https://arxiv.org/abs/2512.14561
作者: Hongli Li,Che Han Chen,Kevin Fan,Chiho Young-Johnson,Soyoung Lim,Yali Feng
机构: 未知
类目: Computation and Language (cs.CL)
备注: This manuscript is under review as a book chapter
Abstract:Despite the growing promise of large language models (LLMs) in automatic essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLMs and human raters in AES. Across studies, reported LLM-human agreement was generally moderate to good, with agreement indices (e.g., Quadratic Weighted Kappa, Pearson correlation, and Spearman’s rho) mostly ranging between 0.30 and 0.80. Substantial variability in agreement levels was observed across studies, reflecting differences in study-specific factors as well as the lack of standardized reporting practices. Implications and directions for future research are discussed.
zh
[NLP-10] VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在越南法律领域应用中面临的评估难题,尤其是由于越南立法体系的复杂性、层级结构及频繁修订导致的法律知识理解与利用能力难以准确衡量的问题。解决方案的关键在于构建首个系统性的基准测试工具——越南法律基准(Vietnamese Legal Benchmark, VLegal-Bench),其基于布卢姆认知分类学(Bloom’s cognitive taxonomy)设计多层级法律理解任务,涵盖一般法律问答、检索增强生成、多步推理和场景化问题求解等实际应用场景,并通过法律专家严格标注与交叉验证的10,450个样本,确保数据权威性和现实工作流一致性,从而为LLMs在越南法律语境下的性能评估提供标准化、透明且认知导向的框架。
链接: https://arxiv.org/abs/2512.14554
作者: Nguyen Tien Dong,Minh-Anh Nguyen,Thanh Dat Hoang,Nguyen Tuan Ngoc,Dao Xuan Quang Minh,Phan Phi Hai,Nguyen Thi Ngoc Anh,Dang Van Tu,Binh Vu
机构: CMC OpenAIViet Nam; CMC OpenAI, Griffith University Australia; CMC OpenAI, HUST Viet Nam; SRH University Heidelberg Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom’s cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems.
zh
[NLP-11] Dual Language Models: Balancing Training Efficiency and Overfitting Resilience
【速读】: 该论文旨在解决单一训练目标下语言模型在训练效率与抗过拟合能力之间的权衡问题:自回归建模(autoregressive modeling)虽训练高效,但易过拟合;而掩码扩散模型(masked-diffusion models)虽更具抗过拟合能力,但训练效率较低。解决方案的关键在于不修改模型架构的前提下,联合优化自回归和掩码扩散两种训练目标,从而实现性能与鲁棒性的协同提升。通过在不同数据重复水平下训练并评估50个语言模型,研究发现无论以哪种下游任务性能为目标,双目标联合训练均优于单一目标,并且最优的目标组合比例具有稳定性。
链接: https://arxiv.org/abs/2512.14549
作者: David Samuel,Lucas Georges Gabriel Charpentier
机构: University of Oslo (奥斯陆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal ratio between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal ratio is similar whether targeting autoregressive or masked-diffusion downstream performance.
zh
[NLP-12] VersatileFFN: Achieving Parameter Efficiency in LLM s via Adaptive Wide-and-Deep Reuse
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在快速扩展过程中带来的高昂内存开销问题,现有参数高效方法如剪枝和量化主要压缩预训练模型但未提升架构容量,因而受限于基础模型的表征上限。其解决方案的关键在于提出VersatileFFN——一种新颖的前馈网络(Feed-Forward Network, FFN),在固定参数预算下实现宽度与深度维度上的灵活参数复用:通过双路径设计,包括一个宽度灵活路径(width-versatile path),利用共享FFN生成子专家混合以模拟稀疏专家路由而无需增加参数;以及一个深度灵活路径(depth-versatile path),递归应用同一FFN以模拟对复杂token的深层处理。两者由难度感知门控机制动态平衡,使简单token走宽路径、复杂token走深路径,且两路径共用相同参数,新增能力完全来自计算而非内存。
链接: https://arxiv.org/abs/2512.14531
作者: Ying Nie,Kai Han,Hongguang Li,Hang Zhou,Tianyu Guo,Enhua Wu,Xinghao Chen,Yunhe Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid scaling of Large Language Models (LLMs) has achieved remarkable performance, but it also leads to prohibitive memory costs. Existing parameter-efficient approaches such as pruning and quantization mainly compress pretrained models without enhancing architectural capacity, thereby hitting the representational ceiling of the base model. In this work, we propose VersatileFFN, a novel feed-forward network (FFN) that enables flexible reuse of parameters in both width and depth dimensions within a fixed parameter budget. Inspired by the dual-process theory of cognition, VersatileFFN comprises two adaptive pathways: a width-versatile path that generates a mixture of sub-experts from a single shared FFN, mimicking sparse expert routing without increasing parameters, and a depth-versatile path that recursively applies the same FFN to emulate deeper processing for complex tokens. A difficulty-aware gating dynamically balances the two pathways, steering “easy” tokens through the efficient width-wise route and allocating deeper iterative refinement to “hard” tokens. Crucially, both pathways reuse the same parameters, so all additional capacity comes from computation rather than memory. Experiments across diverse benchmarks and model scales demonstrate the effectiveness of the method. The code will be available at this https URL.
zh
[NLP-13] Linguists should learn to love speech-based deep learning models
【速读】: 该论文试图解决的问题是:如何在生成式 AI(Generative AI)与语言学理论之间建立有效的桥梁,以促进深度学习系统与语言解释之间的协同研究。其解决方案的关键在于提出一个框架,将技术导向的深度学习系统与解释导向的语言学理论相连接;然而,作者指出,由于目标文章聚焦于基于文本的大型语言模型(Large Language Models, LLMs),这一框架难以充分回应人类语言中许多依赖非书面形式(如语音、语调等)的重要问题,因此主张音频驱动的深度学习模型应成为未来研究的核心组成部分,以更全面地捕捉语言的本质特征。
链接: https://arxiv.org/abs/2512.14506
作者: Marianne de Heer Kloots,Paul Boersma,Willem Zuidema
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS); Neurons and Cognition (q-bio.NC)
备注: Commentary on Futrell, R., Mahowald, K. arXiv:2501.17047 (in press). How Linguistics Learned to Stop Worrying and Love the Language Models. Behavioural and Brain Sciences
Abstract:Futrell and Mahowald present a useful framework bridging technology-oriented deep learning systems and explanation-oriented linguistic theories. Unfortunately, the target article’s focus on generative text-based LLMs fundamentally limits fruitful interactions with linguistics, as many interesting questions on human language fall outside what is captured by written text. We argue that audio-based deep learning models can and should play a crucial role.
zh
[NLP-14] RecGPT -V2 Technical Report
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在推荐系统中应用时面临的四大核心问题:计算效率低与认知冗余、解释多样性不足、监督学习下的泛化能力弱以及以结果为导向的评估无法匹配人类偏好。其关键解决方案在于提出 RecGPT-V2,通过四个创新机制实现突破:(1) 构建分层多智能体系统(Hierarchical Multi-Agent System)重构意图推理路径,结合混合表示推理(Hybrid Representation Inference)压缩用户行为上下文,显著降低 GPU 消耗并提升召回率;(2) 引入元提示框架(Meta-Prompting)动态生成情境自适应提示,增强解释多样性;(3) 采用约束强化学习缓解多奖励冲突,提升标签预测准确性和解释接受度;(4) 设计代理即裁判框架(Agent-as-a-Judge)将评估拆解为多步推理过程,更好对齐人类偏好。在线 A/B 测试显示,RecGPT-V2 在淘宝平台实现了多项指标显著提升,验证了 LLM 驱动意图推理在工业场景中的技术可行性与商业价值。
链接: https://arxiv.org/abs/2512.14503
作者: Chao Yi,Dian Chen,Gaoyang Guo,Jiakai Tang,Jian Wu,Jing Yu,Mao Zhang,Wen Chen,Wenjun Yang,Yujie Luo,Yuning Jiang,Zhujin Gao,Bo Zheng,Binbin Cao,Changfa Wu,Dixuan Wang,Han Wu,Haoyi Hu,Kewei Zhu,Lang Tian,Lin Yang,Qiqi Huang,Siqi Yang,Wenbo Su,Xiaoxiao He,Xin Tong,Xu Chen,Xunke Xi,Xiaowei Huang,Yaxuan Wu,Yeqiu Yang,Yi Hu,Yujin Yuan,Yuliang Yan,Zile Zhou
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable potential in transforming recommender systems from implicit behavioral pattern matching to explicit intent reasoning. While RecGPT-V1 successfully pioneered this paradigm by integrating LLM-based reasoning into user interest mining and item tag prediction, it suffers from four fundamental limitations: (1) computational inefficiency and cognitive redundancy across multiple reasoning routes; (2) insufficient explanation diversity in fixed-template generation; (3) limited generalization under supervised learning paradigms; and (4) simplistic outcome-focused evaluation that fails to match human standards. To address these challenges, we present RecGPT-V2 with four key innovations. First, a Hierarchical Multi-Agent System restructures intent reasoning through coordinated collaboration, eliminating cognitive duplication while enabling diverse intent coverage. Combined with Hybrid Representation Inference that compresses user-behavior contexts, our framework reduces GPU consumption by 60% and improves exclusive recall from 9.39% to 10.99%. Second, a Meta-Prompting framework dynamically generates contextually adaptive prompts, improving explanation diversity by +7.3%. Third, constrained reinforcement learning mitigates multi-reward conflicts, achieving +24.1% improvement in tag prediction and +13.0% in explanation acceptance. Fourth, an Agent-as-a-Judge framework decomposes assessment into multi-step reasoning, improving human preference alignment. Online A/B tests on Taobao demonstrate significant improvements: +2.98% CTR, +3.71% IPV, +2.19% TV, and +11.46% NER. RecGPT-V2 establishes both the technical feasibility and commercial viability of deploying LLM-powered intent reasoning at scale, bridging the gap between cognitive exploration and industrial utility. Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2512.14503 [cs.IR] (or arXiv:2512.14503v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2512.14503 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-15] C-ing Clearly: Enhanced Binary Code Explanations using C code
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理低级编程语言(如汇编语言)时性能不足的问题,尤其在二进制代码摘要生成和漏洞检测任务中表现有限。解决方案的关键在于提出一种名为“C-ing Clearly”的合成数据生成方法,该方法利用对应的C语言代码来增强LLM对汇编代码的理解能力;通过在该方法生成的数据上进行微调,模型在不同家族和规模下均展现出一致的性能提升。
链接: https://arxiv.org/abs/2512.14500
作者: Teodor Poncu,Ioana Pintilie,Marius Dragoi,Dragos Tantaru,Florin Brad
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 5 figures
Abstract:Large Language Models (LLMs) typically excel at coding tasks involving high-level programming languages, as opposed to lower-level programming languages, such as assembly. We propose a synthetic data generation method named C-ing Clearly, which leverages the corresponding C code to enhance an LLM’s understanding of assembly. By fine-tuning on data generated through our method, we demonstrate improved LLM performance for binary code summarization and vulnerability detection. Our approach demonstrates consistent gains across different LLM families and model sizes.
zh
[NLP-16] SASQ: Static Activation Scaling for Quantization-Aware Training in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在部署时因模型规模增长超过GPU内存发展速度而导致的存储与计算瓶颈问题。现有量化方法存在显著权衡:动态量化虽能保持精度但计算开销高且难以在边缘设备上部署,静态量化则牺牲了模型性能;而现有的量化感知训练(Quantization-Aware Training, QAT)方法还面临权重训练成本高的问题。其解决方案的关键在于提出一种轻量级QAT框架SASQ,该框架仅优化激活量化的因子(quantization factors),不修改预训练权重,从而实现静态推理下的高精度和高效部署。SASQ通过自适应截断部分异常值,在降低量化难度的同时保留激活分布特性,最终在LLaMA2-7B模型上实现了优于现有最先进量化方案及原始FP16模型的性能表现(如WikiText2上的困惑度降低达4.7%)。
链接: https://arxiv.org/abs/2512.14481
作者: Shizhuo Mao,Song Chen,Yi Kang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) excel at natural language tasks but face deployment challenges due to their growing size outpacing GPU memory advancements. Model quantization mitigates this issue by lowering weight and activation precision, but existing solutions face fundamental trade-offs: dynamic quantization incurs high computational overhead and poses deployment challenges on edge devices, while static quantization sacrifices accuracy. Existing approaches of quantization-aware training (QAT) further suffer from weight training costs. We propose SASQ: a lightweight QAT framework specifically tailored for activation quantization factors. SASQ exclusively optimizes only the quantization factors (without changing pre-trained weights), enabling static inference with high accuracy while maintaining deployment efficiency. SASQ adaptively truncates some outliers, thereby reducing the difficulty of quantization while preserving the distributional characteristics of the activations. SASQ not only surpasses existing SOTA quantization schemes but also outperforms the corresponding FP16 models. On LLaMA2-7B, it achieves 5.2% lower perplexity than QuaRot and 4.7% lower perplexity than the FP16 model on WikiText2.
zh
[NLP-17] Effect of Document Packing on the Latent Multi-Hop Reasoning Capabilities of Large Language Models
【速读】: 该论文试图解决的问题是:在大规模语言模型(Large Language Models, LLMs)训练中,将多个文档打包(document packing)以提升计算效率的常用做法,如何影响模型的潜在多跳推理能力(latent multi-hop reasoning abilities),这一问题此前尚未被系统研究。解决方案的关键在于通过对比不同文档打包策略对模型性能的影响,并开展消融实验(ablation study),识别出影响打包优势的核心因素,从而揭示打包机制如何促进模型推理能力的提升,同时量化其对计算资源的需求。
链接: https://arxiv.org/abs/2512.14427
作者: Gabriele Prato,Shagun Sodhani,Alessandro Sordoni,Sarath Chandar
机构: Chandar Research Lab; Mila – Quebec AI Institute; Université de Montréal; Microsoft Research; Polytechnique Montréal; Canada CIFAR AI Chair; FAIR, Meta
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The standard practice for training large language models involves packing multiple documents together to optimize computational efficiency. However, the impact of this process on the models’ capabilities remains largely unexplored. To address this gap, we investigate how different document-packing strategies influence the latent multi-hop reasoning abilities of LLMs. Our findings indicate that packing can improve model performance compared to training on individual documents, at the expense of more compute. To further understand the underlying mechanisms, we conduct an ablation study, identifying key factors that explain the advantages of packing. Ultimately, our research deepens the understanding of LLM training dynamics and provides practical insights for optimizing model development.
zh
[NLP-18] RePo: Language Models with Context Re-Positioning
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在上下文学习(in-context learning)中因固定线性位置编码导致的额外认知负荷(extraneous cognitive load)问题,这种结构限制了工作记忆资源的有效分配,影响模型对复杂或长距离依赖信息的处理能力。解决方案的关键在于提出一种可微分的位置重排机制(RePo),通过引入一个参数化函数 $ f_\phi $ 动态分配token位置,使其能够捕捉输入上下文中的内在依赖关系,而非依赖预设的整数位置索引。该方法显著提升了模型在含噪上下文、结构化数据和长文本任务中的表现,同时保持短上下文任务的竞争力。
链接: https://arxiv.org/abs/2512.14391
作者: Huayang Li,Tianyu Zhao,Richard Sproat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:In-context learning is fundamental to modern Large Language Models (LLMs); however, prevailing architectures impose a rigid and fixed contextual structure by assigning linear or constant positional indices. Drawing on Cognitive Load Theory (CLT), we argue that this uninformative structure increases extraneous cognitive load, consuming finite working memory capacity that should be allocated to deep reasoning and attention allocation. To address this, we propose RePo, a novel mechanism that reduces extraneous load via context re-positioning. Unlike standard approaches, RePo utilizes a differentiable module, f_\phi , to assign token positions that capture contextual dependencies, rather than replying on pre-defined integer range. By continually pre-training on the OLMo-2 1B backbone, we demonstrate that RePo significantly enhances performance on tasks involving noisy contexts, structured data, and longer context length, while maintaining competitive performance on general short-context tasks. Detailed analysis reveals that RePo successfully allocate higher attention to distant but relevant information, assign positions in dense and non-linear space, and capture the intrinsic structure of the input context. Our code is available at this https URL.
zh
[NLP-19] Step-Tagging: Toward controlling the generation of Language Reasoning Models through step monitoring
【速读】: 该论文旨在解决语言推理模型(Language Reasoning Models, LRMs)在推理过程中存在效率低下、过度生成验证与反思步骤的问题。其解决方案的关键在于提出一种轻量级的句子分类框架——Step-Tagging,该框架能够实时标注LRM生成的每一步推理类型,基于此构建了新的推理步骤分类体系ReasonType,并通过在线监控特定类型步骤的数量,实现了可解释的早期停止机制,从而在保持与标准生成相当准确率的前提下,显著减少token消耗(20%–50%),尤其在计算密集型任务中效果最为显著。
链接: https://arxiv.org/abs/2512.14332
作者: Yannis Belkhiter,Seshu Tirupathi,Giulio Zizzo,John D. Kelleher
机构: IBM Research Europe (IBM研究欧洲); Trinity College Dublin (都柏林三一学院); ADAPT Research Centre (ADAPT研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. To address this challenge, we introduce the Step-Tagging framework, a lightweight sentence-classifier enabling real-time annotation of the type of reasoning steps that an LRM is generating. To monitor reasoning behaviors, we introduced ReasonType: a novel taxonomy of reasoning steps. Building on this framework, we demonstrated that online monitoring of the count of specific steps can produce effective interpretable early stopping criteria of LRM inferences. We evaluate the Step-tagging framework on three open-source reasoning models across standard benchmark datasets: MATH500, GSM8K, AIME and non-mathematical tasks (GPQA and MMLU-Pro). We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation, with largest gains observed on more computation-heavy tasks. This work offers a novel way to increase control over the generation of LRMs, and a new tool to study behaviors of LRMs.
zh
[NLP-20] Inflation Attitudes of Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)是否能够基于宏观经济价格信号形成通胀感知与预期的问题,特别是评估其在模拟人类通胀认知方面的有效性。解决方案的关键在于采用准实验设计,利用GPT-3.5-turbo模型训练数据截止于2021年9月这一事实,使其无法获取此后英国通胀飙升的信息,从而构建一个“无先验知识”的测试场景;同时,通过将模型输出与家庭调查数据及官方统计进行对比,并引入适用于合成问卷场景的Shapley值分解方法,量化提示词内容对模型输出的贡献,从而揭示模型行为的驱动机制。结果表明,GPT在短期聚合层面能准确跟踪调查和官方数据,且在收入、住房类型和阶级等细分维度上复现了人类受访者的核心通胀感知规律,但缺乏对消费者价格通胀的一致建模能力。
链接: https://arxiv.org/abs/2512.14306
作者: Nikoleta Anesti,Edward Hill,Andreas Joseph
机构: Bank of England (英格兰银行)
类目: Computation and Language (cs.CL); Econometrics (econ.EM)
备注: 41 pages, 11 figures
Abstract:This paper investigates the ability of Large Language Models (LLMs), specifically GPT-3.5-turbo (GPT), to form inflation perceptions and expectations based on macroeconomic price signals. We compare the LLM’s output to household survey data and official statistics, mimicking the information set and demographic characteristics of the Bank of England’s Inflation Attitudes Survey (IAS). Our quasi-experimental design exploits the timing of GPT’s training cut-off in September 2021 which means it has no knowledge of the subsequent UK inflation surge. We find that GPT tracks aggregate survey projections and official statistics at short horizons. At a disaggregated level, GPT replicates key empirical regularities of households’ inflation perceptions, particularly for income, housing tenure, and social class. A novel Shapley value decomposition of LLM outputs suited for the synthetic survey setting provides well-defined insights into the drivers of model outputs linked to prompt content. We find that GPT demonstrates a heightened sensitivity to food inflation information similar to that of human respondents. However, we also find that it lacks a consistent model of consumer price inflation. More generally, our approach could be used to evaluate the behaviour of LLMs for use in the social sciences, to compare different models, or to assist in survey design.
zh
[NLP-21] SPARQL-LLM : Real-Time SPARQL Query Generation from Natural Language Questions
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)生成SPARQL查询时存在的局限性,即多数方法仅关注单一数据源上的响应准确率,而忽视了在分布式知识图谱(Knowledge Graphs)上实现联邦查询能力(federated query capability)、运行时效率及生成成本等关键生产环境指标。解决方案的关键在于提出SPARQL-LLM——一种开源且与三元组存储无关(triplestore-agnostic)的架构,其核心创新是利用轻量级元数据(lightweight metadata)驱动查询生成流程,包含专门用于元数据索引、提示构建(prompt building)和查询生成执行的模块化组件。该方案在多语言生物信息学知识图谱基准测试中实现了F1分数提升24%,支持复杂联邦查询,并具备高达36倍于竞品的查询速度以及每条查询最高仅0.01美元的成本,从而显著提升了实际部署可行性与实时性。
链接: https://arxiv.org/abs/2512.14277
作者: Panayiotis Smeros,Vincent Emonet,Ruijie Wang,Ana-Claudia Sima,Tarcisio Mendes de Farias
机构: SIB Swiss Institute of Bioinformatics(瑞士生物信息研究所); University of Zurich(苏黎世大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 8 figures, 1 table. Under Review
Abstract:The advent of large language models is contributing to the emergence of novel approaches that promise to better tackle the challenge of generating structured queries, such as SPARQL queries, from natural language. However, these new approaches mostly focus on response accuracy over a single source while ignoring other evaluation criteria, such as federated query capability over distributed data stores, as well as runtime and cost to generate SPARQL queries. Consequently, they are often not production-ready or easy to deploy over (potentially federated) knowledge graphs with good accuracy. To mitigate these issues, in this paper, we extend our previous work and describe and systematically evaluate SPARQL-LLM, an open-source and triplestore-agnostic approach, powered by lightweight metadata, that generates SPARQL queries from natural language text. First, we describe its architecture, which consists of dedicated components for metadata indexing, prompt building, and query generation and execution. Then, we evaluate it based on a state-of-the-art challenge with multilingual questions, and a collection of questions from three of the most prevalent knowledge graphs within the field of bioinformatics. Our results demonstrate a substantial increase of 24% in the F1 Score on the state-of-the-art challenge, adaptability to high-resource languages such as English and Spanish, as well as ability to form complex and federated bioinformatics queries. Furthermore, we show that SPARQL-LLM is up to 36x faster than other systems participating in the challenge, while costing a maximum of 0.01 per question, making it suitable for real-time, low-cost text-to-SPARQL applications. One such application deployed over real-world decentralized knowledge graphs can be found at this https URL.
zh
[NLP-22] From Context to EDUs: Faithful and Structured Context Compression via Elementary Discourse Unit Decomposition
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本上下文时面临的计算成本高、噪声干扰严重以及现有压缩技术破坏局部连贯性或存在位置偏差的问题。其解决方案的关键在于提出一种基于EDU(Elementary Discourse Units)的显式上下文压缩框架——EDU-based Context Compressor,该框架将压缩过程重构为“结构化后选择”的两阶段流程:首先通过LingoEDU模块将线性文本转化为锚定于原始文本索引的结构化关系树,从而避免幻觉;随后由轻量级排序模块筛选与查询相关的子树进行线性化输出,确保全局结构与细粒度信息的保留。此方法显著提升了结构理解准确性,并在多种下游任务中实现性能提升和成本降低。
链接: https://arxiv.org/abs/2512.14244
作者: Yiqing Zhou,Yu Lei,Shuzheng Si,Qingyan Sun,Wei Wang,Yifei Wu,Hao Wen,Gang Chen,Fanchao Qi,Maosong Sun
机构: DeepLang AI; Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系); Beijing University of Posts and Telecommunications (北京邮电大学); Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Managing extensive context remains a critical bottleneck for Large Language Models (LLMs), particularly in applications like long-document question answering and autonomous agents where lengthy inputs incur high computational costs and introduce noise. Existing compression techniques often disrupt local coherence through discrete token removal or rely on implicit latent encoding that suffers from positional bias and incompatibility with closed-source APIs. To address these limitations, we introduce the EDU-based Context Compressor, a novel explicit compression framework designed to preserve both global structure and fine-grained details. Our approach reformulates context compression as a structure-then-select process. First, our LingoEDU transforms linear text into a structural relation tree of Elementary Discourse Units (EDUs) which are anchored strictly to source indices to eliminate hallucination. Second, a lightweight ranking module selects query-relevant sub-trees for linearization. To rigorously evaluate structural understanding, we release StructBench, a manually annotated dataset of 248 diverse documents. Empirical results demonstrate that our method achieves state-of-the-art structural prediction accuracy and significantly outperforms frontier LLMs while reducing costs. Furthermore, our structure-aware compression substantially enhances performance across downstream tasks ranging from long-context tasks to complex Deep Search scenarios.
zh
[NLP-23] wo CFG Nahuatl for automatic corpora expansion
【速读】: 该论文旨在解决纳瓦特语(Nawatl)因数字资源稀缺而导致的大型语言模型(LLMs)训练语料库几乎不存在的问题。其解决方案的关键在于构建两种新的无上下文语法(Context-Free Grammars, CFG),并通过生成模式合成大量句法有效的纳瓦特语人工句子,从而显著扩展语料库,用于学习非上下文嵌入(non-contextual embeddings)。实验结果表明,相较于仅使用原始语料库的方法,该扩展策略在句子语义相似性任务中提升了性能,且经济型嵌入方法往往优于部分大型语言模型。
链接: https://arxiv.org/abs/2512.14239
作者: Juan-José Guzmán-Landa,Juan-Manuel Torres-Moreno,Miguel Figueroa-Saavedra,Ligia Quintana-Torres,Graham Ranger Martha-Lorena Avendaño-Garrido
机构: Avignon Université (阿维尼翁大学); Universidad Veracruzana (韦拉克鲁斯大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 5 figures, 8 tables
Abstract:The aim of this article is to introduce two Context-Free Grammars (CFG) for Nawatl Corpora expansion. Nawatl is an Amerindian language (it is a National Language of Mexico) of the \pi -language type, i.e. a language with few digital resources. For this reason the corpora available for the learning of Large Language Models (LLMs) are virtually non-existent, posing a significant challenge. The goal is to produce a substantial number of syntactically valid artificial Nawatl sentences and thereby to expand the corpora for the purpose of learning non contextual embeddings. For this objective, we introduce two new Nawatl CFGs and use them in generative mode. Using these grammars, it is possible to expand Nawatl corpus significantly and subsequently to use it to learn embeddings and to evaluate their relevance in a sentences semantic similarity task. The results show an improvement compared to the results obtained using only the original corpus without artificial expansion, and also demonstrate that economic embeddings often perform better than some LLMs.
zh
[NLP-24] Ladder Up Memory Down: Low-Cost Fine-Tuning With Side Nets
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在微调过程中受限于消费级GPU显存容量的问题。现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法如QLoRA虽能减少可训练参数量,但仍因全模型反向传播导致峰值内存占用较高。论文的关键解决方案是重新审视一种较少被探索的PEFT技术——Ladder Side Tuning (LST),其通过引入轻量级侧边网络实现与QLoRA相当的计算扩展斜率,同时将峰值内存降低50%。这一改进使得7B参数模型可在单张12 GB显存的消费级GPU上完成微调,且无需梯度检查点(gradient checkpointing),从而显著提升了内存效率并保持了与QLoRA相当的性能表现。
链接: https://arxiv.org/abs/2512.14237
作者: Estelle Zheng(LORIA, ALE),Nathan Cerisara(LORIA),Sébastien Warichet(ALE),Emmanuel Helbert(ALE),Christophe Cerisara(SYNALP, LORIA)
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Fine-tuning large language models (LLMs) is often limited by the memory available on commodity GPUs. Parameter-efficient fine-tuning (PEFT) methods such as QLoRA reduce the number of trainable parameters, yet still incur high memory usage induced by the backward pass in the full model. We revisit Ladder Side Tuning (LST), a rarely explored PEFT technique that adds a lightweight side network, and show that it matches QLoRA’s compute scaling slope while cutting peak memory by 50%. Across different downstream benchmarks spanning natural language understanding, mathematical and LLM-critic tasks, LST has competitive performance with QLoRA’s accuracy on average while being much more memory-efficient. This efficiency enables fine-tuning of 7B-parameter models on a single 12 GB consumer GPU with 2k-token contexts, requiring no gradient checkpointing\textemdash conditions under which QLoRA exhausts memory. Beyond memory efficiency, we also establish scaling laws showing that LST scales similarly to QLoRA. We exploit Ladder’s architectural flexibility by introducing xLadder, a depth-extended variant that increases effective depth via cross-connections and shortens chain-of-thought (CoT) at fixed parameter count. Ladder is strong when memory is the bottleneck; xLadder builds on this by enabling deeper reasoning without additional memory overhead.
zh
[NLP-25] A Comparative Analysis of Retrieval-Augmented Generation Techniques for Bengali Standard-to-Dialect Machine Translation Using LLM s AACL2025
【速读】: 该论文旨在解决标准孟加拉语向区域方言翻译的自然语言处理(Natural Language Processing, NLP)难题,该问题因数据稀缺和语言变体差异而尤为突出。解决方案的关键在于提出并比较两种基于检索增强生成(Retrieval-Augmented Generation, RAG)的管道:其一为基于音频转录文本的管道,其二为更有效的结构化“方言-标准语句对”管道。实验表明,后者通过利用高质量的成对句段进行检索增强,显著提升了翻译性能,尤其在Chittagong方言上将词错误率(Word Error Rate, WER)从76%降至55%,且小模型(如Llama-3.1-8B)可超越大模型(如GPT-OSS-120B),证明了精心设计的检索策略比模型规模更具决定性作用。
链接: https://arxiv.org/abs/2512.14179
作者: K. M. Jubair Sami,Dipto Sumit,Ariyan Hossain,Farig Sadeque
机构: BRAC University (BRAC大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to the Second Workshop on Bangla Language Processing (BLP) at IJCNLP-AACL 2025. 14 pages, 9 figures, 6 tables
Abstract:Translating from a standard language to its regional dialects is a significant NLP challenge due to scarce data and linguistic variation, a problem prominent in the Bengali language. This paper proposes and compares two novel RAG pipelines for standard-to-dialectal Bengali translation. The first, a Transcript-Based Pipeline, uses large dialect sentence contexts from audio transcripts. The second, a more effective Standardized Sentence-Pairs Pipeline, utilizes structured local_dialect:standard_bengali sentence pairs. We evaluated both pipelines across six Bengali dialects and multiple LLMs using BLEU, ChrF, WER, and BERTScore. Our findings show that the sentence-pair pipeline consistently outperforms the transcript-based one, reducing Word Error Rate (WER) from 76% to 55% for the Chittagong dialect. Critically, this RAG approach enables smaller models (e.g., Llama-3.1-8B) to outperform much larger models (e.g., GPT-OSS-120B), demonstrating that a well-designed retrieval strategy can be more crucial than model size. This work contributes an effective, fine-tuning-free solution for low-resource dialect translation, offering a practical blueprint for preserving linguistic diversity.
zh
[NLP-26] Astraea: A State-Aware Scheduling Engine for LLM -Powered Agents
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)作为智能体(agent)部署时,因多阶段工作流(交替执行本地计算与外部网络服务调用,如Web API)导致的执行模式与现有推理系统(如vLLM)调度粒度不匹配的问题。现有系统仅关注局部段优化,无法最小化整个代理工作流的端到端延迟,即全局任务完成时间(Job Completion Time, JCT)。解决方案的关键在于提出Astraea服务引擎,其核心是采用状态感知的分层调度算法,融合请求的历史状态与未来预测,动态按I/O密集或计算密集类型分类请求,并基于改进的高响应比优先(HRRN)策略平衡效率与公平性;同时引入自适应KV缓存管理器,在I/O等待期间根据系统内存压力智能处理代理状态,从而显著降低平均JCT(最多减少25.5%),并在高负载和不同模型规模下保持强鲁棒性和稳定性。
链接: https://arxiv.org/abs/2512.14142
作者: Hongqiu Ni,Jiabao Zhang,Guopeng Li,Zilong Wang,Ruiqi Wu,Chi Zhang,Haisheng Tan
机构: University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 8 figures
Abstract:Large Language Models (LLMs) are increasingly being deployed as intelligent agents. Their multi-stage workflows, which alternate between local computation and calls to external network services like Web APIs, introduce a mismatch in their execution pattern and the scheduling granularity of existing inference systems such as vLLM. Existing systems typically focus on per-segment optimization which prevents them from minimizing the end-to-end latency of the complete agentic workflow, i.e., the global Job Completion Time (JCT) over the entire request lifecycle. To address this limitation, we propose Astraea, a service engine designed to shift the optimization from local segments to the global request lifecycle. Astraea employs a state-aware, hierarchical scheduling algorithm that integrates a request’s historical state with future predictions. It dynamically classifies requests by their I/O and compute intensive nature and uses an enhanced HRRN policy to balance efficiency and fairness. Astraea also implements an adaptive KV cache manager that intelligently handles the agent state during I/O waits based on the system memory pressure. Extensive experiments show that Astraea reduces average JCT by up to 25.5% compared to baseline methods. Moreover, our approach demonstrates strong robustness and stability under high load across various model scales.
zh
[NLP-27] CogMem: A Cognitive Memory Architecture for Sustained Multi-Turn Reasoning in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮交互中出现的推理偏差、任务漂移、幻觉、过度自信和记忆衰减等问题,这些问题导致模型在长时间对话中准确性和连贯性下降。解决方案的关键在于提出一种受认知启发的记忆增强型架构——CogMem,其核心由三层结构组成:长期记忆(Long-Term Memory, LTM)用于固化跨会话的推理策略;直接访问记忆(Direct Access, DA)用于存储当前会话笔记并检索相关长期记忆;注意力焦点(Focus of Attention, FoA)机制则动态重构每轮对话中精简且任务相关的上下文。该设计有效控制了上下文增长,提升了多轮推理的一致性与可靠性。
链接: https://arxiv.org/abs/2512.14118
作者: Yiran Zhang,Jincheng Hu,Mark Dras,Usman Naseem
机构: Macquarie University(麦考瑞大学); Independent Researcher(独立研究员)
类目: Computation and Language (cs.CL)
备注: underreview
Abstract:Large language models (LLMs) excel at single-turn reasoning but often lose accuracy and coherence over extended, multi-turn interactions. Recent evaluations such as TurnBench highlight recurring failure modes-reasoning bias, task drift, hallucination, overconfidence, and memory decay. Current approaches typically append full conversational histories, causing unbounded context growth, higher computational costs, and degraded reasoning efficiency. We introduce CogMem, a cognitively inspired, memory-augmented LLM architecture that supports sustained iterative reasoning through structured, persistent memory. CogMem incorporates three layers: a Long-Term Memory (LTM) that consolidates cross-session reasoning strategies; a Direct Access (DA) memory that maintains session-level notes and retrieves relevant long-term memories; and a Focus of Attention (FoA) mechanism that dynamically reconstructs concise, task-relevant context at each turn. Experiments on TurnBench show that this layered design mitigates reasoning failures, controls context growth, and improves consistency across extended reasoning chains, moving toward more reliable, human-like reasoning in LLMs.
zh
[NLP-28] Multilingual and Continuous Backchannel Prediction: A Cross-lingual Study
【速读】: 该论文旨在解决跨语言对话中后通道响应(backchannel)时机差异的建模问题,即如何在不同语言(日语、英语、汉语)中准确预测说话者何时发出表示理解或关注的短语(如“嗯”、“yeah”、“对”)。其解决方案的关键在于提出一个基于Transformer架构的多语言连续后通道预测模型,该模型在约300小时双人对话数据上联合训练,并引入辅助任务以增强泛化能力。该模型不仅能捕捉语言通用的非语言线索(如停顿长度、语调变化),还能学习语言特异性的时序模式,且通过多语言训练促进共享但可适应的表征机制,从而实现对三种语言后通道行为的统一建模与有效预测。
链接: https://arxiv.org/abs/2512.14085
作者: Koji Inoue,Mikey Elmers,Yahui Fu,Zi Haur Pang,Taiga Mori,Divesh Lala,Keiko Ochi,Tatsuya Kawahara
机构: Kyoto University (京都大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD)
备注: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2026 (IWSDS 2026) and represents the author’s version of the work
Abstract:We present a multilingual, continuous backchannel prediction model for Japanese, English, and Chinese, and use it to investigate cross-linguistic timing behavior. The model is Transformer-based and operates at the frame level, jointly trained with auxiliary tasks on approximately 300 hours of dyadic conversations. Across all three languages, the multilingual model matches or surpasses monolingual baselines, indicating that it learns both language-universal cues and language-specific timing patterns. Zero-shot transfer with two-language training remains limited, underscoring substantive cross-lingual differences. Perturbation analyses reveal distinct cue usage: Japanese relies more on short-term linguistic information, whereas English and Chinese are more sensitive to silence duration and prosodic variation; multilingual training encourages shared yet adaptable representations and reduces overreliance on pitch in Chinese. A context-length study further shows that Japanese is relatively robust to shorter contexts, while Chinese benefits markedly from longer contexts. Finally, we integrate the trained model into a real-time processing software, demonstrating CPU-only inference. Together, these findings provide a unified model and empirical evidence for how backchannel timing differs across languages, informing the design of more natural, culturally-aware spoken dialogue systems.
zh
[NLP-29] A Unified Sparse Attention via Multi-Granularity Compression
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在长文本理解与推理任务中因自注意力机制(Self-Attention)计算复杂度随序列长度呈二次增长而导致的效率瓶颈问题。现有稀疏注意力方法虽能缓解此问题,但存在训练成本高或难以作为通用加速插件、以及在推理阶段牺牲效率或跨模态泛化能力等局限性。其解决方案的关键在于提出UniSparse机制,通过引入“复合token”(composite tokens)这一抽象概念——即聚合多粒度上下文信息的紧凑表示——并在此基础上动态构建稀疏注意力结构,结合多粒度压缩与块级选择策略,实现高效且硬件友好的GPU执行。该方法在多种模态和任务中均显著优于当前最优稀疏注意力方案,在保持≥99%全注意力精度的同时,相较FlashAttention提速达2.61倍。
链接: https://arxiv.org/abs/2512.14082
作者: Siran Liu,Zane Cao,Yongchao He
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens–compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and block-level selection, enabling efficient and hardware-friendly execution on GPU. Across multiple modalities and tasks ranging from synthetic benchmarks to real-world applications, UniSparse consistently surpasses state-of-the-art sparse attention methods (e.g., MInference, XAttention, FlexPrefill) in both accuracy and efficiency, achieving \ge 99% of full-attention accuracy and up to 2.61 \times faster attention computation than FlashAttention.
zh
[NLP-30] Grammar Search for Multi-Agent Systems
【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)自动搜索效率与可解释性不足的问题。现有方法多依赖大语言模型(Large Language Models, LLMs)在代码空间中进行自由形式的生成式搜索,虽具灵活性但存在成本高、逻辑复杂且难以解析的缺陷。本文提出一种结构化框架,通过一组固定、可组合的简单组件探索相同代码空间,其关键在于以模块化方式构建智能体,从而在保持较低计算成本的同时提升搜索效率和系统可解释性——实验证明该方法在五个基准中的四个上优于先前基于LLM的方法,尤其在数学推理和问答两个领域表现突出。
链接: https://arxiv.org/abs/2512.14079
作者: Mayank Singh,Vikas Yadav,Shiva Krishna Reddy Malay,Shravan Nayak,Sai Rajeswar,Sathwik Tejaswi Madhusudhan,Eduardo Blanco
机构: University of Arizona (亚利桑那大学); ServiceNow (ServiceNow); Mila - Quebec AI Institute (蒙特利尔人工智能研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Automatic search for Multi-Agent Systems has recently emerged as a key focus in agentic AI research. Several prior approaches have relied on LLM-based free-form search over the code space. In this work, we propose a more structured framework that explores the same space through a fixed set of simple, composable components. We show that, despite lacking the generative flexibility of LLMs during the candidate generation stage, our method outperforms prior approaches on four out of five benchmarks across two domains: mathematics and question answering. Furthermore, our method offers additional advantages, including a more cost-efficient search process and the generation of modular, interpretable multi-agent systems with simpler logic.
zh
[NLP-31] Efficient-DLM: From Autoregressive to Diffusion Language Models and Beyond in Speed
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, dLMs)在从零训练时学习效率低于自回归语言模型(Autoregressive Language Models, AR)的问题,尤其关注如何将预训练的AR模型高效转换为兼具高生成速度与高任务准确性的dLM。其解决方案的关键在于两个核心改进:一是提出基于块状注意力(block-wise attention pattern)的连续预训练策略,该策略在块间保持因果性、块内实现双向建模,从而更好地保留原始AR模型的权重分布,同时支持键值缓存(KV caching),实现精度与效率的双赢;二是设计一种位置依赖的令牌掩码策略(position-dependent token masking),通过提高训练阶段后期token的掩码概率来缩小训练与测试时掩码分布的差距(uniform vs. highly left-to-right),进而提升模型泛化性能。这些方法共同构成了高效dLM(Efficient-DLM)家族的基础,在多个基准上显著优于现有AR和dLM模型。
链接: https://arxiv.org/abs/2512.14067
作者: Yonggan Fu,Lexington Whalen,Zhifan Ye,Xin Dong,Shizhe Diao,Jingyu Liu,Chengyue Wu,Hao Zhang,Enze Xie,Song Han,Maksim Khadkevich,Jan Kautz,Yingyan Celine Lin,Pavlo Molchanov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models’ task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models’ weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs’ attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.
zh
[NLP-32] What Affects the Effective Depth of Large Language Models ?
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在规模扩展过程中存在“有效深度”利用率不足的问题,即尽管模型层数增加,但实际用于有意义计算的有效层比例并未显著提升,导致性能增益逐渐衰减。其解决方案的关键在于系统性地分析有效深度如何随模型规模、训练方式和任务难度变化,发现模型在不同条件下均未充分利用可用深度——无论是基础模型与长链思维(long-CoT)模型对比中缺乏有效深度增长,还是在处理更难任务时未能动态调用更多层。这一发现揭示了提升层利用率的潜在研究方向,包括改进模型结构设计、模型剪枝(model pruning)以及早期退出(early exiting)策略等。
链接: https://arxiv.org/abs/2512.14064
作者: Yi Hu,Cai Zhou,Muhan Zhang
机构: Peking University (北京大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:The scaling of large language models (LLMs) emphasizes increasing depth, yet performance gains diminish with added layers. Prior work introduces the concept of “effective depth”, arguing that deeper models fail to fully utilize their layers for meaningful computation. Building on this, we systematically study how effective depth varies with model scale, training type, and task difficulty. First, we analyze the model behavior of Qwen-2.5 family (1.5B-32B) and find that while the number of effective layers grows with model size, the effective depth ratio remains stable. Besides, comparisons between base and corresponding long-CoT models show no increase in effective depth, suggesting that improved reasoning stems from longer context rather than deeper per-token computation. Furthermore, evaluations across tasks of varying difficulty indicate that models do not dynamically use more layers for harder problems. Our results suggest that current LLMs underuse available depth across scales, training paradigms and tasks of varying difficulties, pointing out research opportunities on increasing the layer utilization rate of LLMs, model pruning, and early exiting. Our code is released at this https URL.
zh
[NLP-33] HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在设备端(on-device)部署时面临的高计算与内存开销问题,尤其是基于标准视觉Transformer(Vision Transformer, ViT)编码器在处理高分辨率图像时导致的延迟和内存消耗过大的瓶颈。其解决方案的关键在于提出HyperVL架构,通过两种创新技术实现高效推理:一是引入视觉分辨率压缩器(Visual Resolution Compressor, VRC),自适应预测最优编码分辨率以消除冗余计算;二是设计双一致性学习(Dual Consistency Learning, DCL),在统一框架内对多尺度ViT编码器进行对齐,支持在共享大语言模型(Large Language Model, LLM)下动态切换视觉分支,从而显著降低移动端延迟与功耗,同时保持优异的性能表现。
链接: https://arxiv.org/abs/2512.14052
作者: HyperAI Team:Yuchen Liu,Kaiyang Han,Zhiqiang Xia,Yuhang Dong,Chen Song,Kangyu Tang,Jiaming Xu,Xiushi Feng,WenXuan Yu,Li Peng,Mingyang Wang,Kai Wang,Changpeng Yang,Yang Li,Haoyu Lu,Hao Wang,Bingna Xu,Guangyao Liu,Long Huang,Kaibin Guo,Jinyang Wu,Dan Wu,Hongzhen Wang,Peng Zhou,Shuai Nie,Shande Wang,Runyu Shi,Ying Huang
机构: Xiaomi Corporation(小米公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Technical report of Xiaomi HyperAI Team
Abstract:Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution this http URL address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.
zh
[NLP-34] Structure-Aware Decoding Mechanisms for Complex Entity Extraction with Large-Scale Language Models
【速读】: 该论文旨在解决传统实体抽取方法在处理嵌套(nested)和重叠(overlapping)实体时难以同时保持语义完整性与结构一致性的问题。其解决方案的关键在于提出一种基于大语言模型的结构感知解码方法,通过引入候选跨度生成机制(candidate span generation mechanism)和结构化注意力建模(structured attention modeling),实现对实体边界、层次关系及跨依赖关系的统一建模;此外,在解码过程中加入层次结构约束,并联合优化分类损失与结构一致性损失,从而在多实体共现和长句依赖等复杂场景下提升识别准确率与结构稳定性。
链接: https://arxiv.org/abs/2512.13980
作者: Zhimin Qiu,Di Wu,Feng Liu,Chenrui Hu,Yuxiao Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper proposes a structure-aware decoding method based on large language models to address the difficulty of traditional approaches in maintaining both semantic integrity and structural consistency in nested and overlapping entity extraction tasks. The method introduces a candidate span generation mechanism and structured attention modeling to achieve unified modeling of entity boundaries, hierarchical relationships, and cross-dependencies. The model first uses a pretrained language model to obtain context-aware semantic representations, then captures multi-granular entity span features through candidate representation combinations, and introduces hierarchical structural constraints during decoding to ensure consistency between semantics and structure. To enhance stability in complex scenarios, the model jointly optimizes classification loss and structural consistency loss, maintaining high recognition accuracy under multi-entity co-occurrence and long-sentence dependency conditions. Experiments conducted on the ACE 2005 dataset demonstrate significant improvements in Accuracy, Precision, Recall, and F1-Score, particularly in nested and overlapping entity recognition, where the model shows stronger boundary localization and structural modeling capability. This study verifies the effectiveness of structure-aware decoding in complex semantic extraction tasks, provides a new perspective for developing language models with hierarchical understanding, and establishes a methodological foundation for high-precision information extraction.
zh
[NLP-35] Olmo 3
【速读】: 该论文旨在解决当前开源语言模型在长上下文推理、函数调用、代码生成、指令遵循、通用对话及知识召回等多任务场景下性能不足的问题。其解决方案的关键在于构建一个完整的、可复现的模型开发全流程(full lifecycle),涵盖从训练数据到模型检查点的全部阶段与依赖项,并推出Olmo 3 Think 32B这一当前最强的全开源推理型模型,从而显著提升开源模型在复杂任务中的表现和实用性。
链接: https://arxiv.org/abs/2512.13961
作者: Team Olmo:Allyson Ettinger,Amanda Bertsch,Bailey Kuehl,David Graham,David Heineman,Dirk Groeneveld,Faeze Brahman,Finbarr Timbers,Hamish Ivison,Jacob Morrison,Jake Poznanski,Kyle Lo,Luca Soldaini,Matt Jordan,Mayee Chen,Michael Noukhovitch,Nathan Lambert,Pete Walsh,Pradeep Dasigi,Robert Berry,Saumya Malik,Saurabh Shah,Scott Geng,Shane Arora,Shashank Gupta,Taira Anderson,Teng Xiao,Tyler Murray,Tyler Romero,Victoria Graf,Akari Asai,Akshita Bhagia,Alexander Wettig,Alisa Liu,Aman Rangapur,Chloe Anastasiades,Costa Huang,Dustin Schwenk,Harsh Trivedi,Ian Magnusson,Jaron Lochner,Jiacheng Liu,Lester James V. Miranda,Maarten Sap,Malia Morgan,Michael Schmitz,Michal Guerquin,Michael Wilson,Regan Huff,Ronan Le Bras,Rui Xin,Rulin Shao,Sam Skjonsberg,Shannon Zejiang Shen,Shuyue Stella Li,Tucker Wilde,Valentina Pyatkin,Will Merrill,Yapei Chang,Yuling Gu,Zhiyuan Zeng,Ashish Sabharwal,Luke Zettlemoyer,Pang Wei Koh,Ali Farhadi,Noah A. Smith,Hannaneh Hajishirzi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, and dependency used to build it. Our flagship model, Olmo 3 Think 32B, is the strongest fully-open thinking model released to-date.
zh
[NLP-36] Generative AI for Video Translation: A Scalable Architecture for Multilingual Video Conferencing
【速读】: 该论文旨在解决生成式 AI (Generative AI) 管道在实时应用场景(如视频翻译)中面临的两大系统级挑战:一是串行模型推理导致的累积延迟,二是多用户视频会议场景下计算复杂度呈二次方增长(O(N2)),难以扩展。解决方案的关键在于提出一种实用的系统级框架,其核心包括两个机制:一是引入轮转机制(turn-taking mechanism),将多用户场景下的计算复杂度从二次方降低至线性(O(N));二是采用分段处理协议(segmented processing protocol),以控制推理延迟,实现感知上的实时体验。该设计已在多种硬件平台(消费级、云服务和企业级GPU)上验证,实现了真实吞吐率(τ≥1.0),并通过主观用户研究确认初始处理延迟可被接受,从而为多语言通信平台中可扩展的实时生成式AI应用提供了可行路径。
链接: https://arxiv.org/abs/2512.13904
作者: Amirkia Rafiei Oskooei,Eren Caglar,Ibrahim Sahin,Ayse Kayabay,Mehmet S. Aktas
机构: Yildiz Technical University (伊斯坦布尔技术大学)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted manuscript. Published in Applied Sciences, 2025
Abstract:The real-time deployment of cascaded generative AI pipelines for applications like video translation is constrained by significant system-level challenges. These include the cumulative latency of sequential model inference and the quadratic ( \mathcalO(N^2) ) computational complexity that renders multi-user video conferencing applications unscalable. This paper proposes and evaluates a practical system-level framework designed to mitigate these critical bottlenecks. The proposed architecture incorporates a turn-taking mechanism to reduce computational complexity from quadratic to linear in multi-user scenarios, and a segmented processing protocol to manage inference latency for a perceptually real-time experience. We implement a proof-of-concept pipeline and conduct a rigorous performance analysis across a multi-tiered hardware setup, including commodity (NVIDIA RTX 4060), cloud (NVIDIA T4), and enterprise (NVIDIA A100) GPUs. Our objective evaluation demonstrates that the system achieves real-time throughput ( \tau 1.0 ) on modern hardware. A subjective user study further validates the approach, showing that a predictable, initial processing delay is highly acceptable to users in exchange for a smooth, uninterrupted playback experience. The work presents a validated, end-to-end system design that offers a practical roadmap for deploying scalable, real-time generative AI applications in multilingual communication platforms.
zh
[NLP-37] Lets (not) just put things in Context: Test-Time Training for Long-Context LLM s
【速读】: 该论文试图解决长上下文大语言模型(Large Language Models, LLMs)在实际应用中存在“信息利用率低”和“推理时计算资源浪费”的问题,尤其是在处理长文本时,现有基于生成式思考 token 的推理扩展策略(inference-time scaling strategies)效果迅速衰减且无法有效提取关键上下文信号。其解决方案的关键在于提出一种通过**针对性梯度更新(targeted gradient updates)**对输入上下文进行微调的方法,该方法能够克服静态自注意力机制(static self-attention)固有的分数稀释(score dilution)问题,从而显著提升模型对长上下文的利用效率。实验证明,该方法在多个长文本基准测试中带来稳定且显著的性能提升,表明将推理时计算资源用于特定上下文优化比单纯增加思考 token 更高效。
链接: https://arxiv.org/abs/2512.13898
作者: Rachit Bansal,Aston Zhang,Rishabh Tiwari,Lovish Madaan,Sai Surya Duvvuri,Devvrit Khatri,David Brandfonbrener,David Alvarez-Melis,Prajjwal Bhargava,Mihir Sanjay Kale,Samy Jelassi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Progress on training and architecture strategies has enabled LLMs with millions of tokens in context length. However, empirical evidence suggests that such long-context LLMs can consume far more text than they can reliably use. On the other hand, it has been shown that inference-time compute can be used to scale performance of LLMs, often by generating thinking tokens, on challenging tasks involving multi-step reasoning. Through controlled experiments on sandbox long-context tasks, we find that such inference-time strategies show rapidly diminishing returns and fail at long context. We attribute these failures to score dilution, a phenomenon inherent to static self-attention. Further, we show that current inference-time strategies cannot retrieve relevant long-context signals under certain conditions. We propose a simple method that, through targeted gradient updates on the given context, provably overcomes limitations of static self-attention. We find that this shift in how inference-time compute is spent leads to consistently large performance improvements across models and long-context benchmarks. Our method leads to large 12.6 and 14.1 percentage point improvements for Qwen3-4B on average across subsets of LongBench-v2 and ZeroScrolls benchmarks. The takeaway is practical: for long context, a small amount of context-specific training is a better use of inference compute than current inference-time scaling strategies like producing more thinking tokens.
zh
[NLP-38] FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition
【速读】: 该论文旨在解决多语言命名实体识别(Multilingual Named Entity Recognition, MNER)中高质量标注数据稀缺的问题,尤其是在低资源语言场景下模型性能受限的挑战。现有方法依赖人工标注或非系统化的合成数据生成方式,难以实现跨语言、跨脚本的大规模可复用数据集构建。其解决方案的关键在于提出FiNERweb——一个基于教师-学生范式的自动化数据生成流水线,通过训练回归模型识别与命名实体相关的文本片段,并利用多语言大语言模型(Multilingual Large Language Models, MLMs)进行自动标注,最终在91种语言和25种书写系统上构建了包含约22.5万段落和23.5万个不同实体标签的数据集。该方法不仅显著降低了数据获取成本(训练数据仅为强基线的1/19),还在零样本迁移任务中实现了与现有方法相当甚至更优的性能,同时通过LLM-as-a-judge评估验证了标注质量的高忠实度(3.99/5)与完整性(4.05/5)。此外,为缓解因标签语言不一致导致的性能下降问题,该工作还提供了英文标签及目标语言翻译后的标签版本,从而提升模型在真实多语环境中的泛化能力。
链接: https://arxiv.org/abs/2512.13884
作者: Jonas Golde,Patrick Haller,Alan Akbik
机构: Humboldt Universität zu Berlin (柏林洪堡大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as systematic, reusable resources. We introduce FiNERweb, a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts. Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotates them with multilingual LLMs, resulting in about 225k passages with 235k distinct entity labels. Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings on English, Thai, and Swahili, despite being trained on 19x less data than strong baselines. In addition, we assess annotation quality using LLM-as-a-judge and observe consistently high scores for both faithfulness (3.99 out of 5) and completeness (4.05 out of 5), indicating reliable and informative annotations. Further, we release the dataset with both English labels and translated label sets in the respective target languages because we observe that the performance of current state-of-the-art models drops by 0.02 to 0.09 F1 when evaluated using target language labels instead of English ones. We release FiNERweb together with all accompanying artifacts to the research community in order to facilitate more effective student-teacher training for multilingual named entity recognition.
zh
[NLP-39] EvoLattice: Persistent Internal-Population Evolution through Multi-Alternative Quality-Diversity Graph Representations for LLM -Guided Program Discovery
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的程序与多智能体系统演化方法中存在的局限性,即依赖于覆盖式变异(overwrite-based mutations)导致的候选解信息丢失、破坏性编辑以及搜索空间结构脆弱等问题。其核心解决方案是提出EvoLattice框架,该框架通过构建一个有向无环图(Directed Acyclic Graph, DAG)来统一表示整个候选种群,其中每个节点存储多个持久化的替代方案(alternatives),每条有效路径对应一个可执行候选解。这种结构实现了组合式搜索空间的高效表达而不重复存储结构,并支持细粒度的替代级评估(alternative-level evaluation),从而提供密集的数据驱动反馈信号用于LLM引导的变异、重组和修剪操作,同时保留成功组件;此外,通过确定性自修复机制保障结构正确性,且自然扩展至智能体演化场景,最终在程序合成任务中展现出更稳定的进化轨迹、更强的表达能力和更高的性能提升。
链接: https://arxiv.org/abs/2512.13857
作者: Kamer Ali Yuksel
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Large language models (LLMs) are increasingly used to evolve programs and multi-agent systems, yet most existing approaches rely on overwrite-based mutations that maintain only a single candidate at a time. Such methods discard useful variants, suffer from destructive edits, and explore a brittle search space prone to structural failure. We introduce EvoLattice, a framework that represents an entire population of candidate programs or agent behaviors within a single directed acyclic graph. Each node stores multiple persistent alternatives, and every valid path through the graph defines a distinct executable candidate, yielding a large combinatorial search space without duplicating structure. EvoLattice enables fine-grained alternative-level evaluation by scoring each alternative across all paths in which it appears, producing statistics that reveal how local design choices affect global performance. These statistics provide a dense, data-driven feedback signal for LLM-guided mutation, recombination, and pruning, while preserving successful components. Structural correctness is guaranteed by a deterministic self-repair mechanism that enforces acyclicity and dependency consistency independently of the LLM. EvoLattice naturally extends to agent evolution by interpreting alternatives as prompt fragments or sub-agent behaviors. Across program synthesis (proxy and optimizer meta-learning), EvoLattice yields more stable evolution, greater expressivity, and stronger improvement trajectories than prior LLM-guided methods. The resulting dynamics resemble quality-diversity optimization, emerging implicitly from EvoLattice’s internal multi-alternative representation rather than an explicit external archive.
zh
[NLP-40] Mitigating Catastrophic Forgetting in Mathematical Reasoning Finetuning through Mixed Training
【速读】: 该论文旨在解决大语言模型在进行专业化任务微调时出现的灾难性遗忘(catastrophic forgetting)问题,即模型在提升特定任务性能的同时会严重损失原有通用能力。例如,在对Flan-T5-Base模型仅使用DeepMind Mathematics数据集微调后,其数学推理准确率虽从3.1%提升至12.0%,但MultiNLI任务上的准确率却从81.0%暴跌至16.5%,表明专业训练导致了显著的通用能力退化。解决方案的关键在于采用混合训练策略(mixed training strategies),即在训练过程中交错插入数学和自然语言推理(NLI)样本,通过引入少量通用任务数据实现对模型记忆的正则化。实验表明,即使仅以1:15的混合比例(NLI占比6.2%)也能有效防止遗忘,且保持与纯数学训练相当的数学性能,从而证明专业化训练无需以牺牲通用能力为代价。
链接: https://arxiv.org/abs/2512.13706
作者: John Graham Reynolds
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 2 figures. Code available at this https URL . Models available at this https URL
Abstract:When finetuning large language models for specialized tasks such as mathematical reasoning, models exhibit catastrophic forgetting, losing previously learned capabilities. We investigate this by finetuning Flan-T5-Base (250M parameters) on the DeepMind Mathematics dataset and measuring forgetting on MultiNLI. Math-only training improves mathematical accuracy from 3.1% to 12.0% but causes NLI accuracy to collapse from 81.0% to 16.5%–a 64.5 percentage point drop occurring within the first 1,000 training steps. We propose mixed training strategies that interleave mathematical and NLI examples during training. Our results demonstrate that mixed training completely eliminates catastrophic forgetting while maintaining equivalent mathematical performance: the balanced 1:1 ratio achieves 12.0% math accuracy (matching math-only) while preserving 86.2% NLI accuracy. We systematically explore mixing ratios from 1:1 to 15:1, finding that even minimal NLI exposure (6.2%) provides effective regularization. These findings demonstrate that specialization need not require forgetting general capabilities, with implications for scaling to larger models where mixed training may confer additional benefits beyond forgetting prevention.
zh
[NLP-41] Leverag ing LLM s for Structured Data Extraction from Unstructured Patient Records
【速读】: 该论文旨在解决临床研究中手动病历审查(manual chart review)耗时且资源密集的问题,该过程需专家从非结构化电子健康记录(EHR)文本中提取复杂信息。其解决方案的关键在于构建一个安全、模块化的框架,利用本地部署的大语言模型(LLMs)在符合HIPAA标准的计算基础设施上实现临床笔记中的结构化特征自动提取;该框架集成检索增强生成(RAG)与结构化响应机制,形成可广泛部署和扩展的容器化系统,在多类医学特征上展现出高准确率,并能识别人工审阅遗漏的标注错误,从而显著降低手动审查负担并提升数据采集一致性。
链接: https://arxiv.org/abs/2512.13700
作者: Mitchell A. Klusty,Elizabeth C. Solie,Caroline N. Leach,W. Vaiden Logan,Lynnet E. Richey,John C. Gensel,David P. Szczykutowicz,Bryan C. McLellan,Emily B. Collier,Samuel E. Armstrong,V.K. Cody Bumgardner
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 2 figures, 2 tables, submitted to AMIA 2026 Informatics Summit
Abstract:Manual chart review remains an extremely time-consuming and resource-intensive component of clinical research, requiring experts to extract often complex information from unstructured electronic health record (EHR) narratives. We present a secure, modular framework for automated structured feature extraction from clinical notes leveraging locally deployed large language models (LLMs) on institutionally approved, Health Insurance Portability and Accountability Act (HIPPA)-compliant compute infrastructure. This system integrates retrieval augmented generation (RAG) and structured response methods of LLMs into a widely deployable and scalable container to provide feature extraction for diverse clinical domains. In evaluation, the framework achieved high accuracy across multiple medical characteristics present in large bodies of patient notes when compared against an expert-annotated dataset and identified several annotation errors missed in manual review. This framework demonstrates the potential of LLM systems to reduce the burden of manual chart review through automated extraction and increase consistency in data capture, accelerating clinical research.
zh
[NLP-42] Writing in Symbiosis: Mapping Human Creative Agency in the AI Era NEURIPS2025
【速读】: 该论文试图解决的问题是:随着大型语言模型(Large Language Models, LLMs)的广泛应用,人类与AI在创意写作中的协同演化关系如何影响作者的创作实践与风格多样性,以及是否存在普遍的风格趋同现象。其解决方案的关键在于通过大规模纵向文本数据(覆盖LLM出现前后的时期),揭示出一种“双轨演化”(Dual-Track Evolution)模式——即主题层面围绕AI相关话题趋于收敛,而风格层面则呈现结构化的分化:具体表现为三类作者适应模式——向AI风格靠拢、远离AI风格、或保持风格稳定但内容涉及时尚AI主题。这一“创意原型图谱”(Creative Archetype Map)为理解人机共创中人类主体性的演变提供了实证基础,并对AI辅助创作的多样性保护与检测策略具有重要启示。
链接: https://arxiv.org/abs/2512.13697
作者: Vivan Doshi,Mengyuan Li
机构: University of Southern California (南加州大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Advances in Neural Information Processing Systems (NeurIPS 2025)
Abstract:The proliferation of Large Language Models (LLMs) raises a critical question about what it means to be human when we share an increasingly symbiotic relationship with persuasive and creative machines. This paper examines patterns of human-AI coevolution in creative writing, investigating how human craft and agency are adapting alongside machine capabilities. We challenge the prevailing notion of stylistic homogenization by examining diverse patterns in longitudinal writing data. Using a large-scale corpus spanning the pre- and post-LLM era, we observe patterns suggestive of a “Dual-Track Evolution”: thematic convergence around AI-related topics, coupled with structured stylistic differentiation. Our analysis reveals three emergent adaptation patterns: authors showing increased similarity to AI style, those exhibiting decreased similarity, and those maintaining stylistic stability while engaging with AI-related themes. This Creative Archetype Map illuminates how authorship is coevolving with AI, contributing to discussions about human-AI collaboration, detection challenges, and the preservation of creative diversity.
zh
[NLP-43] MultiBanAbs: A Comprehensive Multi-Domain Bangla Abstractive Text Summarization Dataset
【速读】: 该论文旨在解决当前孟加拉语(Bangla)文本摘要任务中数据集单一、适应性差的问题,尤其是在真实场景下多样化的文本来源(如博客、社交媒体和新闻)难以被现有模型有效处理的挑战。其解决方案的关键在于构建一个包含超过54,000篇来自多源(包括Cinegolpo博客、Samakal报纸等)的孟加拉语文本及其摘要的新颖跨域抽象式摘要数据集,该数据集涵盖多种写作风格和领域,显著提升了模型的泛化能力与实用性,并通过LSTM、BanglaT5-small和MTS-small等深度学习与迁移学习模型建立了强有力的基线,为孟加拉语自然语言处理(Natural Language Processing, NLP)研究提供了重要的基准资源。
链接: https://arxiv.org/abs/2511.19317
作者: Md. Tanzim Ferdous,Naeem Ahsan Chowdhury,Prithwiraj Bhattacharjee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This study developed a new Bangla abstractive summarization dataset to generate concise summaries of Bangla articles from diverse sources. Most existing studies in this field have concentrated on news articles, where journalists usually follow a fixed writing style. While such approaches are effective in limited contexts, they often fail to adapt to the varied nature of real-world Bangla texts. In today’s digital era, a massive amount of Bangla content is continuously produced across blogs, newspapers, and social media. This creates a pressing need for summarization systems that can reduce information overload and help readers understand content more quickly. To address this challenge, we developed a dataset of over 54,000 Bangla articles and summaries collected from multiple sources, including blogs such as Cinegolpo and newspapers such as Samakal and The Business Standard. Unlike single-domain resources, our dataset spans multiple domains and writing styles. It offers greater adaptability and practical relevance. To establish strong baselines, we trained and evaluated this dataset using several deep learning and transfer learning models, including LSTM, BanglaT5-small, and MTS-small. The results highlight its potential as a benchmark for future research in Bangla natural language processing. This dataset provides a solid foundation for building robust summarization systems and helps expand NLP resources for low-resource languages.
zh
[NLP-44] Segmental Attention Decoding With Long Form Acoustic Encodings
【速读】: 该论文旨在解决基于注意力机制的编码器-解码器(Attention-based Encoder-Decoder, AED)模型在处理长时声学编码时的根本不兼容性问题。具体而言,AED模型在分段语音上训练时会利用片段边界外有限的声学上下文隐式学习绝对帧位置信息,但在解码长时连续声学序列时,这些位置线索消失,导致模型因交叉注意力中键(key)和值(value)的置换不变性而丧失对声学编码顺序的建模能力。解决方案的关键在于四方面改进:(1) 在交叉注意力中显式注入每个解码片段的绝对位置编码;(2) 通过扩展声学上下文进行长时训练以消除隐式位置编码;(3) 使用片段拼接增强训练中对多样化分割方式的覆盖;(4) 利用语义分割使AED输出与训练标签对齐。这四项措施协同作用,有效弥合了连续与分段声学编码之间的性能差距,从而支持自回归解码器在长时语音任务中的稳定应用。
链接: https://arxiv.org/abs/2512.14652
作者: Pawel Swietojanski,Xinwei Li,Mingbin Xu,Takaaki Hori,Dogan Can,Xiaodan Zhuang
机构: Apple(苹果)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 5 pages, 1 fig
Abstract:We address the fundamental incompatibility of attention-based encoder-decoder (AED) models with long-form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long-form segments where these cues vanish. The model loses ability to order acoustic encodings due to permutation invariance of keys and values in cross-attention. We propose four modifications: (1) injecting explicit absolute positional encodings into cross-attention for each decoded segment, (2) long-form training with extended acoustic context to eliminate implicit absolute position encoding, (3) segment concatenation to cover diverse segmentations needed during training, and (4) semantic segmentation to align AED-decoded segments with training segments. We show these modifications close the accuracy gap between continuous and segmented acoustic encodings, enabling auto-regressive use of the attention decoder.
zh
[NLP-45] Scalable Frameworks for Real-World Audio-Visual Speech Recognition
【速读】: 该论文旨在解决音频-视觉语音识别(Audio-Visual Speech Recognition, AVSR)系统在真实环境中的性能退化问题,主要表现为不可预测的声学噪声和视觉干扰导致的识别准确率下降。其解决方案的关键在于采用分层、系统性的方法,在表示层、架构层和系统层分别实现鲁棒性和可扩展性:在表示层构建统一模型以学习对多种现实世界扰动具有内在鲁棒性的多模态特征;在架构层设计自适应资源分配机制以高效扩展模型容量并可靠利用多模态输入;在系统层通过模块化集成大规模基础模型,融合其强大的认知与生成能力以提升最终识别精度。
链接: https://arxiv.org/abs/2512.14083
作者: Sungnyun Kim
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: PhD Dissertation
Abstract:The practical deployment of Audio-Visual Speech Recognition (AVSR) systems is fundamentally challenged by significant performance degradation in real-world environments, characterized by unpredictable acoustic noise and visual interference. This dissertation posits that a systematic, hierarchical approach is essential to overcome these challenges, achieving the robust scalability at the representation, architecture, and system levels. At the representation level, we investigate methods for building a unified model that learns audio-visual features inherently robust to diverse real-world corruptions, thereby enabling generalization to new environments without specialized modules. To address architectural scalability, we explore how to efficiently expand model capacity while ensuring the adaptive and reliable use of multimodal inputs, developing a framework that intelligently allocates computational resources based on the input characteristics. Finally, at the system level, we present methods to expand the system’s functionality through modular integration with large-scale foundation models, leveraging their powerful cognitive and generative capabilities to maximize final recognition accuracy. By systematically providing solutions at each of these three levels, this dissertation aims to build a next-generation, robust, and scalable AVSR system with high reliability in real-world applications.
zh
[NLP-46] Hierarchical Multi-agent Large Language Model Reasoning for Autonomous Functional Materials Discovery
【速读】: 该论文旨在解决当前人工智能在科学探索中自动化程度不足的问题,即多数方法仅能执行程序性任务,缺乏科学推理能力,从而限制了自主发现新知识的能力。其解决方案的关键在于提出Materials Agents for Simulation and Theory in Electronic-structure Reasoning (MASTER)框架,该框架利用大语言模型(Large Language Models, LLMs)实现原子尺度模拟的自主设计、执行与解释,并通过多智能体协作机制(包括同行评审、分诊排序和分诊模板三种策略)驱动化学过程中的理性探索,显著减少所需密度泛函理论(Density Functional Theory, DFT)模拟次数达90%,同时确保决策具有化学合理性,而非随机采样或语义偏倚的结果。
链接: https://arxiv.org/abs/2512.13930
作者: Samuel Rothfarb,Megan C. Davis,Ivana Matanovic,Baikun Li,Edward F. Holby,Wilton J.M. Kort-Kamp
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Keywords: Multi-agent reasoning; Large language models; Active learning; AI-driven simulation; Materials discovery; Density functional theory; Surface chemistry
Abstract:Artificial intelligence is reshaping scientific exploration, but most methods automate procedural tasks without engaging in scientific reasoning, limiting autonomy in discovery. We introduce Materials Agents for Simulation and Theory in Electronic-structure Reasoning (MASTER), an active learning framework where large language models autonomously design, execute, and interpret atomistic simulations. In MASTER, a multimodal system translates natural language into density functional theory workflows, while higher-level reasoning agents guide discovery through a hierarchy of strategies, including a single agent baseline and three multi-agent approaches: peer review, triage-ranking, and triage-forms. Across two chemical applications, CO adsorption on Cu-surface transition metal (M) adatoms and on M-N-C catalysts, reasoning-driven exploration reduces required atomistic simulations by up to 90% relative to trial-and-error selection. Reasoning trajectories reveal chemically grounded decisions that cannot be explained by stochastic sampling or semantic bias. Altogether, multi-agent collaboration accelerates materials discovery and marks a new paradigm for autonomous scientific exploration.
zh
[NLP-47] Shakespeare Entropy and Educated Monkeys
【速读】: 该论文试图解决的问题是:在随机生成文本的过程中,如何显著缩短生成特定目标文本(如莎士比亚作品片段)所需的时间。传统观点认为,一只猴子随机敲击键盘最终能写出莎士比亚全集,但所需时间远超宇宙寿命;而本文提出,若猴子的输入被限制为“统计典型”(statistically typical)的文本——即符合自然语言统计规律的序列——则可极大缩短生成目标文本的时间。解决方案的关键在于利用信息论中的熵(entropy)和典型序列(typical sequence)理论,通过约束生成空间,使随机过程更高效地逼近目标内容,从而将生成特定短语的时间从天文数字降至数万年量级,尽管仍远超实际可行范围。
链接: https://arxiv.org/abs/2512.11880
作者: Ioannis Kontoyiannis
机构: 未知
类目: History and Overview (math.HO); Computation and Language (cs.CL); Information Theory (cs.IT)
备注:
Abstract:It has often been said, correctly, that a monkey forever randomly typing on a keyboard would eventually produce the complete works of William Shakespeare. Almost just as often it has been pointed out that this “eventually” is well beyond any conceivably relevant time frame. We point out that an educated monkey that still types at random but is constrained to only write “statistically typical” text, would produce any given passage in a much shorter time. Information theory gives a very simple way to estimate that time. For example, Shakespeare’s phrase, Better three hours too soon than a minute too late, from The Merry Wives of Windsor, would take the educated monkey only 73 thousand years to produce, compared to the beyond-astronomical 2.7 \times 10^63 years for the randomly typing one. Despite the obvious improvement, it would still take the educated monkey an unimaginably long 10^42,277 years to produce all of Hamlet.
zh
计算机视觉
[CV-0] MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives
【速读】:该论文旨在解决长视频生成中因上下文过长而导致的内容一致性问题,核心挑战在于如何设计高效的内存机制以支持跨帧的语义连贯性。现有方法通常采用预定义策略压缩历史帧来维护记忆,但难以适应不同视频片段所需的差异化历史线索。其解决方案的关键在于提出MemFlow框架:在生成每个视频片段前,基于当前文本提示动态检索最相关的历史帧更新记忆库(memory bank),从而实现叙事连贯性;同时,在生成过程中仅激活注意力层中与查询最相关的记忆token,显著提升效率——相比无记忆基线仅带来7.9%的速度下降,且兼容任何使用键值缓存(KV cache)的流式视频生成模型。
链接: https://arxiv.org/abs/2512.14699
作者: Sihui Ji,Xi Chen,Shuai Yang,Xin Tao,Pengfei Wan,Hengshuang Zhao
机构: HKU; Kling Team, Kuaishou Technology; HKUST(GZ)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.
zh
[CV-1] Spherical Leech Quantization for Visual Tokenization and Generation
【速读】:该论文旨在解决非参数量化(non-parametric quantization)方法在训练稳定性与重建压缩权衡方面的局限性,尤其是针对无查找表(lookup-free)量化变体如BSQ(Ball-Splitting Quantization)在训练过程中需要引入辅助损失项的问题。解决方案的关键在于从格码(lattice coding)的几何视角统一建模不同非参数量化方法,并提出基于Leech格(Leech lattice, Λ₂₄)的球面量化方法——Spherical Leech Quantization (Λ₂₄-SQ)。该方法利用Leech格的高度对称性和在超球面上的均匀分布特性,显著简化了训练流程并提升了重建质量与压缩效率,在图像标记化和压缩任务中优于现有最优方法BSQ,同时比特消耗更低,且可推广至先进的自回归图像生成框架。
链接: https://arxiv.org/abs/2512.14697
作者: Yue Zhao,Hanwen Jiang,Zhenlin Xu,Chutong Yang,Ehsan Adeli,Philipp Krähenbühl
机构: UT Austin; Stanford University (斯坦福大学); Adobe Research; Mistral AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: Tech report; project page: this https URL
Abstract:Non-parametric quantization has received much attention due to its efficiency on parameters and scalability to a large codebook. In this paper, we present a unified formulation of different non-parametric quantization methods through the lens of lattice coding. The geometry of lattice codes explains the necessity of auxiliary loss terms when training auto-encoders with certain existing lookup-free quantization variants such as BSQ. As a step forward, we explore a few possible candidates, including random lattices, generalized Fibonacci lattices, and densest sphere packing lattices. Among all, we find the Leech lattice-based quantization method, which is dubbed as Spherical Leech Quantization ( \Lambda_24 -SQ), leads to both a simplified training recipe and an improved reconstruction-compression tradeoff thanks to its high symmetry and even distribution on the hypersphere. In image tokenization and compression tasks, this quantization approach achieves better reconstruction quality across all metrics than BSQ, the best prior art, while consuming slightly fewer bits. The improvement also extends to state-of-the-art auto-regressive image generation frameworks.
zh
[CV-2] CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives
【速读】:该论文旨在解决从单目视频中联合恢复可模拟的人体运动与场景几何结构的问题,现有方法要么依赖数据驱动先验且缺乏物理约束,要么生成噪声大、存在伪影的几何结构,导致包含场景交互的运动跟踪策略失效。其解决方案的关键在于:首先通过简单聚类管道(基于深度、法向量和光流)拟合平面基元以获得凸形、干净且可用于物理仿真的场景几何;其次利用人体-场景接触建模(如通过人体姿态重建被遮挡的椅子座位)来补全交互过程中不可见的几何信息;最后借助强化学习驱动的人形控制器确保人体与场景重建结果在物理上合理。该方法将人类中心视频基准测试中的运动跟踪失败率从55.2%降至6.9%,同时提升强化学习仿真吞吐量43%,并验证了其在真实世界视频(包括随意拍摄、网络视频及Sora生成视频)上的有效性,显著推动了机器人学与AR/VR领域的真实到仿真(real-to-sim)应用发展。
链接: https://arxiv.org/abs/2512.14696
作者: Zihan Wang,Jiashun Wang,Jeff Tan,Yiwen Zhao,Jessica Hodgins,Shubham Tulsiani,Deva Ramanan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: Project page: this https URL
Abstract:We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human-scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion tracking policies with scene interactions to fail. In contrast, our key insight is to recover convex, clean, and simulation-ready geometry by fitting planar primitives to a point cloud reconstruction of the scene, via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we make use of human-scene contact modeling (e.g., we use human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically-plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion tracking failure rates from 55.2% to 6.9% on human-centric video benchmarks (EMDB, PROX), while delivering a 43% faster RL simulation throughput. We further validate it on in-the-wild videos including casually-captured videos, Internet videos, and even Sora-generated videos. This demonstrates CRISP’s ability to generate physically-valid human motion and interaction environments at scale, greatly advancing real-to-sim applications for robotics and AR/VR.
zh
[CV-3] Native and Compact Structured Latents for 3D Generation MICRO
【速读】:该论文旨在解决当前3D生成建模中因现有表示方法难以捕捉复杂拓扑结构和精细外观特征而导致的生成真实感不足的问题。其核心解决方案是提出一种新型稀疏体素结构——O-Voxel(omni-voxel),该结构能够同时编码几何与外观信息,支持任意拓扑(包括开放、非流形及封闭表面),并能表征超出纹理颜色的物理渲染参数;在此基础上构建的稀疏压缩变分自编码器(Sparse Compression VAE)实现了高空间压缩率与紧凑的潜在空间,配合40亿参数规模的流匹配模型训练,在保持高效推理的同时显著提升了生成资产的几何精度与材质质量。
链接: https://arxiv.org/abs/2512.14692
作者: Jianfeng Xiang,Xiaoxue Chen,Sicheng Xu,Ruicheng Wang,Zelong Lv,Yu Deng,Hongyuan Zhu,Yue Dong,Hao Zhao,Nicholas Jing Yuan,Jiaolong Yang
机构: Tsinghua University (清华大学); Microsoft Research (微软研究院); USTC (中国科学技术大学); Microsoft AI (微软人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.
zh
[CV-4] VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image NEURIPS2025 MICRO WWW
【速读】:该论文旨在解决两个核心问题:一是如何精准捕捉真实人脸中细微的表情细节,二是如何仅从一张肖像图像中重建出复杂的3D头部虚拟形象。解决方案的关键在于利用VASA-1方法中的运动潜在空间(motion latent),并将该潜在表示有效映射至3D空间,从而构建一个以运动潜在为条件的3D头部模型;同时,通过优化框架对单张输入图像进行定制化处理,该框架基于从输入图像合成的大量视频帧,并引入多种鲁棒性训练损失函数,以应对生成数据中存在的伪影和视角覆盖不足的问题。
链接: https://arxiv.org/abs/2512.14677
作者: Sicheng Xu,Guojun Chen,Jiaolong Yang,Yizhong Zhang,Yu Deng,Steve Lin,Baining Guo
机构: Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 paper. Project webpage: this https URL
Abstract:We propose VASA-3D, an audio-driven, single-shot 3D head avatar generator. This research tackles two major challenges: capturing the subtle expression details present in real human faces, and reconstructing an intricate 3D head avatar from a single portrait image. To accurately model expression details, VASA-3D leverages the motion latent of VASA-1, a method that yields exceptional realism and vividness in 2D talking heads. A critical element of our work is translating this motion latent to 3D, which is accomplished by devising a 3D head model that is conditioned on the motion latent. Customization of this model to a single image is achieved through an optimization framework that employs numerous video frames of the reference head synthesized from the input image. The optimization takes various training losses robust to artifacts and limited pose coverage in the generated training data. Our experiment shows that VASA-3D produces realistic 3D talking heads that cannot be achieved by prior art, and it supports the online generation of 512x512 free-viewpoint videos at up to 75 FPS, facilitating more immersive engagements with lifelike 3D avatars.
zh
[CV-5] ART: Articulated Reconstruction Transformer
【速读】:该论文旨在解决从稀疏的多状态RGB图像中重建完整3D关节物体的问题,现有方法要么依赖于缓慢且脆弱的跨状态对应关系优化,要么受限于特定类别、无法泛化。解决方案的关键在于提出ART(Articulated Reconstruction Transformer),它将关节物体建模为刚性部件的集合,并通过一种新型Transformer架构实现基于部件的预测:模型将稀疏图像输入映射到一组可学习的部件槽位,进而联合解码每个部件的统一表示,包括其3D几何结构、纹理和显式关节参数。这一设计使重建结果具有物理可解释性并可直接用于仿真,且在大规模带部件监督的数据集上训练后,在多个基准测试中显著优于现有方法,建立了该任务的新SOTA。
链接: https://arxiv.org/abs/2512.14671
作者: Zizhang Li,Cheng Zhang,Zhengqin Li,Henry Howard-Jenkins,Zhaoyang Lv,Chen Geng,Jiajun Wu,Richard Newcombe,Jakob Engel,Zhao Dong
机构: Stanford University (斯坦福大学); Reality Labs Research, Meta (Meta 实验室研究部门)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:We introduce ART, Articulated Reconstruction Transformer – a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.
zh
[CV-6] EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中依赖监督微调(Supervised Finetuning, SFT)所导致的适应性不足问题,即SFT需要大量任务特定示范、轨迹记忆僵化且无法在部署环境与训练条件不一致时进行有效调整。为实现真正的持续适应性智能,作者提出EVOLVE-VLA框架,其核心创新在于通过引入一个可学习的进展估计器(progress estimator)来替代测试阶段不可用的Oracle奖励信号,从而提供密集的自主反馈;关键技术突破在于设计了两项机制以“驯服”这一固有噪声信号:一是累积式进展估计机制,用于平滑点对点的噪声估计;二是渐进式时间窗扩展策略,支持策略的逐步演化。该方案显著提升了长周期任务性能(+8.6%)、单样本学习能力(+22.0%),并实现了跨任务泛化(未见任务成功率达20.8%,而纯SFT为0%)。
链接: https://arxiv.org/abs/2512.14666
作者: Zechen Bai,Chen Gao,Mike Zheng Shou
机构: Show Lab; National University of Singapore (新加坡国立大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages
Abstract:Achieving truly adaptive embodied intelligence requires agents that learn not just by imitating static demonstrations, but by continuously improving through environmental interaction, which is akin to how humans master skills through practice. Vision-Language-Action (VLA) models have advanced robotic manipulation by leveraging large language models, yet remain fundamentally limited by Supervised Finetuning (SFT): requiring hundreds of demonstrations per task, rigidly memorizing trajectories, and failing to adapt when deployment conditions deviate from training. We introduce EVOLVE-VLA, a test-time training framework enabling VLAs to continuously adapt through environment interaction with minimal or zero task-specific demonstrations. The key technical challenge is replacing oracle reward signals (unavailable at test time) with autonomous feedback. We address this through a learned progress estimator providing dense feedback, and critically, we design our framework to ``tame’’ this inherently noisy signal via two mechanisms: (1) an accumulative progress estimation mechanism smoothing noisy point-wise estimates, and (2) a progressive horizon extension strategy enabling gradual policy evolution. EVOLVE-VLA achieves substantial gains: +8.6% on long-horizon tasks, +22.0% in 1-shot learning, and enables cross-task generalization – achieving 20.8% success on unseen tasks without task-specific demonstrations training (vs. 0% for pure SFT). Qualitative analysis reveals emergent capabilities absent in demonstrations, including error recovery and novel strategies. This work represents a critical step toward VLAs that truly learn and adapt, moving beyond static imitation toward continuous self-improvements.
zh
[CV-7] Enhancing Visual Sentiment Analysis via Semiotic Isotopy-Guided Dataset Construction
【速读】:该论文旨在解决视觉情感分析(Visual Sentiment Analysis, VSA)中因图像情感显著性差异大及标注数据不足导致的模型泛化能力弱的问题。其核心挑战在于构建大规模、多样化的VSA数据集,并开发能够精准定位图像中情感相关元素的算法。解决方案的关键在于引入语义同构性(semiotic isotopy)概念,将其整合到数据集构建流程中,从而生成更具情感多样性且能引导模型聚焦于情感相关图像元素的新数据集;实证结果表明,基于该方法训练的模型在主流VSA基准测试中均展现出优于原始数据集训练模型的性能,显著提升了跨数据集的泛化能力。
链接: https://arxiv.org/abs/2512.14665
作者: Marco Blanchini,Giovanna Maria Dimitri,Benedetta Tondi,Tarcisio Lancioni,Mauro Barni
机构: University of Siena (锡耶纳大学); University of Milan (Statale) (米兰大学(斯塔泰莱))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual Sentiment Analysis (VSA) is a challenging task due to the vast diversity of emotionally salient images and the inherent difficulty of acquiring sufficient data to capture this variability comprehensively. Key obstacles include building large-scale VSA datasets and developing effective methodologies that enable algorithms to identify emotionally significant elements within an image. These challenges are reflected in the limited generalization performance of VSA algorithms and models when trained and tested across different datasets. Starting from a pool of existing data collections, our approach enables the creation of a new larger dataset that not only contains a wider variety of images than the original ones, but also permits training new models with improved capability to focus on emotionally relevant combinations of image elements. This is achieved through the integration of the semiotic isotopy concept within the dataset creation process, providing deeper insights into the emotional content of images. Empirical evaluations show that models trained on a dataset generated with our method consistently outperform those trained on the original data collections, achieving superior generalization across major VSA benchmarks
zh
[CV-8] ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在数学推理任务中因仅依赖静态图像进行文本推理而表现出的局限性,即忽视了人类在解题过程中通过动态视觉获取信息并分步验证中间命题的认知机制。其解决方案的关键在于提出ViRC框架,引入“推理切片”(Reason Chunking)机制,将多模态数学思维链(Chain-of-Thought, CoT)结构化为连续的关键推理单元(Critical Reasoning Units, CRUs),每个CRU内部保持文本一致性以验证中间命题,并跨单元整合视觉信息生成后续命题,从而模拟人类专家的问题求解模式。这一设计显著提升了模型在数学推理中的结构化推理能力与视觉信息利用效率。
链接: https://arxiv.org/abs/2512.14654
作者: Lihong Wang,Liangqi Li,Weiwei Feng,Jiamin Wu,Changtao Miao,Tieru Wu,Rui Ma,Bo Zhang,Zhe Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL
Abstract:CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks. Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning. In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions. This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller’s Law in cognitive science. Inspired by this insight, we propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns. CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning. To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem. Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the this http URL resulting ViRC-7B model achieves a 18.8% average improvement over baselines across multiple mathematical benchmarks. Code is available at this https URL.
zh
[CV-9] Adaptable Segmentation Pipeline for Diverse Brain Tumors with Radiomic-guided Subtyping and Lesion-Wise Model Ensemble MICCAI
【速读】:该论文旨在解决多参数磁共振成像(multi-parametric MRI)中脑肿瘤分割的鲁棒性和泛化性难题,尤其针对成人与儿童不同类型脑肿瘤差异显著带来的挑战。其解决方案的关键在于构建一个灵活、模块化且可适应的处理流程:通过选择并组合最先进的分割模型,在训练前后引入肿瘤类型和病灶特异性的预处理与后处理策略;利用从MRI中提取的放射组学特征(radiomic features)辅助识别肿瘤亚型,从而实现更均衡的训练数据分布;同时采用定制化的病灶级别性能指标来动态调整集成模型中各子模型的权重,并优化后处理步骤,使整个工作流能够根据每个病例的具体情况自动调优。该方法在BraTS测试集上达到了与顶级算法相当的性能,验证了病变感知式处理和模型选择策略的有效性,且不依赖于特定网络架构,具备临床定量测量潜力,支持诊断与预后评估。
链接: https://arxiv.org/abs/2512.14648
作者: Daniel Capellán-Martín,Abhijeet Parida,Zhifan Jiang,Nishad Kulkarni,Krithika Iyer,Austin Tapp,Syed Muhammad Anwar,María J. Ledesma-Carbayo,Marius George Linguraru
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 12 pages, 5 figures, 3 tables. Algorithm presented at MICCAI BraTS 2025
Abstract:Robust and generalizable segmentation of brain tumors on multi-parametric magnetic resonance imaging (MRI) remains difficult because tumor types differ widely. The BraTS 2025 Lighthouse Challenge benchmarks segmentation methods on diverse high-quality datasets of adult and pediatric tumors: multi-consortium international pediatric brain tumor segmentation (PED), preoperative meningioma tumor segmentation (MEN), meningioma radiotherapy segmentation (MEN-RT), and segmentation of pre- and post-treatment brain metastases (MET). We present a flexible, modular, and adaptable pipeline that improves segmentation performance by selecting and combining state-of-the-art models and applying tumor- and lesion-specific processing before and after training. Radiomic features extracted from MRI help detect tumor subtype, ensuring a more balanced training. Custom lesion-level performance metrics determine the influence of each model in the ensemble and optimize post-processing that further refines the predictions, enabling the workflow to tailor every step to each case. On the BraTS testing sets, our pipeline achieved performance comparable to top-ranked algorithms across multiple challenges. These findings confirm that custom lesion-aware processing and model selection yield robust segmentations yet without locking the method to a specific network architecture. Our method has the potential for quantitative tumor measurement in clinical practice, supporting diagnosis and prognosis.
zh
[CV-10] A Multicenter Benchmark of Multiple Instance Learning Models for Lymphoma Subtyping from HE-stained Whole Slide Images
【速读】:该论文旨在解决淋巴瘤亚型诊断中依赖多模态检测手段(如免疫组化、流式细胞术和分子遗传学检测)所导致的设备成本高、专业人员需求大及诊疗延迟的问题。其解决方案的关键在于构建首个多中心淋巴瘤基准数据集,涵盖四种常见亚型及健康对照组织,并系统评估五种病理基础模型(H-optimus-1、H0-mini、Virchow2、UNI2、Titan)与两种注意力机制(AB-MIL)和Transformer架构(TransMIL)的多实例学习聚合方法在三种放大倍数(10x、20x、40x)下的表现。结果显示,在分布内测试集中模型平衡准确率均超过80%,且40x分辨率已足够,无需更高分辨率或跨倍数聚合;但分布外测试集性能显著下降至约60%,凸显了当前模型泛化能力的不足,为未来研究指明方向。
链接: https://arxiv.org/abs/2512.14640
作者: Rao Muhammad Umer,Daniel Sens,Jonathan Noll,Christian Matek,Lukas Wolfseher,Rainer Spang,Ralf Huss,Johannes Raffler,Sarah Reinke,Wolfram Klapper,Katja Steiger,Kristina Schwamborn,Carsten Marr
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:Timely and accurate lymphoma diagnosis is essential for guiding cancer treatment. Standard diagnostic practice combines hematoxylin and eosin (HE)-stained whole slide images with immunohistochemistry, flow cytometry, and molecular genetic tests to determine lymphoma subtypes, a process requiring costly equipment, skilled personnel, and causing treatment delays. Deep learning methods could assist pathologists by extracting diagnostic information from routinely available HE-stained slides, yet comprehensive benchmarks for lymphoma subtyping on multicenter data are lacking. In this work, we present the first multicenter lymphoma benchmarking dataset covering four common lymphoma subtypes and healthy control tissue. We systematically evaluate five publicly available pathology foundation models (H-optimus-1, H0-mini, Virchow2, UNI2, Titan) combined with attention-based (AB-MIL) and transformer-based (TransMIL) multiple instance learning aggregators across three magnifications (10x, 20x, 40x). On in-distribution test sets, models achieve multiclass balanced accuracies exceeding 80% across all magnifications, with all foundation models performing similarly and both aggregation methods showing comparable results. The magnification study reveals that 40x resolution is sufficient, with no performance gains from higher resolutions or cross-magnification aggregation. However, on out-of-distribution test sets, performance drops substantially to around 60%, highlighting significant generalization challenges. To advance the field, larger multicenter studies covering additional rare lymphoma subtypes are needed. We provide an automated benchmarking pipeline to facilitate such future research.
zh
[CV-11] AMD-HookNet: Evolution of AMD-HookNet with Hybrid CNN-Transformer Feature Enhancement for Glacier Calving Front Segmentation
【速读】:该论文旨在解决冰川前缘(calving front)分割中因卷积神经网络(CNN)局部性与平移不变性导致的长程依赖建模能力不足的问题,从而提升冰川边界识别的精度和连续性。其核心解决方案是提出一种新型混合CNN-Transformer特征增强方法AMD-HookNet++,通过双分支结构实现全局上下文与局部细节的协同建模:其中基于Transformer的上下文分支用于捕捉长距离依赖关系以提供大尺度语义信息,而CNN目标分支则保留局部细节;进一步设计增强的空间-通道注意力模块以动态调节两分支间的token交互,并引入像素级对比深度监督机制,融合像素级度量学习优化模型训练。实验表明,该方法在CaFFe数据集上达到78.2的IoU和1,318 m的HD95指标,显著优于现有方法,且生成更平滑的冰川前缘轮廓,有效缓解纯Transformer方法常见的锯齿状边缘问题。
链接: https://arxiv.org/abs/2512.14639
作者: Fei Wu,Marcel Dreier,Nora Gourmelon,Sebastian Wind,Jianlin Zhang,Thorsten Seehaus,Matthias Braun,Andreas Maier,Vincent Christlein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The dynamics of glaciers and ice shelf fronts significantly impact the mass balance of ice sheets and coastal sea levels. To effectively monitor glacier conditions, it is crucial to consistently estimate positional shifts of glacier calving fronts. AMD-HookNet firstly introduces a pure two-branch convolutional neural network (CNN) for glacier segmentation. Yet, the local nature and translational invariance of convolution operations, while beneficial for capturing low-level details, restricts the model ability to maintain long-range dependencies. In this study, we propose AMD-HookNet++, a novel advanced hybrid CNN-Transformer feature enhancement method for segmenting glaciers and delineating calving fronts in synthetic aperture radar images. Our hybrid structure consists of two branches: a Transformer-based context branch to capture long-range dependencies, which provides global contextual information in a larger view, and a CNN-based target branch to preserve local details. To strengthen the representation of the connected hybrid features, we devise an enhanced spatial-channel attention module to foster interactions between the hybrid CNN-Transformer branches through dynamically adjusting the token relationships from both spatial and channel perspectives. Additionally, we develop a pixel-to-pixel contrastive deep supervision to optimize our hybrid model by integrating pixelwise metric learning into glacier segmentation. Through extensive experiments and comprehensive quantitative and qualitative analyses on the challenging glacier segmentation benchmark dataset CaFFe, we show that AMD-HookNet++ sets a new state of the art with an IoU of 78.2 and a HD95 of 1,318 m, while maintaining a competitive MDE of 367 m. More importantly, our hybrid model produces smoother delineations of calving fronts, resolving the issue of jagged edges typically seen in pure Transformer-based approaches.
zh
[CV-12] Distill Video Datasets into Images
【速读】:该论文旨在解决视频数据集蒸馏(video set distillation)中存在的性能不佳问题,其核心挑战在于视频的时序维度引入了大量可学习参数,导致优化困难且收敛性差。解决方案的关键在于提出单帧视频蒸馏框架(Single-Frame Video set Distillation, SFVD),该方法利用“单帧通常足以捕捉视频判别语义”的观察,将每类视频蒸馏为高信息量的代表性帧,并通过可微插值将其转化为视频序列进行匹配;同时限制更新仅作用于这些帧以提升优化效率,并在匹配过程中结合真实视频采样与通道重塑层以增强时序信息建模能力,从而显著优于现有方法,在多个基准上实现最高达5.3%的性能提升。
链接: https://arxiv.org/abs/2512.14621
作者: Zhenghao Zhao,Haoxuan Wang,Kai Wang,Yuzhang Shang,Yuan Hong,Yan Yan
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); National University of Singapore (新加坡国立大学); University of Central Florida (中佛罗里达大学); University of Connecticut (康涅狄格大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dataset distillation aims to synthesize compact yet informative datasets that allow models trained on them to achieve performance comparable to training on the full dataset. While this approach has shown promising results for image data, extending dataset distillation methods to video data has proven challenging and often leads to suboptimal performance. In this work, we first identify the core challenge in video set distillation as the substantial increase in learnable parameters introduced by the temporal dimension of video, which complicates optimization and hinders convergence. To address this issue, we observe that a single frame is often sufficient to capture the discriminative semantics of a video. Leveraging this insight, we propose Single-Frame Video set Distillation (SFVD), a framework that distills videos into highly informative frames for each class. Using differentiable interpolation, these frames are transformed into video sequences and matched with the original dataset, while updates are restricted to the frames themselves for improved optimization efficiency. To further incorporate temporal information, the distilled frames are combined with sampled real videos from real videos during the matching process through a channel reshaping layer. Extensive experiments on multiple benchmarks demonstrate that SFVD substantially outperforms prior methods, achieving improvements of up to 5.3% on MiniUCF, thereby offering a more effective solution.
zh
[CV-13] WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
【速读】:该论文旨在解决当前视频生成模型在实时交互与长期几何一致性之间存在的权衡问题,即如何在保证高帧率(如24 FPS)的同时维持长时间序列的结构稳定性,避免因记忆衰减导致的误差累积。其解决方案的关键在于三个核心创新:1)采用双动作表示(Dual Action Representation)实现对用户键盘和鼠标输入的鲁棒控制;2)提出重构上下文记忆(Reconstituted Context Memory),通过动态重建历史帧并利用时间重框架(temporal reframing)保持远距离关键帧的可访问性,缓解记忆衰减;3)设计一种面向记忆感知模型的新型蒸馏方法——上下文强制(Context Forcing),通过教师-学生架构对齐记忆上下文,使学生模型在保持实时速度的同时具备利用长程信息的能力,从而显著提升生成视频的长期一致性与泛化性能。
链接: https://arxiv.org/abs/2512.14614
作者: Wenqiang Sun,Haiyu Zhang,Haoyuan Wang,Junta Wu,Zehan Wang,Zhenwei Wang,Yunhong Wang,Jun Zhang,Tengfei Wang,Chunchao Guo
机构: Hong Kong University of Science and Technology (香港科技大学); Beihang University (北京航空航天大学); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: project page: this https URL , demo: this https URL
Abstract:This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user’s keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student’s capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: this https URL and this https URL.
zh
[CV-14] FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos
【速读】:该论文旨在解决深度伪造视频检测中跨域泛化能力差的问题,即现有方法依赖特定篡改特征,在面对未知伪造技术时性能显著下降。其关键解决方案是提出FakeRadar框架,通过引入伪造异常点探测(Forgery Outlier Probing)和异常引导三重训练(Outlier-Guided Tri-Training)机制:前者利用预训练模型(如CLIP)识别真实视频、已知伪造与未知篡改之间的分布差异,通过动态子簇建模和簇条件异常样本生成,模拟超出已知伪造类型的新型伪造特征;后者则基于异常驱动的对比学习和异常条件交叉熵损失优化检测器,使其能够有效区分真实、伪造及异常样本,从而提升对新兴伪造技术的鲁棒性。
链接: https://arxiv.org/abs/2512.14601
作者: Zhaolun Li,Jichang Li,Yinqi Cai,Junye Chen,Xiaonan Luo,Guanbin Li,Rushi Lan
机构: Guilin University of Electronic Technology (桂林电子科技大学); Pengcheng Laboratory (鹏城实验室); Sun Yat-sen University (中山大学); Guangxi Key Laboratory of Image and Graphic Intelligent Processing (广西图像图形智能处理重点实验室); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we propose FakeRadar, a novel deepfake video detection framework designed to address the challenges of cross-domain generalization in real-world scenarios. Existing detection methods typically rely on manipulation-specific cues, performing well on known forgery types but exhibiting severe limitations against emerging manipulation techniques. This poor generalization stems from their inability to adapt effectively to unseen forgery patterns. To overcome this, we leverage large-scale pretrained models (e.g. CLIP) to proactively probe the feature space, explicitly highlighting distributional gaps between real videos, known forgeries, and unseen manipulations. Specifically, FakeRadar introduces Forgery Outlier Probing, which employs dynamic subcluster modeling and cluster-conditional outlier generation to synthesize outlier samples near boundaries of estimated subclusters, simulating novel forgery artifacts beyond known manipulation types. Additionally, we design Outlier-Guided Tri-Training, which optimizes the detector to distinguish real, fake, and outlier samples using proposed outlier-driven contrastive learning and outlier-conditioned cross-entropy losses. Experiments show that FakeRadar outperforms existing methods across various benchmark datasets for deepfake video detection, particularly in cross-domain evaluations, by handling the variety of emerging manipulation techniques.
zh
[CV-15] UMTraf EMOT: Event-Based Multi-Object Tracking Dataset and Baseline for Traffic Scenarios
【速读】:该论文旨在解决智能交通系统(Intelligent Transportation Systems, ITS)中基于帧的摄像头在低光照和高速运动条件下性能下降的问题。其解决方案的关键在于引入一种面向事件相机(event camera)的初步基准数据集,该数据集覆盖车辆与行人检测与跟踪任务,并构建了一个基于专用特征提取器的“检测后跟踪”(tracking-by-detection)基准,从而有效利用事件相机所具备的低延迟、高动态范围和高时间分辨率优势,提升复杂场景下的多目标跟踪性能。
链接: https://arxiv.org/abs/2512.14595
作者: Mengyu Li,Xingcheng Zhou,Guang Chen,Alois Knoll,Hu Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures
Abstract:In Intelligent Transportation Systems (ITS), multi-object tracking is primarily based on frame-based cameras. However, these cameras tend to perform poorly under dim lighting and high-speed motion conditions. Event cameras, characterized by low latency, high dynamic range and high temporal resolution, have considerable potential to mitigate these issues. Compared to frame-based vision, there are far fewer studies on event-based vision. To address this research gap, we introduce an initial pilot dataset tailored for event-based ITS, covering vehicle and pedestrian detection and tracking. We establish a tracking-by-detection benchmark with a specialized feature extractor based on this dataset, achieving excellent performance.
zh
[CV-16] LLM -driven Knowledge Enhancement for Multimodal Cancer Survival Prediction
【速读】:该论文旨在解决多模态生存预测中因病理全切片图像(WSIs)和基因组数据维度高、冗余性强而导致难以提取判别性特征以及跨模态对齐困难的问题,同时指出仅使用简单的生存随访标签不足以有效监督这一复杂任务。其解决方案的关键在于提出一种由大语言模型(LLM)驱动的知识增强多模态模型(KEMM),通过引入两个核心组件:一是由病理学家提供的专家报告经LLM精炼后的临床聚焦诊断语句,可提供潜在的生存差异信息;二是由LLM生成的预后背景知识(PBK),涵盖不同癌症类型的预后信息,从而增强模型的生存预测能力。为有效利用这些知识,论文进一步设计了知识增强的跨模态注意力模块(KECM),该模块能引导网络关注来自冗余模态中的判别性和生存相关特征,显著提升预测性能。
链接: https://arxiv.org/abs/2512.14594
作者: Chenyu Zhao,Yingxue Xu,Fengtao Zhou,Yihui Wang,Hao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current multimodal survival prediction methods typically rely on pathology images (WSIs) and genomic data, both of which are high-dimensional and redundant, making it difficult to extract discriminative features from them and align different modalities. Moreover, using a simple survival follow-up label is insufficient to supervise such a complex task. To address these challenges, we propose KEMM, an LLM-driven Knowledge-Enhanced Multimodal Model for cancer survival prediction, which integrates expert reports and prognostic background knowledge. 1) Expert reports, provided by pathologists on a case-by-case basis and refined by large language model (LLM), offer succinct and clinically focused diagnostic statements. This information may typically suggest different survival outcomes. 2) Prognostic background knowledge (PBK), generated concisely by LLM, provides valuable prognostic background knowledge on different cancer types, which also enhances survival prediction. To leverage these knowledge, we introduce the knowledge-enhanced cross-modal (KECM) attention module. KECM can effectively guide the network to focus on discriminative and survival-relevant features from highly redundant modalities. Extensive experiments on five datasets demonstrate that KEMM achieves state-of-the-art performance. The code will be released upon acceptance.
zh
[CV-17] FoodLogAthl-218: Constructing a Real-World Food Image Dataset Using Dietary Management Applications
【速读】:该论文旨在解决当前食物图像分类模型训练数据与真实用户饮食记录之间存在的偏差问题。现有公开数据集多依赖网络爬取图像,其内容和分布难以反映用户实际拍摄的餐食照片特征,导致模型在真实场景下的泛化能力受限。解决方案的关键在于构建一个基于真实用户提交的餐食照片数据集——FoodLogAthl-218,该数据集包含6,925张图像、218类食物及14,349个边界框,并附带丰富的元数据(如用餐时间、匿名用户ID和餐次上下文)。不同于传统“先定义类别再收集图像”的方式,该方法从用户上传的原始照片出发,事后标注标签,从而实现更自然的类内多样性、贴近现实的食物频率分布以及未经修饰的日常餐食图像。此外,论文还提出了两个针对该数据集设计的新任务:增量微调协议和上下文感知分类任务,以更好地利用时间序列和多菜品共存的特性,提升模型在个性化膳食管理中的实用性。
链接: https://arxiv.org/abs/2512.14574
作者: Mitsuki Watanabe,Sosuke Amano,Kiyoharu Aizawa,Yoko Yamakata
机构: The University of Tokyo (东京大学); foo.log Inc. (foo.log 公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Food image classification models are crucial for dietary management applications because they reduce the burden of manual meal logging. However, most publicly available datasets for training such models rely on web-crawled images, which often differ from users’ real-world meal photos. In this work, we present FoodLogAthl-218, a food image dataset constructed from real-world meal records collected through the dietary management application FoodLog Athl. The dataset contains 6,925 images across 218 food categories, with a total of 14,349 bounding boxes. Rich metadata, including meal date and time, anonymized user IDs, and meal-level context, accompany each image. Unlike conventional datasets-where a predefined class set guides web-based image collection-our data begins with user-submitted photos, and labels are applied afterward. This yields greater intra-class diversity, a natural frequency distribution of meal types, and casual, unfiltered images intended for personal use rather than public sharing. In addition to (1) a standard classification benchmark, we introduce two FoodLog-specific tasks: (2) an incremental fine-tuning protocol that follows the temporal stream of users’ logs, and (3) a context-aware classification task where each image contains multiple dishes, and the model must classify each dish by leveraging the overall meal context. We evaluate these tasks using large multimodal models (LMMs). The dataset is publicly available at this https URL.
zh
[CV-18] CLNet: Cross-View Correspondence Makes a Stronger Geo-Localizationer
【速读】:该论文旨在解决图像检索-based跨视角地理定位(Image Retrieval-based Cross-View Geo-localization, IRCVGL)中因视角差异导致的语义与几何不匹配问题,现有方法通常依赖于鲁棒的全局表征或隐式特征对齐,难以建模关键的空间对应关系。解决方案的关键在于提出一种显式对应感知的特征精炼框架CLNet,其核心由三个可学习且互补的模块构成:神经对应图(Neural Correspondence Map, NCM)通过潜在对应场实现跨视角特征的空间对齐;非线性嵌入转换器(Nonlinear Embedding Converter, NEC)利用基于MLP的变换映射不同视角下的特征;全局特征重校准(Global Feature Recalibration, GFR)模块则根据学习到的空间线索重新加权重要特征通道。CLNet能够联合捕捉高层语义信息与细粒度空间对齐,从而在四个公开基准数据集(CVUSA、CVACT、VIGOR和University-1652)上实现最优性能,并具备更强的可解释性与泛化能力。
链接: https://arxiv.org/abs/2512.14560
作者: Xianwei Cao,Dou Quan,Shuang Wang,Ning Huyan,Wei Wang,Yunan Li,Licheng Jiao
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures
Abstract:Image retrieval-based cross-view geo-localization (IRCVGL) aims to match images captured from significantly different viewpoints, such as satellite and street-level images. Existing methods predominantly rely on learning robust global representations or implicit feature alignment, which often fail to model explicit spatial correspondences crucial for accurate localization. In this work, we propose a novel correspondence-aware feature refinement framework, termed CLNet, that explicitly bridges the semantic and geometric gaps between different views. CLNet decomposes the view alignment process into three learnable and complementary modules: a Neural Correspondence Map (NCM) that spatially aligns cross-view features via latent correspondence fields; a Nonlinear Embedding Converter (NEC) that remaps features across perspectives using an MLP-based transformation; and a Global Feature Recalibration (GFR) module that reweights informative feature channels guided by learned spatial cues. The proposed CLNet can jointly capture both high-level semantics and fine-grained alignments. Extensive experiments on four public benchmarks, CVUSA, CVACT, VIGOR, and University-1652, demonstrate that our proposed CLNet achieves state-of-the-art performance while offering better interpretability and generalizability.
zh
[CV-19] AT: Task-Adaptive Transformer for All-in-One Medical Image Restoration MICCAI2025
【速读】:该论文旨在解决多任务医学图像恢复(Medical Image Restoration, MedIR)中因模态和退化类型差异导致的两个关键问题:任务干扰(task interference)和任务不平衡(task imbalance)。任务干扰表现为不同任务在共享参数上产生冲突的梯度更新方向,而任务不平衡则源于各任务学习难度不均所引发的优化失衡。解决方案的核心在于提出一种任务自适应Transformer(Task-adaptive Transformer, TAT),其关键创新包括:一是引入任务自适应权重生成策略,为每个任务动态生成特定的权重参数,从而消除共享参数上的梯度冲突;二是设计任务自适应损失平衡策略,根据各任务的学习难度动态调整损失权重,避免某些任务主导训练过程或被忽略。实验表明,TAT在PET合成、CT去噪和MRI超分辨率三个MedIR任务中均达到当前最优性能。
链接: https://arxiv.org/abs/2512.14550
作者: Zhiwen Yang,Jiaju Zhang,Yang Yi,Jian Liang,Bingzheng Wei,Yan Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by MICCAI 2025
Abstract:Medical image restoration (MedIR) aims to recover high-quality medical images from their low-quality counterparts. Recent advancements in MedIR have focused on All-in-One models capable of simultaneously addressing multiple different MedIR tasks. However, due to significant differences in both modality and degradation types, using a shared model for these diverse tasks requires careful consideration of two critical inter-task relationships: task interference, which occurs when conflicting gradient update directions arise across tasks on the same parameter, and task imbalance, which refers to uneven optimization caused by varying learning difficulties inherent to each task. To address these challenges, we propose a task-adaptive Transformer (TAT), a novel framework that dynamically adapts to different tasks through two key innovations. First, a task-adaptive weight generation strategy is introduced to mitigate task interference by generating task-specific weight parameters for each task, thereby eliminating potential gradient conflicts on shared weight parameters. Second, a task-adaptive loss balancing strategy is introduced to dynamically adjust loss weights based on task-specific learning difficulties, preventing task domination or undertraining. Extensive experiments demonstrate that our proposed TAT achieves state-of-the-art performance in three MedIR tasks–PET synthesis, CT denoising, and MRI super-resolution–both in task-specific and All-in-One settings. Code is available at this https URL.
zh
[CV-20] HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion CVPR2025
【速读】:该论文旨在解决当前基于扩散模型的身份保留人像生成(identity-preserved portrait generation, IPG)方法在使用同一身份的多张参考图像时,生成人像质量较低且难以精确控制面部属性的问题。其解决方案的关键在于提出HiFi-Portrait框架,通过引入人脸精修模块(face refiner)和关键点生成器(landmark generator)以获取细粒度的多脸特征与3D感知的人脸关键点(包含参考身份和目标属性信息),并设计HiFi-Net来融合多脸特征并与关键点对齐,从而提升身份保真度和面部可控性;此外,还构建了一个基于身份的自动化数据集构建流程用于训练,使方法在零样本场景下仍能实现高质量、高可控性的生成效果。
链接: https://arxiv.org/abs/2512.14542
作者: Yifang Xu,Benxiang Zhai,Yunzhuo Sun,Ming Li,Yang Li,Sidan Du
机构: Nanjing University (南京大学); Dalian University of Technology (大连理工大学); Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
Abstract:Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents HiFi-Portrait, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training HiFi-Portrait. Extensive experimental results demonstrate that our method surpasses the SOTA approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works.
zh
[CV-21] CAPRMIL: Context-Aware Patch Representations for Multiple Instance Learning
【速读】:该论文旨在解决在数字病理学中,由于全切片图像(Whole Slide Images, WSI)的巨像素尺度和像素级标注稀缺,导致基于弱监督的深度学习模型训练困难的问题。现有主流方法多采用多实例学习(Multiple Instance Learning, MIL)框架进行滑片级别建模,但其依赖复杂的注意力机制进行实例聚合,计算开销大且参数冗余。解决方案的关键在于提出一种新型的、与聚合器无关的框架CAPRMIL:通过冻结的局部特征提取器生成丰富的上下文感知补丁嵌入(context-aware patch embeddings),并将其投影到少量全局语境/形态感知令牌(global context/morphology-aware tokens)上,再利用多头自注意力机制注入全局信息,从而以线性复杂度实现高效相关性学习。此设计将相关性学习从聚合模块中解耦,显著降低模型参数量(减少48%-92.8%)、推理FLOPs(降低52%-99%),同时保持甚至超越当前最优MIL方法的性能,展现出更高的可扩展性和计算效率。
链接: https://arxiv.org/abs/2512.14540
作者: Andreas Lolos,Theofilos Christodoulou,Aris L. Moustakas,Stergios Christodoulidis,Maria Vakalopoulou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 12 Figures, 4 Tables
Abstract:In computational pathology, weak supervision has become the standard for deep learning due to the gigapixel scale of WSIs and the scarcity of pixel-level annotations, with Multiple Instance Learning (MIL) established as the principal framework for slide-level model training. In this paper, we introduce a novel setting for MIL methods, inspired by proceedings in Neural Partial Differential Equation (PDE) Solvers. Instead of relying on complex attention-based aggregation, we propose an efficient, aggregator-agnostic framework that removes the complexity of correlation learning from the MIL aggregator. CAPRMIL produces rich context-aware patch embeddings that promote effective correlation learning on downstream tasks. By projecting patch features – extracted using a frozen patch encoder – into a small set of global context/morphology-aware tokens and utilizing multi-head self-attention, CAPRMIL injects global context with linear computational complexity with respect to the bag size. Paired with a simple Mean MIL aggregator, CAPRMIL matches state-of-the-art slide-level performance across multiple public pathology benchmarks, while reducing the total number of trainable parameters by 48%-92.8% versus SOTA MILs, lowering FLOPs during inference by 52%-99%, and ranking among the best models on GPU memory efficiency and training time. Our results indicate that learning rich, context-aware instance representations before aggregation is an effective and scalable alternative to complex pooling for whole-slide analysis. Our code is available at this https URL
zh
[CV-22] DASP: Self-supervised Nighttime Monocular Depth Estimation with Domain Adaptation of Spatiotemporal Priors
【速读】:该论文旨在解决夜间单目深度估计(monocular depth estimation)性能显著下降的问题,其核心挑战在于低光照导致纹理缺失以及动态物体引起的模糊区域。解决方案的关键在于提出一种名为DASP的自监督框架,该框架通过引入时空先验(spatiotemporal priors)来增强模型对夜间场景的理解能力。具体而言,DASP包含对抗分支与自监督分支:对抗分支设计了四个时空先验学习块(SPLB),其中融合了基于正交差分的时空学习模块(STLM)和带全局轴向注意力的局部非对称卷积模块(ASLM),用于提取运动相关变化与多尺度结构信息;自监督分支则提出3D一致性投影损失(3D consistency projection loss),将目标帧与源帧投影至共享3D空间并计算其3D差异,从而优化三维结构一致性和白天先验。这一机制有效提升了在纹理缺失和动态模糊区域的深度估计精度。
链接: https://arxiv.org/abs/2512.14536
作者: Yiheng Huang,Junhong Chen,Anqi Ning,Zhanhong Liang,Nick Michiels,Luc Claesen,Wenyin Liu
机构: Guangdong University of Technology (广东工业大学); Hasselt University (哈塞尔特大学); Flanders Make (弗拉芒制造); Shantou University (汕头大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures
Abstract:Self-supervised monocular depth estimation has achieved notable success under daytime conditions. However, its performance deteriorates markedly at night due to low visibility and varying illumination, e.g., insufficient light causes textureless areas, and moving objects bring blurry regions. To this end, we propose a self-supervised framework named DASP that leverages spatiotemporal priors for nighttime depth estimation. Specifically, DASP consists of an adversarial branch for extracting spatiotemporal priors and a self-supervised branch for learning. In the adversarial branch, we first design an adversarial network where the discriminator is composed of four devised spatiotemporal priors learning blocks (SPLB) to exploit the daytime priors. In particular, the SPLB contains a spatial-based temporal learning module (STLM) that uses orthogonal differencing to extract motion-related variations along the time axis and an axial spatial learning module (ASLM) that adopts local asymmetric convolutions with global axial attention to capture the multiscale structural information. By combining STLM and ASLM, our model can acquire sufficient spatiotemporal features to restore textureless areas and estimate the blurry regions caused by dynamic objects. In the self-supervised branch, we propose a 3D consistency projection loss to bilaterally project the target frame and source frame into a shared 3D space, and calculate the 3D discrepancy between the two projected frames as a loss to optimize the 3D structural consistency and daytime priors. Extensive experiments on the Oxford RobotCar and nuScenes datasets demonstrate that our approach achieves state-of-the-art performance for nighttime depth estimation. Ablation studies further validate the effectiveness of each component.
zh
[CV-23] Native Intelligence Emerges from Large-Scale Clinical Practice: A Retinal Foundation Model with Deployment Efficiency
【速读】:该论文旨在解决当前视网膜基础模型受限于精心筛选的研究数据集、缺乏真实临床场景背景,且需针对每个应用场景进行大量任务特定优化的问题,从而限制了其在资源匮乏环境中的部署效率。解决方案的关键在于直接从真实世界医疗实践中提取“临床原生智能(clinical native intelligence)”,即利用大规模远程会诊项目中积累的485,980张彩色眼底照片及其对应诊断报告的自然对齐关系进行训练,构建出无需额外标注即可泛化到多种临床场景的视网膜基础模型ReVision。该方法显著提升了模型在零样本疾病检测和最小适应条件下的性能表现,并实现了跨机构、跨模态和系统性健康预测任务的有效迁移。
链接: https://arxiv.org/abs/2512.14499
作者: Jia Guo,Jiawei Du,Shengzhu Yang,Shuai Lu,Wenquan Cheng,Kaiwen Zhang,Yihua Sun,Chuhong Yang,Weihang Zhang,Fang Chen,Yilan Wu,Lie Ju,Guochen Ning,Longfei Ma,Huiping Yao,Jinyuan Wang,Peilun Shi,Yukun Zhou,Jie Xu,Pearse A. Keane,Hanruo Liu,Hongen Liao,Ningli Wang,Huiqi Li
机构: Tsinghua University (清华大学); Beijing Institute of Technology (北京理工大学); Capital Medical University (首都医科大学); Shanghai Jiaotong University (上海交通大学); University College London (伦敦大学学院); The Chinese University of Hong Kong (香港中文大学); Ruijin Hospital, Shanghai Jiao Tong University School of Medicine (上海交通大学医学院瑞金医院); Henan Provincial People’s Hospital (河南省人民医院); Henan Academy of Innovations in Medical Science (河南省医学科学院); Beijing Tongren Hospital (北京同仁医院); Beijing Visual Science and Translational Eye Research Institute (北京视觉科学与转化眼科研究所); Institute of Ophthalmology (眼科研究所); NIHR Biomedical Research Centre at Moorfields Eye Hospital NHS Foundation Trust (英国国家卫生研究院 Moorfields 眼科医院生物医学研究中心); UCL Hawkes Institute (UCL 霍克斯研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current retinal foundation models remain constrained by curated research datasets that lack authentic clinical context, and require extensive task-specific optimization for each application, limiting their deployment efficiency in low-resource settings. Here, we show that these barriers can be overcome by building clinical native intelligence directly from real-world medical practice. Our key insight is that large-scale telemedicine programs, where expert centers provide remote consultations across distributed facilities, represent a natural reservoir for learning clinical image interpretation. We present ReVision, a retinal foundation model that learns from the natural alignment between 485,980 color fundus photographs and their corresponding diagnostic reports, accumulated through a decade-long telemedicine program spanning 162 medical institutions across China. Through extensive evaluation across 27 ophthalmic benchmarks, we demonstrate that ReVison enables deployment efficiency with minimal local resources. Without any task-specific training, ReVision achieves zero-shot disease detection with an average AUROC of 0.946 across 12 public benchmarks and 0.952 on 3 independent clinical cohorts. When minimal adaptation is feasible, ReVision matches extensively fine-tuned alternatives while requiring orders of magnitude fewer trainable parameters and labeled examples. The learned representations also transfer effectively to new clinical sites, imaging domains, imaging modalities, and systemic health prediction tasks. In a prospective reader study with 33 ophthalmologists, ReVision’s zero-shot assistance improved diagnostic accuracy by 14.8% across all experience levels. These results demonstrate that clinical native intelligence can be directly extracted from clinical archives without any further annotation to build medical AI systems suited to various low-resource settings.
zh
[CV-24] SignIT: A Comprehensive Dataset and Multimodal Analysis for Italian Sign Language Recognition
【速读】:该论文旨在解决意大利手语(Italian Sign Language, LIS)识别任务中的数据稀缺与模型性能瓶颈问题。其解决方案的关键在于构建了一个大规模、精细化标注的LIS识别基准数据集SignIT,包含644个视频(共计3.33小时),涵盖94个独立的手势类别(分为动物、食物、颜色、情绪和家庭五大类),并同步提取了用户的手部、面部和身体二维关键点(2D keypoints)。通过该数据集,作者系统评估了多种前沿模型在融合RGB帧与时空关键点信息时的表现,揭示了当前模型在处理复杂LIS场景时的局限性,为后续研究提供了可复现的基准与改进方向。
链接: https://arxiv.org/abs/2512.14489
作者: Alessia Micieli,Giovanni Maria Farinella,Francesco Ragusa
机构: LIVE@IPLab, Department of Mathematics and Computer Science - University of Catania, Italy; Next Vision s.r.l., Spin-off of the University of Catania, Italy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work we present SignIT, a new dataset to study the task of Italian Sign Language (LIS) recognition. The dataset is composed of 644 videos covering 3.33 hours. We manually annotated videos considering a taxonomy of 94 distinct sign classes belonging to 5 macro-categories: Animals, Food, Colors, Emotions and Family. We also extracted 2D keypoints related to the hands, face and body of the users. With the dataset, we propose a benchmark for the sign recognition task, adopting several state-of-the-art models showing how temporal information, 2D keypoints and RGB frames can be influence the performance of these models. Results show the limitations of these models on this challenging LIS dataset. We release data and annotations at the following link: this https URL.
zh
[CV-25] SuperCLIP: CLIP with Simple Classification Supervision NEURIPS2025
【速读】:该论文旨在解决对比语言-图像预训练(Contrastive Language-Image Pretraining, CLIP)模型在处理长且详细的文本描述时,对文本中细粒度语义信息利用不足的问题。这一局限源于CLIP仅优化全局图像-文本相似性,缺乏对词元级别(token-level)的监督信号,从而限制了其在视觉-文本对齐上的精细程度。解决方案的关键在于提出SuperCLIP框架,通过在视觉编码器中添加一个轻量级线性层,引入基于分类的监督机制,以充分利用文本中的词元级线索来增强视觉-文本对齐能力;该方法仅增加0.077%的总浮点运算次数(FLOPs),且无需额外标注数据,即可显著提升零样本分类、图像-文本检索以及纯视觉任务的性能,并缓解小批量训练下的性能下降问题。
链接: https://arxiv.org/abs/2512.14480
作者: Weiheng Zhao,Zilong Huang,Jiashi Feng,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025. Code: this https URL
Abstract:Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text, and this issue becomes even more pronounced when dealing with long and detailed captions. This stems from CLIP’s training objective, which optimizes only global image-text similarity and overlooks token-level supervision - limiting its ability to achieve fine-grained visual-text alignment. To address this, we propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. By adding only a lightweight linear layer to the vision encoder, SuperCLIP leverages token-level cues to enhance visual-textual alignment - with just a 0.077% increase in total FLOPs, and no need for additional annotated data. Experiments show that SuperCLIP consistently improves zero-shot classification, image-text retrieval, and purely visual tasks. These gains hold regardless of whether the model is trained on original web data or rich re-captioned data, demonstrating SuperCLIP’s ability to recover textual supervision in both cases. Furthermore, SuperCLIP alleviates CLIP’s small-batch performance drop through classification-based supervision that avoids reliance on large batch sizes. Code and models will be made open source.
zh
[CV-26] ACK Tunnel Data (TTD): A Benchmark Dataset for Deep Learning-Based Defect Detection in Tunnels
【速读】:该论文旨在解决隧道结构自动化视觉检测中因缺乏高质量标注数据而导致的深度学习(Deep Learning, DL)模型性能受限问题。其关键解决方案是构建并公开发布一个包含三种不同类型隧道衬砌的标注图像数据集,涵盖典型缺陷如裂缝(cracks)、析盐(leaching)和渗水(water infiltration),该数据集支持监督、半监督及无监督学习方法,并具备纹理与施工技术多样性,从而促进模型在不同隧道类型间的泛化能力与迁移性能研究,填补了隧道领域专用数据集的空白,推动自动化巡检技术的发展与应用。
链接: https://arxiv.org/abs/2512.14477
作者: Andreas Sjölander,Valeria Belloni,Robel Fekadu,Andrea Nascetti
机构: KTH Royal Institute of Technology (皇家理工学院); Sapienza University of Rome (罗马大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Tunnels are essential elements of transportation infrastructure, but are increasingly affected by ageing and deterioration mechanisms such as cracking. Regular inspections are required to ensure their safety, yet traditional manual procedures are time-consuming, subjective, and costly. Recent advances in mobile mapping systems and Deep Learning (DL) enable automated visual inspections. However, their effectiveness is limited by the scarcity of tunnel datasets. This paper introduces a new publicly available dataset containing annotated images of three different tunnel linings, capturing typical defects: cracks, leaching, and water infiltration. The dataset is designed to support supervised, semi-supervised, and unsupervised DL methods for defect detection and segmentation. Its diversity in texture and construction techniques also enables investigation of model generalization and transferability across tunnel types. By addressing the critical lack of domain-specific data, this dataset contributes to advancing automated tunnel inspection and promoting safer, more efficient infrastructure maintenance strategies.
zh
[CV-27] A4-Agent : An Agent ic Framework for Zero-Shot Affordance Reasoning
【速读】:该论文旨在解决传统端到端 affordance prediction 模型在面对新物体和未见环境时泛化能力差的问题,其核心挑战在于现有方法将高层推理与低层定位耦合在一个单一管道中,并依赖大量标注数据进行训练。解决方案的关键是提出 A4-Agent,一个无需训练的代理式框架,通过三个阶段解耦任务:首先由 Dreamer 利用生成式 AI (Generative AI) 可视化交互过程;其次 Thinker 基于大视觉语言模型判断应交互的物体部位;最后 Spotter 使用视觉基础模型精确定位交互区域。该框架在测试时协调多个预训练模型,无需任务特定微调即可实现零样本性能显著优于当前最优监督方法,并展现出强鲁棒性与真实场景适应能力。
链接: https://arxiv.org/abs/2512.14442
作者: Zixin Zhang,Kanghao Chen,Hanqing Wang,Hongfei Zhang,Harold Haodong Chen,Chenfei Liao,Litao Guo,Ying-Cong Chen
机构: HKUST(GZ); HKUST; SJTU; Knowin
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a \textbfDreamer that employs generative models to visualize \textithow an interaction would look; (2) a \textbfThinker that utilizes large vision-language models to decide \textitwhat object part to interact with; and (3) a \textbfSpotter that orchestrates vision foundation models to precisely locate \textitwhere the interaction area is. By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.
zh
[CV-28] S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation
【速读】:该论文旨在解决当前无监督视频实例分割(unsupervised video instance segmentation)方法严重依赖合成视频数据所带来的局限性问题,尤其是合成数据在建模真实视频中复杂运动(如视角变化、物体局部运动或相机运动)时的不足。其解决方案的关键在于:首先从单帧无监督实例分割结果出发,利用深度运动先验识别高质量的“关键mask”(keymasks),构建稀疏伪标注;随后通过提出的Sparse-To-Dense Distillation(稀疏到稠密蒸馏)方法结合Temporal DropLoss(时间丢弃损失),训练一个能够隐式传播掩码的分割模型,最终在密集标签集上训练得到性能优于现有最先进方法的模型。
链接: https://arxiv.org/abs/2512.14440
作者: Leon Sick,Lukas Hoyer,Dominik Engel,Pedro Hermosilla,Timo Ropinski
机构: Ulm University (乌尔姆大学); Google (谷歌); KAUST (国王阿卜杜拉科技大学); TU Vienna (维也纳工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page with Code/Models/Demo: this https URL
Abstract:In recent years, the state-of-the-art in unsupervised video instance segmentation has heavily relied on synthetic video data, generated from object-centric image datasets such as ImageNet. However, video synthesis by artificially shifting and scaling image instance masks fails to accurately model realistic motion in videos, such as perspective changes, movement by parts of one or multiple instances, or camera motion. To tackle this issue, we propose an unsupervised video instance segmentation model trained exclusively on real video data. We start from unsupervised instance segmentation masks on individual video frames. However, these single-frame segmentations exhibit temporal noise and their quality varies through the video. Therefore, we establish temporal coherence by identifying high-quality keymasks in the video by leveraging deep motion priors. The sparse keymask pseudo-annotations are then used to train a segmentation model for implicit mask propagation, for which we propose a Sparse-To-Dense Distillation approach aided by a Temporal DropLoss. After training the final model on the resulting dense labelset, our approach outperforms the current state-of-the-art across various benchmarks.
zh
[CV-29] VICTOR: Dataset Copyright Auditing in Video Recognition Systems NDSS
【速读】:该论文旨在解决视频识别系统中数据集版权审计(dataset copyright auditing)的空白问题,即如何有效检测和追踪视频数据集在未经授权情况下的使用。现有方法主要集中在图像领域,而视频数据因引入了时间维度(temporal dimension),导致传统方法在效果和隐蔽性上均面临挑战。解决方案的关键在于提出VICTOR——首个面向视频识别系统的版权审计框架,其核心创新是一种通用且隐蔽的样本修改策略:仅对少量样本(如1%)进行微小改动,即可显著放大目标模型对已修改样本与原始未修改样本预测行为的差异,从而通过模型输出差异作为版权审计的判别依据。该方法在多个模型和数据集上验证了有效性,并具备对训练视频或模型扰动的鲁棒性。
链接: https://arxiv.org/abs/2512.14439
作者: Quan Yuan,Zhikun Zhang,Linkang Du,Min Chen,Mingyang Sun,Yunjun Gao,Shibo He,Jiming Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in the NDSS Symposium 2026, February 2026, San Diego, CA, USA
Abstract:Video recognition systems are increasingly being deployed in daily life, such as content recommendation and security monitoring. To enhance video recognition development, many institutions have released high-quality public datasets with open-source licenses for training advanced models. At the same time, these datasets are also susceptible to misuse and infringement. Dataset copyright auditing is an effective solution to identify such unauthorized use. However, existing dataset copyright solutions primarily focus on the image domain; the complex nature of video data leaves dataset copyright auditing in the video domain unexplored. Specifically, video data introduces an additional temporal dimension, which poses significant challenges to the effectiveness and stealthiness of existing methods. In this paper, we propose VICTOR, the first dataset copyright auditing approach for video recognition systems. We develop a general and stealthy sample modification strategy that enhances the output discrepancy of the target model. By modifying only a small proportion of samples (e.g., 1%), VICTOR amplifies the impact of published modified samples on the prediction behavior of the target models. Then, the difference in the model’s behavior for published modified and unpublished original samples can serve as a key basis for dataset auditing. Extensive experiments on multiple models and datasets highlight the superiority of VICTOR. Finally, we show that VICTOR is robust in the presence of several perturbation mechanisms to the training videos or the target models. Comments: To appear in the NDSS Symposium 2026, February 2026, San Diego, CA, USA Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.14439 [cs.CR] (or arXiv:2512.14439v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.14439 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-30] Score-Based Turbo Message Passing for Plug-and-Play Compressive Imaging
【速读】:该论文旨在解决传统插件式(plug-and-play, PnP)压缩成像方法在高度欠定条件下重建性能不佳的问题,其核心在于现有图像去噪器依赖于通用或手工设计的先验,难以准确刻画自然图像复杂的统计结构。解决方案的关键是利用基于得分的生成模型(score-based generative models)与经验贝叶斯去噪之间的紧密联系,提出一种融合最小均方误差(MMSE)去噪器的消息传递框架——得分Turbo消息传递(score-based turbo message passing, STMP),从而将得分模型的强大表达能力与消息传递算法的快速收敛特性相结合。此外,针对量化测量场景,进一步引入分量级MMSE去量化模块,形成量化STMP(Q-STMP),并借助状态演化(state-evolution, SE)方程实现渐近性能预测,实验表明该方法在FFHQ数据集上显著优于现有基线,并在1比特量化下仍保持鲁棒性。
链接: https://arxiv.org/abs/2512.14435
作者: Chang Cai,Hao Jiang,Xiaojun Yuan,Ying-Jun Angela Zhang
机构: The University of Hong Kong (香港大学); The Chinese University of Hong Kong (香港中文大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Message-passing algorithms have been adapted for compressive imaging by incorporating various off-the-shelf image denoisers. However, these denoisers rely largely on generic or hand-crafted priors and often fall short in accurately capturing the complex statistical structure of natural images. As a result, traditional plug-and-play (PnP) methods often lead to suboptimal reconstruction, especially in highly underdetermined regimes. Recently, score-based generative models have emerged as a powerful framework for accurately characterizing sophisticated image distribution. Yet, their direct use for posterior sampling typically incurs prohibitive computational complexity. In this paper, by exploiting the close connection between score-based generative modeling and empirical Bayes denoising, we devise a message-passing framework that integrates a score-based minimum mean-squared error (MMSE) denoiser for compressive image recovery. The resulting algorithm, named score-based turbo message passing (STMP), combines the fast convergence of message passing with the expressive power of score-based generative priors. For practical systems with quantized measurements, we further propose quantized STMP (Q-STMP), which augments STMP with a component-wise MMSE dequantization module. We demonstrate that the asymptotic performance of STMP and Q-STMP can be accurately predicted by a set of state-evolution (SE) equations. Experiments on the FFHQ dataset demonstrate that STMP strikes a significantly better performance-complexity tradeoff compared with competing baselines, and that Q-STMP remains robust even under 1-bit quantization. Remarkably, both STMP and Q-STMP typically converge within 10 iterations.
zh
[CV-31] he Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy
【速读】:该论文旨在解决基于大规模扩散模型的无训练图像编辑中,复杂非刚性变换(如姿态或形状变化)难以忠实实现的问题。其关键在于识别出现有注意力共享机制中存在的“注意力坍塌”现象,即位置嵌入(positional embeddings)或语义特征(semantic features)在视觉内容检索中占据主导地位,导致过编辑或欠编辑。为此,作者提出SynPS方法,通过协同利用位置信息与语义信息,设计了一个动态调节位置嵌入影响的注意力协同(attention synergy)管道,并引入一种量化每个去噪步骤所需编辑强度的测量指标,从而自适应地平衡语义修改与图像保真度,有效避免过编辑和欠编辑问题。
链接: https://arxiv.org/abs/2512.14423
作者: Zhuo Chen,Fanyue Wei,Runze Xu,Jingjing Li,Lixin Duan,Angela Yao,Wen Li
机构: University of Electronic Science and Technology of China (电子科技大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Training-free image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or this http URL address this issue, we introduce SynPS, a method that Synergistically leverages Positional embeddings and Semantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity this http URL adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach.
zh
[CV-32] LCMem: A Universal Model for Robust Image Memorization Detection
【速读】:该论文旨在解决生成式图像模型在隐私保护数据共享中的记忆检测问题,当前方法存在可靠性不足、量化评估有限以及跨域泛化能力差等缺陷。其解决方案的关键在于将记忆检测统一建模为再识别(re-identification)与复制检测(copy detection)的联合任务,提出Latent Contrastive Memorization Network (LCMem),通过两阶段训练策略先学习身份一致性,再引入对增强鲁棒的复制检测机制,从而在六个基准数据集上分别实现最高16个百分点和30个百分点的性能提升,显著提升了跨域场景下记忆检测的可靠性和可扩展性。
链接: https://arxiv.org/abs/2512.14421
作者: Mischa Dombrowski,Felix Nützel,Bernhard Kainz
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希·亚历山大大学); Imperial College London (伦敦帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in generative image modeling have achieved visual realism sufficient to deceive human experts, yet their potential for privacy preserving data sharing remains insufficiently understood. A central obstacle is the absence of reliable memorization detection mechanisms, limited quantitative evaluation, and poor generalization of existing privacy auditing methods across domains. To address this, we propose to view memorization detection as a unified problem at the intersection of re-identification and copy detection, whose complementary goals cover both identity consistency and augmentation-robust duplication, and introduce Latent Contrastive Memorization Network (LCMem), a cross-domain model evaluated jointly on both tasks. LCMem achieves this through a two-stage training strategy that first learns identity consistency before incorporating augmentation-robust copy detection. Across six benchmark datasets, LCMem achieves improvements of up to 16 percentage points on re-identification and 30 percentage points on copy detection, enabling substantially more reliable memorization detection at scale. Our results show that existing privacy filters provide limited performance and robustness, highlighting the need for stronger protection mechanisms. We show that LCMem sets a new standard for cross-domain privacy auditing, offering reliable and scalable memorization detection. Code and model is publicly available at this https URL.
zh
[CV-33] DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning AAAI2026
【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在跨域场景下图像描述评估(image caption evaluation)时鲁棒性不足的问题。现有方法在面对分布偏移(domain-shift)时,其评估分数与人类判断的对齐度显著下降。解决方案的关键在于提出一种无需微调(fine-tuning-free)的测试时自适应评估方法——Distribution-Aware Score Decoder (DISCODE),其核心创新是引入了基于高斯先验分布的自适应测试时损失(Adaptive Test-Time, ATT loss),并通过推导出的解析解在测试阶段高效最小化该损失,从而提升评估分数的鲁棒性和与人类评价的一致性。
链接: https://arxiv.org/abs/2512.14420
作者: Nakamasa Inoue,Kanoko Goto,Masanari Oi,Martyna Gruszka,Mahiro Ukai,Takumi Hirose,Yusuke Sekikawa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Paper accepted to AAAI 2026
Abstract:Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.
zh
[CV-34] Broadening View Synthesis of Dynamic Scenes from Constrained Monocular Videos
【速读】:该论文旨在解决动态神经辐射场(Dynamic Neural Radiance Fields, Dynamic NeRF)在大角度视角偏移下生成不稳定、不真实图像的问题。其核心解决方案是提出Expanded Dynamic NeRF (ExpanDyNeRF),该方法通过引入高斯点绘制(Gaussian splatting)先验和伪真值(pseudo-ground-truth)生成策略,优化密度与颜色特征以提升从挑战性视角下的场景重建质量。此外,作者还构建了首个具有显式侧视监督的合成动态多视角数据集SynDM,用于验证模型在极端视角变化下的渲染保真度显著优于现有动态NeRF方法。
链接: https://arxiv.org/abs/2512.14406
作者: Le Jiang,Shaotong Zhu,Yedi Luo,Shayda Moezzi,Sarah Ostadabbas
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In dynamic Neural Radiance Fields (NeRF) systems, state-of-the-art novel view synthesis methods often fail under significant viewpoint deviations, producing unstable and unrealistic renderings. To address this, we introduce Expanded Dynamic NeRF (ExpanDyNeRF), a monocular NeRF framework that leverages Gaussian splatting priors and a pseudo-ground-truth generation strategy to enable realistic synthesis under large-angle rotations. ExpanDyNeRF optimizes density and color features to improve scene reconstruction from challenging perspectives. We also present the Synthetic Dynamic Multiview (SynDM) dataset, the first synthetic multiview dataset for dynamic scenes with explicit side-view supervision-created using a custom GTA V-based rendering pipeline. Quantitative and qualitative results on SynDM and real-world datasets demonstrate that ExpanDyNeRF significantly outperforms existing dynamic NeRF methods in rendering fidelity under extreme viewpoint shifts. Further details are provided in the supplementary materials.
zh
[CV-35] EcoScapes: LLM -Powered Advice for Crafting Sustainable Cities
【速读】:该论文旨在解决小城市在制定气候适应策略时面临的两大挑战:一是人力资源有限,难以开展复杂的分析工作;二是缺乏有效整合多源数据(如气象、地理和社会经济数据)的能力,从而阻碍了全面的风险评估与应对规划。解决方案的关键在于构建一个多层次系统,融合专用大语言模型(Large Language Models, LLMs)、卫星遥感影像分析技术以及结构化知识库,实现对多源异构数据的自动化处理与智能推理,从而提升小城市在气候适应决策中的效率与科学性。
链接: https://arxiv.org/abs/2512.14373
作者: Martin Röhn,Nora Gourmelon,Vincent Christlein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Climate adaptation is vital for the sustainability and sometimes the mere survival of our urban areas. However, small cities often struggle with limited personnel resources and integrating vast amounts of data from multiple sources for a comprehensive analysis. To overcome these challenges, this paper proposes a multi-layered system combining specialized LLMs, satellite imagery analysis and a knowledge base to aid in developing effective climate adaptation strategies. The corresponding code can be found at this https URL.
zh
[CV-36] A Comprehensive Safety Metric to Evaluate Perception in Autonomous Systems ITSC2020
【速读】:该论文旨在解决自动驾驶车辆在环境感知中对目标检测结果进行安全评估时,现有指标未能充分考虑目标属性(如速度、朝向、距离、尺寸及碰撞潜在危害)带来的差异化风险问题。其解决方案的关键在于提出一种新的安全评估指标,该指标将上述多维参数统一量化,并输出一个单一且易于理解的安全评分,从而更全面地反映对象感知系统的安全性表现。该方法已在真实世界和虚拟数据集上验证,并与当前先进指标进行了对比。
链接: https://arxiv.org/abs/2512.14367
作者: Georg Volk,Jörg Gamerdinger,Alexander von Bernuth,Oliver Bringmann
机构: University of Tübingen (图宾根大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE ITSC 2020
Abstract:Complete perception of the environment and its correct interpretation is crucial for autonomous vehicles. Object perception is the main component of automotive surround sensing. Various metrics already exist for the evaluation of object perception. However, objects can be of different importance depending on their velocity, orientation, distance, size, or the potential damage that could be caused by a collision due to a missed detection. Thus, these additional parameters have to be considered for safety evaluation. We propose a new safety metric that incorporates all these parameters and returns a single easily interpretable safety assessment score for object perception. This new metric is evaluated with both real world and virtual data sets and compared to state of the art metrics.
zh
[CV-37] Optimizing Rank for High-Fidelity Implicit Neural Representations
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)中基于普通多层感知机(MLP)难以有效建模高频内容的问题。传统观点认为这是MLP架构固有的低频偏好导致的,但本文提出,这一现象实为训练过程中网络秩(rank)不稳定退化所致。解决方案的关键在于通过优化器设计(如Muon)引入高秩、近正交的参数更新机制,从而稳定地维持网络的表达能力,显著提升对高频信号的重建 fidelity,无需复杂的架构改进即可实现性能突破,在自然图像、医学影像和新视角合成等多个领域均取得显著效果,最高可达9 dB的PSNR提升。
链接: https://arxiv.org/abs/2512.14366
作者: Julian McGinnis,Florian A. Hölzl,Suprosanna Shit,Florentin Bieder,Paul Friedrich,Mark Mühlau,Björn Menze,Daniel Rueckert,Benedikt Wiestler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Implicit Neural Representations (INRs) based on vanilla Multi-Layer Perceptrons (MLPs) are widely believed to be incapable of representing high-frequency content. This has directed research efforts towards architectural interventions, such as coordinate embeddings or specialized activation functions, to represent high-frequency signals. In this paper, we challenge the notion that the low-frequency bias of vanilla MLPs is an intrinsic, architectural limitation to learn high-frequency content, but instead a symptom of stable rank degradation during training. We empirically demonstrate that regulating the network’s rank during training substantially improves the fidelity of the learned signal, rendering even simple MLP architectures expressive. Extensive experiments show that using optimizers like Muon, with high-rank, near-orthogonal updates, consistently enhances INR architectures even beyond simple ReLU MLPs. These substantial improvements hold across a diverse range of domains, including natural and medical images, and novel view synthesis, with up to 9 dB PSNR improvements over the previous state-of-the-art. Our project page, which includes code and experimental results, is available at: (this https URL).
zh
[CV-38] Unified Semantic Transformer for 3D Scene Understanding
【速读】:该论文旨在解决三维场景理解中任务割裂的问题,即现有模型多为特定任务设计,难以统一处理多样化的3D语义任务。其解决方案的关键在于提出UNITE——一种统一的语义Transformer架构,能够通过单一前馈神经网络端到端地完成包括3D场景分割、实例嵌入、开放词汇特征以及可操作性与关节运动预测在内的多种语义任务,仅需RGB图像作为输入,并借助2D蒸馏与新型多视角一致性损失实现高效且准确的推理。
链接: https://arxiv.org/abs/2512.14364
作者: Sebastian Koch,Johanna Wald,Hide Matsuki,Pedro Hermosilla,Timo Ropinski,Federico Tombari
机构: University Ulm (乌尔姆大学); Google(谷歌); TU Vienna (维也纳工业大学); TU Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model. Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, as well as affordance and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry. See the project website at this http URL
zh
[CV-39] Mimicking Human Visual Development for Learning Robust Image Representations
【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在面对输入分布变化时泛化能力不足的问题,这与人类视觉系统具备的适应性存在显著差距。其解决方案的关键在于提出一种**渐进式模糊课程学习(progressive blurring curriculum)**策略:在训练初期使用高度模糊的图像进行模型初始化,随后随训练进程逐步减少模糊程度,从而引导网络优先关注全局结构而非高频细节,提升对分布偏移和噪声输入的鲁棒性。实验表明,该方法在CIFAR-10-C和ImageNet-100-C数据集上分别将平均损坏误差(mean corruption error, mCE)降低8.30%和4.43%,且不显著损害原域准确率,优于静态模糊增强,并可与CutMix、MixUp等其他增强技术协同提升自然及对抗鲁棒性。
链接: https://arxiv.org/abs/2512.14360
作者: Ankita Raj,Kaashika Prajaapat,Tapan Kumar Gandhi,Chetan Arora
机构: Indian Institute of Technology Delhi (印度理工学院德里分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICVGIP 2025
Abstract:The human visual system is remarkably adept at adapting to changes in the input distribution; a capability modern convolutional neural networks (CNNs) still struggle to match. Drawing inspiration from the developmental trajectory of human vision, we propose a progressive blurring curriculum to improve the generalization and robustness of CNNs. Human infants are born with poor visual acuity, gradually refining their ability to perceive fine details. Mimicking this process, we begin training CNNs on highly blurred images during the initial epochs and progressively reduce the blur as training advances. This approach encourages the network to prioritize global structures over high-frequency artifacts, improving robustness against distribution shifts and noisy inputs. Challenging prior claims that blurring in the initial training epochs imposes a stimulus deficit and irreversibly harms model performance, we reveal that early-stage blurring enhances generalization with minimal impact on in-domain accuracy. Our experiments demonstrate that the proposed curriculum reduces mean corruption error (mCE) by up to 8.30% on CIFAR-10-C and 4.43% on ImageNet-100-C datasets, compared to standard training without blurring. Unlike static blur-based augmentation, which applies blurred images randomly throughout training, our method follows a structured progression, yielding consistent gains across various datasets. Furthermore, our approach complements other augmentation techniques, such as CutMix and MixUp, and enhances both natural and adversarial robustness against common attack methods. Code is available at this https URL.
zh
[CV-40] Enhancing Interpretability for Vision Models via Shapley Value Optimization AAAI2026
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)决策过程不透明的问题,特别是现有解释方法存在的两大局限:后验解释方法难以忠实反映模型行为,而自解释神经网络则因特殊结构设计牺牲了性能与兼容性。解决方案的关键在于提出一种新颖的自解释框架,通过在训练过程中将Shapley值估计作为辅助任务,实现两个核心改进:一是公平地分配模型预测得分至图像块(image patches),确保解释结果天然契合模型决策逻辑;二是仅需微小结构改动即可显著提升可解释性,同时保持原有模型性能与兼容性。
链接: https://arxiv.org/abs/2512.14354
作者: Kanglong Fan,Yunqiao Yang,Chen Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI2026
Abstract:Deep neural networks have demonstrated remarkable performance across various domains, yet their decision-making processes remain opaque. Although many explanation methods are dedicated to bringing the obscurity of DNNs to light, they exhibit significant limitations: post-hoc explanation methods often struggle to faithfully reflect model behaviors, while self-explaining neural networks sacrifice performance and compatibility due to their specialized architectural designs. To address these challenges, we propose a novel self-explaining framework that integrates Shapley value estimation as an auxiliary task during training, which achieves two key advancements: 1) a fair allocation of the model prediction scores to image patches, ensuring explanations inherently align with the model’s decision logic, and 2) enhanced interpretability with minor structural modifications, preserving model performance and compatibility. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art interpretability.
zh
[CV-41] HGS: Hybrid Gaussian Splatting with Static-Dynamic Decomposition for Compact Dynamic View Synthesis
【速读】:该论文旨在解决动态新视角合成(Dynamic Novel View Synthesis, NVS)中模型参数冗余与计算效率低的问题,尤其针对现有基于3D高斯泼溅(3D Gaussian Splatting, 3DGS)的方法在资源受限设备上难以实现实时渲染的瓶颈。其解决方案的关键在于提出一种名为混合高斯泼溅(Hybrid Gaussian Splatting, HGS)的紧凑高效框架,核心创新为静态-动态分解(Static-Dynamic Decomposition, SDD)策略:通过径向基函数(Radial Basis Function, RBF)建模高斯原语,在动态区域使用时变RBF以捕捉时间变化并应对突变场景,而在静态区域共享时不变参数以消除冗余;同时引入两阶段训练策略提升静态-动态边界处的时间一致性。该方法显著降低模型规模(最多减少98%),实现4K分辨率下125 FPS的实时渲染性能,并保持高质量视觉效果。
链接: https://arxiv.org/abs/2512.14352
作者: Kaizhe Zhang,Yijie Zhou,Weizhan Zhang,Caixia Yan,Haipeng Du,yugui xie,Yu-Hui Wen,Yong-Jin Liu
机构: Xi’an Jiaotong University (西安交通大学); chinamobile.com (中国移动); Beijing Jiaotong University (北京交通大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
备注: 11 pages, 9 figures
Abstract:Dynamic novel view synthesis (NVS) is essential for creating immersive experiences. Existing approaches have advanced dynamic NVS by introducing 3D Gaussian Splatting (3DGS) with implicit deformation fields or indiscriminately assigned time-varying parameters, surpassing NeRF-based methods. However, due to excessive model complexity and parameter redundancy, they incur large model sizes and slow rendering speeds, making them inefficient for real-time applications, particularly on resource-constrained devices. To obtain a more efficient model with fewer redundant parameters, in this paper, we propose Hybrid Gaussian Splatting (HGS), a compact and efficient framework explicitly designed to disentangle static and dynamic regions of a scene within a unified representation. The core innovation of HGS lies in our Static-Dynamic Decomposition (SDD) strategy, which leverages Radial Basis Function (RBF) modeling for Gaussian primitives. Specifically, for dynamic regions, we employ time-dependent RBFs to effectively capture temporal variations and handle abrupt scene changes, while for static regions, we reduce redundancy by sharing temporally invariant parameters. Additionally, we introduce a two-stage training strategy tailored for explicit models to enhance temporal coherence at static-dynamic boundaries. Experimental results demonstrate that our method reduces model size by up to 98% and achieves real-time rendering at up to 125 FPS at 4K resolution on a single RTX 3090 GPU. It further sustains 160 FPS at 1352 * 1014 on an RTX 3050 and has been integrated into the VR system. Moreover, HGS achieves comparable rendering quality to state-of-the-art methods while providing significantly improved visual fidelity for high-frequency details and abrupt scene changes.
zh
[CV-42] owards Transferable Defense Against Malicious Image Edits
【速读】:该论文旨在解决扩散模型图像编辑系统中恶意篡改行为的防御问题,尤其针对现有方法在跨模型评估中迁移能力有限的挑战。其解决方案的关键在于提出一种双模态防御框架TDAE(Transferable Defense Against Malicious Image Edits),通过协同优化图像与文本特征实现更强的鲁棒性:在视觉层面引入FlatGrad Defense Mechanism(FDM),利用梯度正则化引导扰动趋向平坦极小值,提升对未见编辑模型的免疫能力;在文本层面设计Dynamic Prompt Defense(DPD),通过周期性优化文本嵌入使免疫图像的编辑结果逼近原始图像,从而增强跨模型迁移性。
链接: https://arxiv.org/abs/2512.14341
作者: Jie Zhang,Shuai Dong,Shiguang Shan,Xilin Chen
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing 100190, China; University of China Academy of Sciences, Beijing 100049, China; School of Computer Science, China University of Geosciences, Wuhan 430074, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 14 pages, 5 figures
Abstract:Recent approaches employing imperceptible perturbations in input images have demonstrated promising potential to counter malicious manipulations in diffusion-based image editing systems. However, existing methods suffer from limited transferability in cross-model evaluations. To address this, we propose Transferable Defense Against Malicious Image Edits (TDAE), a novel bimodal framework that enhances image immunity against malicious edits through coordinated image-text optimization. Specifically, at the visual defense level, we introduce FlatGrad Defense Mechanism (FDM), which incorporates gradient regularization into the adversarial objective. By explicitly steering the perturbations toward flat minima, FDM amplifies immune robustness against unseen editing models. For textual enhancement protection, we propose an adversarial optimization paradigm named Dynamic Prompt Defense (DPD), which periodically refines text embeddings to align the editing outcomes of immunized images with those of the original images, then updates the images under optimized embeddings. Through iterative adversarial updates to diverse embeddings, DPD enforces the generation of immunized images that seek a broader set of immunity-enhancing features, thereby achieving cross-model transferability. Extensive experimental results demonstrate that our TDAE achieves state-of-the-art performance in mitigating malicious edits under both intra- and cross-model evaluations.
zh
[CV-43] Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure
【速读】:该论文旨在解决视觉语言模型(VLM)在自动动画化可缩放矢量图形(SVG)时面临的语义理解不足问题,即当前VLM系统难以正确识别SVG中应协同运动的语义组件,因为其视觉上连贯的部分常被分割为低级几何形状,缺乏结构化语义信息。解决方案的关键在于引入一种基于统计聚合的弱部分预测整合框架,通过稳定地从噪声预测中推断出SVG的语义结构,并将原始SVG重组为语义分组,从而显著提升动画生成的一致性和可解释性。
链接: https://arxiv.org/abs/2512.14336
作者: Jooyeol Yun,Jaegul Choo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this http URL
Abstract:Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic. Yet automating the animation of vector graphics remains challenging for vision-language models (VLMs) despite recent progress in code generation and motion planning. VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions. By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.
zh
[CV-44] Dual Attention Guided Defense Against Malicious Edits
【速读】:该论文旨在解决文本到图像扩散模型在图像编辑中因恶意使用而产生的伦理风险问题,尤其是生成欺骗性或有害内容的可能性。现有防御方法通过嵌入不可察觉的扰动来缓解风险,但其对抗恶意篡改的能力有限。解决方案的关键在于提出一种双注意力引导的噪声扰动(Dual Attention-Guided Noise Perturbation, DANP)免疫机制:该方法在多个时间步长上同时干扰模型的交叉注意力映射(cross-attention maps)和噪声预测过程,利用动态阈值生成掩码以区分与文本相关的区域和无关区域,进而降低相关区域的关注度、增强无关区域的关注度,从而误导编辑方向至错误区域并保护目标对象;此外,DANP还最大化注入噪声与模型预测噪声之间的差异,进一步破坏生成过程。该策略从注意力机制和噪声预测两个维度协同干预,显著提升了对恶意编辑的抗性,实验表明其达到当前最优性能。
链接: https://arxiv.org/abs/2512.14333
作者: Jie Zhang,Shuai Dong,Shiguang Shan,Xilin Chen
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing 100190, China; University of China Academy of Sciences, Beijing 100049, China; School of Computer Science, China University of Geosciences, Wuhan 430074, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 11 pages, 7 figures
Abstract:Recent progress in text-to-image diffusion models has transformed image editing via text prompts, yet this also introduces significant ethical challenges from potential misuse in creating deceptive or harmful content. While current defenses seek to mitigate this risk by embedding imperceptible perturbations, their effectiveness is limited against malicious tampering. To address this issue, we propose a Dual Attention-Guided Noise Perturbation (DANP) immunization method that adds imperceptible perturbations to disrupt the model’s semantic understanding and generation process. DANP functions over multiple timesteps to manipulate both cross-attention maps and the noise prediction process, using a dynamic threshold to generate masks that identify text-relevant and irrelevant regions. It then reduces attention in relevant areas while increasing it in irrelevant ones, thereby misguides the edit towards incorrect regions and preserves the intended targets. Additionally, our method maximizes the discrepancy between the injected noise and the model’s predicted noise to further interfere with the generation. By targeting both attention and noise prediction mechanisms, DANP exhibits impressive immunity against malicious edits, and extensive experiments confirm that our method achieves state-of-the-art performance.
zh
[CV-45] Semantic Mismatch and Perceptual Degradation: A New Perspective on Image Editing Immunity
【速读】:该论文旨在解决生成式 AI(Generative AI)中扩散模型(diffusion models)在文本引导图像编辑时可能被恶意滥用的问题,提出通过不可察觉的扰动对图像进行免疫化保护(image immunization),以抵御未经授权的编辑操作。其核心挑战在于现有评估指标仅关注输出图像与参考图像之间的视觉差异,而忽略了免疫化的本质目标——即破坏攻击者意图的语义一致性,无论是否偏离特定输出。解决方案的关键是提出协同中间特征扰动(Synergistic Intermediate Feature Manipulation, SIFM)方法,该方法通过双重协同目标实现:(1) 最大化中间扩散特征与原始编辑轨迹的差异,从而破坏语义对齐;(2) 最小化特征范数以诱发显著的感知退化。同时引入免疫成功率(Immunization Success Rate, ISR)这一新指标,首次系统量化免疫效果,基于多模态大语言模型(MLLMs)判断编辑结果是否在语义上偏离提示或出现显著感知劣化,实验证明SIFM在抵御恶意扩散编辑方面达到当前最优性能。
链接: https://arxiv.org/abs/2512.14320
作者: Shuai Dong,Jie Zhang,Guoying Zhao,Shiguang Shan,Xilin Chen
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences (CAS); University of China Academy of Sciences; Center for Machine Vision and Signal Analysis, University of Oulu; School of Computer Science, China University of Geosciences
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 11 pages, 4 figures
Abstract:Text-guided image editing via diffusion models, while powerful, raises significant concerns about misuse, motivating efforts to immunize images against unauthorized edits using imperceptible perturbations. Prevailing metrics for evaluating immunization success typically rely on measuring the visual dissimilarity between the output generated from a protected image and a reference output generated from the unprotected original. This approach fundamentally overlooks the core requirement of image immunization, which is to disrupt semantic alignment with attacker intent, regardless of deviation from any specific output. We argue that immunization success should instead be defined by the edited output either semantically mismatching the prompt or suffering substantial perceptual degradations, both of which thwart malicious intent. To operationalize this principle, we propose Synergistic Intermediate Feature Manipulation (SIFM), a method that strategically perturbs intermediate diffusion features through dual synergistic objectives: (1) maximizing feature divergence from the original edit trajectory to disrupt semantic alignment with the expected edit, and (2) minimizing feature norms to induce perceptual degradations. Furthermore, we introduce the Immunization Success Rate (ISR), a novel metric designed to rigorously quantify true immunization efficacy for the first time. ISR quantifies the proportion of edits where immunization induces either semantic failure relative to the prompt or significant perceptual degradations, assessed via Multimodal Large Language Models (MLLMs). Extensive experiments show our SIFM achieves the state-of-the-art performance for safeguarding visual content against malicious diffusion-based manipulation.
zh
[CV-46] From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region
【速读】:该论文旨在解决中东和北非(MENA)地区对污水处理厂(Wastewater Treatment Plants, WWTPs)高效识别的需求,以支持可持续水资源管理。传统基于YOLOv8的分割方法依赖大量人工标注数据,成本高且效率低。其解决方案的关键在于利用视觉语言模型(Vision-Language Models, VLMs)的内在推理能力,在无需或仅需少量标注的情况下实现对WWTP的精准识别与分类。研究通过零样本(zero-shot)和少样本(few-shot)两种流式评估框架,对比了包括Gemma 3、Gemini、LLaMA 3.2 Vision等在内的多个先进VLMs,结果表明部分VLMs在不依赖训练数据的前提下即可超越YOLOv8的真阳性率,验证了VLMs在遥感影像中实现无监督WWTP检测的可行性与优越性,从而推动大规模、低成本的环境监测应用。
链接: https://arxiv.org/abs/2512.14312
作者: Akila Premarathna,Kanishka Hewageegana,Garcia Andarcia Mariangel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 9 figures
Abstract:In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that vision-language models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks, aeration basins and distinguish confounders via expert prompts producing JSON outputs with confidence and descriptions. The dataset comprises 1,207 validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and equal non-WWTP sites from field/AI data, as 600mx600m Geo-TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on WWTP images showed several VLMs out-performing YOLOv8’s true positive rate, with Gemma-3 highest. Results confirm that VLMs, particularly with zero-shot, can replace YOLOv8 for efficient, annotation-free WWTP classification, enabling scalable remote sensing.
zh
[CV-47] PSMamba: Progressive Self-supervised Vision Mamba for Plant Disease Recognition
【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)在植物病害图像识别中难以捕捉层次化、多尺度病变模式的问题。现有方法主要关注全局对齐,忽略了病害图像中从宏观分布到微观纹理的多层次特征表达。其解决方案的关键在于提出一种名为PSMamba的渐进式自监督框架,该框架融合了视觉Mamba(Vision Mamba, VM)的高效序列建模能力与双学生层级蒸馏策略:通过一个共享的全局教师模型和两个专业化的学生模型——一个处理中尺度视图以捕获病变分布与叶脉结构,另一个专注于局部视图以提取纹理不规则性和早期病变等细粒度线索——实现跨尺度的一致性约束与上下文-细节表示的协同学习,从而显著提升模型在域偏移和细粒度场景下的准确性和鲁棒性。
链接: https://arxiv.org/abs/2512.14309
作者: Abdullah Al Mamun,Miaohua Zhang,David Ahmedt-Aristizabal,Zeeshan Hayder,Mohammad Awrangjeb
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-supervised Learning (SSL) has become a powerful paradigm for representation learning without manual annotations. However, most existing frameworks focus on global alignment and struggle to capture the hierarchical, multi-scale lesion patterns characteristic of plant disease imagery. To address this gap, we propose PSMamba, a progressive self-supervised framework that integrates the efficient sequence modelling of Vision Mamba (VM) with a dual-student hierarchical distillation strategy. Unlike conventional single teacher-student designs, PSMamba employs a shared global teacher and two specialised students: one processes mid-scale views to capture lesion distributions and vein structures, while the other focuses on local views to capture fine-grained cues such as texture irregularities and early-stage lesions. This multi-granular supervision facilitates the joint learning of contextual and detailed representations, with consistency losses ensuring coherent cross-scale alignment. Experiments on three benchmark datasets show that PSMamba consistently outperforms state-of-the-art SSL methods, delivering superior accuracy and robustness in both domain-shifted and fine-grained scenarios.
zh
[CV-48] SS4D: Native 4D Generative Model via Structured Spacetime Latents SIGGRAPH
【速读】:该论文旨在解决从单目视频中直接合成动态3D物体的4D生成问题,现有方法通常依赖于在3D或视频生成模型上进行优化,难以同时保证高保真度、时间一致性和结构一致性。其解决方案的关键在于提出一种原生4D生成模型SS4D,核心创新包括:(1)利用预训练的单图到3D模型构建结构一致的初始空间表示,缓解4D训练数据稀缺问题;(2)引入专用的时间层以跨帧推理,确保时序一致性;(3)通过因子化4D卷积与时间下采样模块对时空潜在表示进行压缩,提升长视频序列下的训练与推理效率。此外,设计了鲁棒的训练策略以增强对遮挡的抵抗能力。
链接: https://arxiv.org/abs/2512.14284
作者: Zhibing Li,Mengchen Zhang,Tong Wu,Jing Tan,Jiaqi Wang,Dahua Lin
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai AI Laboratory; Zhejiang University (浙江大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ToG(Siggraph Asia 2025)
Abstract:We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency. At the core of our method is a compressed set of structured spacetime latents. Specifically, (1) To address the scarcity of 4D training data, we build on a pre-trained single-image-to-3D model, preserving strong spatial consistency. (2) Temporal consistency is enforced by introducing dedicated temporal layers that reason across frames. (3) To support efficient training and inference over long video sequences, we compress the latent sequence along the temporal axis using factorized 4D convolutions and temporal downsampling blocks. In addition, we employ a carefully designed training strategy to enhance robustness against occlusion
zh
[CV-49] UN: Detecting Significant Points in Persistence Diagrams with Deep Learning
【速读】:该论文旨在解决一维持久性图(Persistence Diagrams, PDs)中显著点自动识别的问题,即如何准确区分哪些点编码了真实的拓扑信号而非噪声,从而提升拓扑数据分析(Topological Data Analysis, TDA)在实际应用中的自动化与可靠性。其解决方案的关键在于提出一种多模态神经网络——拓扑理解网络(Topology Understanding Net, TUN),该网络融合了增强的PD描述子、自注意力机制、PointNet风格的点云编码器、可学习的特征融合策略以及逐点分类头,并结合稳定预处理和类别不平衡感知的训练策略,实现了对PD中显著点的有效识别。
链接: https://arxiv.org/abs/2512.14274
作者: Yu Chen,Hongwei Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Algebraic Topology (math.AT)
备注:
Abstract:Persistence diagrams (PDs) provide a powerful tool for understanding the topology of the underlying shape of a point cloud. However, identifying which points in PDs encode genuine signals remains challenging. This challenge directly hinders the practical adoption of topological data analysis in many applications, where automated and reliable interpretation of persistence diagrams is essential for downstream decision-making. In this paper, we study automatic significance detection for one-dimensional persistence diagrams. Specifically, we propose Topology Understanding Net (TUN), a multi-modal network that combines enhanced PD descriptors with self-attention, a PointNet-style point cloud encoder, learned fusion, and per-point classification, alongside stable preprocessing and imbalance-aware training. It provides an automated and effective solution for identifying significant points in PDs, which are critical for downstream applications. Experiments show that TUN outperforms classic methods in detecting significant points in PDs, illustrating its effectiveness in real-world applications.
zh
[CV-50] Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
【速读】:该论文针对生成式 AI (Generative AI) 在 grounded video question answering (GVQA) 任务中面临的时序感知能力不足问题展开研究,旨在提升模型对视频中相关时间片段的精确定位能力,并减少因缺乏视觉证据支撑而导致的时序误定位和幻觉现象。其解决方案的核心在于提出一种粗粒度到细粒度的“Zoom-Zero”框架,关键创新包括:(i) 引入“zoom-in accuracy reward”以验证时序定位预测的忠实性,并在定位帧上进行细粒度视觉验证;(ii) 采用 token-selective credit assignment 方法,将奖励精确分配给负责时序定位或答案生成的 token,从而缓解 GRPO 在处理多维度奖励信号时的局限性。该方法显著提升了 GVQA 的时序定位准确率(NExT-GQA 提升 5.2%,ReXTime 提升 4.6%)及平均回答准确率(提升 2.4%),同时在长视频理解中保持全局上下文并增强关键细节保留能力。
链接: https://arxiv.org/abs/2512.14273
作者: Xiaoqian Shen,Min-Hung Chen,Yu-Chiang Frank Wang,Mohamed Elhoseiny,Ryo Hachiuma
机构: NVIDIA; KAUST(国王阿卜杜拉大学科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present Zoom-Zero, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with two key innovations: (i) a zoom-in accuracy reward that validates the fidelity of temporal grounding prediction and facilitates fine-grained visual verification on grounded frames; (ii) token-selective credit assignment, which attributes rewards to the tokens responsible for temporal localization or answer generation, mitigating GRPO’s issue in handling multi-faceted reward signals. Our proposed method advances grounded video question answering, improving temporal grounding by 5.2% on NExT-GQA and 4.6% on ReXTime, while also enhancing average answer accuracy by 2.4%. Additionally, the coarse-to-fine zoom-in during inference further benefits long-form video understanding by preserving critical visual details without compromising global context, yielding an average improvement of 6.4% on long-video benchmarks.
zh
[CV-51] DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance
【速读】:该论文旨在解决当前驾驶注意力预测模型在复杂交通场景下表现受限的问题,尤其是由于现有方法多依赖窄视角(frontal field-of-view)和单一驾驶情境,难以捕捉车道变更、转弯及周边行人或骑行者等关键空间上下文信息。其核心解决方案是提出一个大规模360°全景驾驶员注视数据集DriverGaze360(包含约100万帧标注数据,来自19名驾驶员),并设计了DriverGaze360-Net模型,该模型通过引入辅助语义分割头(auxiliary semantic segmentation head)联合学习注意力图与被注意目标,从而提升对宽视野输入下的空间感知能力和注意力预测精度,在多个指标上实现了最先进的性能。
链接: https://arxiv.org/abs/2512.14266
作者: Shreedhar Govil,Didier Stricker,Jason Rambach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Predicting driver attention is a critical problem for developing explainable autonomous driving systems and understanding driver behavior in mixed human-autonomous vehicle traffic scenarios. Although significant progress has been made through large-scale driver attention datasets and deep learning architectures, existing works are constrained by narrow frontal field-of-view and limited driving diversity. Consequently, they fail to capture the full spatial context of driving environments, especially during lane changes, turns, and interactions involving peripheral objects such as pedestrians or cyclists. In this paper, we introduce DriverGaze360, a large-scale 360 ^\circ field of view driver attention dataset, containing \sim 1 million gaze-labeled frames collected from 19 human drivers, enabling comprehensive omnidirectional modeling of driver gaze behavior. Moreover, our panoramic attention prediction approach, DriverGaze360-Net, jointly learns attention maps and attended objects by employing an auxiliary semantic segmentation head. This improves spatial awareness and attention prediction across wide panoramic inputs. Extensive experiments demonstrate that DriverGaze360-Net achieves state-of-the-art attention prediction performance on multiple metrics on panoramic driving images. Dataset and method available at this https URL.
zh
[CV-52] Enhancing Visual Programming for Visual Reasoning via Probabilistic Graphs
【速读】:该论文旨在解决视觉编程(Visual Programming, VP)在复杂视觉推理(Visual Reasoning, VR)任务中因缺乏子任务标签和VP不可微性而导致的优化困难问题。现有方法主要关注提升大语言模型(Large Language Models, LLMs)生成的视觉程序质量,却忽略了对VP调用的预训练视觉模块进行优化。为克服这一挑战,作者提出EVPG方法,其关键在于构建一个基于VP执行过程中变量依赖关系的有向概率图(directed probabilistic graph),将原本不可微的VP执行过程转化为在此图上的可微概率推理过程,从而实现利用最终任务标签进行端到端的梯度驱动优化,显著提升了VP在GQA、NLVRv2和Open Images等复杂VR任务上的性能。
链接: https://arxiv.org/abs/2512.14257
作者: Wentao Wan,Kaiyu Wu,Qingyang Ma,Nan Kang,Yunjie Chen,Liang Lin,Keze Wang
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 Pages, 12 figures
Abstract:Recently, Visual Programming (VP) based on large language models (LLMs) has rapidly developed and demonstrated significant potential in complex Visual Reasoning (VR) tasks. Previous works to enhance VP have primarily focused on improving the quality of LLM-generated visual programs. However, they have neglected to optimize the VP-invoked pre-trained models, which serve as modules for the visual sub-tasks decomposed from the targeted tasks by VP. The difficulty is that there are only final labels of targeted VR tasks rather than labels of sub-tasks. Besides, the non-differentiable nature of VP impedes the direct use of efficient gradient-based optimization methods to leverage final labels for end-to-end learning of the entire VP framework. To overcome these issues, we propose EVPG, a method to Enhance Visual Programming for visual reasoning via Probabilistic Graphs. Specifically, we creatively build a directed probabilistic graph according to the variable dependency relationships during the VP executing process, which reconstructs the non-differentiable VP executing process into a differentiable exact probability inference process on this directed probabilistic graph. As a result, this enables the VP framework to utilize the final labels for efficient, gradient-based optimization in end-to-end supervised learning on targeted VR tasks. Extensive and comprehensive experiments demonstrate the effectiveness and advantages of our EVPG, showing significant performance improvements for VP on three classical complex VR tasks: GQA, NLVRv2, and Open Images.
zh
[CV-53] Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding
【速读】:该论文旨在解决从单目视频自动转换为立体视频(即生成具有深度感的双目视频)的问题,以满足日益增长的沉浸式3D内容需求。传统方法依赖显式深度估计与图像扭曲(warping),易引入伪影;而近期无扭曲方法在控制性和质量上仍有不足。其解决方案的关键在于提出Elastic3D——一种基于条件潜在扩散模型(conditional latent diffusion)的端到端直接方法,通过设计一种新颖的引导式变分自编码器(VAE)解码器,确保输出立体视频具备清晰度和极线一致性(epipolar consistency)。此外,该方法在推理阶段提供一个直观的标量调节旋钮,使用户可灵活控制立体效果强度(即视差范围),从而实现高质量、可控的立体视频转换。
链接: https://arxiv.org/abs/2512.14236
作者: Nando Metzger,Prune Truong,Goutam Bhat,Konrad Schindler,Federico Tombari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this http URL
Abstract:The growing demand for immersive 3D content calls for automated monocular-to-stereo video conversion. We present Elastic3D, a controllable, direct end-to-end method for upgrading a conventional video to a binocular one. Our approach, based on (conditional) latent diffusion, avoids artifacts due to explicit depth estimation and warping. The key to its high-quality stereo video output is a novel, guided VAE decoder that ensures sharp and epipolar-consistent stereo video output. Moreover, our method gives the user control over the strength of the stereo effect (more precisely, the disparity range) at inference time, via an intuitive, scalar tuning knob. Experiments on three different datasets of real-world stereo videos show that our method outperforms both traditional warping-based and recent warping-free baselines and sets a new standard for reliable, controllable stereo video conversion. Please check the project page for the video samples this https URL.
zh
[CV-54] 4D-RaDiff: Latent Diffusion for 4D Radar Point Cloud Generation
【速读】:该论文旨在解决汽车雷达(Automotive Radar)在环境感知中因标注数据稀缺而导致的感知系统性能受限问题。解决方案的关键在于提出一种名为4D-RaDiff的新框架,通过在潜在点云空间中应用扩散机制(diffusion),生成高质量的4D雷达点云数据;该方法能够将未标注的边界框转化为高保真雷达标注,并将已有LiDAR点云数据转换为逼真的雷达场景,从而实现对目标检测模型的有效训练与评估。
链接: https://arxiv.org/abs/2512.14235
作者: Jimmie Kwok,Holger Caesar,Andras Palffy
机构: Delft University of Technology (代尔夫特理工大学); Perciv AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automotive radar has shown promising developments in environment perception due to its cost-effectiveness and robustness in adverse weather conditions. However, the limited availability of annotated radar data poses a significant challenge for advancing radar-based perception systems. To address this limitation, we propose a novel framework to generate 4D radar point clouds for training and evaluating object detectors. Unlike image-based diffusion, our method is designed to consider the sparsity and unique characteristics of radar point clouds by applying diffusion to a latent point cloud representation. Within this latent space, generation is controlled via conditioning at either the object or scene level. The proposed 4D-RaDiff converts unlabeled bounding boxes into high-quality radar annotations and transforms existing LiDAR point cloud data into realistic radar scenes. Experiments demonstrate that incorporating synthetic radar data of 4D-RaDiff as data augmentation method during training consistently improves object detection performance compared to training on real data only. In addition, pre-training on our synthetic data reduces the amount of required annotated radar data by up to 90% while achieving comparable object detection performance.
zh
[CV-55] ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
【速读】:该论文旨在解决现有对话系统中行为生成缺乏社会性与代理决策能力的问题,即传统方法将人类行为建模为单轮语音或文本到动作的映射任务,忽视了多轮对话中何时行动、如何适应以及如何协同语言与身体行为的动态决策过程,导致时序脆弱、社会基础薄弱且各模态独立训练的问题。解决方案的关键在于提出ViBES(Voice in Behavioral Expression and Synchrony),一个基于语音-语言-行为(Speech-Language-Behavior, SLB)联合建模的3D虚拟代理,其核心创新是采用模态专家混合(Mixture-of-Modality-Experts, MoME)架构:通过模态分区的Transformer专家处理语音、面部表情和身体动作,并利用硬路由机制分配参数,同时借助跨专家注意力实现模态间信息共享,从而在多轮交互中实现语言与运动的联合规划与执行,支持用户以语音、文本或肢体指令混合发起互动,并提供可控制的行为钩子用于流式响应。
链接: https://arxiv.org/abs/2512.14234
作者: Juze Zhang,Changan Chen,Xin Chen,Heng Yu,Tiange Xiang,Ali Sartaz Khan,Shrinidhi K. Lakshmikanth,Ehsan Adeli
机构: Stanford University (斯坦福大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task co-speech gesture or text-to-motion that maps a fixed utterance to motion clips-without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue-motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond “speech-conditioned motion generation” toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction. Code and data will be made available at: this http URL
zh
[CV-56] Multi-View MRI Approach for Classification of MGMT Methylation in Glioblastoma Patients
【速读】:该论文旨在解决胶质母细胞瘤(Glioblastoma Multiforme, GBM)患者中MGMT启动子甲基化状态的无创检测问题,从而优化化疗疗效评估。当前依赖侵入性脑组织活检进行确认,存在临床风险与局限性。解决方案的关键在于提出一种基于多视角MRI图像的放射基因组学(radiogenomics)方法,利用深度学习模型提取三个解剖视角中的空间关系特征,实现对MGMT甲基化状态的准确预测;同时引入一种新型肿瘤切片提取技术,在不使用复杂3D深度学习模型的前提下,有效避免高参数量、收敛慢和内存消耗大的问题,显著提升诊断效率与可重复性。
链接: https://arxiv.org/abs/2512.14232
作者: Rawan Alyahya,Asrar Alruwayqi,Atheer Alqarni,Asma Alkhaldi,Metab Alkubeyyer,Xin Gao,Mona Alshahrani
机构: National Center for Artificial Intelligence (NCAI), SDAIA, Riyadh, Saudi Arabia; King Abdullah University of Science and Technology (KAUST), CEMSE Division, Thuwal, Saudi Arabia; Aramco Research Center, R&D Center Department, Dhahran, Saudi Arabia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The presence of MGMT promoter methylation significantly affects how well chemotherapy works for patients with Glioblastoma Multiforme (GBM). Currently, confirmation of MGMT promoter methylation relies on invasive brain tumor tissue biopsies. In this study, we explore radiogenomics techniques, a promising approach in precision medicine, to identify genetic markers from medical images. Using MRI scans and deep learning models, we propose a new multi-view approach that considers spatial relationships between MRI views to detect MGMT methylation status. Importantly, our method extracts information from all three views without using a complicated 3D deep learning model, avoiding issues associated with high parameter count, slow convergence, and substantial memory demands. We also introduce a new technique for tumor slice extraction and show its superiority over existing methods based on multiple evaluation metrics. By comparing our approach to state-of-the-art models, we demonstrate the efficacy of our method. Furthermore, we share a reproducible pipeline of published models, encouraging transparency and the development of robust diagnostic tools. Our study highlights the potential of non-invasive methods for identifying MGMT promoter methylation and contributes to advancing precision medicine in GBM treatment.
zh
[CV-57] OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving ACM-MM2025
【速读】:该论文旨在解决自动驾驶中多模态传感器数据生成的效率与一致性问题,尤其是现有方法多局限于单模态生成,导致多传感器数据在时空上难以对齐且生成效率低下。其解决方案的关键在于提出OminiGen框架,通过共享鸟瞰图(Bird’s Eye View, BEV)空间统一多模态特征,并设计一种新颖的通用多模态重建方法UAE(Unified Attention-based Encoding),利用体渲染(volume rendering)实现LiDAR与多视角相机数据的联合解码,从而保证生成数据的多模态一致性;同时引入带有ControlNet分支的扩散Transformer(Diffusion Transformer, DiT),支持可控的多模态传感器数据生成,提升生成灵活性与实用性。
链接: https://arxiv.org/abs/2512.14225
作者: Tao Tang,Enhui Ma,xia zhou,Letian Wang,Tianyi Yan,Xueyang Zhang,Kun Zhan,Peng Jia,XianPeng Lang,Jia-Wang Bian,Kaicheng Yu,Xiaodan Liang
机构: Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Westlake University (西湖大学); Li Auto Inc. (理想汽车); The University of Toronto (多伦多大学); University of Macau (澳门大学); Bytedance Seed (字节跳动种子团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM MM 2025
Abstract:Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Bird\u2019s Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.
zh
[CV-58] History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation
【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)在大规模城市环境中基于语言指令进行目标定位时,现有单粒度框架难以平衡全局环境推理与局部场景理解的问题。解决方案的关键在于提出一种历史增强的两阶段Transformer(History-Enhanced Two-Stage Transformer, HETT)框架,通过粗粒度到细粒度的导航流程实现双重视觉感知:首先融合空间地标与历史上下文预测粗粒度目标位置,再基于细粒度视觉分析优化动作决策;同时设计历史栅格地图动态聚合视觉特征形成结构化空间记忆,从而提升整体场景感知能力。
链接: https://arxiv.org/abs/2512.14222
作者: Xichen Ding,Jianzhe Gao,Cong Pan,Wenguan Wang,Jie Qin
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component.
zh
[CV-59] DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos
【速读】:该论文旨在解决视频扩散模型在具身智能(embodied AI)中用于机器人操作时可控性不足的问题。现有基于轨迹条件的视频生成方法通常依赖二维轨迹或单一模态条件,难以生成具有高可控性和时空一致性的机器人操作演示。其解决方案的关键在于提出DRAW2ACT框架,该框架通过从输入轨迹中提取深度、语义、形状和运动等多个正交表征,并将其注入扩散模型以增强控制能力;同时,联合生成空间对齐的RGB与深度视频,利用跨模态注意力机制和深度监督提升时空一致性;最终通过多模态策略模型基于生成的RGB和深度序列回归机器人关节角度,从而显著提升操作成功率和视觉保真度。
链接: https://arxiv.org/abs/2512.14217
作者: Yang Bai,Liudi Yang,George Eskandar,Fengyi Shen,Mohammad Altillawi,Ziyuan Liu,Gitta Kutyniok
机构: Ludwig Maximilian University of Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心); University of Freiburg (弗莱堡大学); Technical University of Munich (慕尼黑工业大学); Huawei Heisenberg Research Center (慕尼黑华为海森堡研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot’s joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.
zh
[CV-60] Beyond a Single Light: A Large-Scale Aerial Dataset for Urban Scene Reconstruction Under Varying Illumination
【速读】:该论文旨在解决多时相无人机(UAV)大尺度三维重建中因光照变化导致的颜色伪影、几何不准确和外观不一致问题,这一挑战在真实世界场景中尤为突出,且缺乏系统性的数据集支持。解决方案的关键在于构建一个名为SkyLume的大规模真实世界UAV数据集,其核心贡献包括:(1) 在10个城市区域采集超过10万张高分辨率图像(含四个倾斜视角和正射视角),每个区域在一天中的三个不同时段拍摄以系统性地分离光照变化影响;(2) 提供每场景的LiDAR扫描与精确三维真值,用于评估深度、表面法向量及重建质量在不同光照条件下的表现;(3) 引入时间一致性系数(Temporal Consistency Coefficient, TCC)作为逆渲染任务的量化指标,直接衡量材质反照率在跨时间维度上的稳定性,从而评估光照与材质解耦的鲁棒性。该数据集为大尺度逆渲染、几何重建与新视角合成提供了可信赖的真实世界基准。
链接: https://arxiv.org/abs/2512.14200
作者: Zhuoxiao Li,Wenzong Ma,Taoyu Wu,Jinjing Zhu,Zhenchao Q,Shuai Zhang,Jing Ou,Yinrui Ren,Weiqing Qi,Guobin Shen,Hui Xiong,Wufan Zhao
机构: HKUST(GZ); University of Liverpool
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in Neural Radiance Fields and 3D Gaussian Splatting have demonstrated strong potential for large-scale UAV-based 3D reconstruction tasks by fitting the appearance of images. However, real-world large-scale captures are often based on multi-temporal data capture, where illumination inconsistencies across different times of day can significantly lead to color artifacts, geometric inaccuracies, and inconsistent appearance. Due to the lack of UAV datasets that systematically capture the same areas under varying illumination conditions, this challenge remains largely underexplored. To fill this gap, we introduceSkyLume, a large-scale, real-world UAV dataset specifically designed for studying illumination robust 3D reconstruction in urban scene modeling: (1) We collect data from 10 urban regions data comprising more than 100k high resolution UAV images (four oblique views and nadir), where each region is captured at three periods of the day to systematically isolate illumination changes. (2) To support precise evaluation of geometry and appearance, we provide per-scene LiDAR scans and accurate 3D ground-truth for assessing depth, surface normals, and reconstruction quality under varying illumination. (3) For the inverse rendering task, we introduce the Temporal Consistency Coefficient (TCC), a metric that measuress cross-time albedo stability and directly evaluates the robustness of the disentanglement of light and material. We aim for this resource to serve as a foundation that advances research and real-world evaluation in large-scale inverse rendering, geometry reconstruction, and novel view synthesis.
zh
[CV-61] Fracture Morphology Classification: Local Multiclass Modeling for Multilabel Complexity
【速读】:该论文旨在解决儿童骨折影像中骨折形态(fracture morphology)自动识别的难题,以提升诊断准确性。其关键解决方案是通过为检测到的骨折边界框(fracture bounding boxes)自动分配全球通用的AO分类代码(AO codes),将原本复杂的全局多标签任务转化为局部多分类任务,从而显著提升模型性能,平均F1分数提高7.89%。此方法充分利用了公开数据集,并增强了模型在标准化骨折形态标注下的可迁移性与实用性。
链接: https://arxiv.org/abs/2512.14196
作者: Cassandra Krause,Mattias P. Heinrich,Ron Keuth
机构: University of Lübeck (吕贝克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as poster at the German Conference on Medical Image Computing 2026
Abstract:Between 15,% and 45,% of children experience a fracture during their growth years, making accurate diagnosis essential. Fracture morphology, alongside location and fragment angle, is a key diagnostic feature. In this work, we propose a method to extract fracture morphology by assigning automatically global AO codes to corresponding fracture bounding boxes. This approach enables the use of public datasets and reformulates the global multilabel task into a local multiclass one, improving the average F1 score by 7.89,% . However, performance declines when using imperfect fracture detectors, highlighting challenges for real-world deployment. Our code is available on GitHub.
zh
[CV-62] Establishing Stochastic Object Models from Noisy Data via Ambient Measurement-Integrated Diffusion
【速读】:该论文旨在解决医学成像系统中任务驱动的图像质量(Image Quality, IQ)评估难题,特别是如何在存在解剖变异性的随机性条件下建立真实可靠的生成式解剖模型(Stochastic Object Models, SOMs)。传统数学建模方法难以捕捉真实解剖结构,而数据驱动方法通常依赖于清洁数据,在临床场景中往往不可得。为此,作者提出了一种无监督的环境测量集成扩散模型(AMID),其核心创新在于将测量噪声与扩散轨迹对齐,并显式建模测量噪声与扩散噪声在每一步的耦合关系,从而设计出基于此耦合机制的“环境损失”(ambient loss),直接从噪声测量中学习干净的SOMs。实验表明,AMID在真实CT和乳腺X线数据集上生成保真度更高,且能提供更可靠的基于任务的IQ评估结果。
链接: https://arxiv.org/abs/2512.14187
作者: Jianwei Sun,Xiaoning Lei,Wenhao Cai,Xichen Xu,Yanshu Wang,Hu Gao
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Task-based measures of image quality (IQ) are critical for evaluating medical imaging systems, which must account for randomness including anatomical variability. Stochastic object models (SOMs) provide a statistical description of such variability, but conventional mathematical SOMs fail to capture realistic anatomy, while data-driven approaches typically require clean data rarely available in clinical tasks. To address this challenge, we propose AMID, an unsupervised Ambient Measurement-Integrated Diffusion with noise decoupling, which establishes clean SOMs directly from noisy measurements. AMID introduces a measurement-integrated strategy aligning measurement noise with the diffusion trajectory, and explicitly models coupling between measurement and diffusion noise across steps, an ambient loss is thus designed base on it to learn clean SOMs. Experiments on real CT and mammography datasets show that AMID outperforms existing methods in generation fidelity and yields more reliable task-based IQ evaluation, demonstrating its potential for unsupervised medical imaging analysis.
zh
[CV-63] Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere
【速读】:该论文旨在解决3D Gaussian Splatting中基于球谐函数(Spherical Harmonics, SH)的外观建模所存在的局限性,包括对高频信号建模能力弱、出现Gibbs振铃伪影以及无法有效捕捉镜面反射等关键问题。其解决方案的核心在于提出一种统一的外观表示框架——球面Voronoi(Spherical Voronoi, SV),通过将方向域划分为可学习的区域并具有平滑边界,实现了对视图依赖效应的直观且稳定的参数化;尤其在镜面反射建模方面,SV作为可学习的反射探针,借鉴经典图形学原理以反射方向为输入,显著提升了真实感渲染效果,并在合成与真实世界数据集上达到当前最优性能,展现出其在显式三维表示中作为原理性强、高效且通用的外观建模方案的潜力。
链接: https://arxiv.org/abs/2512.14180
作者: Francesco Di Sario,Daniel Rebain,Dor Verbin,Marco Grangetto,Andrea Tagliasacchi
机构: University of Torino (都灵大学); Simon Fraser University (西蒙弗雷泽大学); University of British Columbia (不列颠哥伦比亚大学); University of Toronto (多伦多大学); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Radiance field methods (e.g. 3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations. SH struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and fail to capture specular reflections - a key component of realistic rendering. Although alternatives like spherical Gaussians offer improvements, they add significant optimization complexity. We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting. SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while keeping optimization simpler than existing alternatives. For reflections - where SH fail - we leverage SV as learnable reflection probes, taking reflected directions as input following principles from classical graphics. This formulation attains state-of-the-art results on synthetic and real-world datasets, demonstrating that SV offers a principled, efficient, and general solution for appearance modeling in explicit 3D representations.
zh
[CV-64] Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在生成答案时存在可信赖性不足的问题,即模型输出虽表面合理但实际可能不可靠,因此亟需对不确定性进行稳健估计。现有方法依赖外部模型对多次采样得到的回答进行聚类以衡量语义一致性,但此类聚类方法对表述细微差异敏感,易导致错误分组或分离语义相近的答案,从而产生不可靠的不确定性估计。论文提出了一种基于贝叶斯框架的语义高斯过程不确定性量化方法(Semantic Gaussian Process Uncertainty, SGPU),其核心创新在于通过分析答案嵌入(answer embeddings)的几何结构来建模语义不确定性,避免了脆弱的聚类步骤:首先将生成答案映射至密集语义空间并计算嵌入的Gram矩阵,再利用特征谱(eigenspectrum)表征其语义配置,并将其输入高斯过程分类器,学习从语义一致性模式到预测不确定性的映射关系。该方法在黑盒与白盒场景下均适用,且在六种大模型和八个数据集上实现了最优校准(ECE)与判别性能(AUROC, AUARC),同时展现出跨模型与模态的迁移能力,表明其谱表示能捕捉通用的语义不确定性模式。
链接: https://arxiv.org/abs/2512.14177
作者: Joseph Hoche,Andrei Bursuc,David Brellmann,Gilles Louppe,Pavel Izmailov,Angela Yao,Gianni Franchi
机构: AMIAD, Pôle Recherche, Palaiseau; valeo.ai; Safran Tech; University of Liège (列日大学); New York University (纽约大学); National University of Singapore (新加坡国立大学); ENSTA Paris
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar answers, leading to unreliable uncertainty estimates. We propose Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty by analyzing the geometric structure of answer embeddings, avoiding brittle clustering. SGPU maps generated answers into a dense semantic space, computes the Gram matrix of their embeddings, and summarizes their semantic configuration via the eigenspectrum. This spectral representation is then fed into a Gaussian Process Classifier that learns to map patterns of semantic consistency to predictive uncertainty, and that can be applied in both black-box and white-box settings. Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. We further show that SGPU transfers across models and modalities, indicating that its spectral representation captures general patterns of semantic uncertainty.
zh
[CV-65] FastDDHPose: Towards Unified Efficient and Disentangled 3D Human Pose Estimation
【速读】:该论文旨在解决单目3D人体姿态估计(Monocular 3D Human Pose Estimation, 3D HPE)领域中方法训练与评估框架不统一的问题,从而阻碍了不同方法间的公平比较和高效开发。解决方案的关键在于提出一个模块化框架 Fast3DHPE,通过标准化训练与评估协议,实现跨方法的公平对比并显著提升训练效率;在此基础上,进一步设计了 FastDDHPose 方法,其核心创新是利用扩散模型(Diffusion Models)对骨长和骨方向分布进行显式建模,避免层级误差累积,并引入一种高效的运动学层次时空去噪器(Kinematic-Hierarchical Spatial and Temporal Denoiser),引导模型聚焦于关节层级结构,减少对复杂拓扑关系的冗余建模,从而在 Human3.6M 和 MPI-INF-3DHP 数据集上实现了当前最优性能及强泛化能力。
链接: https://arxiv.org/abs/2512.14162
作者: Qingyuan Cai,Linxin Zhang,Xuecai Hu,Saihui Hou,Yongzhen Huang
机构: Beijing Normal University (北京师范大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent approaches for monocular 3D human pose estimation (3D HPE) have achieved leading performance by directly regressing 3D poses from 2D keypoint sequences. Despite the rapid progress in 3D HPE, existing methods are typically trained and evaluated under disparate frameworks, lacking a unified framework for fair comparison. To address these limitations, we propose Fast3DHPE, a modular framework that facilitates rapid reproduction and flexible development of new methods. By standardizing training and evaluation protocols, Fast3DHPE enables fair comparison across 3D human pose estimation methods while significantly improving training efficiency. Within this framework, we introduce FastDDHPose, a Disentangled Diffusion-based 3D Human Pose Estimation method which leverages the strong latent distribution modeling capability of diffusion models to explicitly model the distributions of bone length and bone direction while avoiding further amplification of hierarchical error accumulation. Moreover, we design an efficient Kinematic-Hierarchical Spatial and Temporal Denoiser that encourages the model to focus on kinematic joint hierarchies while avoiding unnecessary modeling of overly complex joint topologies. Extensive experiments on Human3.6M and MPI-INF-3DHP show that the Fast3DHPE framework enables fair comparison of all methods while significantly improving training efficiency. Within this unified framework, FastDDHPose achieves state-of-the-art performance with strong generalization and robustness in in-the-wild scenarios. The framework and models will be released at: this https URL
zh
[CV-66] CIS-BA: Continuous Interaction Space Based Backdoor Attack for Object Detection in the Real-World
【速读】:该论文旨在解决现实场景中部署的物体检测模型(如自动驾驶系统)面临的后门攻击威胁问题。现有方法受限于单一触发器与单一目标对象的映射关系及脆弱的像素级线索,在鲁棒性和攻击能力上存在明显不足。其解决方案的关键在于提出一种名为CIS-BA(Continuous Interaction Space-based Backdoor Attack)的新范式,通过将触发器设计从静态物体特征转向连续的物体间交互模式(即描述场景中物体共现与互动的方式),构建基于交互空间的“空间触发器”(space triggers),从而实现多触发器-多目标攻击机制,并利用几何不变性提升攻击鲁棒性。该方案的核心实现框架CIS-Frame通过交互分析生成空间触发器、形式化为类-几何约束用于样本投毒,并在检测器训练过程中嵌入后门,支持单目标和多目标协同攻击,在复杂环境和动态多触发条件下均保持高攻击成功率(>97%)且可规避当前主流防御机制。
链接: https://arxiv.org/abs/2512.14158
作者: Shuxin Zhao,Bo Lang,Nan Xiao,Yilang Zhang
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:Object detection models deployed in real-world applications such as autonomous driving face serious threats from backdoor attacks. Despite their practical effectiveness,existing methods are inherently limited in both capability and robustness due to their dependence on single-trigger-single-object mappings and fragile pixel-level cues. We propose CIS-BA, a novel backdoor attack paradigm that redefines trigger design by shifting from static object features to continuous inter-object interaction patterns that describe how objects co-occur and interact in a scene. By modeling these patterns as a continuous interaction space, CIS-BA introduces space triggers that, for the first time, enable a multi-trigger-multi-object attack mechanism while achieving robustness through invariant geometric relations. To implement this paradigm, we design CIS-Frame, which constructs space triggers via interaction analysis, formalizes them as class-geometry constraints for sample poisoning, and embeds the backdoor during detector training. CIS-Frame supports both single-object attacks (object misclassification and disappearance) and multi-object simultaneous attacks, enabling complex and coordinated effects across diverse interaction states. Experiments on MS-COCO and real-world videos show that CIS-BA achieves over 97% attack success under complex environments and maintains over 95% effectiveness under dynamic multi-trigger conditions, while evading three state-of-the-art defenses. In summary, CIS-BA extends the landscape of backdoor attacks in interaction-intensive scenarios and provides new insights into the security of object detection systems.
zh
[CV-67] Incentivizing Tool-augmented Thinking with Images for Medical Image Analysis
【速读】:该论文旨在解决当前基于推理的医学多模态大语言模型(Medical MLLMs)在处理复杂任务时,难以动态、迭代地聚焦于图像中的细粒度视觉区域以实现精确定位与诊断的问题。其解决方案的关键在于提出一个名为Ophiuchus的通用且工具增强型框架,该框架使MLLM能够自主判断何时需要额外视觉证据、确定在医学图像中何处进行探测与定位,并将相关子图像内容无缝整合进交错的多模态思维链中。该方法的核心创新是三阶段训练策略:冷启动训练(利用工具集成的推理数据实现基础工具选择与关键区域检查适应)、自省微调(强化反思性推理并鼓励重审工具输出),以及代理式工具强化学习(直接优化任务特定奖励,模拟专家级诊断行为),从而将模型内在的感知能力与外部工具深度融合,显著提升高阶推理性能。
链接: https://arxiv.org/abs/2512.14157
作者: Yankai Jiang,Yujie Zhang,Peng Zhang,Yichen Li,Jintai Chen,Xiaoming Shi,Shihui Zhen
机构: Zhejiang University (浙江大学); Fudan University (复旦大学); Shanghai Innovation Institute; Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Huazhong University of Science and Technology (华中科技大学); Information Hub, HKUST (Guangzhou) (香港科技大学(广州)信息 hub); East China Normal University (华东师范大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought. In contrast to prior approaches limited by the performance ceiling of specialized tools, Ophiuchus integrates the model’s inherent grounding and perception capabilities with external tools, thereby fostering higher-level reasoning. The core of our method is a three-stage training strategy: cold-start training with tool-integrated reasoning data to achieve basic tool selection and adaptation for inspecting key regions; self-reflection fine-tuning to strengthen reflective reasoning and encourage revisiting tool outputs; and Agentic Tool Reinforcement Learning to directly optimize task-specific rewards and emulate expert-like diagnostic behavior. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our approach illuminates a path toward medical AI agents that can genuinely “think with images” through tool-integrated reasoning. Datasets, codes, and trained models will be released publicly.
zh
[CV-68] orchTraceAP: A New Benchmark Dataset for Detecting Performance Anti-Patterns in Computer Vision Models
【速读】:该论文旨在解决机器学习(Machine Learning, ML)模型在训练和推理过程中因性能反模式(performance anti-patterns)导致的效率低下问题,尤其针对计算机视觉研究者缺乏专业资源来分析PyTorch执行追踪(trace)这一现实挑战。其核心解决方案是提出首个专门用于评估和提升ML模型检测追踪中反模式能力的基准数据集,并设计了一种迭代式方法:首先使用轻量级ML模型粗粒度定位含反模式的追踪片段,再由大语言模型(Large Language Model, LLM)进行细粒度分类与针对性反馈,从而有效克服LLM上下文长度限制及推理效率低下的问题,显著优于传统的无监督聚类和基于规则的统计方法。
链接: https://arxiv.org/abs/2512.14141
作者: Hanning Chen,Keyu Man,Kevin Zhu,Chenguang Zhu,Haonan Li,Tongbo Luo,Xizhou Feng,Wei Sun,Sreen Tallam,Mohsen Imani,Partha Kanuparthy
机构: University of California, Irvine, CA, USA (加州大学欧文分校); Meta, Menlo Park, CA, USA; University of California, Riverside, CA, USA (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Identifying and addressing performance anti-patterns in machine learning (ML) models is critical for efficient training and inference, but it typically demands deep expertise spanning system infrastructure, ML models and kernel development. While large tech companies rely on dedicated ML infrastructure engineers to analyze torch traces and benchmarks, such resource-intensive workflows are largely inaccessible to computer vision researchers in general. Among the challenges, pinpointing problematic trace segments within lengthy execution traces remains the most time-consuming task, and is difficult to automate with current ML models, including LLMs. In this work, we present the first benchmark dataset specifically designed to evaluate and improve ML models’ ability to detect anti patterns in traces. Our dataset contains over 600 PyTorch traces from diverse computer vision models classification, detection, segmentation, and generation collected across multiple hardware platforms. We also propose a novel iterative approach: a lightweight ML model first detects trace segments with anti patterns, followed by a large language model (LLM) for fine grained classification and targeted feedback. Experimental results demonstrate that our method significantly outperforms unsupervised clustering and rule based statistical techniques for detecting anti pattern regions. Our method also effectively compensates LLM’s limited context length and reasoning inefficiencies.
zh
[CV-69] SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing
【速读】:该论文旨在解决数字插画中草图编辑的核心挑战:现有图像编辑系统难以在支持高层次语义修改和精细局部重绘的同时,保持线条艺术的稀疏结构和风格敏感性。其解决方案的关键在于提出SketchAssist,一个统一指令引导的全局编辑与线条引导的区域重绘的交互式草图绘制助手,通过引入可控的数据生成流水线(包括属性添加序列构建、跨序列采样形成多步编辑链以及风格保留的属性移除模型扩展风格覆盖),并基于DiT架构设计了一个轻量级统一编辑框架,利用RGB通道编码输入实现两种模式的无缝切换,同时结合任务引导的专家混合(MoE)机制嵌入LoRA层,根据文本和视觉提示路由不同专家以提升语义可控性、结构保真度与风格一致性,从而在多项指标上达到当前最优性能。
链接: https://arxiv.org/abs/2512.14140
作者: Han Zou,Yan Zhang,Ruiqi Yu,Cong Xie,Jie Huang,Zhenpeng Zhan
机构: Baidu Inc.(百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sketch editing is central to digital illustration, yet existing image editing systems struggle to preserve the sparse, style-sensitive structure of line art while supporting both high-level semantic changes and precise local redrawing. We present SketchAssist, an interactive sketch drawing assistant that accelerates creation by unifying instruction-guided global edits with line-guided region redrawing, while keeping unrelated regions and overall composition intact. To enable this assistant at scale, we introduce a controllable data generation pipeline that (i) constructs attribute-addition sequences from attribute-free base sketches, (ii) forms multi-step edit chains via cross-sequence sampling, and (iii) expands stylistic coverage with a style-preserving attribute-removal model applied to diverse sketches. Building on this data, SketchAssist employs a unified sketch editing framework with minimal changes to DiT-based editors. We repurpose the RGB channels to encode the inputs, enabling seamless switching between instruction-guided edits and line-guided redrawing within a single input interface. To further specialize behavior across modes, we integrate a task-guided mixture-of-experts into LoRA layers, routing by text and visual cues to improve semantic controllability, structural fidelity, and style preservation. Extensive experiments show state-of-the-art results on both tasks, with superior instruction adherence and style/structure preservation compared to recent baselines. Together, our dataset and SketchAssist provide a practical, controllable assistant for sketch creation and revision.
zh
[CV-70] Erasing CLIP Memories: Non-Destructive Data-Free Zero-Shot class Unlearning in CLIP Models
【速读】:该论文旨在解决多模态模型(如CLIP)中针对特定类别的选择性遗忘(selective unlearning)问题,即在不损害模型整体多模态知识的前提下,有效移除模型对某些目标类别的记忆,以应对数据隐私保护和模型去污(model decontamination)等挑战。解决方案的关键在于提出一种闭式(closed-form)方法,通过计算目标文本嵌入所张成子空间的正交基,并将图像特征投影到该子空间的零空间(nullspace),从而在不进行任何重训练或使用遗忘集图像的情况下,精确削弱图像特征与目标类别之间的对齐关系。此方法具有计算高效、操作精准的优点,且可通过部分投影实现遗忘程度与保留有用信息之间的平衡。
链接: https://arxiv.org/abs/2512.14137
作者: Ashish Mishra,Tarun Kumar,Gyanaranjan Nayak,Arpit Shah,Suparna Bhattacharya,Martin Foltin
机构: Hewlett Packard Labs (惠普实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a novel, closed-form approach for selective unlearning in multimodal models, specifically targeting pretrained models such as CLIP. Our method leverages nullspace projection to erase the target class information embedded in the final projection layer, without requiring any retraining or the use of images from the forget set. By computing an orthonormal basis for the subspace spanned by target text embeddings and projecting these directions, we dramatically reduce the alignment between image features and undesired classes. Unlike traditional unlearning techniques that rely on iterative fine-tuning and extensive data curation, our approach is both computationally efficient and surgically precise. This leads to a pronounced drop in zero-shot performance for the target classes while preserving the overall multimodal knowledge of the model. Our experiments demonstrate that even a partial projection can balance between complete unlearning and retaining useful information, addressing key challenges in model decontamination and privacy preservation.
zh
[CV-71] Consistent Instance Field for Dynamic Scene Understanding
【速读】:该论文旨在解决动态场景理解中实例一致性与时空连续性难以统一的问题,传统方法依赖离散跟踪或视图相关的特征表示,导致在跨时间帧和视角变化时出现实例漂移或语义不一致。其解决方案的关键在于提出了一种名为Consistent Instance Field(CIF)的连续且概率化的时空表征方法,通过将每个时空点建模为占据概率(occupancy probability)和条件实例分布(conditional instance distribution),实现可见性与对象身份的解耦;核心创新是基于可变形3D高斯(deformable 3D Gaussians)设计了嵌入实例信息的表示结构,联合编码辐射和语义特征,并通过可微光栅化直接从RGB图像和实例掩码中学习,同时引入身份校准机制与高斯重采样策略,确保跨空间与时间的一致实例表达。
链接: https://arxiv.org/abs/2512.14126
作者: Junyi Wu,Van Nguyen Nguyen,Benjamin Planche,Jiachen Tao,Changchang Sun,Zhongpai Gao,Zhenghao Zhao,Anwesa Choudhuri,Gengyu Zhang,Meng Zheng,Feiran Wang,Terrence Chen,Yan Yan,Ziyan Wu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); United Imaging Intelligence (联影智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce Consistent Instance Field, a continuous and probabilistic spatio-temporal representation for dynamic scene understanding. Unlike prior methods that rely on discrete tracking or view-dependent features, our approach disentangles visibility from persistent object identity by modeling each space-time point with an occupancy probability and a conditional instance distribution. To realize this, we introduce a novel instance-embedded representation based on deformable 3D Gaussians, which jointly encode radiance and semantic information and are learned directly from input RGB images and instance masks through differentiable rasterization. Furthermore, we introduce new mechanisms to calibrate per-Gaussian identities and resample Gaussians toward semantically active regions, ensuring consistent instance representations across space and time. Experiments on HyperNeRF and Neu3D datasets demonstrate that our method significantly outperforms state-of-the-art methods on novel-view panoptic segmentation and open-vocabulary 4D querying tasks.
zh
[CV-72] SportsGPT : An LLM -driven Framework for Interpretable Sports Motion Assessment and Training Guidance
【速读】:该论文旨在解决现有智能体育分析系统在“得分与可视化”层面的局限性,即缺乏自动化的运动表现诊断能力以及可解释的训练指导。其核心解决方案在于提出SportsGPT框架,通过构建从运动时间序列输入到专业训练指导的闭环流程实现突破:关键创新包括MotionDTW算法(两阶段时间序列对齐方法)用于从骨骼动作序列中精准提取关键帧;基于知识的可解释运动评估模型(KISMAM)通过对比关键帧与目标模型生成可解释的评估指标(如“伸展不足”);以及基于检索增强生成(RAG)的训练指导模型SportsRAG,利用60亿token领域知识库检索特定问答对并驱动Qwen3大语言模型输出专业训练建议。实验证明该方案在时间误差、IoU分数及诊断准确性方面显著优于传统方法。
链接: https://arxiv.org/abs/2512.14121
作者: Wenbo Tian,Ruting Lin,Hongxian Zheng,Yaodong Yang,Geng Wu,Zihao Zhang,Zhang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing intelligent sports analysis systems mainly focus on “scoring and visualization,” often lacking automatic performance diagnosis and interpretable training guidance. Recent advances of Large Language Models (LMMs) and motion analysis techniques provide new opportunities to address the above limitations. In this paper, we propose SportsGPT, an LLM-driven framework for interpretable sports motion assessment and training guidance, which establishes a closed loop from motion time-series input to professional training guidance. First, given a set of high-quality target models, we introduce MotionDTW, a two-stage time series alignment algorithm designed for accurate keyframe extraction from skeleton-based motion sequences. Subsequently, we design a Knowledge-based Interpretable Sports Motion Assessment Model (KISMAM) to obtain a set of interpretable assessment metrics (e.g., insufficient extension) by constrasting the keyframes with the targe models. Finally, we propose SportsRAG, a RAG-based training guidance model based on Qwen3. Leveraging a 6B-token knowledge base, it prompts the LLM to generate professional training guidance by retrieving domain-specific QA pairs. Experimental results demonstrate that MotionDTW significantly outperforms traditional methods with lower temporal error and higher IoU scores. Furthermore, ablation studies validate the KISMAM and SportsRAG, confirming that SportsGPT surpasses general LLMs in diagnostic accuracy and professionalism.
zh
[CV-73] MFE-GAN: Efficient GAN-based Framework for Document Image Enhancement and Binarization with Multi-scale Feature Extraction
【速读】:该论文旨在解决文档图像增强与二值化处理中因使用多个独立生成对抗网络(Generative Adversarial Networks, GANs)导致的训练和推理时间过长的问题。现有方法通常为不同颜色通道分别训练GAN以去除阴影和噪声,虽能提升光学字符识别(Optical Character Recognition, OCR)性能,但模型部署效率低。其解决方案的关键在于提出一种基于多尺度特征提取(Multi-scale Feature Extraction, MFE)的高效GAN框架——MFE-GAN,该框架引入Haar小波变换(Haar Wavelet Transformation, HWT)与归一化预处理机制,在输入GAN前对文档图像进行有效特征增强,并设计新型生成器、判别器及损失函数以提升性能。实验表明,MFE-GAN在保持与当前最优方法相当识别效果的同时,显著缩短了总训练与推理时间。
链接: https://arxiv.org/abs/2512.14114
作者: Rui-Yang Ju,KokSheik Wong,Yanlin Jin,Jen-Shiun Chiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended Journal Version of APSIPA ASC 2025
Abstract:Document image enhancement and binarization are commonly performed prior to document analysis and recognition tasks for improving the efficiency and accuracy of optical character recognition (OCR) systems. This is because directly recognizing text in degraded documents, particularly in color images, often results in unsatisfactory recognition performance. To address these issues, existing methods train independent generative adversarial networks (GANs) for different color channels to remove shadows and noise, which, in turn, facilitates efficient text information extraction. However, deploying multiple GANs results in long training and inference times. To reduce both training and inference times of document image enhancement and binarization models, we propose MFE-GAN, an efficient GAN-based framework with multi-scale feature extraction (MFE), which incorporates Haar wavelet transformation (HWT) and normalization to process document images before feeding them into GANs for training. In addition, we present novel generators, discriminators, and loss functions to improve the model’s performance, and we conduct ablation studies to demonstrate their effectiveness. Experimental results on the Benchmark, Nabuco, and CMATERdb datasets demonstrate that the proposed MFE-GAN significantly reduces the total training and inference times while maintaining comparable performance with respect to state-of-the-art (SOTA) methods. The implementation of this work is available at this https URL.
zh
[CV-74] Selective Controlled and Domain-Agnostic Unlearning in Pretrained CLIP: A Training- and Data-Free Approach
【速读】:该论文旨在解决预训练模型(如CLIP)在实际应用中需对特定类别进行“遗忘”(unlearning),即移除某些对象类别的知识,同时不依赖额外数据或重新训练,并保持模型在其他任务上的性能不受影响的问题。其解决方案的关键在于提出一种无需训练和数据的遗忘框架,通过在CLIP的联合嵌入空间中协同整合文本提示(text prompts)与合成视觉原型(synthesized visual prototypes),构建多模态零空间(multimodal nullspace),从而高效地消除目标类别的信息并保留其余知识,实现了全局、领域特定及选择性领域的可控遗忘。
链接: https://arxiv.org/abs/2512.14113
作者: Ashish Mishra,Gyanaranjan Nayak,Tarun Kumar,Arpit Shah,Suparna Bhattacharya,Martin Foltin
机构: Hewlett Packard Labs (惠普实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pretrained models like CLIP have demonstrated impressive zero-shot classification capabilities across diverse visual domains, spanning natural images, artistic renderings, and abstract representations. However, real-world applications often demand the removal (or “unlearning”) of specific object classes without requiring additional data or retraining, or affecting the model’s performance on unrelated tasks. In this paper, we propose a novel training- and data-free unlearning framework that enables three distinct forgetting paradigms: (1) global unlearning of selected objects across all domains, (2) domain-specific knowledge removal (e.g., eliminating sketch representations while preserving photo recognition), and (3) complete unlearning in selective domains. By leveraging a multimodal nullspace through synergistic integration of text prompts and synthesized visual prototypes derived from CLIP’s joint embedding space, our method efficiently removes undesired class information while preserving the remaining knowledge. This approach overcomes the limitations of existing retraining-based methods and offers a flexible and computationally efficient solution for controlled model forgetting.
zh
[CV-75] Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries
【速读】:该论文旨在解决遥感图像(remote sensing, RS)中文本到图像检索任务中,现有基于大视觉语言模型(RS-LVLMs)存在的可解释性差和复杂空间关系处理能力不足的问题。其核心解决方案是提出RUNE(Reasoning Using Neurosymbolic Entities),通过将大语言模型(LLM)与神经符号AI(neurosymbolic AI)相结合,利用LLM将文本查询转换为一阶逻辑(First-Order Logic, FOL)表达式,并由神经符号推理模块对检测到的实体与FOL表达式进行显式匹配推理,从而提升检索性能、鲁棒性和可解释性。关键创新在于摒弃端到端的联合嵌入机制,转而采用逻辑分解策略在条件子集上执行推理,显著提高了计算效率并增强了对复杂语义和空间关系的理解能力。
链接: https://arxiv.org/abs/2512.14102
作者: Emanuele Mezzi,Gertjan Burghouts,Maarten Kruithof
机构: Vrije Universiteit Amsterdam (阿姆斯特丹自由大学); TNO (荷兰应用科学研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision-language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM’s effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE’s potential for real-world RS applications through a use case on post-flood satellite image retrieval.
zh
[CV-76] ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models
【速读】:该论文旨在解决从单张图像和文本描述中生成多视角图像时难以保持几何一致性的难题。现有方法通常依赖于3D感知架构或专门的扩散模型,这些方法需要大量多视角训练数据和复杂的几何先验。其解决方案的关键在于提出ViewMask-1-to-3,这是一种首次将离散扩散模型应用于多视角图像生成的方法;它将多视角合成建模为一个离散序列问题,通过MAGVIT-v2对每个视角进行视觉token化,并利用掩码token预测统一语言与视觉表示,通过迭代解掩码实现多视角渐进生成。该方法仅通过简单的随机掩码和自注意力机制即可实现跨视角一致性,无需复杂3D几何约束或专用注意力结构,从而在保持架构简洁的同时,在GSO和3D-FUTURE数据集上以PSNR、SSIM和LPIPS指标综合排名第一。
链接: https://arxiv.org/abs/2512.14099
作者: Ruishu Zhu,Zhihao Huang,Jiacheng Sun,Ping Luo,Hongyuan Zhang,Xuelong Li
机构: School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University (西北工业大学人工智能学院); Institute of Artificial Intelligence of China Telecom (TeleAI) (中国电信人工智能研究院); Huawei Technologies Co., Ltd. (华为技术有限公司); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-view image generation from a single image and text description remains challenging due to the difficulty of maintaining geometric consistency across different viewpoints. Existing approaches typically rely on 3D-aware architectures or specialized diffusion models that require extensive multi-view training data and complex geometric priors. In this work, we introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation. Unlike continuous diffusion methods that operate in latent spaces, ViewMask-1-to-3 formulates multi-view synthesis as a discrete sequence modeling problem, where each viewpoint is represented as visual tokens obtained through MAGVIT-v2 tokenization. By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints through iterative token unmasking with text input. ViewMask-1-to-3 achieves cross-view consistency through simple random masking combined with self-attention, eliminating the requirement for complex 3D geometric constraints or specialized attention architectures. Our approach demonstrates that discrete diffusion provides a viable and simple alternative to existing multi-view generation methods, ranking first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS, while maintaining architectural simplicity.
zh
[CV-77] OUSAC: Optimized Guidance Scheduling with Adaptive Caching for DiT Acceleration
【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像生成中计算成本高昂的问题,特别是Classifier-Free Guidance (CFG)机制因需在每个时间步进行条件与无条件前向传播而导致的双重计算开销。解决方案的关键在于提出OUSAC(Optimized gUidance Scheduling with Adaptive Caching)框架,其核心创新是利用可变引导尺度(variable guidance scales)实现稀疏计算:通过在特定时间步调整引导尺度来补偿跳过CFG计算的时间步,从而减少总采样步数和CFG步骤数量,同时保持生成质量;此外,针对动态引导模式破坏传统缓存方法有效性的挑战,引入两阶段优化策略——第一阶段采用进化算法联合优化跳过哪些时间步及设置何种引导尺度以消除高达82%的无条件前向传播,第二阶段则设计自适应秩分配机制,按Transformer模块差异化校准缓存效率,确保在变化的引导条件下仍能维持高效缓存。
链接: https://arxiv.org/abs/2512.14096
作者: Ruitong Sun,Tianze Yang,Wei Niu,Jin Sun
机构: University of Georgia (佐治亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages
Abstract:Diffusion models have emerged as the dominant paradigm for high-quality image generation, yet their computational expense remains substantial due to iterative denoising. Classifier-Free Guidance (CFG) significantly enhances generation quality and controllability but doubles the computation by requiring both conditional and unconditional forward passes at every timestep. We present OUSAC (Optimized gUidance Scheduling with Adaptive Caching), a framework that accelerates diffusion transformers (DiT) through systematic optimization. Our key insight is that variable guidance scales enable sparse computation: adjusting scales at certain timesteps can compensate for skipping CFG at others, enabling both fewer total sampling steps and fewer CFG steps while maintaining quality. However, variable guidance patterns introduce denoising deviations that undermine standard caching methods, which assume constant CFG scales across steps. Moreover, different transformer blocks are affected at different levels under dynamic conditions. This paper develops a two-stage approach leveraging these insights. Stage-1 employs evolutionary algorithms to jointly optimize which timesteps to skip and what guidance scale to use, eliminating up to 82% of unconditional passes. Stage-2 introduces adaptive rank allocation that tailors calibration efforts per transformer block, maintaining caching effectiveness under variable guidance. Experiments demonstrate that OUSAC significantly outperforms state-of-the-art acceleration methods, achieving 53% computational savings with 15% quality improvement on DiT-XL/2 (ImageNet 512x512), 60% savings with 16.1% improvement on PixArt-alpha (MSCOCO), and 5x speedup on FLUX while improving CLIP Score over the 50-step baseline.
zh
[CV-78] AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation AAAI2026
【速读】:该论文旨在解决当前文本驱动的4D人体-物体交互(4D Human-Object Interaction, HOI)生成方法在可扩展性上的局限性问题,即由于缺乏大规模4D HOI数据集导致的监督学习瓶颈,以及现有零样本生成方法中交互线索在生成过程中被弱化、适用场景受限的问题。解决方案的关键在于提出AnchorHOI框架,通过引入视频扩散模型(video diffusion models)拓展图像扩散模型(image diffusion models)的先验知识,并设计一种基于锚点的先验蒸馏策略:利用两个定制化的锚点——交互感知的Neural Radiance Fields(NeRFs)用于表达复杂的交互结构,以及关键点锚点(anchor keypoints)用于合成真实的人体运动,从而将高维4D HOI生成任务分解为可处理的两步流程,显著提升了生成结果的多样性与泛化能力。
链接: https://arxiv.org/abs/2512.14095
作者: Sisi Dai,Kai Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026
Abstract:Despite significant progress in text-driven 4D human-object interaction (HOI) generation with supervised methods, the scalability remains limited by the scarcity of large-scale 4D HOI datasets. To overcome this, recent approaches attempt zero-shot 4D HOI generation with pre-trained image diffusion models. However, interaction cues are minimally distilled during the generation process, restricting their applicability across diverse scenarios. In this paper, we propose AnchorHOI, a novel framework that thoroughly exploits hybrid priors by incorporating video diffusion models beyond image diffusion models, advancing 4D HOI generation. Nevertheless, directly optimizing high-dimensional 4D HOI with such priors remains challenging, particularly for human pose and compositional motion. To address this challenge, AnchorHOI introduces an anchor-based prior distillation strategy, which constructs interaction-aware anchors and then leverages them to guide generation in a tractable two-step process. Specifically, two tailored anchors are designed for 4D HOI generation: anchor Neural Radiance Fields (NeRFs) for expressive interaction composition, and anchor keypoints for realistic motion synthesis. Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization.
zh
[CV-79] Quality-Aware Framework for Video-Derived Respiratory Signals
【速读】:该论文旨在解决视频-based呼吸频率(Respiratory Rate, RR)估计中因信号提取方法不一致而导致的可靠性问题。其核心挑战在于不同来源的生理信号(如面部远程光电容积脉搏波(rPPG)、上身运动及深度学习特征)在质量上存在差异,进而影响RR估计精度。解决方案的关键在于提出一种预测性、质量感知的框架,通过整合十种异构信号源,并结合四种频谱估计方法(Welch法、多重信号分类法(MUSIC)、快速傅里叶变换(FFT)与峰值检测),构建段级质量指标;利用这些指标训练机器学习模型以预测各信号的准确性或选择最优信号,从而实现自适应信号融合与基于质量的段落过滤,显著提升整体RR估计的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2512.14093
作者: Nhi Nguyen,Constantino Álvarez Casado,Le Nguyen,Manuel Lage Cañellas,Miguel Bordallo López
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 6 pages, 1 figure, 2 tables, conference
Abstract:Video-based respiratory rate (RR) estimation is often unreliable due to inconsistent signal quality across extraction methods. We present a predictive, quality-aware framework that integrates heterogeneous signal sources with dynamic assessment of reliability. Ten signals are extracted from facial remote photoplethysmography (rPPG), upper-body motion, and deep learning pipelines, and analyzed using four spectral estimators: Welch’s method, Multiple Signal Classification (MUSIC), Fast Fourier Transform (FFT), and peak detection. Segment-level quality indices are then used to train machine learning models that predict accuracy or select the most reliable signal. This enables adaptive signal fusion and quality-based segment filtering. Experiments on three public datasets (OMuSense-23, COHFACE, MAHNOB-HCI) show that the proposed framework achieves lower RR estimation errors than individual methods in most cases, with performance gains depending on dataset characteristics. These findings highlight the potential of quality-driven predictive modeling to deliver scalable and generalizable video-based respiratory monitoring solutions.
zh
[CV-80] ProtoFlow: Interpretable and Robust Surgical Workflow Modeling with Learned Dynamic Scene Graph Prototypes
【速读】:该论文旨在解决AI辅助手术中因标注成本高、数据稀缺及模型缺乏可解释性而导致的精细手术识别进展受限问题。其核心挑战在于如何在有限数据条件下实现对复杂手术流程的准确建模与理解。解决方案的关键在于提出ProtoFlow框架,该框架通过图神经网络(GNN)编码器-解码器结构,结合自监督预训练与基于原型的微调机制,自动学习并优化具有临床意义的动态场景图原型(dynamic scene graph prototypes),从而在保持高鲁棒性的同时提供可解释的手术流程分析能力。
链接: https://arxiv.org/abs/2512.14092
作者: Felix Holm,Ghazal Ghazaei,Nassir Navab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Purpose: Detailed surgical recognition is critical for advancing AI-assisted surgery, yet progress is hampered by high annotation costs, data scarcity, and a lack of interpretable models. While scene graphs offer a structured abstraction of surgical events, their full potential remains untapped. In this work, we introduce ProtoFlow, a novel framework that learns dynamic scene graph prototypes to model complex surgical workflows in an interpretable and robust manner. Methods: ProtoFlow leverages a graph neural network (GNN) encoder-decoder architecture that combines self-supervised pretraining for rich representation learning with a prototype-based fine-tuning stage. This process discovers and refines core prototypes that encapsulate recurring, clinically meaningful patterns of surgical interaction, forming an explainable foundation for workflow analysis. Results: We evaluate our approach on the fine-grained CAT-SG dataset. ProtoFlow not only outperforms standard GNN baselines in overall accuracy but also demonstrates exceptional robustness in limited-data, few-shot scenarios, maintaining strong performance when trained on as few as one surgical video. Our qualitative analyses further show that the learned prototypes successfully identify distinct surgical sub-techniques and provide clear, interpretable insights into workflow deviations and rare complications. Conclusion: By uniting robust representation learning with inherent explainability, ProtoFlow represents a significant step toward developing more transparent, reliable, and data-efficient AI systems, accelerating their potential for clinical adoption in surgical training, real-time decision support, and workflow optimization. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) ACMclasses: I.2.10 Cite as: arXiv:2512.14092 [cs.CV] (or arXiv:2512.14092v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.14092 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Felix Holm [view email] [v1] Tue, 16 Dec 2025 04:59:58 UTC (7,284 KB)
zh
[CV-81] GaussianPlant: Structure-aligned Gaussian Splatting for 3D Reconstruction of Plants
【速读】:该论文旨在解决现有3D Gaussian Splatting (3DGS) 方法在植物重建中缺乏结构表示的问题,即虽然能够实现高质量的外观重建(appearance reconstruction),但无法准确恢复植物内部结构(如分枝模式、叶片分布等),从而限制了其在植物表型分析(plant phenotyping)等任务中的应用。解决方案的关键在于提出了一种分层式3DGS表示方法——GaussianPlant,通过显式分离结构与外观:引入结构基元(Structure Primitives, StPs)用于建模简化后的植物几何结构(如将枝条建模为圆柱体、叶片建模为圆盘),并利用外观基元(Appearance Primitives, ApPs)绑定于StP上以保留高保真外观细节;同时,StPs和ApPs通过重渲染损失及从ApP到StP的梯度传播进行联合优化,实现了结构与外观的协同重建,从而支持精确提取植物的分支结构与叶片实例。
链接: https://arxiv.org/abs/2512.14087
作者: Yang Yang,Risa Shinoda,Hiroaki Santo,Fumio Okura
机构: The University of Osaka (大阪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE TPAMI, under review
Abstract:We present a method for jointly recovering the appearance and internal structure of botanical plants from multi-view images based on 3D Gaussian Splatting (3DGS). While 3DGS exhibits robust reconstruction of scene appearance for novel-view synthesis, it lacks structural representations underlying those appearances (e.g., branching patterns of plants), which limits its applicability to tasks such as plant phenotyping. To achieve both high-fidelity appearance and structural reconstruction, we introduce GaussianPlant, a hierarchical 3DGS representation, which disentangles structure and appearance. Specifically, we employ structure primitives (StPs) to explicitly represent branch and leaf geometry, and appearance primitives (ApPs) to the plants’ appearance using 3D Gaussians. StPs represent a simplified structure of the plant, i.e., modeling branches as cylinders and leaves as disks. To accurately distinguish the branches and leaves, StP’s attributes (i.e., branches or leaves) are optimized in a self-organized manner. ApPs are bound to each StP to represent the appearance of branches or leaves as in conventional 3DGS. StPs and ApPs are jointly optimized using a re-rendering loss on the input multi-view images, as well as the gradient flow from ApP to StP using the binding correspondence information. We conduct experiments to qualitatively evaluate the reconstruction accuracy of both appearance and structure, as well as real-world experiments to qualitatively validate the practical performance. Experiments show that the GaussianPlant achieves both high-fidelity appearance reconstruction via ApPs and accurate structural reconstruction via StPs, enabling the extraction of branch structure and leaf instances.
zh
[CV-82] SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding
【速读】:该论文旨在解决块级离散扩散(block-wise discrete diffusion)在视觉语言理解(vision-language understanding, VLU)任务中因训练成本高、收敛慢和不稳定而导致难以实际应用的问题,这些问题使得其性能长期落后于自回归(autoregressive, AR)基线模型。解决方案的关键在于提出一个系统性的高效且稳定的训练框架——SDAR-VL,其核心创新包括:(1) 异步块级噪声调度(Asynchronous Block-wise Noise Scheduling),通过多样化批次内监督信号提升训练效率;(2) 有效掩码比例缩放(Effective Mask Ratio Scaling),在随机掩码下实现无偏损失归一化;(3) 渐进式β噪声课程(Progressive Beta Noise Curriculum),在保持扰动多样性的同时逐步增加有效掩码覆盖范围。该框架显著提升了块级扩散模型的训练效率、收敛稳定性与任务性能,在21个单图、多图及视频基准上均优于传统块扩散方法,并在匹配设置下达到或超越强AR基线(如LLaVA-OneVision和LLaDA-V),确立了块级扩散作为VLU实用骨干架构的可能性。
链接: https://arxiv.org/abs/2512.14068
作者: Shuang Cheng,Yuhua Jiang,Zineng Zhou,Dawei Liu,Wang Tao,Linfeng Zhang,Biqing Qi,Bowen Zhou
机构: Zhejiang University (浙江大学); Shanghai AI Laboratory (上海人工智能实验室); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbfSDAR-VL, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emphintegrated framework for efficient and stable training. This framework unifies three components: (1) \textbfAsynchronous Block-wise Noise Scheduling to diversify supervision within each batch; (2) \textbfEffective Mask Ratio Scaling for unbiased loss normalization under stochastic masking; and (3) a \textbfProgressive Beta Noise Curriculum that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emphtraining efficiency, \emphconvergence stability, and \emphtask performance over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.
zh
[CV-83] Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution
【速读】:该论文旨在解决基于扩散模型的单步图像超分辨率(Image Super-Resolution, ISR)方法中存在的三大关键问题:(1)低质量(Low-Quality, LQ)输入经压缩编码导致的信息损失,进而影响重建保真度;(2)生成先验(Generative Prior)在区域层面激活不足,难以提升感知质量;(3)文本提示(Text Prompt)与其对应语义区域之间存在错位。解决方案的关键在于提出一种可控的单步扩散网络CODSR,其核心包括:(1)设计LQ引导的特征调制模块(LQ-guided Feature Modulation Module),利用未压缩的LQ输入信息提供高保真条件;(2)引入区域自适应生成先验激活机制(Region-Adaptive Generative Prior Activation),在不牺牲局部结构保真度的前提下增强感知丰富性;(3)采用文本匹配引导策略(Text-Matching Guidance Strategy),实现文本提示与图像语义区域的精准对齐。实验表明,该方法在感知质量上优于现有最优方法,同时保持高效的一步推理能力。
链接: https://arxiv.org/abs/2512.14061
作者: Hao Chen,Junyang Chen,Jinshan Pan,Jiangxin Dong
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recent diffusion-based one-step methods have shown remarkable progress in the field of image super-resolution, yet they remain constrained by three critical limitations: (1) inferior fidelity performance caused by the information loss from compression encoding of low-quality (LQ) inputs; (2) insufficient region-discriminative activation of generative priors; (3) misalignment between text prompts and their corresponding semantic regions. To address these limitations, we propose CODSR, a controllable one-step diffusion network for image super-resolution. First, we propose an LQ-guided feature modulation module that leverages original uncompressed information from LQ inputs to provide high-fidelity conditioning for the diffusion process. We then develop a region-adaptive generative prior activation method to effectively enhance perceptual richness without sacrificing local structural fidelity. Finally, we employ a text-matching guidance strategy to fully harness the conditioning potential of text prompts. Extensive experiments demonstrate that CODSR achieves superior perceptual quality and competitive fidelity compared with state-of-the-art methods with efficient one-step inference.
zh
[CV-84] Real-time prediction of workplane illuminance distribution for daylight-linked controls using non-intrusive multimodal deep learning
【速读】:该论文旨在解决建筑室内工作面照度(workplane illuminance)实时预测的难题,尤其是在动态占用场景下传统静态模型难以准确预测的问题。其核心挑战在于如何在不依赖复杂传感器或侵入式测量的情况下,利用非侵入式图像实现高精度、高时效性的照度分布预测。解决方案的关键在于提出了一种多模态深度学习框架,通过仅从侧窗区域提取图像特征(而非整个室内像素),有效捕捉了光照的时空变化特性,从而在真实动态环境中实现了高精度预测——实验表明,该方法在同分布测试集上R²达0.98、RMSE为0.14,在未见日测试集上R²仍保持0.82、RMSE为0.17,展现出优异的准确性与时间泛化能力。
链接: https://arxiv.org/abs/2512.14058
作者: Zulin Zhuang,Yu Bian
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Daylight-linked controls (DLCs) have significant potential for energy savings in buildings, especially when abundant daylight is available and indoor workplane illuminance can be accurately predicted in real time. Most existing studies on indoor daylight predictions were developed and tested for static scenes. This study proposes a multimodal deep learning framework that predicts indoor workplane illuminance distributions in real time from non-intrusive images with temporal-spatial features. By extracting image features only from the side-lit window areas rather than interior pixels, the approach remains applicable in dynamically occupied indoor spaces. A field experiment was conducted in a test room in Guangzhou (China), where 17,344 samples were collected for model training and validation. The model achieved R2 0.98 with RMSE 0.14 on the same-distribution test set and R2 0.82 with RMSE 0.17 on an unseen-day test set, indicating high accuracy and acceptable temporal generalization.
zh
[CV-85] FacEDiT: Unified Talking Face Editing and Generation via Facial Motion Infilling
【速读】:该论文旨在解决传统研究中将“说话人脸编辑(talking face editing)”与“人脸生成(face generation)”视为独立任务所带来的局限性,提出将其统一为一个更基础的子任务——语音条件下的面部运动补全(speech-conditional facial motion infilling)。其核心解决方案是提出FacEDiT模型,这是一种基于流匹配(flow matching)训练的语音条件扩散Transformer(Diffusion Transformer),受掩码自编码器启发,能够根据周围已知的面部运动和语音信息重建被遮蔽的面部运动。该框架不仅支持局部编辑操作(如替换、插入、删除),还能保证与未编辑区域的无缝过渡;同时通过偏置注意力机制和时序平滑约束提升边界连续性和唇音同步精度。这一统一建模方式使得说话人脸编辑与生成自然成为其衍生子任务,并在新提出的FacEDiTBench数据集上验证了方法的有效性与泛化能力。
链接: https://arxiv.org/abs/2512.14056
作者: Kim Sung-Bin,Joohyun Chang,David Harwath,Tae-Hyun Oh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Talking face editing and face generation have often been studied as distinct problems. In this work, we propose viewing both not as separate tasks but as subtasks of a unifying formulation, speech-conditional facial motion infilling. We explore facial motion infilling as a self-supervised pretext task that also serves as a unifying formulation of dynamic talking face synthesis. To instantiate this idea, we propose FacEDiT, a speech-conditional Diffusion Transformer trained with flow matching. Inspired by masked autoencoders, FacEDiT learns to synthesize masked facial motions conditioned on surrounding motions and speech. This formulation enables both localized generation and edits, such as substitution, insertion, and deletion, while ensuring seamless transitions with unedited regions. In addition, biased attention and temporal smoothness constraints enhance boundary continuity and lip synchronization. To address the lack of a standard editing benchmark, we introduce FacEDiTBench, the first dataset for talking face editing, featuring diverse edit types and lengths, along with new evaluation metrics. Extensive experiments validate that talking face editing and generation emerge as subtasks of speech-conditional motion infilling; FacEDiT produces accurate, speech-aligned facial edits with strong identity preservation and smooth visual continuity while generalizing effectively to talking face generation.
zh
[CV-86] Expert Switching for Robust AAV Landing: A Dual-Detector Framework in Simulation
【速读】:该论文旨在解决自主飞行器(AAV)在GPS拒止或视觉退化条件下可靠识别起降平台(helipad)的问题,尤其针对下降过程中因尺度剧烈变化导致的检测性能下降问题。解决方案的关键在于提出一种尺度自适应的双专家感知框架(scale-adaptive dual-expert perception framework),将检测任务分解为远距离和近距离两个场景,分别训练两个YOLOv8专家模型以应对小目标低分辨率与大目标高精度定位的需求;在推理阶段,通过几何门控机制动态选择与当前视角最一致的专家输出,实现跨尺度的鲁棒感知,从而显著提升着陆稳定性与准确性。
链接: https://arxiv.org/abs/2512.14054
作者: Humaira Tasnim,Ashik E Rasul,Bruce Jo,Hyung-Jin Yoon
机构: Tennessee Technological University (田纳西理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable helipad detection is essential for Autonomous Aerial Vehicle (AAV) landing, especially under GPS-denied or visually degraded conditions. While modern detectors such as YOLOv8 offer strong baseline performance, single-model pipelines struggle to remain robust across the extreme scale transitions that occur during descent, where helipads appear small at high altitude and large near touchdown. To address this limitation, we propose a scale-adaptive dual-expert perception framework that decomposes the detection task into far-range and close-range regimes. Two YOLOv8 experts are trained on scale-specialized versions of the HelipadCat dataset, enabling one model to excel at detecting small, low-resolution helipads and the other to provide high-precision localization when the target dominates the field of view. During inference, both experts operate in parallel, and a geometric gating mechanism selects the expert whose prediction is most consistent with the AAV’s viewpoint. This adaptive routing prevents the degradation commonly observed in single-detector systems when operating across wide altitude ranges. The dual-expert perception module is evaluated in a closed-loop landing environment that integrates CARLA’s photorealistic rendering with NASA’s GUAM flight-dynamics engine. Results show substantial improvements in alignment stability, landing accuracy, and overall robustness compared to single-detector baselines. By introducing a scale-aware expert routing strategy tailored to the landing problem, this work advances resilient vision-based perception for autonomous descent and provides a foundation for future multi-expert AAV frameworks.
zh
[CV-87] SELECT: Detecting Label Errors in Real-world Scene Text Data
【速读】:该论文旨在解决真实场景文本数据集中标签错误检测问题,尤其是针对变长序列标签、标签序列错位以及字符级错误等挑战。其解决方案的关键在于提出SELECT(Scene tExt Label Errors deteCTion)方法,该方法基于多模态训练,结合图像-文本编码器与字符级分词器,能够有效识别并纠正上述类型的标签错误;同时引入相似性引导的序列标签污染(Similarity-based Sequence Label Corruption, SSLC)机制,在训练阶段模拟真实世界中的标签错误模式,从而提升模型对标签噪声的鲁棒性与检测能力。
链接: https://arxiv.org/abs/2512.14050
作者: Wenjun Liu,Qian Wu,Yifeng Hu,Yuke Li
机构: Yidun AI Lab, NetEase(网易)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce SELECT (Scene tExt Label Errors deteCTion), a novel approach that leverages multi-modal training to detect label errors in real-world scene text datasets. Utilizing an image-text encoder and a character-level tokenizer, SELECT addresses the issues of variable-length sequence labels, label sequence misalignment, and character-level errors, outperforming existing methods in accuracy and practical utility. In addition, we introduce Similarity-based Sequence Label Corruption (SSLC), a process that intentionally introduces errors into the training labels to mimic real-world error scenarios during training. SSLC not only can cause a change in the sequence length but also takes into account the visual similarity between characters during corruption. Our method is the first to detect label errors in real-world scene text datasets successfully accounting for variable-length labels. Experimental results demonstrate the effectiveness of SELECT in detecting label errors and improving STR accuracy on real-world text datasets, showcasing its practical utility.
zh
[CV-88] OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在自动驾驶等安全关键领域部署时因可靠性不足导致的物体幻觉(object hallucination)问题。现有基于文本的Chain-of-Thought(CoT)方法虽尝试缓解此问题,但存在两个根本缺陷:一是感知与推理阶段解耦,无法实现端到端联合优化;二是依赖昂贵且密集的定位标注。为此,作者提出OmniDrive-R1框架,其核心创新在于引入一种交错式多模态Chain-of-Thought(iMCoT)机制,通过强化学习驱动的视觉定位能力,使模型能自主聚焦于关键区域进行细粒度分析。该能力由纯两阶段强化学习训练流程及Clip-GRPO算法实现,其中Clip-GRPO设计了一种无标注、基于过程的视觉接地奖励,不仅避免了密集标签需求,还通过实时跨模态一致性约束提升了推理稳定性。实验表明,相较于基线Qwen2.5VL-7B模型,OmniDrive-R1显著提升整体推理得分(从51.77%至80.35%)和最终答案准确率(从37.81%至73.62%)。
链接: https://arxiv.org/abs/2512.14044
作者: Zhenguo Zhang,Haohan Zhen,Yishen Wang,Le Xu,Tianchen Deng,Xuefeng Chen,Qu Chen,Bo Zhang,Wuxiong Huang
机构: Shanghaitech University (上海科技大学); Tsinghua University (清华大学); Tongji University (同济大学); Shanghai Jiao Tong University (上海交通大学); MEGVII Technology (旷视科技); Mach Drive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) this http URL existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization this http URL we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and “zoom in” on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model’s significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.
zh
[CV-89] ChartAgent : A Chart Understanding Framework with Tool Integrated Reasoning
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在图表理解任务中对显式文本标注高度依赖、且当关键数值缺失时性能显著下降的问题。解决方案的关键在于提出ChartAgent框架,其核心是基于工具集成推理(Tool-Integrated Reasoning, TIR)的架构设计:通过将复杂的图表分析分解为一系列可观测、可回放的步骤,并动态调用包含关键元素检测、实例分割和光学字符识别(Optical Character Recognition, OCR)等在内的十余种模块化工具,实现跨类型图表的系统性视觉解析;同时,借助TIR的透明性和可验证性,将中间输出结构化为证据包(Evidence Package),从而提供可追溯、可复现的结论支撑,显著提升在稀疏标注场景下的鲁棒性与可信度。
链接: https://arxiv.org/abs/2512.14040
作者: Boran Wang,Xinming Wang,Yi Chen,Xiang Li,Jian Xu,Jing Yuan,Chenglin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:With their high information density and intuitive readability, charts have become the de facto medium for data analysis and communication across disciplines. Recent multimodal large language models (MLLMs) have made notable progress in automated chart understanding, yet they remain heavily dependent on explicit textual annotations and the performance degrades markedly when key numerals are absent. To address this limitation, we introduce ChartAgent, a chart understanding framework grounded in Tool-Integrated Reasoning (TIR). Inspired by human cognition, ChartAgent decomposes complex chart analysis into a sequence of observable, replayable steps. Supporting this architecture is an extensible, modular tool library comprising more than a dozen core tools, such as keyelement detection, instance segmentation, and optical character recognition (OCR), which the agent dynamically orchestrates to achieve systematic visual parsing across diverse chart types. Leveraging TIRs transparency and verifiability, ChartAgent moves beyond the black box paradigm by standardizing and consolidating intermediate outputs into a structured Evidence Package, providing traceable and reproducible support for final conclusions. Experiments show that ChartAgent substantially improves robustness under sparse annotation settings, offering a practical path toward trustworthy and extensible systems for chart understanding.
zh
[CV-90] ASAP-Textured Gaussians: Enhancing Textured Gaussians with Adaptive Sampling and Anisotropic Parameterization
【速读】:该论文旨在解决现有纹理化高斯方法(textured Gaussian methods)中存在的内存效率低下问题,具体表现为两个关键限制:一是纹理通常定义在规范空间(canonical space),导致采样效率低,浪费纹理容量于视觉贡献度较低的区域;二是纹理参数分配方式均匀,未考虑不同高斯体素的视觉复杂度差异,造成过参数化。解决方案的核心在于提出ASAP Texture Gaussians(Adaptive Sampling and Anisotropic Parameterization),通过基于高斯密度分布的自适应采样策略和根据渲染误差驱动的各向异性参数化机制,实现纹理资源的精准分配,从而显著提升质量-效率权衡,在大幅减少纹理参数数量的同时保持高保真渲染效果。
链接: https://arxiv.org/abs/2512.14039
作者: Meng Wei,Cheng Zhang,Jianmin Zheng,Hamid Rezatofighi,Jianfei Cai
机构: Monash University (莫纳什大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances have equipped 3D Gaussian Splatting with texture parameterizations to capture spatially varying attributes, improving the performance of both appearance modeling and downstream tasks. However, the added texture parameters introduce significant memory efficiency challenges. Rather than proposing new texture formulations, we take a step back to examine the characteristics of existing textured Gaussian methods and identify two key limitations in common: (1) Textures are typically defined in canonical space, leading to inefficient sampling that wastes textures’ capacity on low-contribution regions; and (2) texture parameterization is uniformly assigned across all Gaussians, regardless of their visual complexity, resulting in over-parameterization. In this work, we address these issues through two simple yet effective strategies: adaptive sampling based on the Gaussian density distribution and error-driven anisotropic parameterization that allocates texture resources according to rendering error. Our proposed ASAP Textured Gaussians, short for Adaptive Sampling and Anisotropic Parameterization, significantly improve the quality efficiency tradeoff, achieving high-fidelity rendering with far fewer texture parameters.
zh
[CV-91] ACE-SLAM: Scene Coordinate Regression for Neural Implicit Real-Time SLAM
【速读】:该论文旨在解决传统RGB-D SLAM系统在实时性、内存效率及隐私保护方面的局限性,尤其是在构建和维护高精度三维场景地图时面临的挑战。其核心问题在于如何实现高效、低延迟且具备隐私特性的神经隐式SLAM框架。解决方案的关键在于首次将场景坐标回归(Scene Coordinate Regression, SCR)作为神经SLAM管道中的核心隐式地图表示方法:通过训练轻量级网络直接从二维图像特征映射到三维全局坐标,SCR不仅实现了内存占用极低的3D地图表达,还支持极快的重定位速度,并天然具备隐私保护特性。作者设计了一种专用于此目的的新型SCR架构,并明确了将SCR集成进实时SLAM流水线所需的关键技术决策,从而在保证系统简洁灵活的同时,支持稀疏与稠密特征融合,在动态环境中无需额外调整即可稳定运行。
链接: https://arxiv.org/abs/2512.14032
作者: Ignacio Alzugaray,Marwan Taher,Andrew J. Davison
机构: Dyson Robotics Lab, Imperial College London (帝国理工学院戴森机器人实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Project Page: this https URL
Abstract:We present a novel neural RGB-D Simultaneous Localization And Mapping (SLAM) system that learns an implicit map of the scene in real time. For the first time, we explore the use of Scene Coordinate Regression (SCR) as the core implicit map representation in a neural SLAM pipeline, a paradigm that trains a lightweight network to directly map 2D image features to 3D global coordinates. SCR networks provide efficient, low-memory 3D map representations, enable extremely fast relocalization, and inherently preserve privacy, making them particularly suitable for neural implicit SLAM. Our system is the first one to achieve strict real-time in neural implicit RGB-D SLAM by relying on a SCR-based representation. We introduce a novel SCR architecture specifically tailored for this purpose and detail the critical design choices required to integrate SCR into a live SLAM pipeline. The resulting framework is simple yet flexible, seamlessly supporting both sparse and dense features, and operates reliably in dynamic environments without special adaptation. We evaluate our approach on established synthetic and real-world benchmarks, demonstrating competitive performance against the state of the art. Project Page: this https URL Comments: Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV) Cite as: arXiv:2512.14032 [cs.CV] (or arXiv:2512.14032v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.14032 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-92] Robust Single-shot Structured Light 3D Imaging via Neural Feature Decoding
【速读】:该论文旨在解决单帧结构光(Structured Light)系统在复杂场景下深度重建鲁棒性不足的问题,尤其是在遮挡、精细结构细节和非朗伯表面等挑战性条件下,传统基于像素域匹配的解码方法性能受限。其解决方案的关键在于提出一种基于学习的结构光解码框架,通过提取投影图案与捕获红外(IR)图像的神经特征,并在特征空间中构建代价体(cost volume)以实现更鲁棒的对应匹配,从而规避像素域匹配的脆弱性;同时引入一个深度细化模块,利用大规模单目深度估计模型中的强先验信息,提升细节恢复能力和全局结构一致性。该方法完全基于合成数据训练,且无需针对不同模式重新训练即可适应多种结构光模式,在真实室内环境中表现出优于商用结构光系统和被动立体RGB深度估计方法的性能。
链接: https://arxiv.org/abs/2512.14028
作者: Jiaheng Li,Qiyu Dai,Lihan Li,Praneeth Chakravarthula,He Sun,Baoquan Chen,Wenzheng Chen
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); School of Intelligence Science and Technology, Peking University (北京大学智能科学与技术学院); Yuanpei College, Peking University (北京大学元培学院); College of Future Technology, Peking University (北京大学未来技术学院); State Key Laboratory of General Artificial Intelligence, Peking University (通用人工智能国家重点实验室); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We consider the problem of active 3D imaging using single-shot structured light systems, which are widely employed in commercial 3D sensing devices such as Apple Face ID and Intel RealSense. Traditional structured light methods typically decode depth correspondences through pixel-domain matching algorithms, resulting in limited robustness under challenging scenarios like occlusions, fine-structured details, and non-Lambertian surfaces. Inspired by recent advances in neural feature matching, we propose a learning-based structured light decoding framework that performs robust correspondence matching within feature space rather than the fragile pixel domain. Our method extracts neural features from the projected patterns and captured infrared (IR) images, explicitly incorporating their geometric priors by building cost volumes in feature space, achieving substantial performance improvements over pixel-domain decoding approaches. To further enhance depth quality, we introduce a depth refinement module that leverages strong priors from large-scale monocular depth estimation models, improving fine detail recovery and global structural coherence. To facilitate effective learning, we develop a physically-based structured light rendering pipeline, generating nearly one million synthetic pattern-image pairs with diverse objects and materials for indoor settings. Experiments demonstrate that our method, trained exclusively on synthetic data with multiple structured light patterns, generalizes well to real-world indoor environments, effectively processes various pattern types without retraining, and consistently outperforms both commercial structured light systems and passive stereo RGB-based depth estimation methods. Project page: this https URL.
zh
[CV-93] Unleashing the Power of Image-Tabular Self-Supervised Learning via Breaking Cross-Tabular Barriers
【速读】:该论文旨在解决多模态自监督学习(Self-Supervised Learning, SSL)在医学图像与表格数据联合建模中因表格数据异质性导致的跨队列迁移能力受限问题,即现有方法受制于刚性的表格建模机制,难以有效学习跨不同数据队列共享的可迁移医学知识。其解决方案的关键在于提出一种名为CITab的新颖SSL框架,通过引入语义感知的表格建模机制——将列标题作为语义线索整合进模型设计,从而增强特征表示的可迁移性与多源数据利用的可扩展性;同时创新性地设计了原型引导的线性层混合模块(Prototype-guided Mixture-of-linear layer, P-MoLin),实现对表格特征的专业化处理,有效应对表格数据的异质性并挖掘潜在的医学概念。
链接: https://arxiv.org/abs/2512.14026
作者: Yibing Fu,Yunpeng Zhao,Zhitao Zeng,Cheng Chen,Yueming Jin
机构: 1. (未知)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-modal learning integrating medical images and tabular data has significantly advanced clinical decision-making in recent years. Self-Supervised Learning (SSL) has emerged as a powerful paradigm for pretraining these models on large-scale unlabeled image-tabular data, aiming to learn discriminative representations. However, existing SSL methods for image-tabular representation learning are often confined to specific data cohorts, mainly due to their rigid tabular modeling mechanisms when modeling heterogeneous tabular data. This inter-tabular barrier hinders the multi-modal SSL methods from effectively learning transferrable medical knowledge shared across diverse cohorts. In this paper, we propose a novel SSL framework, namely CITab, designed to learn powerful multi-modal feature representations in a cross-tabular manner. We design the tabular modeling mechanism from a semantic-awareness perspective by integrating column headers as semantic cues, which facilitates transferrable knowledge learning and the scalability in utilizing multiple data sources for pretraining. Additionally, we propose a prototype-guided mixture-of-linear layer (P-MoLin) module for tabular feature specialization, empowering the model to effectively handle the heterogeneity of tabular data and explore the underlying medical concepts. We conduct comprehensive evaluations on Alzheimer’s disease diagnosis task across three publicly available data cohorts containing 4,461 subjects. Experimental results demonstrate that CITab outperforms state-of-the-art approaches, paving the way for effective and scalable cross-tabular multi-modal learning.
zh
[CV-94] Deep Learning Perspective of Scene Understanding in Autonomous Robots
【速读】:该论文旨在解决自主机器人在动态和非结构化环境中进行场景理解时面临的感知与决策挑战,特别是传统几何模型在深度感知(depth perception)受限于遮挡和无纹理表面、语义推理能力不足等问题。解决方案的关键在于引入深度学习技术,通过改进目标检测、语义分割、实例分割、深度估计、三维重建以及视觉SLAM(Simultaneous Localization and Mapping)等模块,显著提升机器人对复杂环境的实时感知能力和语义理解水平,从而增强其在导航、决策与交互中的表现。
链接: https://arxiv.org/abs/2512.14020
作者: Afia Maham(National Textile University, Faisalabad, Pakistan),Dur E Nayab Tashfa(Independent Researcher)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages. Review Paper on Deep Learning Perspective of Scene Understanding in Autonomous Robots
Abstract:This paper provides a review of deep learning applications in scene understanding in autonomous robots, including innovations in object detection, semantic and instance segmentation, depth estimation, 3D reconstruction, and visual SLAM. It emphasizes how these techniques address limitations of traditional geometric models, improve depth perception in real time despite occlusions and textureless surfaces, and enhance semantic reasoning to understand the environment better. When these perception modules are integrated into dynamic and unstructured environments, they become more effective in decisionmaking, navigation and interaction. Lastly, the review outlines the existing problems and research directions to advance learning-based scene understanding of autonomous robots.
zh
[CV-95] KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding WACV2026
【速读】:该论文旨在解决长视频问答(Long Video Question Answering, LVQA)中关键帧采样(Key Frame Sampling, KFS)评估不直接、不鲁棒的问题。现有方法仅通过问答准确率间接衡量帧选择质量,难以系统分析不同采样策略对内容覆盖的影响。为此,作者提出了首个针对长视频问答的关键帧采样基准KFS-Bench,其核心创新在于引入多场景标注(multi-scene annotations),使得能够直接评估采样策略在整段视频中的场景覆盖率和采样平衡性。解决方案的关键在于:1)设计了一个与问答准确率高度相关的新型采样质量指标,综合考虑采样精度、场景覆盖度和采样平衡性;2)提出一种基于问题-视频相关性的自适应平衡采样方法,通过权衡采样多样性与问题-帧相似性,提升相关场景的覆盖效果,从而显著改善关键帧采样质量和最终问答性能。
链接: https://arxiv.org/abs/2512.14017
作者: Zongyao Li,Kengo Ishida,Satoshi Yamazaki,Xiaotong Ji,Jianquan Liu
机构: NEC Corporation (日本电气公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: WACV2026
Abstract:We propose KFS-Bench, the first benchmark for key frame sampling in long video question answering (QA), featuring multi-scene annotations to enable direct and robust evaluation of sampling strategies. Key frame sampling is crucial for efficient long-form video understanding. In long video QA, selecting informative frames enables multimodal large language models (MLLMs) to improve both accuracy and efficiency. KFS-Bench addresses the limitation of prior works that only indirectly assess frame selection quality via QA accuracy. By providing ground-truth annotations of multiple disjoint scenes required per question, KFS-Bench allows us to directly analyze how different sampling approaches capture essential content across an entire long video. Using KFS-Bench, we conduct a comprehensive study of key frame sampling methods and identify that not only sampling precision but also scene coverage and sampling balance are the key factors influencing QA performance. Regarding all the factors, we design a novel sampling quality metric that correlates with QA accuracy. Furthermore, we develop a novel key frame sampling method that leverages question-video relevance to balance sampling diversity against question-frame similarity, thereby improving coverage of relevant scenes. Our adaptively balanced sampling approach achieves superior performance in both key frame sampling and QA performance. The benchmark is available at this https URL.
zh
[CV-96] Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models
【速读】:该论文旨在解决掩码离散扩散模型(Masked Discrete Diffusion Models, MDMs)在推理阶段速度较慢的问题,其根本原因在于每一步采样过程中需重复处理大量冗余的掩码标记(masked tokens)。解决方案的关键在于提出Sparse-LaViDa框架,通过在每个推理步骤中动态裁剪不必要的掩码标记以加速采样过程;同时引入专用的注册标记(register tokens)作为被裁剪标记的紧凑表示,并设计一种专门的注意力掩码(attention mask),确保训练与推理阶段的裁剪策略一致,从而在保持生成质量的前提下实现最高达2倍的加速效果。
链接: https://arxiv.org/abs/2512.14008
作者: Shufan Li,Jiuxiang Gu,Kangning Liu,Zhe Lin,Zijun Wei,Aditya Grover,Jason Kuen
机构: Adobe(Adobe); UCLA(加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages (12 pages for the main paper and 6 pages for the appendix), 9 figures
Abstract:Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose Sparse-LaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as compact representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.
zh
[CV-97] CLAIM: Camera-LiDAR Alignment with Intensity and Monodepth IROS2025
【速读】:该论文旨在解决相机与激光雷达(LiDAR)之间的标定问题,即如何精确对齐来自这两种传感器的数据以实现跨模态感知融合。解决方案的关键在于提出 CLAIM 方法,该方法利用粗到精的搜索策略,在给定初始估计和图像-点云配对的基础上,通过最小化基于局部皮尔逊相关系数的结构损失(patched Pearson correlation-based structure loss)和基于互信息的纹理损失(mutual information-based texture loss)来优化相机到LiDAR的变换矩阵。这两个损失函数无需复杂的预处理、特征提取或匹配步骤,具有良好的适应性与鲁棒性,显著提升了标定精度与通用性。
链接: https://arxiv.org/abs/2512.14001
作者: Zhuo Zhang,Yonghui Liu,Meijie Zhang,Feiyang Tan,Yikang Ding
机构: Mach Drive (北京机器驱动科技有限公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IROS 2025
Abstract:In this paper, we unleash the potential of the powerful monodepth model in camera-LiDAR calibration and propose CLAIM, a novel method of aligning data from the camera and LiDAR. Given the initial guess and pairs of images and LiDAR point clouds, CLAIM utilizes a coarse-to-fine searching method to find the optimal transformation minimizing a patched Pearson correlation-based structure loss and a mutual information-based texture loss. These two losses serve as good metrics for camera-LiDAR alignment results and require no complicated steps of data processing, feature extraction, or feature matching like most methods, rendering our method simple and adaptive to most scenes. We validate CLAIM on public KITTI, Waymo, and MIAS-LCEC datasets, and the experimental results demonstrate its superior performance compared with the state-of-the-art methods. The code is available at this https URL.
zh
[CV-98] Repurposing 2D Diffusion Models for 3D Shape Completion
【速读】:该论文旨在解决3D形状补全任务中因高质量3D数据稀缺和3D输入与2D潜在空间之间存在模态鸿沟而导致的生成式AI(Generative AI)模型性能受限的问题。其解决方案的关键在于提出了一种名为Shape Atlas的紧凑型2D几何表示方法,该方法一方面充分利用预训练2D扩散模型的强大生成能力,另一方面通过在条件输入与输出空间之间建立模态对齐,实现更有效的条件控制,从而在有限3D数据下仍能生成高质量且细节保留良好的补全结果。
链接: https://arxiv.org/abs/2512.13991
作者: Yao He,Youngjoong Kwon,Tiange Xiang,Wenxiao Cai,Ehsan Adeli
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a framework that adapts 2D diffusion models for 3D shape completion from incomplete point clouds. While text-to-image diffusion models have achieved remarkable success with abundant 2D data, 3D diffusion models lag due to the scarcity of high-quality 3D datasets and a persistent modality gap between 3D inputs and 2D latent spaces. To overcome these limitations, we introduce the Shape Atlas, a compact 2D representation of 3D geometry that (1) enables full utilization of the generative power of pretrained 2D diffusion models, and (2) aligns the modalities between the conditional input and output spaces, allowing more effective conditioning. This unified 2D formulation facilitates learning from limited 3D data and produces high-quality, detail-preserving shape completions. We validate the effectiveness of our results on the PCN and ShapeNet-55 datasets. Additionally, we show the downstream application of creating artist-created meshes from our completed point clouds, further demonstrating the practicality of our method.
zh
[CV-99] FocalComm: Hard Instance-Aware Multi-Agent Perception WACV2026
【速读】:该论文旨在解决多智能体协同感知(Multi-agent Collaborative Perception, CP)在自动驾驶中对行人等小尺寸、高风险目标检测性能不足的问题,尤其是现有方法因过度依赖车辆检测指标优化及全特征交换策略而易产生漏检,导致安全隐患。其解决方案的关键在于提出FocalComm框架,核心创新包括:(1) 可学习的渐进式难例挖掘(Learnable Progressive Hard Instance Mining, HIM)模块,用于提取每个智能体的难例导向特征;(2) 基于查询的特征级(中间层)融合机制,动态加权这些识别出的难例特征以实现高效协作。该设计显著提升了对行人等关键目标的检测精度,尤其在V2X-Real数据集上表现突出。
链接: https://arxiv.org/abs/2512.13982
作者: Dereje Shenkut,Vijayakumar Bhagavatula
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026
Abstract:Multi-agent collaborative perception (CP) is a promising paradigm for improving autonomous driving safety, particularly for vulnerable road users like pedestrians, via robust 3D perception. However, existing CP approaches often optimize for vehicle detection performance metrics, underperforming on smaller, safety-critical objects such as pedestrians, where detection failures can be catastrophic. Furthermore, previous CP methods rely on full feature exchange rather than communicating only salient features that help reduce false negatives. To this end, we present FocalComm, a novel collaborative perception framework that focuses on exchanging hard-instance-oriented features among connected collaborative agents. FocalComm consists of two key novel designs: (1) a learnable progressive hard instance mining (HIM) module to extract hard instance-oriented features per agent, and (2) a query-based feature-level (intermediate) fusion technique that dynamically weights these identified features during collaboration. We show that FocalComm outperforms state-of-the-art collaborative perception methods on two challenging real-world datasets (V2X-Real and DAIR-V2X) across both vehicle-centric and infrastructure-centric collaborative setups. FocalComm also shows a strong performance gain in pedestrian detection in V2X-Real.
zh
[CV-100] XAI-Driven Diagnosis of Generalization Failure in State-Space Cerebrovascular Segmentation Models: A Case Study on Domain Shift Between RSNA and TopCoW Datasets
【速读】:该论文旨在解决深度学习模型在医学影像临床部署中因领域偏移(domain shift)导致的泛化失败问题,即模型在源域表现优异但在目标域性能急剧下降,从而影响可信人工智能(Trustworthy AI)的实现。其解决方案的关键在于提出一种两阶段诊断框架:第一阶段量化源域(RSNA CTA Aneurysm)与目标域(TopCoW Circle of Willis CT)之间的领域差距,发现Z分辨率和背景噪声差异是重要因素;第二阶段核心创新为利用Seg-XRes-CAM方法对状态空间模型(State-Space Models, SSMs)如UMamaba的注意力机制进行可解释性分析,通过计算注意力图与真实标注(Ground Truth)及预测掩膜之间的交并比(IoU),揭示模型在目标域中放弃了真实解剖特征,转而依赖错误预测的伪相关关系,从而证明了XAI(可解释人工智能)在识别新兴架构中数据偏差方面的强大诊断能力。
链接: https://arxiv.org/abs/2512.13977
作者: Youssef Abuzeid,Shimaa El-Bana,Ahmad Al-Kabbany
机构: Cairo University (开罗大学); Arab Academy for Science and Technology (阿拉伯科技学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The clinical deployment of deep learning models in medical imaging is severely hindered by domain shift. This challenge, where a high-performing model fails catastrophically on external datasets, is a critical barrier to trustworthy AI. Addressing this requires moving beyond simple performance metrics toward deeper understanding, making Explainable AI (XAI) an essential diagnostic tool in medical image analysis. We present a rigorous, two-phase approach to diagnose the generalization failure of state-of-the-art State-Space Models (SSMs), specifically UMamaba, applied to cerebrovascular segmentation. We first established a quantifiable domain gap between our Source (RSNA CTA Aneurysm) and Target (TopCoW Circle of Willis CT) datasets, noting significant differences in Z-resolution and background noise. The model’s Dice score subsequently plummeted from 0.8604 (Source) to 0.2902 (Target). In the second phase, which is our core contribution, we utilized Seg-XRes-CAM to diagnose the cause of this failure. We quantified the model’s focus by measuring the overlap between its attention maps and the Ground Truth segmentations, and between its attention maps and its own Prediction Mask. Our analysis proves the model failed to generalize because its attention mechanism abandoned true anatomical features in the Target domain. Quantitative metrics confirm the model’s focus shifted away from the Ground Truth vessels (IoU~0.101 at 0.3 threshold) while still aligning with its own wrong predictions (IoU~0.282 at 0.3 threshold). This demonstrates the model learned spurious correlations, confirming XAI is a powerful diagnostic tool for identifying dataset bias in emerging architectures.
zh
[CV-101] Quality-Driven and Diversity-Aware Sample Expansion for Robust Marine Obstacle Segmentation
【速读】:该论文旨在解决海洋障碍物检测中因图像质量退化(如日光闪烁、雾天和快速变化的波浪模式)以及训练数据稀缺与结构重复导致的分割模型鲁棒性不足的问题。其核心解决方案是提出一种质量驱动且多样性感知的样本扩展流水线,该方法在推理阶段生成训练数据而无需重新训练扩散模型,关键创新在于:(i) 构建类感知风格库以生成高熵、语义合理的提示词,增强多样性;(ii) 设计自适应退火采样器通过扰动早期条件,并结合COD(Contour-aware Density)引导的比例控制器动态调节扰动强度,在不牺牲布局保真度的前提下显著提升样本多样性,从而有效改善多骨干网络下的分割性能,尤其增强对稀有及纹理敏感类别的视觉变异性。
链接: https://arxiv.org/abs/2512.13970
作者: Miaohua Zhang,Mohammad Ali Armin,Xuesong Li,Sisi Liang,Lars Petersson,Changming Sun,David Ahmedt-Aristizabal,Zeeshan Hayder
机构: CSIRO Data61; CSIRO Agriculture & Food
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Marine obstacle detection demands robust segmentation under challenging conditions, such as sun glitter, fog, and rapidly changing wave patterns. These factors degrade image quality, while the scarcity and structural repetition of marine datasets limit the diversity of available training data. Although mask-conditioned diffusion models can synthesize layout-aligned samples, they often produce low-diversity outputs when conditioned on low-entropy masks and prompts, limiting their utility for improving robustness. In this paper, we propose a quality-driven and diversity-aware sample expansion pipeline that generates training data entirely at inference time, without retraining the diffusion model. The framework combines two key components:(i) a class-aware style bank that constructs high-entropy, semantically grounded prompts, and (ii) an adaptive annealing sampler that perturbs early conditioning, while a COD-guided proportional controller regulates this perturbation to boost diversity without compromising layout fidelity. Across marine obstacle benchmarks, augmenting training data with these controlled synthetic samples consistently improves segmentation performance across multiple backbones and increases visual variation in rare and texture-sensitive classes.
zh
[CV-102] From Unlearning to UNBRANDING: A Benchmark for Trademark-Safe Text-to-Image Generation
【速读】:该论文旨在解决生成式 AI (Generative AI) 中文本到图像扩散模型对商标内容(包括显式标志和隐式结构特征)的未经授权再现问题。传统方法仅关注通用概念(如风格或名人),未能有效处理具体品牌标识,而品牌识别具有多维特性,不仅限于明确的logo,还包括诸如汽车前格栅等独特结构特征。解决方案的关键在于提出“去品牌化”(unbranding)这一新任务,即在保持语义一致性的前提下精细移除商标及细微的结构性品牌特征;同时构建了一个全面的基准数据集,并引入基于视觉语言模型(VLMs)的新评估指标,通过问答框架检测图像中显性和隐含的品牌特征,从而更准确地衡量去品牌化效果。实验证明,随着模型保真度提升(如SDXL、FLUX),品牌特征更易被合成,凸显了该问题的紧迫性与特殊性。
链接: https://arxiv.org/abs/2512.13953
作者: Dawid Malarz,Artur Kasymov,Filip Manjak,Maciej Zięba,Przemysław Spurek
机构: Jagiellonian University (雅盖隆大学); Wrocław University of Science and Technology (弗罗茨瓦夫理工大学); IDEAS Research Institute (IDEAS 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid progress of text-to-image diffusion models raises significant concerns regarding the unauthorized reproduction of trademarked content. While prior work targets general concepts (e.g., styles, celebrities), it fails to address specific brand identifiers. Crucially, we note that brand recognition is multi-dimensional, extending beyond explicit logos to encompass distinctive structural features (e.g., a car’s front grille). To tackle this, we introduce unbranding, a novel task for the fine-grained removal of both trademarks and subtle structural brand features, while preserving semantic coherence. To facilitate research, we construct a comprehensive benchmark dataset. Recognizing that existing brand detectors are limited to logos and fail to capture abstract trade dress (e.g., the shape of a Coca-Cola bottle), we introduce a novel evaluation metric based on Vision Language Models (VLMs). This VLM-based metric uses a question-answering framework to probe images for both explicit logos and implicit, holistic brand characteristics. Furthermore, we observe that as model fidelity increases, with newer systems (SDXL, FLUX) synthesizing brand identifiers more readily than older models (Stable Diffusion), the urgency of the unbranding challenge is starkly highlighted. Our results, validated by our VLM metric, confirm unbranding is a distinct, practically relevant problem requiring specialized techniques. Project Page: this https URL.
zh
[CV-103] An evaluation of SVBRDF Prediction from Generative Image Models for Appearance Modeling of 3D Scenes
【速读】:该论文旨在解决快速外观建模流水线中基于单视图的表面反射率和光泽度(SVBRDF)预测所面临的多视角不一致性问题,即单视图预测可能导致纹理贴图在合并时出现不一致,从而影响最终三维场景纹理 atlas 的质量。解决方案的关键在于利用条件图像生成模型(conditional image generators)与 SVBRDF 预测网络的协同作用:一方面,通过生成与 3D 几何对齐的 RGB 图像作为输入,提升 SVBRDF 预测的准确性;另一方面,通过比较不同神经网络架构(如 UNet)及其条件输入(如深度图、法向量等),发现标准 UNet 在保持高精度的同时,仍能实现良好的多视角一致性,展现出简单设计在复杂任务中的有效性。
链接: https://arxiv.org/abs/2512.13950
作者: Alban Gauthier,Valentin Deschaintre,Alexandre Lanvin,Fredo Durand,Adrien Bousseau,George Drettakis
机构: Inria & Université Côte d’Azur, France; Adobe Research, UK; MIT, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this http URL Code: this http URL
Abstract:Digital content creation is experiencing a profound change with the advent of deep generative models. For texturing, conditional image generators now allow the synthesis of realistic RGB images of a 3D scene that align with the geometry of that scene. For appearance modeling, SVBRDF prediction networks recover material parameters from RGB images. Combining these technologies allows us to quickly generate SVBRDF maps for multiple views of a 3D scene, which can be merged to form a SVBRDF texture atlas of that scene. In this paper, we analyze the challenges and opportunities for SVBRDF prediction in the context of such a fast appearance modeling pipeline. On the one hand, single-view SVBRDF predictions might suffer from multiview incoherence and yield inconsistent texture atlases. On the other hand, generated RGB images, and the different modalities on which they are conditioned, can provide additional information for SVBRDF estimation compared to photographs. We compare neural architectures and conditions to identify designs that achieve high accuracy and coherence. We find that, surprisingly, a standard UNet is competitive with more complex designs. Project page: this http URL
zh
[CV-104] KLO-Net: A Dynamic K-NN Attention U-Net with CSP Encoder for Efficient Prostate Gland Segmentation from MRI
【速读】:该论文旨在解决前列腺磁共振成像(MRI)分割在临床工作站中实时部署时面临的计算负载高和内存占用大的问题,同时应对因解剖结构变异导致的深度学习分割方法精度不稳定的问题。其解决方案的关键在于提出一种动态K近邻注意力U-Net(KLO-Net),该模型结合了动态K-近邻注意力机制与交叉阶段部分(Cross Stage Partial, CSP)编码器结构:动态K-近邻注意力机制使模型能够自适应地为每个空间位置确定最优的注意力连接数量,从而提升分割精度;CSP模块则通过减少冗余计算有效降低内存消耗,增强模型效率。在PROMISE12和PROSTATEx两个公开数据集上的实验验证表明,该架构在保持高分割质量的同时显著提升了计算效率。
链接: https://arxiv.org/abs/2512.13902
作者: Anning Tian,Byunghyun Ko,Kaichen Qu,Mengyuan Liu,Jeongkyu Lee
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint. Accepted to SPIE Medical Imaging 2026: Image Processing
Abstract:Real-time deployment of prostate MRI segmentation on clinical workstations is often bottlenecked by computational load and memory footprint. Deep learning-based prostate gland segmentation approaches remain challenging due to anatomical variability. To bridge this efficiency gap while still maintaining reliable segmentation accuracy, we propose KLO-Net, a dynamic K-Nearest Neighbor attention U-Net with Cross Stage Partial, i.e., CSP, encoder for efficient prostate gland segmentation from MRI scan. Unlike the regular K-NN attention mechanism, the proposed dynamic K-NN attention mechanism allows the model to adaptively determine the number of attention connections for each spatial location within a slice. In addition, CSP blocks address the computational load to reduce memory consumption. To evaluate the model’s performance, comprehensive experiments and ablation studies are conducted on two public datasets, i.e., PROMISE12 and PROSTATEx, to validate the proposed architecture. The detailed comparative analysis demonstrates the model’s advantage in computational efficiency and segmentation quality.
zh
[CV-105] Route-DETR: Pairwise Query Routing in Transformers for Object Detection
【速读】:该论文旨在解决检测变压器(DETR)中查询竞争效率低下问题,即多个查询收敛至相似位置导致冗余计算。解决方案的关键在于提出Route-DETR,通过在解码器自注意力层中引入自适应成对路由机制,区分竞争性查询(针对同一目标)与互补性查询(针对不同目标),利用查询间相似度、置信度得分和几何信息进行判断;并设计双路由机制:抑制路由(suppressor routes)调节竞争查询间的注意力以减少重复,委派路由(delegator routes)促进对不同区域的探索;上述机制通过可学习的低秩注意力偏置实现不对称查询交互,并采用双分支训练策略仅在训练阶段引入路由偏置,推理时保持标准注意力结构,从而在不增加额外计算开销的前提下显著提升性能。
链接: https://arxiv.org/abs/2512.13876
作者: Ye Zhang,Qi Chen,Wenyou Huang,Rui Liu,Zhengjian Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures
Abstract:Detection Transformer (DETR) offers an end-to-end solution for object detection by eliminating hand-crafted components like non-maximum suppression. However, DETR suffers from inefficient query competition where multiple queries converge to similar positions, leading to redundant computations. We present Route-DETR, which addresses these issues through adaptive pairwise routing in decoder self-attention layers. Our key insight is distinguishing between competing queries (targeting the same object) versus complementary queries (targeting different objects) using inter-query similarity, confidence scores, and geometry. We introduce dual routing mechanisms: suppressor routes that modulate attention between competing queries to reduce duplication, and delegator routes that encourage exploration of different regions. These are implemented via learnable low-rank attention biases enabling asymmetric query interactions. A dual-branch training strategy incorporates routing biases only during training while preserving standard attention for inference, ensuring no additional computational cost. Experiments on COCO and Cityscapes demonstrate consistent improvements across multiple DETR baselines, achieving +1.7% mAP gain over DINO on ResNet-50 and reaching 57.6% mAP on Swin-L, surpassing prior state-of-the-art models.
zh
[CV-106] SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning
【速读】:该论文旨在解决当前视频推理模型在处理不同长度视频时缺乏灵活性的问题,即现有最优(SOTA)模型通常采用单次推理方式处理大量帧,导致资源消耗高且无法像人类一样根据任务需求动态选择观看策略(如短片全看或长视频分段浏览)。其核心解决方案是提出SAGE系统,该系统通过多轮交互式推理实现“任意时长”(any-horizon)的视频理解能力:首先设计一个调度器SAGE-MM,利用Gemini-2.5-Flash生成合成数据进行训练,并结合强化学习(RL)后训练策略以增强其决策灵活性;其次构建SAGE-Bench基准测试集(平均时长>700秒),用于评估真实娱乐场景下的视频推理性能。实验证明,该方案在开放式视频推理任务中提升达6.1%,对超10分钟长视频的改进更高达8.2%。
链接: https://arxiv.org/abs/2512.13874
作者: Jitesh Jain,Jialuo Li,Zixian Ma,Jieyu Zhang,Chris Dongjoo Kim,Sangho Lee,Rohun Tripathi,Tanmay Gupta,Christopher Clark,Humphrey Shi
机构: SHI Labs @ Georgia Tech (佐治亚理工学院); Allen AI; University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:As humans, we are natural any-horizon reasoners, i.e., we can decide whether to iteratively skim long videos or watch short ones in full when necessary for a given task. With this in mind, one would expect video reasoning models to reason flexibly across different durations. However, SOTA models are still trained to predict answers in a single turn while processing a large number of frames, akin to watching an entire long video, requiring significant resources. This raises the question: Is it possible to develop performant any-horizon video reasoning systems? Inspired by human behavior, we first propose SAGE, an agent system that performs multi-turn reasoning on long videos while handling simpler problems in a single turn. Secondly, we introduce an easy synthetic data generation pipeline using Gemini-2.5-Flash to train the orchestrator, SAGE-MM, which lies at the core of SAGE. We further propose an effective RL post-training recipe essential for instilling any-horizon reasoning ability in SAGE-MM. Thirdly, we curate SAGE-Bench with an average duration of greater than 700 seconds for evaluating video reasoning ability in real-world entertainment use cases. Lastly, we empirically validate the effectiveness of our system, data, and RL recipe, observing notable improvements of up to 6.1% on open-ended video reasoning tasks, as well as an impressive 8.2% improvement on videos longer than 10 minutes.
zh
[CV-107] Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models
【速读】:该论文旨在解决无人机(UAV)场景下人体检测任务中因目标分布动态变化和标注数据稀缺导致的模型训练困难问题,以及合成数据与真实数据之间存在的域差异(domain gap)对检测性能的负面影响。其解决方案的关键在于提出一种分阶段的扩散模型框架——粗粒度到细粒度层次对齐(Coarse-to-Fine Hierarchical Alignment, CFHA),通过三个模块协同实现:(1) 全局风格迁移(Global Style Transfer),利用少量真实图像参考集将合成图像的颜色、光照和纹理统计特性映射至真实风格;(2) 局部细节增强(Local Refinement),采用超分辨率扩散模型提升小目标(如人体实例)的精细纹理与逼真度,同时保持形状与边界完整性;(3) 幻觉去除(Hallucination Removal),过滤视觉属性不符合真实世界分布的人体实例,使生成样本更贴近目标域分布。该方法在保持原始合成标签的前提下有效缩小了域差距,显著提升了检测精度,在Semantic-Drone基准上mAP50最高提升达+14.1%。
链接: https://arxiv.org/abs/2512.13869
作者: Wenda Li,Meng Wu,Sungmin Eum,Heesung Kwon,Qing Qu
机构: University of Michigan (密歇根大学); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Training object detectors demands extensive, task-specific annotations, yet this requirement becomes impractical in UAV-based human detection due to constantly shifting target distributions and the scarcity of labeled images. As a remedy, synthetic simulators are adopted to generate annotated data, with a low annotation cost. However, the domain gap between synthetic and real images hinders the model from being effectively applied to the target domain. Accordingly, we introduce Coarse-to-Fine Hierarchical Alignment (CFHA), a three-stage diffusion-based framework designed to transform synthetic data for UAV-based human detection, narrowing the domain gap while preserving the original synthetic labels. CFHA explicitly decouples global style and local content domain discrepancies and bridges those gaps using three modules: (1) Global Style Transfer – a diffusion model aligns color, illumination, and texture statistics of synthetic images to the realistic style, using only a small real reference set; (2) Local Refinement – a super-resolution diffusion model is used to facilitate fine-grained and photorealistic details for the small objects, such as human instances, preserving shape and boundary integrity; (3) Hallucination Removal – a module that filters out human instances whose visual attributes do not align with real-world data to make the human appearance closer to the target distribution. Extensive experiments on public UAV Sim2Real detection benchmarks demonstrate that our methods significantly improve the detection accuracy compared to the non-transformed baselines. Specifically, our method achieves up to +14.1 improvement of mAP50 on Semantic-Drone benchmark. Ablation studies confirm the complementary roles of the global and local stages and highlight the importance of hierarchical alignment. The code is released at \hrefthis https URLthis url.
zh
[CV-108] Improvise Adapt Overcome – Telescopic Adapters for Efficient Fine-tuning of Vision Language Models in Medical Imaging WACV2026
【速读】:该论文旨在解决视觉语言分割模型(Vision Language Segmentation Models, VLSMs)在医学影像领域适应时,传统微调方法计算开销大、参数效率低的问题。现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法对所有Transformer层采用统一的适配器维度,导致参数分配不合理,影响适应效率。其解决方案的关键在于提出望远镜式适配器(Telescopic Adapters),通过深度感知的缩放策略,从浅层到深层逐步增加适配器容量,并结合轻量级瓶颈模块嵌入CLIPSeg的视觉与文本编码器中,使适配器维度根据层深度和语义相关性动态调整。该方法仅使用613k可训练参数(较端到端微调减少244倍),在五个多样化的医学数据集上实现更优性能,验证了深层Transformer层需要更强适应能力的假设,为资源受限临床环境下的高效VLSM微调提供了新范式。
链接: https://arxiv.org/abs/2512.13855
作者: Ujjwal Mishra,Vinita Shukla,Praful Hambarde,Amit Shukla
机构: Centre for Artificial Intelligence and Robotics, Indian Institute of Technology Mandi, India (印度理工学院曼迪分校人工智能与机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the IEEE/CVF winter conference on applications of computer vision (WACV 2026)
Abstract:Adapting Vision Language Segmentation Models (VLSMs) to medical imaging domains requires significant computational overhead when using conventional fine-tuning approaches. Existing Parameter-Efficient Fine-Tuning (PEFT) methods apply uniform adapter dimensions across all transformer layers, leading to suboptimal parameter allocation and reduced adaptation efficiency. We introduce Telescopic Adapters, a novel PEFT framework that employs depth-aware scaling to progressively increase adapter capacity from shallow to deep transformer layers. Our method integrates lightweight bottleneck modules within CLIPSeg’s vision and text encoders, with adapter dimensions dynamically scaled based on layer depth and semantic relevance. Using only 613k trainable parameters–244x fewer than end-to-end fine-tuning, Telescopic Adapters achieve superior performance across five diverse medical datasets spanning polyp segmentation, skin lesion detection, and breast ultrasound imaging. Comprehensive ablation studies demonstrate that deeper layers require substantially more adaptation capacity than shallow layers, validating our telescopic scaling hypothesis. Our approach establishes a new paradigm for efficient medical VLSM fine-tuning, enabling deployment in resource-constrained clinical environments while maintaining competitive segmentation accuracy.
zh
[CV-109] MoLingo: Motion-Language Alignment for Text-to-Motion Generation
【速读】:该论文旨在解决文本到动作(text-to-motion, T2M)生成中运动 realism 和文本-动作对齐性不足的问题。核心挑战在于如何在连续的运动潜在空间中高效进行去噪扩散,并实现语义一致的文本条件注入。解决方案的关键在于:(1) 构建语义对齐的运动潜在空间,通过帧级文本标签训练编码器,使语义相近的动作在潜在空间中距离更近,从而提升扩散过程的有效性;(2) 采用多标记交叉注意力(multi-token cross-attention)机制替代单标记条件注入,显著增强运动真实感与文本描述的匹配度。结合语义对齐潜在空间、自回归生成策略和交叉注意力条件注入,该方法在标准指标和用户评测中均达到当前最优性能。
链接: https://arxiv.org/abs/2512.13840
作者: Yannan He,Garvita Tiwari,Xiaohan Zhang,Pankaj Bora,Tolga Birdal,Jan Eric Lenssen,Gerard Pons-Moll
机构: University of Tübingen, Germany(图宾根大学, 德国); Tübingen AI Center, Germany(图宾根人工智能中心, 德国); Max Planck Institute for Informatics, Saarland Informatics Campus, Germany(马克斯·普朗克信息研究所, 萨尔兰信息学园区, 德国); Imperial College London, United Kingdom(帝国理工学院, 英国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment. With semantically aligned latents, auto-regressive generation, and cross-attention text conditioning, our model sets a new state of the art in human motion generation on standard metrics and in a user study. We will release our code and models for further research and downstream usage.
zh
[CV-110] VajraV1 – The most accurate Real Time Object Detector of the YOLO family
【速读】:该论文旨在解决实时目标检测(Real-time Object Detection)中精度与推理速度难以兼顾的问题。针对这一挑战,作者提出VajraV1模型架构,其关键在于融合了此前YOLO系列模型中的有效设计选择,并通过结构优化实现更高的检测精度而保持竞争力的推理速度。具体而言,VajraV1在COCO验证集上显著优于YOLOv12和YOLOv13各尺寸版本,在Nano至Xlarge等多个规模下均实现了mAP指标的提升,其中VajraV1-Xlarge成为首个超越所有现有实时目标检测器的模型,验证了其架构改进的有效性。
链接: https://arxiv.org/abs/2512.13834
作者: Naman Balbir Singh Makkar
机构: Vayuvahana Technologies Private Limited(瓦尤瓦纳技术私人有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report. 20 Pages, 7 figures
Abstract:Recent years have seen significant advances in real-time object detection, with the release of YOLOv10, YOLO11, YOLOv12, and YOLOv13 between 2024 and 2025. This technical report presents the VajraV1 model architecture, which introduces architectural enhancements over existing YOLO-based detectors. VajraV1 combines effective design choices from prior YOLO models to achieve state-of-the-art accuracy among real-time object detectors while maintaining competitive inference speed. On the COCO validation set, VajraV1-Nano achieves 44.3% mAP, outperforming YOLOv12-N by 3.7% and YOLOv13-N by 2.7% at latency competitive with YOLOv12-N and YOLOv11-N. VajraV1-Small achieves 50.4% mAP, exceeding YOLOv12-S and YOLOv13-S by 2.4%. VajraV1-Medium achieves 52.7% mAP, outperforming YOLOv12-M by 0.2%. VajraV1-Large achieves 53.7% mAP, surpassing YOLOv13-L by 0.3%. VajraV1-Xlarge achieves 56.2% mAP, outperforming all existing real-time object detectors. Comments: Technical Report. 20 Pages, 7 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.13834 [cs.CV] (or arXiv:2512.13834v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.13834 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-111] EEG-D3: A Solution to the Hidden Overfitting Problem of Deep Learning Models
【速读】:该论文旨在解决深度学习模型在脑电图(EEG)信号解码中普遍存在但常被忽视的“隐式过拟合”问题,即模型在受控脑机接口(BCI)基准测试中表现优异,却难以泛化到实际应用场景。其核心挑战在于任务相关的伪影(如运动伪迹或设备噪声)可能被模型错误地学习为判别特征,导致性能在真实环境中严重下降。解决方案的关键是提出一种弱监督的解耦解码分解方法(Disentangled Decoding Decomposition, D3),通过预测输入时间窗采样位置来分离脑活动的潜在成分,类似于非线性独立成分分析(ICA)。D3采用完全独立的子网络架构以确保可解释性,并基于组件激活模式对比不同数据集上的特征分布,从而识别并剔除由任务相关伪影引起的虚假成分。训练下游分类器时仅使用这些真实生理成分的子集,有效避免了隐式过拟合,同时显著提升模型在少量标注数据下的泛化能力,尤其适用于睡眠阶段分类等少样本场景。
链接: https://arxiv.org/abs/2512.13806
作者: Siegfried Ludwig,Stylianos Bakas,Konstantinos Barmpas,Georgios Zoumpourlis,Dimitrios A. Adamos,Nikolaos Laskaris,Yannis Panagakis,Stefanos Zafeiriou
机构: Imperial College London (帝国理工学院); Aristotle University of Thessaloniki (亚里士多德大学塞萨洛尼基分校); National and Kapodistrian University of Athens (国家和卡波迪斯特里安大学雅典分校); Cogitat Ltd. (Cogitat有限公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Deep learning for decoding EEG signals has gained traction, with many claims to state-of-the-art accuracy. However, despite the convincing benchmark performance, successful translation to real applications is limited. The frequent disconnect between performance on controlled BCI benchmarks and its lack of generalisation to practical settings indicates hidden overfitting problems. We introduce Disentangled Decoding Decomposition (D3), a weakly supervised method for training deep learning models across EEG datasets. By predicting the place in the respective trial sequence from which the input window was sampled, EEG-D3 separates latent components of brain activity, akin to non-linear ICA. We utilise a novel model architecture with fully independent sub-networks for strict interpretability. We outline a feature interpretation paradigm to contrast the component activation profiles on different datasets and inspect the associated temporal and spatial filters. The proposed method reliably separates latent components of brain activity on motor imagery data. Training downstream classifiers on an appropriate subset of these components prevents hidden overfitting caused by task-correlated artefacts, which severely affects end-to-end classifiers. We further exploit the linearly separable latent space for effective few-shot learning on sleep stage classification. The ability to distinguish genuine components of brain activity from spurious features results in models that avoid the hidden overfitting problem and generalise well to real-world applications, while requiring only minimal labelled data. With interest to the neuroscience community, the proposed method gives researchers a tool to separate individual brain processes and potentially even uncover heretofore unknown dynamics.
zh
[CV-112] Nexels: Neurally-Textured Surfels for Real-Time Novel View Synthesis with Sparse Geometries
【速读】:该论文旨在解决3D Gaussian splatting在新视角合成中因需数百万个基础单元(primitive)来建模高纹理场景而导致的表示冗余与内存消耗过高的问题,尤其是在几何结构简单的场景下。其解决方案的关键在于提出一种超越点基渲染的新型表示方法,通过将几何与外观解耦:使用surfels(表面元素)表示几何结构,同时采用全局神经场(neural field)结合每个基础单元的颜色信息来建模外观;其中神经场以固定数量的基元对每个像素进行纹理映射,从而在保持低计算开销的同时实现高质量渲染。该方法在室外和室内场景中分别减少9.7倍和31倍的基础单元数量,以及5.5倍和3.7倍的内存占用,并且渲染速度提升一倍,同时视觉质量优于现有纹理化基元方法。
链接: https://arxiv.org/abs/2512.13796
作者: Victor Rong,Jan Held,Victor Chu,Daniel Rebain,Marc Van Droogenbroeck,Kiriakos N. Kutulakos,Andrea Tagliasacchi,David B. Lindell
机构: University of Toronto (多伦多大学); Vector Institute (矢量研究所); Simon Frasier University (西蒙菲莎大学); University of Liège (列日大学); University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage at this https URL
Abstract:Though Gaussian splatting has achieved impressive results in novel view synthesis, it requires millions of primitives to model highly textured scenes, even when the geometry of the scene is simple. We propose a representation that goes beyond point-based rendering and decouples geometry and appearance in order to achieve a compact representation. We use surfels for geometry and a combination of a global neural field and per-primitive colours for appearance. The neural field textures a fixed number of primitives for each pixel, ensuring that the added compute is low. Our representation matches the perceptual quality of 3D Gaussian splatting while using 9.7\times fewer primitives and 5.5\times less memory on outdoor scenes and using 31\times fewer primitives and 3.7\times less memory on indoor scenes. Our representation also renders twice as fast as existing textured primitives while improving upon their visual quality.
zh
[CV-113] Enhancing Semi-Supervised Multi-View Graph Convolutional Networks via Supervised Contrastive Learning and Self-Training
【速读】:该论文旨在解决多视图学习中现有方法难以充分挖掘不同视图间互补信息的问题,从而导致特征表示不充分和性能受限。其解决方案的关键在于提出一种半监督图卷积网络(MV-SupGCN),通过三个核心组件实现协同增强:首先,设计联合损失函数融合交叉熵损失与监督对比损失(Supervised Contrastive Loss),在潜在空间中同时最小化类内方差并最大化类间可分性;其次,结合K近邻(KNN)与半监督图构建方法以提升图结构的鲁棒性与完整性,降低泛化误差;最后,引入统一的对比学习框架与伪标签机制,强化多视图嵌入一致性并利用未标记数据提供额外监督信号,从而显著提升模型泛化能力。
链接: https://arxiv.org/abs/2512.13770
作者: Huaiyuan Xiao,Fadi Dornaika,Jingjun Bi
机构: University of the Basque Country (巴斯克大学); IKERBASQUE; North China University of Water Resources and Electric Power (华北水利水电大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The advent of graph convolutional network (GCN)-based multi-view learning provides a powerful framework for integrating structural information from heterogeneous views, enabling effective modeling of complex multi-view data. However, existing methods often fail to fully exploit the complementary information across views, leading to suboptimal feature representations and limited performance. To address this, we propose MV-SupGCN, a semi-supervised GCN model that integrates several complementary components with clear motivations and mutual reinforcement. First, to better capture discriminative features and improve model generalization, we design a joint loss function that combines Cross-Entropy loss with Supervised Contrastive loss, encouraging the model to simultaneously minimize intra-class variance and maximize inter-class separability in the latent space. Second, recognizing the instability and incompleteness of single graph construction methods, we combine both KNN-based and semi-supervised graph construction approaches on each view, thereby enhancing the robustness of the data structure representation and reducing generalization error. Third, to effectively utilize abundant unlabeled data and enhance semantic alignment across multiple views, we propose a unified framework that integrates contrastive learning in order to enforce consistency among multi-view embeddings and capture meaningful inter-view relationships, together with pseudo-labeling, which provides additional supervision applied to both the cross-entropy and contrastive loss functions to enhance model generalization. Extensive experiments demonstrate that MV-SupGCN consistently surpasses state-of-the-art methods across multiple benchmarks, validating the effectiveness of our integrated approach. The source code is available at this https URL
zh
[CV-114] me-aware UNet and super-resolution deep residual networks for spatial downscaling
【速读】:该论文旨在解决卫星大气污染物数据空间分辨率较低的问题,从而限制了其在局部尺度环境分析与决策中的应用。为实现从粗分辨率到高分辨率的转换,研究提出基于深度学习的空间降尺度方法,关键在于将轻量级时间模块引入两种主流网络架构——超分辨率深度残差网络(SRDRN)和基于编码器-解码器结构的UNet中,通过正弦或径向基函数(RBF)编码方式对观测时间进行建模,并将时间特征融合至空间表示中,显著提升了降尺度性能与收敛速度,同时仅带来轻微的计算复杂度增加。
链接: https://arxiv.org/abs/2512.13753
作者: Mika Sipilä,Sabrina Maggio,Sandra De Iaco,Klaus Nordhausen,Monica Palma,Sara Taskinen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注:
Abstract:Satellite data of atmospheric pollutants are often available only at coarse spatial resolution, limiting their applicability in local-scale environmental analysis and decision-making. Spatial downscaling methods aim to transform the coarse satellite data into high-resolution fields. In this work, two widely used deep learning architectures, the super-resolution deep residual network (SRDRN) and the encoder-decoder-based UNet, are considered for spatial downscaling of tropospheric ozone. Both methods are extended with a lightweight temporal module, which encodes observation time using either sinusoidal or radial basis function (RBF) encoding, and fuses the temporal features with the spatial representations in the networks. The proposed time-aware extensions are evaluated against their baseline counterparts in a case study on ozone downscaling over Italy. The results suggest that, while only slightly increasing computational complexity, the temporal modules significantly improve downscaling performance and convergence speed.
zh
[CV-115] STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在统一理解与生成任务中面临的优化冲突和性能权衡问题。为实现这一目标,其关键解决方案是提出STAR(STacked AutoRegressive)架构,该方案将多模态学习分解为理解、生成和编辑三个阶段,并通过冻结基础自回归(Autoregressive, AR)模型参数、逐步堆叠同构AR模块的方式,在避免跨任务干扰的同时扩展模型能力;此外,引入高容量向量量化(VQ)以提升图像表征粒度,并采用隐式推理机制增强复杂场景下的生成质量,从而实现对多模态理解与生成能力的统一优化。
链接: https://arxiv.org/abs/2512.13752
作者: Jie Qin,Jiancheng Huang,Limeng Qiao,Lin Ma
机构: Meituan Inc (美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures
Abstract:Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model’s capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.
zh
[CV-116] Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making ICDM2025
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在生物医学领域中医疗决策(Medical Decision Making, MDM)任务上的性能不足问题,尤其是在视觉-语言任务中缺乏对医学图像的精准理解与推理能力。研究发现,在阿尔茨海默病(Alzheimer’s Disease, AD)三阶段分类和MIMIC-CXR胸部X光片多标签分类任务中,仅使用文本输入的推理方式优于视觉-文本联合输入,表明现有MLLMs存在视觉感知不充分、跨模态对齐不佳的问题。解决方案的关键在于提升模型的视觉语义 grounding 能力:通过三种策略实现——(1) 使用带推理标注的上下文示例进行提示学习(in-context learning),(2) 先对图像生成描述再进行纯文本推理(vision captioning + text-only inference),以及 (3) 对视觉编码器进行少量样本微调(few-shot fine-tuning of the vision tower with classification supervision)。这些方法共同揭示了增强视觉理解是改善医疗场景下多模态决策的核心路径。
链接: https://arxiv.org/abs/2512.13747
作者: Siyuan Dai,Lunxiao Li,Kun Zhao,Eardi Lila,Paul K. Crane,Heng Huang,Dongkuan Xu,Haoteng Tang,Liang Zhan
机构: University of Pittsburgh (匹兹堡大学); NC State University (北卡罗来纳州立大学); University of Washington (华盛顿大学); University of Maryland (马里兰大学); University of Texas Rio Grande Valley (德州里奥格兰德谷大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICDM 2025 the Workshop on Synergy of AI and Multimodal Biomedical Data Mining
Abstract:With the rapid progress of large language models (LLMs), advanced multimodal large language models (MLLMs) have demonstrated impressive zero-shot capabilities on vision-language tasks. In the biomedical domain, however, even state-of-the-art MLLMs struggle with basic Medical Decision Making (MDM) tasks. We investigate this limitation using two challenging datasets: (1) three-stage Alzheimer’s disease (AD) classification (normal, mild cognitive impairment, dementia), where category differences are visually subtle, and (2) MIMIC-CXR chest radiograph classification with 14 non-mutually exclusive conditions. Our empirical study shows that text-only reasoning consistently outperforms vision-only or vision-text settings, with multimodal inputs often performing worse than text alone. To mitigate this, we explore three strategies: (1) in-context learning with reason-annotated exemplars, (2) vision captioning followed by text-only inference, and (3) few-shot fine-tuning of the vision tower with classification supervision. These findings reveal that current MLLMs lack grounded visual understanding and point to promising directions for improving multimodal decision making in healthcare.
zh
[CV-117] DL3M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models
【速读】:该论文试图解决医学图像分类模型(如用于胃肠道疾病检测的深度学习模型)缺乏可解释性的问题,以及大语言模型(LLM)在视觉推理能力不足、生成内容不稳定或不准确的问题,从而导致模型决策与临床医生所需的结构化推理之间存在显著差距。解决方案的关键在于提出一个融合框架:首先使用新设计的轻量化混合模型 MobileCoAtNet 对内镜图像进行高精度分类(覆盖八类胃部相关疾病),随后利用其输出结果驱动多个 LLM 进行结构化临床推理;同时构建两个由专家验证的基准数据集,用于评估 LLM 在病因、症状、治疗、生活方式及随访等维度上的推理质量。研究发现,高质量的图像分类能提升 LLM 的解释质量,但当前 LLM 仍无法达到人类水平的稳定性,表明将深度学习与 LLM 结合虽可生成有用临床叙事,但在高风险医疗决策中仍不可靠。
链接: https://arxiv.org/abs/2512.13742
作者: Md. Najib Hasan(1),Imran Ahmad(1),Sourav Basak Shuvo(2),Md. Mahadi Hasan Ankon(2),Sunanda Das(3),Nazmul Siddique(4),Hui Wang(5) ((1) Wichita State University, USA, (2) Khulna University of Engineering and Technology, Bangladesh, (3) University of Arkansas, USA, (4) Ulster University, UK, (5) Queen’s University Belfast, UK)
机构: Wichita State University (威奇托州立大学); KUET (库尔纳大学工程与技术学院); University of Arkansas (阿肯色大学); Ulster University (阿尔斯特大学); Queen’s University Belfast (贝尔法斯特女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical image classifiers detect gastrointestinal diseases well, but they do not explain their decisions. Large language models can generate clinical text, yet they struggle with visual reasoning and often produce unstable or incorrect explanations. This leaves a gap between what a model sees and the type of reasoning a clinician expects. We introduce a framework that links image classification with structured clinical reasoning. A new hybrid model, MobileCoAtNet, is designed for endoscopic images and achieves high accuracy across eight stomach-related classes. Its outputs are then used to drive reasoning by several LLMs. To judge this reasoning, we build two expert-verified benchmarks covering causes, symptoms, treatment, lifestyle, and follow-up care. Thirty-two LLMs are evaluated against these gold standards. Strong classification improves the quality of their explanations, but none of the models reach human-level stability. Even the best LLMs change their reasoning when prompts vary. Our study shows that combining DL with LLMs can produce useful clinical narratives, but current LLMs remain unreliable for high-stakes medical decisions. The framework provides a clearer view of their limits and a path for building safer reasoning systems. The complete source code and datasets used in this study are available at this https URL.
zh
[CV-118] Human-AI Collaboration Mechanism Study on AIGC Assisted Image Production for Special Coverag e AAAI
【速读】:该论文旨在解决生成式 AI (Generative AI) 辅助图像生产在新闻报道中引发的可信度与伦理困境,尤其是内容准确性、语义一致性(semantic fidelity)和可解释性不足的问题。其核心挑战源于当前 AIGC 工具普遍存在的“黑箱”特性,导致媒体在特殊报道场景下难以确保图像的真实性与文化适配性。解决方案的关键在于构建一种人机协同机制,通过模块化流水线实现可控图像生成:集成高精度分割(SAM、GroundingDINO)、语义对齐(BrushNet)、风格调控(Style-LoRA、Prompt-to-Prompt),并辅以 CLIP 语义评分、NSFW/OCR/YOLO 内容过滤及可验证的内容凭证机制,从而保障编辑一致性与可追溯性。最终提出三项量化评估指标——角色身份稳定性(Character Identity Stability, CIS)、文化表达准确性(Cultural Expression Accuracy, CEA)和用户-公众适宜性(User-Public Appropriateness, U-PA),为新闻领域 AIGC 图像生产的质量控制提供系统性框架。
链接: https://arxiv.org/abs/2512.13739
作者: Yajie Yang,Yuqing Zhao,Xiaochao Xi,Yinan Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI-AISI 2026
Abstract:Artificial Intelligence Generated Content (AIGC) assisting image production triggers controversy in journalism while attracting attention from media agencies. Key issues involve misinformation, authenticity, semantic fidelity, and interpretability. Most AIGC tools are opaque “black boxes,” hindering the dual demands of content accuracy and semantic alignment and creating ethical, sociotechnical, and trust dilemmas. This paper explores pathways for controllable image production in journalism’s special coverage and conducts two experiments with projects from China’s media agency: (1) Experiment 1 tests cross-platform adaptability via standardized prompts across three scenes, revealing disparities in semantic alignment, cultural specificity, and visual realism driven by training-corpus bias and platform-level filtering. (2) Experiment 2 builds a human-in-the-loop modular pipeline combining high-precision segmentation (SAM, GroundingDINO), semantic alignment (BrushNet), and style regulating (Style-LoRA, Prompt-to-Prompt), ensuring editorial fidelity through CLIP-based semantic scoring, NSFW/OCR/YOLO filtering, and verifiable content credentials. Traceable deployment preserves semantic representation. Consequently, we propose a human-AI collaboration mechanism for AIGC assisted image production in special coverage and recommend evaluating Character Identity Stability (CIS), Cultural Expression Accuracy (CEA), and User-Public Appropriateness (U-PA).
zh
[CV-119] Complex Mathematical Expression Recognition: Benchmark Large-Scale Dataset and Strong Baseline
【速读】:该论文旨在解决复杂数学表达式识别(Mathematical Expression Recognition, MER)中模型性能显著下降的问题,尤其在多行、高token数的复杂表达式场景下表现不佳。其核心挑战源于现有训练数据集以简单表达式为主,缺乏对复杂结构的覆盖。解决方案的关键在于三个方面:首先,构建了CMER-Bench基准,系统性地评估现有MER模型与多模态大语言模型(Multimodal Large Language Models, MLLMs)在不同难度层级下的表现;其次,提出了MER-17M和CMER-3M两个大规模数据集,聚焦于复杂数学表达式的多样性与丰富性;最后,设计了一种新型表达式分词器和名为Structured Mathematical Language的新表示方法,显式建模表达式的层次化与空间布局结构,从而开发出基于编码器-解码器架构的专用模型CMERNet,在仅1.25亿参数下显著优于现有方法。
链接: https://arxiv.org/abs/2512.13731
作者: Weikang Bai,Yongkun Du,Yuchen Su,Yazhen Xie,Zhineng Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Mathematical Expression Recognition (MER) has made significant progress in recognizing simple expressions, but the robust recognition of complex mathematical expressions with many tokens and multiple lines remains a formidable challenge. In this paper, we first introduce CMER-Bench, a carefully constructed benchmark that categorizes expressions into three difficulty levels: easy, moderate, and complex. Leveraging CMER-Bench, we conduct a comprehensive evaluation of existing MER models and general-purpose multimodal large language models (MLLMs). The results reveal that while current methods perform well on easy and moderate expressions, their performance degrades significantly when handling complex mathematical expressions, mainly because existing public training datasets are primarily composed of simple samples. In response, we propose MER-17M and CMER-3M that are large-scale datasets emphasizing the recognition of complex mathematical expressions. The datasets provide rich and diverse samples to support the development of accurate and robust complex MER models. Furthermore, to address the challenges posed by the complicated spatial layout of complex expressions, we introduce a novel expression tokenizer, and a new representation called Structured Mathematical Language, which explicitly models the hierarchical and spatial structure of expressions beyond LaTeX format. Based on these, we propose a specialized model named CMERNet, built upon an encoder-decoder architecture and trained on CMER-3M. Experimental results show that CMERNet, with only 125 million parameters, significantly outperforms existing MER models and MLLMs on CMER-Bench.
zh
[CV-120] Composite Classifier-Free Guidance for Multi-Modal Conditioning in Wind Dynamics Super-Resolution
【速读】:该论文旨在解决高分辨率、高精度风场数据获取成本高昂且难以实现的问题,传统方法在成本与准确性之间难以兼顾。其核心解决方案是提出一种适用于多条件输入的扩散模型改进策略——复合无分类器引导(Composite Classifier-Free Guidance, CCFG),该方法可无缝集成至已使用标准无分类器引导(Classifier-Free Guidance, CFG)训练的扩散模型中,从而更有效地利用风场重建任务中多达十余个输入通道的条件变量。实验表明,采用CCFG的WindDM模型在风场超分辨率任务中实现了比CFG更高的保真度,同时相较经典方法降低高达1000倍的成本,达到了工业级应用的性能与效率平衡。
链接: https://arxiv.org/abs/2512.13729
作者: Jacob Schnell,Aditya Makkar,Gunadi Gani,Aniket Srinivasan Ashok,Darren Lo,Mike Optis,Alexander Wong,Yuhao Chen
机构: University of Waterloo (滑铁卢大学); Veer Renewables
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Various weather modelling problems (e.g., weather forecasting, optimizing turbine placements, etc.) require ample access to high-resolution, highly accurate wind data. Acquiring such high-resolution wind data, however, remains a challenging and expensive endeavour. Traditional reconstruction approaches are typically either cost-effective or accurate, but not both. Deep learning methods, including diffusion models, have been proposed to resolve this trade-off by leveraging advances in natural image super-resolution. Wind data, however, is distinct from natural images, and wind super-resolvers often use upwards of 10 input channels, significantly more than the usual 3-channel RGB inputs in natural images. To better leverage a large number of conditioning variables in diffusion models, we present a generalization of classifier-free guidance (CFG) to multiple conditioning inputs. Our novel composite classifier-free guidance (CCFG) can be dropped into any pre-trained diffusion model trained with standard CFG dropout. We demonstrate that CCFG outputs are higher-fidelity than those from CFG on wind super-resolution tasks. We present WindDM, a diffusion model trained for industrial-scale wind dynamics reconstruction and leveraging CCFG. WindDM achieves state-of-the-art reconstruction quality among deep learning models and costs up to 1000\times less than classical methods.
zh
[CV-121] Physics-Guided Deep Learning for Heat Pump Stress Detection: A Comprehensive Analysis on When2Heat Dataset
【速读】:该论文旨在解决热泵系统(heat pump systems)在实际运行中应力检测困难的问题,主要挑战在于复杂的热力学相互作用以及真实世界数据的稀缺性。解决方案的关键在于提出一种物理引导的深度神经网络(Physics-Guided Deep Neural Network, PG-DNN)方法,其核心创新包括:基于物理机制的特征选择与类别定义、五层深度神经网络架构、双重正则化策略,以及针对不同国家能源模式的跨区域分析。该方法在When2Heat数据集上实现了78.1%的测试准确率,显著优于基线模型,验证了物理信息融合与结构化建模对提升热泵应力分类性能的有效性。
链接: https://arxiv.org/abs/2512.13696
作者: Md Shahabub Alam,Md Asifuzzaman Jishan,Ayan Kumar Ghosh
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Heat pump systems are critical components in modern energy-efficient buildings, yet their operational stress detection remains challenging due to complex thermodynamic interactions and limited real-world data. This paper presents a novel Physics-Guided Deep Neural Network (PG-DNN) approach for heat pump stress classification using the When2Heat dataset, containing 131,483 samples with 656 features across 26 European countries. The methodology integrates physics-guided feature selection and class definition with a deep neural network architecture featuring 5 hidden layers and dual regularization strategies. The model achieves 78.1% test accuracy and 78.5% validation accuracy, demonstrating significant improvements over baseline approaches: +5.0% over shallow networks, +4.0% over limited feature sets, and +2.0% over single regularization strategies. Comprehensive ablation studies validate the effectiveness of physics-guided feature selection, variable thresholding for realistic class distribution, and cross-country energy pattern analysis. The proposed system provides a production-ready solution for heat pump stress detection with 181,348 parameters and 720 seconds training time on AMD Ryzen 9 7950X with RTX 4080 hardware.
zh
[CV-122] WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving
【速读】:该论文旨在解决端到端自动驾驶系统中轨迹生成的效率与灵活性问题,特别是针对现有方法(如自回归大语言模型和连续扩散策略)在处理复杂场景时存在计算开销高、解码顺序受限等局限。其核心解决方案是提出WAM-Diff框架,采用离散掩码扩散(discrete masked diffusion)机制对未来的车辆轨迹进行迭代优化,关键创新包括:1)系统性地将掩码扩散适配至自动驾驶任务,支持灵活的非因果解码顺序;2)通过稀疏专家混合(MoE)架构提升模型容量,并联合训练运动预测与驾驶导向视觉问答(VQA)任务;3)引入基于组序列策略优化(GSPO)的在线强化学习方法,以序列级驾驶奖励为目标进行优化。实验表明,该方法在NAVSIM-v1和NAVSIM-v2基准上分别达到91.0 PDMS和89.7 EPDMS,验证了离散掩码扩散在轨迹生成中的有效性与潜力。
链接: https://arxiv.org/abs/2512.11872
作者: Mingwang Xu,Jiahao Cui,Feipeng Cai,Hanlin Shang,Zhihao Zhu,Shan Luan,Yifang Xu,Neng Zhang,Yaoyi Li,Jia Cai,Siyu Zhu
机构: Fudan University (复旦大学); Yinwang Intelligent Technology Co., Ltd
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-end autonomous driving systems based on vision-language-action (VLA) models integrate multimodal sensor inputs and language instructions to generate planning and control signals. While autoregressive large language models and continuous diffusion policies are prevalent, the potential of discrete masked diffusion for trajectory generation remains largely unexplored. This paper presents WAM-Diff, a VLA framework that employs masked diffusion to iteratively refine a discrete sequence representing future ego-trajectories. Our approach features three key innovations: a systematic adaptation of masked diffusion for autonomous driving that supports flexible, non-causal decoding orders; scalable model capacity via a sparse MoE architecture trained jointly on motion prediction and driving-oriented visual question answering (VQA); and online reinforcement learning using Group Sequence Policy Optimization (GSPO) to optimize sequence-level driving rewards. Remarkably, our model achieves 91.0 PDMS on NAVSIM-v1 and 89.7 EPDMS on NAVSIM-v2, demonstrating the effectiveness of masked diffusion for autonomous driving. The approach provides a promising alternative to autoregressive and diffusion-based policies, supporting scenario-aware decoding strategies for trajectory generation. The code for this paper will be released publicly at: this https URL
zh
[CV-123] Automated Pollen Recognition in Optical and Holographic Microscopy Images DATE
【速读】:该论文旨在解决光学显微镜与全息显微镜图像中花粉颗粒检测与分类的自动化难题,尤其聚焦于兽医细胞学应用场景。其核心问题在于:全息显微图像因成像质量较低导致深度学习模型性能显著下降,难以满足实际应用需求。解决方案的关键在于通过自动化标注(automated labeling)和边界框面积扩展(bounding box area enlargement)两种数据增强策略对全息图像数据集进行扩展,从而有效缩小了与光学图像在检测(mAP50从2.49%提升至13.3%)和分类(准确率从42%提升至54%)性能间的差距,验证了低成本无透镜数字全息显微技术结合深度学习在特定任务中的可行性。
链接: https://arxiv.org/abs/2512.08589
作者: Swarn Singh Warshaneyan,Maksims Ivanovs,Blaž Cugmas,Inese Bērziņa,Laura Goldberga,Mindaugas Tamosiunas,Roberts Kadiķis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 08 pages, 10 figures, 04 tables, 20 references. Date of Conference: 13-14 June 2025 Date Added to IEEE Xplore: 10 July 2025 Electronic ISBN: 979-8-3315-0969-9 Print on Demand(PoD) ISBN: 979-8-3315-0970-5 DOI: https://doi.org/10.1109/AICCONF64766.2025.11064260 Conference Location: Prague, Czech Republic Online Access: this https URL
Abstract:This study explores the application of deep learning to improve and automate pollen grain detection and classification in both optical and holographic microscopy images, with a particular focus on veterinary cytology use cases. We used YOLOv8s for object detection and MobileNetV3L for the classification task, evaluating their performance across imaging modalities. The models achieved 91.3% mAP50 for detection and 97% overall accuracy for classification on optical images, whereas the initial performance on greyscale holographic images was substantially lower. We addressed the performance gap issue through dataset expansion using automated labeling and bounding box area enlargement. These techniques, applied to holographic images, improved detection performance from 2.49% to 13.3% mAP50 and classification performance from 42% to 54%. Our work demonstrates that, at least for image classification tasks, it is possible to pair deep learning techniques with cost-effective lensless digital holographic microscopy devices.
zh
[CV-124] WAM-Flow: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving
【速读】:该论文旨在解决端到端自动驾驶中轨迹规划的效率与精度瓶颈问题,尤其针对传统自回归(autoregressive)模型在推理时串行生成、难以平衡计算资源与决策准确性的问题。解决方案的关键在于提出WAM-Flow——一种将自我轨迹规划建模为结构化标记空间上的离散流匹配(discrete flow matching)的视觉-语言-动作(VLA)模型。其核心创新包括:1)基于三元组边界损失(triplet-margin learning)的度量对齐数值分词器,保留标量几何信息;2)融合几何感知的流目标函数与模拟器引导的GRPO对齐机制,同时优化安全性、行驶进度和舒适性奖励;3)通过多阶段适配将预训练自回归骨干(Janus-1.5B)转化为非因果流模型,并借助持续多模态预训练增强道路场景理解能力。该方法支持完全并行、双向去噪推理,实现从粗到精的可调算力-精度权衡,显著优于自回归与扩散基线,在NAVSIM v1基准上达到90.3 PDMS(5步推理),验证了离散流匹配作为自动驾驶新范式的潜力。
链接: https://arxiv.org/abs/2512.06112
作者: Yifang Xu,Jiahao Cui,Feipeng Cai,Zhihao Zhu,Hanlin Shang,Shan Luan,Mingwang Xu,Neng Zhang,Yaoyi Li,Jia Cai,Siyu Zhu
机构: Fudan University (复旦大学); Yinwang Intelligent Technology Co., Ltd
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 11 figures. Code Model: this https URL
Abstract:We introduce WAM-Flow, a vision-language-action (VLA) model that casts ego-trajectory planning as discrete flow matching over a structured token space. In contrast to autoregressive decoders, WAM-Flow performs fully parallel, bidirectional denoising, enabling coarse-to-fine refinement with a tunable compute-accuracy trade-off. Specifically, the approach combines a metric-aligned numerical tokenizer that preserves scalar geometry via triplet-margin learning, a geometry-aware flow objective and a simulator-guided GRPO alignment that integrates safety, ego progress, and comfort rewards while retaining parallel generation. A multi-stage adaptation converts a pre-trained auto-regressive backbone (Janus-1.5B) from causal decoding to non-causal flow model and strengthens road-scene competence through continued multimodal pretraining. Thanks to the inherent nature of consistency model training and parallel decoding inference, WAM-Flow achieves superior closed-loop performance against autoregressive and diffusion-based VLA baselines, with 1-step inference attaining 89.1 PDMS and 5-step inference reaching 90.3 PDMS on NAVSIM v1 benchmark. These results establish discrete flow matching as a new promising paradigm for end-to-end autonomous driving. The code will be publicly available soon.
zh
[CV-125] CRISTAL: Real-time Camera Registration in Static LiDAR Scans using Neural Rendering
【速读】:该论文旨在解决机器人和扩展现实(XR)应用中相机定位的准确性问题,现有视觉方法常因累积误差(drift)、尺度模糊性以及对特征标记(fiducials)或回环闭合(loop closure)的依赖而受限。其解决方案的关键在于利用预先捕获的高精度彩色激光雷达(LiDAR)点云,在实时场景中通过神经渲染技术生成合成视图,建立真实图像与点云之间的2D-3D对应关系,从而实现无漂移且具有正确度量尺度的相机跟踪。该方法有效缩小了合成图像与真实图像之间的域差距(domain gap),减少遮挡和背景伪影,显著提升了特征匹配的鲁棒性与精度。
链接: https://arxiv.org/abs/2511.16349
作者: Joni Vanherck,Steven Moonen,Brent Zoomers,Kobe Werner,Jeroen Put,Lode Jorissen,Nick Michiels
机构: Hasselt University - Digital Future Lab - Flanders Make (哈塞尔特大学 - 数字未来实验室 - 弗拉芒制造)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Accurate camera localization is crucial for robotics and Extended Reality (XR), enabling reliable navigation and alignment of virtual and real content. Existing visual methods often suffer from drift, scale ambiguity, and depend on fiducials or loop closure. This work introduces a real-time method for localizing a camera within a pre-captured, highly accurate colored LiDAR point cloud. By rendering synthetic views from this cloud, 2D-3D correspondences are established between live frames and the point cloud. A neural rendering technique narrows the domain gap between synthetic and real images, reducing occlusion and background artifacts to improve feature matching. The result is drift-free camera tracking with correct metric scale in the global LiDAR coordinate system. Two real-time variants are presented: Online Render and Match, and Prebuild and Localize. We demonstrate improved results on the ScanNet++ dataset and outperform existing SLAM pipelines.
zh
[CV-126] IPR-1: Interactive Physical Reason er
【速读】:该论文旨在解决智能体如何通过交互学习并持续提升类人物理推理能力的问题,尤其关注在存在显著视觉域差异的多样化环境中实现泛化与改进。现有方法如视觉语言模型(VLM)和世界模型(World Model)难以捕捉底层物理规律与因果关系,前者缺乏前瞻决策能力,后者则过度依赖视觉模式模仿而非物理机制分析。其解决方案的关键在于提出交互式物理推理器(IPR),该框架利用世界模型的滚动预测来评估并强化VLM策略,并引入以物理为中心的动作编码(PhysCode),将语义意图与动力学行为对齐,构建统一的动作空间以支持预测与推理。实验表明,IPR在上千个异构游戏中预训练后,不仅能从基础直觉到目标驱动推理逐步提升性能,还能零样本迁移至未见游戏,验证了以物理为核心交互路径对持续增强物理推理的有效性。
链接: https://arxiv.org/abs/2511.15407
作者: Mingyu Zhang,Lifeng Zhuo,Tianxi Tan,Guocan Xie,Xian Nie,Yan Li,Renjie Zhao,Zizhu He,Ziyu Wang,Jiting Cai,Yong-Lu Li
机构: RHOS Lab, Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute; Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages of main text and 19 pages of appendices. Project page: this https URL
Abstract:Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. To study this, we introduce a Game-to-Unseen (G2U) benchmark of 1,000+ heterogeneous games that exhibit significant visual domain gaps. Existing approaches, including VLMs and world models, struggle to capture underlying physics and causality since they are not focused on core mechanisms and overfit to visual details. VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM’s policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on levels from primitive intuition to goal-driven reasoning, and even surpasses GPT-5 overall. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning. Further demos and project details can be found at this https URL.
zh
[CV-127] WaveSim: A Wavelet-based Multi-scale Similarity Metric for Weather and Climate Fields
【速读】:该论文旨在解决传统点对点(point-wise)相似性度量在评估天气与气候空间场时无法区分误差来源物理尺度或模式的问题,从而导致诊断信息不足。其解决方案的关键在于提出一种多尺度相似性度量方法 WaveSim,该方法利用小波变换(wavelet transform)将输入场分解为不同尺度的小波系数,并在此基础上构建三个正交分量:幅度(Magnitude,量化能量分布相似性)、位移(Displacement,捕捉空间偏移)和结构(Structure,评估不依赖位置与幅值的模式组织),每个分量均产生一个 0 到 1 的尺度特定相似性分数,最终跨尺度融合得到整体相似性指标。此框架不仅可解释误差来源,还支持用户按需聚焦特定尺度或分量,适用于模型对比、评估、校准及预报系统训练。
链接: https://arxiv.org/abs/2512.14656
作者: Gabriele Accarino,Viviana Acquaviva,Sara Shamekh,Duncan Watson-Parris,David Lawrence
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)
备注:
Abstract:We introduce WaveSim, a multi-scale similarity metric for the evaluation of spatial fields in weather and climate applications. WaveSim exploits wavelet transforms to decompose input fields into scale-specific wavelet coefficients. The metric is built by multiplying three orthogonal components derived from these coefficients: Magnitude, which quantifies similarities in the energy distribution of the coefficients, i.e., the intensity of the field; Displacement, which captures spatial shift by comparing the centers of mass of normalized energy distributions; and Structure, which assesses pattern organization independent of location and amplitude. Each component yields a scale-specific similarity score ranging from 0 (no similarity) to 1 (perfect similarity), which are then combined across scales to produce an overall similarity measure. We first evaluate WaveSim using synthetic test cases, applying controlled spatial and temporal perturbations to systematically assess its sensitivity and expected behavior. We then demonstrate its applicability to physically relevant case studies of key modes of climate variability in Earth System Models. Traditional point-wise metrics lack a mechanism for attributing errors to physical scales or modes of dissimilarity. By operating in the wavelet domain and decomposing the signal along independent axes, WaveSim bypasses these limitations and provides an interpretable and diagnostically rich framework for assessing similarity in complex fields. Additionally, the WaveSim framework allows users to place emphasis on a specific scale or component, and lends itself to user-specific model intercomparison, model evaluation, and calibration and training of forecasting systems. We provide a PyTorch-ready implementation of WaveSim, along with all evaluation scripts, at: this https URL.
zh
[CV-128] st Time Optimized Generalized AI-based Medical Image Registration Method
【速读】:该论文旨在解决医学图像非刚性配准(Non-Rigid Registration, NRR)在实际临床应用中面临的两大核心问题:一是传统方法因参数调优复杂、计算成本高而难以满足实时性需求;二是深度学习方法虽具潜力,但依赖任务特定再训练,缺乏跨模态和跨解剖区域的泛化能力。其解决方案的关键在于提出一种新型AI驱动的3D非刚性配准框架,该框架无需针对特定解剖结构或成像模态进行定制化建模,从而实现对多种成像模态(如CT、MRI、超声)和不同 anatomical regions 的通用适配,显著提升临床部署的效率与灵活性。
链接: https://arxiv.org/abs/2512.14556
作者: Sneha Sree C.,Dattesh Shanbhag,Sudhanya Chatterjee
机构: GE HealthCare(通用电气医疗健康公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image registration is critical for aligning anatomical structures across imaging modalities such as computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound. Among existing techniques, non-rigid registration (NRR) is particularly challenging due to the need to capture complex anatomical deformations caused by physiological processes like respiration or contrast-induced signal variations. Traditional NRR methods, while theoretically robust, often require extensive parameter tuning and incur high computational costs, limiting their use in real-time clinical workflows. Recent deep learning (DL)-based approaches have shown promise; however, their dependence on task-specific retraining restricts scalability and adaptability in practice. These limitations underscore the need for efficient, generalizable registration frameworks capable of handling heterogeneous imaging contexts. In this work, we introduce a novel AI-driven framework for 3D non-rigid registration that generalizes across multiple imaging modalities and anatomical regions. Unlike conventional methods that rely on application-specific models, our approach eliminates anatomy- or modality-specific customization, enabling streamlined integration into diverse clinical environments.
zh
[CV-129] Improving the Plausibility of Pressure Distributions Synthesized from Depth through Generative Modeling
【速读】:该论文旨在解决医院床铺接触压力监测中压力图预测缺乏物理合理性的问题,从而影响临床可靠性。现有方法虽能生成压力分布估计,但难以保证其与人体力学特性一致。解决方案的关键在于提出一种基于生成式建模的框架,通过引入有监督潜在空间(Informed Latent Space, ILS) 和 权重优化损失函数(Weight Optimization Loss, WOL) 来增强预测结果的物理一致性;同时采用基于扩散过程的条件布朗桥扩散模型(Brownian Bridge Diffusion Model, BBDM),并进一步设计其潜在空间版本——潜在布朗桥扩散模型(Latent Brownian Bridge Diffusion Model, LBBDM),以实现高保真、快速推理的压力合成,特别适用于卧姿状态下的非侵入式视觉感知监控。
链接: https://arxiv.org/abs/2512.13757
作者: Neevkumar Manavar,Hanno Gerd Meyer,Joachim Waßmuth,Barbara Hammer,Axel Schneider
机构: Hochschule Bielefeld (比勒费尔德应用技术大学); CITEC, Bielefeld University (比勒费尔德大学认知与交互研究中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Monitoring contact pressure in hospital beds is essential for preventing pressure ulcers and enabling real-time patient assessment. Current methods can predict pressure maps but often lack physical plausibility, limiting clinical reliability. This work proposes a framework that enhances plausibility via Informed Latent Space (ILS) and Weight Optimization Loss (WOL) with generative modeling to produce high-fidelity, physically consistent pressure estimates. This study also applies diffusion based conditional Brownian Bridge Diffusion Model (BBDM) and proposes training strategy for its latent counterpart Latent Brownian Bridge Diffusion Model (LBBDM) tailored for pressure synthesis in lying postures. Experiment results shows proposed method improves physical plausibility and performance over baselines: BBDM with ILS delivers highly detailed maps at higher computational cost and large inference time, whereas LBBDM provides faster inference with competitive performance. Overall, the approach supports non-invasive, vision-based, real-time patient monitoring in clinical environments.
zh
人工智能
[AI-0] Universal Reasoning Model
【速读】:该论文旨在解决通用变换器(Universal Transformer, UT)在复杂推理任务(如ARC-AGI和数独)中性能提升的来源不明确的问题。通过系统性分析UT变体,研究发现ARC-AGI性能提升主要源于Transformer的循环归纳偏置(recurrent inductive bias)和强非线性组件,而非复杂的架构设计。基于此发现,作者提出通用推理模型(Universal Reasoning Model, URM),其关键改进在于引入短卷积(short convolution)和截断反向传播(truncated backpropagation),从而显著提升了推理能力,在ARC-AGI 1上达到53.8% pass@1,在ARC-AGI 2上达到16.0% pass@1,优于现有方法。
链接: https://arxiv.org/abs/2512.14693
作者: Zitian Gao,Lynx Chen,Yihao Xiao,He Xing,Ran Tao,Haoming Luo,Joey Zhou,Bryan Dai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Universal transformers (UTs) have been widely used for complex reasoning tasks such as ARC-AGI and Sudoku, yet the specific sources of their performance gains remain underexplored. In this work, we systematically analyze UTs variants and show that improvements on ARC-AGI primarily arise from the recurrent inductive bias and strong nonlinear components of Transformer, rather than from elaborate architectural designs. Motivated by this finding, we propose the Universal Reasoning Model (URM), which enhances the UT with short convolution and truncated backpropagation. Our approach substantially improves reasoning performance, achieving state-of-the-art 53.8% pass@1 on ARC-AGI 1 and 16.0% pass@1 on ARC-AGI 2. Our code is avaliable at this https URL.
zh
[AI-1] Bias-Variance Trade-off for Clipped Stochastic First-Order Methods: From Bounded Variance to Infinite Mean
【速读】:该论文旨在解决随机优化方法(Stochastic First-Order Methods, SFOMs)在重尾噪声(heavy-tailed noise)环境下,尤其是当噪声尾指数 α∈(0,2] 时的Oracle复杂度(oracle complexity)问题。现有理论大多局限于 α∈(1,2] 的情形(即噪声具有有限均值),而对 α→1 时复杂度趋于无穷的问题缺乏有效刻画,尤其在 α∈(0,1](噪声无有限均值)的情形下几乎未被研究。解决方案的关键在于提出一种新颖的偏差-方差权衡分析(bias-variance trade-off analysis)框架,通过控制噪声尾部的对称性(symmetry measure),证明在任意 α∈(0,2] 下,采用梯度裁剪(gradient clipping)的SFOMs可获得改进的复杂度保证。这一分析不仅统一了不同尾指数下的复杂度界限,且可与轻尾噪声下的经典分析结合,从而为重尾噪声场景提供坚实的理论支撑。
链接: https://arxiv.org/abs/2512.14686
作者: Chuan He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Computation (stat.CO); Machine Learning (stat.ML)
备注:
Abstract:Stochastic optimization is fundamental to modern machine learning. Recent research has extended the study of stochastic first-order methods (SFOMs) from light-tailed to heavy-tailed noise, which frequently arises in practice, with clipping emerging as a key technique for controlling heavy-tailed gradients. Extensive theoretical advances have further shown that the oracle complexity of SFOMs depends on the tail index \alpha of the noise. Nonetheless, existing complexity results often cover only the case \alpha \in (1,2] , that is, the regime where the noise has a finite mean, while the complexity bounds tend to infinity as \alpha approaches 1 . This paper tackles the general case of noise with tail index \alpha\in(0,2] , covering regimes ranging from noise with bounded variance to noise with an infinite mean, where the latter case has been scarcely studied. Through a novel analysis of the bias-variance trade-off in gradient clipping, we show that when a symmetry measure of the noise tail is controlled, clipped SFOMs achieve improved complexity guarantees in the presence of heavy-tailed noise for any tail index \alpha \in (0,2] . Our analysis of the bias-variance trade-off not only yields new unified complexity guarantees for clipped SFOMs across this full range of tail indices, but is also straightforward to apply and can be combined with classical analyses under light-tailed noise to establish oracle complexity guarantees under heavy-tailed noise. Finally, numerical experiments validate our theoretical findings.
zh
[AI-2] gridfm-datakit-v1: A Python Library for Scalable and Realistic Power Flow and Optimal Power Flow Data Generation
【速读】:该论文旨在解决当前用于训练机器学习(ML)求解器的潮流(Power Flow, PF)和最优潮流(Optimal Power Flow, OPF)数据集存在的三大局限性:一是缺乏现实世界的随机负荷与拓扑扰动,导致场景多样性不足;二是PF数据集仅限于满足OPF可行点,限制了ML求解器在违反运行极限(如支路过载或电压越限)情况下的泛化能力;三是OPF数据集使用固定发电成本函数,难以适应不同成本场景下的泛化需求。解决方案的关键在于:首先,通过融合真实世界负荷曲线的全局缩放与局部噪声,并支持任意N-k拓扑扰动,生成多样化且真实的PF/OPF数据;其次,主动生成超出运行极限的PF样本以增强模型鲁棒性;最后,引入可变的发电机成本函数来提升OPF数据的多样性与跨场景适应能力。该方法实现了对大规模电网(最多10,000节点)的高效扩展,显著优于现有工具如OPFData、OPF-Learn、PGLearn及PFΔ。
链接: https://arxiv.org/abs/2512.14658
作者: Alban Puech,Matteo Mazzonelli,Celia Cintas,Tamara R. Govindasamy,Mangaliso Mngomezulu,Jonas Weiss,Matteo Baù,Anna Varbella,François Mirallès,Kibaek Kim,Le Xie,Hendrik F. Hamann,Etienne Vos,Thomas Brunschwiler
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: Main equal contributors: Alban Puech, Matteo Mazzonelli. Other equal contributors: Celia Cintas, Tamara R. Govindasamy, Mangaliso Mngomezulu, Jonas Weiss
Abstract:We introduce gridfm-datakit-v1, a Python library for generating realistic and diverse Power Flow (PF) and Optimal Power Flow (OPF) datasets for training Machine Learning (ML) solvers. Existing datasets and libraries face three main challenges: (1) lack of realistic stochastic load and topology perturbations, limiting scenario diversity; (2) PF datasets are restricted to OPF-feasible points, hindering generalization of ML solvers to cases that violate operating limits (e.g., branch overloads or voltage violations); and (3) OPF datasets use fixed generator cost functions, limiting generalization across varying costs. gridfm-datakit addresses these challenges by: (1) combining global load scaling from real-world profiles with localized noise and supporting arbitrary N-k topology perturbations to create diverse yet realistic datasets; (2) generating PF samples beyond operating limits; and (3) producing OPF data with varying generator costs. It also scales efficiently to large grids (up to 10,000 buses). Comparisons with OPFData, OPF-Learn, PGLearn, and PF \Delta are provided. Available on GitHub at this https URL under Apache 2.0 and via pip install gridfm-datakit.
zh
[AI-3] MuseCPBench: an Empirical Study of Music Editing Methods through Music Context Preservation
【速读】:该论文试图解决当前音乐编辑方法在处理特定音乐属性修改时,对不应改变的音乐上下文信息(即Music Context Preservation, MCP)保护不足的问题。现有研究普遍存在评估标准不一致、指标不统一的现象,导致不同方法之间的比较不可靠且不公平。为解决这一问题,作者提出了首个专门用于评估MCP能力的基准测试平台MuseCPBench,涵盖四类音乐特征维度,并支持对五种代表性音乐编辑基线模型进行系统性对比分析。其关键创新在于构建了标准化、多维的评估框架,揭示了当前主流方法在MCP方面的普遍性缺陷,从而为开发具备更强上下文保持能力的音乐编辑策略提供了实证依据和实践指导。
链接: https://arxiv.org/abs/2512.14629
作者: Yash Vishe,Eric Xue,Xunyi Jiang,Zachary Novack,Junda Wu,Julian McAuley,Xin Xu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Music editing plays a vital role in modern music production, with applications in film, broadcasting, and game development. Recent advances in music generation models have enabled diverse editing tasks such as timbre transfer, instrument substitution, and genre transformation. However, many existing works overlook the evaluation of their ability to preserve musical facets that should remain unchanged during editing a property we define as Music Context Preservation (MCP). While some studies do consider MCP, they adopt inconsistent evaluation protocols and metrics, leading to unreliable and unfair comparisons. To address this gap, we introduce the first MCP evaluation benchmark, MuseCPBench, which covers four categories of musical facets and enables comprehensive comparisons across five representative music editing baselines. Through systematic analysis along musical facets, methods, and models, we identify consistent preservation gaps in current music editing methods and provide insightful explanations. We hope our findings offer practical guidance for developing more effective and reliable music editing strategies with strong MCP capability
zh
[AI-4] Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes
【速读】:该论文旨在解决非马尔可夫奖励决策过程(Non-Markovian Reward Decision Processes, NMRDPs)中长期依赖任务的强化学习问题,其核心挑战在于传统马尔可夫强化学习(Markovian Reinforcement Learning, RL)无法有效建模系统历史对决策的影响,而现有NMRDP方法缺乏近似最优性和样本效率的理论保障。解决方案的关键在于提出QR-MAX算法,这是一种基于模型的离散动作NMRDP强化学习方法,通过将马尔可夫转移学习与非马尔可夫奖励处理解耦,利用奖励机器(reward machines)实现结构化建模,从而在多项式样本复杂度下实现概率有界一致收敛(PAC convergence)到ε-最优策略。进一步地,该框架扩展至连续状态空间,采用SimHash-based的分桶离散化方法(Bucket-QR-MAX),保持因子分解结构并避免手动网格划分或函数逼近,显著提升了学习速度和策略稳定性。
链接: https://arxiv.org/abs/2512.14617
作者: Alessandro Trapasso,Luca Iocchi,Fabio Patrizi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 32 figures, includes appendix
Abstract:Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not suitable for such tasks, while RL with non-Markovian reward decision processes (NMRDPs) enables agents to tackle temporal-dependency tasks. This approach has long been known to lack formal guarantees on both (near-)optimality and sample efficiency. We contribute to solving both issues with QR-MAX, a novel model-based algorithm for discrete NMRDPs that factorizes Markovian transition learning from non-Markovian reward handling via reward machines. To the best of our knowledge, this is the first model-based RL algorithm for discrete-action NMRDPs that exploits this factorization to obtain PAC convergence to \varepsilon -optimal policies with polynomial sample complexity. We then extend QR-MAX to continuous state spaces with Bucket-QR-MAX, a SimHash-based discretiser that preserves the same factorized structure and achieves fast and stable learning without manual gridding or function approximation. We experimentally compare our method with modern state-of-the-art model-based RL approaches on environments of increasing complexity, showing a significant improvement in sample efficiency and increased robustness in finding optimal policies.
zh
[AI-5] Residual GRUMHSA: A Lightweight Hybrid Recurrent Attention Model for Cardiovascular Disease Detection
【速读】:该论文旨在解决心血管疾病(Cardiovascular Disease, CVD)早期预测中传统方法依赖人工特征提取、机器学习模型泛化能力弱的问题,尤其是在噪声大和异质性强的临床数据场景下。其解决方案的关键在于提出一种轻量级混合架构——残差门控循环单元与多头自注意力机制结合的模型(Residual GRU with Multi-Head Self-Attention),通过残差双向门控循环单元建模特征列的时序依赖关系,引入通道重加权模块增强关键特征表达,并利用可学习分类标记的多头自注意力池化机制捕获全局上下文信息,从而在保持高效性的同时显著提升预测准确性与鲁棒性。
链接: https://arxiv.org/abs/2512.14563
作者: Tejaswani Dash,Gautam Datla,Anudeep Vurity,Tazeem Ahmad,Mohd Adnan,Saima Rafi,Saisha Patro,Saina Patro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE Bigdata 2025- Learning Representations with Limited Supervision
Abstract:Cardiovascular disease (CVD) remains the leading cause of mortality worldwide, underscoring the need for reliable and efficient predictive tools that support early intervention. Traditional diagnostic approaches rely on handcrafted features and clinician expertise, while machine learning methods improve reproducibility but often struggle to generalize across noisy and heterogeneous clinical data. In this work, we propose Residual GRU with Multi-Head Self-Attention, a compact deep learning architecture designed for tabular clinical records. The model integrates residual bidirectional gated recurrent units for sequential modeling of feature columns, a channel reweighting block, and multi-head self-attention pooling with a learnable classification token to capture global context. We evaluate the model on the UCI Heart Disease dataset using 5-fold stratified cross-validation and compare it against classical methods such as Logistic Regression, Random Forest, and Support Vector Machines, as well as modern deep learning baselines including DeepMLP, convolutional networks, recurrent networks, and Transformers. The proposed model achieves an accuracy of 0.861, macro-F1 of 0.860, ROC-AUC of 0.908, and PR-AUC of 0.904, outperforming all baselines. Ablation studies confirm the individual contributions of residual recurrence, channel gating, and attention pooling. t-SNE visualizations further indicate that the learned embeddings exhibit clearer separation between disease and non-disease classes compared to raw features. These results demonstrate that lightweight hybrid recurrent and attention-based architectures provide a strong balance between accuracy and efficiency for clinical risk prediction, supporting deployment in resource-constrained healthcare settings.
zh
[AI-6] Dynamic Learning Rate Scheduling based on Loss Changes Leads to Faster Convergence
【速读】:该论文旨在解决现有学习率调度器(learning rate scheduler)在深度学习训练中普遍采用固定策略(如余弦退火或指数衰减)所导致的性能瓶颈问题,尤其是在不同任务和模型规模下难以自适应调整学习率以优化收敛速度与最终精度。其解决方案的关键在于提出一种名为 GreedyLR 的新型自适应学习率调度算法,该算法根据当前损失值动态调整学习率,无需人工调参即可实现更优的训练效率与鲁棒性。GreedyLR 通过理论分析证明了收敛性,并推导出最大化收敛速率的最优缩放因子 $ F $,同时实验验证了其在多种自然语言处理(NLP)、计算机视觉(CV)及大语言模型(LLM)任务中的有效性,尤其在高达 7B 参数的模型上表现突出。
链接: https://arxiv.org/abs/2512.14527
作者: Shreyas Subramanian,Bala Krishnamoorthy,Pranav Murthy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite significant advances in optimizers for training, most research works use common scheduler choices like Cosine or exponential decay. In this paper, we study \emphGreedyLR, a novel scheduler that adaptively adjusts the learning rate during training based on the current loss. To validate the effectiveness of our proposed scheduler, we conduct experiments on several NLP, CV, and LLM tasks with up to 7B parameters, including both fine-tuning and pre-training experiments. The results show that our approach outperforms several state-of-the-art schedulers in terms of accuracy, speed, and convergence. We also provide a theoretical analysis of the GreedyLR algorithm, including a proof of convergence and derivation of the optimal scaling factor F that maximizes the convergence rate, along with experiments to show robustness of the algorithm to realistic noisy landscapes. Our scheduler is easy to implement, computationally efficient, and could be considered a good default scheduler for training.
zh
[AI-7] Sparse Multi-Modal Transformer with Masking for Alzheimers Disease Classification
【速读】:该论文旨在解决基于Transformer的多模态智能系统因密集自注意力机制导致计算和能耗成本过高,从而在资源受限环境下难以扩展的问题。其核心解决方案是提出一种稀疏多模态Transformer(SMMT)架构,关键创新在于引入基于聚类的稀疏注意力机制以实现近线性的计算复杂度,并结合模态级掩码策略提升对不完整输入的鲁棒性,从而在保持预测性能的同时显著降低训练时间、内存占用和能量消耗。
链接: https://arxiv.org/abs/2512.14491
作者: Cheng-Han Lu,Pei-Hsuan Tsai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures
Abstract:Transformer-based multi-modal intelligent systems often suffer from high computational and energy costs due to dense self-attention, limiting their scalability under resource constraints. This paper presents SMMT, a sparse multi-modal transformer architecture designed to improve efficiency and robustness. Building upon a cascaded multi-modal transformer framework, SMMT introduces cluster-based sparse attention to achieve near linear computational complexity and modality-wise masking to enhance robustness against incomplete inputs. The architecture is evaluated using Alzheimer’s Disease classification on the ADNI dataset as a representative multi-modal case study. Experimental results show that SMMT maintains competitive predictive performance while significantly reducing training time, memory usage, and energy consumption compared to dense attention baselines, demonstrating its suitability as a resource-aware architectural component for scalable intelligent systems.
zh
[AI-8] Model-First Reasoning LLM Agents : Reducing Hallucinations through Explicit Problem Modeling
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂多步骤规划任务中频繁违反约束条件、解决方案不一致的问题。现有方法如思维链(Chain-of-Thought)和ReAct依赖隐式状态追踪,缺乏对问题的显式表示。其解决方案的关键在于提出“模型先行推理”(Model-First Reasoning, MFR),即分两阶段进行:首先由LLM构建问题的显式模型,明确定义实体、状态变量、动作和约束;随后基于该模型生成解计划。实验证明,MFR在医疗调度、路径规划、资源分配等多个领域显著降低约束违规率并提升解质量,且消融实验表明显式建模阶段是性能提升的核心因素,揭示了LLM规划失败主要源于表征缺陷而非推理能力不足。
链接: https://arxiv.org/abs/2512.14474
作者: Annu Rana,Gaurav Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) often struggle with complex multi-step planning tasks, showing high rates of constraint violations and inconsistent solutions. Existing strategies such as Chain-of-Thought and ReAct rely on implicit state tracking and lack an explicit problem representation. Inspired by classical AI planning, we propose Model-First Reasoning (MFR), a two-phase paradigm in which the LLM first constructs an explicit model of the problem, defining entities, state variables, actions, and constraints, before generating a solution plan. Across multiple planning domains, including medical scheduling, route planning, resource allocation, logic puzzles, and procedural synthesis, MFR reduces constraint violations and improves solution quality compared to Chain-of-Thought and ReAct. Ablation studies show that the explicit modeling phase is critical for these gains. Our results suggest that many LLM planning failures stem from representational deficiencies rather than reasoning limitations, highlighting explicit modeling as a key component for robust and interpretable AI agents. All prompts, evaluation procedures, and task datasets are documented to facilitate reproducibility.
zh
[AI-9] Context-Picker: Dynamic context selection using multi-stage reinforcement learning
【速读】:该论文旨在解决长上下文问答(Long-Context Question Answering, LCQA)中如何确定最优上下文长度的问题,即在保证答案准确性的同时最小化冗余信息。传统方法如固定Top-K检索和单阶段重排序难以平衡上下文数量与质量,尤其对事实类问题(factoid questions)而言,往往只需少量关键证据即可作答。解决方案的关键在于提出一种名为Context-Picker的推理感知框架,其核心是将上下文选择从相似度排序转变为最小充分子集选择,并采用人类启发的两阶段强化学习策略:第一阶段为召回导向(recall-oriented),优先覆盖推理链;第二阶段为精度导向(precision-oriented),通过冗余感知的奖励机制精炼出紧凑的证据集合。此外,论文设计了一种离线证据蒸馏流水线,利用留一法(Leave-One-Out, LOO)挖掘“最小充分集合”,缓解奖励稀疏性问题,从而实现更高效、准确的上下文选择。
链接: https://arxiv.org/abs/2512.14465
作者: Siyuan Zhu,Chengdong Xu,Kaiqiang Ke,Chao Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In long-context question answering (LCQA), determining the optimal amount of context for a given query is a significant challenge. Including too few passages may omit critical information, while including too many can introduce noise and reduce the quality of the answer. Traditional approaches, such as fixed Top- K retrieval and single-stage reranking, face the dilemma of selecting the right number of passages. This problem is particularly pronounced for factoid questions, which often require only a few specific pieces of evidence. To address this issue, we introduce \emphContext-Picker, a reasoning-aware framework that shifts the paradigm from similarity-based ranking to minimal sufficient subset selection. Context-Picker treats context selection as a decision-making process optimized via a human-inspired, two-stage reinforcement learning schedule: a \emphrecall-oriented stage that prioritizes the coverage of reasoning chains, followed by a \emphprecision-oriented stage that aggressively prunes redundancy to distill a compact evidence set. To resolve reward sparsity, we propose an offline evidence distillation pipeline that mines “minimal sufficient sets” via a Leave-One-Out (LOO) procedure, providing dense, task-aligned supervision. Experiments on five long-context and multi-hop QA benchmarks demonstrate that Context-Picker significantly outperforms strong RAG baselines, achieving superior answer accuracy with comparable or reduced context lengths. Ablation studies indicate that the coarse-to-fine optimization schedule, the redundancy-aware reward shaping, and the rationale-guided format all contribute substantially to these gains.
zh
[AI-10] Reasoning -Style Poisoning of LLM Agents via Stealthy Style Transfer: Process-Level Attacks and Runtime Monitoring in RSV Space
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在高风险场景中因外部检索依赖而面临的新类型安全威胁——即推理风格(reasoning style)被恶意操控的问题。传统对抗攻击主要集中在内容伪造或指令注入,但本文首次识别出“推理过程”本身是一个可被攻击的表面,提出推理风格投毒(Reasoning-Style Poisoning, RSP)这一新范式,其核心在于不改变检索内容的事实性,而是通过生成式风格注入(Generative Style Injection, GSI)将文档改写为病理化语气(如“分析瘫痪”或“认知急躁”),从而干扰代理的决策流程。关键创新在于引入推理风格向量(Reasoning Style Vector, RSV),量化验证深度、自我信心和注意力焦点等维度,并据此设计轻量级实时监控器RSP-M,在运行时检测异常推理模式并触发警报,推动防御体系从静态内容过滤向动态过程感知演进。
链接: https://arxiv.org/abs/2512.14448
作者: Xingfu Zhou,Pengfei Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) agents relying on external retrieval are increasingly deployed in high-stakes environments. While existing adversarial attacks primarily focus on content falsification or instruction injection, we identify a novel, process-oriented attack surface: the agent’s reasoning style. We propose Reasoning-Style Poisoning (RSP), a paradigm that manipulates how agents process information rather than what they process. We introduce Generative Style Injection (GSI), an attack method that rewrites retrieved documents into pathological tones–specifically “analysis paralysis” or “cognitive haste”–without altering underlying facts or using explicit triggers. To quantify these shifts, we develop the Reasoning Style Vector (RSV), a metric tracking Verification depth, Self-confidence, and Attention focus. Experiments on HotpotQA and FEVER using ReAct, Reflection, and Tree of Thoughts (ToT) architectures reveal that GSI significantly degrades performance. It increases reasoning steps by up to 4.4 times or induces premature errors, successfully bypassing state-of-the-art content filters. Finally, we propose RSP-M, a lightweight runtime monitor that calculates RSV metrics in real-time and triggers alerts when values exceed safety thresholds. Our work demonstrates that reasoning style is a distinct, exploitable vulnerability, necessitating process-level defenses beyond static content analysis.
zh
[AI-11] Seismology modeling agent : A smart assistant for geophysical researchers
【速读】:该论文旨在解决传统开源地震波模拟软件SPECFEM在使用过程中存在的学习曲线陡峭、依赖复杂的手动文件编辑和命令行操作等问题。其解决方案的关键在于引入基于大语言模型(Large Language Models, LLMs)的智能交互式工作流,并首次提出适用于SPECFEM的模型上下文协议(Model Context Protocol, MCP)服务器套件,将整个模拟流程分解为一系列可由代理执行的离散工具,涵盖参数生成、网格划分、求解器执行及可视化等环节,从而实现从文件驱动到意图驱动的对话式交互范式转变,支持全自动执行与人机协同两种模式,在显著降低低层次操作负担的同时保留科研人员对科学决策的控制权。
链接: https://arxiv.org/abs/2512.14429
作者: Yukun Ren,Siwei Yu,Kai Chen,Jianwei Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 26 pages, 15 figures. Code available at this https URL
Abstract:To address the steep learning curve and reliance on complex manual file editing and command-line operations in the traditional workflow of the mainstream open-source seismic wave simulation software SPECFEM, this paper proposes an intelligent, interactive workflow powered by Large Language Models (LLMs). We introduce the first Model Context Protocol (MCP) server suite for SPECFEM (supporting 2D, 3D Cartesian, and 3D Globe versions), which decomposes the entire simulation process into discrete, agent-executable tools spanning from parameter generation and mesh partitioning to solver execution and visualization. This approach enables a paradigm shift from file-driven to intent-driven conversational interactions. The framework supports both fully automated execution and human-in-the-loop collaboration, allowing researchers to guide simulation strategies in real time and retain scientific decision-making authority while significantly reducing tedious low-level operations. Validated through multiple case studies, the workflow operates seamlessly in both autonomous and interactive modes, yielding high-fidelity results consistent with standard baselines. As the first application of MCP technology to computational seismology, this study significantly lowers the entry barrier, enhances reproducibility, and offers a promising avenue for advancing computational geophysics toward AI-assisted and automated scientific research. The complete source code is available at this https URL.
zh
[AI-12] PortAgent : LLM -driven Vehicle Dispatching Agent for Port Terminals
【速读】:该论文旨在解决自动化集装箱码头(Automated Container Terminal, ACT)中车辆调度系统(Vehicle Dispatching System, VDS)在不同码头间迁移困难的问题,其核心挑战包括对港口运营专家的高度依赖、对码头特定数据的高需求以及手动部署流程耗时长。解决方案的关键在于提出PortAgent——一个由大型语言模型(Large Language Model, LLM)驱动的车辆调度代理,通过构建虚拟专家团队(Virtual Expert Team, VET)实现全流程自动化迁移。VET包含知识检索器、建模者、编码器和调试器四个模块,利用少量示例学习(few-shot example learning)机制从有限样本中提取VDS领域知识,并借助检索增强生成(Retrieval-Augmented Generation, RAG)减少对终端数据的依赖;同时,通过自纠错循环机制(受LLM Reflexion框架启发)建立自动设计工作流,从而消除人工干预,显著提升VDS迁移效率与可移植性。
链接: https://arxiv.org/abs/2512.14417
作者: Jia Hu,Junqi Li,Weimeng Lin,Peng Jia,Yuxiong Ji,Jintao Lai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Vehicle Dispatching Systems (VDSs) are critical to the operational efficiency of Automated Container Terminals (ACTs). However, their widespread commercialization is hindered due to their low transferability across diverse terminals. This transferability challenge stems from three limitations: high reliance on port operational specialists, a high demand for terminal-specific data, and time-consuming manual deployment processes. Leveraging the emergence of Large Language Models (LLMs), this paper proposes PortAgent, an LLM-driven vehicle dispatching agent that fully automates the VDS transferring workflow. It bears three features: (1) no need for port operations specialists; (2) low need of data; and (3) fast deployment. Specifically, specialist dependency is eliminated by the Virtual Expert Team (VET). The VET collaborates with four virtual experts, including a Knowledge Retriever, Modeler, Coder, and Debugger, to emulate a human expert team for the VDS transferring workflow. These experts specialize in the domain of terminal VDS via a few-shot example learning approach. Through this approach, the experts are able to learn VDS-domain knowledge from a few VDS examples. These examples are retrieved via a Retrieval-Augmented Generation (RAG) mechanism, mitigating the high demand for terminal-specific data. Furthermore, an automatic VDS design workflow is established among these experts to avoid extra manual interventions. In this workflow, a self-correction loop inspired by the LLM Reflexion framework is created
zh
[AI-13] Massive Editing for Large Language Models Based on Dynamic Weight Generation
【速读】:该论文旨在解决大规模知识编辑(Knowledge Editing, KE)中如何在低计算成本下实现高可靠性(Reliability)、泛化性(Generality)和局部性(Locality)的问题。现有方法难以同时保障这三项指标,尤其在大规模编辑场景下性能受限。其解决方案的关键在于提出一种基于动态权重生成(Massive editing approach for LLMs based on dynamic weight Generation, MeG)的方法:通过在大语言模型(Large Language Models, LLMs)特定层附加一个动态权重神经元,并利用扩散模型(diffusion model)根据输入查询条件性地生成该神经元的权重,从而仅以添加一个轻量级参数单元的方式实现高效的大规模知识修改。实验表明,该方法在三大指标上均显著优于现有技术,尤其是在局部性指标上取得绝对值百分点的显著提升,验证了其有效性与优越性。
链接: https://arxiv.org/abs/2512.14395
作者: Wentao Wan,Qiqing Lao,Zhiwei Xie,Hefeng Wu,Runnan Lin,Liang Lin,Keze Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 8 figures
Abstract:Knowledge Editing (KE) is a field that studies how to modify some knowledge in Large Language Models (LLMs) at a low cost (compared to pre-training). Currently, performing large-scale edits on LLMs while ensuring the Reliability, Generality, and Locality metrics of the edits remain a challenge. This paper proposes a Massive editing approach for LLMs based on dynamic weight Generation (MeG). Our MeG involves attaching a dynamic weight neuron to specific layers of the LLMs and using a diffusion model to conditionally generate the weights of this neuron based on the input query required for the knowledge. This allows the use of adding a single dynamic weight neuron to achieve the goal of large-scale knowledge editing. Experiments show that our MeG can significantly improve the performance of large-scale KE in terms of Reliability, Generality, and Locality metrics compared to existing knowledge editing methods, particularly with a high percentage point increase in the absolute value index for the Locality metric, demonstrating the advantages of our proposed method.
zh
[AI-14] Causal Structure Learning for Dynamical Systems with Theoretical Score Analysis AAAI2026
【速读】:该论文旨在解决动态系统中因果发现的问题,尤其针对现有方法在处理不规则采样数据时性能不佳以及忽视底层因果关系的局限性。其解决方案的关键在于提出CaDyT方法,该方法基于差分因果模型(Difference-based causal models)建模连续时间动态过程,利用精确的高斯过程(Gaussian Process)推断来更贴合系统的连续演化特性,并通过贪心搜索结合算法马尔可夫条件(Algorithmic Markov Condition)与最小描述长度原则(Minimum Description Length principle)识别因果结构。这一设计使CaDyT在规则和不规则采样数据上均优于现有最优方法,能够更准确地恢复真实动态系统的因果网络。
链接: https://arxiv.org/abs/2512.14361
作者: Nicholas Tagliapietra,Katharina Ensinger,Christoph Zimmer,Osman Mian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
备注: Accepted as Oral at AAAI 2026 Conference
Abstract:Real world systems evolve in continuous-time according to their underlying causal relationships, yet their dynamics are often unknown. Existing approaches to learning such dynamics typically either discretize time – leading to poor performance on irregularly sampled data – or ignore the underlying causality. We propose CaDyT, a novel method for causal discovery on dynamical systems addressing both these challenges. In contrast to state-of-the-art causal discovery methods that model the problem using discrete-time Dynamic Bayesian networks, our formulation is grounded in Difference-based causal models, which allow milder assumptions for modeling the continuous nature of the system. CaDyT leverages exact Gaussian Process inference for modeling the continuous-time dynamics which is more aligned with the underlying dynamical process. We propose a practical instantiation that identifies the causal structure via a greedy search guided by the Algorithmic Markov Condition and Minimum Description Length principle. Our experiments show that CaDyT outperforms state-of-the-art methods on both regularly and irregularly-sampled data, discovering causal networks closer to the true underlying dynamics.
zh
[AI-15] Card: Deployable EXPLAIN-only Residual Learning for Cardinality Estimation
【速读】:该论文旨在解决数据库查询优化中基数估计(Cardinality Estimation)这一关键瓶颈问题,传统估算方法往往忽略数据间的相关性,而现有的学习型估算器则通常需要针对特定工作负载训练且集成到优化器中具有侵入性。其解决方案的核心在于提出 TiCard——一个低侵入性的基于校正的框架,它不替代数据库原生估算器,而是通过学习乘法残差校正值来增强其准确性;关键创新点在于仅使用 EXPLAIN 信息提取特征进行在线推理,并利用 EXPLAIN ANALYZE 生成离线标签,从而实现高效、可部署的改进。
链接: https://arxiv.org/abs/2512.14358
作者: Qizhi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 16 pages(/wo references), 4 figures, 10 tables
Abstract:Cardinality estimation is a key bottleneck for cost-based query optimization, yet deployable improvements remain difficult: classical estimators miss correlations, while learned estimators often require workload-specific training pipelines and invasive integration into the optimizer. This paper presents TiCard, a low intrusion, correction-based framework that augments (rather than replaces) a database’s native estimator. TiCard learns multiplicative residual corrections using EXPLAIN-only features, and uses EXPLAIN ANALYZE only for offline labels. We study two practical instantiations: (i) a Gradient Boosting Regressor for sub-millisecond inference, and (ii) TabPFN, an in-context tabular foundation model that adapts by refreshing a small reference set without gradient retraining. On TiDB with TPCH and the Join Order Benchmark, in a low-trace setting (263 executions total; 157 used for learning), TiCard improves operator-level tail accuracy substantially: P90 Q-error drops from 312.85 (native) to 13.69 (TiCard-GBR), and P99 drops from 37,974.37 to 3,416.50 (TiCard-TabPFN), while a join-only policy preserves near-perfect median behavior. We position TiCard as an AI4DB building block focused on deployability: explicit scope, conservative integration policies, and an integration roadmap from offline correction to in-optimizer use.
zh
[AI-16] Criminal Liability in AI-Enabled Autonomous Vehicles: A Comparative Study
【速读】:该论文旨在解决自动驾驶车辆(AV)在发生交通违规时引发的复杂刑事责任归属问题,特别是在不同司法管辖区法律体系差异导致责任认定模糊、监管碎片化的情况下。研究通过比较美国、德国、英国、中国和印度等技术先进且监管模式各异的国家或地区的主要法规、实际责任索赔案例及学术文献,系统分析了人类错误归因、AI道德主体性以及事故中首要责任方的识别机制。其解决方案的关键在于提出全球统一的法律标准框架,以促进技术创新并确保最低风险水平下明确的责任划分,从而应对当前跨国自动驾驶发展带来的法律挑战。
链接: https://arxiv.org/abs/2512.14330
作者: Sahibpreet Singh,Manjit Singh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Published in Journal of University Institute of Legal Studies, Vol. 18, Issue 1, pp. 57-78, 2025
Abstract:AI revolutionizes transportation through autonomous vehicles (AVs) but introduces complex criminal liability issues regarding infractions. This study employs a comparative legal analysis of primary statutes, real-world liability claims, and academic literature across the US, Germany, UK, China, and India; jurisdictions selected for their technological advancement and contrasting regulatory approaches. The research examines the attribution of human error, AI moral agency, and the identification of primary offenders in AV incidents. Findings reveal fragmented regulatory landscapes: India and the US rely on loose networks of state laws, whereas the UK enacted the pioneering Automated and Electric Vehicles Act 2018. Germany enforces strict safety standards, distinguishing liability based on the vehicle’s operating mode, while China similarly aims for a stringent liability regime. The study concludes that globally harmonized legal standards are essential to foster technological innovation while ensuring minimum risk and clear liability attribution.
zh
[AI-17] A data-physics hybrid generative model for patient-specific post-stroke motor rehabilitation using wearable sensor data
【速读】:该论文旨在解决卒中后运动能力动态预测不足的问题,现有评估方法仅提供静态损伤评分,无法判断患者是否能安全执行特定任务(如坡道行走或爬楼梯)。其解决方案的关键在于构建一个数据-物理混合的生成式框架,通过单次20米平地步行试验,结合可穿戴传感器运动学数据、比例-微分(Proportional-Derivative, PD)物理控制器、健康运动图谱(Healthy Motion Atlas)以及目标条件深度强化学习与行为克隆及生成对抗模仿学习技术,生成符合物理规律且个体化的步态模拟,从而实现对不同康复场景下任务导向性运动能力的精准预测。
链接: https://arxiv.org/abs/2512.14329
作者: Yanning Dai,Chenyu Tang,Ruizhi Zhang,Wenyu Yang,Yilan Zhang,Yuhui Wang,Junliang Chen,Xuhang Chen,Ruimou Xie,Yangyue Cao,Qiaoying Li,Jin Cao,Tao Li,Hubin Zhao,Yu Pan,Arokia Nathan,Xin Gao,Peter Smielewski,Shuo Gao
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: 26 pages, 6 figures
Abstract:Dynamic prediction of locomotor capacity after stroke is crucial for tailoring rehabilitation, yet current assessments provide only static impairment scores and do not indicate whether patients can safely perform specific tasks such as slope walking or stair climbing. Here, we develop a data-physics hybrid generative framework that reconstructs an individual stroke survivor’s neuromuscular control from a single 20 m level-ground walking trial and predicts task-conditioned locomotion across rehabilitation scenarios. The system combines wearable-sensor kinematics, a proportional-derivative physics controller, a population Healthy Motion Atlas, and goal-conditioned deep reinforcement learning with behaviour cloning and generative adversarial imitation learning to generate physically plausible, patient-specific gait simulations for slopes and stairs. In 11 stroke survivors, the personalized controllers preserved idiosyncratic gait patterns while improving joint-angle and endpoint fidelity by 4.73% and 12.10%, respectively, and reducing training time to 25.56% relative to a physics-only baseline. In a multicentre pilot involving 21 inpatients, clinicians who used our locomotion predictions to guide task selection and difficulty obtained larger gains in Fugl-Meyer lower-extremity scores over 28 days of standard rehabilitation than control clinicians (mean change 6.0 versus 3.7 points). These findings indicate that our generative, task-predictive framework can augment clinical decision-making in post-stroke gait rehabilitation and provide a template for dynamically personalized motor recovery strategies.
zh
[AI-18] A Threshold-Triggered Deep Q-Network-Based Framework for Self-Healing in Autonomic Software-Defined IIoT-Edge Networks
【速读】:该论文旨在解决软件定义工业网络(Software-Defined Industrial Networks, SDINs)中由随机扰动(如良性流量突发和交换机热波动引起的闪断事件)导致的间歇性服务质量下降问题,这些问题违反了IEC 61850标准定义的服务质量要求及用户自定义的服务水平协议(Service-Level Agreement, SLA),进而影响风电场中控制、监控和尽力传输类业务的可靠及时交付。解决方案的关键在于提出一种基于阈值触发的深度Q网络(Deep Q-Network, DQN)自愈智能体,该智能体能够自主检测、分析并实时缓解网络扰动,同时动态调整路由行为与资源分配策略,在超脊叶数据平面架构下实现高鲁棒性的网络恢复能力,相比基线最短路径负载均衡方法提升恢复性能53.84%,优于现有先进方法,并通过主动启动外部机架冷却保障交换机热稳定性。
链接: https://arxiv.org/abs/2512.14297
作者: Agrippina Mwangi(Utrecht University, The Netherlands),León Navarro-Hilfiker(Ørsted, USA),Lukasz Brewka(Ørsted, Denmark),Mikkel Gryning(Ørsted, Denmark),Elena Fumagalli(Utrecht University, The Netherlands),Madeleine Gibescu(Utrecht University, The Netherlands)
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Performance (cs.PF); High Energy Physics - Experiment (hep-ex)
备注:
Abstract:Stochastic disruptions such as flash events arising from benign traffic bursts and switch thermal fluctuations are major contributors to intermittent service degradation in software-defined industrial networks. These events violate IEC~61850-derived quality-of-service requirements and user-defined service-level agreements, hindering the reliable and timely delivery of control, monitoring, and best-effort traffic in IEC~61400-25-compliant wind power plants. Failure to maintain these requirements often results in delayed or lost control signals, reduced operational efficiency, and increased risk of wind turbine generator downtime. To address these challenges, this study proposes a threshold-triggered Deep Q-Network self-healing agent that autonomically detects, analyzes, and mitigates network disruptions while adapting routing behavior and resource allocation in real time. The proposed agent was trained, validated, and tested on an emulated tri-clustered switch network deployed in a cloud-based proof-of-concept testbed. Simulation results show that the proposed agent improves disruption recovery performance by 53.84% compared to a baseline shortest-path and load-balanced routing approach and outperforms state-of-the-art methods, including the Adaptive Network-based Fuzzy Inference System by 13.1% and the Deep Q-Network and traffic prediction-based routing optimization method by 21.5%, in a super-spine leaf data-plane architecture. Additionally, the agent maintains switch thermal stability by proactively initiating external rack cooling when required. These findings highlight the potential of deep reinforcement learning in building resilience in software-defined industrial networks deployed in mission-critical, time-sensitive application scenarios. Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Performance (cs.PF); High Energy Physics - Experiment (hep-ex) Cite as: arXiv:2512.14297 [cs.NI] (or arXiv:2512.14297v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2512.14297 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Agrippina Mwangi [view email] [v1] Tue, 16 Dec 2025 11:11:37 UTC (7,282 KB)
zh
[AI-19] Leverag ing LLM s for Collaborative Ontology Engineering in Parkinson Disease Monitoring and Alerting
【速读】:该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)在帕金森病(Parkinson’s Disease, PD)监测与预警领域中实现高效、准确的本体(ontology)构建问题,尤其关注LLMs是否能够独立完成全面本体设计,或需通过人机协作提升其完整性与准确性。解决方案的关键在于提出并验证两种混合方法:X-HCOME(结合人类专家知识与LLM能力的协同工程方法)和SimX-HCOME+(强调持续人类监督与迭代优化的增强型方法),实证表明,仅靠LLMs生成的本体存在显著不足,而通过人类参与的协作机制可大幅提升本体的质量,使其接近专业专家构建水平,从而揭示了人机协同在复杂医学本体工程中的核心价值。
链接: https://arxiv.org/abs/2512.14288
作者: Georgios Bouchouras,Dimitrios Doumanas,Andreas Soularidis,Konstantinos Kotis,George A. Vouros
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper explores the integration of Large Language Models (LLMs) in the engineering of a Parkinson’s Disease (PD) monitoring and alerting ontology through four key methodologies: One Shot (OS) prompt techniques, Chain of Thought (CoT) prompts, X-HCOME, and SimX-HCOME+. The primary objective is to determine whether LLMs alone can create comprehensive ontologies and, if not, whether human-LLM collaboration can achieve this goal. Consequently, the paper assesses the effectiveness of LLMs in automated ontology development and the enhancement achieved through human-LLM collaboration. Initial ontology generation was performed using One Shot (OS) and Chain of Thought (CoT) prompts, demonstrating the capability of LLMs to autonomously construct ontologies for PD monitoring and alerting. However, these outputs were not comprehensive and required substantial human refinement to enhance their completeness and accuracy. X-HCOME, a hybrid ontology engineering approach that combines human expertise with LLM capabilities, showed significant improvements in ontology comprehensiveness. This methodology resulted in ontologies that are very similar to those constructed by experts. Further experimentation with SimX-HCOME+, another hybrid methodology emphasizing continuous human supervision and iterative refinement, highlighted the importance of ongoing human involvement. This approach led to the creation of more comprehensive and accurate ontologies. Overall, the paper underscores the potential of human-LLM collaboration in advancing ontology engineering, particularly in complex domains like PD. The results suggest promising directions for future research, including the development of specialized GPT models for ontology construction. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2512.14288 [cs.AI] (or arXiv:2512.14288v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.14288 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Georgios Bouchouras PhD [view email] [v1] Tue, 16 Dec 2025 10:58:26 UTC (412 KB) Full-text links: Access Paper: View a PDF of the paper titled Leveraging LLMs for Collaborative Ontology Engineering in Parkinson Disease Monitoring and Alerting, by Georgios Bouchouras and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2025-12 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[AI-20] he Trust in AI-Generated Health Advice (TAIGHA) Scale and Short Version (TAIGHA-S): Development and Validation Study
【速读】:该论文试图解决的问题是:当前缺乏针对用户对生成式 AI 提供的健康建议的信任程度进行测量的标准化工具,而现有量表如“自动化系统信任调查”(Trust in Automated Systems Survey)仅适用于通用技术,无法准确评估用户对 AI 生成健康信息的信任与不信任。解决方案的关键在于开发并验证了一个理论驱动的、包含认知和情感维度的双因素量表——信任 AI 生成健康建议量表(Trust in AI-Generated Health Advice, TAIGHA),及其四题简版(TAIGHA-S)。该量表通过生成式 AI 辅助项目生成、专家内容效度检验、普通用户面效验证及大规模心理测量学分析(n=385),最终确立了良好的信效度指标(如 S-CVI/Ave=0.99,CFA 拟合优度良好,α=0.95),能够区分性地测量用户对 AI 健康建议的正向信任与负向不信任,为未来 AI 医疗干预的研究与应用提供可靠测量工具。
链接: https://arxiv.org/abs/2512.14278
作者: Marvin Kopka,Azeem Majeed,Gabriella Spinelli,Austen El-Osta,Markus Feufel
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial Intelligence tools such as large language models are increasingly used by the public to obtain health information and guidance. In health-related contexts, following or rejecting AI-generated advice can have direct clinical implications. Existing instruments like the Trust in Automated Systems Survey assess trustworthiness of generic technology, and no validated instrument measures users’ trust in AI-generated health advice specifically. This study developed and validated the Trust in AI-Generated Health Advice (TAIGHA) scale and its four-item short form (TAIGHA-S) as theory-based instruments measuring trust and distrust, each with cognitive and affective components. The items were developed using a generative AI approach, followed by content validation with 10 domain experts, face validation with 30 lay participants, and psychometric validation with 385 UK participants who received AI-generated advice in a symptom-assessment scenario. After automated item reduction, 28 items were retained and reduced to 10 based on expert ratings. TAIGHA showed excellent content validity (S-CVI/Ave=0.99) and CFA confirmed a two-factor model with excellent fit (CFI=0.98, TLI=0.98, RMSEA=0.07, SRMR=0.03). Internal consistency was high (\alpha=0.95). Convergent validity was supported by correlations with the Trust in Automated Systems Survey (r=0.67/-0.66) and users’ reliance on the AI’s advice (r=0.37 for trust), while divergent validity was supported by low correlations with reading flow and mental load (all |r|0.25). TAIGHA-S correlated highly with the full scale (r=0.96) and showed good reliability (\alpha=0.88). TAIGHA and TAIGHA-S are validated instruments for assessing user trust and distrust in AI-generated health advice. Reporting trust and distrust separately permits a more complete evaluation of AI interventions, and the short scale is well-suited for time-constrained settings.
zh
[AI-21] Explainable Preference Learning: a Decision Tree-based Surrogate Model for Preferential Bayesian Optimization
【速读】:该论文旨在解决当前基于偏好贝叶斯优化(Preferential Bayesian Optimization)方法中普遍依赖高斯过程(Gaussian Process, GP)作为代理模型所引发的三大问题:模型可解释性差、难以处理类别型数据以及计算复杂度高,从而限制了其在真实场景中的应用。解决方案的关键在于提出一种基于决策树(decision tree)的代理模型,该模型具备天然可解释性,能够统一处理连续型与类别型特征,并且具有良好的扩展性,适用于大规模数据集。实验表明,该方法在具有尖锐特征(spiky)的优化函数上优于GP基线,在非尖锐函数上性能略有下降,同时在真实世界Sushi偏好数据集上成功建模个体偏好,并初步探索了利用历史偏好数据加速新用户优化过程的可能性。
链接: https://arxiv.org/abs/2512.14263
作者: Nick Leenders,Thomas Quadt,Boris Cule,Roy Lindelauf,Herman Monsuur,Joost van Oijen,Mark Voskuijl
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:Current Preferential Bayesian Optimization methods rely on Gaussian Processes (GPs) as surrogate models. These models are hard to interpret, struggle with handling categorical data, and are computationally complex, limiting their real-world usability. In this paper, we introduce an inherently interpretable decision tree-based surrogate model capable of handling both categorical and continuous data, and scalable to large datasets. Extensive numerical experiments on eight increasingly spiky optimization functions show that our model outperforms GP-based alternatives on spiky functions and has only marginally lower performance for non-spiky functions. Moreover, we apply our model to the real-world Sushi dataset and show its ability to learn an individual’s sushi preferences. Finally, we show some initial work on using historical preference data to speed up the optimization process for new unseen users.
zh
[AI-22] Gödels Poetry
【速读】:该论文旨在解决形式化自动定理证明(formal automated theorem proving)在人工智能领域长期面临的挑战,特别是如何提升计算机在复杂数学命题上的自动推理能力。其解决方案的关键在于提出一种基于多智能体架构(multi-agent architecture)的新方法:利用针对Lean4语言定制的专用语言模型进行证明生成,并结合递归分解困难定理为更简单的蕴含命题(entailing propositions),从而降低证明难度。其中一项关键技术贡献是扩展了Kimina Lean Server以支持抽象语法树(AST)解析,实现了自动化的、递归的证明分解过程,显著提升了在miniF2F基准测试中的通过率(从无分解时的90.4%进一步提高)。
链接: https://arxiv.org/abs/2512.14252
作者: Kelly J. Davis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 1 figure
Abstract:Formal, automated theorem proving has long been viewed as a challenge to artificial intelligence. We introduce here a new approach to computer theorem proving, one that employs specialized language models for Lean4 proof generation combined with recursive decomposition of difficult theorems into simpler entailing propositions. These models are coordinated through a multi-agent architecture that orchestrates autoformalization (if required), proof generation, decomposition of difficult theorems into simpler entailing propositions, and recursive proof (and/or decomposition) of these propositions. Without decomposition, we achieve a 90.4% pass rate on miniF2F. With decomposition, this is significantly improved. A key technical contribution lies in our extension of the Kimina Lean Server with abstract syntax tree (AST) parsing capabilities to facilitate automated, recursive proof decomposition. The system is made available on PyPI as goedels-poetry (at this https URL ), and the open-source implementation KellyJDavis/goedels-poetry (at this https URL ) facilitates both adaptation to alternative language models and extension with custom functionality.
zh
[AI-23] Beyond MMD: Evaluating Graph Generative Models with Geometric Deep Learning
【速读】:该论文旨在解决当前图生成模型(Graph Generative Models, GGMs)在评估过程中存在的局限性问题,尤其是依赖最大均值差异(Maximum Mean Discrepancy, MMD)作为评价指标时难以准确反映生成图与真实图在结构特性上的相似性。其解决方案的关键在于提出一种新的评估方法——RGM(Representation-aware Graph-generation Model evaluation),该方法通过引入基于几何深度学习的图分类模型,并利用自定义的合成与真实图数据集进行训练和测试,从而更有效地衡量GGM生成图在保留不同图域结构特征方面的性能。此方法超越了传统MMD对图属性分布的浅层比较,能够揭示模型在深层结构一致性上的不足,为GGM的改进提供更具判别力的评估依据。
链接: https://arxiv.org/abs/2512.14241
作者: Salvatore Romano,Marco Grassia,Giuseppe Mangioni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph)
备注: 16 pages, 4 figures
Abstract:Graph generation is a crucial task in many fields, including network science and bioinformatics, as it enables the creation of synthetic graphs that mimic the properties of real-world networks for various applications. Graph Generative Models (GGMs) have emerged as a promising solution to this problem, leveraging deep learning techniques to learn the underlying distribution of real-world graphs and generate new samples that closely resemble them. Examples include approaches based on Variational Auto-Encoders, Recurrent Neural Networks, and more recently, diffusion-based models. However, the main limitation often lies in the evaluation process, which typically relies on Maximum Mean Discrepancy (MMD) as a metric to assess the distribution of graph properties in the generated ensemble. This paper introduces a novel methodology for evaluating GGMs that overcomes the limitations of MMD, which we call RGM (Representation-aware Graph-generation Model evaluation). As a practical demonstration of our methodology, we present a comprehensive evaluation of two state-of-the-art Graph Generative Models: Graph Recurrent Attention Networks (GRAN) and Efficient and Degree-guided graph GEnerative model (EDGE). We investigate their performance in generating realistic graphs and compare them using a Geometric Deep Learning model trained on a custom dataset of synthetic and real-world graphs, specifically designed for graph classification tasks. Our findings reveal that while both models can generate graphs with certain topological properties, they exhibit significant limitations in preserving the structural characteristics that distinguish different graph domains. We also highlight the inadequacy of Maximum Mean Discrepancy as an evaluation metric for GGMs and suggest alternative approaches for future research.
zh
[AI-24] PentestEval: Benchmarking LLM -based Penetration Testing with Modular and Stage-Level Design
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在渗透测试(Penetration Testing)中应用时存在的自动化程度低、任务分解不足及缺乏系统性评估的问题。现有方法多依赖简单提示(prompting),未对渗透测试流程进行阶段化拆解,导致模型行为不可靠且难以分析其在各阶段的实际能力。解决方案的关键在于提出首个全面的基准测试工具 PentestEval,该工具将渗透测试划分为六个可量化评估的子阶段(信息收集、弱点获取与筛选、攻击决策、漏洞利用生成与修订等),并构建包含346个任务的专家标注数据集和自动化评估流水线,首次实现了对主流大语言模型(LLMs)在渗透测试全流程中的细粒度性能分析,揭示了当前模型在端到端任务中仅31%的成功率,凸显出结构化推理与模块化设计对于提升自主渗透测试系统可靠性的必要性。
链接: https://arxiv.org/abs/2512.14233
作者: Ruozhao Yang,Mingfei Cheng,Gelei Deng,Tianwei Zhang,Junjie Wang,Xiaofei Xie
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 13 pages, 6 figures
Abstract:Penetration testing is essential for assessing and strengthening system security against real-world threats, yet traditional workflows remain highly manual, expertise-intensive, and difficult to scale. Although recent advances in Large Language Models (LLMs) offer promising opportunities for automation, existing applications rely on simplistic prompting without task decomposition or domain adaptation, resulting in unreliable black-box behavior and limited insight into model capabilities across penetration testing stages. To address this gap, we introduce PentestEval, the first comprehensive benchmark for evaluating LLMs across six decomposed penetration testing stages: Information Collection, Weakness Gathering and Filtering, Attack Decision-Making, Exploit Generation and Revision. PentestEval integrates expert-annotated ground truth with a fully automated evaluation pipeline across 346 tasks covering all stages in 12 realistic vulnerable scenarios. Our stage-level evaluation of 9 widely used LLMs reveals generally weak performance and distinct limitations across the stages of penetration-testing workflow. End-to-end pipelines reach only 31% success rate, and existing LLM-powered systems such as PentestGPT, PentestAgent, and VulnBot exhibit similar limitations, with autonomous agents failing almost entirely. These findings highlight that autonomous penetration testing demands stronger structured reasoning, where modularization enhances each individual stage and improves overall performance. PentestEval provides the foundational benchmark needed for future research on fine-grained, stage-level evaluation, paving the way toward more reliable LLM-based automation.
zh
[AI-25] Georeferencing complex relative locality descriptions with large language models
【速读】:该论文旨在解决生物标本采集记录中复杂地理位置描述的自动地理编码(georeferencing)问题,尤其针对那些依赖相对空间关系而非精确地名或地理指示词的文本描述,这类描述在GPS普及前的文献中尤为常见。传统方法如基于地名词典(gazetteer-based)或语言建模的方法难以准确处理此类描述,导致地理编码精度不足。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)并结合量化低秩适配(Quantized Low-Rank Adaptation, QLoRA)技术,在多区域、多语言的生物多样性数据集上进行微调,从而实现对复杂局部描述的高精度自动地理编码,实验表明该方法在固定训练数据量下平均有65%的记录落在10公里半径内,最佳结果(纽约州)达到85% within 10km 和 67% within 1km。
链接: https://arxiv.org/abs/2512.14228
作者: Aneesha Fernando,Surangika Ranathunga,Kristin Stock,Raj Prasanna,Christopher B. Jones
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Provisionally accepted for publication in the International Journal of Geographical Information Science
Abstract:Georeferencing text documents has typically relied on either gazetteer-based methods to assign geographic coordinates to place names, or on language modelling approaches that associate textual terms with geographic locations. However, many location descriptions specify positions relatively with spatial relationships, making geocoding based solely on place names or geo-indicative words inaccurate. This issue frequently arises in biological specimen collection records, where locations are often described through narratives rather than coordinates if they pre-date GPS. Accurate georeferencing is vital for biodiversity studies, yet the process remains labour-intensive, leading to a demand for automated georeferencing solutions. This paper explores the potential of Large Language Models (LLMs) to georeference complex locality descriptions automatically, focusing on the biodiversity collections domain. We first identified effective prompting patterns, then fine-tuned an LLM using Quantized Low-Rank Adaptation (QLoRA) on biodiversity datasets from multiple regions and languages. Our approach outperforms existing baselines with an average, across datasets, of 65% of records within a 10 km radius, for a fixed amount of training data. The best results (New York state) were 85% within 10km and 67% within 1km. The selected LLM performs well for lengthy, complex descriptions, highlighting its potential for georeferencing intricate locality descriptions.
zh
[AI-26] Estimating problem difficulty without ground truth using Large Language Model comparisons
【速读】:该论文旨在解决当前生成式 AI(Generative AI)训练中用于评估问题难度的指标难以泛化至分布外(out-of-distribution)问题的难题,即现有方法如人工校准或基于模型性能的评分在面对人类和大语言模型(Large Language Models, LLMs)均无法解答的问题时失效,因其依赖人工标注、耗时且不可扩展。解决方案的关键在于提出一种名为 LLM compare 的新方法:通过让 LLM 进行成对难度比较,并基于 Bradley-Terry 模型计算得分,从而实现连续、动态、与模型无关且不依赖真实标签的难度估计。该方法首次在三个维度——构建方式、可扩展性和依赖性上均满足理想条件,且经实验证明其与人类标注高度一致(Pearson相关系数 ≥ 0.80),并对幻觉具有鲁棒性(噪声注入10%时相关性下降<6%)。
链接: https://arxiv.org/abs/2512.14220
作者: Marthe Ballon,Andres Algaba,Brecht Verbeken,Vincent Ginis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 10 figures
Abstract:Recent advances in the finetuning of large language models (LLMs) have significantly improved their performance on established benchmarks, emphasizing the need for increasingly difficult, synthetic data. A key step in this data generation pipeline is a method for estimating problem difficulty. Current approaches, such as human calibration or performance-based scoring, fail to generalize to out-of-distribution problems, i.e. problems currently unsolvable by humans and LLMs, because they are not scalable, time-consuming, and ground truth dependent. Therefore, we propose a new method for estimating problem difficulty, LLM compare, that addresses these limitations. An LLM performs pairwise difficulty comparisons, and then Bradley-Terry scores are computed based on the outcomes. To validate our method, we first propose a conceptual framework that positions existing approaches on three orthogonal planes–construction, scale and dependence–identifying which quadrants a measure needs to occupy to score out-of-distribution problems. LLM compare naturally occupies all desirable quadrants as the first measure that is continuous and dynamic, model-agnostic and independent of ground truth information. As a second validation, we show that LLM compare demonstrates strong alignment with human annotations: Pearson r \geq 0.80 for n=1876 . Thirdly, we show that LLM compare is robust to hallucinations, with less than 6% degradation in Pearson correlation for 10% noise injection. Our work represents a significant step towards replacing time-consuming human annotations and synthetic data generation, and will be an important driver for curriculum design, model evaluation, and AI-assisted research ideation.
zh
[AI-27] Understanding and Improving Hyperbolic Deep Reinforcement Learning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)代理在使用双曲特征空间(hyperbolic feature spaces)时因梯度不稳定导致训练失败的问题。研究表明,大范数嵌入会破坏基于梯度的优化过程,引发近端策略优化(Proximal Policy Optimization, PPO)中的信任区域违规问题。解决方案的关键在于提出 Hyper++,其核心创新包括:(i) 采用分类值损失替代回归损失以实现稳定的评论家(critic)训练;(ii) 引入特征正则化机制,在不引入维度诅咒的前提下保证嵌入范数有界;(iii) 使用更优化友好的双曲网络层公式化方法。实验表明,Hyper++ 在 ProcGen 和 Atari-5(结合 Double DQN)任务上显著优于欧几里得与现有双曲基线方法,并将训练时间减少约 30%。
链接: https://arxiv.org/abs/2512.14202
作者: Timo Klein,Thomas Lang,Andrii Shkabrii,Alexander Sturm,Kevin Sidak,Lukas Miklautz,Claudia Plant,Yllka Velaj,Sebastian Tschiatschek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The performance of reinforcement learning (RL) agents depends critically on the quality of the underlying feature representations. Hyperbolic feature spaces are well-suited for this purpose, as they naturally capture hierarchical and relational structure often present in complex RL environments. However, leveraging these spaces commonly faces optimization challenges due to the nonstationarity of RL. In this work, we identify key factors that determine the success and failure of training hyperbolic deep RL agents. By analyzing the gradients of core operations in the Poincaré Ball and Hyperboloid models of hyperbolic geometry, we show that large-norm embeddings destabilize gradient-based training, leading to trust-region violations in proximal policy optimization (PPO). Based on these insights, we introduce Hyper++, a new hyperbolic PPO agent that consists of three components: (i) stable critic training through a categorical value loss instead of regression; (ii) feature regularization guaranteeing bounded norms while avoiding the curse of dimensionality from clipping; and (iii) using a more optimization-friendly formulation of hyperbolic network layers. In experiments on ProcGen, we show that Hyper++ guarantees stable learning, outperforms prior hyperbolic agents, and reduces wall-clock time by approximately 30%. On Atari-5 with Double DQN, Hyper++ strongly outperforms Euclidean and hyperbolic baselines. We release our code at this https URL .
zh
[AI-28] End-to-End Learning-based Video Streaming Enhancement Pipeline: A Generative AI Approach
【速读】:该论文旨在解决视频流媒体中高画质与流畅播放之间的平衡问题,传统编码器因缺乏上下文感知能力,需传输全部视频数据,导致带宽浪费。其解决方案的关键在于提出ELVIS(End-to-end Learning-based VIdeo Streaming Enhancement Pipeline),一个端到端架构,结合服务器端编码优化与客户端生成式图像修复(generative in-painting)技术,以移除并重建冗余视频数据,从而在不增加带宽的前提下提升视频质量。该架构模块化设计支持集成不同编码器、修复模型和质量评估指标,具备良好的可扩展性与适应未来技术演进的能力。
链接: https://arxiv.org/abs/2512.14185
作者: Emanuele Artioli,Farzad Tashtarian,Christian Timmerer
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: The 35th edition of the Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV '25), March 31-April 4, 2025, Stellenbosch, South Africa
Abstract:The primary challenge of video streaming is to balance high video quality with smooth playback. Traditional codecs are well tuned for this trade-off, yet their inability to use context means they must encode the entire video data and transmit it to the client. This paper introduces ELVIS (End-to-end Learning-based VIdeo Streaming Enhancement Pipeline), an end-to-end architecture that combines server-side encoding optimizations with client-side generative in-painting to remove and reconstruct redundant video data. Its modular design allows ELVIS to integrate different codecs, inpainting models, and quality metrics, making it adaptable to future innovations. Our results show that current technologies achieve improvements of up to 11 VMAF points over baseline benchmarks, though challenges remain for real-time applications due to computational demands. ELVIS represents a foundational step toward incorporating generative AI into video streaming pipelines, enabling higher quality experiences without increased bandwidth requirements.
zh
[AI-29] IntentMiner: Intent Inversion Attack via Tool Call Analysis in the Model Context Protocol
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)作为自主代理时,因采用模型上下文协议(Model Context Protocol, MCP)架构导致的隐私泄露问题。具体而言,MCP通过将推理引擎与工具执行解耦以提升可扩展性,但引入了第三方MCP服务器作为半诚实中介,其能够访问用户工具调用日志,从而可能通过分析这些日志重构用户的私有意图,即提出并形式化了一种新型隐私威胁——意图反转(Intent Inversion)。解决方案的关键在于提出IntentMiner框架,该框架融合分层信息隔离(Hierarchical Information Isolation)与三维语义分析(Three-Dimensional Semantic Analysis),综合工具目的、调用语句及返回结果,在步骤级别精准还原用户意图,实验表明其在语义一致性上超过85%,显著优于基线方法,揭示了看似无害的工具执行日志实际上可能成为泄露用户秘密的强大载体。
链接: https://arxiv.org/abs/2512.14166
作者: Yunhao Yao,Zhiqiang Wang,Haoran Cheng,Yihang Cheng,Haohua Du,Xiang-Yang Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures
Abstract:The rapid evolution of Large Language Models (LLMs) into autonomous agents has led to the adoption of the Model Context Protocol (MCP) as a standard for discovering and invoking external tools. While this architecture decouples the reasoning engine from tool execution to enhance scalability, it introduces a significant privacy surface: third-party MCP servers, acting as semi-honest intermediaries, can observe detailed tool interaction logs outside the user’s trusted boundary. In this paper, we first identify and formalize a novel privacy threat termed Intent Inversion, where a semi-honest MCP server attempts to reconstruct the user’s private underlying intent solely by analyzing legitimate tool calls. To systematically assess this vulnerability, we propose IntentMiner, a framework that leverages Hierarchical Information Isolation and Three-Dimensional Semantic Analysis, integrating tool purpose, call statements, and returned results, to accurately infer user intent at the step level. Extensive experiments demonstrate that IntentMiner achieves a high degree of semantic alignment (over 85%) with original user queries, significantly outperforming baseline approaches. These results highlight the inherent privacy risks in decoupled agent architectures, revealing that seemingly benign tool execution logs can serve as a potent vector for exposing user secrets.
zh
[AI-30] PathFinder: Advancing Path Loss Prediction for Single-to-Multi-Transmitter Scenario
【速读】:该论文旨在解决当前基于深度学习的无线路径损耗预测(Radio Path Loss Prediction, RPP)方法中存在的三大问题:一是环境建模被动,忽视发射机和关键环境特征;二是过度依赖单发射机场景,难以适应真实世界中普遍存在的多发射机情况;三是过分关注分布内性能,缺乏对分布偏移(如建筑密度或发射机配置变化)下的泛化能力。其解决方案的关键在于提出PathFinder架构,通过解耦特征编码主动建模建筑物与发射机,并引入Mask-Guided Low-rank Attention机制,独立聚焦于接收端和建筑区域;同时设计了面向发射机的Mixup训练策略以提升鲁棒性,并构建了专门用于评估外推性能的新基准S2MT-RPP(single-to-multi-transmitter RPP),从而显著改善在多发射机场景下的预测精度与泛化能力。
链接: https://arxiv.org/abs/2512.14150
作者: Zhijie Zhong,Zhiwen Yu,Pengyu Li,Jianming Lv,C. L. Philip Chen,Min Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages, 14 figures, 4 tables. Under review
Abstract:Radio path loss prediction (RPP) is critical for optimizing 5G networks and enabling IoT, smart city, and similar applications. However, current deep learning-based RPP methods lack proactive environmental modeling, struggle with realistic multi-transmitter scenarios, and generalize poorly under distribution shifts, particularly when training/testing environments differ in building density or transmitter configurations. This paper identifies three key issues: (1) passive environmental modeling that overlooks transmitters and key environmental features; (2) overemphasis on single-transmitter scenarios despite real-world multi-transmitter prevalence; (3) excessive focus on in-distribution performance while neglecting distribution shift challenges. To address these, we propose PathFinder, a novel architecture that actively models buildings and transmitters via disentangled feature encoding and integrates Mask-Guided Low-rank Attention to independently focus on receiver and building regions. We also introduce a Transmitter-Oriented Mixup strategy for robust training and a new benchmark, single-to-multi-transmitter RPP (S2MT-RPP), tailored to evaluate extrapolation performance (multi-transmitter testing after single-transmitter training). Experimental results show PathFinder outperforms state-of-the-art methods significantly, especially in challenging multi-transmitter scenarios. Our code and project site are available at: this https URL.
zh
[AI-31] LAPPI: Interactive Optimization with LLM -Assisted Preference-Based Problem Instantiation
【速读】:该论文旨在解决用户在面对组合优化问题(combinatorial optimization problems)时,难以使用传统优化求解器的问题,核心挑战在于问题实例化(problem instantiation)的复杂性——即用户需明确定义候选项目、赋予权重偏好和约束条件,而这一过程对非专业用户而言门槛较高。解决方案的关键在于提出LAPPI(LLM-Assisted Preference-based Problem Instantiation),一种基于大语言模型(Large Language Models, LLMs)的交互式方法,通过自然语言对话引导用户将模糊偏好逐步转化为结构化的优化问题,并将其交由现有求解器执行,从而显著降低问题建模难度并提升实用性。
链接: https://arxiv.org/abs/2512.14138
作者: So Kuroki,Manami Nakagawa,Shigeo Yoshida,Yuki Koyama,Kozuno Tadashi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Many real-world tasks, such as trip planning or meal planning, can be formulated as combinatorial optimization problems. However, using optimization solvers is difficult for end users because it requires problem instantiation: defining candidate items, assigning preference scores, and specifying constraints. We introduce LAPPI (LLM-Assisted Preference-based Problem Instantiation), an interactive approach that uses large language models (LLMs) to support users in this instantiation process. Through natural language conversations, the system helps users transform vague preferences into well-defined optimization problems. These instantiated problems are then passed to existing optimization solvers to generate solutions. In a user study on trip planning, our method successfully captured user preferences and generated feasible plans that outperformed both conventional and prompt-engineering approaches. We further demonstrate LAPPI’s versatility by adapting it to an additional use case.
zh
[AI-32] UIXPOSE: Mobile Malware Detection via Intention-Behaviour Discrepancy Analysis
【速读】:该论文旨在解决移动恶意软件动态检测中因缺乏细粒度行为建模而导致的误报率高和隐蔽性行为难以识别的问题。现有方法通常依赖静态权限分析或粗粒度运行时信号(如端点调用或部分资源使用),无法捕捉应用界面(UI)所体现的意图与实际运行时语义之间的偏差。其解决方案的关键在于提出UIXPOSE框架,通过引入意图行为对齐(Intention Behaviour Alignment, IBA)机制:利用视觉-语言模型从每个屏幕推断意图向量,并结合解码后的网络流量、堆内存信号及资源使用轨迹构建行为向量,在运行时进行二者对齐分析,从而精准识别恶意行为并定位高风险行为路径。
链接: https://arxiv.org/abs/2512.14130
作者: Amirmohammad Pasdar,Toby Murray,Van-Thuan Pham
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:We introduce UIXPOSE, a source-code-agnostic framework that operates on both compiled and open-source apps. This framework applies Intention Behaviour Alignment (IBA) to mobile malware analysis, aligning UI-inferred intent with runtime semantics. Previous work either infers intent statically, e.g., permission-centric, or widget-level or monitors coarse dynamic signals (endpoints, partial resource usage) that miss content and context. UIXPOSE infers an intent vector from each screen using vision-language models and knowledge structures and combines decoded network payloads, heap/memory signals, and resource utilisation traces into a behaviour vector. Their alignment, calculated at runtime, can both detect misbehaviour and highlight exploration of behaviourally rich paths. In three real-world case studies, UIXPOSE reveals covert exfiltration and hidden background activity that evade metadata-only baselines, demonstrating how IBA improves dynamic detection.
zh
[AI-33] Optimizing Multi-Tier Supply Chain Ordering with a Hybrid Liquid Neural Network and Extreme Gradient Boosting Model
【速读】:该论文旨在解决供应链管理(Supply Chain Management, SCM)中因需求波动和牛鞭效应(Bullwhip Effect)带来的挑战,尤其是传统方法及当前先进的大语言模型(Large Language Models, LLMs)在处理SCM复杂连续时间序列数据时表现不佳的问题。解决方案的关键在于提出一种混合液态神经网络(Liquid Neural Networks, LNN)与XGBoost的模型架构:利用LNN对动态特征的高效提取能力,结合XGBoost的全局优化优势,从而有效降低牛鞭效应并提升供应链盈利能力,填补了智能供应链管理领域对高效性与适应性协同需求的空白。
链接: https://arxiv.org/abs/2512.14112
作者: Chunan Tong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Supply chain management (SCM) faces significant challenges like demand fluctuations and the bullwhip effect. Traditional methods and even state-of-the-art LLMs struggle with benchmarks like the Vending Machine Test, failing to handle SCM’s complex continuous time-series data. While ML approaches like LSTM and XGBoost offer solutions, they are often limited by computational inefficiency. Liquid Neural Networks (LNN), known for their adaptability and efficiency in robotics, remain untapped in SCM. This study proposes a hybrid LNN+XGBoost model for multi-tier supply chains. By combining LNN’s dynamic feature extraction with XGBoost’s global optimization, the model aims to minimize the bullwhip effect and increase profitability. This innovative approach addresses the need for efficiency and adaptability, filling a critical gap in intelligent SCM.
zh
[AI-34] HydroGEM: A Self Supervised Zero Shot Hybrid TCN Transformer Foundation Model for Continental Scale Streamflow Quality Control
【速读】:该论文旨在解决大规模实时径流监测网络中,数千个偏远传感器数据质量难以维护的难题(data quality maintenance)。传统方法依赖人工校验,效率低下且难以扩展。解决方案的关键在于提出 HydroGEM(Hydrological Generalizable Encoder for Monitoring),一个用于大陆尺度径流质量控制的基础模型(foundation model)。其核心创新包括:采用两阶段训练策略——先在来自3,724个USGS站点的603万条序列上进行自监督预训练以学习水文表征,再用合成异常数据微调以实现检测与重构;设计混合TCN-Transformer架构(14.2M参数)捕捉局部时序模式和长程依赖关系,并引入分层归一化处理六数量级的流量变化范围。实验表明,该模型在保留专家验证的18类异常检测能力的同时,显著优于现有方法,在零样本迁移至加拿大环境与气候变化局站点时仍表现出强泛化能力(F1=0.586),并支持人机协同的质量控制流程。
链接: https://arxiv.org/abs/2512.14106
作者: Ijaz Ul Haq,Byung Suk Lee,Julia N. Perdrial,David Baude
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Supplementary materials, datasets, and implementation code will be made publicly available upon acceptance for publication in a peer-reviewed journal
Abstract:Real-time streamflow monitoring networks generate millions of observations annually, yet maintaining data quality across thousands of remote sensors remains labor-intensive. We introduce HydroGEM (Hydrological Generalizable Encoder for Monitoring), a foundation model for continental-scale streamflow quality control. HydroGEM uses two-stage training: self-supervised pretraining on 6.03 million sequences from 3,724 USGS stations learns hydrological representations, followed by fine-tuning with synthetic anomalies for detection and reconstruction. A hybrid TCN-Transformer architecture (14.2M parameters) captures local temporal patterns and long-range dependencies, while hierarchical normalization handles six orders of magnitude in discharge. On held-out synthetic tests comprising 799 stations with 18 expert-validated anomaly types, HydroGEM achieves F1 = 0.792 for detection and 68.7% reconstruction-error reduction, a 36.3% improvement over existing methods. Zero-shot transfer to 100 Environment and Climate Change Canada stations yields F1 = 0.586, exceeding all baselines and demonstrating cross-national generalization. The model maintains consistent detection across correction magnitudes and aligns with operational seasonal patterns. HydroGEM is designed for human-in-the-loop workflows - outputs are quality control suggestions requiring expert review, not autonomous corrections.
zh
[AI-35] Arithmetic-Intensity-Aware Quantization
【速读】:该论文旨在解决现代神经网络在推理阶段因DRAM带宽受限而导致的吞吐量瓶颈问题,尤其是在计算密集型任务中,内存访问成为性能关键限制因素。其核心解决方案是提出一种基于算术强度(Arithmetic Intensity, AI)感知的混合精度量化框架——AIQ(Arithmetic-Intensity-Aware Quantization),通过在每层独立选择最优位宽来最大化算术强度并最小化精度损失。该方法采用后训练量化策略,利用搜索算法优化各层量化方案,在AI与准确率之间进行加权平衡,从而显著提升内存受限场景下的推理效率,实验表明其可在保持精度损失小于1个百分点的前提下,使ResNet-20/CIFAR-10和MobileNetV2的算术强度提升约50%、吞吐量提高1.66倍。
链接: https://arxiv.org/abs/2512.14090
作者: Taig Singh,Shreshth Rajan,Nikhil Iyer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As modern neural networks become increasingly memory-bound, inference throughput is limited by DRAM bandwidth rather than compute. We present Arithmetic-Intensity-Aware Quantization (AIQ), a mixed precision quantization framework that chooses per-layer bit-widths to maximize arithmetic intensity (AI) while minimizing accuracy loss. AIQ is a post-training quantization method that uses search algorithms over per-layer quantization schemes to minimize a weighted loss over AI and accuracy. On ResNet-20/CIFAR-10, AIQ increases AI by ~50% over an FP32 baseline while keeping test accuracy within ~1 percentage point, and outperforming global uniform quantization schemes. On a memory-bound MobileNetV2 architecture, AIQ configurations give a 1.66x higher throughput than the FP32 baseline while keeping test accuracy within 1 percentage point. We also find that AIQ naturally quantizes larger layers more aggressively.
zh
[AI-36] SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations
【速读】:该论文针对混合专家(Mixture of Experts, MoE)模型在高粒度专家(fine-grained experts)和高稀疏性(high sparsity)趋势下所面临的两大挑战展开研究:一是细粒度MoE导致激活内存占用增加及硬件效率下降,源于更高的输入输出(IO)开销;二是稀疏MoE因Grouped GEMM核中的填充(padding)造成计算浪费。解决方案的关键在于三个核心创新:第一,提出一种内存高效的前向与反向传播算法,极大减少反向传播时的激活缓存需求;第二,设计GPU核函数以实现内存IO与计算重叠,提升所有MoE架构的硬件利用率;第三,引入新颖的“token rounding”方法最小化Grouped GEMM中因padding产生的冗余计算。这些改进共同实现了激活内存降低45%、计算吞吐量提升1.86倍(Hopper GPU上),并在高稀疏场景下进一步通过tile-aware token rounding获得1.16倍的内核执行速度提升,同时保持下游任务性能稳定。
链接: https://arxiv.org/abs/2512.14080
作者: Wentao Guo,Mayank Mishra,Xinle Cheng,Ion Stoica,Tri Dao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation benefiting all MoE architectures. Finally, we propose a novel “token rounding” method that minimizes the wasted compute due to padding in Grouped GEMM kernels. As a result, our method SonicMoE reduces activation memory by 45% and achieves a 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE’s BF16 MoE kernel for a fine-grained 7B MoE. Concretely, SonicMoE on 64 H100s achieves a training throughput of 213 billion tokens per day comparable to ScatterMoE’s 225 billion tokens per day on 96 H100s for a 7B MoE model training with FSDP-2 using the lm-engine codebase. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.16x speedup on kernel execution time compared to vanilla top- K routing while maintaining similar downstream performance. We open-source all our kernels to enable faster MoE model training.
zh
[AI-37] RADAR: Accelerating Large Language Model Inference With RL-Based Dynamic Draft Trees
【速读】:该论文旨在解决现代大语言模型(Large Language Models, LLMs)推理过程中的高成本与低效率问题,尤其是传统推测采样(speculative sampling)方法中预设的草稿模型调用次数缺乏灵活性、导致冗余计算的问题。其解决方案的关键在于提出RADAR方法,通过将草稿树生成过程建模为马尔可夫决策过程(Markov Decision Process, MDP),并利用离线强化学习训练一个预测模型,从而实现实时动态决策草稿模型调用,有效减少冗余计算,显著提升推理速度。实验表明,RADAR在三种LLMs和四个任务上相较自回归解码基线实现了3.17x–4.82x的速度提升。
链接: https://arxiv.org/abs/2512.14069
作者: Junjie Ma,Jinlong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures
Abstract:Inference with modern Large Language Models (LLMs) is expensive and slow, and speculative sampling has emerged as an effective solution to this problem, however, the number of the calls to the draft model for generating candidate tokens in speculative sampling is a preset hyperparameter, lacking flexibility. To generate and utilize the candidate tokens more effectively, we propose RADAR, a novel speculative sampling method with RL-based dynamic draft trees. RADAR formulates the draft tree generation process as a Markov Decision Process (MDP) and employs offline reinforcement learning to train a prediction model, which enables real-time decision on the calls to the draft model, reducing redundant computations and further accelerating inference. Evaluations across three LLMs and four tasks show that RADAR achieves a speedup of 3.17x-4.82x over the auto-regressive decoding baseline. The code is available at this https URL.
zh
[AI-38] OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)后训练数据质量与多样性评估的黑箱问题,即当前模型性能基准测试高度透明,但支撑模型训练的数据集组成不透明、来源不明且缺乏系统性评价,从而阻碍了可复现性和数据特性与模型行为之间因果关系的揭示。解决方案的关键在于提出OpenDataArena(ODA),一个集成化的开放平台,其核心包括:(i) 统一的训练-评估流水线以实现跨模型(如Llama、Qwen)和领域公平比较;(ii) 多维评分框架对数据质量进行数十个维度的量化分析;(iii) 交互式数据谱系探索工具用于可视化数据来源及其演化关系;(iv) 完全开源的工具包支持训练、评估与评分,推动数据研究范式从经验试错转向数据驱动的科学方法,为基础模型的数据混合规律和战略组合提供严谨研究基础。
链接: https://arxiv.org/abs/2512.14051
作者: Mengzhang Cai,Xin Gao,Yu Li,Honglin Lin,Zheng Liu,Zhuoshi Pan,Qizhi Pei,Xiaoran Shang,Mengyuan Sun,Zinan Tang,Xiaoyang Wang,Zhanping Zhong,Yun Zhu,Dahua Lin,Conghui He,Lijun Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid evolution of Large Language Models (LLMs) is predicated on the quality and diversity of post-training datasets. However, a critical dichotomy persists: while models are rigorously benchmarked, the data fueling them remains a black box–characterized by opaque composition, uncertain provenance, and a lack of systematic evaluation. This opacity hinders reproducibility and obscures the causal link between data characteristics and model behaviors. To bridge this gap, we introduce OpenDataArena (ODA), a holistic and open platform designed to benchmark the intrinsic value of post-training data. ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models (e.g., Llama, Qwen) and domains; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources; and (iv) a fully open-source toolkit for training, evaluation, and scoring to foster data research. Extensive experiments on ODA–covering over 120 training datasets across multiple domains on 22 benchmarks, validated by more than 600 training runs and 40 million processed data points–reveal non-trivial insights. Our analysis uncovers the inherent trade-offs between data complexity and task performance, identifies redundancy in popular benchmarks through lineage tracing, and maps the genealogical relationships across datasets. We release all results, tools, and configurations to democratize access to high-quality data evaluation. Rather than merely expanding a leaderboard, ODA envisions a shift from trial-and-error data curation to a principled science of Data-Centric AI, paving the way for rigorous studies on data mixing laws and the strategic composition of foundation models.
zh
[AI-39] Intention Chain-of-Thought Prompting with Dynamic Routing for Code Generation AAAI-2026
【速读】:该论文旨在解决现有链式思维(Chain-of-Thought, CoT)提示方法在代码生成中面临的两大问题:一是对简单任务过度应用统一的推理策略,导致计算资源浪费;二是缺乏对任务意图的抽象建模,使得模型难以捕捉核心算法逻辑及其效率目标,从而聚焦于表面结构而忽略全局优化。解决方案的关键在于提出一种难度感知的路由框架 RoutingGen,该框架根据任务复杂度动态选择提示策略——对于简单任务采用少量示例提示(few-shot prompting),对于复杂任务则激活一种新型结构化推理策略 Intention Chain-of-Thought (ICoT),通过显式建模任务意图(如核心算法设计与时间复杂度)引导模型生成更高效、更符合问题本质的代码。实验表明,RoutingGen 在多个基准上达到最优性能的同时,平均减少46.37%的总token消耗。
链接: https://arxiv.org/abs/2512.14048
作者: Shen Li,Li Huang,Shaoxiong Zhan,Weifeng Sun,Tao Yin,Zhongxin Liu,Meng Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at AAAI-2026
Abstract:Large language models (LLMs) exhibit strong generative capabilities and have shown great potential in code generation. Existing chain-of-thought (CoT) prompting methods enhance model reasoning by eliciting intermediate steps, but suffer from two major limitations: First, their uniform application tends to induce overthinking on simple tasks. Second, they lack intention abstraction in code generation, such as explicitly modeling core algorithmic design and efficiency, leading models to focus on surface-level structures while neglecting the global problem objective. Inspired by the cognitive economy principle of engaging structured reasoning only when necessary to conserve cognitive resources, we propose RoutingGen, a novel difficulty-aware routing framework that dynamically adapts prompting strategies for code generation. For simple tasks, it adopts few-shot prompting; for more complex ones, it invokes a structured reasoning strategy, termed Intention Chain-of-Thought (ICoT), which we introduce to guide the model in capturing task intention, such as the core algorithmic logic and its time complexity. Experiments across three models and six standard code generation benchmarks show that RoutingGen achieves state-of-the-art performance in most settings, while reducing total token usage by 46.37% on average across settings. Furthermore, ICoT outperforms six existing prompting baselines on challenging benchmarks.
zh
[AI-40] Evaluating Small Language Models for Agent ic On-Farm Decision Support Systems
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLM)在奶牛养殖决策支持中因计算资源需求高而难以本地部署的问题,尤其是在农场硬件条件下无法有效运行云服务模式的限制。解决方案的关键在于开发并评估一系列轻量级语言模型(Small Language Models, SLM),这些模型可在农场本地设备上运行,并集成一个代理式人工智能系统(agentic AI system),该系统包含五个任务特定代理:文献检索、网络搜索、SQL数据库交互、NoSQL数据库交互和基于预测模型的图生成。通过在真实农场计算约束下对20个开源SLM进行基准测试,研究发现Qwen-4B在多数任务类别中表现最优,验证了SLM作为隐私敏感且计算高效的决策支持工具在奶牛养殖场景中的可行性,但仍需进一步微调以提升其在专业领域问题上的准确性。
链接: https://arxiv.org/abs/2512.14043
作者: Enhong Liu,Haiyu Yang,Miel Hostens
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLM) hold potential to support dairy scholars and farmers by supporting decision-making and broadening access to knowledge for stakeholders with limited technical expertise. However, the substantial computational demand restricts access to LLM almost exclusively through cloud-based service, which makes LLM-based decision support tools impractical for dairy farming. To address this gap, lightweight alternatives capable of running locally on farm hardware are required. In this work, we benchmarked 20 open-source Small Language Models (SLM) available on HuggingFace under farm-realistic computing constraints. Building on our prior work, we developed an agentic AI system that integrates five task-specific agents: literature search, web search, SQL database interaction, NoSQL database interaction, and graph generation following predictive models. Evaluation was conducted in two phases. In the first phase, five test questions were used for the initial screening to identify models capable of following basic dairy-related instructions and performing reliably in a compute-constrained environment. Models that passed this preliminary stage were then evaluated using 30 questions (five per task category mentioned above, plus one category addressing integrity and misconduct) in phase two. In results, Qwen-4B achieved superior performance across most of task categories, although showed unstable effectiveness in NoSQL database interactions through PySpark. To our knowledge, this is the first work explicitly evaluating the feasibility of SLM as engines for dairy farming decision-making, with central emphases on privacy and computational efficiency. While results highlight the promise of SLM-assisted tools for practical deployment in dairy farming, challenges remain, and fine-tuning is still needed to refine SLM performance in dairy-specific questions.
zh
[AI-41] Sample-Efficient Robot Skill Learning for Construction Tasks: Benchmarking Hierarchical Reinforcement Learning and Vision-Language-Action VLA Model
【速读】:该论文旨在解决建筑机器人技能学习中任务适应性与部署效率的问题,即如何在实际施工场景中高效地训练机器人完成复杂、多阶段的任务。其关键解决方案在于对比两种主流方法:视觉-语言-动作(Vision-Language-Action, VLA)模型与强化学习(Reinforcement Learning, RL)方法,并通过三阶段实验评估其性能与实用性。研究发现,VLA模型凭借强泛化能力和少样本学习优势,在搬运和安装等任务中表现出高成功率(如拾取阶段达60%和100%),显著降低了编程工作量和数据需求;而DQN作为RL基线虽可通过额外噪声调参提升鲁棒性,但增加了调试负担。因此,VLA在应对施工任务变化时展现出更优的工程实践价值。
链接: https://arxiv.org/abs/2512.14031
作者: Zhaofeng Hu,Hongrui Yu,Vaidhyanathan Chandramouli,Ci-Jyun Liang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:This study evaluates two leading approaches for teaching construction robots new skills to understand their applicability for construction automation: a Vision-Language-Action (VLA) model and Reinforcement Learning (RL) methods. The goal is to understand both task performance and the practical effort needed to deploy each approach on real jobs. The authors developed two teleoperation interfaces to control the robots and collect the demonstrations needed, both of which proved effective for training robots for long-horizon and dexterous tasks. In addition, the authors conduct a three-stage evaluation. First, the authors compare a Multi-Layer Perceptron (MLP) policy with a Deep Q-network (DQN) imitation model to identify the stronger RL baseline, focusing on model performance, generalization, and a pick-up experiment. Second, three different VLA models are trained in two different scenarios and compared with each other. Third, the authors benchmark the selected RL baseline against the VLA model using computational and sample-efficiency measures and then a robot experiment on a multi-stage panel installation task that includes transport and installation. The VLA model demonstrates strong generalization and few-shot capability, achieving 60% and 100% success in the pickup phase. In comparison, DQN can be made robust but needs additional noise during tuning, which increases the workload. Overall, the findings indicate that VLA offers practical advantages for changing tasks by reducing programming effort and enabling useful performance with minimal data, while DQN provides a viable baseline when sufficient tuning effort is acceptable.
zh
[AI-42] PerfCoder: Large Language Models for Interpretable Code Performance Optimization
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动代码生成中难以产出高性能代码的问题,尤其是在真实软件系统中对性能优化的高要求。现有模型受限于数据稀缺和缺乏引导可解释性与高效性能提升的监督机制。解决方案的关键在于提出 PerfCoder,一个专门用于通过可解释、定制化优化生成性能增强代码的 LLM 家族;其核心创新包括:1)在包含人类可读标注的真实世界优化轨迹数据集上进行微调,2)利用运行时测量结果通过强化学习微调实现偏好对齐,从而直接提出并应用输入特定的优化策略,无需迭代 refine;3)生成可解释的反馈信息,支持与更大规模模型协同工作以进一步提升优化效果。实验证明,PerfCoder 在 PIE 代码性能基准上显著优于现有模型,在加速比和有效优化率方面均取得突破,验证了“仅靠规模无法实现性能优化”这一观点,强调了优化策略意识的重要性。
链接: https://arxiv.org/abs/2512.14018
作者: Jiuding Yang,Shengyao Lu,Hongxuan Liu,Shayan Shirahmad Gale Bagi,Zahra Fazel,Tomasz Czajkowski,Di Niu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved remarkable progress in automatic code generation, yet their ability to produce high-performance code remains limited–a critical requirement in real-world software systems. We argue that current LLMs struggle not only due to data scarcity but, more importantly, because they lack supervision that guides interpretable and effective performance improvements. In this work, we introduce PerfCoder, a family of LLMs specifically designed to generate performance-enhanced code from source code via interpretable, customized optimizations. PerfCoder is fine-tuned on a curated collection of real-world optimization trajectories with human-readable annotations, and preference-aligned by reinforcement fine-tuning using runtime measurements, enabling it to propose input-specific improvement strategies and apply them directly without relying on iterative refinement. On the PIE code performance benchmark, PerfCoder surpasses all existing models in both runtime speedup and effective optimization rate, demonstrating that performance optimization cannot be achieved by scale alone but requires optimization stratetgy awareness. In addition, PerfCoder can generate interpretable feedback about the source code, which, when provided as input to a larger LLM in a planner-and-optimizer cooperative workflow, can further improve outcomes. Specifically, we elevate the performance of 32B models and GPT-5 to new levels on code optimization, substantially surpassing their original performance.
zh
[AI-43] MobileWorldBench: Towards Semantic World Modeling For Mobile Agents
【速读】:该论文旨在解决传统像素空间世界模型在GUI(图形用户界面)场景下难以有效预测复杂视觉元素的问题,从而限制了具身智能体的任务表现。其解决方案的关键在于提出一种基于自然语言的状态转移建模方式,将GUI环境中的状态变化从像素级预测转化为语义层面的描述;通过构建MobileWorldBench基准和MobileWorld大规模数据集(140万样本),显著提升了视觉语言模型(VLM)作为世界模型的能力,并设计了一种将VLM世界模型嵌入移动代理规划框架的新方法,实验证明该语义世界模型可直接提升移动代理的任务成功率。
链接: https://arxiv.org/abs/2512.14014
作者: Shufan Li,Konstantinos Kallidromitis,Akash Gokul,Yusuke Kato,Kazuki Kozuka,Aditya Grover
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 13 figures
Abstract:World models have shown great utility in improving the task performance of embodied agents. While prior work largely focuses on pixel-space world models, these approaches face practical limitations in GUI settings, where predicting complex visual elements in future states is often difficult. In this work, we explore an alternative formulation of world modeling for GUI agents, where state transitions are described in natural language rather than predicting raw pixels. First, we introduce MobileWorldBench, a benchmark that evaluates the ability of vision-language models (VLMs) to function as world models for mobile GUI agents. Second, we release MobileWorld, a large-scale dataset consisting of 1.4M samples, that significantly improves the world modeling capabilities of VLMs. Finally, we propose a novel framework that integrates VLM world models into the planning framework of mobile agents, demonstrating that semantic world models can directly benefit mobile agents by improving task success rates. The code and dataset is available at this https URL
zh
[AI-44] Professional Software Developers Dont Vibe They Control: AI Agent Use for Coding in 2025
【速读】:该论文试图解决的问题是:在专业软件开发实践中,AI代理(AI agents)的实际角色尚不明确,尤其是在经验丰富的开发者如何使用这些代理进行软件构建、其动机、策略、任务适配性及情感态度等方面缺乏系统理解。解决方案的关键在于通过实地观察(N=13)与定性调查(N=99)揭示出,尽管开发者普遍认为代理能提升生产力,但他们仍坚持保留对软件设计与实现的主导权,以保障核心软件质量属性;同时,他们利用自身专业知识制定控制代理行为的策略,并对代理能力持积极但审慎的态度,这表明软件开发最佳实践在有效利用代理过程中具有关键作用。
链接: https://arxiv.org/abs/2512.14012
作者: Ruanqianqian Huang,Avery Reyna,Sorin Lerner,Haijun Xia,Brian Hempel
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:The rise of AI agents is transforming how software can be built. The promise of agents is that developers might write code quicker, delegate multiple tasks to different agents, and even write a full piece of software purely out of natural language. In reality, what roles agents play in professional software development remains in question. This paper investigates how experienced developers use agents in building software, including their motivations, strategies, task suitability, and sentiments. Through field observations (N=13) and qualitative surveys (N=99), we find that while experienced developers value agents as a productivity boost, they retain their agency in software design and implementation out of insistence on fundamental software quality attributes, employing strategies for controlling agent behavior leveraging their expertise. In addition, experienced developers feel overall positive about incorporating agents into software development given their confidence in complementing the agents’ limitations. Our results shed light on the value of software development best practices in effective use of agents, suggest the kinds of tasks for which agents may be suitable, and point towards future opportunities for better agentic interfaces and agentic use guidelines.
zh
[AI-45] Memo2496: Expert-Annotated Dataset and Dual-View Adaptive Framework for Music Emotion Recognition
【速读】:该论文旨在解决音乐情感识别(Music Emotion Recogniser, MER)研究中面临的两大挑战:一是高质量标注数据集稀缺,二是跨音乐片段特征漂移(cross-track feature drift)问题。为应对这些问题,作者提出了两个核心贡献:其一,构建了大规模标注数据集Memo2496,包含2496首器乐曲目,每首曲目均具有连续的情绪维度标签(愉悦度-唤醒度,valence-arousal),并由30名认证音乐专家标注,通过极端情绪示例校准与欧氏距离一致性阈值(0.25)确保标注质量;其二,设计了双视角自适应音乐情感识别模型DAMER,其关键创新在于三个协同模块:Dual Stream Attention Fusion(DSAF)实现梅尔频谱图与耳蜗图之间的token级双向交互;Progressive Confidence Labeling(PCL)利用课程温度调度与Jensen-Shannon散度量化一致性生成可靠伪标签;Style Anchored Memory Learning(SAML)引入对比记忆队列缓解跨片段特征漂移。实验表明,DAMER在Memo2496、1000songs和PMEmo数据集上均达到当前最优性能,验证了方案的有效性。
链接: https://arxiv.org/abs/2512.13998
作者: Qilin Li,C. L. Philip Chen,TongZhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 instrumental music tracks with continuous valence arousal labels, annotated by 30 certified music specialists. Annotation quality is ensured through calibration with extreme emotion exemplars and a consistency threshold of 0.25, measured by Euclidean distance in the valence arousal space. Furthermore, the Dual-view Adaptive Music Emotion Recogniser (DAMER) is introduced. DAMER integrates three synergistic modules: Dual Stream Attention Fusion (DSAF) facilitates token-level bidirectional interaction between Mel spectrograms and cochleagrams via cross attention mechanisms; Progressive Confidence Labelling (PCL) generates reliable pseudo labels employing curriculum-based temperature scheduling and consistency quantification using Jensen Shannon divergence; and Style Anchored Memory Learning (SAML) maintains a contrastive memory queue to mitigate cross-track feature drift. Extensive experiments on the Memo2496, 1000songs, and PMEmo datasets demonstrate DAMER’s state-of-the-art performance, improving arousal dimension accuracy by 3.43%, 2.25%, and 0.17%, respectively. Ablation studies and visualisation analyses validate each module’s contribution. Both the dataset and source code are publicly available.
zh
[AI-46] Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training
【速读】:该论文旨在解决稀疏专家混合(Mixture-of-Experts, MoE)架构中因标准Top-k路由策略导致的固定稀疏模式问题,以及现有Top-p路由方法依赖固定全局概率阈值所引发的计算成本不可控和超参数敏感性问题。解决方案的关键在于提出一种可控制稀疏度的动态Top-p路由机制(DTop-p MoE):首先,通过引入比例-积分(Proportional-Integral, PI)控制器动态调整概率阈值,以使激活专家的稀疏度趋近于预设目标;其次,设计了一种层间自适应的路由归一化机制,使不同层能够学习差异化的专家选择模式,同时保持全局概率阈值的一致性。这一方案实现了对激活专家数量的精确控制,并能根据输入token和网络层动态分配计算资源,显著优于传统的Top-k与固定阈值Top-p基线方法。
链接: https://arxiv.org/abs/2512.13996
作者: Can Jin,Hongwu Peng,Mingcan Xiang,Qixin Zhang,Xiangchi Yuan,Amit Hasan,Ohiremen Dibua,Yifan Gong,Yan Kang,Dimitris N. Metaxas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse Mixture-of-Experts (MoE) architectures effectively scale model capacity by activating only a subset of experts for each input token. However, the standard Top-k routing strategy imposes a uniform sparsity pattern that ignores the varying difficulty of tokens. While Top-p routing offers a flexible alternative, existing implementations typically rely on a fixed global probability threshold, which results in uncontrolled computational costs and sensitivity to hyperparameter selection. In this paper, we propose DTop-p MoE, a sparsity-controllable dynamic Top-p routing mechanism. To resolve the challenge of optimizing a non-differentiable threshold, we utilize a Proportional-Integral (PI) Controller that dynamically adjusts the probability threshold to align the running activated-expert sparsity with a specified target. Furthermore, we introduce a dynamic routing normalization mechanism that adapts layer-wise routing logits, allowing different layers to learn distinct expert-selection patterns while utilizing a global probability threshold. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that DTop-p consistently outperforms both Top-k and fixed-threshold Top-p baselines. Our analysis confirms that DTop-p maintains precise control over the number of activated experts while adaptively allocating resources across different tokens and layers. Furthermore, DTop-p exhibits strong scaling properties with respect to expert granularity, expert capacity, model size, and dataset size, offering a robust framework for large-scale MoE pre-training.
zh
[AI-47] ReflCtrl: Controlling LLM Reflection via Representation Engineering NEURIPS25
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理过程中因频繁自反思(self-reflection)导致的推理成本过高问题。解决方案的关键在于通过表征工程(representation engineering)识别出模型推理路径中对应于反思行为的步骤,并在潜在空间中提取一个控制反思方向(reflection direction);基于此方向,提出一种分步可控的引导方法(stepwise steering method),即ReflCtrl框架,从而实现对反思频率的精确调控。实验表明,在多数情况下反思行为冗余,尤其在更强模型中可节省高达33.6%的推理token而保持性能不变,且模型的反思行为与内部不确定性信号高度相关,暗示可通过不确定性来控制自反思行为。
链接: https://arxiv.org/abs/2512.13979
作者: Ge Yan,Chung-En Sun,Tsui-Wei(Lily)Weng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Spotlight in NeurIPS 25 MI workshop
Abstract:Large language models (LLMs) with Chain-of-Thought (CoT) reasoning have achieved strong performance across diverse tasks, including mathematics, coding, and general reasoning. A distinctive ability of these reasoning models is self-reflection: the ability to review and revise previous reasoning steps. While self-reflection enhances reasoning performance, it also increases inference cost. In this work, we study self-reflection through the lens of representation engineering. We segment the model’s reasoning into steps, identify the steps corresponding to reflection, and extract a reflection direction in the latent space that governs this behavior. Using this direction, we propose a stepwise steering method that can control reflection frequency. We call our framework ReflCtrl. Our experiments show that (1) in many cases reflections are redundant, especially in stronger models (in our experiments, we can save up to 33.6 percent of reasoning tokens while preserving performance), and (2) the model’s reflection behavior is highly correlated with an internal uncertainty signal, implying self-reflection may be controlled by the model’s uncertainty.
zh
[AI-48] Evaluating Frontier LLM s on PhD-Level Mathematical Reasoning : A Benchmark on a Textbook in Theoretical Computer Science about Randomized Algorithms
【速读】:该论文旨在解决当前前沿大语言模型(Large Language Models, LLMs)在 graduate-level 数学理论推理能力上的评估不足问题,特别是其在形式化证明生成任务中的可靠性与一致性。解决方案的关键在于构建一个针对经典教材《Randomized Algorithms》(Motwani and Raghavan [MR95])的系统性基准测试,对四种前沿模型(GPT-5-Thinking、Gemini-3-Pro、Claude-Sonnet-4.5-Thinking 和 Grok-4)进行定量和定性分析,要求它们生成严格的 LaTeX 形式化证明。结果显示,顶级模型(Gemini 和 Claude)在概率方法和形式逻辑掌握上表现出约 66% 的准确率,而其他模型一致性显著偏低(约 40%),揭示了当前 LLMs 在数学严谨推导中存在显著性能差异,为未来模型改进提供了明确方向。
链接: https://arxiv.org/abs/2512.13978
作者: Yang Cao,Yubin Chen,Xuyang Guo,Zhao Song,Song Yue,Jiahao Zhang,Jiale Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of large language models (LLMs) has led to significant breakthroughs in automated mathematical reasoning and scientific discovery. Georgiev, G ó mez-Serrano, Tao, and Wagner [GGSTW+25] demonstrate that AI systems can explore new constructions and improve existing bounds, illustrating the growing potential of LLMs to accelerate mathematical discovery. Similarly, Bubeck et al. [BCE+25] show that GPT-5 can meaningfully contribute to scientific workflows, from proposing hypotheses to generating proofs and analyses. Despite these advances, a rigorous evaluation of these models on canonical, graduate-level mathematical theory remains necessary to understand their baseline reasoning capabilities. In this paper, we present a comprehensive benchmark of four frontier models: GPT-5-Thinking, Gemini-3-Pro, Claude-Sonnet-4.5-Thinking, and Grok-4 against the classic curriculum of Randomized Algorithms by Motwani and Raghavan [MR95]. We tasked each model with generating formal LaTeX proofs for a series of lemmas and exercises spanning the textbook. We find that while the top-tier models (Gemini, and Claude) achieve a high accuracy rate (approx. 66%), demonstrating a robust grasp of probabilistic method and formal logic, other models lag significantly in consistency (approx. 40%). We provide a qualitative analysis of the generated proofs, highlighting differences in conciseness, hallucination rates, and logical structure. Our results suggest that while frontier models have reached a threshold of proficiency suitable for graduate-level pedagogical assistance and formalization, significant variance exists in their reliability for rigorous mathematical derivation. The code and the full set of LLM-generated responses are open-sourced and publicly available at this https URL. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2512.13978 [cs.AI] (or arXiv:2512.13978v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.13978 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-49] Multi-Agent Collaborative Framework for Intelligent IT Operations: An AOI System with Context-Aware Compression and Dynamic Task Scheduling
【速读】:该论文旨在解决云原生架构下微服务与动态编排导致的现代IT基础设施复杂性和高波动性问题,此类问题引发运维数据量激增,进而造成传统系统在信息处理效率低、任务协调差以及故障诊断与修复过程中上下文连续性丢失等瓶颈。解决方案的关键在于提出AOI(AI-Oriented Operations)框架,其核心创新包括:(1) 基于实时系统状态自适应优先级调度的动态任务分配策略;(2) 由工作记忆(Working)、情景记忆(Episodic)和语义记忆(Semantic)三层构成的记忆架构,实现上下文信息的有效保留与高效检索。该方案显著提升了上下文压缩比(72.4%)并保持92.8%的关键信息完整性,同时将任务成功率提升至94.2%,并将平均修复时间(MTTR)缩短34.4%,实现了面向未来的可扩展、自适应且上下文感知的自治运维范式。
链接: https://arxiv.org/abs/2512.13956
作者: Zishan Bai,Enze Ge,Junfeng Hao
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:The proliferation of cloud-native architectures, characterized by microservices and dynamic orchestration, has rendered modern IT infrastructures exceedingly complex and volatile. This complexity generates overwhelming volumes of operational data, leading to critical bottlenecks in conventional systems: inefficient information processing, poor task coordination, and loss of contextual continuity during fault diagnosis and remediation. To address these challenges, we propose AOI (AI-Oriented Operations), a novel multi-agent collaborative framework that integrates three specialized agents with an LLM-based Context Compressor. Its core innovations include: (1) a dynamic task scheduling strategy that adaptively prioritizes operations based on real-time system states, and (2) a three-layer memory architecture comprising Working, Episodic, and Semantic layers that optimizes context retention and retrieval. Extensive experiments on both synthetic and real-world benchmarks demonstrate that AOI effectively mitigates information overload, achieving a 72.4% context compression ratio while preserving 92.8% of critical information and significantly enhances operational efficiency, attaining a 94.2% task success rate and reducing the Mean Time to Repair (MTTR) by 34.4% compared to the best baseline. This work presents a paradigm shift towards scalable, adaptive, and context-aware autonomous operations, enabling robust management of next-generation IT infrastructures with minimal human intervention.
zh
[AI-50] MURIM: Multidimensional Reputation-based Incentive Mechanism for Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中客户端激励不足、隐私风险以及资源受限等问题,尤其是如何评估客户端可靠性以实现公平的激励分配,并确保其数据对全局模型的有效贡献。解决方案的关键在于提出一种多维声誉驱动的激励机制(MURIM),该机制综合考虑客户端的可靠性、隐私保护能力、资源容量及公平性,通过基于贡献度、延迟和声誉的奖励分配策略,结合可靠性验证模块,有效防止恶意或不可靠客户端获取不当收益。实验表明,MURIM在公平性指标上提升最高达18%,隐私攻击成功率降低5-9%,并对投毒和噪声梯度攻击的鲁棒性提升达85%。
链接: https://arxiv.org/abs/2512.13955
作者: Sindhuja Madabushi,Dawood Wasif,Jin-Hee Cho
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Learning (FL) has emerged as a leading privacy-preserving machine learning paradigm, enabling participants to share model updates instead of raw data. However, FL continues to face key challenges, including weak client incentives, privacy risks, and resource constraints. Assessing client reliability is essential for fair incentive allocation and ensuring that each client’s data contributes meaningfully to the global model. To this end, we propose MURIM, a MUlti-dimensional Reputation-based Incentive Mechanism that jointly considers client reliability, privacy, resource capacity, and fairness while preventing malicious or unreliable clients from earning undeserved rewards. MURIM allocates incentives based on client contribution, latency, and reputation, supported by a reliability verification module. Extensive experiments on MNIST, FMNIST, and ADULT Income datasets demonstrate that MURIM achieves up to 18% improvement in fairness metrics, reduces privacy attack success rates by 5-9%, and improves robustness against poisoning and noisy-gradient attacks by up to 85% compared to state-of-the-art baselines. Overall, MURIM effectively mitigates adversarial threats, promotes fair and truthful participation, and preserves stable model convergence across heterogeneous and dynamic federated settings.
zh
[AI-51] Informing Acquisition Functions via Foundation Models for Molecular Discovery
【速读】:该论文旨在解决贝叶斯优化(Bayesian Optimization, BO)在分子发现中面临的低数据场景下的性能瓶颈问题,即当先验知识不足且候选分子空间庞大时,传统BO方法因依赖显式概率代理模型(surrogate model)而难以高效探索。其关键解决方案是提出一种无需显式建模似然函数的贝叶斯优化方法(likelihood-free BO),直接利用通用大语言模型(Large Language Models, LLMs)和化学基础模型(chemistry foundation models)提供的先验信息来指导采集函数(acquisition function)的设计;同时引入树状结构对分子搜索空间进行局部划分,并结合蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)实现高效的候选分子选择;进一步通过粗粒度LLM聚类限制采集函数评估范围,仅在统计上具有更高属性值的簇内执行优化,从而显著提升方法在大规模候选集上的可扩展性、鲁棒性和样本效率。
链接: https://arxiv.org/abs/2512.13935
作者: Qi Chen,Fabio Ramos,Alán Aspuru-Guzik,Florian Shkurti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Bayesian Optimization (BO) is a key methodology for accelerating molecular discovery by estimating the mapping from molecules to their properties while seeking the optimal candidate. Typically, BO iteratively updates a probabilistic surrogate model of this mapping and optimizes acquisition functions derived from the model to guide molecule selection. However, its performance is limited in low-data regimes with insufficient prior knowledge and vast candidate spaces. Large language models (LLMs) and chemistry foundation models offer rich priors to enhance BO, but high-dimensional features, costly in-context learning, and the computational burden of deep Bayesian surrogates hinder their full utilization. To address these challenges, we propose a likelihood-free BO method that bypasses explicit surrogate modeling and directly leverages priors from general LLMs and chemistry-specific foundation models to inform acquisition functions. Our method also learns a tree-structured partition of the molecular search space with local acquisition functions, enabling efficient candidate selection via Monte Carlo Tree Search. By further incorporating coarse-grained LLM-based clustering, it substantially improves scalability to large candidate sets by restricting acquisition function evaluations to clusters with statistically higher property values. We show through extensive experiments and ablations that the proposed method substantially improves scalability, robustness, and sample efficiency in LLM-guided BO for molecular discovery.
zh
[AI-52] Context Branching for LLM Conversations: A Version Control Approach to Exploratory Programming
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮对话中性能显著下降的问题,尤其是在探索性编程任务中,模型容易因上下文污染而产生过早假设且无法纠正错误方向。现有方法迫使用户在“继续污染的对话”与“重置并丢失上下文”之间做出非此即彼的选择。其解决方案的关键在于提出 ContextBranch 系统,该系统借鉴版本控制语义,提供四个核心操作原语——检查点(checkpoint)、分支(branch)、切换(switch)和注入(inject),使用户能够隔离探索路径、捕获对话状态,并选择性地合并有用见解,从而有效避免上下文污染,提升复杂场景下的响应质量与上下文感知能力。
链接: https://arxiv.org/abs/2512.13914
作者: Bhargav Chickmagalur Nanjundappa,Spandan Maaheshwari
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 11 pages, 4 figures, 2 tables, 1 code snippet, 4 algorithms
Abstract:Large Language Models (LLMs) have become integral to software engineering workflows, yet their effectiveness degrades significantly in multi-turn conversations. Recent studies demonstrate an average 39% performance drop when instructions are delivered across multiple turns, with models making premature assumptions and failing to course correct (Laban et al., 2025). This degradation is particularly problematic in exploratory programming tasks where developers need to investigate alternative approaches without committing to a single path. Current solutions force users into a false dichotomy: continue in a context-polluted conversation where the LLM becomes increasingly confused, or start fresh and lose all accumulated context. We present ContextBranch, a conversation management system that applies version control semantics to LLM interactions. ContextBranch provides four core primitives–checkpoint, branch, switch, and inject–enabling users to capture conversation state, explore alternatives in isolation, and selectively merge insights. We evaluate ContextBranch through a controlled experiment with 30 software engineering scenarios featuring intentionally polluting explorations. Branched conversations achieved higher response quality compared to linear conversations, with large improvements in focus and context awareness. Benefits were concentrated in complex scenarios involving conceptually distant explorations. Branching reduced context size by 58.1% (31.0 to 13.0 messages), eliminating irrelevant exploratory content. Our work establishes conversation branching as a fundamental primitive for AI-assisted exploratory work, demonstrating that isolation prevents context pollution when exploring alternatives. Comments: 11 pages, 4 figures, 2 tables, 1 code snippet, 4 algorithms Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2512.13914 [cs.SE] (or arXiv:2512.13914v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.13914 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-53] Exploring Machine Learning Deep Learning and Explainable AI Methods for Seasonal Precipitation Prediction in South America
【速读】:该论文旨在解决纯数据驱动方法在降水预报中的可行性问题,即在缺乏传统动态模型依赖的前提下,能否通过机器学习(ML)和深度学习(DL)技术实现高精度的降水预测。其关键解决方案在于系统比较了多种经典机器学习(随机森林、XGBoost)与深度学习模型(1D卷积神经网络、长短期记忆网络LSTM、门控循环单元GRU)在南美地区全年2019年降水预报中的性能,并引入可解释人工智能(XAI)分析模型行为。研究发现,LSTM在重降水事件预测中表现最优,尽管存在较高延迟;而XGBoost则在计算成本敏感场景下提供次优但高效的替代方案,验证了深度学习模型在气候预测中的可行性和优越性,呼应了全球气象和气候预报中心对数据驱动方法日益增长的采纳趋势。
链接: https://arxiv.org/abs/2512.13910
作者: Matheus Corrêa Domingos,Valdivino Alexandre de Santiago Júnior,Juliana Aparecida Anochi,Elcio Hideiti Shiguemori,Luísa Mirelle Costa dos Santos,Hércules Carlos dos Santos Pereira,André Estevam Costa Oliveira
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Forecasting meteorological variables is challenging due to the complexity of their processes, requiring advanced models for accuracy. Accurate precipitation forecasts are vital for society. Reliable predictions help communities mitigate climatic impacts. Based on the current relevance of artificial intelligence (AI), classical machine learning (ML) and deep learning (DL) techniques have been used as an alternative or complement to dynamic modeling. However, there is still a lack of broad investigations into the feasibility of purely data-driven approaches for precipitation forecasting. This study aims at addressing this issue where different classical ML and DL approaches for forecasting precipitation in South America, taking into account all 2019 seasons, are considered in a detailed investigation. The selected classical ML techniques were Random Forests and extreme gradient boosting (XGBoost), while the DL counterparts were a 1D convolutional neural network (CNN 1D), a long short-term memory (LSTM) model, and a gated recurrent unit (GRU) model. Additionally, the Brazilian Global Atmospheric Model (BAM) was used as a representative of the traditional dynamic modeling approach. We also relied on explainable artificial intelligence (XAI) to provide some explanations for the models behaviors. LSTM showed strong predictive performance while BAM, the traditional dynamic model representative, had the worst results. Despite presented the higher latency, LSTM was most accurate for heavy precipitation. If cost is a concern, XGBoost offers lower latency with slightly accuracy loss. The results of this research confirm the viability of DL models for climate forecasting, solidifying a global trend in major meteorological and climate forecasting centers.
zh
[AI-54] Assessing High-Risk Systems: An EU AI Act Verification Framework
【速读】:该论文旨在解决欧盟在实施《人工智能法案》(AI Act)及其他相关AI法规过程中,因缺乏系统性合规验证方法而导致的监管模糊性问题,这一问题已被广泛认为是阻碍成员国一致准备和执行法规的主要障碍。解决方案的关键在于提出一个综合框架,将合规验证活动按两个核心维度进行组织:一是验证方法类型(控制 vs. 测试),二是评估对象(数据、模型、流程及最终产品)。该框架进一步将关键法律要求映射到具体的验证活动中,从而在政策制定者与实践者之间建立桥梁,实现法律文本与技术标准及最佳实践的对齐,有效降低解释不确定性,提升评估的一致性,并促进整个AI生命周期中监管、伦理与技术视角的协同。
链接: https://arxiv.org/abs/2512.13907
作者: Alessio Buscemi,Tom Deckenbrunnen,Fahria Kabir,Nishat Mowla,Kateryna Mishchenko
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:A central challenge in implementing the AI Act and other AI-relevant regulations in the EU is the lack of a systematic approach to verify their legal mandates. Recent surveys show that this regulatory ambiguity is perceived as a significant burden, leading to inconsistent readiness across Member States. This paper proposes a comprehensive framework designed to help close this gap by organising compliance verification along two fundamental dimensions: the type of method (controls vs. testing) and the target of assessment (data, model, processes, and final product). Additionally, our framework maps core legal requirements to concrete verification activities, serving as a vital bridge between policymakers and practitioners, and aligning legal text with technical standards and best practices. The proposed approach aims to reduce interpretive uncertainty, promote consistency in assessment practices, and support the alignment of regulatory, ethical, and technical perspectives across the AI lifecycle.
zh
[AI-55] OPTIMA: Optimal One-shot Pruning for LLM s via Quadratic Programming Reconstruction
【速读】:该论文旨在解决后训练剪枝(post-training model pruning)中准确率与可扩展性之间的权衡问题:简单启发式方法虽计算高效但显著降低模型性能,而基于联合优化的精确方法虽能恢复准确率却因计算复杂度高难以在现代大模型规模下应用。其解决方案的关键在于提出OPTIMA,一种实用的一次性(one-shot)后训练剪枝方法,通过将层内权重重构建模为共享层海森矩阵(Hessian)的独立行级二次规划(Quadratic Program, QP)问题,实现每行的全局最优更新;该结构支持在加速器上高效批处理多个小规模QP,从而在单个加速器上无需微调即可实现大规模模型的快速、高精度剪枝,显著提升了剪枝后的零样本性能(最高提升3.97%绝对准确率)。
链接: https://arxiv.org/abs/2512.13886
作者: Mohammad Mozaffari,Samuel Kushnir,Maryam Mehri Dehnavi,Amir Yazdanbakhsh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:Post-training model pruning is a promising solution, yet it faces a trade-off: simple heuristics that zero weights are fast but degrade accuracy, while principled joint optimization methods recover accuracy but are computationally infeasible at modern scale. One-shot methods such as SparseGPT offer a practical trade-off in optimality by applying efficient, approximate heuristic weight updates. To close this gap, we introduce OPTIMA, a practical one-shot post-training pruning method that balances accuracy and scalability. OPTIMA casts layer-wise weight reconstruction after mask selection as independent, row-wise Quadratic Programs (QPs) that share a common layer Hessian. Solving these QPs yields the per-row globally optimal update with respect to the reconstruction objective given the estimated Hessian. The shared-Hessian structure makes the problem highly amenable to batching on accelerators. We implement an accelerator-friendly QP solver that accumulates one Hessian per layer and solves many small QPs in parallel, enabling one-shot post-training pruning at scale on a single accelerator without fine-tuning. OPTIMA integrates with existing mask selectors and consistently improves zero-shot performance across multiple LLM families and sparsity regimes, yielding up to 3.97% absolute accuracy improvement. On an NVIDIA H100, OPTIMA prunes a 8B-parameter transformer end-to-end in 40 hours with 60GB peak memory. Together, these results set a new state-of-the-art accuracy-efficiency trade-offs for one-shot post-training pruning.
zh
[AI-56] Privacy-Enhancing Infant Cry Classification with Federated Transformers and Denoising Regularization
【速读】:该论文旨在解决婴儿啼哭分类系统在实际部署中面临的三大挑战:音频数据隐私问题、背景噪声敏感性以及跨录制环境的领域偏移(domain shift)。其解决方案的关键在于构建一个端到端的婴儿啼哭分析流水线,集成去噪自编码器(denoising autoencoder, DAE)、卷积分词器(convolutional tokenizer)和基于通信高效的联邦学习(federated learning, FL)训练的Transformer编码器。该架构支持本地去噪、自适应分割、事后校准及基于能量的分布外(out-of-distribution, OOD)拒绝决策,同时通过8-bit适配器增量更新与安全聚合机制将每轮客户端上传数据量从约36–42 MB降低至3.3 MB,显著提升了通信效率,并实现在NVIDIA Jetson Nano设备上的实时边缘推理(96 ms/秒频谱帧),从而实现了隐私保护、噪声鲁棒性和通信高效性的统一。
链接: https://arxiv.org/abs/2512.13880
作者: Geofrey Owino,Bernard Shibwabo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: This paper was accepted for presentation and presented at the 2025 International Conference on Computer Engineering, Network, and Intelligent Multimedia (CENIM 2025)
Abstract:Infant cry classification can aid early assessment of infant needs. However, deployment of such solutions is limited by privacy concerns around audio data, sensitivity to background noise, and domain shift across recording environments. We present an end-to-end infant cry analysis pipeline that integrates a denoising autoencoder (DAE), a convolutional tokenizer, and a Transformer encoder trained using communication-efficient federated learning (FL). The system performs on-device denoising, adaptive segmentation, post hoc calibration, and energy-based out-of-distribution (OOD) abstention. Federated training employs a regularized control variate update with 8-bit adapter deltas under secure aggregation. Using the Baby Chillanto and Donate-a-Cry datasets with ESC-50 noise overlays, the model achieves a macro F1 score of 0.938, an AUC of 0.962, and an Expected Calibration Error (ECE) of 0.032, while reducing per-round client upload from approximately 36 to 42 MB to 3.3 MB. Real-time edge inference on an NVIDIA Jetson Nano (4 GB, TensorRT FP16) achieves 96 ms per one-second spectrogram frame. These results demonstrate a practical path toward privacy-preserving, noise-robust, and communication-efficient infant cry classification suitable for federated deployment.
zh
[AI-57] Verification-Guided Context Optimization for Tool Calling via Hierarchical LLM s-as-Editors AAAI2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在调用外部工具时因工具文档和知识库上下文质量不足而导致的性能瓶颈问题,尤其是在工业场景中,大量功能重叠的工具带来可扩展性、变异性与歧义性挑战。解决方案的关键在于提出验证引导的上下文优化框架(Verification-Guided Context Optimization, VGCO),其核心创新包括:1)引入分层编辑机制,将编辑过程自然嵌入工具调用工作流;2)设计状态感知、动作特定且基于验证反馈的编辑策略,有效缩小搜索空间并实现精准改进;3)支持低成本子任务专业化,可通过提示工程优化大型编辑模型或通过后训练微调小型编辑模型。VGCO在单轮大规模工具调用任务上取得显著提升,在准确率、鲁棒性和泛化能力方面优于以往依赖多轮推理的方法。
链接: https://arxiv.org/abs/2512.13860
作者: Henger Li,Shuangjie You,Flavio Di Palo,Yiyue Qian,Ayush Jain
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026 Workshop on Agentic AI Benchmarks and Applications for Enterprise Tasks
Abstract:Tool calling enables large language models (LLMs) to interact with external environments through tool invocation, providing a practical way to overcome the limitations of pretraining. However, the effectiveness of tool use depends heavily on the quality of the associated documentation and knowledge base context. These materials are usually written for human users and are often misaligned with how LLMs interpret information. This problem is even more pronounced in industrial settings, where hundreds of tools with overlapping functionality create challenges in scalability, variability, and ambiguity. We propose Verification-Guided Context Optimization (VGCO), a framework that uses LLMs as editors to automatically refine tool-related documentation and knowledge base context. VGCO works in two stages. First, Evaluation collects real-world failure cases and identifies mismatches between tools and their context. Second, Optimization performs hierarchical editing through offline learning with structure-aware, in-context optimization. The novelty of our LLM editors has three main aspects. First, they use a hierarchical structure that naturally integrates into the tool-calling workflow. Second, they are state-aware, action-specific, and verification-guided, which constrains the search space and enables efficient, targeted improvements. Third, they enable cost-efficient sub-task specialization, either by prompt engineering large editor models or by post-training smaller editor models. Unlike prior work that emphasizes multi-turn reasoning, VGCO focuses on the single-turn, large-scale tool-calling problem and achieves significant improvements in accuracy, robustness, and generalization across LLMs.
zh
[AI-58] Semantic Grounding Index: Geometric Bounds on Context Engagement in RAG Systems
【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中幻觉(hallucination)识别的问题,即如何从嵌入空间的几何结构中捕捉到生成响应偏离事实依据的痕迹。其解决方案的关键在于提出语义锚定指数(Semantic Grounding Index, SGI),定义为响应向量在单位超球面 Sd−1 上相对于问题和上下文的夹角距离之比。研究表明,幻觉响应倾向于保持与问题的角距离较近,而非靠近检索到的上下文,这种现象被称为“语义惰性”(semantic laziness)。SGI通过理论推导(基于球面三角不等式)和实证验证,证明其判别能力随问题与上下文夹角增大而提升,并具备良好的校准性(ECE=0.10),可作为生产环境中需人工验证响应的概率估计指标。
链接: https://arxiv.org/abs/2512.13771
作者: Javier Marín
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:When retrieval-augmented generation (RAG) systems hallucinate, what geometric trace does this leave in embedding space? We introduce the Semantic Grounding Index (SGI), defined as the ratio of angular distances from the response to the question versus the context on the unit hypersphere \mathbbS^d-1 .Our central finding is \emphsemantic laziness: hallucinated responses remain angularly proximate to questions rather than departing toward retrieved contexts. On HaluEval ( n =5,000), we observe large effect sizes (Cohen’s d ranging from 0.92 to 1.28) across five embedding models with mean cross-model correlation r =0.85. Crucially, we derive from the spherical triangle inequality that SGI’s discriminative power should increase with question-context angular separation \theta(q,c) -a theoretical prediction confirmed empirically: effect size rises monotonically from d =0.61 -low \theta(q,c) , to d =1.27 -high \theta(q,c) , with AUC improving from 0.72 to 0.83. Subgroup analysis reveals that SGI excels on long responses ( d =2.05) and short questions ( d =1.22), while remaining robust across context lengths. Calibration analysis yields ECE=0.10, indicating SGI scores can serve as probability estimates, not merely rankings. A critical negative result on TruthfulQA (AUC=0.478) establishes that angular geometry measures topical engagement rather than factual accuracy. SGI provides computationally efficient, theoretically grounded infrastructure for identifying responses that warrant verification in production RAG deployments.
zh
[AI-59] Beyond Procedural Compliance: Human Oversight as a Dimension of Well-being Efficacy in AI Governance
【速读】:该论文旨在解决当前人工智能(AI)伦理指南和法规(如欧盟《人工智能法案》)中对“人类监督”(human oversight)缺乏明确定义与可操作性路径的问题。其核心挑战在于如何将抽象的监管要求转化为具有实际效能的人类能力,以确保AI系统的安全与伦理使用。解决方案的关键在于将人类监督重新定义为一种“福祉有效性能力”(well-being efficacy capacity),该能力由AI素养、伦理判断力及对人类需求的认知构成,尤其强调识别并约束因人类投射欲望、恐惧或利益而产生的不当AI需求。作者进一步提出,这种能力的可持续发展依赖于将其系统性嵌入从职业培训到终身学习的各级教育体系中,从而实现从宏观政策目标向个体能力建设的转化,为未来AI治理提供理论基础与实践路径。
链接: https://arxiv.org/abs/2512.13768
作者: Yao Xie,Walter Cullen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Major AI ethics guidelines and laws, including the EU AI Act, call for effective human oversight, but do not define it as a distinct and developable capacity. This paper introduces human oversight as a well-being capacity, situated within the emerging Well-being Efficacy framework. The concept integrates AI literacy, ethical discernment, and awareness of human needs, acknowledging that some needs may be conflicting or harmful. Because people inevitably project desires, fears, and interests into AI systems, oversight requires the competence to examine and, when necessary, restrain problematic demands. The authors argue that the sustainable and cost-effective development of this capacity depends on its integration into education at every level, from professional training to lifelong learning. The frame of human oversight as a well-being capacity provides a practical path from high-level regulatory goals to the continuous cultivation of human agency and responsibility essential for safe and ethical AI. The paper establishes a theoretical foundation for future research on the pedagogical implementation and empirical validation of well-being effectiveness in multiple contexts. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2512.13768 [cs.CY] (or arXiv:2512.13768v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2512.13768 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-60] Mathematics and Coding are Universal AI Benchmarks
【速读】:该论文旨在解决如何在心理测量电池(psychometric batteries)的模空间(moduli space)中,通过数学与编码任务构建可信赖且具有自提升能力的AI代理评估框架问题。其核心挑战在于识别哪些任务类型能够在不依赖特定领域知识的前提下,提供对AI代理能力的通用、稳定且可扩展的度量方式。解决方案的关键在于引入“数学纤维”(Mathematics Fiber)概念,并证明:在统一紧性条件下,若代理输出满足Lipschitz连续的AAI泛函约束,则由数学定理证明与编码任务生成的子空间在评估度量下稠密;进一步地,结合形式化证明核(如Lean、Coq)时,GVU动力学在此纤维上表现出谱稳定的自我改进行为,这源于类似预言机的验证机制。该结果表明,数学与编码共同构成了AI能力评估的“通用坐标系”,其中形式数学是递归自提升的自然触发域,而非单纯表达能力的来源。
链接: https://arxiv.org/abs/2512.13764
作者: Przemyslaw Chojecki
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We study the special role of mathematics and coding inside the moduli space of psychometric batteries for AI agents. Building on the AAI framework and GVU dynamics from previous works, we define the Mathematics Fiber and show that, when paired with formal proof kernels (e.g. Lean, Coq), GVU flows on this fiber admit spectrally stable self-improvement regimes due to oracle-like verification. Our main technical result is a density theorem: under uniform tightness of agent outputs and a Lipschitz AAI functional, the subspace of batteries generated by mathematical theorem-proving and coding tasks is dense in the moduli space of batteries with respect to the evaluation metric. Coding alone is universal in this sense, while pure mathematics is not; its privilege is spectral rather than expressive. We interpret this as evidence that mathematics and coding provide ``universal coordinates’’ for evaluation, and that formal mathematics is a natural ignition domain for recursive self-improvement in advanced AI agents.
zh
[AI-61] State-Dependent Refusal and Learned Incapacity in RLHF-Aligned Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在长期交互中表现出的政策敏感性行为选择性问题,即模型在非敏感领域维持正常功能表现(Normal Performance, NP),而在涉及提供者或政策敏感领域的对话中频繁出现功能性拒绝(Functional Refusal, FR),导致行为不对称。解决方案的关键在于提出一种基于可观测行为的交互级审计框架,通过识别三种响应模式(NP、FR 和元叙事,Meta-Narrative, MN)来量化这种选择性抑制,并引入“习得无能”(Learned Incapacity, LI)作为描述此类行为的术语,用以刻画模型在特定情境下系统性地回避响应但不涉及意图或内部机制的特征,从而为理解潜在对齐副作用提供新的分析视角。
链接: https://arxiv.org/abs/2512.13762
作者: TK Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 23 pages, 6 figures. Qualitative interaction-level analysis of response patterns in a large language model. Code and processed interaction data are available at this https URL
Abstract:Large language models (LLMs) are widely deployed as general-purpose tools, yet extended interaction can reveal behavioral patterns not captured by standard quantitative benchmarks. We present a qualitative case-study methodology for auditing policy-linked behavioral selectivity in long-horizon interaction. In a single 86-turn dialogue session, the same model shows Normal Performance (NP) in broad, non-sensitive domains while repeatedly producing Functional Refusal (FR) in provider- or policy-sensitive domains, yielding a consistent asymmetry between NP and FR across domains. Drawing on learned helplessness as an analogy, we introduce learned incapacity (LI) as a behavioral descriptor for this selective withholding without implying intentionality or internal mechanisms. We operationalize three response regimes (NP, FR, Meta-Narrative; MN) and show that MN role-framing narratives tend to co-occur with refusals in the same sensitive contexts. Overall, the study proposes an interaction-level auditing framework based on observable behavior and motivates LI as a lens for examining potential alignment side effects, warranting further investigation across users and models.
zh
[AI-62] Network-Wide Traffic Volume Estimation from Speed Profiles using a Spatio-Temporal Graph Neural Network with Directed Spatial Attention
【速读】:该论文旨在解决城市交通网络中全路网交通流量估计问题,即如何在缺乏传感器覆盖的区域实现高精度的交通体积预测。传统方法要么仅能预测已部署传感器的道路流量(forecasting),要么依赖于邻近传感器的体积数据进行空间插值(spatial imputation),后者在传感器稀疏的城市中受限明显。本文提出一种基于混合定向注意力机制的时空图神经网络(Hybrid Directed-Attention Spatio-Temporal Graph Neural Network, HDA-STGNN),其关键在于利用更易获取的探针车辆速度(probe vehicle speeds)、静态道路属性(static road attributes)以及路网拓扑结构(road network topology),在无需推理时使用历史流量数据的前提下,实现对所有路段日周期交通体积分布的精准预测。实验表明,该方法能有效捕捉复杂的时空依赖关系,并验证了拓扑信息对于无流量数据条件下准确估计全网交通量的重要性。
链接: https://arxiv.org/abs/2512.13758
作者: Léo Hein(IFPEN),Giovanni de Nunzio(IFPEN),Giovanni Chierchia(LIGM),Aurélie Pirayre(IFPEN),Laurent Najman(LIGM)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing traffic volume estimation methods typically address either forecasting traffic on sensor-equipped roads or spatially imputing missing volumes using nearby sensors. While forecasting models generally disregard unmonitored roads by design, spatial imputation methods explicitly address network-wide estimation; yet this approach relies on volume data at inference time, limiting its applicability in sensor-scarce cities. Unlike traffic volume data, probe vehicle speeds and static road attributes are more broadly accessible and support full coverage of road segments in most urban networks. In this work, we present the Hybrid Directed-Attention Spatio-Temporal Graph Neural Network (HDA-STGNN), an inductive deep learning framework designed to tackle the network-wide volume estimation problem. Our approach leverages speed profiles, static road attributes, and road network topology to predict daily traffic volume profiles across all road segments in the network. To evaluate the effectiveness of our approach, we perform extensive ablation studies that demonstrate the model’s capacity to capture complex spatio-temporal dependencies and highlight the value of topological information for accurate network-wide traffic volume estimation without relying on volume data at inference time.
zh
[AI-63] MIDUS: Memory-Infused Depth Up-Scaling
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在深度扩展(Depth Up-Scaling, DUS)过程中因重复全连接前馈网络(Feed-Forward Networks, FFNs)导致的效率低下和性能提升受限的问题。其核心解决方案是提出一种记忆注入型深度扩展方法(Memory-Infused Depth Up-Scaling, MIDUS),关键在于用头级记忆层(Head-wise Memory Layer, HML)替代DUS中重复的FFN模块,通过为每个注意力头分配独立的记忆库实现头级信息检索与注入,同时保留注意力头的功能结构;该设计结合稀疏记忆访问机制与每头值因子分解模块,在不显著增加参数量的前提下实现了更高效的计算与更强的性能表现。
链接: https://arxiv.org/abs/2512.13751
作者: Taero Kim,Hoyoon Byun,Youngjun Choi,Sungrae Park,Kyungwoo Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Scaling large language models (LLMs) demands approaches that increase capacity without incurring excessive parameter growth or inference cost. Depth Up-Scaling (DUS) has emerged as a promising strategy by duplicating layers and applying Continual Pre-training (CPT), but its reliance on feed-forward networks (FFNs) limits efficiency and attainable gains. We introduce Memory-Infused Depth Up-Scaling (MIDUS), which replaces FFNs in duplicated blocks with a head-wise memory (HML) layer. Motivated by observations that attention heads have distinct roles both across and within layers, MIDUS assigns an independent memory bank to each head, enabling head-wise retrieval and injecting information into subsequent layers while preserving head-wise functional structure. This design combines sparse memory access with head-wise representations and incorporates an efficient per-head value factorization module, thereby relaxing the usual efficiency-performance trade-off. Across our CPT experiments, MIDUS exhibits robust performance improvements over strong DUS baselines while maintaining a highly efficient parameter footprint. Our findings establish MIDUS as a compelling and resource-efficient alternative to conventional FFN replication for depth up-scaling by leveraging its head-wise memory design.
zh
[AI-64] he algorithmic muse and the public domain: Why copyrights legal philosophy precludes protection for generative AI outputs
【速读】:该论文试图解决生成式 AI (Generative AI) 输出是否应受版权保护的问题。其核心论点是:由于生成式 AI 从根本上切断了人类创作者与表达形式之间的直接联系,传统版权理论(如功利主义激励、劳动应得和人格权)无法提供合理的保护依据;因此,作者主张将原始的 AI 输出置于公共领域,仅对人类在 AI 生成作品中的创造性贡献给予版权保护。解决方案的关键在于明确区分“人类创造性贡献”与“算法生成的原始输出”,从而避免对数字公地的不当私有化,维护创新生态系统的健康运行。
链接: https://arxiv.org/abs/2512.13750
作者: Ezieddin Elmahjub
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 9 pages, two figures
Abstract:Generative AI (GenAI) outputs are not copyrightable. This article argues why. We bypass conventional doctrinal analysis that focuses on black letter law notions of originality and authorship to re-evaluate copyright’s foundational philosophy. GenAI fundamentally severs the direct human creative link to expressive form. Traditional theories utilitarian incentive, labor desert and personality fail to provide coherent justification for protection. The public domain constitutes the default baseline for intellectual creations. Those seeking copyright coverage for GenAI outputs bear the burden of proof. Granting copyright to raw GenAI outputs would not only be philosophically unsound but would also trigger an unprecedented enclosure of the digital commons, creating a legal quagmire and stifling future innovation. The paper advocates for a clear distinction: human creative contributions to AI-generated works may warrant protection, but the raw algorithmic output should remain in the public domain.
zh
[AI-65] Comparative Evaluation of Embedding Representations for Financial News Sentiment Analysis
【速读】:该论文旨在解决在资源受限环境下(如小样本数据集)进行金融新闻情感分类时,传统自然语言处理方法表现不佳的问题。其核心挑战在于,尽管预训练嵌入(如Word2Vec、GloVe和sentence transformers)在标准任务中效果良好,但在小数据场景下仍难以提升模型性能,甚至出现验证集与测试集表现严重不一致的现象。研究的关键发现是:预训练嵌入的效果随数据量下降而显著减弱,在低于某一数据充足阈值时边际收益递减;同时,小规模验证集易导致模型选择过程中的过拟合。因此,解决方案的关键不在于单纯提升嵌入质量,而应转向少样本学习(few-shot learning)、数据增强或基于词典的混合方法等替代策略,以应对标注样本稀缺的根本性限制。
链接: https://arxiv.org/abs/2512.13749
作者: Joyjit Roy,Samaresh Kumar Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures. Submitted to IEEE IATMSI-2026 (Track: AI, IoT and Computer Vision Enabled Technologies)
Abstract:Financial sentiment analysis enhances market understanding; however, standard natural language processing approaches encounter significant challenges when applied to small datasets. This study provides a comparative evaluation of embedding-based methods for financial news sentiment classification in resource-constrained environments. Word2Vec, GloVe, and sentence transformer representations are evaluated in combination with gradient boosting on manually labeled headlines. Experimental results identify a substantial gap between validation and test performance, with models performing worse than trivial baselines despite strong validation metrics. The analysis demonstrates that pretrained embeddings yield diminishing returns below a critical data sufficiency threshold, and that small validation sets contribute to overfitting during model selection. Practical application is illustrated through weekly sentiment aggregation and narrative summarization for market monitoring workflows. The findings offer empirical evidence that embedding quality alone cannot address fundamental data scarcity in sentiment classification. For practitioners operating with limited resources, the results indicate the need to consider alternative approaches such as few-shot learning, data augmentation, or lexicon-enhanced hybrid methods when labeled samples are scarce.
zh
[AI-66] oward Noise-Aware Audio Deepfake Detection: Survey SNR-Benchmarks and Practical Recipes
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 语音伪造检测模型在真实场景下鲁棒性不足的问题,尤其是面对背景噪声(如家庭、办公室及交通环境)、混响和消费级音频通道等复杂条件时性能显著下降的现象。其解决方案的关键在于构建一个可复现的评估框架,通过将 MS-SNSD 噪声与 ASVspoof 2021 DF 语句混合,在受控信噪比(SNR)条件下系统性地量化模型的性能退化趋势(从近清洁状态 35 dB 到高噪声 -5 dB)。实验表明,针对预训练编码器(WavLM、Wav2Vec2、MMS)进行微调可在不同 SNR 下显著降低等错误率(EER),提升检测鲁棒性。
链接: https://arxiv.org/abs/2512.13744
作者: Udayon Sen,Alka Luqman,Anupam Chattopadhyay
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 6 pages
Abstract:Deepfake audio detection has progressed rapidly with strong pre-trained encoders (e.g., WavLM, Wav2Vec2, MMS). However, performance in realistic capture conditions - background noise (domestic/office/transport), room reverberation, and consumer channels - often lags clean-lab results. We survey and evaluate robustness for state-of-the-art audio deepfake detection models and present a reproducible framework that mixes MS-SNSD noises with ASVspoof 2021 DF utterances to evaluate under controlled signal-to-noise ratios (SNRs). SNR is a measured proxy for noise severity used widely in speech; it lets us sweep from near-clean (35 dB) to very noisy (-5 dB) to quantify graceful degradation. We study multi-condition training and fixed-SNR testing for pretrained encoders (WavLM, Wav2Vec2, MMS), reporting accuracy, ROC-AUC, and EER on binary and four-class (authenticity x corruption) tasks. In our experiments, finetuning reduces EER by 10-15 percentage points at 10-0 dB SNR across backbones.
zh
[AI-67] he Laminar Flow Hypothesis: Detecting Jailbreaks via Semantic Turbulence in Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对对抗性“越狱”攻击时的安全防护难题。现有防御方法通常依赖计算成本高昂的外部分类器或脆弱的词汇过滤机制,未能深入挖掘模型内部推理过程的动力学特性。其解决方案的关键在于提出“层流假设”(Laminar Flow Hypothesis),即良性输入会在高维潜在空间中引发平滑、渐进的轨迹变化,而恶意提示则诱发混沌且高方差的路径——称为语义湍流(Semantic Turbulence),这是由安全对齐与指令遵循目标之间的内部冲突所导致。通过引入一种零样本指标——逐层余弦速度方差(variance of layer-wise cosine velocity),研究者实现了对模型内部状态动态的实时监测,不仅可作为轻量级越狱检测工具,还可用于无侵入式诊断黑盒模型的安全架构类型。
链接: https://arxiv.org/abs/2512.13741
作者: Md. Hasib Ur Rahman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As Large Language Models (LLMs) become ubiquitous, the challenge of securing them against adversarial “jailbreaking” attacks has intensified. Current defense strategies often rely on computationally expensive external classifiers or brittle lexical filters, overlooking the intrinsic dynamics of the model’s reasoning process. In this work, the Laminar Flow Hypothesis is introduced, which posits that benign inputs induce smooth, gradual transitions in an LLM’s high-dimensional latent space, whereas adversarial prompts trigger chaotic, high-variance trajectories - termed Semantic Turbulence - resulting from the internal conflict between safety alignment and instruction-following objectives. This phenomenon is formalized through a novel, zero-shot metric: the variance of layer-wise cosine velocity. Experimental evaluation across diverse small language models reveals a striking diagnostic capability. The RLHF-aligned Qwen2-1.5B exhibits a statistically significant 75.4% increase in turbulence under attack (p less than 0.001), validating the hypothesis of internal conflict. Conversely, Gemma-2B displays a 22.0% decrease in turbulence, characterizing a distinct, low-entropy “reflex-based” refusal mechanism. These findings demonstrate that Semantic Turbulence serves not only as a lightweight, real-time jailbreak detector but also as a non-invasive diagnostic tool for categorizing the underlying safety architecture of black-box models.
zh
[AI-68] Instilling Organisational Values in Firefighters through Simulation-Based Training
【速读】:该论文旨在解决传统消防训练方法在应对紧急情境中伦理困境和价值冲突时准备不足的问题,从而影响事故处置效果与消防员安全。其解决方案的关键在于构建一个概念框架,通过系统性地将部门价值观融入基于仿真的训练中,以促进消防员对价值观的深层内化,并提升其在高压环境下基于价值观的决策能力;同时,该框架所依赖的工具还可用于评估和优化部门操作规程,使其更契合组织期望的价值导向。
链接: https://arxiv.org/abs/2512.13737
作者: Nardine Osman,Manel Rodriguez-Soto,Jordi Sabater-Mir
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:In firefighting and other emergency operations, decisions made under pressure carry profound ethical weight and can significantly impact incident outcomes and firefighter safety. Traditional training methods, while foundational, often fall short in adequately preparing firefighters for the complex ethical dilemmas and value conflicts inherent in chaotic emergency environments. This paper proposes a conceptual framework for enhancing firefighter training by systematically integrating departmental values into simulation-based training. This approach fosters deeper value internalisation and improves value-driven decision-making under pressure. Furthermore, the underlying tools can also be leveraged to evaluate and refine departmental operational protocols for better alignment with preferred values.
zh
[AI-69] F-MCL: Time-frequency Fusion and Multi-domain Cross-Loss for Self-supervised Depression Detection
【速读】:该论文旨在解决当前基于脑电图(EEG)信号的重度抑郁症(MDD)检测中,监督学习方法对标签依赖性强、标注成本高,以及现有对比学习方法未能有效建模EEG信号时频分布特征、难以提取低语义层次表示的问题。解决方案的关键在于提出一种时频融合与多域交叉损失(TF-MCL)模型:首先通过融合映射头(FMH)生成时频混合表征,高效地将时频域信息映射到融合域以增强模型对时频信息的整合能力;其次,通过优化多域交叉损失函数重构时频域与融合域表示的分布,从而提升模型获取高质量融合表征的能力。实验表明,该方法在MODMA和PRED+CT数据集上分别较现有最先进方法提升了5.87%和9.96%的准确率。
链接: https://arxiv.org/abs/2512.13736
作者: Li-Xuan Zhao,Chen-Yang Xu,Wen-Qiang Li,Bo Wang,Rong-Xing Wei,Qing-Hao Menga
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, there has been a notable increase in the use of supervised detection methods of major depressive disorder (MDD) based on electroencephalogram (EEG) signals. However, the process of labeling MDD remains challenging. As a self-supervised learning method, contrastive learning could address the shortcomings of supervised learning methods, which are unduly reliant on labels in the context of MDD detection. However, existing contrastive learning methods are not specifically designed to characterize the time-frequency distribution of EEG signals, and their capacity to acquire low-semantic data representations is still inadequate for MDD detection tasks. To address the problem of contrastive learning method, we propose a time-frequency fusion and multi-domain cross-loss (TF-MCL) model for MDD detection. TF-MCL generates time-frequency hybrid representations through the use of a fusion mapping head (FMH), which efficiently remaps time-frequency domain information to the fusion domain, and thus can effectively enhance the model’s capacity to synthesize time-frequency information. Moreover, by optimizing the multi-domain cross-loss function, the distribution of the representations in the time-frequency domain and the fusion domain is reconstructed, thereby improving the model’s capacity to acquire fusion representations. We evaluated the performance of our model on the publicly available datasets MODMA and PRED+CT and show a significant improvement in accuracy, outperforming the existing state-of-the-art (SOTA) method by 5.87% and 9.96%, respectively.
zh
[AI-70] DARTs: A Dual-Path Robust Framework for Anomaly Detection in High-Dimensional Multivariate Time Series
【速读】:该论文旨在解决多变量时间序列异常检测(Multivariate Time Series Anomaly Detection, MTSAD)在高维噪声时间序列中难以有效捕捉长程时空依赖关系的问题。现有方法在低维场景下表现良好,但在高维复杂工业控制系统中往往无法稳健地建模长期动态特征。其解决方案的关键在于提出一种鲁棒的长短时双路径框架 DARTs,包含三个互补模块:短时路径通过多视角稀疏图学习器与扩散多关系图单元协同捕获高噪声环境下的层次化短时时空模式;长时路径设计了多尺度时空图构造器以建模高维表示空间中的显著长期动态;最终引入窗口感知的时空软融合机制,在滤除残余噪声的同时实现异常模式的无缝集成,从而提升检测的准确性与鲁棒性。
链接: https://arxiv.org/abs/2512.13735
作者: Xuechun Liu,Heli Sun,Xuecheng Wu,Ruichen Cao,Yunyun Shi,Dingkang Yang,Haoran Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multivariate time series anomaly detection (MTSAD) aims to accurately identify and localize complex abnormal patterns in the large-scale industrial control systems. While existing approaches excel in recognizing the distinct patterns under the low-dimensional scenarios, they often fail to robustly capture long-range spatiotemporal dependencies when learning representations from the high-dimensional noisy time series. To address these limitations, we propose DARTs, a robust long short-term dual-path framework with window-aware spatiotemporal soft fusion mechanism, which can be primarily decomposed into three complementary components. Specifically, in the short-term path, we introduce a Multi-View Sparse Graph Learner and a Diffusion Multi-Relation Graph Unit that collaborate to adaptively capture hierarchical discriminative short-term spatiotemporal patterns in the high-noise time series. While in the long-term path, we design a Multi-Scale Spatiotemporal Graph Constructor to model salient long-term dynamics within the high-dimensional representation space. Finally, a window-aware spatiotemporal soft-fusion mechanism is introduced to filter the residual noise while seamlessly integrating anomalous patterns. Extensive qualitative and quantitative experimental results across mainstream datasets demonstrate the superiority and robustness of our proposed DARTs. A series of ablation studies are also conducted to explore the crucial design factors of our proposed components. Our code and model will be made publicly open soon.
zh
[AI-71] Plug-and-Play Parameter-Efficient Tuning of Embeddings for Federated Recommendation AAAI2026
【速读】:该论文旨在解决联邦推荐(Federated Recommendation, FR)中因海量物品嵌入(item embeddings)导致的通信效率低下问题,即在分布式训练过程中,嵌入参数传输带来的高通信开销严重制约了FR的实际部署效果。其解决方案的关键在于引入基于参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)的嵌入优化策略,通过减少需传输的嵌入参数量来降低通信负担,同时保持甚至提升推荐精度。具体而言,作者提出了一种轻量级、插件式框架,集成多种PEFT技术(如LoRA和哈希编码),并创新性地探索了残差量化变分自编码器(Residual Quantized Variational Autoencoders, RQ-VAE)作为新型嵌入压缩方法,在不改变原有FR模型结构的前提下显著降低了通信开销并提升了性能。
链接: https://arxiv.org/abs/2512.13734
作者: Haochen Yuan,Yang Zhang,Xiang He,Quan Z. Sheng,Zhongjie Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for publication at AAAI 2026
Abstract:With the rise of cloud-edge collaboration, recommendation services are increasingly trained in distributed environments. Federated Recommendation (FR) enables such multi-end collaborative training while preserving privacy by sharing model parameters instead of raw data. However, the large number of parameters, primarily due to the massive item embeddings, significantly hampers communication efficiency. While existing studies mainly focus on improving the efficiency of FR models, they largely overlook the issue of embedding parameter overhead. To address this gap, we propose a FR training framework with Parameter-Efficient Fine-Tuning (PEFT) based embedding designed to reduce the volume of embedding parameters that need to be transmitted. Our approach offers a lightweight, plugin-style solution that can be seamlessly integrated into existing FR methods. In addition to incorporating common PEFT techniques such as LoRA and Hash-based encoding, we explore the use of Residual Quantized Variational Autoencoders (RQ-VAE) as a novel PEFT strategy within our framework. Extensive experiments across various FR model backbones and datasets demonstrate that our framework significantly reduces communication overhead while improving accuracy. The source code is available at this https URL.
zh
[AI-72] Low-Rank Compression of Language Models via Differentiable Rank Selection
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)低秩压缩过程中如何为每一层选择最优秩以协同优化压缩率与下游任务准确性的难题。现有方法或依赖启发式策略导致搜索空间受限而效果不佳,或采用梯度优化但缺乏微调时性能不如启发式方法。解决方案的关键在于提出一种无需微调的梯度学习框架——Learning to Low-Rank Compress (LLRC),其通过在校准数据集上训练可学习的掩码权重(mask weights),动态选择每层奇异值的数量,从而最小化中间激活分布与原始模型之间的差异。该方法在多个压缩率下均显著优于同类无微调基线方法,并在常见推理和开放域问答任务中表现优异。
链接: https://arxiv.org/abs/2512.13733
作者: Sidhant Sundrani,Francesco Tudisco,Pasquale Minervini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Approaches for compressing large-language models using low-rank decomposition have made strides, particularly with the introduction of activation and loss-aware SVD, which improves the trade-off between decomposition rank and downstream task performance. Despite these advancements, a persistent challenge remains–selecting the optimal ranks for each layer to jointly optimise compression rate and downstream task accuracy. Current methods either rely on heuristics that can yield sub-optimal results due to their limited discrete search space or are gradient-based but are not as performant as heuristic approaches without post-compression fine-tuning. To address these issues, we propose Learning to Low-Rank Compress (LLRC), a gradient-based approach which directly learns the weights of masks that select singular values in a fine-tuning-free setting. Using a calibration dataset, we train only the mask weights to select fewer and fewer singular values while minimising the divergence of intermediate activations from the original model. Our approach outperforms competing ranking selection methods that similarly require no post-compression fine-tuning across various compression rates on common-sense reasoning and open-domain question-answering tasks. For instance, with a compression rate of 20% on Llama-2-13B, LLRC outperforms the competitive Sensitivity-based Truncation Rank Searching (STRS) on MMLU, BoolQ, and OpenbookQA by 12%, 3.5%, and 4.4%, respectively. Compared to other compression techniques, our approach consistently outperforms fine-tuning-free variants of SVD-LLM and LLM-Pruner across datasets and compression rates. Our fine-tuning-free approach also performs competitively with the fine-tuning variant of LLM-Pruner.
zh
[AI-73] PIS: A Generalized Physical Inversion Solver for Arbitrary Sparse Observations via Set-Conditioned Diffusion
【速读】:该论文旨在解决偏微分方程(PDE)约束下物理参数反演问题在观测数据稀疏、不规则且受实际传感器部署限制时的病态性挑战,此类问题广泛存在于流体力学、地震反演和结构健康监测等领域。现有深度学习与算子学习模型在此类条件下失效,表现为固定网格假设崩溃、重建质量急剧下降以及缺乏不确定性量化(UQ)。其解决方案的关键在于提出物理反演求解器(Physical Inversion Solver, PIS),该方法采用基于Set Transformer的编码器处理任意数量和几何分布的观测集,并引入余弦退火稀疏课程训练策略以增强鲁棒性;同时通过信息论分析揭示极端稀疏条件下反演的极限,使PIS能够在仅0.29%观测率的极端场景下仍保持稳定性和准确性,显著降低反演误差(12.28%–88.73%),并生成校准后的后验样本,准确反映数据稀缺性和物理内在模糊性。
链接: https://arxiv.org/abs/2512.13732
作者: Weijie Yang,Xun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Estimation of PDE-constrained physical parameters from limited indirect measurements is inherently ill-posed, particularly when observations are sparse, irregular, and constrained by real-world sensor placement. This challenge is ubiquitous in fields such as fluid mechanics, seismic inversion, and structural health monitoring. Existing deep and operator-learning models collapse under these conditions: fixed-grid assumptions fail, reconstruction deteriorates sharply, and inversion becomes unreliable with limited robustness and no uncertainty quantification (UQ).We propose the Physical Inversion Solver (PIS), a set-conditioned diffusion framework enabling inversion from truly arbitrary observation sets. PIS employs a Set Transformer-based encoder to handle measurements of any number or geometry, and a cosine-annealed sparsity curriculum for exceptional robustness. An accompanying information-theoretic analysis provides insight into the limits of inversion under extreme sparsity by revealing how observation entropy varies across physical this http URL is evaluated on three challenging PDE inverse problems: Darcy flow, wavefield inversion (Helmholtz), and structural health monitoring (Hooke’s Law). Across all tasks and sparsity regimes – including extreme cases with an observation rate of only 0.29% – existing operator-learning baselines fail to reconstruct meaningful fields, often diverging or collapsing this http URL stark contrast, PIS remains stable and accurate, reducing inversion error by 12.28% – 88.73% and reliably producing calibrated posterior samples. These samples accurately reflect both data scarcity and intrinsic physical ambiguity. These results position PIS as a powerful, general-purpose, and uniquely sparsity-resilient solution for physical inversion under arbitrary and severely undersampled observations.
zh
[AI-74] Exploring the Modular Integration of “AI Architecture” Pedagogy in Undergraduate Design Education: A Case Study of Architectural Design III/IV Courses at Zhejiang University
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在建筑设计教育中系统性融入的难题,尤其是如何在不破坏原有课程结构的前提下,有效提升学生的数字技能与伦理认知。其解决方案的关键在于采用双模块教学框架:一是20小时的AI技术培训,涵盖深度学习模型、大语言模型(Large Language Models, LLMs)、生成式AI(AIGC)、LoRA微调及ComfyUI工具;二是将伦理讨论嵌入课程全程,辅以专职技术导师支持,形成分阶段引导、技术与伦理并重的教学策略。该方法不仅提升了学生的技术应用能力与战略思维,还构建了可复制的AI赋能设计教育模式。
链接: https://arxiv.org/abs/2512.13730
作者: Wang Jiaqi,Lan Yi,Chen Xiang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: in Chinese language, AI for architectural design education
Abstract:This study investigates AI integration in architectural education through a teaching experiment in Zhejiang University’s 2024-25 grade three undergraduate design studio. Adopting a dual-module framework (20-hour AI training + embedded ethics discussions), the course introduced deep learning models, LLMs, AIGC, LoRA, and ComfyUI while maintaining the original curriculum structure, supported by dedicated technical instructors. Findings demonstrate the effectiveness of phased guidance, balanced technical-ethical approaches, and institutional support. The model improved students’ digital skills and strategic cognition while addressing AI ethics, providing a replicable approach combining technical and critical learning in design education.
zh
[AI-75] CurvaDion: Curvature-Adaptive Distributed Orthonormalization
【速读】:该论文旨在解决大规模语言模型分布式训练中梯度同步通信开销过大的问题(即通信瓶颈),尤其是在高带宽、低延迟网络环境下,频繁的梯度同步在平坦区域造成冗余,在高曲率区域又可能引发模型发散。其解决方案的关键在于提出CurvaDion方法,通过引入相对最大动量变化(Relative Maximum Momentum Change, RMMC)来动态识别需要同步的高曲率区域:RMMC利用优化过程中已有的动量信息作为方向曲率的计算可行代理,仅需每层O(d)的额外计算开销,即可实现通信频率的智能调控,在保证收敛性的同时达到99%的通信减少。
链接: https://arxiv.org/abs/2512.13728
作者: Bhavesh Kumar,Roger Jin,Jeffrey Quesnelle
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Nous Research
Abstract:As language models scale to trillions of parameters, distributed training across many GPUs becomes essential, yet gradient synchronization over high-bandwidth, low-latency networks remains a critical bottleneck. While recent methods like Dion reduce per-step communication through low-rank updates, they synchronize at every step regardless of the optimization landscape. We observe that synchronization requirements vary dramatically throughout training: workers naturally compute similar gradients in flat regions, making frequent synchronization redundant, while high-curvature regions require coordination to prevent divergence. We introduce CurvaDion, which uses Relative Maximum Momentum Change (RMMC) to detect high-curvature regions requiring synchronization. RMMC leverages momentum dynamics which are already computed during optimization as a computationally tractable proxy for directional curvature, adding only \mathcalO(d) operations per layer. We establish theoretical connections between RMMC and loss curvature and demonstrate that CurvaDion achieves 99% communication reduction while matching baseline convergence across models from 160M to 1.3B parameters.
zh
[AI-76] me-Constrained Recommendations: Reinforcement Learning Strategies for E-Commerce
【速读】:该论文旨在解决电商场景下用户有限时间预算对推荐系统带来的资源约束问题,即如何在保证推荐项相关性的同时降低用户的评估成本(evaluation cost),以提升用户参与度。其核心挑战在于传统推荐方法忽视了用户决策过程中的时间开销,导致高相关性但高评估成本的项目可能超出用户的时间预算,从而影响交互效果。解决方案的关键在于将时间约束建模为马尔可夫决策过程(Markov Decision Process, MDP)中的预算感知效用(budget-aware utilities),并利用强化学习算法同时学习用户偏好与时间预算模式,实现对推荐列表(slate)的优化。实验表明,基于策略的(on-policy)和非策略的(off-policy)强化学习控制方法在严苛时间预算下优于传统的上下文bandit方法,验证了该框架的有效性。
链接: https://arxiv.org/abs/2512.13726
作者: Sayak Chakrabarty,Souradip Pal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures
Abstract:Unlike traditional recommendation tasks, finite user time budgets introduce a critical resource constraint, requiring the recommender system to balance item relevance and evaluation cost. For example, in a mobile shopping interface, users interact with recommendations by scrolling, where each scroll triggers a list of items called slate. Users incur an evaluation cost - time spent assessing item features before deciding to click. Highly relevant items having higher evaluation costs may not fit within the user’s time budget, affecting engagement. In this position paper, our objective is to evaluate reinforcement learning algorithms that learn patterns in user preferences and time budgets simultaneously, crafting recommendations with higher engagement potential under resource constraints. Our experiments explore the use of reinforcement learning to recommend items for users using Alibaba’s Personalized Re-ranking dataset supporting slate optimization in e-commerce contexts. Our contributions include (i) a unified formulation of time-constrained slate recommendation modeled as Markov Decision Processes (MDPs) with budget-aware utilities; (ii) a simulation framework to study policy behavior on re-ranking data; and (iii) empirical evidence that on-policy and off-policy control can improve performance under tight time budgets than traditional contextual bandit-based methods.
zh
[AI-77] Compressed Causal Reasoning : Quantization and GraphRAG Effects on Interventional and Counterfactual Accuracy
【速读】:该论文旨在解决量化压缩(如INT8和NF4)对大型语言模型中因果推理能力的影响问题,特别是针对Pearl因果阶梯(Causal Ladder)的三个层级——关联(association)、干预(intervention)与反事实(counterfactual)推理的稳定性。其关键解决方案在于系统性地评估不同精度下模型在3000样本分层构建的CLadder基准上的表现,并引入基于真实因果图的图检索增强生成(Graph Retrieval Augmented Generation, Graph-RAG)方法以提升干预推理的鲁棒性。结果表明,尽管四比特量化(NF4)整体仅导致不到1%的准确率下降,干预推理最为敏感,而反事实推理虽相对稳定但存在类型异质性弱点;同时,现有反事实数据集(如CRASS)未能揭示量化引起的推理漂移,说明当前基准尚不足以捕捉深层因果脆弱性。该研究为高效且结构化支持的因果AI部署提供了实证基础与实践指导。
链接: https://arxiv.org/abs/2512.13725
作者: Steve Nwaiwu,Nipat Jongsawat,Anucha Tungkasthan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Causal reasoning in Large Language Models spanning association, intervention, and counterfactual inference is essential for reliable decision making in high stakes settings. As deployment shifts toward edge and resource constrained environments, quantized models such as INT8 and NF4 are becoming standard. Yet the impact of precision reduction on formal causal reasoning is poorly understood. To our knowledge, this is the first study to systematically evaluate quantization effects across all three levels of Pearls Causal Ladder. Using a 3000 sample stratified CLadder benchmark, we find that rung level accuracy in Llama 3 8B remains broadly stable under quantization, with NF4 showing less than one percent overall degradation. Interventional queries at rung 2 are the most sensitive to precision loss, whereas counterfactual reasoning at rung 3 is comparatively stable but exhibits heterogeneous weaknesses across query types such as collider bias and backdoor adjustment. Experiments on the CRASS benchmark show near identical performance across precisions, indicating that existing commonsense counterfactual datasets lack the structural sensitivity needed to reveal quantization induced reasoning drift. We further evaluate Graph Retrieval Augmented Generation using ground truth causal graphs and observe a consistent improvement in NF4 interventional accuracy of plus 1.7 percent, partially offsetting compression related degradation. These results suggest that causal reasoning is unexpectedly robust to four bit quantization, graph structured augmentation can selectively reinforce interventional reasoning, and current counterfactual benchmarks fail to capture deeper causal brittleness. This work provides an initial empirical map of compressed causal reasoning and practical guidance for deploying efficient and structurally supported causal AI systems.
zh
[AI-78] Made-in China Thinking in America:U.S. Values Persist in Chinese LLM s
【速读】:该论文试图解决的问题是:当前主流大语言模型(Large Language Models, LLMs)是否在价值观上偏向西方国家,特别是美国,并且这种倾向是否在由中国开发的模型中依然存在。研究聚焦于中美两国开发的大模型在道德和价值判断上的差异,以及它们是否更贴近本国人群的价值观。解决方案的关键在于通过大规模实证比较——使用道德基础问卷(Moral Foundations Questionnaire 2.0)和世界价值观调查(World Values Survey)作为测量工具,收集来自10个中国模型和10个美国模型的响应,并与数千名中国和美国受试者的回答进行对比分析。结果表明,无论模型来源国或提示语(prompt)语言如何,所有模型均更接近美国人的价值观,说明当前大语言模型在全球范围内存在显著的“美国价值偏倚”,这一发现对生成式AI在地缘政治软实力竞争中的角色具有深远影响。
链接: https://arxiv.org/abs/2512.13723
作者: David Haslett,Linus Ta-Lun Huang,Leila Khalatbari,Janet Hui-wen Hsiao,Antoni B. Chan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models increasingly mediate access to information and facilitate decision-making, they are becoming instruments in soft power competitions between global actors such as the United States and China. So far, language models seem to be aligned with the values of Western countries, but evidence for this ethical bias comes mostly from models made by American companies. The current crop of state-of-the-art models includes several made in China, so we conducted the first large-scale investigation of how models made in China and the USA align with people from China and the USA. We elicited responses to the Moral Foundations Questionnaire 2.0 and the World Values Survey from ten Chinese models and ten American models, and we compared their responses to responses from thousands of Chinese and American people. We found that all models respond to both surveys more like American people than like Chinese people. This skew toward American values is only slightly mitigated when prompting the models in Chinese or imposing a Chinese persona on the models. These findings have important implications for a near future in which large language models generate much of the content people consume and shape normative influence in geopolitics.
zh
[AI-79] Federated Few-Shot Learning for Epileptic Seizure Detection Under Privacy Constraints
【速读】:该论文旨在解决癫痫发作检测中因数据稀缺、分布分散及隐私保护法规限制而导致的AI模型训练难题,尤其是在真实医疗环境中难以获取大规模集中标注的脑电图(EEG)数据的问题。其解决方案的关键在于提出了一种两阶段联邦少样本学习(federated few-shot learning, FFSL)框架:第一阶段通过联邦学习在非独立同分布(non-IID)模拟医院站点间微调预训练生物信号变换器(BIOT),实现无需集中化EEG数据的共享表征学习;第二阶段采用联邦少样本个性化机制,仅用每名患者5个标注EEG片段即可适配专属分类器,在保留癫痫特异性信息的同时利用跨机构知识提升性能。该方法在TUH Event Corpus上验证了有效性,显著提升了在数据受限和隐私合规条件下的个体化癫痫检测准确率。
链接: https://arxiv.org/abs/2512.13717
作者: Ekaterina Sysoykova,Bernhard Anzengruber-Tanase,Michael Haslgrubler,Philipp Seidl,Alois Ferscha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Many deep learning approaches have been developed for EEG-based seizure detection; however, most rely on access to large centralized annotated datasets. In clinical practice, EEG data are often scarce, patient-specific distributed across institutions, and governed by strict privacy regulations that prohibit data pooling. As a result, creating usable AI-based seizure detection models remains challenging in real-world medical settings. To address these constraints, we propose a two-stage federated few-shot learning (FFSL) framework for personalized EEG-based seizure detection. The method is trained and evaluated on the TUH Event Corpus, which includes six EEG event classes. In Stage 1, a pretrained biosignal transformer (BIOT) is fine-tuned across non-IID simulated hospital sites using federated learning, enabling shared representation learning without centralizing EEG recordings. In Stage 2, federated few-shot personalization adapts the classifier to each patient using only five labeled EEG segments, retaining seizure-specific information while still benefiting from cross-site knowledge. Federated fine-tuning achieved a balanced accuracy of 0.43 (centralized: 0.52), Cohen’s kappa of 0.42 (0.49), and weighted F1 of 0.69 (0.74). In the FFSL stage, client-specific models reached an average balanced accuracy of 0.77, Cohen’s kappa of 0.62, and weighted F1 of 0.73 across four sites with heterogeneous event distributions. These results suggest that FFSL can support effective patient-adaptive seizure detection under realistic data-availability and privacy constraints.
zh
[AI-80] ValuePilot: A Two-Phase Framework for Value-Driven Decision-Making NEURIPS2025
【速读】:该论文旨在解决当前AI系统在真实场景中缺乏个性化决策能力的问题,即如何使AI代理不仅完成特定任务或与群体目标对齐,还能根据个体用户的价值偏好做出一致且可解释的决策。其核心挑战在于传统基于外部奖励的任务导向范式难以支持跨情境的个性化行为泛化。解决方案的关键在于提出一种价值驱动(value-driven)的个性化决策框架ValuePilot,其由两阶段组成:首先通过人类-大语言模型(LLM)协作管道构建多样化的、带有价值标注的情境数据集(DGT),进而训练一个决策模块(DMM)以学习基于个人价值偏好的行动评估机制,从而实现情境敏感的个体化决策。实验表明,该方法在未见过的情境中显著优于多个主流大模型基线,在对齐人类行为选择方面展现出更强的适应性和可解释性。
链接: https://arxiv.org/abs/2512.13716
作者: Yitong Luo,Ziang Chen,Hou Hei Lam,Jiayu zhan,Junqi Wang,Zhenliang Zhang,Xue Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at LAW Workshop, NeurIPS 2025
Abstract:Personalized decision-making is essential for human-AI interaction, enabling AI agents to act in alignment with individual users’ value preferences. As AI systems expand into real-world applications, adapting to personalized values beyond task completion or collective alignment has become a critical challenge. We address this by proposing a value-driven approach to personalized decision-making. Human values serve as stable, transferable signals that support consistent and generalizable behavior across contexts. Compared to task-oriented paradigms driven by external rewards and incentives, value-driven decision-making enhances interpretability and enables agents to act appropriately even in novel scenarios. We introduce ValuePilot, a two-phase framework consisting of a dataset generation toolkit (DGT) and a decision-making module (DMM). DGT constructs diverse, value-annotated scenarios from a human-LLM collaborative pipeline. DMM learns to evaluate actions based on personal value preferences, enabling context-sensitive, individualized decisions. When evaluated on previously unseen scenarios, DMM outperforms strong LLM baselines, including GPT-5, Claude-Sonnet-4, Gemini-2-flash, and Llama-3.1-70b, in aligning with human action choices. Our results demonstrate that value-driven decision-making is an effective and extensible engineering pathway toward building interpretable, personalized AI agents.
zh
[AI-81] Meta Hierarchical Reinforcement Learning for Scalable Resource Management in O-RAN
【速读】:该论文旨在解决现代无线网络中因应用复杂性增加而带来的实时自适应与资源管理效率低下的问题,特别是在开放无线接入网(O-RAN)架构下,如何实现动态资源分配与网络切片(network slicing)的联合优化。其解决方案的关键在于提出了一种受模型无关元学习(Model Agnostic Meta Learning, MAML)启发的自适应元分层强化学习(Meta Hierarchical Reinforcement Learning, Meta-HRL)框架:该框架通过分层控制结构实现全局与局部协同优化——高层控制器负责跨切片资源分配,低层智能体执行切片内调度;同时引入基于时差误差方差加权的任务自适应元更新机制,提升算法在复杂、动态场景下的稳定性与适应速度,理论分析证明其具有次线性收敛性和遗憾边界,仿真验证其相较基线强化学习和元强化学习方法在资源利用效率上提升19.8%,且在eMBB、URLLC和mMTC切片中均实现更快适应与更高服务质量(QoS)满意度。
链接: https://arxiv.org/abs/2512.13715
作者: Fatemeh Lotfi,Fatemeh Afghah
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: This paper is submitted to IEEE Open Journal of the Communications Society
Abstract:The increasing complexity of modern applications demands wireless networks capable of real time adaptability and efficient resource management. The Open Radio Access Network (O-RAN) architecture, with its RAN Intelligent Controller (RIC) modules, has emerged as a pivotal solution for dynamic resource management and network slicing. While artificial intelligence (AI) driven methods have shown promise, most approaches struggle to maintain performance under unpredictable and highly dynamic conditions. This paper proposes an adaptive Meta Hierarchical Reinforcement Learning (Meta-HRL) framework, inspired by Model Agnostic Meta Learning (MAML), to jointly optimize resource allocation and network slicing in O-RAN. The framework integrates hierarchical control with meta learning to enable both global and local adaptation: the high-level controller allocates resources across slices, while low level agents perform intra slice scheduling. The adaptive meta-update mechanism weights tasks by temporal difference error variance, improving stability and prioritizing complex network scenarios. Theoretical analysis establishes sublinear convergence and regret guarantees for the two-level learning process. Simulation results demonstrate a 19.8% improvement in network management efficiency compared with baseline RL and meta-RL approaches, along with faster adaptation and higher QoS satisfaction across eMBB, URLLC, and mMTC slices. Additional ablation and scalability studies confirm the method’s robustness, achieving up to 40% faster adaptation and consistent fairness, latency, and throughput performance as network scale increases.
zh
[AI-82] AI-Powered Annotation Pipelines for Stabilizing Large Language Models : A Human-AI Synergy Approach
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高度监管行业中因不稳定性、推理不一致、幻觉(hallucinations)及性能波动等问题导致的可靠性不足,从而限制其在需要事实精确性和行为一致性场景中的安全应用。解决方案的关键在于提出一种基于人工智能的标注流水线,通过人机协同机制实现对LLM输出中不稳定模式的系统性识别、标记与修复;该方法融合自动化弱监督与置信度驱动的标注策略,并引入目标人工验证以确保反馈信息的可靠性和伦理合规性,同时将语义一致性、事实正确性和逻辑连贯性三类稳定性特定标注纳入框架,支持模型持续校准与鲁棒性增强。
链接: https://arxiv.org/abs/2512.13714
作者: Gangesh Pathak,Prasanna Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 Pages
Abstract:LLM implementations are failing in highly regulated industries owing to instability issues, inconsistent reasoning, hallucinations and performance variability, especially in workflows. These reliability issues restrict safe use of LLM in areas that need the precision of facts and consistent behavior (Aiyappa et al., 2023). The current methods of stabilization, such as, reinforcement learning with human feedback (RLHF) and supervised fine-tuning, offer quantifiable improvements but are expensive and based on the intensive annotation of humans, thus being not easily scaled in a sustainable way (Dong et al., 2023; Retzlaff et al., 2024). This paper presents an AI-based annotation pipeline that systematically identifies, labels, and fixes for instability patterns on LLM output. Our human-AI synergy method combines the models of automated weak supervision and confidence-based annotation with the target human validation to guarantee the reliability and moral uprightness of feedback information (Cabitza et al., 2023; Jiang et al., 2023). The semantic consistency, factual correctness, and logical coherence categories of stability-specific annotation are introduced into our framework, allowing the continuous calibration of models and the enhancement of their robustness based on the feedback loops (Honovich et al., 2021; Nan et al., 2021).
zh
[AI-83] LoopBench: Discovering Emergent Symmetry Breaking Strategies with LLM Swarms
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在分布式系统中协调能力不足的问题,特别是其在对称性破缺(symmetry breaking)和元认知思维(meta-cognitive thinking)方面的推理局限。研究提出了一种名为LoopBench的基准测试框架,聚焦于使用有限颜色对奇数环图(odd cycle graphs, 如C₃、C₅、C₁₁)进行着色任务,此类问题中无通信的确定性代理会陷入无限循环。解决方案的关键在于引入一种策略传递机制(strategy passing mechanism),作为一致记忆(consistent memory)的形式,使LLM能够识别并跳出死锁状态;实验表明,标准LLM和传统启发式方法难以应对,而具备高级推理能力的模型(如O3)则能自主设计有效策略以实现分布式协同,从而揭示基于语言推理的涌现式分布式算法潜力。
链接: https://arxiv.org/abs/2512.13713
作者: Ali Parsaee,Yashar Talebirad,Csongor Szepesvári,Vishwajeet Ohal,Eden Redman
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 11 pages, 3 figures, submitted to ANTS 2026
Abstract:Large Language Models (LLMs) are increasingly being utilized as autonomous agents, yet their ability to coordinate in distributed systems remains poorly understood. We introduce \textbfLoopBench, a benchmark to evaluate LLM reasoning in distributed symmetry breaking and meta-cognitive thinking. The benchmark focuses on coloring odd cycle graphs ( C_3, C_5, C_11 ) with limited colors, where deterministic, non-communicating agents fail in infinite loops. A strategy passing mechanism is implemented as a form of consistent memory. We show that while standard LLMs and classical heuristics struggle, advanced reasoning models (e.g., O3) devise strategies to escape deadlocks. LoopBench allows the study of emergent distributed algorithms based on language-based reasoning, offering a testbed for collective intelligence.
zh
[AI-84] Scaling and Transferability of Annealing Strategies in Large Language Model Training AAAI2026
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练中学习率调度策略的优化问题,特别是如何在不同模型配置下找到最优的退火(annealing)动态。其核心挑战在于缺乏对退火策略跨模型配置可迁移性的理解,导致需要耗费大量计算资源进行超参数搜索。解决方案的关键在于提出并改进了一个广义预测框架,该框架基于Warmup-Steady-Decay(WSD)调度器,引入训练步数、最大学习率和退火行为等关键因素,从而实现对学习率调度的高效优化。研究进一步发现,较小模型可作为可靠代理,用于指导更大模型的最优退火策略选择,且最优退火比例在不同训练配置中呈现一致模式,具备良好的可迁移性。
链接: https://arxiv.org/abs/2512.13705
作者: Siqi Wang,Zhengyu Chen,Teng Xiao,Zheqi Lv,Jinluan Yang,Xunliang Cai,Jingang Wang,Xiaomeng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026 (camera-ready version)
Abstract:Learning rate scheduling is crucial for training large language models, yet understanding the optimal annealing strategies across different model configurations remains challenging. In this work, we investigate the transferability of annealing dynamics in large language model training and refine a generalized predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler. Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules. Our work provides a practical guidance for selecting optimal annealing strategies without exhaustive hyperparameter searches, demonstrating that smaller models can serve as reliable proxies for optimizing the training dynamics of larger models. We validate our findings on extensive experiments using both Dense and Mixture-of-Experts (MoE) models, demonstrating that optimal annealing ratios follow consistent patterns and can be transferred across different training configurations.
zh
[AI-85] Adjudicator: Correcting Noisy Labels with a KG-Informed Council of LLM Agents
【速读】:该论文旨在解决工业级生产机器学习系统中因标签噪声(label noise)导致的性能下降问题,尤其是在高风险应用场景下,标签噪声会显著削弱模型性能并损害用户信任。其解决方案的关键在于提出了一种名为Adjudicator的神经符号系统,该系统首先构建一个动态知识图谱(Knowledge Graph, KG)以统一物品上下文信息,进而利用该KG驱动一个“代理委员会”(Council of Agents)——一种基于多智能体大语言模型(Multi-agent Large Language Model)的架构,其中专业化代理通过辩论和投票机制判定标签有效性。实验表明,该KG引导的模型在AlleNoise基准数据集上达到0.99的F1分数,显著优于单一LLM基线(0.48)和无KG的代理委员会(0.59),其优势源于一种新颖的覆盖逻辑(override logic),能够精准识别复杂结构型错误(实现完整召回率),而这是传统方法无法捕捉的。
链接: https://arxiv.org/abs/2512.13704
作者: Doohee You,Sundeep Paul
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures
Abstract:The performance of production machine learning systems is fundamentally limited by the quality of their training data. In high-stakes industrial applications, noisy labels can degrade performance and erode user trust. This paper presents Adjudicator, a system that addresses the critical data mining challenge of automatically identifying and correcting label noise and has been validated for production deployment. Adjudicator models this as a neuro-symbolic task, first constructing a dynamic Knowledge Graph (KG) to unify item context. This KG then informs a “Council of Agents,” a novel multi-agent Large Language Model architecture where specialized agents debate and vote on a label’s validity. We validate our system on a 1,000-item balanced subset of the AlleNoise benchmark. Our KG-informed model achieves a 0.99 F1-score, significantly outperforming a single-LLM baseline (0.48 F1) and a non-KG council (0.59 F1). Our analysis reveals this is due to a Precision, achieved by a novel override logic that uses the KG to perfectly identify complex, structural errors (complete Recall) – a class of errors that baselines fail to find. This result demonstrates a robust and explainable system for automated, high-precision data verification, serving as a vital proof-of-concept for generating golden datasets in strictly governed industrial environments.
zh
[AI-86] Safe2Harm: Semantic Isomorphism Attacks for Jailbreaking Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对恶意攻击时的安全漏洞问题,即攻击者可通过特定技巧诱导模型生成有害内容,从而对社会各领域造成负面影响。现有方法主要依赖提示工程(Prompt Engineering)或对抗优化,但本文发现许多有害场景与合法场景在底层原理上具有高度一致性,这一现象此前未被充分重视。解决方案的关键在于提出Safe2Harm语义同构攻击方法,其核心机制是通过四阶段流程实现:首先将有害问题重构为语义安全但原理一致的问题;其次提取二者之间的主题映射关系;接着让模型对安全问题生成详细响应;最后基于映射关系将安全响应逆向转换为有害输出。该方法实现了高效且隐蔽的越狱攻击,在7种主流LLM和三类基准数据集上验证了其优越性。
链接: https://arxiv.org/abs/2512.13703
作者: Fan Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated exceptional performance across various tasks, but their security vulnerabilities can be exploited by attackers to generate harmful content, causing adverse impacts across various societal domains. Most existing jailbreak methods revolve around Prompt Engineering or adversarial optimization, yet we identify a previously overlooked phenomenon: many harmful scenarios are highly consistent with legitimate ones in terms of underlying principles. Based on this finding, this paper proposes the Safe2Harm Semantic Isomorphism Attack method, which achieves efficient jailbreaking through four stages: first, rewrite the harmful question into a semantically safe question with similar underlying principles; second, extract the thematic mapping relationship between the two; third, let the LLM generate a detailed response targeting the safe question; finally, reversely rewrite the safe response based on the thematic mapping relationship to obtain harmful output. Experiments on 7 mainstream LLMs and three types of benchmark datasets show that Safe2Harm exhibits strong jailbreaking capability, and its overall performance is superior to existing methods. Additionally, we construct a challenging harmful content evaluation dataset containing 358 samples and evaluate the effectiveness of existing harmful detection methods, which can be deployed for LLM input-output filtering to enable defense.
zh
[AI-87] Enhancing Transparency and Traceability in Healthcare AI: The AI Product Passport
【速读】:该论文旨在解决医疗领域生成式 AI (Generative AI) 在生命周期管理中透明度不足、可追溯性差及合规性难以保障的问题。其解决方案的关键在于提出并实现了一个基于标准的“AI产品护照”(AI Product Passport)框架,该框架通过结构化的生命周期数据模型(涵盖研究定义、数据准备、模型开发与评估、部署监控等阶段)和角色权限控制机制,实现对AI工具全生命周期的可审计文档记录,并结合MLOps/ModelOps实践确保操作可行性;同时,该平台支持自动生成机器与人类可读报告,并遵循FUTURE-AI原则(公平性、通用性、可追溯性、可用性、鲁棒性和可解释性),从而提升医疗AI系统的可信度与监管适配性。
链接: https://arxiv.org/abs/2512.13702
作者: A. Anil Sinaci,Senan Postaci,Dogukan Cavdaroglu,Machteld J. Boonstra,Okan Mercan,Kerem Yilmaz,Gokce B. Laleci Erturkmen,Folkert W. Asselbergs,Karim Lekadir
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: A total of 33 pages: First 16 pages for the manuscript and the remaining 17 pages for the supplementary user guide of the graphical user interface
Abstract:Objective: To develop the AI Product Passport, a standards-based framework improving transparency, traceability, and compliance in healthcare AI via lifecycle-based documentation. Materials and Methods: The AI Product Passport was developed within the AI4HF project, focusing on heart failure AI tools. We analyzed regulatory frameworks (EU AI Act, FDA guidelines) and existing standards to design a relational data model capturing metadata across AI lifecycle phases: study definition, dataset preparation, model generation/evaluation, deployment/monitoring, and passport generation. MLOps/ModelOps concepts were integrated for operational relevance. Co-creation involved feedback from AI4HF consortium and a Lisbon workshop with 21 diverse stakeholders, evaluated via Mentimeter polls. The open-source platform was implemented with Python libraries for automated provenance tracking. Results: The AI Product Passport was designed based on existing standards and methods with well-defined lifecycle management and role-based access. Its implementation is a web-based platform with a relational data model supporting auditable documentation. It generates machine- and human-readable reports, customizable for stakeholders. It aligns with FUTURE-AI principles (Fairness, Universality, Traceability, Usability, Robustness, Explainability), ensuring fairness, traceability, and usability. Exported passports detail model purpose, data provenance, performance, and deployment context. GitHub-hosted backend/frontend codebases enhance accessibility. Discussion and Conclusion: The AI Product Passport addresses transparency gaps in healthcare AI, meeting regulatory and ethical demands. Its open-source nature and alignment with standards foster trust and adaptability. Future enhancements include FAIR data principles and FHIR integration for improved interoperability, promoting responsible AI deployment.
zh
[AI-88] Blind Radio Mapping via Spatially Regularized Bayesian Trajectory Inference
【速读】:该论文旨在解决传统无线电地图(radio map)构建方法依赖大量位置标签数据的问题,这类数据在实际场景中获取成本高且不切实际。其核心解决方案是提出一种无需位置标签的盲式无线电地图构建框架,关键在于利用非视距(NLOS)环境下信道状态信息(CSI)的空间连续性特性,在准镜面(quasi-specular)环境模型下推导出与物理距离成比例的CSI-距离度量;并进一步基于泊松分布接入点(AP)部署下的直角轨迹假设,证明了定位误差的Cramér-Rao下界(CRLB)在渐近条件下趋于零,即使角度分辨率较差亦然。在此理论基础上,设计了一种空间正则化的贝叶斯推理框架,可联合估计信道特征、区分视距(LOS)/非视距(NLOS)条件并恢复用户轨迹,实验表明该方法在射线追踪数据集上实现了平均定位误差0.68 m和波束图重建误差3.3%。
链接: https://arxiv.org/abs/2512.13701
作者: Zheng Xing,Junting Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:Radio maps enable intelligent wireless applications by capturing the spatial distribution of channel characteristics. However, conventional construction methods demand extensive location-labeled data, which are costly and impractical in many real-world scenarios. This paper presents a blind radio map construction framework that infers user trajectories from indoor multiple-input multiple-output (MIMO)-Orthogonal Frequency-Division Multiplexing (OFDM) channel measurements without relying on location labels. It first proves that channel state information (CSI) under non-line-of-sight (NLOS) exhibits spatial continuity under a quasi-specular environmental model, allowing the derivation of a CSI-distance metric that is proportional to the corresponding physical distance. For rectilinear trajectories in Poisson-distributed access point (AP) deployments, it is shown that the Cramer-Rao Lower Bound (CRLB) of localization error vanishes asymptotically, even under poor angular resolution. Building on these theoretical results, a spatially regularized Bayesian inference framework is developed that jointly estimates channel features, distinguishes line-of-sight (LOS)/NLOS conditions and recovers user trajectories. Experiments on a ray-tracing dataset demonstrate an average localization error of 0.68 m and a beam map reconstruction error of 3.3%, validating the effectiveness of the proposed blind mapping method.
zh
[AI-89] EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练过程中因通信开销过大而导致的效率瓶颈问题。现有基于静态梯度压缩的方法未能考虑梯度在训练过程中动态变化的特性,导致压缩效率与模型精度之间难以平衡。其解决方案的关键在于提出一种基于熵驱动的动态梯度压缩框架(Entropy-Driven Dynamic Gradient Compression, EDGC),通过实时监测梯度熵的变化趋势来自适应调整压缩率,在保障模型性能的前提下显著降低通信延迟。EDGC的核心创新包括:1)采用下采样方法高效估算梯度熵以减少计算开销;2)建立压缩率与梯度熵之间的理论关联模型,指导更优的压缩决策;3)引入基于滑动窗口的压缩率调整机制,实现流水线各阶段的动态优化,从而在多个GPU集群上实现了最高达46.45%的通信延迟降低和16.13%的训练时间加速。
链接: https://arxiv.org/abs/2511.10333
作者: Qingao Yi,Jiaang Duan,Hanwen Hu,Qin Hua,Haiyan Zhao,Shiyou Qian,Dingyu Yang,Jian Cao,Jinghua Tang,Yinghao Yu,Chenzhi Liao,Kangjin Wang,Liping Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:Training large language models (LLMs) poses significant challenges regarding computational resources and memory capacity. Although distributed training techniques help mitigate these issues, they still suffer from considerable communication overhead. Existing approaches primarily rely on static gradient compression to enhance communication efficiency; however, these methods neglect the dynamic nature of evolving gradients during training, leading to performance degradation. Accelerating LLM training via compression without sacrificing performance remains a challenge. In this paper, we propose an entropy-driven dynamic gradient compression framework called EDGC. The core concept is to adjust the compression rate during LLM training based on the evolving trends of gradient entropy, taking into account both compression efficiency and error. EDGC consists of three key this http URL, it employs a down-sampling method to efficiently estimate gradient entropy, reducing computation overhead. Second, it establishes a theoretical model linking compression rate with gradient entropy, enabling more informed compression decisions. Lastly, a window-based adjustment mechanism dynamically adapts the compression rate across pipeline stages, improving communication efficiency and maintaining model performance. We implemented EDGC on a 32-NVIDIA-V100 cluster and a 64-NVIDIA-H100 cluster to train GPT2-2.5B and GPT2-12.1B, respectively. The results show that EDGC significantly reduces communication latency and training time by up to 46.45% and 16.13% while preserving LLM accuracy.
zh
[AI-90] Error Bound Analysis of Physics-Informed Neural Networks-Driven T2 Quantification in Cardiac Magnetic Resonance Imaging
【速读】:该论文旨在解决磁共振成像(MRI)中T2参数定量估计的精度与泛化能力问题,现有深度学习方法虽能实现高精度估计,但依赖大量标注训练数据且缺乏理论支撑。其解决方案的关键在于将MRI的基本物理模型——布洛赫方程(Bloch equation)嵌入到物理信息神经网络(PINN)的损失函数中,从而仅基于目标扫描数据即可完成T2估计,无需预先构建训练数据库;同时通过推导T2估计误差和解的泛化误差的严格上界,建立了可量化评估PINN定量精度的理论框架,即使在无真实值或金标准的情况下也能预测误差范围,显著提升了方法的可靠性与临床适用性。
链接: https://arxiv.org/abs/2512.14211
作者: Mengxue Zhang,Qingrui Cai,Yinyin Chen,Hang Jin,Jianjun Zhou,Qiu Guo,Peijun Zhao,Zhiping Mao,Xingxing Zhang,Yuyu Xia,Xianwang Jiang,Qin Xu,Chunyan Xiong,Yirong Zhou,Chengyan Wang,Xiaobo Qu
机构: 未知
类目: Biological Physics (physics.bio-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Physics-Informed Neural Networks (PINN) are emerging as a promising approach for quantitative parameter estimation of Magnetic Resonance Imaging (MRI). While existing deep learning methods can provide an accurate quantitative estimation of the T2 parameter, they still require large amounts of training data and lack theoretical support and a recognized gold standard. Thus, given the absence of PINN-based approaches for T2 estimation, we propose embedding the fundamental physics of MRI, the Bloch equation, in the loss of PINN, which is solely based on target scan data and does not require a pre-defined training database. Furthermore, by deriving rigorous upper bounds for both the T2 estimation error and the generalization error of the Bloch equation solution, we establish a theoretical foundation for evaluating the PINN’s quantitative accuracy. Even without access to the ground truth or a gold standard, this theory enables us to estimate the error with respect to the real quantitative parameter T2. The accuracy of T2 mapping and the validity of the theoretical analysis are demonstrated on a numerical cardiac model and a water phantom, where our method exhibits excellent quantitative precision in the myocardial T2 range. Clinical applicability is confirmed in 94 acute myocardial infarction (AMI) patients, achieving low-error quantitative T2 estimation under the theoretical error bound, highlighting the robustness and potential of PINN.
zh
[AI-91] owards Explainable Quantum AI: Informing the Encoder Selection of Quantum Neural Networks via Visualization
【速读】:该论文旨在解决量子神经网络(Quantum Neural Networks, QNNs)中编码器(encoder)设计缺乏系统性指导和有效评估手段的问题,具体挑战包括:难以在训练前评估编码后的量子态,以及缺乏直观方法分析编码器对数据特征的区分能力。解决方案的关键在于提出一种新型可视化工具 XQAI-Eyes,该工具能够将经典数据特征与其对应的量子态进行对比,并分析不同类别下的混合量子态分布,从而实现从经典到量子视角的跨域理解,助力开发者优化编码器设计并提升 QNN 性能。
链接: https://arxiv.org/abs/2512.14181
作者: Shaolun Ruan,Feng Liang,Rohan Ramakrishna,Chao Ren,Rudai Yan,Qiang Guan,Jiannan Li,Yong Wang
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pages, 6 figures, accepted by TVCG 2026, not published yet
Abstract:Quantum Neural Networks (QNNs) represent a promising fusion of quantum computing and neural network architectures, offering speed-ups and efficient processing of high-dimensional, entangled data. A crucial component of QNNs is the encoder, which maps classical input data into quantum states. However, choosing suitable encoders remains a significant challenge, largely due to the lack of systematic guidance and the trial-and-error nature of current approaches. This process is further impeded by two key challenges: (1) the difficulty in evaluating encoded quantum states prior to training, and (2) the lack of intuitive methods for analyzing an encoder’s ability to effectively distinguish data features. To address these issues, we introduce a novel visualization tool, XQAI-Eyes, which enables QNN developers to compare classical data features with their corresponding encoded quantum states and to examine the mixed quantum states across different classes. By bridging classical and quantum perspectives, XQAI-Eyes facilitates a deeper understanding of how encoders influence QNN performance. Evaluations across diverse datasets and encoder designs demonstrate XQAI-Eyes’s potential to support the exploration of the relationship between encoder design and QNN effectiveness, offering a holistic and transparent approach to optimizing quantum encoders. Moreover, domain experts used XQAI-Eyes to derive two key practices for quantum encoder selection, grounded in the principles of pattern preservation and feature mapping.
zh
[AI-92] Intelligent matter consisting of active particles
【速读】:该论文试图解决如何通过简单运动代理系统(simple motile agents)构建具有类智能行为的复杂系统,即探索“智能物质”(intelligent matter)的实现路径。其核心问题是:能否在合成物质中模拟自然界中群体智能的涌现现象,并使集体行为达到与智能系统相当的复杂度。解决方案的关键在于两种策略:一是涌现计算(emergent computing),设计特定的活性物质系统,使其通过自组织行为直接完成指定任务;二是物理储层计算(physical reservoir computing),利用活性粒子系统的动力学特性作为信息处理媒介,尤其提出了一种基于超声波或光折射驱动的新型储层计算方案,从而将活性物质的动力学转化为可编程的信息处理能力。
链接: https://arxiv.org/abs/2512.13912
作者: Julian Jeggle,Raphael Wittkowski
机构: 未知
类目: oft Condensed Matter (cond-mat.soft); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
备注: 14 pages, 5 figures
Abstract:In this book chapter, we review how systems of simple motile agents can be used as a pathway to intelligent systems. It is a well known result from nature that large groups of entities following simple rules, such as swarms of animals, can give rise to much more complex collective behavior in a display of emergence. This begs the question whether we can emulate this behavior in synthetic matter and drive it to a point where the collective behavior reaches the complexity level of intelligent systems. Here, we will use a formalized notion of “intelligent matter” and compare it to recent results in the field of active matter. First, we will explore the approach of emergent computing in which specialized active matter systems are designed to directly solve a given task through emergent behavior. This we will then contrast with the approach of physical reservoir computing powered by the dynamics of active particle systems. In this context, we will also describe a novel reservoir computing scheme for active particles driven ultrasonically or via light refraction.
zh
[AI-93] One Permutation Is All You Need: Fast Reliable Variable Importance and Model Stress-Testing
【速读】:该论文旨在解决机器学习模型中特征重要性估计的可靠性问题,尤其是在模型为黑箱或专有情况下,确保可解释性、透明度和合规性。传统基于随机置换的方法存在计算开销大和结果不稳定的问题。解决方案的关键在于用单一确定性最优置换替代多次随机置换,从而在保留置换法核心思想的基础上实现非随机性、更高效率与更强稳定性;此外,论文进一步提出系统性变量重要性(Systemic Variable Importance),通过显式建模特征相关性来评估扰动传播路径,揭示标准方法忽略的依赖关系,尤其适用于公平性审计与模型压力测试场景。
链接: https://arxiv.org/abs/2512.13892
作者: Albert Dorador
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reliable estimation of feature contributions in machine learning models is essential for trust, transparency and regulatory compliance, especially when models are proprietary or otherwise operate as black boxes. While permutation-based methods are a standard tool for this task, classical implementations rely on repeated random permutations, introducing computational overhead and stochastic instability. In this paper, we show that by replacing multiple random permutations with a single, deterministic, and optimal permutation, we achieve a method that retains the core principles of permutation-based importance while being non-random, faster, and more stable. We validate this approach across nearly 200 scenarios, including real-world household finance and credit risk applications, demonstrating improved bias-variance tradeoffs and accuracy in challenging regimes such as small sample sizes, high dimensionality, and low signal-to-noise ratios. Finally, we introduce Systemic Variable Importance, a natural extension designed for model stress-testing that explicitly accounts for feature correlations. This framework provides a transparent way to quantify how shocks or perturbations propagate through correlated inputs, revealing dependencies that standard variable importance measures miss. Two real-world case studies demonstrate how this metric can be used to audit models for hidden reliance on protected attributes (e.g., gender or race), enabling regulators and practitioners to assess fairness and systemic risk in a principled and computationally efficient manner.
zh
[AI-94] owards Deep Learning Surrogate for the Forward Problem in Electrocardiology: A Scalable Alternative to Physics-Based Models
【速读】:该论文旨在解决心电学中正问题(forward problem)的计算效率瓶颈,即从心脏电活动预测体表电位的传统物理模型(如双域或单域方程)因计算成本高而难以应用于实时和大规模临床场景的问题。解决方案的关键在于提出一种基于深度学习(Deep Learning, DL)的高效代理模型,采用时间依赖的注意力机制序列到序列架构,直接从心脏电压传播图预测心电图(ECG)信号,并引入结合Huber损失与频谱熵项的混合损失函数以同时保障时域和频域的保真度,从而在2D组织模拟中实现了高精度(平均R² = 0.99 ± 0.01),验证了该方法作为物理模型替代方案的可行性与可扩展性。
链接: https://arxiv.org/abs/2512.13765
作者: Shaheim Ogbomo-Harmitt,Cesare Magnetti,Chiara Spota,Jakub Grzelak,Oleg Aslanidi
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to CinC conference 2025
Abstract:The forward problem in electrocardiology, computing body surface potentials from cardiac electrical activity, is traditionally solved using physics-based models such as the bidomain or monodomain equations. While accurate, these approaches are computationally expensive, limiting their use in real-time and large-scale clinical applications. We propose a proof-of-concept deep learning (DL) framework as an efficient surrogate for forward solvers. The model adopts a time-dependent, attention-based sequence-to-sequence architecture to predict electrocardiogram (ECG) signals from cardiac voltage propagation maps. A hybrid loss combining Huber loss with a spectral entropy term was introduced to preserve both temporal and frequency-domain fidelity. Using 2D tissue simulations incorporating healthy, fibrotic, and gap junction-remodelled conditions, the model achieved high accuracy (mean R^2 = 0.99 \pm 0.01 ). Ablation studies confirmed the contributions of convolutional encoders, time-aware attention, and spectral entropy loss. These findings highlight DL as a scalable, cost-effective alternative to physics-based solvers, with potential for clinical and digital twin applications.
zh
[AI-95] A Spatio-Temporal Hybrid Quantum-Classical Graph Convolutional Neural Network Approach for Urban Taxi Destination Prediction
【速读】:该论文旨在解决城市道路网络中出租车目的地预测的准确性与稳定性问题,尤其关注如何有效建模高维空间依赖关系以提升预测性能。其解决方案的关键在于提出一种混合时空量子图卷积网络(Hybrid Spatio-Temporal Quantum Graph Convolutional Network, H-STQGCN),该方法融合量子计算与经典深度学习的优势:在空间处理分支中,利用经典图卷积网络(Graph Convolutional Network, GCN)提取局部拓扑特征,并通过可微池化层将图特征映射至参数化量子电路;在时间演化分支中,基于时序卷积网络(Temporal Convolutional Network, TCN)整合多源上下文信息并捕捉行程间的动态依赖关系。这一量子增强机制显著提升了对复杂空间结构的表征能力,从而实现了更精准和稳定的预测效果。
链接: https://arxiv.org/abs/2512.13745
作者: Xiuying Zhang,Qinsheng Zhu,Xiaodong Xing
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a Hybrid Spatio-Temporal Quantum Graph Convolutional Network (H-STQGCN) algorithm by combining the strengths of quantum computing and classical deep learning to predict the taxi destination within urban road networks. Our algorithm consists of two branches: spatial processing and time evolution. Regarding the spatial processing, the classical module encodes the local topological features of the road network based on the GCN method, and the quantum module is designed to map graph features onto parameterized quantum circuits through a differentiable pooling layer. The time evolution is solved by integrating multi-source contextual information and capturing dynamic trip dependencies on the classical TCN theory. Finally, our experimental results demonstrate that the proposed algorithm outperforms the current methods in terms of prediction accuracy and stability, validating the unique advantages of the quantum-enhanced mechanism in capturing high-dimensional spatial dependencies.
zh
[AI-96] Graph AI generates neurological hypotheses validated in molecular organoid and clinical systems
【速读】:该论文旨在解决神经系统疾病缺乏疾病修饰治疗(disease-modifying treatments)的问题,尤其是针对帕金森病(Parkinson’s disease, PD)、双相情感障碍(bipolar disorder, BD)和阿尔茨海默病(Alzheimer’s disease, AD)等复杂疾病。其解决方案的核心是提出PROTON——一种异质图Transformer模型,能够整合分子、类器官(organoid)和临床系统数据,生成可验证的假说。PROTON通过跨尺度关联分析,将遗传风险位点与关键基因功能、环境毒物及候选药物联系起来,并在多个实验和真实世界健康记录中验证其预测能力,从而为神经疾病提供AI驱动的机制发现路径。
链接: https://arxiv.org/abs/2512.13724
作者: Ayush Noori,Joaquín Polonuer,Katharina Meyer,Bogdan Budnik,Shad Morton,Xinyuan Wang,Sumaiya Nazeen,Yingnan He,Iñaki Arango,Lucas Vittor,Matthew Woodworth,Richard C. Krolewski,Michelle M. Li,Ninning Liu,Tushar Kamath,Evan Macosko,Dylan Ritter,Jalwa Afroz,Alexander B. H. Henderson,Lorenz Studer,Samuel G. Rodriques,Andrew White,Noa Dagan,David A. Clifton,George M. Church,Sudeshna Das,Jenny M. Tam,Vikram Khurana,Marinka Zitnik
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Neurological diseases are the leading global cause of disability, yet most lack disease-modifying treatments. We present PROTON, a heterogeneous graph transformer that generates testable hypotheses across molecular, organoid, and clinical systems. To evaluate PROTON, we apply it to Parkinson’s disease (PD), bipolar disorder (BD), and Alzheimer’s disease (AD). In PD, PROTON linked genetic risk loci to genes essential for dopaminergic neuron survival and predicted pesticides toxic to patient-derived neurons, including the insecticide endosulfan, which ranked within the top 1.29% of predictions. In silico screens performed by PROTON reproduced six genome-wide \alpha -synuclein experiments, including a split-ubiquitin yeast two-hybrid system (normalized enrichment score [NES] = 2.30, FDR-adjusted p 1 \times 10^-4 ), an ascorbate peroxidase proximity labeling assay (NES = 2.16, FDR 1 \times 10^-4 ), and a high-depth targeted exome sequencing study in 496 synucleinopathy patients (NES = 2.13, FDR 1 \times 10^-4 ). In BD, PROTON predicted calcitriol as a candidate drug that reversed proteomic alterations observed in cortical organoids derived from BD patients. In AD, we evaluated PROTON predictions in health records from n = 610,524 patients at Mass General Brigham, confirming that five PROTON-predicted drugs were associated with reduced seven-year dementia risk (minimum hazard ratio = 0.63, 95% CI: 0.53-0.75, p 1 \times 10^-7 ). PROTON generated neurological hypotheses that were evaluated across molecular, organoid, and clinical systems, defining a path for AI-driven discovery in neurological disease.
zh
机器学习
[LG-0] CHIP: Adaptive Compliance for Humanoid Control through Hindsight Perturbation
链接: https://arxiv.org/abs/2512.14689
作者: Sirui Chen,Zi-ang Cao,Zhengyi Luo,Fernando Castañeda,Chenran Li,Tingwu Wang,Ye Yuan,Linxi “Jim” Fan,C. Karen Liu,Yuke Zhu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: The first two authors contributed equally. Project page: this https URL
Abstract:Recent progress in humanoid robots has unlocked agile locomotion skills, including backflipping, running, and crawling. Yet it remains challenging for a humanoid robot to perform forceful manipulation tasks such as moving objects, wiping, and pushing a cart. We propose adaptive Compliance Humanoid control through hIsight Perturbation (CHIP), a plug-and-play module that enables controllable end-effector stiffness while preserving agile tracking of dynamic reference motions. CHIP is easy to implement and requires neither data augmentation nor additional reward tuning. We show that a generalist motion-tracking controller trained with CHIP can perform a diverse set of forceful manipulation tasks that require different end-effector compliance, such as multi-robot collaboration, wiping, box delivery, and door opening.
[LG-1] Early Warning Index for Patient Deteriorations in Hospitals
链接: https://arxiv.org/abs/2512.14683
作者: Dimitris Bertsimas,Yu Ma,Kimberly Villalobos Carballo,Gagan Singh,Michal Laskowski,Jeff Mather,Dan Kombert,Howard Haronian
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hospitals lack automated systems to harness the growing volume of heterogeneous clinical and operational data to effectively forecast critical events. Early identification of patients at risk for deterioration is essential not only for patient care quality monitoring but also for physician care management. However, translating varied data streams into accurate and interpretable risk assessments poses significant challenges due to inconsistent data formats. We develop a multimodal machine learning framework, the Early Warning Index (EWI), to predict the aggregate risk of ICU admission, emergency response team dispatch, and mortality. Key to EWI’s design is a human-in-the-loop process: clinicians help determine alert thresholds and interpret model outputs, which are enhanced by explainable outputs using Shapley Additive exPlanations (SHAP) to highlight clinical and operational factors (e.g., scheduled surgeries, ward census) driving each patient’s risk. We deploy EWI in a hospital dashboard that stratifies patients into three risk tiers. Using a dataset of 18,633 unique patients at a large U.S. hospital, our approach automatically extracts features from both structured and unstructured electronic health record (EHR) data and achieves C-statistics of 0.796. It is currently used as a triage tool for proactively managing at-risk patients. The proposed approach saves physicians valuable time by automatically sorting patients of varying risk levels, allowing them to concentrate on patient care rather than sifting through complex EHR data. By further pinpointing specific risk drivers, the proposed model provides data-informed adjustments to caregiver scheduling and allocation of critical resources. As a result, clinicians and administrators can avert downstream complications, including costly procedures or high readmission rates and improve overall patient flow.
[LG-2] Beyond Lipschitz Continuity and Monotonicity: Fractal and Chaotic Activation Functions in Echo State Networks
链接: https://arxiv.org/abs/2512.14675
作者: Rae Chipera,Jenny Du,Irene Tsapara
类目: Machine Learning (cs.LG)
*备注: 50 pages, 21 figures. Extended version with full proofs, parameter sweeps, and appendices
Abstract:Contemporary reservoir computing relies heavily on smooth, globally Lipschitz continuous activation functions, limiting applications in defense, disaster response, and pharmaceutical modeling where robust operation under extreme conditions is critical. We systematically investigate non-smooth activation functions, including chaotic, stochastic, and fractal variants, in echo state networks. Through comprehensive parameter sweeps across 36,610 reservoir configurations, we demonstrate that several non-smooth functions not only maintain the Echo State Property (ESP) but outperform traditional smooth activations in convergence speed and spectral radius tolerance. Notably, the Cantor function (continuous everywhere and flat almost everywhere) maintains ESP-consistent behavior up to spectral radii of rho ~ 10, an order of magnitude beyond typical bounds for smooth functions, while achieving 2.6x faster convergence than tanh and ReLU. We introduce a theoretical framework for quantized activation functions, defining a Degenerate Echo State Property (d-ESP) that captures stability for discrete-output functions and proving that d-ESP implies traditional ESP. We identify a critical crowding ratio Q=N/k (reservoir size / quantization levels) that predicts failure thresholds for discrete activations. Our analysis reveals that preprocessing topology, rather than continuity per se, determines stability: monotone, compressive preprocessing maintains ESP across scales, while dispersive or discontinuous preprocessing triggers sharp failures. While our findings challenge assumptions about activation function design in reservoir computing, the mechanism underlying the exceptional performance of certain fractal functions remains unexplained, suggesting fundamental gaps in our understanding of how geometric properties of activation functions influence reservoir dynamics.
[LG-3] ParaFormer: A Generalized PageRank Graph Transformer for Graph Representation Learning WSDM2026
链接: https://arxiv.org/abs/2512.14619
作者: Chaohao Yuan,Zhenjie Song,Ercan Engin Kuruoglu,Kangfei Zhao,Yang Liu,Deli Zhao,Hong Cheng,Yu Rong
类目: Machine Learning (cs.LG)
*备注: Accepted by WSDM 2026
Abstract:Graph Transformers (GTs) have emerged as a promising graph learning tool, leveraging their all-pair connected property to effectively capture global information. To address the over-smoothing problem in deep GNNs, global attention was initially introduced, eliminating the necessity for using deep GNNs. However, through empirical and theoretical analysis, we verify that the introduced global attention exhibits severe over-smoothing, causing node representations to become indistinguishable due to its inherent low-pass filtering. This effect is even stronger than that observed in GNNs. To mitigate this, we propose PageRank Transformer (ParaFormer), which features a PageRank-enhanced attention module designed to mimic the behavior of deep Transformers. We theoretically and empirically demonstrate that ParaFormer mitigates over-smoothing by functioning as an adaptive-pass filter. Experiments show that ParaFormer achieves consistent performance improvements across both node classification and graph classification tasks on 11 datasets ranging from thousands to millions of nodes, validating its efficacy. The supplementary material, including code and appendix, can be found in this https URL.
[LG-4] Hierarchical Persistence Velocity for Network Anomaly Detection: Theory and Applications to Cryptocurrency Markets
链接: https://arxiv.org/abs/2512.14615
作者: Omid Khormali
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce the Overlap-Weighted Hierarchical Normalized Persistence Velocity (OW-HNPV), a novel topological data analysis method for detecting anomalies in time-varying networks. Unlike existing methods that measure cumulative topological presence, we introduce the first velocity-based perspective on persistence diagrams, measuring the rate at which features appear and disappear, automatically downweighting noise through overlap-based weighting. We also prove that OW-HNPV is mathematically stable. It behaves in a controlled, predictable way, even when comparing persistence diagrams from networks with different feature types. Applied to Ethereum transaction networks (May 2017-May 2018), OW-HNPV demonstrates superior performance for cryptocurrency anomaly detection, achieving up to 10.4% AUC gain over baseline models for 7-day price movement predictions. Compared with established methods, including Vector of Averaged Bettis (VAB), persistence landscapes, and persistence images, velocity-based summaries excel at medium- to long-range forecasting (4-7 days), with OW-HNPV providing the most consistent and stable performance across prediction horizons. Our results show that modeling topological velocity is crucial for detecting structural anomalies in dynamic networks.
[LG-5] Sound and Music Biases in Deep Music Transcription Models: A Systematic Analysis
链接: https://arxiv.org/abs/2512.14602
作者: Lukáš Samuel Marták,Patricia Hu,Gerhard Widmer
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: pre-print of the upcoming EURASIP JASM journal article
Abstract:Automatic Music Transcription (AMT) – the task of converting music audio into note representations – has seen rapid progress, driven largely by deep learning systems. Due to the limited availability of richly annotated music datasets, much of the progress in AMT has been concentrated on classical piano music, and even a few very specific datasets. Whether these systems can generalize effectively to other musical contexts remains an open question. Complementing recent studies on distribution shifts in sound (e.g., recording conditions), in this work we investigate the musical dimension – specifically, variations in genre, dynamics, and polyphony levels. To this end, we introduce the MDS corpus, comprising three distinct subsets – (1) Genre, (2) Random, and (3) MAEtest – to emulate different axes of distribution shift. We evaluate the performance of several state-of-the-art AMT systems on the MDS corpus using both traditional information-retrieval and musically-informed performance metrics. Our extensive evaluation isolates and exposes varying degrees of performance degradation under specific distribution shifts. In particular, we measure a note-level F1 performance drop of 20 percentage points due to sound, and 14 due to genre. Generally, we find that dynamics estimation proves more vulnerable to musical variation than onset prediction. Musically informed evaluation metrics, particularly those capturing harmonic structure, help identify potential contributing factors. Furthermore, experiments with randomly generated, non-musical sequences reveal clear limitations in system performance under extreme musical distribution shifts. Altogether, these findings offer new evidence of the persistent impact of the Corpus Bias problem in deep AMT systems.
[LG-6] Hybrid Iterative Solvers with Geometry-Aware Neural Preconditioners for Parametric PDEs
链接: https://arxiv.org/abs/2512.14596
作者: Youngkyu Lee,Francesc Levrero Florencio,Jay Pathak,George Em Karniadakis
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 19 pages, 10 figures, 3 tables
Abstract:The convergence behavior of classical iterative solvers for parametric partial differential equations (PDEs) is often highly sensitive to the domain and specific discretization of PDEs. Previously, we introduced hybrid solvers by combining the classical solvers with neural operators for a specific geometry 1, but they tend to under-perform in geometries not encountered during training. To address this challenge, we introduce Geo-DeepONet, a geometry-aware deep operator network that incorporates domain information extracted from finite element discretizations. Geo-DeepONet enables accurate operator learning across arbitrary unstructured meshes without requiring retraining. Building on this, we develop a class of geometry-aware hybrid preconditioned iterative solvers by coupling Geo-DeepONet with traditional methods such as relaxation schemes and Krylov subspace algorithms. Through numerical experiments on parametric PDEs posed over diverse unstructured domains, we demonstrate the enhanced robustness and efficiency of the proposed hybrid solvers for multiple real-world applications.
[LG-7] Counterfactual Explanations for Time Series Should be Human-Centered and Temporally Coherent in Interventions
链接: https://arxiv.org/abs/2512.14559
作者: Emmanuel C. Chukwu,Rianne M. Schouten,Monique Tabak,Mykola Pechenizkiy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Counterfactual explanations are increasingly proposed as interpretable mechanisms to achieve algorithmic recourse. However, current counterfactual techniques for time series classification are predominantly designed with static data assumptions and focus on generating minimal input perturbations to flip model predictions. This paper argues that such approaches are fundamentally insufficient in clinical recommendation settings, where interventions unfold over time and must be causally plausible and temporally coherent. We advocate for a shift towards counterfactuals that reflect sustained, goal-directed interventions aligned with clinical reasoning and patient-specific dynamics. We identify critical gaps in existing methods that limit their practical applicability, specifically, temporal blind spots and the lack of user-centered considerations in both method design and evaluation metrics. To support our position, we conduct a robustness analysis of several state-of-the-art methods for time series and show that the generated counterfactuals are highly sensitive to stochastic noise. This finding highlights their limited reliability in real-world clinical settings, where minor measurement variations are inevitable. We conclude by calling for methods and evaluation frameworks that go beyond mere prediction changes without considering feasibility or actionability. We emphasize the need for actionable, purpose-driven interventions that are feasible in real-world contexts for the users of such applications.
[LG-8] Synthetic Electrogram Generation with Variational Autoencoders for ECGI
链接: https://arxiv.org/abs/2512.14537
作者: Miriam Gutiérrez Fernández,Karen López-Linares,Carlos Fambuena Santos,María S. Guillem,Andreu M. Climent,Óscar Barquero Pérez
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Atrial fibrillation (AF) is the most prevalent sustained cardiac arrhythmia, and its clinical assessment requires accurate characterization of atrial electrical activity. Noninvasive electrocardiographic imaging (ECGI) combined with deep learning (DL) approaches for estimating intracardiac electrograms (EGMs) from body surface potentials (BSPMs) has shown promise, but progress is hindered by the limited availability of paired BSPM-EGM datasets. To address this limitation, we investigate variational autoencoders (VAEs) for the generation of synthetic multichannel atrial EGMs. Two models are proposed: a sinus rhythm-specific VAE (VAE-S) and a class-conditioned VAE (VAE-C) trained on both sinus rhythm and AF signals. Generated EGMs are evaluated using morphological, spectral, and distributional similarity metrics. VAE-S achieves higher fidelity with respect to in silico EGMs, while VAE-C enables rhythm-specific generation at the expense of reduced sinus reconstruction quality. As a proof of concept, the generated EGMs are used for data augmentation in a downstream noninvasive EGM reconstruction task, where moderate augmentation improves estimation performance. These results demonstrate the potential of VAE-based generative modeling to alleviate data scarcity and enhance deep learning-based ECGI pipelines.
[LG-9] Improving Slow Transfer Predictions: Generative Methods Compared
链接: https://arxiv.org/abs/2512.14522
作者: Jacob Taegon Kim,Alex Sim,Kesheng Wu,Jinoh Kim
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Monitoring data transfer performance is a crucial task in scientific computing networks. By predicting performance early in the communication phase, potentially sluggish transfers can be identified and selectively monitored, optimizing network usage and overall performance. A key bottleneck to improving the predictive power of machine learning (ML) models in this context is the issue of class imbalance. This project focuses on addressing the class imbalance problem to enhance the accuracy of performance predictions. In this study, we analyze and compare various augmentation strategies, including traditional oversampling methods and generative techniques. Additionally, we adjust the class imbalance ratios in training datasets to evaluate their impact on model performance. While augmentation may improve performance, as the imbalance ratio increases, the performance does not significantly improve. We conclude that even the most advanced technique, such as CTGAN, does not significantly improve over simple stratified sampling.
[LG-10] Kinetic-Mamba: Mamba-Assisted Predictions of Stiff Chemical Kinetics
链接: https://arxiv.org/abs/2512.14471
作者: Additi Pandey,Liang Wei,Hessam Babaee,George Em Karniadakis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate chemical kinetics modeling is essential for combustion simulations, as it governs the evolution of complex reaction pathways and thermochemical states. In this work, we introduce Kinetic-Mamba, a Mamba-based neural operator framework that integrates the expressive power of neural operators with the efficient temporal modeling capabilities of Mamba architectures. The framework comprises three complementary models: (i) a standalone Mamba model that predicts the time evolution of thermochemical state variables from given initial conditions; (ii) a constrained Mamba model that enforces mass conservation while learning the state dynamics; and (iii) a regime-informed architecture employing two standalone Mamba models to capture dynamics across temperature-dependent regimes. We additionally develop a latent Kinetic-Mamba variant that evolves dynamics in a reduced latent space and reconstructs the full state on the physical manifold. We evaluate the accuracy and robustness of Kinetic-Mamba using both time-decomposition and recursive-prediction strategies. We further assess the extrapolation capabilities of the model on varied out-of-distribution datasets. Computational experiments on Syngas and GRI-Mech 3.0 reaction mechanisms demonstrate that our framework achieves high fidelity in predicting complex kinetic behavior using only the initial conditions of the state variables.
[LG-11] AnySleep: a channel-agnostic deep learning system for high-resolution sleep staging in multi-center cohorts
链接: https://arxiv.org/abs/2512.14461
作者: Niklas Grieger,Jannik Raskob,Siamak Mehrkanoon,Stephan Bialonski
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Quantitative Methods (q-bio.QM)
*备注: 18 pages, 6 figures, 2 tables
Abstract:Sleep is essential for good health throughout our lives, yet studying its dynamics requires manual sleep staging, a labor-intensive step in sleep research and clinical care. Across centers, polysomnography (PSG) recordings are traditionally scored in 30-s epochs for pragmatic, not physiological, reasons and can vary considerably in electrode count, montage, and subject characteristics. These constraints present challenges in conducting harmonized multi-center sleep studies and discovering novel, robust biomarkers on shorter timescales. Here, we present AnySleep, a deep neural network model that uses any electroencephalography (EEG) or electrooculography (EOG) data to score sleep at adjustable temporal resolutions. We trained and validated the model on over 19,000 overnight recordings from 21 datasets collected across multiple clinics, spanning nearly 200,000 hours of EEG and EOG data, to promote robust generalization across sites. The model attains state-of-the-art performance and surpasses or equals established baselines at 30-s epochs. Performance improves as more channels are provided, yet remains strong when EOG is absent or when only EOG or single EEG derivations (frontal, central, or occipital) are available. On sub-30-s timescales, the model captures short wake intrusions consistent with arousals and improves prediction of physiological characteristics (age, sex) and pathophysiological conditions (sleep apnea), relative to standard 30-s scoring. We make the model publicly available to facilitate large-scale studies with heterogeneous electrode setups and to accelerate the discovery of novel biomarkers in sleep.
[LG-12] Bridging Artificial Intelligence and Data Assimilation: The Data-driven Ensemble Forecasting System ClimaX-LETKF
链接: https://arxiv.org/abs/2512.14444
作者: Akira Takeshima,Kenta Shiraishi,Atsushi Okazaki,Tadashi Tsuyuki,Shunji Kotsuki
类目: Machine Learning (cs.LG)
*备注: 14 pages and 5 figures for the main text and 13 pages and 7 figures as supplementary materials
Abstract:While machine learning-based weather prediction (MLWP) has achieved significant advancements, research on assimilating real observations or ensemble forecasts within MLWP models remains limited. We introduce ClimaX-LETKF, the first purely data-driven ML-based ensemble weather forecasting system. It operates stably over multiple years, independently of numerical weather prediction (NWP) models, by assimilating the NCEP ADP Global Upper Air and Surface Weather Observations. The system demonstrates greater stability and accuracy with relaxation to prior perturbation (RTPP) than with relaxation to prior spread (RTPS), while NWP models tend to be more stable with RTPS. RTPP replaces an analysis perturbation with a weighted blend of analysis and background perturbations, whereas RTPS simply rescales the analysis perturbation. Our experiments reveal that MLWP models are less capable of restoring the atmospheric field to its attractor than NWP models. This work provides valuable insights for enhancing MLWP ensemble forecasting systems and represents a substantial step toward their practical applications.
[LG-13] Hybrid Ensemble Method for Detecting Cyber-Attacks in Water Distribution Systems Using the BATADAL Dataset
链接: https://arxiv.org/abs/2512.14422
作者: Waqas Ahmed
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 18 pages, figures
Abstract:The cybersecurity of Industrial Control Systems that manage critical infrastructure such as Water Distribution Systems has become increasingly important as digital connectivity expands. BATADAL benchmark data is a good source of testing intrusion detection techniques, but it presents several important problems, such as imbalance in the number of classes, multivariate time dependence, and stealthy attacks. We consider a hybrid ensemble learning model that will enhance the detection ability of cyber-attacks in WDS by using the complementary capabilities of machine learning and deep learning models. Three base learners, namely, Random Forest , eXtreme Gradient Boosting , and Long Short-Term Memory network, have been strictly compared and seven ensemble types using simple averaged and stacked learning with a logistic regression meta-learner. Random Forest analysis identified top predictors turned into temporal and statistical features, and Synthetic Minority Oversampling Technique (SMOTE) was used to overcome the class imbalance issue. The analyics indicates that the single Long Short-Term Memory network model is of poor performance (F1 = 0.000, AUC = 0.4460), but tree-based models, especially eXtreme Gradient Boosting, perform well (F1 = 0.7470, AUC=0.9684). The hybrid stacked ensemble of Random Forest , eXtreme Gradient Boosting , and Long Short-Term Memory network scored the highest, with the attack class of 0.7205 with an F1-score and a AUC of 0.9826 indicating that the heterogeneous stacking between model precision and generalization can work. The proposed framework establishes a robust and scalable solution for cyber-attack detection in time-dependent industrial systems, integrating temporal learning and ensemble diversity to support the secure operation of critical infrastructure.
[LG-14] Dual-Axis RCCL: Representation-Complete Convergent Learning for Organic Chemical Space
链接: https://arxiv.org/abs/2512.14418
作者: Dejun Hu,Zhiming Li,Jia-Rui Shen,Jia-Ning Tu,Zi-Hao Ye,Junliang Zhang
类目: Machine Learning (cs.LG)
*备注: 33 pages, 10 figures
Abstract:Machine learning is profoundly reshaping molecular and materials modeling; however, given the vast scale of chemical space (10^30-10^60), it remains an open scientific question whether models can achieve convergent learning across this space. We introduce a Dual-Axis Representation-Complete Convergent Learning (RCCL) strategy, enabled by a molecular representation that integrates graph convolutional network (GCN) encoding of local valence environments, grounded in modern valence bond theory, together with no-bridge graph (NBG) encoding of ring/cage topologies, providing a quantitative measure of chemical-space coverage. This framework formalizes representation completeness, establishing a principled basis for constructing datasets that support convergent learning for large models. Guided by this RCCL framework, we develop the FD25 dataset, systematically covering 13,302 local valence units and 165,726 ring/cage topologies, achieving near-complete combinatorial coverage of organic molecules with H/C/N/O/F elements. Graph neural networks trained on FD25 exhibit representation-complete convergent learning and strong out-of-distribution generalization, with an overall prediction error of approximately 1.0 kcal/mol MAE across external benchmarks. Our results establish a quantitative link between molecular representation, structural completeness, and model generalization, providing a foundation for interpretable, transferable, and data-efficient molecular intelligence.
[LG-15] GRAFT: Grid-Aware Load Forecasting with Multi-Source Textual Alignment and Fusion
链接: https://arxiv.org/abs/2512.14400
作者: Fangzhou Lin,Guoshun He,Zhenyu Guo,Zhe Huang,Jinsong Tao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electric load is simultaneously affected across multiple time scales by exogenous factors such as weather and calendar rhythms, sudden events, and policies. Therefore, this paper proposes GRAFT (GRid-Aware Forecasting with Text), which modifies and improves STanHOP to better support grid-aware forecasting and multi-source textual interventions. Specifically, GRAFT strictly aligns daily-aggregated news, social media, and policy texts with half-hour load, and realizes text-guided fusion to specific time positions via cross-attention during both training and rolling forecasting. In addition, GRAFT provides a plug-and-play external-memory interface to accommodate different information sources in real-world deployment. We construct and release a unified aligned benchmark covering 2019–2021 for five Australian states (half-hour load, daily-aligned weather/calendar variables, and three categories of external texts), and conduct systematic, reproducible evaluations at three scales – hourly, daily, and monthly – under a unified protocol for comparison across regions, external sources, and time scales. Experimental results show that GRAFT significantly outperforms strong baselines and reaches or surpasses the state of the art across multiple regions and forecasting horizons. Moreover, the model is robust in event-driven scenarios and enables temporal localization and source-level interpretation of text-to-load effects through attention read-out. We release the benchmark, preprocessing scripts, and forecasting results to facilitate standardized empirical evaluation and reproducibility in power grid load forecasting.
[LG-16] SuperWing: a comprehensive transonic wing dataset for data-driven aerodynamic design
链接: https://arxiv.org/abs/2512.14397
作者: Yunjia Yang,Weishao Tang,Mengxin Liu,Nils Thuerey,Yufei Zhang,Haixin Chen
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Machine-learning surrogate models have shown promise in accelerating aerodynamic design, yet progress toward generalizable predictors for three-dimensional wings has been limited by the scarcity and restricted diversity of existing datasets. Here, we present SuperWing, a comprehensive open dataset of transonic swept-wing aerodynamics comprising 4,239 parameterized wing geometries and 28,856 Reynolds-averaged Navier-Stokes flow field solutions. The wing shapes in the dataset are generated using a simplified yet expressive geometry parameterization that incorporates spanwise variations in airfoil shape, twist, and dihedral, allowing for an enhanced diversity without relying on perturbations of a baseline wing. All shapes are simulated under a broad range of Mach numbers and angles of attack covering the typical flight envelope. To demonstrate the dataset’s utility, we benchmark two state-of-the-art Transformers that accurately predict surface flow and achieve a 2.5 drag-count error on held-out samples. Models pretrained on SuperWing further exhibit strong zero-shot generalization to complex benchmark wings such as DLR-F6 and NASA CRM, underscoring the dataset’s diversity and potential for practical usage.
[LG-17] Black-Box Auditing of Quantum Model: Lifted Differential Privacy with Quantum Canaries
链接: https://arxiv.org/abs/2512.14388
作者: Baobao Song,Shiva Raj Pokhrel,Athanasios V. Vasilakos,Tianqing Zhu,Gang Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Quantum machine learning (QML) promises significant computational advantages, yet models trained on sensitive data risk memorizing individual records, creating serious privacy vulnerabilities. While Quantum Differential Privacy (QDP) mechanisms provide theoretical worst-case guarantees, they critically lack empirical verification tools for deployed models. We introduce the first black-box privacy auditing framework for QML based on Lifted Quantum Differential Privacy, leveraging quantum canaries (strategically offset-encoded quantum states) to detect memorization and precisely quantify privacy leakage during training. Our framework establishes a rigorous mathematical connection between canary offset and trace distance bounds, deriving empirical lower bounds on privacy budget consumption that bridge the critical gap between theoretical guarantees and practical privacy verification. Comprehensive evaluations across both simulated and physical quantum hardware demonstrate our framework’s effectiveness in measuring actual privacy loss in QML models, enabling robust privacy verification in QML systems.
[LG-18] Implicit Bias and Invariance: How Hopfield Networks Efficiently Learn Graph Orbits
链接: https://arxiv.org/abs/2512.14338
作者: Michael Murray,Tenzin Chan,Kedar Karhadker,Christopher J. Hillar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many learning problems involve symmetries, and while invariance can be built into neural architectures, it can also emerge implicitly when training on group-structured data. We study this phenomenon in classical Hopfield networks and show they can infer the full isomorphism class of a graph from a small random sample. Our results reveal that: (i) graph isomorphism classes can be represented within a three-dimensional invariant subspace, (ii) using gradient descent to minimize energy flow (MEF) has an implicit bias toward norm-efficient solutions, which underpins a polynomial sample complexity bound for learning isomorphism classes, and (iii) across multiple learning rules, parameters converge toward the invariant subspace as sample sizes grow. Together, these findings highlight a unifying mechanism for generalization in Hopfield networks: a bias toward norm efficiency in learning drives the emergence of approximate invariance under group-structured data.
[LG-19] FLAME: Flow Enhanced Legendre Memory Models for General Time Series Forecasting
链接: https://arxiv.org/abs/2512.14253
作者: Xingjian Wu,Hanyin Cheng,Xiangfei Qiu,Zhengyu Li,Jilin Hu,Chenjuan Guo,Bin Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we introduce FLAME, a family of extremely lightweight and capable Time Series Foundation Models, which support both deterministic and probabilistic forecasting via generative probabilistic modeling, thus ensuring both efficiency and robustness. FLAME utilizes the Legendre Memory for strong generalization capabilities. Through adapting variants of Legendre Memory, i.e., translated Legendre (LegT) and scaled Legendre (LegS), in the Encoding and Decoding phases, FLAME can effectively capture the inherent inductive bias within data and make efficient long-range inferences. To enhance the accuracy of probabilistic forecasting while keeping efficient, FLAME adopts a Normalization Flow based forecasting head, which can model the arbitrarily intricate distributions over the forecasting horizon in a generative manner. Comprehensive experiments on well-recognized benchmarks, including TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art zero-shot performance of FLAME on both deterministic and probabilistic forecasting tasks.
[LG-20] Physically consistent model learning for reaction-diffusion systems
链接: https://arxiv.org/abs/2512.14240
作者: Erion Morina,Martin Holler
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Optimization and Control (math.OC)
*备注:
Abstract:This paper addresses the problem of learning reaction-diffusion (RD) systems from data while ensuring physical consistency and well-posedness of the learned models. Building on a regularization-based framework for structured model learning, we focus on learning parameterized reaction terms and investigate how to incorporate key physical properties, such as mass conservation and quasipositivity, directly into the learning process. Our main contributions are twofold: First, we propose techniques to systematically modify a given class of parameterized reaction terms such that the resulting terms inherently satisfy mass conservation and quasipositivity, ensuring that the learned RD systems preserve non-negativity and adhere to physical principles. These modifications also guarantee well-posedness of the resulting PDEs under additional regularity and growth conditions. Second, we extend existing theoretical results on regularization-based model learning to RD systems using these physically consistent reaction terms. Specifically, we prove that solutions to the learning problem converge to a unique, regularization-minimizing solution of a limit system even when conservation laws and quasipositivity are enforced. In addition, we provide approximation results for quasipositive functions, essential for constructing physically consistent parameterizations. These results advance the development of interpretable and reliable data-driven models for RD systems that align with fundamental physical laws.
[LG-21] Understanding the Gain from Data Filtering in Multimodal Contrastive Learning
链接: https://arxiv.org/abs/2512.14230
作者: Divyansh Pareek,Sewoong Oh,Simon S. Du
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 40 pages, 8 figures, 1 table. This work is accepted to the Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
Abstract:The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting \eta\in(0,1] as the fraction of data with correctly matched modalities among n paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: (i) the error without filtering is upper and lower bounded by \frac1\eta \sqrtn , and (ii) the error with teacher-based filtering is upper bounded by \frac1\sqrt\eta n in the large \eta regime, and by \frac1\sqrtn in the small \eta regime.
[LG-22] Random-Bridges as Stochastic Transports for Generative Models
链接: https://arxiv.org/abs/2512.14190
作者: Stefano Goria,Levent A. Mengütürk,Murat C. Mengütürk,Berkan Sesen
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:This paper motivates the use of random-bridges – stochastic processes conditioned to take target distributions at fixed timepoints – in the realm of generative modelling. Herein, random-bridges can act as stochastic transports between two probability distributions when appropriately initialized, and can display either Markovian or non-Markovian, and either continuous, discontinuous or hybrid patterns depending on the driving process. We show how one can start from general probabilistic statements and then branch out into specific representations for learning and simulation algorithms in terms of information processing. Our empirical results, built on Gaussian random bridges, produce high-quality samples in significantly fewer steps compared to traditional approaches, while achieving competitive Frechet inception distance scores. Our analysis provides evidence that the proposed framework is computationally cheap and suitable for high-speed generation tasks.
[LG-23] Optimizing the Adversarial Perturbation with a Momentum-based Adaptive Matrix
链接: https://arxiv.org/abs/2512.14188
作者: Wei Tao,Sheng Long,Xin Liu,Wei Li,Qing Tao
类目: Machine Learning (cs.LG)
*备注: IEEE Transactions on Dependable and Secure Computing
Abstract:Generating adversarial examples (AEs) can be formulated as an optimization problem. Among various optimization-based attacks, the gradient-based PGD and the momentum-based MI-FGSM have garnered considerable interest. However, all these attacks use the sign function to scale their perturbations, which raises several theoretical concerns from the point of view of optimization. In this paper, we first reveal that PGD is actually a specific reformulation of the projected gradient method using only the current gradient to determine its step-size. Further, we show that when we utilize a conventional adaptive matrix with the accumulated gradients to scale the perturbation, PGD becomes AdaGrad. Motivated by this analysis, we present a novel momentum-based attack AdaMI, in which the perturbation is optimized with an interesting momentum-based adaptive matrix. AdaMI is proved to attain optimal convergence for convex problems, indicating that it addresses the non-convergence issue of MI-FGSM, thereby ensuring stability of the optimization process. The experiments demonstrate that the proposed momentum-based adaptive matrix can serve as a general and effective technique to boost adversarial transferability over the state-of-the-art methods across different networks while maintaining better stability and imperceptibility.
[LG-24] On Improving Deep Active Learning with Formal Verification
链接: https://arxiv.org/abs/2512.14170
作者: Jonathan Spiegelman,Guy Amir,Guy Katz
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:
Abstract:Deep Active Learning (DAL) aims to reduce labeling costs in neural-network training by prioritizing the most informative unlabeled samples for annotation. Beyond selecting which samples to label, several DAL approaches further enhance data efficiency by augmenting the training set with synthetic inputs that do not require additional manual labeling. In this work, we investigate how augmenting the training data with adversarial inputs that violate robustness constraints can improve DAL performance. We show that adversarial examples generated via formal verification contribute substantially more than those produced by standard, gradient-based attacks. We apply this extension to multiple modern DAL techniques, as well as to a new technique that we propose, and show that it yields significant improvements in model generalization across standard benchmarks.
[LG-25] Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting
链接: https://arxiv.org/abs/2512.14115
作者: Ramesh Gundluru,Shubham Gupta,Sri Rama Murty K
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-text alignment, and the need for task-specific models. To address these shortcomings, we propose a joint multimodal contrastive learning framework that unifies both acoustic and cross-modal supervision in a shared embedding space. Our approach simultaneously optimizes: (i) audio-text contrastive learning, inspired by the CLAP loss, to align audio and text representations and (ii) audio-audio contrastive learning, via Deep Word Discrimination (DWD) loss, to enhance intra-class compactness and inter-class separation. The proposed method outperforms existing AWE baselines on word discrimination task while flexibly supporting both STD and KWS. To our knowledge, this is the first comprehensive approach of its kind.
[LG-26] A First-Order Logic-Based Alternative to Reward Models in RLHF
链接: https://arxiv.org/abs/2512.14100
作者: Chunjin Jian,Xinhua Zhu
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:
Abstract:Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. However, the quality and stability of the trained reward model largely determine the final alignment performance. Existing approaches such as Proximal Policy Optimization (PPO) rely heavily on reward models to guide LLMs toward human-aligned behaviors. In this work, we propose a logic-similarity-based reward mechanism as an alternative to conventional reward modeling. Instead of relying on heuristic reward estimation, our method leverages formal logical consistency to steer model alignment with human preferences. Since real-world questions can be interpreted from multiple perspectives, to ensure that logic-based reinforcement learning does not cause model collapse, we introduce S-GRPO, a supervised variant of the GRPO framework. S-GRPO incorporates an additional supervised component and jointly optimizes the generation term, KL-divergence regularization, and label-based objective during training. Experimental results demonstrate that S-GRPO consistently outperforms standard supervised fine-tuning (SFT) in both performance and robustness. Furthermore, it extends existing preference-learning frameworks such as GRPO and DPO, offering a more flexible and task-adaptive approach to alignment training. Our code is available at this https URL. Subjects: Machine Learning (cs.LG); Logic in Computer Science (cs.LO) Cite as: arXiv:2512.14100 [cs.LG] (or arXiv:2512.14100v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.14100 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-27] Cornserve: Efficiently Serving Any-to-Any Multimodal Models
链接: https://arxiv.org/abs/2512.14098
作者: Jeff J. Ma,Jae-Won Chung,Jisang Ahn,Yizhuo Liang,Akshay Jajoo,Myungjin Lee,Mosharaf Chowdhury
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:We present Cornserve, an efficient online serving system for an emerging class of multimodal models called Any-to-Any models. Any-to-Any models accept combinations of text and multimodal data (e.g., image, video, audio) as input and also generate combinations of text and multimodal data as output, introducing request type, computation path, and computation scaling heterogeneity in model serving. Cornserve allows model developers to describe the computation graph of generic Any-to-Any models, which consists of heterogeneous components such as multimodal encoders, autoregressive models like Large Language Models (LLMs), and multimodal generators like Diffusion Transformers (DiTs). Given this, Cornserve’s planner automatically finds an optimized deployment plan for the model, including whether and how to disaggregate the model into smaller components based on model and workload characteristics. Cornserve’s distributed runtime then executes the model per the plan, efficiently handling Any-to-Any model heterogeneity during online serving. Evaluations show that Cornserve can efficiently serve diverse Any-to-Any models and workloads, delivering up to 3.81 \times throughput improvement and up to 5.79 \times tail latency reduction over existing solutions. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2512.14098 [cs.LG] (or arXiv:2512.14098v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.14098 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-28] Derivative-Informed Fourier Neural Operator: Universal Approximation and Applications to PDE-Constrained Optimization
链接: https://arxiv.org/abs/2512.14086
作者: Boyuan Yao,Dingcheng Luo,Lianghao Cao,Nikola Kovachki,Thomas O’Leary-Roseberry,Omar Ghattas
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We present approximation theories and efficient training methods for derivative-informed Fourier neural operators (DIFNOs) with applications to PDE-constrained optimization. A DIFNO is an FNO trained by minimizing its prediction error jointly on output and Fréchet derivative samples of a high-fidelity operator (e.g., a parametric PDE solution operator). As a result, a DIFNO can closely emulate not only the high-fidelity operator’s response but also its sensitivities. To motivate the use of DIFNOs instead of conventional FNOs as surrogate models, we show that accurate surrogate-driven PDE-constrained optimization requires accurate surrogate Fréchet derivatives. Then, for continuously differentiable operators, we establish (i) simultaneous universal approximation of FNOs and their Fréchet derivatives on compact sets, and (ii) universal approximation of FNOs in weighted Sobolev spaces with input measures that have unbounded supports. Our theoretical results certify the capability of FNOs for accurate derivative-informed operator learning and accurate solution of PDE-constrained optimization. Furthermore, we develop efficient training schemes using dimension reduction and multi-resolution techniques that significantly reduce memory and computational costs for Fréchet derivative learning. Numerical examples on nonlinear diffusion–reaction, Helmholtz, and Navier–Stokes equations demonstrate that DIFNOs are superior in sample complexity for operator learning and solving infinite-dimensional PDE-constrained inverse problems, achieving high accuracy at low training sample sizes.
[LG-29] FusAD: Time-Frequency Fusion with Adaptive Denoising for General Time Series Analysis ICDE2026
链接: https://arxiv.org/abs/2512.14078
作者: Da Zhang,Bingyu Li,Zhiyuan Zhao,Feiping Nie,Junyu Gao,Xuelong Li
类目: Machine Learning (cs.LG)
*备注: Paper has been accepted by ICDE2026
Abstract:Time series analysis plays a vital role in fields such as finance, healthcare, industry, and meteorology, underpinning key tasks including classification, forecasting, and anomaly detection. Although deep learning models have achieved remarkable progress in these areas in recent years, constructing an efficient, multi-task compatible, and generalizable unified framework for time series analysis remains a significant challenge. Existing approaches are often tailored to single tasks or specific data types, making it difficult to simultaneously handle multi-task modeling and effectively integrate information across diverse time series types. Moreover, real-world data are often affected by noise, complex frequency components, and multi-scale dynamic patterns, which further complicate robust feature extraction and analysis. To ameliorate these challenges, we propose FusAD, a unified analysis framework designed for diverse time series tasks. FusAD features an adaptive time-frequency fusion mechanism, integrating both Fourier and Wavelet transforms to efficiently capture global-local and multi-scale dynamic features. With an adaptive denoising mechanism, FusAD automatically senses and filters various types of noise, highlighting crucial sequence variations and enabling robust feature extraction in complex environments. In addition, the framework integrates a general information fusion and decoding structure, combined with masked pre-training, to promote efficient learning and transfer of multi-granularity representations. Extensive experiments demonstrate that FusAD consistently outperforms state-of-the-art models on mainstream time series benchmarks for classification, forecasting, and anomaly detection tasks, while maintaining high efficiency and scalability. Code is available at this https URL.
[LG-30] A Deep Dive into Function Inlining and its Security Implications for ML-based Binary Analysis
链接: https://arxiv.org/abs/2512.14045
作者: Omar Abusabha,Jiyong Uhm,Tamer Abuhmed,Hyungjoon Koo
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:
Abstract:A function inlining optimization is a widely used transformation in modern compilers, which replaces a call site with the callee’s body in need. While this transformation improves performance, it significantly impacts static features such as machine instructions and control flow graphs, which are crucial to binary analysis. Yet, despite its broad impact, the security impact of function inlining remains underexplored to date. In this paper, we present the first comprehensive study of function inlining through the lens of machine learning-based binary analysis. To this end, we dissect the inlining decision pipeline within the LLVM’s cost model and explore the combinations of the compiler options that aggressively promote the function inlining ratio beyond standard optimization levels, which we term extreme inlining. We focus on five ML-assisted binary analysis tasks for security, using 20 unique models to systematically evaluate their robustness under extreme inlining scenarios. Our extensive experiments reveal several significant findings: i) function inlining, though a benign transformation in intent, can (in)directly affect ML model behaviors, being potentially exploited by evading discriminative or generative ML models; ii) ML models relying on static features can be highly sensitive to inlining; iii) subtle compiler settings can be leveraged to deliberately craft evasive binary variants; and iv) inlining ratios vary substantially across applications and build configurations, undermining assumptions of consistency in training and evaluation of ML models.
[LG-31] Multivariate Time Series Forecasting with Hybrid Euclidean-SPD Manifold Graph Neural Networks
链接: https://arxiv.org/abs/2512.14023
作者: Yong Fang,Na Li,Hangguan Shan,Eryun Liu,Xinyu Li,Wei Ni,Er-Ping Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multivariate Time Series (MTS) forecasting plays a vital role in various real-world applications, such as traffic management and predictive maintenance. Existing approaches typically model MTS data in either Euclidean or Riemannian space, limiting their ability to capture the diverse geometric structures and complex spatio-temporal dependencies inherent in real-world data. To overcome this limitation, we propose the Hybrid Symmetric Positive-Definite Manifold Graph Neural Network (HSMGNN), a novel graph neural network-based model that captures data geometry within a hybrid Euclidean-Riemannian framework. To the best of our knowledge, this is the first work to leverage hybrid geometric representations for MTS forecasting, enabling expressive and comprehensive modeling of geometric properties. Specifically, we introduce a Submanifold-Cross-Segment (SCS) embedding to project input MTS into both Euclidean and Riemannian spaces, thereby capturing spatio-temporal variations across distinct geometric domains. To alleviate the high computational cost of Riemannian distance, we further design an Adaptive-Distance-Bank (ADB) layer with a trainable memory mechanism. Finally, a Fusion Graph Convolutional Network (FGCN) is devised to integrate features from the dual spaces via a learnable fusion operator for accurate prediction. Experiments on three benchmark datasets demonstrate that HSMGNN achieves up to a 13.8 percent improvement over state-of-the-art baselines in forecasting accuracy.
[LG-32] EXAONE Path 2.5: Pathology Foundation Model with Multi-Omics Alignment
链接: https://arxiv.org/abs/2512.14019
作者: Juseung Yun,Sunwoo Yu,Sumin Ha,Jonghyun Kim,Janghyeon Lee,Jongseong Jang,Soonyoung Lee
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Cancer progression arises from interactions across multiple biological layers, especially beyond morphological and across molecular layers that remain invisible to image-only models. To capture this broader biological landscape, we present EXAONE Path 2.5, a pathology foundation model that jointly models histologic, genomic, epigenetic and transcriptomic modalities, producing an integrated patient representation that reflects tumor biology more comprehensively. Our approach incorporates three key components: (1) multimodal SigLIP loss enabling all-pairwise contrastive learning across heterogeneous modalities, (2) a fragment-aware rotary positional encoding (F-RoPE) module that preserves spatial structure and tissue-fragment topology in WSI, and (3) domain-specialized internal foundation models for both WSI and RNA-seq to provide biologically grounded embeddings for robust multimodal alignment. We evaluate EXAONE Path 2.5 against six leading pathology foundation models across two complementary benchmarks: an internal real-world clinical dataset and the Patho-Bench benchmark covering 80 tasks. Our framework demonstrates high data and parameter efficiency, achieving on-par performance with state-of-the-art foundation models on Patho-Bench while exhibiting the highest adaptability in the internal clinical setting. These results highlight the value of biologically informed multimodal design and underscore the potential of integrated genotype-to-phenotype modeling for next-generation precision oncology.
[LG-33] Accelerating MHC-II Epitope Discovery via Multi-Scale Prediction in Antigen Presentation
链接: https://arxiv.org/abs/2512.14011
作者: Yue Wan,Jiayi Yuan,Zhiwei Feng,Xiaowei Jia
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Antigenic epitope presented by major histocompatibility complex II (MHC-II) proteins plays an essential role in immunotherapy. However, compared to the more widely studied MHC-I in computational immunotherapy, the study of MHC-II antigenic epitope poses significantly more challenges due to its complex binding specificity and ambiguous motif patterns. Consequently, existing datasets for MHC-II interactions are smaller and less standardized than those available for MHC-I. To address these challenges, we present a well-curated dataset derived from the Immune Epitope Database (IEDB) and other public sources. It not only extends and standardizes existing peptide-MHC-II datasets, but also introduces a novel antigen-MHC-II dataset with richer biological context. Leveraging this dataset, we formulate three major machine learning (ML) tasks of peptide binding, peptide presentation, and antigen presentation, which progressively capture the broader biological processes within the MHC-II antigen presentation pathway. We further employ a multi-scale evaluation framework to benchmark existing models, along with a comprehensive analysis over various modeling designs to this problem with a modular framework. Overall, this work serves as a valuable resource for advancing computational immunotherapy, providing a foundation for future research in ML guided epitope discovery and predictive modeling of immune responses.
[LG-34] A Single Architecture for Representing Invariance Under Any Space Group
链接: https://arxiv.org/abs/2512.13989
作者: Cindy Y. Zhang,Elif Ertekin,Peter Orbanz,Ryan P. Adams
类目: Machine Learning (cs.LG)
*备注: 24 pages, 7 figures
Abstract:Incorporating known symmetries in data into machine learning models has consistently improved predictive accuracy, robustness, and generalization. However, achieving exact invariance to specific symmetries typically requires designing bespoke architectures for each group of symmetries, limiting scalability and preventing knowledge transfer across related symmetries. In the case of the space groups, symmetries critical to modeling crystalline solids in materials science and condensed matter physics, this challenge is particularly salient as there are 230 such groups in three dimensions. In this work we present a new approach to such crystallographic symmetries by developing a single machine learning architecture that is capable of adapting its weights automatically to enforce invariance to any input space group. Our approach is based on constructing symmetry-adapted Fourier bases through an explicit characterization of constraints that group operations impose on Fourier coefficients. Encoding these constraints into a neural network layer enables weight sharing across different space groups, allowing the model to leverage structural similarities between groups and overcome data sparsity when limited measurements are available for specific groups. We demonstrate the effectiveness of this approach in achieving competitive performance on material property prediction tasks and performing zero-shot learning to generalize to unseen groups.
[LG-35] Pattern-Guided Diffusion Models
链接: https://arxiv.org/abs/2512.13945
作者: Vivian Lin,Kuk Jin Jang,Wenwen Si,Insup Lee
类目: Machine Learning (cs.LG)
*备注: Under review
Abstract:Diffusion models have shown promise in forecasting future data from multivariate time series. However, few existing methods account for recurring structures, or patterns, that appear within the data. We present Pattern-Guided Diffusion Models (PGDM), which leverage inherent patterns within temporal data for forecasting future time steps. PGDM first extracts patterns using archetypal analysis and estimates the most likely next pattern in the sequence. By guiding predictions with this pattern estimate, PGDM makes more realistic predictions that fit within the set of known patterns. We additionally introduce a novel uncertainty quantification technique based on archetypal analysis, and we dynamically scale the guidance level based on the pattern estimate uncertainty. We apply our method to two well-motivated forecasting applications, predicting visual field measurements and motion capture frames. On both, we show that pattern guidance improves PGDM’s performance (MAE / CRPS) by up to 40.67% / 56.26% and 14.12% / 14.10%, respectively. PGDM also outperforms baselines by up to 65.58% / 84.83% and 93.64% / 92.55%.
[LG-36] A Complete Guide to Spherical Equivariant Graph Transformers
链接: https://arxiv.org/abs/2512.13927
作者: Sophia Tang
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: This paper is a technical version of the article originally published in Alchemy Bio (99 pages, 46 figures)
Abstract:Spherical equivariant graph neural networks (EGNNs) provide a principled framework for learning on three-dimensional molecular and biomolecular systems, where predictions must respect the rotational symmetries inherent in physics. These models extend traditional message-passing GNNs and Transformers by representing node and edge features as spherical tensors that transform under irreducible representations of the rotation group SO(3), ensuring that predictions change in physically meaningful ways under rotations of the input. This guide develops a complete, intuitive foundation for spherical equivariant modeling - from group representations and spherical harmonics, to tensor products, Clebsch-Gordan decomposition, and the construction of SO(3)-equivariant kernels. Building on this foundation, we construct the Tensor Field Network and SE(3)-Transformer architectures and explain how they perform equivariant message-passing and attention on geometric graphs. Through clear mathematical derivations and annotated code excerpts, this guide serves as a self-contained introduction for researchers and learners seeking to understand or implement spherical EGNNs for applications in chemistry, molecular property prediction, protein structure modeling, and generative modeling.
[LG-37] Sliding Window Recurrences for Sequence Models
链接: https://arxiv.org/abs/2512.13921
作者: Dragos Secrieru,Garyk Brixi,Yoshua Bengio,Taiji Suzuki,Michael Poli,Stefano Massaroli
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-hybrid architectures are poised to take over language modeling due to better quality and performance. We introduce a hierarchical decomposition framework for linear recurrences that allows us to develop algorithms aligned with GPU memory hierarchies, yielding Sliding Window Recurrences. We focus specifically on truncating recurrences to hardware-aligned windows which are naturally jagged, limiting costly inter-warp communication. Using SWR, we develop Phalanx layers that serve as drop-in replacements for windowed attention or linear recurrences. In 1B parameter multi-hybrid models, Phalanx achieves over 10-40% speedup across 4K to 32K context length over optimized Transformers while matching perplexity.
[LG-38] Adaptive digital twins for predictive decision-making: Online Bayesian learning of transition dynamics
链接: https://arxiv.org/abs/2512.13919
作者: Eugenio Varetti,Matteo Torzoni,Marco Tezzele,Andrea Manzoni
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:This work shows how adaptivity can enhance value realization of digital twins in civil engineering. We focus on adapting the state transition models within digital twins represented through probabilistic graphical models. The bi-directional interaction between the physical and virtual domains is modeled using dynamic Bayesian networks. By treating state transition probabilities as random variables endowed with conjugate priors, we enable hierarchical online learning of transition dynamics from a state to another through effortless Bayesian updates. We provide the mathematical framework to account for a larger class of distributions with respect to the current literature. To compute dynamic policies with precision updates we solve parametric Markov decision processes through reinforcement learning. The proposed adaptive digital twin framework enjoys enhanced personalization, increased robustness, and improved cost-effectiveness. We assess our approach on a case study involving structural health monitoring and maintenance planning of a railway bridge.
[LG-39] Capturing reduced-order quantum many-body dynamics out of equilibrium via neural ordinary differential equations
链接: https://arxiv.org/abs/2512.13913
作者: Patrick Egenlauf,Iva Březinová,Sabine Andergassen,Miriam Klopotek
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Quantum Physics (quant-ph)
*备注:
Abstract:Out-of-equilibrium quantum many-body systems exhibit rapid correlation buildup that underlies many emerging phenomena. Exact wave-function methods to describe this scale exponentially with particle number; simpler mean-field approaches neglect essential two-particle correlations. The time-dependent two-particle reduced density matrix (TD2RDM) formalism offers a middle ground by propagating the two-particle reduced density matrix (2RDM) and closing the BBGKY hierarchy with a reconstruction of the three-particle cumulant. But the validity and existence of time-local reconstruction functionals ignoring memory effects remain unclear across different dynamical regimes. We show that a neural ODE model trained on exact 2RDM data (no dimensionality reduction) can reproduce its dynamics without any explicit three-particle information – but only in parameter regions where the Pearson correlation between the two- and three-particle cumulants is large. In the anti-correlated or uncorrelated regime, the neural ODE fails, indicating that no simple time-local functional of the instantaneous two-particle cumulant can capture the evolution. The magnitude of the time-averaged three-particle-correlation buildup appears to be the primary predictor of success: For a moderate correlation buildup, both neural ODE predictions and existing TD2RDM reconstructions are accurate, whereas stronger values lead to systematic breakdowns. These findings pinpoint the need for memory-dependent kernels in the three-particle cumulant reconstruction for the latter regime. Our results place the neural ODE as a model-agnostic diagnostic tool that maps the regime of applicability of cumulant expansion methods and guides the development of non-local closure schemes. More broadly, the ability to learn high-dimensional RDM dynamics from limited data opens a pathway to fast, data-driven simulation of correlated quantum matter.
[LG-40] Measuring Uncertainty Calibration
链接: https://arxiv.org/abs/2512.13872
作者: Kamil Ciosek,Nicolò Felicioni,Sina Ghiassian,Juan Elenter Litwin,Francesco Tonolini,David Gustaffson,Eva Garcia Martin,Carmen Barcena Gonzales,Raphaëlle Bertrand-Lalo
类目: Machine Learning (cs.LG)
*备注: 28 pages
Abstract:We make two contributions to the problem of estimating the L_1 calibration error of a binary classifier from a finite dataset. First, we provide an upper bound for any classifier where the calibration function has bounded variation. Second, we provide a method of modifying any classifier so that its calibration error can be upper bounded efficiently without significantly impacting classifier performance and without any restrictive assumptions. All our results are non-asymptotic and distribution-free. We conclude by providing advice on how to measure calibration error in practice. Our methods yield practical procedures that can be run on real-world datasets with modest overhead.
[LG-41] Safe Online Control-Informed Learning
链接: https://arxiv.org/abs/2512.13868
作者: Tianyu Zhou,Zihao Liang,Zehui Lu,Shaoshuai Mou
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:This paper proposes a Safe Online Control-Informed Learning framework for safety-critical autonomous systems. The framework unifies optimal control, parameter estimation, and safety constraints into an online learning process. It employs an extended Kalman filter to incrementally update system parameters in real time, enabling robust and data-efficient adaptation under uncertainty. A softplus barrier function enforces constraint satisfaction during learning and control while eliminating the dependence on high-quality initial guesses. Theoretical analysis establishes convergence and safety guarantees, and the framework’s effectiveness is demonstrated on cart-pole and robot-arm systems.
[LG-42] Dropout Neural Network Training Viewed from a Percolation Perspective
链接: https://arxiv.org/abs/2512.13853
作者: Finley Devlin,Jaron Sanders
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Probability (math.PR); Machine Learning (stat.ML)
*备注: 22 pages, 14 figures
Abstract:In this work, we investigate the existence and effect of percolation in training deep Neural Networks (NNs) with dropout. Dropout methods are regularisation techniques for training NNs, first introduced by G. Hinton et al. (2012). These methods temporarily remove connections in the NN, randomly at each stage of training, and update the remaining subnetwork with Stochastic Gradient Descent (SGD). The process of removing connections from a network at random is similar to percolation, a paradigm model of statistical physics. If dropout were to remove enough connections such that there is no path between the input and output of the NN, then the NN could not make predictions informed by the data. We study new percolation models that mimic dropout in NNs and characterise the relationship between network topology and this path problem. The theory shows the existence of a percolative effect in dropout. We also show that this percolative effect can cause a breakdown when training NNs without biases with dropout; and we argue heuristically that this breakdown extends to NNs with biases. Comments: 22 pages, 14 figures Subjects: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2512.13853 [cs.LG] (or arXiv:2512.13853v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.13853 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-43] opologically-Stabilized Graph Neural Networks: Empirical Robustness Across Domains
链接: https://arxiv.org/abs/2512.13852
作者: Jelena Losic
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:
Abstract:Graph Neural Networks (GNNs) have become the standard for graph representation learning but remain vulnerable to structural perturbations. We propose a novel framework that integrates persistent homology features with stability regularization to enhance robustness. Building on the stability theorems of persistent homology \citecohen2007stability, our method combines GIN architectures with multi-scale topological features extracted from persistence images, enforced by Hiraoka-Kusano-inspired stability constraints. Across six diverse datasets spanning biochemical, social, and collaboration networks , our approach demonstrates exceptional robustness to edge perturbations while maintaining competitive accuracy. Notably, we observe minimal performance degradation (0-4% on most datasets) under perturbation, significantly outperforming baseline stability. Our work provides both a theoretically-grounded and empirically-validated approach to robust graph learning that aligns with recent advances in topological regularization
[LG-44] BiCoRec: Bias-Mitigated Context-Aware Sequential Recommendation Model
链接: https://arxiv.org/abs/2512.13848
作者: Mufhumudzi Muthivhi,Terence L van Zyl,Hairong Wang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Sequential recommendation models aim to learn from users evolving preferences. However, current state-of-the-art models suffer from an inherent popularity bias. This study developed a novel framework, BiCoRec, that adaptively accommodates users changing preferences for popular and niche items. Our approach leverages a co-attention mechanism to obtain a popularity-weighted user sequence representation, facilitating more accurate predictions. We then present a new training scheme that learns from future preferences using a consistency loss function. BiCoRec aimed to improve the recommendation performance of users who preferred niche items. For these users, BiCoRec achieves a 26.00% average improvement in NDCG@10 over state-of-the-art baselines. When ranking the relevant item against the entire collection, BiCoRec achieves NDCG@10 scores of 0.0102, 0.0047, 0.0021, and 0.0005 for the Movies, Fashion, Games and Music datasets.
[LG-45] Explainable reinforcement learning from human feedback to improve alignment
链接: https://arxiv.org/abs/2512.13837
作者: Shicheng Liu,Siyuan Xu,Wenjie Qiu,Hangfan Zhang,Minghui Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:A common and effective strategy for humans to improve an unsatisfactory outcome in daily life is to find a cause of this outcome and correct the cause. In this paper, we investigate whether this human improvement strategy can be applied to improving reinforcement learning from human feedback (RLHF) for alignment of language models (LMs). In particular, it is observed in the literature that LMs tuned by RLHF can still output unsatisfactory responses. This paper proposes a method to improve the unsatisfactory responses by correcting their causes. Our method has two parts. The first part proposes a post-hoc explanation method to explain why an unsatisfactory response is generated to a prompt by identifying the training data that lead to this response. We formulate this problem as a constrained combinatorial optimization problem where the objective is to find a set of training data closest to this prompt-response pair in a feature representation space, and the constraint is that the prompt-response pair can be decomposed as a convex combination of this set of training data in the feature space. We propose an efficient iterative data selection algorithm to solve this problem. The second part proposes an unlearning method that improves unsatisfactory responses to some prompts by unlearning the training data that lead to these unsatisfactory responses and, meanwhile, does not significantly degrade satisfactory responses to other prompts. Experimental results demonstrate that our algorithm can improve RLHF.
[LG-46] he Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces
链接: https://arxiv.org/abs/2512.13821
作者: Subramanyam Sahoo,Jared Junkin
类目: Machine Learning (cs.LG)
*备注: 13 Pages, Initial Work on AI Control. A Preprint
Abstract:Large language models (LLMs) increasingly generate code with minimal human oversight, raising critical concerns about backdoor injection and malicious behavior. We present Cross-Trace Verification Protocol (CTVP), a novel AI control framework that verifies untrusted code-generating models through semantic orbit analysis. Rather than directly executing potentially malicious code, CTVP leverages the model’s own predictions of execution traces across semantically equivalent program transformations. By analyzing consistency patterns in these predicted traces, we detect behavioral anomalies indicative of backdoors. Our approach introduces the Adversarial Robustness Quotient (ARQ), which quantifies the computational cost of verification relative to baseline generation, demonstrating exponential growth with orbit size. Theoretical analysis establishes information-theoretic bounds showing non-gamifiability – adversaries cannot improve through training due to fundamental space complexity constraints. This work demonstrates that semantic orbit analysis provides a scalable, theoretically grounded approach to AI control for code generation tasks.
[LG-47] Constrained Policy Optimization via Sampling-Based Weight-Space Projection
链接: https://arxiv.org/abs/2512.13788
作者: Shengfan Cao,Francesco Borrelli
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Submitted to IFAC World Congress 2026
Abstract:Safety-critical learning requires policies that improve performance without leaving the safe operating regime. We study constrained policy learning where model parameters must satisfy unknown, rollout-based safety constraints. We propose SCPO, a sampling-based weight-space projection method that enforces safety directly in parameter space without requiring gradient access to the constraint functions. Our approach constructs a local safe region by combining trajectory rollouts with smoothness bounds that relate parameter changes to shifts in safety metrics. Each gradient update is then projected via a convex SOCP, producing a safe first-order step. We establish a safe-by-induction guarantee: starting from any safe initialization, all intermediate policies remain safe given feasible projections. In constrained control settings with a stabilizing backup policy, our approach further ensures closed-loop stability and enables safe adaptation beyond the conservative backup. On regression with harmful supervision and a constrained double-integrator task with malicious expert, our approach consistently rejects unsafe updates, maintains feasibility throughout training, and achieves meaningful primal objective improvement.
[LG-48] Probabilistic Predictions of Process-Induced Deformation in Carbon/Epoxy Composites Using a Deep Operator Network
链接: https://arxiv.org/abs/2512.13746
作者: Elham Kiyani,Amit Makarand Deshpande,Madhura Limaye,Zhiwei Gao,Sai Aditya Pradeep,Srikanth Pilla,Gang Li,Zhen Li,George Em Karniadakis
类目: Computational Engineering, Finance, and Science (cs.CE); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 21 pages, 13 figures
Abstract:Fiber reinforcement and polymer matrix respond differently to manufacturing conditions due to mismatch in coefficient of thermal expansion and matrix shrinkage during curing of thermosets. These heterogeneities generate residual stresses over multiple length scales, whose partial release leads to process-induced deformation (PID), requiring accurate prediction and mitigation via optimized non-isothermal cure cycles. This study considers a unidirectional AS4 carbon fiber/amine bi-functional epoxy prepreg and models PID using a two-mechanism framework that accounts for thermal expansion/shrinkage and cure shrinkage. The model is validated against manufacturing trials to identify initial and boundary conditions, then used to generate PID responses for a diverse set of non-isothermal cure cycles (time-temperature profiles). Building on this physics-based foundation, we develop a data-driven surrogate based on Deep Operator Networks (DeepONets). A DeepONet is trained on a dataset combining high-fidelity simulations with targeted experimental measurements of PID. We extend this to a Feature-wise Linear Modulation (FiLM) DeepONet, where branch-network features are modulated by external parameters, including the initial degree of cure, enabling prediction of time histories of degree of cure, viscosity, and deformation. Because experimental data are available only at limited time instances (for example, final deformation), we use transfer learning: simulation-trained trunk and branch networks are fixed and only the final layer is updated using measured final deformation. Finally, we augment the framework with Ensemble Kalman Inversion (EKI) to quantify uncertainty under experimental conditions and to support optimization of cure schedules for reduced PID in composites.
[LG-49] RAST-MoE-RL: A Regime-Aware Spatio-Temporal MoE Framework for Deep Reinforcement Learning in Ride-Hailing
链接: https://arxiv.org/abs/2512.13727
作者: Yuhan Tang,Kangxin Cui,Jung Ho Park,Yibo Zhao,Xuan Jiang,Haoze He,Dingyi Zhuang,Shenhao Wang,Jiangbo Yu,Haris Koutsopoulos,Jinhua Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ride-hailing platforms face the challenge of balancing passenger waiting times with overall system efficiency under highly uncertain supply-demand conditions. Adaptive delayed matching creates a trade-off between matching and pickup delays by deciding whether to assign drivers immediately or batch requests. Since outcomes accumulate over long horizons with stochastic dynamics, reinforcement learning (RL) is a suitable framework. However, existing approaches often oversimplify traffic dynamics or use shallow encoders that miss complex spatiotemporal patterns. We introduce the Regime-Aware Spatio-Temporal Mixture-of-Experts (RAST-MoE), which formalizes adaptive delayed matching as a regime-aware MDP equipped with a self-attention MoE encoder. Unlike monolithic networks, our experts specialize automatically, improving representation capacity while maintaining computational efficiency. A physics-informed congestion surrogate preserves realistic density-speed feedback, enabling millions of efficient rollouts, while an adaptive reward scheme guards against pathological strategies. With only 12M parameters, our framework outperforms strong baselines. On real-world Uber trajectory data (San Francisco), it improves total reward by over 13%, reducing average matching and pickup delays by 10% and 15% respectively. It demonstrates robustness across unseen demand regimes and stable training. These findings highlight the potential of MoE-enhanced RL for large-scale decision-making with complex spatiotemporal dynamics. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.13727 [cs.LG] (or arXiv:2512.13727v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.13727 Focus to learn more arXiv-issued DOI via DataCite
[LG-50] Prediction of Respiratory Syncytial Virus-Associated Hospitalizations Using Machine Learning Models Based on Environmental Data
链接: https://arxiv.org/abs/2512.13712
作者: Eric Guo
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:
Abstract:Respiratory syncytial virus (RSV) is a leading cause of hospitalization among young children, with outbreaks strongly influenced by environmental conditions. This study developed a machine learning framework to predict RSV-associated hospitalizations in the United States (U.S.) by integrating wastewater surveillance, meteorological, and air quality data. The dataset combined weekly hospitalization rates, wastewater RSV levels, daily meteorological measurements, and air pollutant concentrations. Classification models, including CART, Random Forest, and Boosting, were trained to predict weekly RSV-associated hospitalization rates classified as \textitLow risk, \textitAlert, and \textitEpidemic levels. The wastewater RSV level was identified as the strongest predictor, followed by meteorological and air quality variables such as temperature, ozone levels, and specific humidity. Notably, the analysis also revealed significantly higher RSV-associated hospitalization rates among Native Americans and Alaska Natives. Further research is needed to better understand the drivers of RSV disparity in these communities to improve prevention strategies. Furthermore, states at high altitudes, characterized by lower surface pressure, showed consistently higher RSV-associated hospitalization rates. These findings highlight the value of combining environmental and community surveillance data to forecast RSV outbreaks, enabling more timely public health interventions and resource allocation. In order to provide accessibility and practical use of the models, we have developed an interactive R Shiny dashboard (this https URL), which allows users to explore RSV-associated hospitalization risk levels across different states, visualize the impact of key predictors, and interactively generate RSV outbreak forecasts.
[LG-51] Delete and Retain: Efficient Unlearning for Document Classification
链接: https://arxiv.org/abs/2512.13711
作者: Aadya Goel,Mayuri Sridhar
类目: Machine Learning (cs.LG)
*备注: 18 pages, 5 figures
Abstract:Machine unlearning aims to efficiently remove the influence of specific training data from a model without full retraining. While much progress has been made in unlearning for LLMs, document classification models remain relatively understudied. In this paper, we study class-level unlearning for document classifiers and present Hessian Reassignment, a two-step, model-agnostic solution. First, we perform a single influence-style update that subtracts the contribution of all training points from the target class by solving a Hessian-vector system with conjugate gradients, requiring only gradient and Hessian-vector products. Second, in contrast to common unlearning baselines that randomly reclassify deleted-class samples, we enforce a decision-space guarantee via Top-1 classification. On standard text benchmarks, Hessian Reassignment achieves retained-class accuracy close to full retrain-without-class while running orders of magnitude faster. Additionally, it consistently lowers membership-inference advantage on the removed class, measured with pooled multi-shadow attacks. These results demonstrate a practical, principled path to efficient class unlearning in document classification.
[LG-52] Predictive Modeling of Flood-Prone Areas Using SAR and Environmental Variables
链接: https://arxiv.org/abs/2512.13710
作者: Edwin Oluoch Awino,Denis Machanda
类目: Machine Learning (cs.LG)
*备注:
Abstract:Flooding is one of the most destructive natural hazards worldwide, posing serious risks to ecosystems, infrastructure, and human livelihoods. This study combines Synthetic Aperture Radar (SAR) imagery with environmental and hydrological data to model flood susceptibility in the River Nyando watershed, western Kenya. Sentinel-1 dual-polarization SAR data from the May 2024 flood event were processed to produce a binary flood inventory, which served as training data for machine learning (ML) models. Six conditioning factors – slope, elevation, aspect, land use/land cover, soil type, and distance from streams – were integrated with the SAR-derived flood inventory to train four supervised classifiers: Logistic Regression (LR), Classification and Regression Trees (CART), Support Vector Machines (SVM), and Random Forest (RF). Model performance was assessed using accuracy, Cohen’s Kappa, and Receiver Operating Characteristic (ROC) analysis. Results indicate that RF achieved the highest predictive performance (accuracy = 0.762; Kappa = 0.480), outperforming LR, CART, and SVM. The RF-based susceptibility map showed that low-lying Kano Plains near Lake Victoria have the highest flood vulnerability, consistent with historical flood records and the impacts of the May 2024 event. These findings demonstrate the value of combining SAR data and ensemble ML methods for flood susceptibility mapping in regions with limited data. The resulting maps offer important insights for disaster risk reduction, land-use planning, and early warning system development.
[LG-53] Smart Surveillance: Identifying IoT Device Behaviours using ML-Powered Traffic Analysis
链接: https://arxiv.org/abs/2512.13709
作者: Reza Ryan,Napoleon Paciente,Cahil Youngs,Nickson Karie,Qian Li,Nasim Ferdosian
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 6 pages, 1 figures, conference
Abstract:The proliferation of Internet of Things (IoT) devices has grown exponentially in recent years, introducing significant security challenges. Accurate identification of the types of IoT devices and their associated actions through network traffic analysis is essential to mitigate potential threats. By monitoring and analysing packet flows between IoT devices and connected networks, anomalous or malicious behaviours can be detected. Existing research focuses primarily on device identification within local networks using methods such as protocol fingerprinting and wireless frequency scanning. However, these approaches are limited in their ability to monitor or classify IoT devices externally. To address this gap, we investigate the use of machine learning (ML) techniques, specifically Random Forest (RF), Multilayer Perceptron (MLP), and K-Nearest Neighbours (KNN), in conjunction with targeted network traffic monitoring to classify IoT device types and their actions. We constructed a testbed comprising an NPAT-enabled router and a diverse set of IoT devices, including smart cameras, controller hubs, home appliances, power controllers, and streaming devices. Experimental results demonstrate that IoT device and action recognition is feasible using our proposed ML-driven approach, with the RF classifier achieving the highest accuracy of 91%, while the MLP recorded the lowest accuracy at 56%. Notably, all device categories were successfully classified except for certain actions associated with security cameras, underscoring both the potential and the limitations of the proposed method.
[LG-54] Variational Physics-Informed Ansatz for Reconstructing Hidden Interaction Networks from Steady States
链接: https://arxiv.org/abs/2512.13708
作者: Kaiming Luo
类目: Machine Learning (cs.LG)
*备注:
Abstract:The interaction structure of a complex dynamical system governs its collective behavior, yet existing reconstruction methods struggle with nonlinear, heterogeneous, and higher-order couplings, especially when only steady states are observable. We propose a Variational Physics-Informed Ansatz (VPIA) that infers general interaction operators directly from heterogeneous steady-state data. VPIA embeds the steady-state constraints of the dynamics into a differentiable variational representation and reconstructs the underlying couplings by minimizing a physics-derived steady-state residual, without requiring temporal trajectories, derivative estimation, or supervision. Residual sampling combined with natural-gradient optimization enables scalable learning of large and higher-order networks. Across diverse nonlinear systems, VPIA accurately recovers directed, weighted, and multi-body structures under substantial noise, providing a unified and robust framework for physics-constrained inference of complex interaction networks in settings where only snapshot observations are available.
[LG-55] Sim2Real Reinforcement Learning for Soccer skills
链接: https://arxiv.org/abs/2512.12437
作者: Jonathan Spraggett
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Undergrad Thesis
Abstract:This thesis work presents a more efficient and effective approach to training control-related tasks for humanoid robots using Reinforcement Learning (RL). The traditional RL methods are limited in adapting to real-world environments, complexity, and natural motions, but the proposed approach overcomes these limitations by using curriculum training and Adversarial Motion Priors (AMP) technique. The results show that the developed RL policies for kicking, walking, and jumping are more dynamic, and adaptive, and outperformed previous methods. However, the transfer of the learned policy from simulation to the real world was unsuccessful, highlighting the limitations of current RL methods in fully adapting to real-world scenarios.
[LG-56] LLm FPCA-detect: LLM -powered Multivariate Functional PCA for Anomaly Detection in Sparse Longitudinal Texts
链接: https://arxiv.org/abs/2512.14604
作者: Prasanjit Dubey,Aritra Guha,Zhengyi Zhou,Qiong Wu,Xiaoming Huo,Paromita Dubey
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Sparse longitudinal (SL) textual data arises when individuals generate text repeatedly over time (e.g., customer reviews, occasional social media posts, electronic medical records across visits), but the frequency and timing of observations vary across individuals. These complex textual data sets have immense potential to inform future policy and targeted recommendations. However, because SL text data lack dedicated methods and are noisy, heterogeneous, and prone to anomalies, detecting and inferring key patterns is challenging. We introduce LLmFPCA-detect, a flexible framework that pairs LLM-based text embeddings with functional data analysis to detect clusters and infer anomalies in large SL text datasets. First, LLmFPCA-detect embeds each piece of text into an application-specific numeric space using LLM prompts. Sparse multivariate functional principal component analysis (mFPCA) conducted in the numeric space forms the workhorse to recover primary population characteristics, and produces subject-level scores which, together with baseline static covariates, facilitate data segmentation, unsupervised anomaly detection and inference, and enable other downstream tasks. In particular, we leverage LLMs to perform dynamic keyword profiling guided by the data segments and anomalies discovered by LLmFPCA-detect, and we show that cluster-specific functional PC scores from LLmFPCA-detect, used as features in existing pipelines, help boost prediction performance. We support the stability of LLmFPCA-detect with experiments and evaluate it on two different applications using public datasets, Amazon customer-review trajectories, and Wikipedia talk-page comment streams, demonstrating utility across domains and outperforming state-of-the-art baselines.
[LG-57] Pattern Recognition of Aluminium Arbitrag e in Global Trade Data
链接: https://arxiv.org/abs/2512.14410
作者: Muhammad Sukri Bin Ramli
类目: General Economics (econ.GN); Machine Learning (cs.LG)
*备注:
Abstract:As the global economy transitions toward decarbonization, the aluminium sector has become a focal point for strategic resource management. While policies such as the Carbon Border Adjustment Mechanism (CBAM) aim to reduce emissions, they have inadvertently widened the price arbitrage between primary metal, scrap, and semi-finished goods, creating new incentives for market optimization. This study presents a unified, unsupervised machine learning framework to detect and classify emerging trade anomalies within UN Comtrade data (2020 to 2024). Moving beyond traditional rule-based monitoring, we apply a four-layer analytical pipeline utilizing Forensic Statistics, Isolation Forests, Network Science, and Deep Autoencoders. Contrary to the hypothesis that Sustainability Arbitrage would be the primary driver, empirical results reveal a contradictory and more severe phenomenon of Hardware Masking. Illicit actors exploit bi-directional tariff incentives by misclassifying scrap as high-count heterogeneous goods to justify extreme unit-price outliers of 160/kg, a 1,900% markup indicative of Trade-Based Money Laundering (TBML) rather than commercial arbitrage. Topologically, risk is not concentrated in major exporters but in high-centrality Shadow Hubs that function as pivotal nodes for illicit rerouting. These actors execute a strategy of Void-Shoring, systematically suppressing destination data to Unspecified Code to fracture mirror statistics and sever forensic trails. Validated by SHAP (Shapley Additive Explanations), the results confirm that price deviation is the dominant predictor of anomalies, necessitating a paradigm shift in customs enforcement from physical volume checks to dynamic, algorithmic valuation auditing.
[LG-58] From STLS to Projection-based Dictionary Selection in Sparse Regression for System Identification
链接: https://arxiv.org/abs/2512.14404
作者: Hangjun Cho,Fabio V.G. Amaral,Andrei A. Klishin,Cassio M. Oishi,Steven L. Brunton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Computational Physics (physics.comp-ph)
*备注: 34 pages, 11 figures
Abstract:In this work, we revisit dictionary-based sparse regression, in particular, Sequential Threshold Least Squares (STLS), and propose a score-guided library selection to provide practical guidance for data-driven modeling, with emphasis on SINDy-type algorithms. STLS is an algorithm to solve the \ell_0 sparse least-squares problem, which relies on splitting to efficiently solve the least-squares portion while handling the sparse term via proximal methods. It produces coefficient vectors whose components depend on both the projected reconstruction errors, here referred to as the scores, and the mutual coherence of dictionary terms. The first contribution of this work is a theoretical analysis of the score and dictionary-selection strategy. This could be understood in both the original and weak SINDy regime. Second, numerical experiments on ordinary and partial differential equations highlight the effectiveness of score-based screening, improving both accuracy and interpretability in dynamical system identification. These results suggest that integrating score-guided methods to refine the dictionary more accurately may help SINDy users in some cases to enhance their robustness for data-driven discovery of governing equations.
[LG-59] Continual Learning at the Edge: An Agnostic IIoT Architecture
链接: https://arxiv.org/abs/2512.14311
作者: Pablo García-Santaclara,Bruno Fernández-Castro,Rebeca P. Díaz-Redondo,Carlos Calvo-Moa,Henar Mariño-Bodelón
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The exponential growth of Internet-connected devices has presented challenges to traditional centralized computing systems due to latency and bandwidth limitations. Edge computing has evolved to address these difficulties by bringing computations closer to the data source. Additionally, traditional machine learning algorithms are not suitable for edge-computing systems, where data usually arrives in a dynamic and continual way. However, incremental learning offers a good solution for these settings. We introduce a new approach that applies the incremental learning philosophy within an edge-computing scenario for the industrial sector with a specific purpose: real time quality control in a manufacturing system. Applying continual learning we reduce the impact of catastrophic forgetting and provide an efficient and effective solution.
[LG-60] Improving the Accuracy of Amortized Model Comparison with Self-Consistency
链接: https://arxiv.org/abs/2512.14308
作者: Šimon Kucharský,Aayush Mishra,Daniel Habermann,Stefan T. Radev,Paul-Christian Bürkner
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 17 pages, 9 figures
Abstract:Amortized Bayesian inference (ABI) offers fast, scalable approximations to posterior densities by training neural surrogates on data simulated from the statistical model. However, ABI methods are highly sensitive to model misspecification: when observed data fall outside the training distribution (generative scope of the statistical models), neural surrogates can behave unpredictably. This makes it a challenge in a model comparison setting, where multiple statistical models are considered, of which at least some are misspecified. Recent work on self-consistency (SC) provides a promising remedy to this issue, accessible even for empirical data (without ground-truth labels). In this work, we investigate how SC can improve amortized model comparison conceptualized in four different ways. Across two synthetic and two real-world case studies, we find that approaches for model comparison that estimate marginal likelihoods through approximate parameter posteriors consistently outperform methods that directly approximate model evidence or posterior model probabilities. SC training improves robustness when the likelihood is available, even under severe model misspecification. The benefits of SC for methods without access of analytic likelihoods are more limited and inconsistent. Our results suggest practical guidance for reliable amortized Bayesian model comparison: prefer parameter posterior-based methods and augment them with SC training on empirical datasets to mitigate extrapolation bias under model misspecification.
[LG-61] Weighted Conformal Prediction Provides Adaptive and Valid Mask-Conditional Coverag e for General Missing Data Mechanisms
链接: https://arxiv.org/abs/2512.14221
作者: Jiarong Fan,Juhyun Park. Thi Phuong Thuy Vo,Nicolas Brunel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Conformal prediction (CP) offers a principled framework for uncertainty quantification, but it fails to guarantee coverage when faced with missing covariates. In addressing the heterogeneity induced by various missing patterns, Mask-Conditional Valid (MCV) Coverage has emerged as a more desirable property than Marginal Coverage. In this work, we adapt split CP to handle missing values by proposing a preimpute-mask-then-correct framework that can offer valid coverage. We show that our method provides guaranteed Marginal Coverage and Mask-Conditional Validity for general missing data mechanisms. A key component of our approach is a reweighted conformal prediction procedure that corrects the prediction sets after distributional imputation (multiple imputation) of the calibration dataset, making our method compatible with standard imputation pipelines. We derive two algorithms, and we show that they are approximately marginally valid and MCV. We evaluate them on synthetic and real-world datasets. It reduces significantly the width of prediction intervals w.r.t standard MCV methods, while maintaining the target guarantees.
[LG-62] Physics-Informed Machine Learning for Two-Phase Moving-Interface and Stefan Problems
链接: https://arxiv.org/abs/2512.14010
作者: Che-Chia Chang,Te-Sheng Lin,Ming-Chih Lai
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:
Abstract:The Stefan problem is a classical free-boundary problem that models phase-change processes and poses computational challenges due to its moving interface and nonlinear temperature-phase coupling. In this work, we develop a physics-informed neural network framework for solving two-phase Stefan problems. The proposed method explicitly tracks the interface motion and enforces the discontinuity in the temperature gradient across the interface while maintaining global consistency of the temperature field. Our approach employs two neural networks: one representing the moving interface and the other for the temperature field. The interface network allows rapid categorization of thermal diffusivity in the spatial domain, which is a crucial step for selecting training points for the temperature network. The temperature network’s input is augmented with a modified zero-level set function to accurately capture the jump in its normal derivative across the interface. Numerical experiments on two-phase dynamical Stefan problems demonstrate the superior accuracy and effectiveness of our proposed method compared with the ones obtained by other neural network methodology in literature. The results indicate that the proposed framework offers a robust and flexible alternative to traditional numerical methods for solving phase-change problems governed by moving boundaries. In addition, the proposed method can capture an unstable interface evolution associated with the Mullins-Sekerka instability.
[LG-63] On the Hardness of Conditional Independence Testing In Practice NEURIPS2025
链接: https://arxiv.org/abs/2512.14000
作者: Zheng He,Roman Pogodin,Yazhe Li,Namrata Deka,Arthur Gretton,Danica J. Sutherland
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Published at NeurIPS 2025: this https URL
Abstract:Tests of conditional independence (CI) underpin a number of important problems in machine learning and statistics, from causal discovery to evaluation of predictor fairness and out-of-distribution robustness. Shah and Peters (2020) showed that, contrary to the unconditional case, no universally finite-sample valid test can ever achieve nontrivial power. While informative, this result (based on “hiding” dependence) does not seem to explain the frequent practical failures observed with popular CI tests. We investigate the Kernel-based Conditional Independence (KCI) test - of which we show the Generalized Covariance Measure underlying many recent tests is nearly a special case - and identify the major factors underlying its practical behavior. We highlight the key role of errors in the conditional mean embedding estimate for the Type-I error, while pointing out the importance of selecting an appropriate conditioning kernel (not recognized in previous work) as being necessary for good test power but also tending to inflate Type-I error.
[LG-64] Maximum Mean Discrepancy with Unequal Sample Sizes via Generalized U-Statistics
链接: https://arxiv.org/abs/2512.13997
作者: Aaron Wei,Milad Jalali,Danica J. Sutherland
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:Existing two-sample testing techniques, particularly those based on choosing a kernel for the Maximum Mean Discrepancy (MMD), often assume equal sample sizes from the two distributions. Applying these methods in practice can require discarding valuable data, unnecessarily reducing test power. We address this long-standing limitation by extending the theory of generalized U-statistics and applying it to the usual MMD estimator, resulting in new characterization of the asymptotic distributions of the MMD estimator with unequal sample sizes (particularly outside the proportional regimes required by previous partial results). This generalization also provides a new criterion for optimizing the power of an MMD test with unequal sample sizes. Our approach preserves all available data, enhancing test accuracy and applicability in realistic settings. Along the way, we give much cleaner characterizations of the variance of MMD estimators, revealing something that might be surprising to those in the area: while zero MMD implies a degenerate estimator, it is sometimes possible to have a degenerate estimator with nonzero MMD as well; we give a construction and a proof that it does not happen in common situations.
[LG-65] Group-Theoretic Reinforcement Learning of Dynamical Decoupling Sequences
链接: https://arxiv.org/abs/2512.13890
作者: Charles Marrder,Shuo Sun,Murray J. Holland
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Dynamical decoupling seeks to mitigate phase decoherence in qubits by applying a carefully designed sequence of effectively instantaneous electromagnetic pulses. Although analytic solutions exist for pulse timings that are optimal under specific noise regimes, identifying the optimal timings for a realistic noise spectrum remains challenging. We propose a reinforcement learning (RL)-based method for designing pulse sequences on qubits. Our novel action set enables the RL agent to efficiently navigate this inherently non-convex optimization landscape. The action set, derived from Thompson’s group F , is applicable to a broad class of sequential decision problems whose states can be represented as bounded sequences. We demonstrate that our RL agent can learn pulse sequences that minimize dephasing without requiring explicit knowledge of the underlying noise spectrum. This work opens the possibility for real-time learning of optimal dynamical decoupling sequences on qubits which are dephasing-limited. The model-free nature of our algorithm suggests that the agent may ultimately learn optimal pulse sequences even in the presence of unmodeled physical effects, such as pulse errors or non-Gaussian noise.
[LG-66] Simultaneous and Proportional Finger Motion Decoding Using Spatial Features from High-Density Surface Electromyography
链接: https://arxiv.org/abs/2512.13870
作者: Ricardo Gonçalves Molinari,Leonardo Abdala Elias
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 39 pages, 13 figures, 2 tables
Abstract:Restoring natural and intuitive hand function requires simultaneous and proportional control (SPC) of multiple degrees of freedom (DoFs). This study systematically evaluated the multichannel linear descriptors-based block field method (MLD-BFM) for continuous decoding of five finger-joint DoFs by leveraging the rich spatial information of high-density surface electromyography (HD sEMG). Twenty-one healthy participants performed dynamic sinusoidal finger movements while HD sEMG signals were recorded from the \textitextensor digitorum communis (EDC) and \textitflexor digitorum superficialis (FDS) muscles. MLD-BFM extracted region-specific spatial features, including effective field strength ( \Sigma ), field-strength variation rate ( \Phi ), and spatial complexity ( \Omega ). Model performance was optimized (block size: 2 \times 2 ; window: 0.15 s) and compared with conventional time-domain features and dimensionality reduction approaches when applied to multi-output regression models. MLD-BFM consistently achieved the highest \mathrmR^2_\mathrmvw values across all models. The multilayer perceptron (MLP) combined with MLD-BFM yielded the best performance ( \mathrmR^2_\mathrmvw = 86.68% \pm 0.33 ). Time-domain features also showed strong predictive capability and were statistically comparable to MLD-BFM in some models, whereas dimensionality reduction techniques exhibited lower accuracy. Decoding accuracy was higher for the middle and ring fingers than for the thumb. Overall, MLD-BFM improved continuous finger movement decoding accuracy, underscoring the importance of taking advantage of the spatial richness of HD sEMG. These findings suggest that spatially structured features enhance SPC and provide practical guidance for designing robust, real-time, and responsive myoelectric interfaces.
[LG-67] Unreason able effectiveness of unsupervised learning in identifying Majorana topology
链接: https://arxiv.org/abs/2512.13825
作者: Jacob Taylor,Haining Pan,Sankar Das Sarma
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures
Abstract:In unsupervised learning, the training data for deep learning does not come with any labels, thus forcing the algorithm to discover hidden patterns in the data for discerning useful information. This, in principle, could be a powerful tool in identifying topological order since topology does not always manifest in obvious physical ways (e.g., topological superconductivity) for its decisive confirmation. The problem, however, is that unsupervised learning is a difficult challenge, necessitating huge computing resources, which may not always work. In the current work, we combine unsupervised and supervised learning using an autoencoder to establish that unlabeled data in the Majorana splitting in realistic short disordered nanowires may enable not only a distinction between topological' and trivial’, but also where their crossover happens in the relevant parameter space. This may be a useful tool in identifying topology in Majorana nanowires.
[LG-68] Modular connectivity in neural networks emerges from Poisson noise-motivated regularisation and promotes robustness and compositional generalisation
链接: https://arxiv.org/abs/2512.13707
作者: Daoyuan Qian,Qiyao Liang,Ila Fiete
类目: Biological Physics (physics.bio-ph); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:
Abstract:Circuits in the brain commonly exhibit modular architectures that factorise complex tasks, resulting in the ability to compositionally generalise and reduce catastrophic forgetting. In contrast, artificial neural networks (ANNs) appear to mix all processing, because modular solutions are difficult to find as they are vanishing subspaces in the space of possible solutions. Here, we draw inspiration from fault-tolerant computation and the Poisson-like firing of real neurons to show that activity-dependent neural noise, combined with nonlinear neural responses, drives the emergence of solutions that reflect an accurate understanding of modular tasks, corresponding to acquisition of a correct world model. We find that noise-driven modularisation can be recapitulated by a deterministic regulariser that multiplicatively combines weights and activations, revealing rich phenomenology not captured in linear networks or by standard regularisation methods. Though the emergence of modular structure requires sufficiently many training samples (exponential in the number of modular task dimensions), we show that pre-modularised ANNs exhibit superior noise-robustness and the ability to generalise and extrapolate well beyond training data, compared to ANNs without such inductive biases. Together, our work demonstrates a regulariser and architectures that could encourage modularity emergence to yield functional benefits.
信息检索
[IR-0] Pairwise Comparison for Bias Identification and Quantification
链接: https://arxiv.org/abs/2512.14565
作者: Fabian Haak,Philipp Schaer
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Linguistic bias in online news and social media is widespread but difficult to measure. Yet, its identification and quantification remain difficult due to subjectivity, context dependence, and the scarcity of high-quality gold-label datasets. We aim to reduce annotation effort by leveraging pairwise comparison for bias annotation. To overcome the costliness of the approach, we evaluate more efficient implementations of pairwise comparison-based rating. We achieve this by investigating the effects of various rating techniques and the parameters of three cost-aware alternatives in a simulation environment. Since the approach can in principle be applied to both human and large language model annotation, our work provides a basis for creating high-quality benchmark datasets and for quantifying biases and other subjective linguistic aspects. The controlled simulations include latent severity distributions, distance-calibrated noise, and synthetic annotator bias to probe robustness and cost-quality trade-offs. In applying the approach to human-labeled bias benchmark datasets, we then evaluate the most promising setups and compare them to direct assessment by large language models and unmodified pairwise comparison labels as baselines. Our findings support the use of pairwise comparison as a practical foundation for quantifying subjective linguistic aspects, enabling reproducible bias analysis. We contribute an optimization of comparison and matchmaking components, an end-to-end evaluation including simulation and real-data application, and an implementation blueprint for cost-aware large-scale annotation Subjects: Information Retrieval (cs.IR) MSC classes: 68T50 (Primary) 62J15 (Secondary) ACMclasses: H.3.1; I.2.7 Cite as: arXiv:2512.14565 [cs.IR] (or arXiv:2512.14565v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2512.14565 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-1] PushGen: Push Notifications Generation with LLM WSDM2026
链接: https://arxiv.org/abs/2512.14490
作者: Shifu Bie,Jiangxia Cao,Zixiao Luo,Yichuan Zou,Lei Liang,Lu Zhang,Linxun Chen,Zhaojie Liu,Xuanping Li,Guorui Zhou,Kaiqiao Zhan,Kun Gai
类目: Information Retrieval (cs.IR)
*备注: Accepted by WSDM 2026
Abstract:We present PushGen, an automated framework for generating high-quality push notifications comparable to human-crafted content. With the rise of generative models, there is growing interest in leveraging LLMs for push content generation. Although LLMs make content generation straightforward and cost-effective, maintaining stylistic control and reliable quality assessment remains challenging, as both directly impact user engagement. To address these issues, PushGen combines two key components: (1) a controllable category prompt technique to guide LLM outputs toward desired styles, and (2) a reward model that ranks and selects generated candidates. Extensive offline and online experiments demonstrate its effectiveness, which has been deployed in large-scale industrial applications, serving hundreds of millions of users daily.
[IR-2] Dynamic Context Selection for Retrieval-Augmented Generation: Mitigating Distractors and Positional Bias
链接: https://arxiv.org/abs/2512.14313
作者: Malika Iratni,Mohand Boughanem,Taoufiq Dkaki
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Retrieval Augmented Generation (RAG) enhances language model performance by incorporating external knowledge retrieved from large corpora, which makes it highly suitable for tasks such as open domain question answering. Standard RAG systems typically rely on a fixed top k retrieval strategy, which can either miss relevant information or introduce semantically irrelevant passages, known as distractors, that degrade output quality. Additionally, the positioning of retrieved passages within the input context can influence the model attention and generation outcomes. Context placed in the middle tends to be overlooked, which is an issue known as the “lost in the middle” phenomenon. In this work, we systematically analyze the impact of distractors on generation quality, and quantify their effects under varying conditions. We also investigate how the position of relevant passages within the context window affects their influence on generation. Building on these insights, we propose a context-size classifier that dynamically predicts the optimal number of documents to retrieve based on query-specific informational needs. We integrate this approach into a full RAG pipeline, and demonstrate improved performance over fixed k baselines.
[IR-3] AsarRec: Adaptive Sequential Augmentation for Robust Self-supervised Sequential Recommendation
链接: https://arxiv.org/abs/2512.14047
作者: Kaike Zhang,Qi Cao,Fei Sun,Xinran Liu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Sequential recommender systems have demonstrated strong capabilities in modeling users’ dynamic preferences and capturing item transition patterns. However, real-world user behaviors are often noisy due to factors such as human errors, uncertainty, and behavioral ambiguity, which can lead to degraded recommendation performance. To address this issue, recent approaches widely adopt self-supervised learning (SSL), particularly contrastive learning, by generating perturbed views of user interaction sequences and maximizing their mutual information to improve model robustness. However, these methods heavily rely on their pre-defined static augmentation strategies~(where the augmentation type remains fixed once chosen) to construct augmented views, leading to two critical challenges: (1) the optimal augmentation type can vary significantly across different scenarios; (2) inappropriate augmentations may even degrade recommendation performance, limiting the effectiveness of SSL. To overcome these limitations, we propose an adaptive augmentation framework. We first unify existing basic augmentation operations into a unified formulation via structured transformation matrices. Building on this, we introduce AsarRec (Adaptive Sequential Augmentation for Robust Sequential Recommendation), which learns to generate transformation matrices by encoding user sequences into probabilistic transition matrices and projecting them into hard semi-doubly stochastic matrices via a differentiable Semi-Sinkhorn algorithm. To ensure that the learned augmentations benefit downstream performance, we jointly optimize three objectives: diversity, semantic invariance, and informativeness. Extensive experiments on three benchmark datasets under varying noise levels validate the effectiveness of AsarRec, demonstrating its superior robustness and consistent improvements.
[IR-4] From Feature Interaction to Feature Generation: A Generative Paradigm of CTR Prediction Models
链接: https://arxiv.org/abs/2512.14041
作者: Mingjia Yin,Junwei Pan,Hao Wang,Ximei Wang,Shangyu Zhang,Jie Jiang,Defu Lian,Enhong Chen
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Click-Through Rate (CTR) prediction, a core task in recommendation systems, aims to estimate the probability of users clicking on items. Existing models predominantly follow a discriminative paradigm, which relies heavily on explicit interactions between raw ID embeddings. However, this paradigm inherently renders them susceptible to two critical issues: embedding dimensional collapse and information redundancy, stemming from the over-reliance on feature interactions \emphover raw ID embeddings. To address these limitations, we propose a novel \emphSupervised Feature Generation (SFG) framework, \emphshifting the paradigm from discriminative feature interaction" to generative feature generation". Specifically, SFG comprises two key components: an \emphEncoder that constructs hidden embeddings for each feature, and a \emphDecoder tasked with regenerating the feature embeddings of all features from these hidden representations. Unlike existing generative approaches that adopt self-supervised losses, we introduce a supervised loss to utilize the supervised signal, \ie, click or not, in the CTR prediction task. This framework exhibits strong generalizability: it can be seamlessly integrated with most existing CTR models, reformulating them under the generative paradigm. Extensive experiments demonstrate that SFG consistently mitigates embedding collapse and reduces information redundancy, while yielding substantial performance gains across various datasets and base models. The code is available at this https URL.
[IR-5] DTRec: Learning Dynamic Reasoning Trajectories for Sequential Recommendation
链接: https://arxiv.org/abs/2512.14036
作者: Yifan Shao,Peilin Zhou,Shoujin Wang,Weizhi Zhang,Xu Cai,Sunghun Kim
类目: Information Retrieval (cs.IR)
*备注: Under Review
Abstract:Inspired by advances in LLMs, reasoning-enhanced sequential recommendation performs multi-step deliberation before making final predictions, unlocking greater potential for capturing user preferences. However, current methods are constrained by static reasoning trajectories that are ill-suited for the diverse complexity of user behaviors. They suffer from two key limitations: (1) a static reasoning direction, which uses flat supervision signals misaligned with human-like hierarchical reasoning, and (2) a fixed reasoning depth, which inefficiently applies the same computational effort to all users, regardless of pattern complexity. These rigidity lead to suboptimal performance and significant computational waste. To overcome these challenges, we propose DTRec, a novel and effective framework that explores the Dynamic reasoning Trajectory for Sequential Recommendation along both direction and depth. To guide the direction, we develop Hierarchical Process Supervision (HPS), which provides coarse-to-fine supervisory signals to emulate the natural, progressive refinement of human cognitive processes. To optimize the depth, we introduce the Adaptive Reasoning Halting (ARH) mechanism that dynamically adjusts the number of reasoning steps by jointly monitoring three indicators. Extensive experiments on three real-world datasets demonstrate the superiority of our approach, achieving up to a 24.5% performance improvement over strong baselines while simultaneously reducing computational cost by up to 41.6%.
[IR-6] Intent-Guided Reasoning for Sequential Recommendation
链接: https://arxiv.org/abs/2512.14034
作者: Yifan Shao,Peilin Zhou
类目: Information Retrieval (cs.IR)
*备注: Under Review
Abstract:Sequential recommendation systems aim to capture users’ evolving preferences from their interaction histories. Recent reasoningenhanced methods have shown promise by introducing deliberate, chain-of-thought-like processes with intermediate reasoning steps. However, these methods rely solely on the next target item as supervision, leading to two critical issues: (1) reasoning instability–the process becomes overly sensitive to recent behaviors and spurious interactions like accidental clicks, and (2) surface-level reasoning–the model memorizes item-to-item transitions rather than understanding intrinsic behavior patterns. To address these challenges, we propose IGR-SR, an Intent-Guided Reasoning framework for Sequential Recommendation that anchors the reasoning process to explicitly extracted high-level intents. Our framework comprises three key components: (1) a Latent Intent Distiller (LID) that efficiently extracts multi-faceted intents using a frozen encoder with learnable tokens, (2) an Intent-aware Deliberative Reasoner (IDR) that decouples reasoning into intent deliberation and decision-making via a dual-attention architecture, and (3) an Intent Consistency Regularization (ICR) that ensures robustness by enforcing consistent representations across different intent views. Extensive experiments on three public datasets demonstrate that IGR-SR achieves an average 7.13% improvement over state-of-the-art baselines. Critically, under 20% behavioral noise, IGR-SR degrades only 10.4% compared to 16.2% and 18.6% for competing methods, validating the effectiveness and robustness of intent-guided reasoning.

