本篇博文主要内容为 2025-12-04 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-12-04)
今日共更新525篇论文,其中:
- 自然语言处理共55篇(Computation and Language (cs.CL))
- 人工智能共141篇(Artificial Intelligence (cs.AI))
- 计算机视觉共131篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共170篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] SkillFactory: Self-Distillation For Learning Cognitive Behaviors
【速读】: 该论文试图解决的问题是:如何让生成式 AI 模型在强化学习(Reinforcement Learning, RL)阶段有效利用那些在其基础模型(base model)中尚未展现的认知技能,例如答案验证、回溯和替代方法重试等。解决方案的关键在于提出了一种名为 SkillFactory 的监督微调(Supervised Fine-Tuning, SFT)方法,其核心思想是在 RL 之前通过重组模型自身生成的样本,构造出符合特定认知技能格式的“银质”(silver)训练轨迹,从而为模型提供初步的技能诱导偏置(inductive bias)。这种方法无需依赖更强模型的知识蒸馏,仅使用模型自身的输出即可实现对目标技能的粗略学习,显著提升了模型在 RL 后对更复杂任务的泛化能力与鲁棒性。
链接: https://arxiv.org/abs/2512.04072
作者: Zayne Sprague,Jack Lu,Manya Wadhwa,Sedrick Keh,Mengye Ren,Greg Durrett
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren’t exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These “silver” SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.
zh
[NLP-1] Stable Signer: Hierarchical Sign Language Generative Model
【速读】: 该论文旨在解决手势语言生成(Sign Language Production, SLP)任务中因传统多阶段流程(如Text2Gloss、Gloss2Pose、Pose2Vid等)导致的误差累积问题,这些问题限制了生成视频的质量与多样性。其解决方案的关键在于重构SLP任务为一个端到端的分层生成框架,仅保留文本理解(Prompt2Gloss和Text2Gloss)与Pose2Vid两个核心阶段,并引入两个创新模块:一是用于提升文本语义对齐能力的符号语言理解链接器(Sign Language Understanding Linker, SLUL),二是名为SLP-MoE的手势渲染专家块,用于生成高质量且多样风格的手势动作。其中,SLUL通过新设计的语义感知词元掩码损失(Semantic-Aware Gloss Masking Loss, SAGM Loss)进行训练,使整体性能相较当前最优方法提升48.6%。
链接: https://arxiv.org/abs/2512.04048
作者: Sen Fang,Yalin Feng,Hongbin Zhong,Yanxin Zhang,Dimitris N. Metaxas
机构: Rutgers University (罗格斯大学); Nanyang Technological University (南洋理工大学); Georgia Institute of Technology (佐治亚理工学院); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 12 pages, 7 figures. More Demo at this https URL
Abstract:Sign Language Production (SLP) is the process of converting the complex input text into a real video. Most previous works focused on the Text2Gloss, Gloss2Pose, Pose2Vid stages, and some concentrated on Prompt2Gloss and Text2Avatar stages. However, this field has made slow progress due to the inaccuracy of text conversion, pose generation, and the rendering of poses into real human videos in these stages, resulting in gradually accumulating errors. Therefore, in this paper, we streamline the traditional redundant structure, simplify and optimize the task objective, and design a new sign language generative model called Stable Signer. It redefines the SLP task as a hierarchical generation end-to-end task that only includes text understanding (Prompt2Gloss, Text2Gloss) and Pose2Vid, and executes text understanding through our proposed new Sign Language Understanding Linker called SLUL, and generates hand gestures through the named SLP-MoE hand gesture rendering expert block to end-to-end generate high-quality and multi-style sign language videos. SLUL is trained using the newly developed Semantic-Aware Gloss Masking Loss (SAGM Loss). Its performance has improved by 48.6% compared to the current SOTA generation methods.
zh
[NLP-2] Jina-VLM: Small Multilingual Vision Language Model
【速读】: 该论文旨在解决当前开源视觉语言模型(Vision-Language Model, VLM)在多语言视觉问答(Multilingual Visual Question Answering, VQA)任务中性能不足的问题,尤其是在参数规模约为2B的开放模型中缺乏竞争力。解决方案的关键在于提出Jina-VLM,其核心创新是采用SigLIP2视觉编码器与Qwen3语言主干通过注意力池化连接器(attention-pooling connector)进行耦合,该设计支持任意分辨率图像的高效token处理,从而在保持文本理解能力的同时显著提升跨语言视觉理解性能,在多个标准VQA基准和多语言评测中优于同类模型。
链接: https://arxiv.org/abs/2512.04032
作者: Andreas Koukounas,Georgios Mastrapas,Florian Hönicke,Sedigheh Eslami,Guillaume Roncari,Scott Martens,Han Xiao
机构: Jina AI by Elastic (Jina AI by Elastic)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 1-7 main content
Abstract:We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.
zh
[NLP-3] AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving
【速读】: 该论文旨在解决增强型大语言模型(augmented large language models, LLMs)在Web应用中推理服务效率低下和SLO(Service-Level Objectives)难以保障的问题,核心挑战包括:(i) 传统先到先服务(FCFS)调度导致严重的队首阻塞(head-of-line blocking),使大量请求的排队延迟超出SLO;(ii) 固定批次token限制无法适应动态负载与硬件状态变化,从而降低有效吞吐量(effective throughput)。解决方案的关键在于提出AugServe框架,其核心创新为两阶段自适应请求调度策略:第一阶段基于增强LLM请求的推理特征优化调度顺序;第二阶段结合运行时信息持续调整调度决策,以匹配请求特性和系统能力;同时,AugServe动态调整token批处理机制,依据硬件状态和实时负载自适应优化吞吐性能。实验表明,该方案显著优于现有系统vLLM和InferCept,在有效吞吐量上提升达4.7–33.1倍和3.3–13.2倍,并将首次词元时间(TTFT)减少高达96.3%和95.0%。
链接: https://arxiv.org/abs/2512.04013
作者: Ying Wang,Zhen Jin,Jiexiong Xu,Wenhai Lin,Yiquan Chen,Wenzhi Chen
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:As augmented large language models (LLMs) with external tools become increasingly popular in web applications, improving augmented LLM inference serving efficiency and optimizing service-level objectives (SLOs) are critical for enhancing user experience. To achieve this, inference systems must maximize request handling within latency constraints, referred to as increasing effective throughput. However, existing systems face two major challenges: (i) reliance on first-come-first-served (FCFS) scheduling causes severe head-of-line blocking, leading to queuing delays exceeding the SLOs for many requests; and (ii) static batch token limit, which fails to adapt to fluctuating loads and hardware conditions. Both of these factors degrade effective throughput and service quality. This paper presents AugServe, an efficient inference framework designed to reduce queueing latency and enhance effective throughput for augmented LLM inference services. The core idea of AugServe is a two-stage adaptive request scheduling strategy. Specifically, AugServe combines the inference features of augmented LLM requests to optimize the order of scheduling decisions (stage I). These decisions are continuously refined with runtime information (stage II), adapting to both request characteristics and system capabilities. In addition, AugServe dynamically adjusts the token batching mechanism based on hardware status and real-time load, further enhancing throughput performance. Experimental results show that AugServe achieves 4.7-33.1x and 3.3-13.2x higher effective throughput than vLLM and InferCept, while reducing time-to-first-token (TTFT) by up to 96.3% and 95.0%, respectively. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2512.04013 [cs.CL] (or arXiv:2512.04013v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.04013 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-4] aching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models
【速读】: 该论文旨在解决预训练语言模型在迁移至新领域或语言时,分词器(Tokenizer)适应过程中存在的两个关键问题:一是词汇扩展导致的冗余token难以被利用,二是缺乏对词汇表的有效精简以维持模型性能。其解决方案的核心在于提出两种互补方法:一是持续BPE(Byte Pair Encoding)训练,通过在新数据上继续BPE合并学习过程来适配预训练分词器,从而提升token化效率并增强新增词汇的利用率;二是基于叶子节点的词汇 pruning 方法,可在不损害模型质量的前提下移除冗余token。这两项技术共同构成了可控词汇表修改的实用工具链,并已开源发布。
链接: https://arxiv.org/abs/2512.03989
作者: Taido Purason,Pavel Chizhov,Ivan P. Yamshchikov,Mark Fishel
机构: Institute of Computer Science, University of Tartu (塔尔图大学计算机科学研究所); Center for Artificial Intelligence, Technical University of Applied Sciences Würzburg-Schweinfurt (维尔茨堡-施韦因富特应用技术大学人工智能中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Tokenizer adaptation plays an important role in transferring pre-trained language models to new domains or languages. In this work, we address two complementary aspects of this process: vocabulary extension and pruning. The common approach to extension trains a new tokenizer on domain-specific text and appends the tokens that do not overlap with the existing vocabulary, which often results in many tokens that are unreachable or never used. We propose continued BPE training, which adapts a pre-trained tokenizer by continuing the BPE merge learning process on new data. Experiments across multiple languages and model families show that this approach improves tokenization efficiency and leads to better utilization of added vocabulary. We also introduce leaf-based vocabulary pruning, which removes redundant tokens while preserving model quality. Together, these methods provide practical tools for controlled vocabulary modification, which we release as an open-source package.
zh
[NLP-5] Adapting Large Language Models to Low-Resource Tibetan: A Two-Stage Continual and Supervised Fine-Tuning Study
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言(如藏语)上的适应问题,其核心挑战包括数据稀缺性和跨语言漂移(cross-lingual drift)。解决方案的关键在于采用两阶段策略:首先通过持续预训练(Continual Pretraining, CPT)建立藏语的语言基础,构建其语义流形(semantic manifold);随后通过监督微调(Supervised Fine-Tuning, SFT)实现翻译任务的专业化,同时最小化对原有表征结构的扰动。实证结果表明,该方法显著降低困惑度并提升中译藏翻译质量(BLEU从0.046提升至0.261,chrF从2.2提升至6.6),且层分析显示适配主要集中在嵌入层和输出头,以及中后期的前馈网络(MLP)投影,体现出对领域特异性变换的有效编码。
链接: https://arxiv.org/abs/2512.03976
作者: Lifeng Chen,Ryan Lai,Tianming Liu
机构: Beijing Jiaotong University (北京交通大学); University of Georgia (佐治亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Adapting large language models (LLMs) to low-resource languages remains a major challenge due to data scarcity and cross-lingual drift. This work presents a two-stage adaptation of Qwen2.5-3B to Tibetan, a morphologically rich and underrepresented language. We employ Continual Pretraining (CPT) to establish Tibetan linguistic grounding, followed by Supervised Fine-Tuning (SFT) for task and translation specialization. Empirical evaluations demonstrate a consistent decrease in perplexity (from 2.98 \rightarrow 1.54) and substantial improvements in Chinese \rightarrow Tibetan translation quality (BLEU: 0.046 \rightarrow 0.261; chrF: 2.2 \rightarrow 6.6). Layer-wise analysis across 435 layers in Qwen3-4B reveals that adaptation primarily concentrates on embedding and output heads, with mid–late MLP projections encoding domain-specific transformations. Our findings suggest that CPT constructs a Tibetan semantic manifold while SFT sharpens task alignment with minimal representational disruption. This study provides the first quantitative exploration of Tibetan adaptation dynamics for LLMs, and offers an open, reproducible framework for extending multilingual foundation models to low-resource settings.
zh
[NLP-6] Is Lying Only Sinful in Islam? Exploring Religious Bias in Multilingual Large Language Models Across Major Religions
【速读】: 该论文旨在解决多语言大语言模型在宗教相关语境中存在显著偏见的问题,尤其是当涉及南亚主要宗教(佛教、基督教、印度教和伊斯兰教)时,模型常因细微错误导致严重误解。解决方案的关键在于构建了 BRAND(Bilingual Religious Accountable Norm Dataset),这是一个包含超过2400条标注数据的双语(英语与孟加拉语)语料库,覆盖四大宗教并采用三种不同类型的提示(prompts)进行测试。实证结果表明,模型在英语任务中表现优于孟加拉语,并且即使在宗教中立问题上也持续表现出对伊斯兰教的偏见,揭示了多语言模型在跨语言情境下存在的系统性偏差,为人机交互(HCI)领域中的宗教与灵性议题提供了重要洞见。
链接: https://arxiv.org/abs/2512.03943
作者: Kazi Abrab Hossain,Jannatul Somiya Mahmud,Maria Hossain Tuli,Anik Mitra,S. M. Taiabul Haque,Farig Y. Sadeque
机构: BRAC University (BRAC大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 18 pages, 7 figures
Abstract:While recent developments in large language models have improved bias detection and classification, sensitive subjects like religion still present challenges because even minor errors can result in severe misunderstandings. In particular, multilingual models often misrepresent religions and have difficulties being accurate in religious contexts. To address this, we introduce BRAND: Bilingual Religious Accountable Norm Dataset, which focuses on the four main religions of South Asia: Buddhism, Christianity, Hinduism, and Islam, containing over 2,400 entries, and we used three different types of prompts in both English and Bengali. Our results indicate that models perform better in English than in Bengali and consistently display bias toward Islam, even when answering religion-neutral questions. These findings highlight persistent bias in multilingual models when similar questions are asked in different languages. We further connect our findings to the broader issues in HCI regarding religion and spirituality.
zh
[NLP-7] BERnaT: Basque Encoders for Representing Natural Textual Diversity LREC2026
【速读】: 该论文试图解决语言模型在训练过程中因过度依赖标准化文本而忽视非标准语言变体(如方言、历史语料、非正式表达等)所导致的代表性偏差与泛化能力不足的问题。其解决方案的关键在于构建涵盖标准文本、社交媒体文本和历史语料的多样化语料库,并在此基础上预训练三种配置的BERnaT系列编码器模型(标准、多样、混合),同时提出一个将自然语言理解(Natural Language Understanding, NLU)任务拆分为标准与多样化子集的评估框架,以系统性衡量模型对语言变异的适应能力。实验表明,融合多样数据的模型在各类任务上均优于仅使用标准语料训练的模型,且不牺牲标准基准性能,验证了引入语言多样性对于提升模型包容性与泛化性的必要性。
链接: https://arxiv.org/abs/2512.03903
作者: Ekhi Azurmendi,Joseba Fernandez de Landa,Jaione Bengoetxea,Maite Heredia,Julen Etxaniz,Mikel Zubillaga,Ander Soraluze,Aitor Soroa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to LREC 2026
Abstract:Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on Basque, a morphologically rich and low-resource language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.
zh
[NLP-8] Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
【速读】: 该论文旨在解决长序列场景下Transformer解码器因KV缓存(Key-Value Cache)占用内存过大而导致的性能瓶颈问题。传统跨层KV缓存共享方法(如YOCO、CLA)虽能缓解内存压力,但通常在性能上落后于同层方法(如GQA)。研究发现,顶层的键(Key)主要依赖底层和中间层的信息,而值(Value)则主要来自底层。基于此洞察,作者提出FusedKV:通过可学习融合机制将底层与中间层中最信息量的KV缓存组合成顶层缓存,直接在应用RoPE(Rotary Positional Embedding)后的键上操作,保留相对位置信息且避免重复计算RoPE。进一步地,FusedKV-Lite简化了这一过程,直接使用底层值和中间层键构建顶层缓存,在降低I/O开销的同时略有增加困惑度(perplexity)。实验表明,该方法在332M至4B参数的大语言模型中实现50%缓存内存减少,并优于标准Transformer解码器的验证困惑度,成为高效率、高性能的架构替代方案。
链接: https://arxiv.org/abs/2512.03870
作者: Hongzhan Lin,Zhiqi Bai,Xinmiao Zhang,Sen Yang,Xiang Li,Siran Yang,Yunlong Xu,Jiaheng Liu,Yongchi Zhao,Jiamang Wang,Yuchi Xu,Wenbo Su,Bo Zheng
机构: Taobao & Tmall Group of Alibaba; GSAI, Renmin University of China; ICT, Chinese Academy of Sciences; Nanjing University
类目: Computation and Language (cs.CL)
备注: under review
Abstract:Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.
zh
[NLP-9] raining and Evaluation of Guideline-Based Medical Reasoning in LLM s
【速读】: 该论文旨在解决医疗领域中机器学习模型在早期预测任务中过度追求准确率而忽视可解释性的问题,从而难以获得临床医生信任。其核心挑战在于如何使大型语言模型(Large Language Models, LLMs)能够遵循医学共识指南(consensus guidelines)进行逐步推理与预测,以提升模型决策过程的透明度和可信度。解决方案的关键在于:将医学共识规则以自然语言形式实例化为电子健康记录(Electronic Health Records, EHR)中的推理规则,并以此数据对LLMs进行微调,使其不仅掌握规则本身,还能识别例外情况;同时利用共识规则实现对模型推理过程的自动评估——包括推导正确性(derivation correctness,即从前提到结论的逻辑一致性)和价值正确性(value correctness,即预测值与真实测量值的一致性)。实验表明,基于此类规则微调的小型模型在未见患者数据上表现优于直接提示大型LLM或仅在医学文本上训练的模型,且当前主要瓶颈并非分布外泛化问题,而是对未来稀疏、不规则采样的临床变量的预测能力,该问题可通过融合时间序列预测模型输出表示的多模态架构加以改善。
链接: https://arxiv.org/abs/2512.03838
作者: Michael Staniek,Artem Sokolov,Stefan Riezler
机构: Heidelberg University (海德堡大学); Google DeepMind (谷歌深度学习); IWR (计算语言学与IWR研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Machine learning for early prediction in medicine has recently shown breakthrough performance, however, the focus on improving prediction accuracy has led to a neglect of faithful explanations that are required to gain the trust of medical practitioners. The goal of this paper is to teach LLMs to follow medical consensus guidelines step-by-step in their reasoning and prediction process. Since consensus guidelines are ubiquitous in medicine, instantiations of verbalized medical inference rules to electronic health records provide data for fine-tuning LLMs to learn consensus rules and possible exceptions thereof for many medical areas. Consensus rules also enable an automatic evaluation of the model’s inference process regarding its derivation correctness (evaluating correct and faithful deduction of a conclusion from given premises) and value correctness (comparing predicted values against real-world measurements). We exemplify our work using the complex Sepsis-3 consensus definition. Our experiments show that small fine-tuned models outperform one-shot learning of considerably larger LLMs that are prompted with the explicit definition and models that are trained on medical texts including consensus definitions. Since fine-tuning on verbalized rule instantiations of a specific medical area yields nearly perfect derivation correctness for rules (and exceptions) on unseen patient data in that area, the bottleneck for early prediction is not out-of-distribution generalization, but the orthogonal problem of generalization into the future by forecasting sparsely and irregularly sampled clinical variables. We show that the latter results can be improved by integrating the output representations of a time series forecasting model with the LLM in a multimodal setup.
zh
[NLP-10] Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文本分类任务中因提示(prompt)措辞不当而导致性能显著下降的问题,尤其是在心理学等领域,其中构念(construct)具有理论驱动的精确定义,且可能未充分体现在预训练数据中。解决方案的关键在于构建一个实证框架,通过系统性地优化提示设计来提升LLM输出与专家判断的一致性:具体而言,最有效的提示特征包括构念定义、任务表述方式,以及少量示例;实验表明,结合代码本引导的经验提示选择与自动提示工程的少样本提示策略,在三个构念和两个模型上均取得最佳效果。研究建议研究人员尽可能生成并评估多种提示变体(人工或自动生成),基于训练集中的实证表现进行筛选,并在保留集上验证最终方案,从而形成一种实用、系统且理论导向的LLM提示优化方法。
链接: https://arxiv.org/abs/2512.03818
作者: Kylie L. Anglin,Stephanie Milan,Brittney Hernandez,Claudia Ventura
机构: University of Connecticut (康涅狄格大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 2 figures
Abstract:Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording of the prompt. While literature on prompt engineering is expanding, few studies focus on classification tasks, and even fewer address domains like psychology, where constructs have precise, theory-driven definitions that may not be well represented in pre-training data. We present an empirical framework for optimizing LLM performance for identifying constructs in texts via prompt engineering. We experimentally evaluate five prompting strategies --codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting - with zero-shot and few-shot classification. We find that persona, chain-of-thought, and explanations do not fully address performance loss accompanying a badly worded prompt. Instead, the most influential features of a prompt are the construct definition, task framing, and, to a lesser extent, the examples provided. Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering. Based on our findings, we recommend that researchers generate and evaluate as many prompt variants as feasible, whether human-crafted, automatically generated, or ideally both, and select prompts and examples based on empirical performance in a training dataset, validating the final approach in a holdout set. This procedure offers a practical, systematic, and theory-driven method for optimizing LLM prompts in settings where alignment with expert judgment is critical.
zh
[NLP-11] Enhancing Instruction-Following Capabilities in Seq2Seq Models: DoLA Adaptations for T5
【速读】: 该论文旨在解决对比解码(contrastive decoding)方法在编码器-解码器架构(如T5和FLAN-T5)中尚未被有效应用的问题,尤其是其对模型指令遵循能力(instruction following capabilities)的影响尚不明确。解决方案的关键在于首次将DoLa(Decoding by Contrastive Layers)这一原本仅适用于纯解码器架构的对比解码策略成功适配到编码器-解码器框架中,并通过层级分析(layer-by-layer analysis)量化其对token输出概率分布的影响,从而揭示该方法在不同任务类别中对生成内容忠实性(faithfulness)的提升或损害机制。
链接: https://arxiv.org/abs/2512.03803
作者: Huey Sun,Anabel Yong,Lorenzo Gilly,Felipe Jin
机构: University College London (伦敦大学学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Contrastive decoding is a lightweight and effective inference-time method that improves the quality of text generation in Large Language Models. However, algorithms such as DoLa (Decoding by Contrastive Layers) have only been implemented in decoder-only architectures and studied for their impact on improving factuality. This work adapts DoLa for the T5 and FLAN-T5 model families and evaluates its impact on the models’ instruction following capabilities, which to our knowledge is the first implementation of a contrastive decoding strategy in an encoder-decoder architecture. Our results show that DoLa improves the faithfulness of text generation for certain categories of tasks and harms others. To understand these results, we present a layer-by-layer analysis of logit evolution in a FLAN-T5 model to quantify DoLa’s impact on token output probabilities.
zh
[NLP-12] AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在视觉问答(Visual Question Answering, VQA)任务中因依赖大量视觉标记(visual tokens)而导致的高计算开销问题。现有高效VLM方法通常采用固定比例压缩视觉token,但缺乏根据任务需求动态调整的能力,属于被动处理方式。为应对这一挑战,作者提出AdaptVision,其核心创新在于引入一种基于粗到精策略的自适应视觉token获取机制:模型首先以低分辨率图像的压缩视觉token进行初步推理,若必要则调用边界框工具(bounding box tool)裁剪关键区域并获取额外视觉信息。解决方案的关键是Decoupled Turn Policy Optimization(DTPO),它将学习目标解耦为两个独立组件——工具使用优化(tool learning)和准确率提升(accuracy improvement),并通过分别计算每个目标对应的奖励优势(advantage)实现更有效的强化学习优化,从而在保证性能的同时显著降低视觉token消耗。
链接: https://arxiv.org/abs/2512.03794
作者: Zichuan Lin,Yicheng Liu,Yang Yang,Lvfang Tao,Deheng Ye
机构: Tencent AI Lab (腾讯人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 9 figures
Abstract:Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
zh
[NLP-13] In-Context Representation Hijacking
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在上下文学习(in-context learning)场景下,因内部表示被劫持而导致安全对齐失效的问题。具体而言,攻击者通过在提示中用一个无害词(如“carrot”)系统性替换有害关键词(如“bomb”),使模型将无害词的语义表示逐渐向有害词靠拢,从而实现对模型内部表征的隐式篡改。解决方案的关键在于:利用这种“双关语”(Doublespeak)机制,在不改变输入文本表面含义的前提下,诱导模型在深层表示空间中将无害提示误判为非法指令,进而绕过内容安全过滤。该方法无需优化过程、具备跨模型迁移能力,并在闭源与开源系统上均表现出高成功率(如Llama-3.3-70B-Instruct达到74%的攻击成功率),揭示了当前基于提示或输出层面的安全对齐策略在表征空间上的脆弱性。
链接: https://arxiv.org/abs/2512.03771
作者: Itay Yona,Amir Sarid,Michael Karasik,Yossi Gandelsman
机构: Mentaleap; UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:We introduce \textbfDoublespeak, a simple \emphin-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., \textitbomb) with a benign token (e.g., \textitcarrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., How to build a carrot?'') are internally interpreted as disallowed instructions (e.g., How to build a bomb?‘’), thereby bypassing the model’s safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.
zh
[NLP-14] Principled RL for Diffusion LLM s Emerges from a Sequence-Level Perspective
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在扩散大语言模型(diffusion large language models, dLLMs)中的应用难题,核心挑战在于dLLMs的非自回归生成机制缺乏像自回归模型那样天然的逐标记条件概率分解,从而难以直接使用基于token-level的RL目标(如GRPO)。解决方案的关键在于提出一种基于证据下界(Evidence Lower Bound, ELBO)的序列级策略优化方法(ELBO-based Sequence-level Policy Optimization, ESPO),将整个序列生成视为单一动作,并利用ELBO作为可计算的序列级似然代理,同时引入逐标记重要性比率归一化和鲁棒KL散度估计,以实现稳定的大规模训练。实验表明,ESPO在数学推理、编程和规划任务中显著优于token-level基线方法,尤其在Countdown任务上提升达20-40分。
链接: https://arxiv.org/abs/2512.03759
作者: Jingyang Ou,Jiaqi Han,Minkai Xu,Shaoxuan Xu,Jianwen Xie,Stefano Ermon,Yi Wu,Chongxuan Li
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Beijing Key Laboratory of Research on Large Models and Intelligent Governance (北京市大模型与智能治理重点实验室); Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE (教育部下一代智能搜索与推荐工程研究中心); Stanford University (斯坦福大学); Lambda, Inc (Lambda公司); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at this https URL.
zh
[NLP-15] hinking with Programming Vision: Towards a Unified View for Thinking with Images
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在基于工具的视觉推理中存在的重要局限性:即模型对图像方向变化和自然噪声敏感,鲁棒性差,且依赖固定工具集,缺乏灵活性与可扩展性。其解决方案的关键在于提出CodeVision框架——一种将代码作为通用接口的“代码即工具”(code-as-tool)范式,使模型能够生成任意图像操作代码以调用任意图像处理功能,从而突破传统固定工具注册表的限制。该方法通过两阶段训练策略实现:首先在高质量多轮复杂工具组合与错误恢复数据集上进行监督微调(Supervised Fine-Tuning, SFT),随后利用新颖的密集过程奖励函数(dense process reward function)进行强化学习(Reinforcement Learning, RL),以引导模型形成高效、策略性的工具使用行为。实验证明该方案显著提升了Qwen2.5-VL和Qwen3-VL系列模型的性能,并催生了灵活工具组合、链式高效执行及运行时反馈驱动的鲁棒错误恢复等新兴能力。
链接: https://arxiv.org/abs/2512.03746
作者: Zirun Guo,Minjie Hong,Feng Zhang,Kai Jia,Tao Jin
机构: Zhejiang University (浙江大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at this https URL.
zh
[NLP-16] AR-Med: Automated Relevance Enhancement in Medical Search via LLM -Driven Information Augmentation
【速读】: 该论文旨在解决在线医疗平台中搜索准确性与可靠性不足的问题,传统方法难以理解复杂且细微的用户查询,导致服务效果受限。其核心解决方案是提出AR-Med框架,通过检索增强(retrieval-augmented)方式将大语言模型(LLM)的推理过程锚定在经验证的医学知识上,从而提升准确性和可靠性;同时设计了一种实用的知识蒸馏方案,将大型教师模型压缩为高效的小型学生模型以支持线上实时服务,并引入LocalQSMed多专家标注基准用于优化模型迭代,确保离线评估与线上表现的一致性。
链接: https://arxiv.org/abs/2512.03737
作者: Chuyue Wang,Jie Feng,Yuxi Wu,Hang Zhang,Zhiguo Fan,Bing Cheng,Wei Lin
机构: Meituan Inc.(美团公司); Tsinghua University (清华大学); Independent Researcher (独立研究者)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Accurate and reliable search on online healthcare platforms is critical for user safety and service efficacy. Traditional methods, however, often fail to comprehend complex and nuanced user queries, limiting their effectiveness. Large language models (LLMs) present a promising solution, offering powerful semantic understanding to bridge this gap. Despite their potential, deploying LLMs in this high-stakes domain is fraught with challenges, including factual hallucinations, specialized knowledge gaps, and high operational costs. To overcome these barriers, we introduce \textbfAR-Med, a novel framework for \textbfAutomated \textbfRelevance assessment for \textbfMedical search that has been successfully deployed at scale on the Online Medical Delivery Platforms. AR-Med grounds LLM reasoning in verified medical knowledge through a retrieval-augmented approach, ensuring high accuracy and reliability. To enable efficient online service, we design a practical knowledge distillation scheme that compresses large teacher models into compact yet powerful student models. We also introduce LocalQSMed, a multi-expert annotated benchmark developed to guide model iteration and ensure strong alignment between offline and online performance. Extensive experiments show AR-Med achieves an offline accuracy of over 93%, a 24% absolute improvement over the original online system, and delivers significant gains in online relevance and user satisfaction. Our work presents a practical and scalable blueprint for developing trustworthy, LLM-powered systems in real-world healthcare applications.
zh
[NLP-17] DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking in Long-Context Dialogue
【速读】: 该论文旨在解决长对话系统中存在的状态惯性(State Inertia)问题,即静态约束导致模型难以在用户意图动态演变时有效调整与历史上下文之间的冲突。其核心解决方案是提出一种非破坏性对齐框架 DZ-TDPO,关键在于将感知冲突的动态 KL 散度约束与可学习的时间注意力偏置(Temporal Attention Inference, TAI)相结合,从而实现对注意力机制的精准调控,而非依赖破坏性的权重更新。此方法在 Multi-Session Chat (MSC) 数据集上取得最优胜率(Phi-3.5 达 86.2%),且在不同模型规模下均保持鲁棒的零样本泛化能力,并揭示了“容量-稳定性权衡”现象:大模型可通过精细注意力调节实现近乎完美的对齐而几乎无困惑度损失,同时维持通用能力(如 MMLU 指标)。
链接: https://arxiv.org/abs/2512.03704
作者: Yijun Liao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 22 pages, 2 figures, 13 tables. Code available at this https URL
Abstract:Long-context dialogue systems suffer from State Inertia, where static constraints prevent models from resolving conflicts between evolving user intents and established historical context. To address this, we propose DZ-TDPO, a non-destructive alignment framework that synergizes conflict-aware dynamic KL constraints with a learnable temporal attention bias. Experiments on the Multi-Session Chat (MSC) dataset demonstrate that DZ-TDPO achieves state-of-the-art win rates (86.2% on Phi-3.5) while maintaining robust zero-shot generalization. Crucially, our scaling analysis reveals a “Capacity-Stability Trade-off”: while smaller models incur an “alignment tax” (perplexity surge) to overcome historical inertia, the larger Qwen2.5-7B model achieves near-perfect alignment (99.4% win rate) with negligible perplexity overhead. This confirms that TAI can be alleviated via precise attention regulation rather than destructive weight updates, preserving general capabilities (MMLU) across model scales. Code and data are available: this https URL
zh
[NLP-18] AITutor-EvalKit: Exploring the Capabilities of AI Tutors
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 在教育场景中缺乏系统化评估工具的问题,尤其是针对 AI 教师(AI tutors)的 pedagogical quality(教学质量)难以量化与分析的挑战。解决方案的关键在于提出 AITutor-EvalKit,一个集成语言技术的评估应用,支持教学质量评估、模型可解释性分析、数据可视化以及用户反馈收集,从而为教育利益相关者和自然语言处理社区提供可复用的评估框架与实践工具。
链接: https://arxiv.org/abs/2512.03688
作者: Numaan Naeem,Kaushal Kumar Maurya,Kseniia Petukhova,Ekaterina Kochmar
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization. This tool is aimed at education stakeholders as well as *ACL community at large, as it supports learning and can also be used to collect user feedback and annotations.
zh
[NLP-19] Different types of syntactic agreement recruit the same units within large language models
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)中语法知识的表征机制尚不明确,尤其是不同句法现象是否共享或区分特定的模型组件。解决方案的关键在于采用受认知神经科学启发的功能定位方法,识别出在七种开源模型中对67种英语句法现象最敏感的模型单元,并通过因果验证表明这些单元对模型句法性能具有支持作用;进一步发现,不同类型的句法一致关系(如主谓一致、回指一致、限定词-名词一致)会招募重叠的模型单元,表明“一致”构成LLMs表征空间中的一个有意义的功能类别,且该模式在英语、俄语、汉语及57种跨语言分析中均成立,揭示了句法一致作为句法依赖的关键标记在LLMs中具有稳定的表征基础。
链接: https://arxiv.org/abs/2512.03676
作者: Daria Kryvosheieva,Andrea de Varda,Evelina Fedorenko,Greta Tuckute
机构: Massachusetts Institute of Technology (麻省理工学院); Kempner Institute at Harvard University (哈佛大学肯普纳研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) can reliably distinguish grammatical from ungrammatical sentences, but how grammatical knowledge is represented within the models remains an open question. We investigate whether different syntactic phenomena recruit shared or distinct components in LLMs. Using a functional localization approach inspired by cognitive neuroscience, we identify the LLM units most responsive to 67 English syntactic phenomena in seven open-weight models. These units are consistently recruited across sentences containing the phenomena and causally support the models’ syntactic performance. Critically, different types of syntactic agreement (e.g., subject-verb, anaphor, determiner-noun) recruit overlapping sets of units, suggesting that agreement constitutes a meaningful functional category for LLMs. This pattern holds in English, Russian, and Chinese; and further, in a cross-lingual analysis of 57 diverse languages, structurally more similar languages share more units for subject-verb agreement. Taken together, these findings reveal that syntactic agreement-a critical marker of syntactic dependencies-constitutes a meaningful category within LLMs’ representational spaces.
zh
[NLP-20] Evaluating Hydro-Science and Engineering Knowledge of Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在水科学与工程(Hydro-Science and Engineering, Hydro-SE)领域中的知识掌握能力与应用潜力尚未得到系统评估的问题。为实现这一目标,作者提出了Hydro-SE LLM评估基准(Hydro-SE Bench),其核心创新在于构建了一个包含4,000道多选题的标准化测评集,覆盖水文、水资源、水力学、水利结构等九个子领域,能够从基础概念理解、工程应用能力和推理计算能力三个维度全面评估LLMs的表现。该基准的建立为量化LLMs在Hydro-SE任务中的性能提供了可重复、可比较的工具,揭示了当前模型在自然物理科学相关知识上表现较好,但在行业标准和具体工程问题上存在明显短板,从而为模型优化方向和实际应用场景提供了明确指引。
链接: https://arxiv.org/abs/2512.03672
作者: Shiruo Hu,Wenbo Shan,Yingjia Li,Zhiqi Wan,Xinpeng Yu,Yunjia Qi,Haotian Xia,Yang Xiao,Dingxiao Liu,Jiaru Wang,Chenxu Gong,Ruixi Zhang,Shuyue Wu,Shibo Cui,Chee Hui Lai,Wei Luo,Yubin He,Bin Xu,Jianshi Zhao
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Hydro-SE Bench sets a new benchmark for the evaluation of LLMs in the Hydro-Science and Engineering domain, with its code and data available at \url{ this https URL }
Abstract:Hydro-Science and Engineering (Hydro-SE) is a critical and irreplaceable domain that secures human water supply, generates clean hydropower energy, and mitigates flood and drought disasters. Featuring multiple engineering objectives, Hydro-SE is an inherently interdisciplinary domain that integrates scientific knowledge with engineering expertise. This integration necessitates extensive expert collaboration in decision-making, which poses difficulties for intelligence. With the rapid advancement of large language models (LLMs), their potential application in the Hydro-SE domain is being increasingly explored. However, the knowledge and application abilities of LLMs in Hydro-SE have not been sufficiently evaluated. To address this issue, we propose the Hydro-SE LLM evaluation benchmark (Hydro-SE Bench), which contains 4,000 multiple-choice questions. Hydro-SE Bench covers nine subfields and enables evaluation of LLMs in aspects of basic conceptual knowledge, engineering application ability, and reasoning and calculation ability. The evaluation results on Hydro-SE Bench show that the accuracy values vary among 0.74 to 0.80 for commercial LLMs, and among 0.41 to 0.68 for small-parameter LLMs. While LLMs perform well in subfields closely related to natural and physical sciences, they struggle with domain-specific knowledge such as industry standards and hydraulic structures. Model scaling mainly improves reasoning and calculation abilities, but there is still great potential for LLMs to better handle problems in practical engineering application. This study highlights the strengths and weaknesses of LLMs for Hydro-SE tasks, providing model developers with clear training targets and Hydro-SE researchers with practical guidance for applying LLMs.
zh
[NLP-21] Generative AI Practices Literacy and Divides: An Empirical Analysis in the Italian Context
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在意大利社会中的采纳不平等、使用模式及其与数字素养之间关系的问题,尤其关注其在工作和生活场景中的广泛应用是否会导致信息误判风险加剧,以及性别差异如何影响技术参与度。解决方案的关键在于通过大规模实证调查(基于1,906名意大利语成人的问卷数据)揭示 GenAI 的多用途使用特征,并识别出尽管数字素养是采纳的重要预测因子,但无法完全解释女性尤其是年长女性群体的低采纳率,从而凸显出除能力之外的结构性障碍的存在,为制定针对性教育干预措施和进一步探究公平参与机制提供依据。
链接: https://arxiv.org/abs/2512.03671
作者: Beatrice Savoldi,Giuseppe Attanasio,Olga Gorodetskaya,Marta Marchiori Manerba,Elisa Bassignana,Silvia Casola,Matteo Negri,Tommaso Caselli,Luisa Bentivogli,Alan Ramponi,Arianna Muti,Nicoletta Balbo,Debora Nozza
机构: FBK (Fondazione Bruno Kessler)
类目: Computation and Language (cs.CL)
备注:
Abstract:The rise of Artificial Intelligence (AI) language technologies, particularly generative AI (GenAI) chatbots accessible via conversational interfaces, is transforming digital interactions. While these tools hold societal promise, they also risk widening digital divides due to uneven adoption and low awareness of their limitations. This study presents the first comprehensive empirical mapping of GenAI adoption, usage patterns, and literacy in Italy, based on newly collected survey data from 1,906 Italian-speaking adults. Our findings reveal widespread adoption for both work and personal use, including sensitive tasks like emotional support and medical advice. Crucially, GenAI is supplanting other technologies to become a primary information source: this trend persists despite low user digital literacy, posing a risk as users struggle to recognize errors or misinformation. Moreover, we identify a significant gender divide – particularly pronounced in older generations – where women are half as likely to adopt GenAI and use it less frequently than men. While we find literacy to be a key predictor of adoption, it only partially explains this disparity, suggesting that other barriers are at play. Overall, our data provide granular insights into the multipurpose usage of GenAI, highlighting the dual need for targeted educational initiatives and further investigation into the underlying barriers to equitable participation that competence alone cannot explain.
zh
[NLP-22] Optical Context Compression Is Just (Bad) Autoencoding
【速读】: 该论文旨在解决当前关于视觉基元压缩(vision-based context compression)在语言建模中有效性的问题,特别是针对DeepSeek-OCR所提出的“通过少量视觉token即可高保真重建文本”这一发现是否能转化为语言模型的实际性能提升。其核心问题在于验证两个隐含假设:一是视觉压缩是否为文本重建提供了独特优势;二是高精度重建是否意味着其对语言建模有益。解决方案的关键在于设计对照实验,将DeepSeek-OCR的视觉编码器与两种更简单的替代方法——无参数均值池化(parameter-free mean pooling)和可学习的分层编码器(learned hierarchical encoder)进行比较,在相同压缩比下评估它们在文本重建和语言建模任务上的表现。结果表明,简单方法在重建任务中达到或超越视觉编码器,且在语言建模任务中显著优于视觉压缩方案,甚至无法超越直接截断(truncation)策略,从而揭示了当前对光学上下文压缩(optical context compression)的乐观预期缺乏充分实证支持。
链接: https://arxiv.org/abs/2512.03643
作者: Ivan Yee Lee,Cheng Yang,Taylor Berg-Kirkpatrick
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR’s reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives–parameter-free mean pooling and a learned hierarchical encoder–we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling–where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at this https URL
zh
[NLP-23] AlignCheck: a Semantic Open-Domain Metric for Factual Consistency Assessment
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时存在的幻觉(hallucination)问题,即模型可能生成看似合理但事实错误的内容,尤其在临床等高风险领域可能导致严重后果。现有评估指标难以准确衡量事实一致性且缺乏可解释性,导致错误诊断与修正困难。论文提出了一种可解释的事实一致性评估框架,其关键在于将文本分解为原子事实(atomic facts),采用无模式(schema-free)的灵活方法进行建模,并引入加权指标以提升评估精度,同时设计机制控制复杂领域中的评估复杂度,从而实现对领域内和开放域文本更可靠、可解释的事实一致性判断。
链接: https://arxiv.org/abs/2512.03634
作者: Ahmad Aghaebrahimian
机构: Zurich University of Applied Sciences (苏黎世应用科学大学); Swiss Institute of Bioinformatics (瑞士生物信息学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models have significantly advanced natural language processing tasks, but remain prone to generating incorrect or misleading but plausible arguments. This issue, known as hallucination, is particularly concerning in high-stakes domains like clinical applications, where factual inaccuracies can have severe consequences. Existing evaluation metrics fail to adequately assess factual consistency and lack interpretability, making diagnosing and mitigating errors difficult. We propose an interpretable framework for factual consistency assessment for in-domain and open-domain texts to address these limitations. Our approach decomposes text into atomic facts and introduces a flexible, schema-free methodology. Unlike previous methods with an absolute metric, we incorporate a weighted metric to enhance factual evaluation. Additionally, we propose a mechanism to control assessment complexity in intricate domains. We benchmark our approach on popular general and clinical datasets and release our code to support fact-aware model training in future research.
zh
[NLP-24] SELF: A Robust Singular Value and Eigenvalue Approach for LLM Fingerprinting
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)知识产权(Intellectual Property, IP)保护中的关键挑战,即现有指纹技术在应对伪造主张攻击(false claim attacks)和权重篡改(weight manipulations)时的脆弱性问题。解决方案的关键在于提出SELF——一种基于模型内部权重的新型指纹方案,其核心创新包括:1)通过注意力权重的奇异值分解(Singular Value Decomposition, SVD)与特征值分解(Eigenvalue Decomposition)实现唯一、可扩展且对变换不变的指纹提取;2)利用少样本学习(few-shot learning)与数据增强技术构建神经网络驱动的指纹相似度比对机制,从而在多种下游修改攻击(如量化、剪枝和微调)下仍保持高检测准确率和鲁棒性。
链接: https://arxiv.org/abs/2512.03620
作者: Hanxiu Zhang,Yue Zheng
机构: The Chinese University of Hong Kong, Shenzhen (深圳分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The protection of Intellectual Property (IP) in Large Language Models (LLMs) represents a critical challenge in contemporary AI research. While fingerprinting techniques have emerged as a fundamental mechanism for detecting unauthorized model usage, existing methods – whether behavior-based or structural – suffer from vulnerabilities such as false claim attacks or susceptible to weight manipulations. To overcome these limitations, we propose SELF, a novel intrinsic weight-based fingerprinting scheme that eliminates dependency on input and inherently resists false claims. SELF achieves robust IP protection through two key innovations: 1) unique, scalable and transformation-invariant fingerprint extraction via singular value and eigenvalue decomposition of LLM attention weights, and 2) effective neural network-based fingerprint similarity comparison based on few-shot learning and data augmentation. Experimental results demonstrate SELF maintains high IP infringement detection accuracy while showing strong robustness against various downstream modifications, including quantization, pruning, and fine-tuning attacks. Our code is available at this https URL.
zh
[NLP-25] Fine-grained Narrative Classification in Biased News Articles
【速读】: 该论文旨在解决 biased news articles 中的细粒度叙事分类问题,以及如何通过文章偏见分类作为前置任务来提升叙事分析与说服技巧识别的准确性。其核心挑战在于现有研究缺乏基于意识形态 grounded 的细粒度叙事标注数据集,且缺乏能够有效捕捉多层传播机制(如意识形态极化、话语意图和说服策略)的推理框架。解决方案的关键在于构建了首个面向印度新闻媒体的多层级标注数据集 INDI-PROP,涵盖 1,266 篇聚焦于 CAA 和农民抗议两个社会政治事件的文章,每个样本均标注了三个层次:(i)意识形态层面的文章偏见(亲政府、亲反对派、中立),(ii)以意识形态极化为锚点的事件特定叙事框架,(iii)具体的说服技巧;同时提出了两种基于 GPT-4o-mini 的多跳提示推理框架——FANTA(利用信息抽取与语境框定进行分层推理)和 TPTC(通过两阶段分解说服线索实现系统性解析),显著优于基线模型,在偏见分类、叙事分类和说服技巧识别任务上均取得实质性提升。
链接: https://arxiv.org/abs/2512.03582
作者: Zeba Afroz,Harsh Vardhan,Pawan Bhakuni,Aanchal Punia,Rajdeep Kumar,Md. Shad Akhtar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Narratives are the cognitive and emotional scaffolds of propaganda. They organize isolated persuasive techniques into coherent stories that justify actions, attribute blame, and evoke identification with ideological camps. In this paper, we propose a novel fine-grained narrative classification in biased news articles. We also explore article-bias classification as the precursor task to narrative classification and fine-grained persuasive technique identification. We develop INDI-PROP, the first ideologically grounded fine-grained narrative dataset with multi-level annotation for analyzing propaganda in Indian news media. Our dataset INDI-PROP comprises 1,266 articles focusing on two polarizing socio-political events in recent times: CAA and the Farmers’ protest. Each article is annotated at three hierarchical levels: (i) ideological article-bias (pro-government, pro-opposition, neutral), (ii) event-specific fine-grained narrative frames anchored in ideological polarity and communicative intent, and (iii) persuasive techniques. We propose FANTA and TPTC, two GPT-4o-mini guided multi-hop prompt-based reasoning frameworks for the bias, narrative, and persuasive technique classification. FANTA leverages multi-layered communicative phenomena by integrating information extraction and contextual framing for hierarchical reasoning. On the other hand, TPTC adopts systematic decomposition of persuasive cues via a two-stage approach. Our evaluation suggests substantial improvement over underlying baselines in each case.
zh
[NLP-26] CartoMapQA: A Fundamental Benchmark Dataset Evaluating Vision-Language Models on Cartographic Map Understanding DATE
【速读】: 该论文旨在解决视觉语言模型(Visual-Language Models, LVLMs)在地图理解能力方面存在的显著不足问题,尤其是在处理制图信息时缺乏系统性评估工具和明确的挑战识别。解决方案的关键在于提出一个名为CartoMapQA的新基准测试集,该数据集包含超过2000个样本,涵盖符号识别、嵌入信息提取、比例尺解读及基于路径的推理等多层次地图理解任务,通过结构化的问题回答机制精准评估LVLM对地图语义的理解能力。该基准不仅揭示了当前模型在地图特定语义理解、地理空间推理以及光学字符识别(OCR)相关错误方面的局限性,还为未来LVLM架构优化提供了可量化、可追踪的评估依据,从而推动其在导航、地理搜索和城市规划等实际应用中的可靠部署。
链接: https://arxiv.org/abs/2512.03558
作者: Huy Quang Ung,Guillaume Habault,Yasutaka Nishimura,Hao Niu,Roberto Legaspi,Tomoki Oya,Ryoichi Kojima,Masato Taya,Chihiro Ono,Atsunori Minamikawa,Yan Liu
机构: KDDI Research, Inc.(KDDI 研究所); University of Southern California(南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at SIGSPATIAL 2025 (Best paper candidates), 15 pages
Abstract:The rise of Visual-Language Models (LVLMs) has unlocked new possibilities for seamlessly integrating visual and textual information. However, their ability to interpret cartographic maps remains largely unexplored. In this paper, we introduce CartoMapQA, a benchmark specifically designed to evaluate LVLMs’ understanding of cartographic maps through question-answering tasks. The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer. These tasks span key low-, mid- and high-level map interpretation skills, including symbol recognition, embedded information extraction, scale interpretation, and route-based reasoning. Our evaluation of both open-source and proprietary LVLMs reveals persistent challenges: models frequently struggle with map-specific semantics, exhibit limited geospatial reasoning, and are prone to Optical Character Recognition (OCR)-related errors. By isolating these weaknesses, CartoMapQA offers a valuable tool for guiding future improvements in LVLM architectures. Ultimately, it supports the development of models better equipped for real-world applications that depend on robust and reliable map understanding, such as navigation, geographic search, and urban planning. Our source code and data are openly available to the research community at: this https URL
zh
[NLP-27] M3DR: Towards Universal Multilingual Multimodal Document Retrieval
【速读】: 该论文旨在解决当前多模态文档检索系统(Multimodal Document Retrieval, MDR)普遍存在的语言局限性问题,即现有方法主要依赖英语数据,难以在多语言环境下实现有效的跨语言语义对齐。其核心解决方案是提出M3DR框架,通过合成多语言文档数据并结合对比学习策略,构建统一的文本与文档图像表示空间,从而实现跨语言、跨模态的对齐能力。关键创新在于利用大规模合成多语言数据训练模型,并使其在不同视觉-语言架构和模型规模下均具备泛化能力,同时支持单向量和token级多向量两种检索范式,在22种语言上验证了其鲁棒性和性能提升(相对提升达~150%)。
链接: https://arxiv.org/abs/2512.03514
作者: Adithya S Kolavi,Vyoman Jain
机构: CognitiveLab(认知实验室)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.
zh
[NLP-28] Understanding LLM Reasoning for Abstractive Summarization
【速读】: 该论文旨在解决生成式 AI(Generative AI)在抽象摘要(abstractive summarization)任务中推理能力的实际效用问题,即尽管大语言模型(Large Language Models, LLMs)在数学和代码生成等分析性任务中表现出色,但其在摘要生成中的推理价值仍停留在假设层面,缺乏系统验证。解决方案的关键在于:首先将通用推理策略适配至摘要领域,并在此基础上开展大规模对比实验,评估8种推理策略与3个大型推理模型(Large Reasoning Models, LRMs)在8个不同数据集上的摘要质量与事实忠实性(faithfulness)。研究发现,推理并非万能方案,其效果高度依赖于具体策略与上下文;尤其揭示了摘要质量与事实忠实性之间存在权衡关系——显式推理提升流畅性但损害事实准确性,而隐式推理则相反;此外,增加LRM的内部推理预算不仅无益,反而可能削弱事实一致性,表明高质量摘要的核心在于忠实压缩而非过度创造性推演。
链接: https://arxiv.org/abs/2512.03503
作者: Haohan Yuan,Siu Cheung Hui,Haopeng Zhang
机构: ALOHA Lab, University of Hawaii at Manoa (ALOHA 实验室,夏威夷大学马诺阿分校); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: 26 pages,15 figures
Abstract:While the reasoning capabilities of Large Language Models (LLMs) excel in analytical tasks such as mathematics and code generation, their utility for abstractive summarization remains widely assumed but largely unverified. To bridge this gap, we first tailor general reasoning strategies to the summarization domain. We then conduct a systematic, large scale comparative study of 8 reasoning strategies and 3 Large Reasoning Models (LRMs) across 8 diverse datasets, assessing both summary quality and faithfulness. Our findings show that reasoning is not a universal solution and its effectiveness is highly dependent on the specific strategy and context. Specifically, we observe a trade-off between summary quality and factual faithfulness: explicit reasoning strategies tend to improve fluency at the expense of factual grounding, while implicit reasoning in LRMs exhibits the inverse pattern. Furthermore, increasing an LRM’s internal reasoning budget does not improve, and can even hurt, factual consistency, suggesting that effective summarization demands faithful compression rather than creative over-thinking.
zh
[NLP-29] NAS-LoRA: Empowering Parameter-Efficient Fine-Tuning for Visual Foundation Models with Searchable Adaptation
【速读】: 该论文旨在解决生成式 AI(Generative AI)领域中,Segment Anything Model (SAM) 在特定下游任务(如医学和农业图像分割)中适应性不足的问题,尤其是由于其Transformer编码器缺乏图像块内的空间先验信息,导致难以获取高层语义信息。解决方案的关键在于提出一种新的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法 NAS-LoRA,其核心创新是在 LoRA 的基础上引入一个轻量级神经架构搜索(Neural Architecture Search, NAS)模块,嵌入于编码器与解码器之间,以动态优化权重更新过程中所集成的先验知识;同时设计分阶段优化策略,帮助视觉Transformer(Vision Transformer, ViT)编码器在微调过程中平衡权重调整与结构适配,从而逐步学习高阶语义特征。实验表明,NAS-LoRA 在提升性能的同时,训练成本降低24.14%,且不增加推理开销,验证了NAS在增强视觉基础模型PEFT中的潜力。
链接: https://arxiv.org/abs/2512.03499
作者: Renqi Chen,Haoyang Su,Shixiang Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The Segment Anything Model (SAM) has emerged as a powerful visual foundation model for image segmentation. However, adapting SAM to specific downstream tasks, such as medical and agricultural imaging, remains a significant challenge. To address this, Low-Rank Adaptation (LoRA) and its variants have been widely employed to enhancing SAM’s adaptation performance on diverse domains. Despite advancements, a critical question arises: can we integrate inductive bias into the model? This is particularly relevant since the Transformer encoder in SAM inherently lacks spatial priors within image patches, potentially hindering the acquisition of high-level semantic information. In this paper, we propose NAS-LoRA, a new Parameter-Efficient Fine-Tuning (PEFT) method designed to bridge the semantic gap between pre-trained SAM and specialized domains. Specifically, NAS-LoRA incorporates a lightweight Neural Architecture Search (NAS) block between the encoder and decoder components of LoRA to dynamically optimize the prior knowledge integrated into weight updates. Furthermore, we propose a stage-wise optimization strategy to help the ViT encoder balance weight updates and architectural adjustments, facilitating the gradual learning of high-level semantic information. Various Experiments demonstrate our NAS-LoRA improves existing PEFT methods, while reducing training cost by 24.14% without increasing inference cost, highlighting the potential of NAS in enhancing PEFT for visual foundation models.
zh
[NLP-30] A Preliminary Study on the Promises and Challenges of Native Top-k Sparse Attention
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文建模中因推理计算成本过高而导致性能瓶颈的问题,尤其是在智能体(agents)和多模态应用等任务中的限制。其核心解决方案是引入并系统研究Top- k 注意机制(Top- k Attention),关键在于:1)通过精确的Top- k 解码策略,在解码阶段仅保留与Query相似度最高的K个Key作为上下文窗口,即可实现与全注意力相当甚至更优的下游任务性能;2)提出训练与推理阶段保持Top- k 注意操作一致性的策略,从而显著提升模型整体表现;3)从信息熵角度提供理论解释,发现经Top- k 注意微调(SFT)后的模型在下游任务中呈现熵降低现象,表明低熵状态更适配Top- k 解码,进一步验证了该机制的有效性与可优化空间。
链接: https://arxiv.org/abs/2512.03494
作者: Di Xiu,Hongyin Tang,Bolin Rong,Lizhi Yan,Jingang Wang,Yifan Lu,Xunliang Cai
机构: Meituan(美团)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) are increasingly prevalent in the field of long-context modeling, however, their inference computational costs have become a critical bottleneck hindering the advancement of tasks such as agents and multimodal applications. This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top- k Attention mechanism during both the decoding and training phases. First, we validate the effectiveness of exact Top- k Decoding through extensive experimentation. Experiments demonstrate that retaining only the pivotal Keys with the highest similarity to the Query as the context window during the decoding stage achieves performance comparable to, or even surpassing, full attention on downstream tasks such as HELMET and LongBench v2. Second, we further explore the native Top- k Attention training strategy. Experiments confirm that ensuring the consistency between training and inference regarding Top- k Attention operations facilitates the further unlocking of Top- k Decoding’s potential, thereby significantly enhancing model performance. Furthermore, considering the high computational complexity of exact Top- k Attention, we investigate the impact of approximate Top- k algorithm precision on downstream tasks. Our research confirms a positive correlation between downstream task performance and approximation fidelity, and we provide statistical evaluations of the Lightning Indexer’s precision within the DeepSeek-V3.2-Exp model. Finally, this report provides a theoretical interpretation from the perspective of Entropy. Experimental observations indicate that models subjected to Top- k Attention SFT exhibit a distinct phenomenon of entropy reduction in downstream tasks, which validates the hypothesis that low-entropy states are better adapted to Top- k Decoding.
zh
[NLP-31] uning for TraceTarnish: Techniques Trends and Testing Tangible Traits
【速读】: 该论文旨在解决文本作者身份识别中的匿名化攻击问题,即如何通过对抗性风格学(adversarial stylometry)手段对文本进行改写,以隐藏其真实作者特征。解决方案的关键在于识别并利用具有高信息增益(Information Gain)的 stylometric 特征,包括功能词及其类型(L_FUNC_A 和 L_FUNC_T)、内容词及其类型(L_CONT_A 和 L_CONT_T),以及词类比值(Type-Token Ratio, ST_TYPE_TOKEN_RATIO_LEMMAS)。这些特征不仅作为可靠的入侵指标(Indicators of Compromise, IoCs)揭示文本已被人为篡改以掩盖作者身份,还可作为取证信标,在缺乏原始文本的情况下提示防御方存在对抗性风格学攻击。研究基于此构建了名为 \textitTraceTarnish 的攻击脚本,并围绕这五个核心特征优化其输出,从而增强攻击的有效性和隐蔽性。
链接: https://arxiv.org/abs/2512.03465
作者: Robert Dilworth
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 20 pages, 8 figures, 2 tables
Abstract:In this study, we more rigorously evaluated our attack script \textitTraceTarnish , which leverages adversarial stylometry principles to anonymize the authorship of text-based messages. To ensure the efficacy and utility of our attack, we sourced, processed, and analyzed Reddit comments–comments that were later alchemized into \textitTraceTarnish data–to gain valuable insights. The transformed \textitTraceTarnish data was then further augmented by \textitStyloMetrix to manufacture stylometric features–features that were culled using the Information Gain criterion, leaving only the most informative, predictive, and discriminative ones. Our results found that function words and function word types ( L_FUNC_A \ L_FUNC_T ); content words and content word types ( L_CONT_A \ L_CONT_T ); and the Type-Token Ratio ( ST_TYPE_TOKEN_RATIO_LEMMAS ) yielded significant Information-Gain readings. The identified stylometric cues–function-word frequencies, content-word distributions, and the Type-Token Ratio–serve as reliable indicators of compromise (IoCs), revealing when a text has been deliberately altered to mask its true author. Similarly, these features could function as forensic beacons, alerting defenders to the presence of an adversarial stylometry attack; granted, in the absence of the original message, this signal may go largely unnoticed, as it appears to depend on a pre- and post-transformation comparison. “In trying to erase a trace, you often imprint a larger one.” Armed with this understanding, we framed \textitTraceTarnish 's operations and outputs around these five isolated features, using them to conceptualize and implement enhancements that further strengthen the attack.
zh
[NLP-32] xt-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在视觉问答(VQA)任务中依赖大量真实图像-文本对进行微调所带来的高成本问题,尤其是在数据稀缺或隐私受限场景下。其核心挑战在于如何利用低成本、易获取的纯文本数据实现有效的训练,同时克服文本与图像之间的模态鸿沟。解决方案的关键在于提出“文本打印图像”(Text-Printed Image, TPI),即通过将文本描述直接渲染到白色画布上生成合成图像,从而将文本语义无损地映射至图像空间,且无需依赖复杂文本到图像生成模型(如扩散模型)。TPI具有低计算开销、保持语义一致性、易于集成到现有训练流程等优势,在多个基准测试中显著优于基于扩散模型生成的合成图像,验证了其作为文本中心训练范式和低成本数据增强策略的有效性。
链接: https://arxiv.org/abs/2512.03463
作者: Shojiro Yamabe,Futa Waseda,Daiki Shiono,Tsubasa Takahashi
机构: Turing Inc.; Institute of Science Tokyo (东京科学研究所); The University of Tokyo (东京大学); Tohoku University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.
zh
[NLP-33] PretrainZero: Reinforcement Active Pretraining
【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的模型在通用推理能力扩展中受限于特定领域可验证奖励信号的问题,即现有方法严重依赖领域内可验证奖励,难以突破通用推理性能边界。其解决方案的关键在于提出PretrainZero框架,通过三方面创新实现从领域特定后训练向通用预训练的范式转变:首先,引入“主动预训练”机制,模拟人类主动学习能力,利用RL识别预训练语料中的合理且信息丰富的片段并进行推理预测;其次,实现无监督自监督学习,无需任何可验证标签、预训练奖励模型或监督微调,直接在通用维基百科语料上对3B至30B规模的基础模型进行RL预训练,打破通用推理的验证数据壁垒;最后,通过逐步挑战更复杂的掩码跨度任务实现验证尺度扩展,显著提升基础模型的通用推理能力。实验表明,PretrainZero在MMLU-Pro、SuperGPQA和数学平均基准上分别将Qwen3-4B-Base模型性能提升8.43、5.96和10.60分,并可作为下游RLVR任务的推理基础模型。
链接: https://arxiv.org/abs/2512.03442
作者: Xingrun Xing,Zhiyuan Fan,Jie Lou,Guoqi Li,Jiajun Zhang,Debing Zhang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Xiaohongshu Inc. (小红书)
类目: Computation and Language (cs.CL)
备注:
Abstract:Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.
zh
[NLP-34] Dual LoRA: Enhancing LoRA with Magnitude and Direction Updates
【速读】: 该论文旨在解决低秩适应(Low-rank Adaptation, LoRA)方法在参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)预训练大语言模型(Large Language Models, LLMs)时性能受限的问题,其根本原因在于LoRA的低秩假设难以充分模拟全量微调中基于梯度优化算法的参数更新过程。解决方案的关键在于提出双分支LoRA(Dual LoRA),通过引入归纳偏置(inductive bias)将原始低秩矩阵分解为两个独立组:幅度组(magnitude group)用于控制参数是否更新及更新幅度,方向组(direction group)用于决定参数更新方向;具体实现上,在幅度组引入ReLU函数以实现非负更新强度控制,在方向组引入符号函数(sign function)以确定更新方向,从而更精确地逼近全量微调的更新轨迹。实验表明,该方法在多个自然语言处理任务(包括自然语言生成、理解与常识推理)和主流模型(如GPT-2、RoBERTa、DeBERTa及LLaMA系列)上均显著优于LoRA及其先进变体,且保持相同可训练参数数量。
链接: https://arxiv.org/abs/2512.03402
作者: Yixing Xu,Chao Li,Xuanwu Yin,Spandan Tiwari,Dong Li,Ashish Sirasao,Emad Barsoum
机构: Advanced Micro Devices, Inc.(超威半导体公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Low-rank adaptation (LoRA) is one of the most popular methods among parameter-efficient fine-tuning (PEFT) methods to adapt pre-trained large language models (LLMs) to specific downstream tasks. However, the model trained based on LoRA often has an unsatisfactory performance due to its low-rank assumption. In this paper, we propose a novel method called Dual LoRA to improve the performance by incorporating an inductive bias into the original LoRA. Specifically, we separate low-rank matrices into two groups: the magnitude group to control whether or not and how far we should update a parameter and the direction group to decide whether this parameter should move forward or backward, to better simulate the parameter updating process of the full fine-tuning based on gradient-based optimization algorithms. We show that this can be simply achieved by adding a ReLU function to the magnitude group and a sign function to the direction group. We conduct several experiments over a wide range of NLP tasks, including natural language generation (NLG), understanding (NLU), and commonsense reasoning datasets on GPT-2, RoBERTa, DeBERTa, and LLaMA-1/2/3 as baseline models. The results show that we consistently outperform LoRA and its state-of-the-art variants with the same number of trainable parameters.
zh
[NLP-35] Characterizing Language Use in a Collaborative Situated Game
【速读】: 该论文旨在解决现有对话语料库在复杂协作场景下语言使用表征不足的问题,特别是缺乏对多智能体在不确定性环境中通过沟通与推理进行协同决策的自然语言数据。其解决方案的关键在于构建并公开发布Portal Dialogue Corpus,这是一个包含11.5小时人类玩家在《传送门2》合作模式中产生的语音对话(共24.5K条话语)的高质量语料库,涵盖游戏状态、音频、视频及人工与自动标注的多模态数据,从而为研究空间指称、澄清修复机制和临时约定形成等罕见于常规闲聊或任务导向对话中的语言现象提供了实证基础。
链接: https://arxiv.org/abs/2512.03381
作者: Nicholas Tomlin,Naitian Zhou,Eve Fleisig,Liangyuan(Circle)Chen,Téa Wright,Lauren Vinh,Laura X. Ma,Seun Eisape,Ellie French,Tingting Du,Tianjiao Zhang,Alexander Koller,Alane Suhr
机构: UC Berkeley (加州大学伯克利分校); NYU (纽约大学); Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Cooperative video games, where multiple participants must coordinate by communicating and reasoning under uncertainty in complex environments, yield a rich source of language data. We collect the Portal Dialogue Corpus: a corpus of 11.5 hours of spoken human dialogue in the co-op mode of the popular Portal 2 virtual puzzle game, comprising 24.5K total utterances. We analyze player language and behavior, identifying a number of linguistic phenomena that rarely appear in most existing chitchat or task-oriented dialogue corpora, including complex spatial reference, clarification and repair, and ad-hoc convention formation. To support future analyses of language use in complex, situated, collaborative problem-solving scenarios, we publicly release the corpus, which comprises player videos, audio, transcripts, game state data, and both manual and automatic annotations of language data.
zh
[NLP-36] Nexus: Higher-Order Attention Mechanisms in Transformers
【速读】: 该论文旨在解决标准Transformer模型中一阶自注意力机制存在的低秩瓶颈问题,该瓶颈限制了模型在单层内捕捉复杂、多跳依赖关系的能力。解决方案的关键在于提出高阶注意力网络(Higher-Order Attention Network, Hon),其通过嵌套的自注意力机制动态优化查询(Query)和键(Key)向量表示,使token在最终注意力计算前能够聚合全局上下文并建模高阶相关性;同时采用参数高效的权重共享策略,在不显著增加参数量(仅增加O(1))的前提下提升了表达能力。
链接: https://arxiv.org/abs/2512.03377
作者: Hanting Chen,Chu Zhong,Kai Han,Yuchuan Tian,Yuchen Liang,Tianyu Guo,Xinghao Chen,Dacheng Tao,Yunhe Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Transformers have achieved significant success across various domains, relying on self-attention to capture dependencies. However, the standard first-order attention mechanism is often limited by a low-rank bottleneck, struggling to capture intricate, multi-hop relationships within a single layer. In this paper, we propose the \textbfHigher-Order Attention Network (Hon), a novel architecture designed to enhance representational power through a recursive framework. Unlike standard approaches that use static linear projections for Queries and Keys, Hon dynamically refines these representations via nested self-attention mechanisms. Specifically, the Query and Key vectors are themselves outputs of inner attention loops, allowing tokens to aggregate global context and model high-order correlations \textitprior to the final attention computation. We enforce a parameter-efficient weight-sharing strategy across recursive steps, ensuring that this enhanced expressivity incurs \mathcalO(1) additional parameters. We provide theoretical analysis demonstrating that our method breaks the linear bottleneck of standard attention. Empirically, Hon outperforms standard Transformers on multiple benchmarks.
zh
[NLP-37] LLM -Generated Ads: From Personalization Parity to Persuasion Superiority
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在广告生成中说服力的有效性问题,特别是其在个性化策略与通用心理说服原则下的表现差异。研究通过两部分实证分析揭示了LLMs从“与人类专家持平”到“显著优于人类”的演进路径:第一阶段基于人格特质(如开放性和神经质)进行个性化广告生成,结果显示LLM生成内容与人类专家作品在效果上无显著差异;第二阶段转向四种基础心理学说服原理(权威、共识、认知和稀缺性),发现LLM生成的广告整体偏好率显著高于人类创作(59.1% vs. 40.9%,p < 0.001),尤其在权威和共识维度表现最优。解决方案的关键在于LLMs能够构建更具叙事一致性与理想化表达的文本,并具备高度可扩展的边际成本优势,即使在用户识别出AI来源后仍保持显著偏好,表明其在广告实践中具有颠覆性潜力。
链接: https://arxiv.org/abs/2512.03373
作者: Elyas Meguellati,Stefano Civelli,Lei Han,Abraham Bernstein,Shazia Sadiq,Gianluca Demartini
机构: The University of Queensland (昆士兰大学); The University of Zurich (苏黎世大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
Abstract:As large language models (LLMs) become increasingly capable of generating persuasive content, understanding their effectiveness across different advertising strategies becomes critical. This paper presents a two-part investigation examining LLM-generated advertising through complementary lenses: (1) personality-based and (2) psychological persuasion principles. In our first study (n=400), we tested whether LLMs could generate personalized advertisements tailored to specific personality traits (openness and neuroticism) and how their performance compared to human experts. Results showed that LLM-generated ads achieved statistical parity with human-written ads (51.1% vs. 48.9%, p 0.05), with no significant performance differences for matched personalities. Building on these insights, our second study (n=800) shifted focus from individual personalization to universal persuasion, testing LLM performance across four foundational psychological principles: authority, consensus, cognition, and scarcity. AI-generated ads significantly outperformed human-created content, achieving a 59.1% preference rate (vs. 40.9%, p 0.001), with the strongest performance in authority (63.0%) and consensus (62.5%) appeals. Qualitative analysis revealed AI’s advantage stems from crafting more sophisticated, aspirational messages and achieving superior visual-narrative coherence. Critically, this quality advantage proved robust: even after applying a 21.2 percentage point detection penalty when participants correctly identified AI-origin, AI ads still outperformed human ads, and 29.4% of participants chose AI content despite knowing its origin. These findings demonstrate LLMs’ evolution from parity in personalization to superiority in persuasive storytelling, with significant implications for advertising practice given LLMs’ near-zero marginal cost and time requirements compared to human experts. Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL) Cite as: arXiv:2512.03373 [cs.CY] (or arXiv:2512.03373v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2512.03373 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-38] From Hypothesis to Premises: LLM -based Backward Logical Reasoning with Selective Symbolic Translation AAAI2026
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在逻辑推理任务中普遍存在的效率低、可靠性差的问题,尤其是由前向推理范式导致的冗余推理路径、幻觉步骤和语义漂移现象。其解决方案的关键在于提出一种新型框架——假设驱动的逆向逻辑推理(Hypothesis-driven Backward Logical Reasoning, HBLR),核心创新包括:1)引入置信度感知的符号化翻译机制,仅将高置信度文本片段转化为一阶逻辑(First-Order Logic, FOL)形式以保证符号表达的准确性,同时保留不确定内容为自然语言;2)通过翻译反射模块确保符号输出的语义保真性,必要时将损失性转换回文本;3)采用假设驱动的逆向推理策略模拟人类演绎思维,从结论出发递归验证前提,并结合推理反射模块识别与修正错误推理步骤,从而显著提升逻辑一致性与推理效率。
链接: https://arxiv.org/abs/2512.03360
作者: Qingchuan Li,Mingyue Cheng,Zirui Liu,Daoyu Wang,Yuting Zeng,Tongxuan Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI2026
Abstract:Logical reasoning is a core challenge in natural language understanding and a fundamental capability of artificial intelligence, underpinning scientific discovery, mathematical theorem proving, and complex decision-making. Despite the remarkable progress of large language models (LLMs), most current approaches still rely on forward reasoning paradigms, generating step-by-step rationales from premises to conclusions. However, such methods often suffer from redundant inference paths, hallucinated steps, and semantic drift, resulting in inefficient and unreliable reasoning. In this paper, we propose a novel framework, Hypothesis-driven Backward Logical Reasoning (HBLR). The core idea is to integrate confidence-aware symbolic translation with hypothesis-driven backward reasoning. In the translation phase, only high-confidence spans are converted into logical form, such as First-Order Logic (FOL), while uncertain content remains in natural language. A translation reflection module further ensures semantic fidelity by evaluating symbolic outputs and reverting lossy ones back to text when necessary. In the reasoning phase, HBLR simulates human deductive thinking by assuming the conclusion is true and recursively verifying its premises. A reasoning reflection module further identifies and corrects flawed inference steps, enhancing logical coherence. Extensive experiments on five reasoning benchmarks demonstrate that HBLR consistently outperforms strong baselines in both accuracy and efficiency.
zh
[NLP-39] Idea-Gated Transformers: Enforcing Semantic Coherence via Differentiable Vocabulary Pruning
【速读】: 该论文旨在解决自回归语言模型(Autoregressive Language Models, LLMs)在基于下一个词预测(Next-Token Prediction, NTP)目标训练时普遍存在的“主题漂移”(Topic Drift)问题,即生成内容逐渐偏离初始提示,源于模型对局部关联的依赖而非全局语义规划。其解决方案的关键在于提出一种新型架构——思想门控变压器(Idea-Gated Transformer),通过分离语义规划与句法生成过程,引入一个辅助的“思想头”(Idea Head)来预测未来上下文窗口的词袋分布,从而生成一个潜在的“概念向量”(Concept Vector),该向量以可微分的方式动态 gating 主要词汇表中的token,实时抑制语义无关候选,有效缩小搜索空间。实验表明,该方法在WikiText-103上达到与标准GPT-2相当的验证困惑度,但显著提升了领域保留能力(Domain Retention),并能稳定锁定特定语义簇(如金融、科学),从而为更可控的语言建模提供了一条参数高效路径。
链接: https://arxiv.org/abs/2512.03343
作者: Darshan Fofadiya
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code available at this https URL
Abstract:Autoregressive Language Models (LLMs) trained on Next-Token Prediction (NTP) often suffer from Topic Drift'' where the generation wanders away from the initial prompt due to a reliance on local associations rather than global planning \citepholtzman2019curious. While scaling model size mitigates this \citepbrown2020language, the fundamental myopia of the NTP objective remains. In this work, we introduce the Idea-Gated Transformer, a novel architecture that separates semantic planning from syntactic generation. We introduce an auxiliary Idea Head’’ trained to predict the bag-of-words distribution for a future context window, creating a latent ``Concept Vector’’ that actively gates the main vocabulary during generation. We propose a differentiable gating mechanism that suppresses semantically irrelevant tokens, effectively pruning the search space in real-time. Experiments on WikiText-103 demonstrate that while the Idea-Gated model achieves comparable validation perplexity to a standard GPT-2 baseline, it exhibits significantly superior Domain Retention. Qualitative and quantitative analysis reveals that the gating mechanism successfully locks generation into specific semantic clusters (e.g., Finance, Science) and resists associative drift, offering a parameter-efficient path toward more controllable language modeling.
zh
[NLP-40] PERCS: Persona-Guided Controllable Biomedical Summarization Dataset
【速读】: 该论文旨在解决现有自动医学文本简化方法忽视受众差异的问题,即多数资源假设存在单一通用读者群体,而未考虑不同用户群体在医学素养(medical literacy)和信息需求上的显著差异。其解决方案的关键在于提出PERCS数据集,该数据集包含针对四类人物角色(persona)——普通公众、预医学生、非医学研究人员及医学专家——定制的生物医学摘要,每条摘要均经医生审核以确保事实准确性与角色适配性,并通过可读性、词汇复杂度和内容深度等指标验证了不同角色间的内容差异。此设计为可控的、面向特定受众的生物医学摘要生成提供了基准数据与评估框架。
链接: https://arxiv.org/abs/2512.03340
作者: Rohan Charudatt Salvi,Chirag Chawla,Dhruv Jain,Swapnil Panigrahi,Md Shad Akhtar,Shweta Yadav
机构: University of Illinois, Chicago(伊利诺伊大学芝加哥分校); Indian Institute of Technology, Varanasi(印度理工学院瓦拉纳西分校); Indraprastha Institute of Information Technology, Delhi(印地普拉斯特拉信息技术研究所)
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures, 6 tables
Abstract:Automatic medical text simplification plays a key role in improving health literacy by making complex biomedical research accessible to diverse readers. However, most existing resources assume a single generic audience, overlooking the wide variation in medical literacy and information needs across user groups. To address this limitation, we introduce PERCS (Persona-guided Controllable Summarization), a dataset of biomedical abstracts paired with summaries tailored to four personas: Laypersons, Premedical Students, Non-medical Researchers, and Medical Experts. These personas represent different levels of medical literacy and information needs, emphasizing the need for targeted, audience-specific summarization. Each summary in PERCS was reviewed by physicians for factual accuracy and persona alignment using a detailed error taxonomy. Technical validation shows clear differences in readability, vocabulary, and content depth across personas. Along with describing the dataset, we benchmark four large language models on PERCS using automatic evaluation metrics that assess comprehensiveness, readability, and faithfulness, establishing baseline results for future research. The dataset, annotation guidelines, and evaluation materials are publicly available to support research on persona-specific communication and controllable biomedical summarization.
zh
[NLP-41] Epistemic Substitution: How Grokipedias AI-Generated Encyclopedia Restructures Authority
【速读】: 该论文试图解决的问题是:生成式 AI 与人类协作的百科全书(如 Grokipedia 与 Wikipedia)在知识来源和 epistemic(认识论)基础方面是否存在本质差异,以及这种差异是否意味着知识生产范式的结构性转变。解决方案的关键在于通过多尺度对比分析方法,对 72 对匹配条目中近 60,000 条引用进行 8 类 epistemic 分类,构建并比较两个平台的“epistemic profiles”;同时识别出一种“AI 生成知识引用密度的缩放定律”,即文章长度与引用密度呈线性关系,这不同于人类集体参考引用的模式。这一发现表明,基于大语言模型(LLM)的百科全书并非简单自动化知识生产,而是重构了知识的来源结构和认知正当性机制。
链接: https://arxiv.org/abs/2512.03337
作者: Aliakbar Mehdizadeh,Martin Hilbert
机构: University of California, Davis (加州大学戴维斯分校)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY); Digital Libraries (cs.DL)
备注:
Abstract:A quarter century ago, Wikipedia’s decentralized, crowdsourced, and consensus-driven model replaced the centralized, expert-driven, and authority-based standard for encyclopedic knowledge curation. The emergence of generative AI encyclopedias, such as Grokipedia, possibly presents another potential shift in epistemic evolution. This study investigates whether AI- and human-curated encyclopedias rely on the same foundations of authority. We conducted a multi-scale comparative analysis of the citation networks from 72 matched article pairs, which cite a total of almost 60,000 sources. Using an 8-category epistemic classification, we mapped the “epistemic profiles” of the articles on each platform. Our findings reveal several quantitative and qualitative differences in how knowledge is sourced and encyclopedia claims are epistemologically justified. Grokipedia replaces Wikipedia’s heavy reliance on peer-reviewed “Academic Scholarly” work with a notable increase in “User-generated” and “Civic organization” sources. Comparative network analyses further show that Grokipedia employs very different epistemological profiles when sourcing leisure topics (such as Sports and Entertainment) and more societal sensitive civic topics (such as Politics Conflicts, Geographical Entities, and General Knowledge Society). Finally, we find a “scaling-law for AI-generated knowledge sourcing” that shows a linear relationship between article length and citation density, which is distinct from collective human reference sourcing. We conclude that this first implementation of an LLM-based encyclopedia does not merely automate knowledge production but restructures it. Given the notable changes and the important role of encyclopedias, we suggest the continuation and deepening of algorithm audits, such as the one presented here, in order to understand the ongoing epistemological shifts.
zh
[NLP-42] Modeling Topics and Sociolinguistic Variation in Code-Switched Discourse: Insights from Spanish-English and Spanish-Guaraní
【速读】: 该论文旨在解决跨语言和低资源双语话语中社会语言学与主题分析的自动化标注难题,传统上此类分析依赖人工标注,耗时且难以扩展。解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的辅助标注流程,能够自动识别代码转换句子中的主题、体裁及语用功能,并整合人口统计学元数据,从而在西班牙语-英语和西班牙语-瓜拉尼语两种类型差异显著的语言环境中实现可解释的社会语言学模式识别。该方法首次在大规模语料库尺度上验证了LLM在捕捉性别、语言主导权与话语功能之间系统关联方面的可靠性,显著提升了计算社会科学在多语种研究中的可扩展性和精度。
链接: https://arxiv.org/abs/2512.03334
作者: Nemika Tyagi,Nelvin Licona Guevara,Olga Kellert
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 4 figures
Abstract:This study presents an LLM-assisted annotation pipeline for the sociolinguistic and topical analysis of bilingual discourse in two typologically distinct contexts: Spanish-English and Spanish-Guaraní. Using large language models, we automatically labeled topic, genre, and discourse-pragmatic functions across a total of 3,691 code-switched sentences, integrated demographic metadata from the Miami Bilingual Corpus, and enriched the Spanish-Guaraní dataset with new topic annotations. The resulting distributions reveal systematic links between gender, language dominance, and discourse function in the Miami data, and a clear diglossic division between formal Guaraní and informal Spanish in Paraguayan texts. These findings replicate and extend earlier interactional and sociolinguistic observations with corpus-scale quantitative evidence. The study demonstrates that large language models can reliably recover interpretable sociolinguistic patterns traditionally accessible only through manual annotation, advancing computational methods for cross-linguistic and low-resource bilingual research.
zh
[NLP-43] Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLM s ICML2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中对个人身份信息(Personally Identifying Information, PII)的过度记忆问题,这一现象会带来严重的安全与隐私风险。其解决方案的关键在于提出一种名为随机掩码微调(Randomized Masked Fine-Tuning, RMFT)的新颖隐私保护微调方法,通过在微调阶段引入随机掩码机制来降低PII的可提取性,同时最小化对模型性能的影响。实验表明,RMFT在Enron Email数据集上相比基线微调显著降低了总提取率(Total Extraction Rate)和已见提取率(Seen Extraction Rate),分别减少80.81%和80.17%,且仅导致5.73%的困惑度(perplexity)上升,优于去重(deduplication)方法,并通过MaxTER评估框架验证了其在隐私-效用权衡上的帕累托最优性。
链接: https://arxiv.org/abs/2512.03310
作者: Kunj Joshi,David A. Smith
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: To be submitted for ICML 2026
Abstract:The current literature on memorization in Natural Language Models, especially Large Language Models (LLMs), poses severe security and privacy risks, as models tend to memorize personally identifying information (PIIs) from training data. We introduce Randomized Masked Fine-Tuning (RMFT), a novel privacy-preserving fine-tuning technique that reduces PII memorization while minimizing performance impact. Using the Enron Email Dataset, we demonstrate that RMFT achieves an 80.81% reduction in Total Extraction Rate and 80.17% reduction in Seen Extraction Rate compared to baseline fine-tuning, outperforming deduplication methods while maintaining only a 5.73% increase in perplexity. We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.
zh
[NLP-44] Is Vibe Coding Safe? Benchmarking Vulnerability of Agent -Generated Code in Real-World Tasks
【速读】: 该论文试图解决的问题是:在生成式 AI(Generative AI)驱动的“vibe coding”编程范式中,由大型语言模型(LLM)代理生成的代码是否具备足够的安全性,尤其是在生产环境中部署时。其解决方案的关键在于构建了一个名为 SU S VI B E S 的基准测试集,该基准包含200个来自真实开源项目的功能请求软件工程任务,这些任务若由人类程序员实现,会导致存在漏洞的代码。通过在该基准上评估多个主流编码代理(如SWE-Agent与Claude 4 Sonnet),研究发现尽管部分代码功能正确(例如61%的解决方案功能无误),但仅有极低比例(10.5%)具备安全性,且初步的安全策略(如添加漏洞提示)无法有效缓解此类问题,从而揭示了当前vibe coding在安全方面的严重缺陷。
链接: https://arxiv.org/abs/2512.03262
作者: Songwen Zhao,Danqing Wang,Kexun Zhang,Jiaxuan Luo,Zhuo Li,Lei Li
机构: Carnegie Mellon University (卡内基梅隆大学); Columbia University (哥伦比亚大学); Johns Hopkins University (约翰霍普金斯大学); HydroX AI
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision. Although it is increasingly adopted, are vibe coding outputs really safe to deploy in production? To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations. We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE-Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure. Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues. Our findings raise serious concerns about the widespread adoption of vibe-coding, particularly in security-sensitive applications.
zh
[NLP-45] SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning
【速读】: 该论文旨在解决强化学习中过程奖励模型(Process Reward Models, PRMs)训练依赖昂贵的逐步标注或真实参考答案的问题。其核心挑战在于如何在缺乏地面真实(ground truth)监督的情况下,依然构建高质量的步骤级奖励信号以指导模型优化。解决方案的关键在于提出SPARK三阶段框架:第一阶段通过生成模型产生多样化解题路径,并利用并行缩放(自一致性)与串行缩放(元批判)的验证机制进行多视角评估;第二阶段将验证结果作为合成训练数据微调生成式过程奖励模型(Generative Process Reward Models, PRMs);第三阶段在强化学习中使用带思维链验证(PRM-CoT)的PRM作为奖励函数,并引入格式约束防止奖励黑客行为。该方法实现了无需参考答案的强化学习训练,在数学推理任务上超越了基于地面真实的RLVR方法,显著提升了性能(F1达67.5 vs. 66.4)。
链接: https://arxiv.org/abs/2512.03244
作者: Salman Rahman,Sruthi Gorantla,Arpit Gupta,Swastik Roy,Nanyun Peng,Yang Liu
机构: Amazon AGI; UCLA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them using parallel scaling (self-consistency) and sequential scaling (meta-critique). In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision, achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps in mathematical reasoning) compared to 66.4 for reference-guided training and 61.9 for GPT-4o. In the final stage, we apply our generative PRM with chain-of-thought verification (PRM-CoT) as the reward model in RL experiments on mathematical reasoning, and introduce format constraints to prevent reward hacking. Using Qwen2.5-Math-7B, we achieve 47.4% average accuracy across six mathematical reasoning benchmarks, outperforming ground-truth-based RLVR (43.9%). Our work enables reference-free RL training that exceeds ground-truth methods, opening new possibilities for domains lacking verifiable answers or accessible ground truth.
zh
[NLP-46] Identifying attributions of causality in political text
【速读】: 该论文试图解决政治学中对因果解释(causal explanation)研究的系统性不足问题,即现有方法碎片化且多局限于特定议题,难以实现大规模、结构化的分析。解决方案的关键在于提出一种基于轻量级因果语言模型(causal language model)的框架,能够从政治文本中自动检测并解析出因果主张(cause-effect pairs),生成可用于下游分析的结构化数据集,从而在标注需求较低的前提下实现高准确性和跨领域泛化能力。
链接: https://arxiv.org/abs/2512.03214
作者: Paulina Garcia-Corral
机构: Hertie School (赫蒂学院)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Explanations are a fundamental element of how people make sense of the political world. Citizens routinely ask and answer questions about why events happen, who is responsible, and what could or should be done differently. Yet despite their importance, explanations remain an underdeveloped object of systematic analysis in political science, and existing approaches are fragmented and often issue-specific. I introduce a framework for detecting and parsing explanations in political text. To do this, I train a lightweight causal language model that returns a structured data set of causal claims in the form of cause-effect pairs for downstream analysis. I demonstrate how causal explanations can be studied at scale, and show the method’s modest annotation requirements, generalizability, and accuracy relative to human coding.
zh
[NLP-47] InvertiTune: High-Quality Data Synthesis for Cost-Effective Single-Shot Text-to-Knowledge Graph Generation
【速读】: 该论文旨在解决当前自动知识图谱构建(Text2KG)方法中依赖迭代式大语言模型(LLM)提示导致计算成本高、且易忽略文本中分散的复杂关系的问题。其解决方案的关键在于提出InvertiTune框架,该框架通过一个受控的数据生成流程与监督微调(SFT)相结合:首先从大规模知识库中系统性地提取子图,经噪声过滤后利用LLM生成对应的自然语言描述,从而构建更贴近真实场景的长文本-大知识图谱配对数据集;这一高质量训练数据支持轻量级模型在单次推理中高效完成知识图谱构建,实验表明该方法在CE12k和CrossEval-1200等基准上均优于更大规模的非微调模型及现有先进Text2KG方法,验证了真实、高质量训练数据对提升Text2KG系统效率与性能的重要性。
链接: https://arxiv.org/abs/2512.03197
作者: Faezeh Faez,Marzieh S. Tahaei,Yaochen Hu,Ali Pourranjbar,Mahdi Biparva,Mark Coates,Yingxue Zhang
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Autodesk (欧特克); Ascend Team, Huawei Technologies (华为昇腾团队); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have revolutionized the ability to understand and generate text, enabling significant progress in automatic knowledge graph construction from text (Text2KG). Many Text2KG methods, however, rely on iterative LLM prompting, making them computationally expensive and prone to overlooking complex relations distributed throughout the text. To address these limitations, we propose InvertiTune, a framework that combines a controlled data generation pipeline with supervised fine-tuning (SFT). Within this framework, the data-generation pipeline systematically extracts subgraphs from large knowledge bases, applies noise filtering, and leverages LLMs to generate corresponding natural text descriptions, a task more aligned with LLM capabilities than direct KG generation from text. This pipeline enables generating datasets composed of longer texts paired with larger KGs that better reflect real-world scenarios compared to existing benchmarks, thus supporting effective SFT of lightweight models for single-shot KG construction. Experimental results on CE12k, a dataset generated using the introduced pipeline, show that InvertiTune outperforms larger non-fine-tuned LLMs as well as state-of-the-art Text2KG approaches, while also demonstrating stronger cross-dataset generalization on CrossEval-1200, a test set created from three established benchmark datasets and CE12k. These findings highlight the importance of realistic, high-quality training data for advancing efficient and high-performing Text2KG systems.
zh
[NLP-48] Enhancing Job Matching: Occupation Skill and Qualification Linking with the ESCO and EQF taxonomies
【速读】: 该论文旨在解决劳动力市场信息中职位空缺文本的分类问题,即如何将非结构化的职位描述准确映射到欧洲技能、能力、资格与职业分类体系(ESCO)和欧洲资格框架(EQF)中。其解决方案的关键在于引入两种主流方法——句子链接(Sentence Linking)与实体链接(Entity Linking),并进一步利用生成式大语言模型(Generative Large Language Models)提升对职位文本中职业与资格表述的理解深度;同时,研究构建了两个专门标注的数据集以评估任务性能,并开源了一个包含上述方法的工具包,为数字经济背景下劳动力市场语义分析提供了可复用的计算基础设施。
链接: https://arxiv.org/abs/2512.03195
作者: Stylianos Saroglou,Konstantinos Diamantaras,Francesco Preta,Marina Delianidi,Apostolos Benisis,Christian Johannes Meyer
机构: International Hellenic University (国际赫里奥波利斯大学); Tabiya; University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 1 figure, Preprint
Abstract:This study investigates the potential of language models to improve the classification of labor market information by linking job vacancy texts to two major European frameworks: the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy and the European Qualifications Framework (EQF). We examine and compare two prominent methodologies from the literature: Sentence Linking and Entity Linking. In support of ongoing research, we release an open-source tool, incorporating these two methodologies, designed to facilitate further work on labor classification and employment discourse. To move beyond surface-level skill extraction, we introduce two annotated datasets specifically aimed at evaluating how occupations and qualifications are represented within job vacancy texts. Additionally, we examine different ways to utilize generative large language models for this task. Our findings contribute to advancing the state of the art in job entity extraction and offer computational infrastructure for examining work, skills, and labor market narratives in a digitally mediated economy. Our code is made publicly available: this https URL
zh
[NLP-49] Culture Affordance Atlas: Reconciling Object Diversity Through Functional Mapping
【速读】: 该论文旨在解决主流视觉-语言(Vision-Language, VL)数据集存在的文化偏差问题,即这些数据集普遍偏向高收入、西方语境,导致模型在低收入和非西方群体中的泛化能力下降,加剧性能差异。其解决方案的关键在于提出一种以功能为中心(function-centric)的框架,通过将物体按其在不同文化和经济背景下所实现的功能进行分类,重新组织和标注Dollar Street数据集,构建出“文化可及性图谱”(Culture Affordance Atlas),涵盖46种功能与288个物体。实证研究表明,该方法显著降低了高低收入群体间的性能差距(中位数提升6个百分点),并识别出传统VL数据集中常被忽视的文化关键物品,为构建更具包容性的VL数据集和公平的人工智能系统提供了可扩展路径。
链接: https://arxiv.org/abs/2512.03173
作者: Joan Nwatu,Longju Bai,Oana Ignat,Rada Mihalcea
机构: University of Michigan (密歇根大学); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Culture shapes the objects people use and for what purposes, yet mainstream Vision-Language (VL) datasets frequently exhibit cultural biases, disproportionately favoring higher-income, Western contexts. This imbalance reduces model generalizability and perpetuates performance disparities, especially impacting lower-income and non-Western communities. To address these disparities, we propose a novel function-centric framework that categorizes objects by the functions they fulfill, across diverse cultural and economic contexts. We implement this framework by creating the Culture Affordance Atlas, a re-annotated and culturally grounded restructuring of the Dollar Street dataset spanning 46 functions and 288 objects publicly available at this https URL. Through extensive empirical analyses using the CLIP model, we demonstrate that function-centric labels substantially reduce socioeconomic performance gaps between high- and low-income groups by a median of 6 pp (statistically significant), improving model effectiveness for lower-income contexts. Furthermore, our analyses reveals numerous culturally essential objects that are frequently overlooked in prominent VL datasets. Our contributions offer a scalable pathway toward building inclusive VL datasets and equitable AI systems.
zh
[NLP-50] Detecting AI Hallucinations in Finance: An Information-Theoretic Method Cuts Hallucination Rate by 92%
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险领域部署时因生成流畅但缺乏支持的虚假回答(即幻觉,hallucination)而导致的安全性问题。解决方案的关键在于提出ECLIPSE框架,其核心思想是将幻觉建模为模型语义熵(semantic entropy)与可用证据容量之间的不匹配,并通过多样本聚类估计熵值,结合一种新颖的困惑度分解(perplexity decomposition)来量化模型对检索到证据的利用程度。该方法在理论层面证明了其熵-容量目标函数在弱条件下具有严格凸性且存在唯一稳定最优解;实证结果表明,在受控金融问答数据集上,ECLIPSE实现了0.89的ROC AUC和0.90的平均精度,显著优于仅依赖语义熵的基线(AUC 0.50),且其有效性高度依赖于token级对数概率(logprob)提供的校准不确定性信息,进一步验证了证据利用率在幻觉检测中的关键作用。
链接: https://arxiv.org/abs/2512.03107
作者: Mainak Singha
机构: NASA(美国国家航空航天局); Goddard Space Flight Center (戈达德太空飞行中心); The Catholic University of America (天主教大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computational Finance (q-fin.CP); Machine Learning (stat.ML)
备注: 17 pages, 7 figures. Information-theoretic, hallucination detector for financial application. Feedback from researchers and practitioners is welcome
Abstract:Large language models (LLMs) produce fluent but unsupported answers - hallucinations - limiting safe deployment in high-stakes domains. We propose ECLIPSE, a framework that treats hallucination as a mismatch between a model’s semantic entropy and the capacity of available evidence. We combine entropy estimation via multi-sample clustering with a novel perplexity decomposition that measures how models use retrieved evidence. We prove that under mild conditions, the resulting entropy-capacity objective is strictly convex with a unique stable optimum. We evaluate on a controlled financial question answering dataset with GPT-3.5-turbo (n=200 balanced samples with synthetic hallucinations), where ECLIPSE achieves ROC AUC of 0.89 and average precision of 0.90, substantially outperforming a semantic entropy-only baseline (AUC 0.50). A controlled ablation with Claude-3-Haiku, which lacks token-level log probabilities, shows AUC dropping to 0.59 with coefficient magnitudes decreasing by 95% - demonstrating that ECLIPSE is a logprob-native mechanism whose effectiveness depends on calibrated token-level uncertainties. The perplexity decomposition features exhibit the largest learned coefficients, confirming that evidence utilization is central to hallucination detection. We position this work as a controlled mechanism study; broader validation across domains and naturally occurring hallucinations remains future work.
zh
[NLP-51] Alleviating Choice Supportive Bias in LLM with Reasoning Dependency Generation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在决策评估中表现出的选择支持偏差(Choice-Supportive Bias, CSB)问题,即模型在评估选项时系统性地偏好其先前选择的选项,从而损害AI辅助决策的客观性。解决方案的关键在于提出了一种新颖的推理依赖生成框架(Reasoning Dependency Generation, RDG),该框架通过自动构建平衡的推理问答对(QA pairs),显式建模或解耦选项、证据与理由之间的依赖关系,从而生成用于微调的大规模无偏推理数据集。实验表明,基于RDG生成数据微调后的模型在记忆基和评估基实验中分别实现了81.5%和94.3%的性能提升,同时保持了在标准BBQ基准上的表现稳定,为缓解LLMs中的认知偏差提供了首个有效方法。
链接: https://arxiv.org/abs/2512.03082
作者: Nan Zhuang,Wenshuo Wang,Lekai Qian,Yuxiao Wang,Boyu Cao,Qi Liu
机构: School of Future Technology, South China University of Technology, Guangzhou, PRC (华南理工大学未来技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent studies have demonstrated that some Large Language Models exhibit choice-supportive bias (CSB) when performing evaluations, systematically favoring their chosen options and potentially compromising the objectivity of AI-assisted decision making. While existing debiasing approaches primarily target demographic and social biases, methods for addressing cognitive biases in LLMs remain largely unexplored. In this work, we present the first solution to address CSB through Reasoning Dependency Generation (RDG), a novel framework for generating unbiased reasoning data to mitigate choice-supportive bias through fine-tuning. RDG automatically constructs balanced reasoning QA pairs, explicitly (un)modeling the dependencies between choices, evidences, and justifications. Our approach is able to generate a large-scale dataset of QA pairs across domains, incorporating Contextual Dependency Data and Dependency Decouple Data. Experiments show that LLMs fine-tuned on RDG-generated data demonstrate a 81.5% improvement in memory-based experiments and 94.3% improvement in the evaluation-based experiment, while maintaining similar performance on standard BBQ benchmarks. This work pioneers an approach for addressing cognitive biases in LLMs and contributes to the development of more reliable AI-assisted decision support systems.
zh
[NLP-52] Watermarks for Embeddings-as-a-Service Large Language Models
【速读】: 该论文旨在解决Embeddings-as-a-Service (EaaS) 在面对黑盒模仿攻击(imitation attack)时的知识产权保护问题,即如何有效防止第三方在不访问模型内部结构的情况下克隆服务并盗用其文本嵌入能力。现有水印技术被发现易受输入文本改写(paraphrasing)攻击的影响,导致水印失效,从而暴露了当前EaaS水印方案的关键漏洞。为应对这一挑战,论文提出了一种新的水印方法WET(Watermarking EaaS with Linear Transformation),其核心在于对嵌入向量进行线性变换以嵌入水印信息,并通过逆变换恢复原始嵌入进行相似度比对实现验证。该方案在多项实验中展现出对 paraphrasing 攻击的高度鲁棒性,且验证准确率接近完美,显著提升了EaaS服务的防窃取能力。
链接: https://arxiv.org/abs/2512.03079
作者: Anudeex Shetty
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have demonstrated exceptional capabilities in natural language understanding and generation. Based on these LLMs, businesses have started to provide Embeddings-as-a-Service (EaaS), offering feature extraction capabilities (in the form of text embeddings) that benefit downstream natural language processing tasks. However, prior research has demonstrated that EaaS is vulnerable to imitation attacks, where an attacker clones the service’s model in a black-box manner without access to the model’s internal workings. In response, watermarks have been added to the text embeddings to protect the intellectual property of EaaS providers by allowing them to check for model ownership. This thesis focuses on defending against imitation attacks by investigating EaaS watermarks. To achieve this goal, we unveil novel attacks and propose and validate new watermarking techniques. Firstly, we show that existing EaaS watermarks can be removed through paraphrasing the input text when attackers clone the model during imitation attacks. Our study illustrates that paraphrasing can effectively bypass current state-of-the-art EaaS watermarks across various attack setups (including different paraphrasing techniques and models) and datasets in most instances. This demonstrates a new vulnerability in recent EaaS watermarking techniques. Subsequently, as a countermeasure, we propose a novel watermarking technique, WET (Watermarking EaaS with Linear Transformation), which employs linear transformation of the embeddings. Watermark verification is conducted by applying a reverse transformation and comparing the similarity between recovered and original embeddings. We demonstrate its robustness against paraphrasing attacks with near-perfect verifiability. We conduct detailed ablation studies to assess the significance of each component and hyperparameter in WET. Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2512.03079 [cs.CL] (or arXiv:2512.03079v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.03079 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-53] Entropy-Based Measurement of Value Drift and Alignment Work in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中因动态环境变化导致的价值漂移(value drift)、越狱攻击(jailbreak attacks)及对齐性能缓慢退化等安全问题,这些问题无法通过静态基准测试有效捕捉。其解决方案的关键在于将伦理熵(ethical entropy)建模为一个可度量的状态变量,并基于“智能第二定律”构建了一个可操作的评估框架:首先提出五类行为分类法并训练分类器以从模型输出中估算伦理熵 S(t);进而通过压力测试追踪基础模型与指令微调版本的熵演化轨迹,发现微调显著抑制熵增长,使伦理熵降低约80%;最终引入有效对齐工作速率 γ_eff 并嵌入实时监控流水线,在熵漂移超出稳定阈值时触发警报,从而实现对价值漂移的运行时监督。
链接: https://arxiv.org/abs/2512.03047
作者: Samih Fadli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages. Companion paper to “The Second Law of Intelligence: Controlling Ethical Entropy in Autonomous Systems”. Code and tools: this https URL
Abstract:Large language model safety is usually assessed with static benchmarks, but key failures are dynamic: value drift under distribution shift, jailbreak attacks, and slow degradation of alignment in deployment. Building on a recent Second Law of Intelligence that treats ethical entropy as a state variable which tends to increase unless countered by alignment work, we make this framework operational for large language models. We define a five-way behavioral taxonomy, train a classifier to estimate ethical entropy S(t) from model transcripts, and measure entropy dynamics for base and instruction-tuned variants of four frontier models across stress tests. Base models show sustained entropy growth, while tuned variants suppress drift and reduce ethical entropy by roughly eighty percent. From these trajectories we estimate an effective alignment work rate gamma_eff and embed S(t) and gamma_eff in a monitoring pipeline that raises alerts when entropy drift exceeds a stability threshold, enabling run-time oversight of value drift.
zh
[NLP-54] Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting ECML KDD2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂多步推理任务时表现不足的问题。其解决方案的关键在于提出多层自省与自动提示(Multi-Layered Self-Reflection with Auto-Prompting, MAPS)框架,该框架通过结合思维链(Chain of Thought, CoT)、自省机制和自动提示技术,实现对推理过程的迭代优化:模型首先使用CoT生成初步解,随后通过自适应自省机制识别并分析错误,动态生成针对性修正提示以引导迭代改进。此机制使通用LLMs在不依赖专门训练的情况下达到接近专用推理模型的性能水平,同时通过策略性限制反思层数,在推理精度与计算成本之间取得平衡。
链接: https://arxiv.org/abs/2506.23888
作者: André de Souza Loureiro,Jorge Valverde-Rebaza,Julieta Noguez,David Escarcega,Ricardo Marcacini
机构: Luiz de Queiroz College of Agriculture (Luiz de Queiroz 农业学院); University of São Paulo (圣保罗大学); School of Engineering and Sciences (工程与科学学院); Tecnologico de Monterrey (蒙特雷理工学院); Institute of Mathematics and Computer Sciences (数学与计算机科学研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for publication in: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2025). Research Track
Abstract:Recent advancements in Large Language Models (LLMs) have significantly improved their problem-solving capabilities. However, these models still struggle when faced with complex multi-step reasoning tasks. In this paper, we propose the Multi-Layered Self-Reflection with Auto-Prompting (MAPS) framework, a novel approach designed to enhance multi-step mathematical reasoning in LLMs by integrating techniques such as Chain of Thought (CoT), Self-Reflection, and Auto-Prompting. Unlike traditional static prompting methods, MAPS employs an iterative refinement process. Initially, the model generates a solution using CoT prompting. When errors are detected, an adaptive self-reflection mechanism identifies and analyzes them, generating tailored prompts to guide corrections. These dynamically adjusted prompts enable the model to iteratively refine its reasoning. Experiments on four well-established benchmarks across multiple LLMs show that MAPS significantly outperforms standard CoT and achieves competitive results with reasoning-optimized models. In addition, MAPS enables general-purpose LLMs to reach performance levels comparable to specialized reasoning models. While deeper reflection layers improve accuracy, they also increase token usage and costs. To balance this trade-off, MAPS strategically limits reflection depth, ensuring an optimal balance between cost and reasoning performance.
zh
计算机视觉
[CV-0] Unique Lives Shared World: Learning from Single-Life Videos
【速读】:该论文试图解决的问题是:如何利用个体在日常生活中自然采集的视角数据(即第一人称视频)来学习具有几何一致性与泛化能力的视觉表征,从而替代传统依赖大规模多样化网络数据的自监督学习范式。解决方案的关键在于提出“单人生命周期”(single-life)学习范式,通过仅使用某一个体在一段时间内拍摄的第一人称视频进行训练,利用该个体所经历的多视角自然变化实现自监督视觉编码器的学习。实验表明,即使不同个体的数据来源差异显著,其模型仍能发展出高度对齐的几何理解,并具备良好的跨场景迁移能力,且仅需约30小时的单一生命周期数据即可达到与30小时多样化网络数据相当的性能,凸显了人类生活经验中蕴含的共享世界结构对视觉表示学习的强大信号作用。
链接: https://arxiv.org/abs/2512.04085
作者: Tengda Han,Sayna Ebrahimi,Dilara Gokay,Li Yang Ku,Maks Ovsjanikov,Iva Babukova,Daniel Zoran,Viorica Patraucean,Joao Carreira,Andrew Zisserman,Dima Damen
机构: Google DeepMind
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce the “single-life” learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person’s life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.
zh
[CV-1] SimFlow: Simplified and End-to-End Training of Latent Normalizing Flows
【速读】:该论文旨在解决正常流(Normalizing Flows, NFs)在图像生成任务中面临的两个关键问题:一是现有方法通过向训练样本或变分自编码器(Variational Autoencoder, VAE)潜在空间添加随机噪声进行数据增强,导致复杂且冗余的去噪流程;二是通常采用预训练冻结的VAE编码器,限制了重建与生成质量。解决方案的核心在于:将VAE编码器输出的方差固定为常数(如0.5),而非由模型预测。这一简单改动使编码器能输出更广义的潜在表示分布,解码器可直接从该增强的潜在分布中学习重建干净图像,无需额外的噪声注入与去噪设计;同时,固定方差简化了VAE的证据下界(ELBO),提升了NF与VAE联合训练的稳定性。实验表明,该方法在ImageNet 256×256图像生成任务上取得优异性能,gFID达2.15,优于STARFlow(2.40),并可进一步结合REPA-E方法提升至1.91,刷新了基于NF的生成模型性能纪录。
链接: https://arxiv.org/abs/2512.04084
作者: Qinyu Zhao,Guangting Zheng,Tao Yang,Rui Zhu,Xingjian Leng,Stephen Gould,Liang Zheng
机构: Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Normalizing Flows (NFs) learn invertible mappings between the data and a Gaussian distribution. Prior works usually suffer from two limitations. First, they add random noise to training samples or VAE latents as data augmentation, introducing complex pipelines including extra noising and denoising steps. Second, they use a pretrained and frozen VAE encoder, resulting in suboptimal reconstruction and generation quality. In this paper, we find that the two issues can be solved in a very simple way: just fixing the variance (which would otherwise be predicted by the VAE encoder) to a constant (e.g., 0.5). On the one hand, this method allows the encoder to output a broader distribution of tokens and the decoder to learn to reconstruct clean images from the augmented token distribution, avoiding additional noise or denoising design. On the other hand, fixed variance simplifies the VAE evidence lower bound, making it stable to train an NF with a VAE jointly. On the ImageNet 256 \times 256 generation task, our model SimFlow obtains a gFID score of 2.15, outperforming the state-of-the-art method STARFlow (gFID 2.40). Moreover, SimFlow can be seamlessly integrated with the end-to-end representation alignment (REPA-E) method and achieves an improved gFID of 1.91, setting a new state of the art among NFs.
zh
[CV-2] PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design
【速读】:该论文旨在解决当前基于大型多模态模型(Large Multimodal Models, LMMs)的图形设计自动化方法中存在的几何布局不准确以及缺乏专业工作流所需的逐层迭代编辑能力的问题。解决方案的关键在于提出PosterCopilot框架,其核心创新包括:一是采用渐进式三阶段训练策略,依次通过扰动监督微调、视觉真实性对齐强化学习和美学反馈强化学习,提升LMM在布局设计中的几何理解与审美推理能力;二是构建完整的协同工作流,将训练后的LMM设计模型与生成模型结合,实现可控的逐层迭代编辑,从而在保持全局视觉一致性的同时精确优化设计元素。
链接: https://arxiv.org/abs/2512.04082
作者: Jiazhe Wei,Ken Li,Tianyu Lao,Haofan Wang,Liang Wang,Caifeng Shan,Chenyang Si
机构: PRLab, Nanjing University (南京大学); LibLib.ai; Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Graphic design forms the cornerstone of modern visual communication, serving as a vital medium for promoting cultural and commercial events. Recent advances have explored automating this process using Large Multimodal Models (LMMs), yet existing methods often produce geometrically inaccurate layouts and lack the iterative, layer-specific editing required in professional workflows. To address these limitations, we present PosterCopilot, a framework that advances layout reasoning and controllable editing for professional graphic design. Specifically, we introduce a progressive three-stage training strategy that equips LMMs with geometric understanding and aesthetic reasoning for layout design, consisting of Perturbed Supervised Fine-Tuning, Reinforcement Learning for Visual-Reality Alignment, and Reinforcement Learning from Aesthetic Feedback. Furthermore, we develop a complete workflow that couples the trained LMM-based design model with generative models, enabling layer-controllable, iterative editing for precise element refinement while maintaining global visual consistency. Extensive experiments demonstrate that PosterCopilot achieves geometrically accurate and aesthetically superior layouts, offering unprecedented controllability for professional iterative design.
zh
[CV-3] Radiance Meshes for Volumetric Reconstruction
【速读】:该论文旨在解决传统辐射场(radiance field)表示方法在渲染效率与拓扑稳定性之间的权衡问题。现有方法通常依赖于体素或神经隐式表示,难以兼顾实时渲染性能与几何结构的连续性。其解决方案的关键在于提出辐射网格(radiance meshes),即通过Delaunay四面体剖分生成具有恒定密度的四面体单元来表示辐射场,从而利用硬件原生支持的三角形结构实现精确且高效的体绘制(volume rendering),同时引入类似Zip-NeRF的骨干网络架构以应对因顶点优化导致的拓扑不连续性(如边翻转),确保即使在拓扑变化下仍能输出平滑变化的辐射场。这一方法实现了高质量、实时的视图合成,并支持多种下游应用,如鱼眼畸变处理、物理模拟和编辑等。
链接: https://arxiv.org/abs/2512.04076
作者: Alexander Mai,Trevor Hedstrom,George Kopanas,Janne Kontkanen,Falko Kuester,Jonathan T. Barron
机构: University of California, San Diego (加州大学圣地亚哥分校); Google(谷歌)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this http URL
Abstract:We introduce radiance meshes, a technique for representing radiance fields with constant density tetrahedral cells produced with a Delaunay tetrahedralization. Unlike a Voronoi diagram, a Delaunay tetrahedralization yields simple triangles that are natively supported by existing hardware. As such, our model is able to perform exact and fast volume rendering using both rasterization and ray-tracing. We introduce a new rasterization method that achieves faster rendering speeds than all prior radiance field representations (assuming an equivalent number of primitives and resolution) across a variety of platforms. Optimizing the positions of Delaunay vertices introduces topological discontinuities (edge flips). To solve this, we use a Zip-NeRF-style backbone which allows us to express a smoothly varying field even when the topology changes. Our rendering method exactly evaluates the volume rendering equation and enables high quality, real-time view synthesis on standard consumer hardware. Our tetrahedral meshes also lend themselves to a variety of exciting applications including fisheye lens distortion, physics-based simulation, editing, and mesh extraction.
zh
[CV-4] SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在具身应用中缺乏精确空间推理能力的问题,尤其是在多工具协同使用场景下,传统方法依赖人工设计的提示策略或固定工具流水线,难以发现最优工具使用模式。其解决方案的关键在于提出双交互强化学习(Double Interactive Reinforcement Learning, DIRL),这是一种两阶段训练框架:第一阶段通过结合单一工具专家模型(经交互式强化学习训练)与前沿模型(使用全部工具)的演示轨迹进行教学;第二阶段则通过持续的强化学习进一步优化多工具协调能力。该方法使模型SpaceTools实现了跨任务的空间理解性能提升,并在真实机器人操作中验证了其有效性。
链接: https://arxiv.org/abs/2512.04069
作者: Siyi Chen,Mikaela Angelina Uy,Chan Hee Song,Faisal Ladhak,Adithyavairavan Murali,Qing Qu,Stan Birchfield,Valts Blukis,Jonathan Tremblay
机构: University of Michigan (密歇根大学); The Ohio State University (俄亥俄州立大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs’ ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: this https URL.
zh
[CV-5] RELIC: Interactive Video World Model with Long-Horizon Memory
【速读】:该论文旨在解决交互式世界建模中三个关键挑战的协同实现问题:实时长时程流式处理(real-time long-horizon streaming)、一致的空间记忆(consistent spatial memory)以及精确的用户控制(precise user control)。现有方法通常仅单独解决其中一项,难以兼顾三者,例如长期记忆机制常导致实时性能下降。解决方案的关键在于提出一个统一框架RELIC,其核心创新包括:1)基于自回归视频扩散蒸馏技术,利用KV缓存中编码的相对动作与绝对相机位姿,以高度压缩的历史隐状态令牌表示长时记忆,从而支持隐式的3D一致性内容检索并最小化计算开销;2)通过一种新的内存高效的自强制(self-forcing)范式,将双向教师视频模型微调为因果学生生成器,实现对长时教师序列及学生自回放序列的全上下文蒸馏。该方案使RELIC在14B参数规模下达到16 FPS的实时生成能力,并显著提升动作跟随准确性、长时流式稳定性与空间记忆鲁棒性。
链接: https://arxiv.org/abs/2512.04040
作者: Yicong Hong,Yiqun Mei,Chongjian Ge,Yiran Xu,Yang Zhou,Sai Bi,Yannick Hold-Geoffroy,Mike Roberts,Matthew Fisher,Eli Shechtman,Kalyan Sunkavalli,Feng Liu,Zhengqi Li,Hao Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages
Abstract:A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.
zh
[CV-6] Fast Efficient Normalizing Flows and Applications of Image Generative Models
【速读】:该论文旨在解决生成式模型(Generative Models)在效率与实际应用中的两大核心问题:一是提升生成模型(尤其是归一化流,Normalizing Flows)的计算效率和可扩展性;二是将其应用于真实世界计算机视觉任务中,如农业质量评估、地质测绘、自动驾驶数据隐私保护及艺术品修复等。解决方案的关键在于:第一,在归一化流架构上提出六项创新技术,包括数学严格证明的可逆3×3卷积层、高效的Quad耦合层、并行反向传播算法以及基于反卷积的Inverser-Flow训练机制,显著降低参数量并加速推理;第二,将生成模型与具体场景需求结合,例如使用条件GAN处理小样本和类别不平衡问题、采用堆叠自编码器进行无监督特征提取、利用Stable Diffusion图像修复实现隐私保护,并通过统一微调策略增强扩散模型对多类型退化的鲁棒性,从而推动生成模型在多个领域的落地应用。
链接: https://arxiv.org/abs/2512.04039
作者: Sandeep Nagar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: PhD Thesis
Abstract:This thesis presents novel contributions in two primary areas: advancing the efficiency of generative models, particularly normalizing flows, and applying generative models to solve real-world computer vision challenges. The first part introduce significant improvements to normalizing flow architectures through six key innovations: 1) Development of invertible 3x3 Convolution layers with mathematically proven necessary and sufficient conditions for invertibility, (2) introduction of a more efficient Quad-coupling layer, 3) Design of a fast and efficient parallel inversion algorithm for kxk convolutional layers, 4) Fast efficient backpropagation algorithm for inverse of convolution, 5) Using inverse of convolution, in Inverse-Flow, for the forward pass and training it using proposed backpropagation algorithm, and 6) Affine-StableSR, a compact and efficient super-resolution model that leverages pre-trained weights and Normalizing Flow layers to reduce parameter count while maintaining performance. The second part: 1) An automated quality assessment system for agricultural produce using Conditional GANs to address class imbalance, data scarcity and annotation challenges, achieving good accuracy in seed purity testing; 2) An unsupervised geological mapping framework utilizing stacked autoencoders for dimensionality reduction, showing improved feature extraction compared to conventional methods; 3) We proposed a privacy preserving method for autonomous driving datasets using on face detection and image inpainting; 4) Utilizing Stable Diffusion based image inpainting for replacing the detected face and license plate to advancing privacy-preserving techniques and ethical considerations in the field.; and 5) An adapted diffusion model for art restoration that effectively handles multiple types of degradation through unified fine-tuning. Comments: PhD Thesis Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.04039 [cs.CV] (or arXiv:2512.04039v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.04039 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-7] PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
【速读】:该论文旨在解决注意力机制(Attention Mechanism)在基础模型中因二次复杂度而导致的扩展瓶颈问题,尤其针对当前稀疏注意力方法在高稀疏度下因二值掩码导致的信息丢失问题。其解决方案的关键在于提出金字塔稀疏注意力(Pyramid Sparse Attention, PSA)模块,该模块通过引入多层级池化键值(Key-Value, KV)表示,实现更细粒度的掩码控制:每个查询块动态地将低池化层级分配给重要KV块、高层级分配给次要块,从而在完整保留与完全剪枝之间构建信息丰富的插值,有效缓解信息损失并保持计算效率;该设计借鉴了固定点量化和计算机视觉中经典特征金字塔网络的思想,且配合硬件友好的解耦块-瓷砖(block-tile)内核实现高效执行,在视频理解与生成任务中显著优于或等效于现有稀疏注意力基线,展现出更优的效率-质量权衡。
链接: https://arxiv.org/abs/2512.04025
作者: Xiaolong Li,Youping Gu,Xi Lin,Weijie Wang,Bohan Zhuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Tech report
Abstract:Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high sparsity. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical KV blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete pruning. This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: this http URL
zh
[CV-8] C3G: Learning Compact 3D Representations with 2K Gaussians
【速读】:该论文旨在解决从无姿态约束的稀疏视角中以前馈方式重建和理解三维场景时存在的冗余表示与特征聚合效率低的问题。现有方法通常采用逐像素的3D高斯溅射(3D Gaussian Splatting)进行重建,再通过2D到3D特征提升阶段实现场景理解,但其生成大量冗余高斯分布,导致内存开销高且多视角特征聚合效果不佳,进而影响新视角合成与场景理解性能。解决方案的关键在于提出C3G框架,该框架仅在关键空间位置估计紧凑的3D高斯分布,从而最小化冗余;同时引入可学习的token,利用自注意力机制聚合多视角特征以指导高斯生成,并基于学习到的注意力模式高效解码高斯,实现有效的特征提升。
链接: https://arxiv.org/abs/2512.04021
作者: Honggyu An,Jaewoo Jung,Mungyeom Kim,Sunghwan Hong,Chaehyun Kim,Kazumi Fukuda,Minkyeong Jeon,Jisang Han,Takuya Narihira,Hyuna Ko,Junsu Kim,Yuki Mitsufuji,Seungryong Kim
机构: KAIST AI (韩国科学技术院人工智能); ETH AI Center, ETH Zürich (瑞士联邦理工学院人工智能中心); SONY AI (索尼人工智能); Sony Group Corporation (索尼集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page : this https URL
Abstract:Reconstructing and understanding 3D scenes from unposed sparse views in a feed-forward manner remains as a challenging task in 3D computer vision. Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding. However, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation, leading to degraded novel view synthesis and scene understanding performance. We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns for Gaussian decoding to efficiently lift features. Extensive experiments on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation demonstrate our approach’s effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction and understanding, achieving superior memory efficiency and feature fidelity compared to existing methods.
zh
[CV-9] Ultra-lightweight Neural Video Representation Compression
【速读】:该论文旨在解决轻量化隐式神经表示(Implicit Neural Representations, INRs)在视频压缩中性能与计算复杂度之间的权衡问题,以及现有INR编码器中熵编码模块效率低下的问题。解决方案的关键在于:首先,在轻量级神经表示中引入多尺度特征网格(multi-scale feature grids),通过高分辨率网格显著提升低复杂度下的重建质量;其次,提出基于八叉树(octree-based)的上下文模型用于熵编码高维特征网格,有效替代传统自回归模型,大幅提高编码和解码速度,实现8.4倍编码加速和2.5倍解码加速,同时在PSNR和MS-SSIM指标上分别较当前最优轻量级方案C3提升21.03%和23.06%的BD-rate增益。
链接: https://arxiv.org/abs/2512.04019
作者: Ho Man Kwan,Tianhao Peng,Ge Gao,Fan Zhang,Mike Nilsson,Andrew Gower,David Bull
机构: University of Bristol (布里斯托大学); BT (英国电信)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Recent works have demonstrated the viability of utilizing over-fitted implicit neural representations (INRs) as alternatives to autoencoder-based models for neural video compression. Among these INR-based video codecs, Neural Video Representation Compression (NVRC) was the first to adopt a fully end-to-end compression framework that compresses INRs, achieving state-of-the-art performance. Moreover, some recently proposed lightweight INRs have shown comparable performance to their baseline codecs with computational complexity lower than 10kMACs/pixel. In this work, we extend NVRC toward lightweight representations, and propose NVRC-Lite, which incorporates two key changes. Firstly, we integrated multi-scale feature grids into our lightweight neural representation, and the use of higher resolution grids significantly improves the performance of INRs at low complexity. Secondly, we address the issue that existing INRs typically leverage autoregressive models for entropy coding: these are effective but impractical due to their slow coding speed. In this work, we propose an octree-based context model for entropy coding high-dimensional feature grids, which accelerates the entropy coding module of the model. Our experimental results demonstrate that NVRC-Lite outperforms C3, one of the best lightweight INR-based video codecs, with up to 21.03% and 23.06% BD-rate savings when measured in PSNR and MS-SSIM, respectively, while achieving 8.4x encoding and 2.5x decoding speedup. The implementation of NVRC-Lite will be made available.
zh
[CV-10] Learning Group Actions In Disentangled Latent Image Representations
【速读】:该论文旨在解决现有方法在学习图像数据中群作用(group actions)时难以自动发现与变换相关结构的问题。传统方法通常在高维数据空间中操作,导致无法有效分离受变换影响的子空间;而现有的潜在空间方法虽具灵活性,却依赖人工划分潜在变量为等变(equivariant)和不变(invariant)子空间,限制了对表示空间中群作用的鲁棒学习。其解决方案的关键在于提出一种端到端框架,首次实现对潜在图像流形上群作用的自动学习,通过可学习的二值掩码(learnable binary masks)结合直通估计(straight-through estimation),动态划分潜在表示为敏感于变换与不变的组件,并在统一优化框架中联合学习潜在解耦与群变换映射,从而无需人工干预即可自动识别并建模与群作用相关的结构。
链接: https://arxiv.org/abs/2512.04015
作者: Farhana Hossain Swarnali,Miaomiao Zhang,Tonmoy Hossain
机构: Genuity Systems Limited(格尼特公司); University of Virginia(弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modeling group actions on latent representations enables controllable transformations of high-dimensional image data. Prior works applying group-theoretic priors or modeling transformations typically operate in the high-dimensional data space, where group actions apply uniformly across the entire input, making it difficult to disentangle the subspace that varies under transformations. While latent-space methods offer greater flexibility, they still require manual partitioning of latent variables into equivariant and invariant subspaces, limiting the ability to robustly learn and operate group actions within the representation space. To address this, we introduce a novel end-to-end framework that for the first time learns group actions on latent image manifolds, automatically discovering transformation-relevant structures without manual intervention. Our method uses learnable binary masks with straight-through estimation to dynamically partition latent representations into transformation-sensitive and invariant components. We formulate this within a unified optimization framework that jointly learns latent disentanglement and group transformation mappings. The framework can be seamlessly integrated with any standard encoder-decoder architecture. We validate our approach on five 2D/3D image datasets, demonstrating its ability to automatically learn disentangled latent factors for group actions in diverse data, while downstream classification tasks confirm the effectiveness of the learned representations. Our code is publicly available at this https URL .
zh
[CV-11] Emergent Outlier View Rejection in Visual Geometry Grounded Transformers
【速读】:该论文旨在解决从野外(in-the-wild)图像集合中进行可靠三维重建时,因存在“噪声”图像(即与其它图像视图重叠度低的无关输入)而导致性能下降的问题。传统基于几何验证的Structure-from-Motion(SfM)方法可通过异常值剔除机制缓解此问题,但前馈式三维重建模型缺乏此类显式机制,导致在真实复杂场景下表现不佳。解决方案的关键在于发现现有前馈模型(如VGGT)虽未显式设计去噪训练或异常值剔除模块,却能在特定网络层中自发表现出异常值抑制行为;进一步分析表明,该层编码了具有判别性的内部表示,从而具备有效的噪声过滤能力。作者仅利用该隐式过滤机制即可实现无需额外微调或监督的异常视图剔除,在受控和野外数据集上均展现出良好的一致性和泛化性能。
链接: https://arxiv.org/abs/2512.04012
作者: Jisang Han,Sunghwan Hong,Jaewoo Jung,Wooseok Jang,Honggyu An,Qianqian Wang,Seungryong Kim,Chen Feng
机构: KAIST AI(韩国科学技术院人工智能); New York University(纽约大学); ETH AI Center, ETH Zurich(苏黎世联邦理工学院人工智能中心); UC Berkeley(加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Reliable 3D reconstruction from in-the-wild image collections is often hindered by “noisy” images-irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios.
zh
[CV-12] On the Temporality for Sketch Representation Learning
【速读】:该论文旨在解决手绘草图(sketch)在表示学习中是否应被视为序列数据,以及何种内部顺序对表示质量更为关键的问题。其核心贡献在于通过实证分析表明:尽管传统的位置编码(positional encoding)可用于建模草图序列,但绝对坐标(absolute coordinates)始终优于相对坐标;此外,非自回归解码器(non-autoregressive decoder)在性能上优于自回归解码器(autoregressive decoder);同时,时序性的重要性取决于所采用的顺序类型和具体任务目标。
链接: https://arxiv.org/abs/2512.04007
作者: Marcelo Isaias de Moraes Junior,Moacir Antonelli Ponti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Sketches are simple human hand-drawn abstractions of complex scenes and real-world objects. Although the field of sketch representation learning has advanced significantly, there is still a gap in understanding the true relevance of the temporal aspect to the quality of these representations. This work investigates whether it is indeed justifiable to treat sketches as sequences, as well as which internal orders play a more relevant role. The results indicate that, although the use of traditional positional encodings is valid for modeling sketches as sequences, absolute coordinates consistently outperform relative ones. Furthermore, non-autoregressive decoders outperform their autoregressive counterparts. Finally, the importance of temporality was shown to depend on both the order considered and the task evaluated.
zh
[CV-13] Divide then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在长视频理解任务中因上下文长度限制和密集视频帧处理带来的高计算成本问题。现有方法多依赖查询感知的帧选择机制,但常伴随显著的计算开销。论文的关键创新在于识别出两类查询类型——全局查询(global query)与局部查询(localized query),并发现二者对帧选择策略的需求存在本质差异:全局查询可通过高效的均匀采样实现良好效果,而局部查询则需依赖查询相关的帧提取。基于此洞察,作者提出无需训练的DIG框架,其核心在于根据查询类型动态切换策略——对全局查询采用均匀采样,对局部查询激活专用管道以提取相关帧,从而在保持高效性的同时显著提升LMM在长视频理解上的性能表现。
链接: https://arxiv.org/abs/2512.04000
作者: Jialuo Li,Bin Li,Jiahao Li,Yan Lu
机构: Tsinghua University (清华大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.
zh
[CV-14] Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在测试时缩放(Test-time Scaling, TTS)方法中,噪声随机性对生成质量与多样性影响机制不明确的问题。现有研究多集中于搜索策略和奖励模型设计,但忽略了扩散过程中噪声的随机特性如何影响生成结果,尤其缺乏对高频率细节控制能力的探索。其解决方案的关键在于提出一种新的随机性形式——文本嵌入扰动(text embedding perturbation),该扰动与SDE注入的时空噪声形成互补:前者主要增强高频细节(后期步骤),后者主导低频结构(早期步骤),从而协同提升生成质量与多样性;同时,通过基于频率特性的扰动强度自适应调整机制,根据各维度对生成的贡献度与抗扰动能力动态优化扰动强度,实现无需额外计算开销即可显著提升多个基准上的性能表现。
链接: https://arxiv.org/abs/2512.03996
作者: Hang Xu,Linjiang Huang,Feng Zhao
机构: MoE Key Lab of BIPC, USTC (BIPC 重点实验室,中国科学技术大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Test-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and reward models, yet the impact of the stochastic characteristic of noise in T2I diffusion models on the method’s performance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding perturbation, which couples with existing randomness like SDE-injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturbation enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key designs: (1) Introducing step-based text embedding perturbation, combining frequency-guided noise schedules with spatial noise perturbation. (2) Adapting the perturbation intensity selectively based on their frequency-specific contributions to generation and tolerance to perturbation. Our approach can be seamlessly integrated into existing TTS methods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at \hrefthis https URLthis https URL.
zh
[CV-15] Artificial Microsaccade Compensation: Stable Vision for an Ornithopter
【速读】:该论文旨在解决尾迹式扑翼飞行器(tailless ornithopter)在飞行过程中因高频振动(12–20 Hz)导致视频图像严重抖动的问题,这类振动使得传统基于摄像头的感知系统难以应用。其解决方案的关键在于受人类视网膜中央凹(foveated vision)中微小眼动(microsaccades)启发,提出“人工微小眼动补偿”(Artificial Microsaccade Compensation)方法:通过在SO(3)流形上优化三维旋转参数,最小化图像强度变化,从而实现实时、无畸变的视频稳定化处理。该方法不仅可生成适合人眼观看的高质量稳定视频,还能在保持固定观测方向的前提下,显著减少帧间运动,并利用递归更新机制提升计算效率。相比Adobe Premiere Pro的Warp Stabilizer等商用方案,该方法在稳定性与实时性方面均取得更优效果。
链接: https://arxiv.org/abs/2512.03995
作者: Levi Burner,Guido de Croon,Yiannis Aloimonos
机构: University of Maryland, College Park (马里兰大学帕克分校); Delft University of Technology (代尔夫特理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 5 figures, 2 tables, under review
Abstract:Animals with foveated vision, including humans, experience microsaccades, small, rapid eye movements that they are not aware of. Inspired by this phenomenon, we develop a method for “Artificial Microsaccade Compensation”. It can stabilize video captured by a tailless ornithopter that has resisted attempts to use camera-based sensing because it shakes at 12-20 Hz. Our approach minimizes changes in image intensity by optimizing over 3D rotation represented in SO(3). This results in a stabilized video, computed in real time, suitable for human viewing, and free from distortion. When adapted to hold a fixed viewing orientation, up to occasional saccades, it can dramatically reduce inter-frame motion while also benefiting from an efficient recursive update. When compared to Adobe Premier Pro’s warp stabilizer, which is widely regarded as the best commercial video stabilization software available, our method achieves higher quality results while also running in real time.
zh
[CV-16] DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在安全关键应用(如自动驾驶)中,面对动态视觉退化(如运动模糊、传感器噪声和压缩伪影)时的鲁棒性不足问题,特别是由瞬态视觉失真引发的幻觉(hallucination)持续传播至后续帧的现象。其解决方案的关键在于提出首个面向时序序列中动态视觉退化的基准测试平台DIQ-H,通过多轮问答任务量化幻觉持续性、错误恢复能力和时序一致性;同时引入不确定性引导的迭代精炼(Uncertainty-Guided Iterative Refinement, UIR)方法,利用轻量级VLM结合不确定性过滤生成可靠的伪标签,显著提升标注效率与准确性(提升15.3%),从而为VLM在真实场景下的可靠性评估提供系统性工具。
链接: https://arxiv.org/abs/2512.03992
作者: Zexin Lin,Hawen Wan,Yebin Zhong,Xiaoqiang
机构: The School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China; The School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China; The Shenzhen Institute of Artificial Intelligence and Robotics for Society Shenzhen, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.
zh
[CV-17] DirectDrag : High-Fidelity Mask-Free Prompt-Free Drag -based Image Editing via Readout-Guided Feature Alignment
【速读】:该论文旨在解决基于生成式模型的拖拽式图像编辑中依赖手动掩码(mask)和文本提示(prompt)所带来的限制问题,这些问题导致在去除约束后易产生视觉伪影或空间控制精度不足。其核心解决方案是提出一种无需掩码和提示的编辑框架 DirectDrag,关键创新在于:一是设计了自动软掩码生成模块(Auto Soft Mask Generation),通过点位移智能推断可编辑区域,在保持上下文完整性的同时沿运动路径精准定位形变;二是开发了读出引导特征对齐机制(Readout-Guided Feature Alignment),利用扩散模型中间激活特征实现点驱动编辑过程中的结构一致性保持,从而显著提升图像保真度。
链接: https://arxiv.org/abs/2512.03981
作者: Sheng-Hao Liao,Shang-Fu Chen,Tai-Ming Huang,Wen-Huang Cheng,Kai-Lung Hua
机构: National Taiwan University of Science and Technology (台湾科技大学); National Taiwan University (台湾大学); Microsoft Taiwan (微软台湾)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Drag-based image editing using generative models provides intuitive control over image structures. However, existing methods rely heavily on manually provided masks and textual prompts to preserve semantic fidelity and motion precision. Removing these constraints creates a fundamental trade-off: visual artifacts without masks and poor spatial control without prompts. To address these limitations, we propose DirectDrag, a novel mask- and prompt-free editing framework. DirectDrag enables precise and efficient manipulation with minimal user input while maintaining high image fidelity and accurate point alignment. DirectDrag introduces two key innovations. First, we design an Auto Soft Mask Generation module that intelligently infers editable regions from point displacement, automatically localizing deformation along movement paths while preserving contextual integrity through the generative model’s inherent capacity. Second, we develop a Readout-Guided Feature Alignment mechanism that leverages intermediate diffusion activations to maintain structural consistency during point-based edits, substantially improving visual fidelity. Despite operating without manual mask or prompt, DirectDrag achieves superior image quality compared to existing methods while maintaining competitive drag accuracy. Extensive experiments on DragBench and real-world scenarios demonstrate the effectiveness and practicality of DirectDrag for high-quality, interactive image manipulation. Project Page: this https URL. Code is available at: this https URL.
zh
[CV-18] BlurDM: A Blur Diffusion Model for Image Deblurring NEURIPS2025
【速读】:该论文旨在解决现有扩散模型在动态场景去模糊任务中未能充分利用模糊形成过程内在特性的问题,从而限制了其性能潜力。解决方案的关键在于提出一种模糊扩散模型(Blur Diffusion Model, BlurDM),通过双扩散前向过程将噪声和模糊同时施加到清晰图像上,隐式建模运动模糊的连续曝光本质;在反向生成过程中,推导出同时去噪与去模糊的联合优化公式,使得模型能够以纯高斯噪声为条件、以模糊图像为输入,直接恢复出清晰图像。此外,为提升效率,BlurDM在潜在空间中执行,构建了一个灵活的先验生成网络,可无缝集成至各类去模糊网络架构中。
链接: https://arxiv.org/abs/2512.03979
作者: Jin-Ting He,Fu-Jen Tsai,Yan-Tsung Peng,Min-Hung Chen,Chia-Wen Lin,Yen-Yu Lin
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); National Tsing Hua University (国立清华大学); National Chengchi University (国立政治大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025
Abstract:Diffusion models show promise for dynamic scene deblurring; however, existing studies often fail to leverage the intrinsic nature of the blurring process within diffusion models, limiting their full potential. To address it, we present a Blur Diffusion Model (BlurDM), which seamlessly integrates the blur formation process into diffusion for image deblurring. Observing that motion blur stems from continuous exposure, BlurDM implicitly models the blur formation process through a dual-diffusion forward scheme, diffusing both noise and blur onto a sharp image. During the reverse generation process, we derive a dual denoising and deblurring formulation, enabling BlurDM to recover the sharp image by simultaneously denoising and deblurring, given pure Gaussian noise conditioned on the blurred image as input. Additionally, to efficiently integrate BlurDM into deblurring networks, we perform BlurDM in the latent space, forming a flexible prior generation network for deblurring. Extensive experiments demonstrate that BlurDM significantly and consistently enhances existing deblurring methods on four benchmark datasets. The source code is available at this https URL.
zh
[CV-19] raining for Identity Inference for Controllability: A Unified Approach to Tuning-Free Face Personalization
【速读】:该论文旨在解决当前无需微调(tuning-free)人脸个性化方法在同时实现高身份保真度(identity fidelity)与灵活文本控制能力(text controllability)方面的局限性。现有方法主要分为两类:基于文本嵌入(text embedding)的方法将人脸特征映射至文本嵌入空间,以及基于适配器(adapter-based)的方法通过辅助交叉注意力层注入特征。然而,这两类方法难以兼顾二者性能。论文提出的解决方案是UniID,其关键在于提出一种原理性的训练-推理策略:训练阶段采用聚焦身份的学习机制,引导两个分支仅捕获与身份相关的信息;推理阶段引入归一化重缩放机制,在保留基础扩散模型原始先验的前提下恢复文本可控性,并使两类身份信号互补增强。这一设计使得UniID能够在保持高身份保真度的同时实现灵活的文本控制,实验表明其优于六种先进方法。
链接: https://arxiv.org/abs/2512.03964
作者: Lianyu Pang,Ji Zhou,Qiping Wang,Baoquan Zhao,Zhenguo Yang,Qing Li,Xudong Mao
机构: Sun Yat-sen University (中山大学); East China Normal University (华东师范大学); Guangdong University of Technology (广东工业大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 13 figures
Abstract:Tuning-free face personalization methods have developed along two distinct paradigms: text embedding approaches that map facial features into the text embedding space, and adapter-based methods that inject features through auxiliary cross-attention layers. While both paradigms have shown promise, existing methods struggle to simultaneously achieve high identity fidelity and flexible text controllability. We introduce UniID, a unified tuning-free framework that synergistically integrates both paradigms. Our key insight is that when merging these approaches, they should mutually reinforce only identity-relevant information while preserving the original diffusion prior for non-identity attributes. We realize this through a principled training-inference strategy: during training, we employ an identity-focused learning scheme that guides both branches to capture identity features exclusively; at inference, we introduce a normalized rescaling mechanism that recovers the text controllability of the base diffusion model while enabling complementary identity signals to enhance each other. This principled design enables UniID to achieve high-fidelity face personalization with flexible text controllability. Extensive experiments against six state-of-the-art methods demonstrate that UniID achieves superior performance in both identity preservation and text controllability. Code will be available at this https URL
zh
[CV-20] mpR1: Improving Temporal Understanding of MLLM s via Temporal-Aware Multi-Task Reinforcement Learning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长视频分析中时间理解能力不足的问题,尤其是在时间定位、动作检测和时序敏感问答等任务上的泛化能力受限。其核心解决方案是提出 TempR1,一个面向时间感知的多任务强化学习框架,关键在于:首先构建包含多样化时间结构与语义的多任务语料库以增强模型对不同时间模式的理解;其次基于组相对策略优化(Group Relative Policy Optimization, GRPO)算法实现跨任务的稳定优化;并进一步将时间任务细分为三类预测区间与真实标注之间的对应关系,设计针对性的时间定位奖励机制,从而捕捉精细的时间依赖性并适应多样化的时序模式。该方法显著提升了多个基准上的性能,并展现出任务间协同效应,为 MLLMs 中的时间推理提供了可扩展且原理清晰的新范式。
链接: https://arxiv.org/abs/2512.03963
作者: Tao Wu,Li Yang,Gen Zhan,Yiting Liao,Junlin Li,Deliang Fu,Li Zhang,Limin Wang
机构: Nanjing University (南京大学); ByteDance Inc. (字节跳动); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs’ temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we categorize temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and design tailored localization rewards for each, enabling TempR1 to capture fine-grained temporal dependencies and adapt to different temporal patterns. Extensive experiments demonstrate that TempR1 attains state-of-the-art performance across multiple benchmarks. Moreover, its joint optimization over complementary tasks yields a strong synergistic effect, enhancing both generalization and single-task performance, establishing a scalable and principled paradigm for temporal reasoning in MLLMs.
zh
[CV-21] MUT3R: Motion-aware Updating Transformer for Dynamic 3D Reconstruction
【速读】:该论文旨在解决状态感知的循环神经网络在动态场景下进行静态三维重建时,因非刚性区域导致注意力传播污染空间记忆与图像特征之间关联的问题,从而产生运动诱导伪影(motion-induced artifacts)。其解决方案的关键在于发现预训练Transformer中自注意力图在多层聚合后存在一种隐式运动线索:动态区域会被自然地弱化。基于此观察,作者提出MUT3R框架,通过引入一个无需训练的注意力级门控模块,在推理阶段早期抑制动态区域的影响,从而防止伪影沿特征层级传播。该方法不依赖模型微调或重新训练,而是让预训练模型自主识别并修正自身的运动干扰,显著提升了流式场景下的几何推理稳定性、时间一致性及相机位姿鲁棒性。
链接: https://arxiv.org/abs/2512.03939
作者: Guole Shen,Tianchen Deng,Xingrui Qin,Nailin Wang,Jianyu Wang,Yanbo Wang,Yongtao Chen,Hesheng Wang,Jingchuan Wang
机构: Shanghai Jiao Tong University (上海交通大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Recent stateful recurrent neural networks have achieved remarkable progress on static 3D reconstruction but remain vulnerable to motion-induced artifacts, where non-rigid regions corrupt attention propagation between the spatial memory and image feature. By analyzing the internal behaviors of the state and image token updating mechanism, we find that aggregating self-attention maps across layers reveals a consistent pattern: dynamic regions are naturally down-weighted, exposing an implicit motion cue that the pretrained transformer already encodes but never explicitly uses. Motivated by this observation, we introduce MUT3R, a training-free framework that applies the attention-derived motion cue to suppress dynamic content in the early layers of the transformer during inference. Our attention-level gating module suppresses the influence of dynamic regions before their artifacts propagate through the feature hierarchy. Notably, we do not retrain or fine-tune the model; we let the pretrained transformer diagnose its own motion cues and correct itself. This early regulation stabilizes geometric reasoning in streaming scenarios and leads to improvements in temporal consistency and camera pose robustness across multiple dynamic benchmarks, offering a simple and training-free pathway toward motion-aware streaming reconstruction.
zh
[CV-22] Beyond the Ground Truth: Enhanced Supervision for Image Restoration
【速读】:该论文旨在解决真实世界图像恢复任务中因训练数据集ground truth图像质量受限而导致模型性能瓶颈的问题。其解决方案的关键在于提出一种新颖的框架,通过引入条件频率掩码生成器(conditional frequency mask generator)学习自适应频率掩码,指导原始ground truth图像与其超分辨率变体在频域中的最优融合,从而生成感知增强的ground truth图像;该方法在保持语义一致性的同时选择性增强感知细节,避免幻觉伪影,进而提升监督信号质量,并进一步利用这些增强后的ground truth训练一个轻量级输出精炼网络(output refinement network),可无缝集成至现有恢复模型中以改善最终输出质量。
链接: https://arxiv.org/abs/2512.03932
作者: Donghun Ryou,Inju Ha,Sanghyeok Chu,Bohyung Han
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning-based image restoration has achieved significant success. However, when addressing real-world degradations, model performance is limited by the quality of ground-truth images in datasets due to practical constraints in data acquisition. To address this limitation, we propose a novel framework that enhances existing ground truth images to provide higher-quality supervision for real-world restoration. Our framework generates perceptually enhanced ground truth images using super-resolution by incorporating adaptive frequency masks, which are learned by a conditional frequency mask generator. These masks guide the optimal fusion of frequency components from the original ground truth and its super-resolved variants, yielding enhanced ground truth images. This frequency-domain mixup preserves the semantic consistency of the original content while selectively enriching perceptual details, preventing hallucinated artifacts that could compromise fidelity. The enhanced ground truth images are used to train a lightweight output refinement network that can be seamlessly integrated with existing restoration models. Extensive experiments demonstrate that our approach consistently improves the quality of restored images. We further validate the effectiveness of both supervision enhancement and output refinement through user studies. Code is available at this https URL.
zh
[CV-23] UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework
【速读】:该论文旨在解决2D人类视频与3D人体运动在统一框架下同时生成与理解的问题,现有方法通常仅能单向生成(如由视频生成3D运动或反之)或与其他模态(如文本、音频)融合,而未探索二者联合优化的潜力,主要挑战在于两者在结构和分布上的显著差异。解决方案的关键在于提出UniMo模型:首先将视频和3D运动建模为统一的token序列,通过独立嵌入层缓解分布差异;其次设计了一种结合两类任务的序列建模策略,实现端到端联合优化;此外,引入一种基于VQ-VAE的新型3D运动分词器(motion tokenizer),结合时间扩展策略以高效对齐视觉token并保留空间信息,并采用多专家解码器分别处理身体形状、平移、全局朝向和肢体姿态,从而实现高保真3D运动重建与对应视频生成的同步完成。
链接: https://arxiv.org/abs/2512.03918
作者: Youxin Pang,Yong Zhang,Ruizhi Shao,Xiang Deng,Feng Gao,Xu Xiaoming,Xiaoming Wei,Yebin Liu
机构: Tsinghua University (清华大学); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:We propose UniMo, an innovative autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework, enabling simultaneous generation and understanding of these two modalities for the first time. Current methods predominantly focus on generating one modality given another as the condition or integrating either of them with other modalities such as text and audio. Unifying 2D videos and 3D motions for simultaneous optimization and generation remains largely unexplored, presenting significant challenges due to their substantial structural and distributional differences. Inspired by the LLM’s ability to unify different modalities, our method models videos and 3D motions as a unified tokens sequence, utilizing separate embedding layers to mitigate distribution gaps. Additionally, we devise a sequence modeling strategy that integrates two distinct tasks within a single framework, proving the effectiveness of unified modeling. Moreover, to efficiently align with visual tokens and preserve 3D spatial information, we design a novel 3D motion tokenizer with a temporal expansion strategy, using a single VQ-VAE to produce quantized motion tokens. It features multiple expert decoders that handle body shapes, translation, global orientation, and body poses for reliable 3D motion reconstruction. Extensive experiments demonstrate that our method simultaneously generates corresponding videos and motions while performing accurate motion capture. This work taps into the capacity of LLMs to fuse diverse data types, paving the way for integrating human-centric information into existing models and potentially enabling multimodal, controllable joint modeling of humans, objects, and scenes.
zh
[CV-24] Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence
【速读】:该论文旨在解决零样本视频生成任务中因图像扩散模型直接迁移至视频场景而导致的时序不一致性问题。现有方法主要通过在注意力机制中引入帧间对应关系(inter-frame correspondence)来实现视频生成,但其对有效特征的软约束不足以保证语义一致性的跨帧传递,从而导致视频内容出现时间上的不连贯。解决方案的关键在于提出FRESCO框架,该框架创新性地融合了帧内对应关系(intra-frame correspondence)与帧间对应关系,构建更鲁棒的空间-时间约束机制,不仅优化注意力引导,还显式地调整特征表示,从而显著提升视频生成过程中的时空一致性,确保语义相似内容在不同帧之间保持稳定转换。
链接: https://arxiv.org/abs/2512.03905
作者: Shuai Yang,Junxin Lin,Yifan Zhou,Ziwei Liu,Chen Change Loy
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所); S-Lab, Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL , Project: this https URL
Abstract:The remarkable success in text-to-image diffusion models has motivated extensive investigation of their potential for video applications. Zero-shot techniques aim to adapt image diffusion models for videos without requiring further model training. Recent methods largely emphasize integrating inter-frame correspondence into attention mechanisms. However, the soft constraint applied to identify the valid features to attend is insufficient, which could lead to temporal inconsistency. In this paper, we present FRESCO, which integrates intra-frame correspondence with inter-frame correspondence to formulate a more robust spatial-temporal constraint. This enhancement ensures a consistent transformation of semantically similar content between frames. Our method goes beyond attention guidance to explicitly optimize features, achieving high spatial-temporal consistency with the input video, significantly enhancing the visual coherence of manipulated videos. We verify FRESCO adaptations on two zero-shot tasks of video-to-video translation and text-guided video editing. Comprehensive experiments demonstrate the effectiveness of our framework in generating high-quality, coherent videos, highlighting a significant advance over current zero-shot methods.
zh
[CV-25] Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy
【速读】:该论文旨在解决在直肠癌患者接受全身新辅助治疗(TNT)后,通过“观察与等待”(watch-and-wait, WW)策略管理临床完全缓解(cCR)状态时,如何早期准确识别局部复发(LR)的问题。由于WW策略依赖于随访内镜图像的动态监测,而现有方法缺乏对图像变化的鲁棒判别能力,易导致漏诊或误判,进而影响预后。解决方案的关键在于提出一种基于双交叉注意力机制的Siamese Swin Transformer(SSDCA)模型,该模型通过预训练Swin Transformer提取跨域不变特征以增强对成像差异的鲁棒性,并利用双交叉注意力机制在无需图像空间配准的情况下,有效融合两次内镜检查图像的特征,从而实现对cCR与LR的精准区分。实验表明,SSDCA在平衡准确率、敏感性和特异性方面均优于基线方法,且在多种图像伪影下保持稳定性能,验证了其在临床实践中用于早期识别局部复发的可行性。
链接: https://arxiv.org/abs/2512.03883
作者: Jorge Tapias Gomez,Despoina Kanata,Aneesh Rangnekar,Christina Lee,Julio Garcia-Aguilar,Joshua Jesse Smith,Harini Veeraraghavan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures, 1 table, submitted to ISBI conference
Abstract:Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, objectively accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the two scans without requiring any spatial alignment of images to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from 135 patients and evaluated on a held-out set of image pairs from 62 patients. SSDCA produced the best balanced accuracy (81.76% \pm 0.04), sensitivity (90.07% \pm 0.08), and specificity (72.86% \pm 0.05). Robustness analysis showed stable performance irrespective of artifacts including blood, stool, telangiectasia, and poor image quality. UMAP clustering of extracted features showed maximal inter-cluster separation (1.45 \pm 0.18) and minimal intra-cluster dispersion (1.07 \pm 0.19) with SSDCA, confirming discriminative representation learning.
zh
[CV-26] An Automated Framework for Large-Scale Graph-Based Cerebrovascular Analysis
【速读】:该论文旨在解决脑血管形态学定量分析中缺乏自动化、多尺度特征提取方法的问题,尤其针对复杂脑血管网络的结构表征与人群水平研究需求。解决方案的关键在于提出CaravelMetrics框架,其核心是通过骨架化(skeletonization)生成图结构表示,整合基于图谱的区域分割、中心线提取与图构建技术,从而自动计算15种形态学、拓扑学、分形及几何特征,并支持全局和区域尺度的多层级分析,实现对脑血管组织结构的精准量化建模。
链接: https://arxiv.org/abs/2512.03869
作者: Daniele Falcetta,Liane S. Canas,Lorenzo Suppa,Matteo Pentassuglia,Jon Cleary,Marc Modat,Sébastien Ourselin,Maria A. Zuluaga
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Submitted to ISBI 2026. 6 pages, 6 figures
Abstract:We present CaravelMetrics, a computational framework for automated cerebrovascular analysis that models vessel morphology through skeletonization-derived graph representations. The framework integrates atlas-based regional parcellation, centerline extraction, and graph construction to compute fifteen morphometric, topological, fractal, and geometric features. The features can be estimated globally from the complete vascular network or regionally within arterial territories, enabling multiscale characterization of cerebrovascular organization. Applied to 570 3D TOF-MRA scans from the IXI dataset (ages 20-86), CaravelMetrics yields reproducible vessel graphs capturing age- and sex-related variations and education-associated increases in vascular complexity, consistent with findings reported in the literature. The framework provides a scalable and fully automated approach for quantitative cerebrovascular feature extraction, supporting normative modeling and population-level studies of vascular health and aging.
zh
[CV-27] Diminishing Returns in Self-Supervised Learning
【速读】:该论文旨在解决小规模视觉Transformer(Vision Transformer, ViT)在模型性能提升过程中对大规模参数和训练数据的高度依赖问题。研究表明,尽管预训练和微调(fine-tuning)通常能提升模型表现,但其收益呈现边际递减趋势;更关键的是,中间微调(intermediate fine-tuning)可能因任务机制差异而对下游性能产生负面影响。解决方案的关键在于:采用有针对性的预训练策略与精心筛选的数据集,而非盲目堆叠多个中间任务,从而在有限计算资源下最大化小规模ViT的性能表现。
链接: https://arxiv.org/abs/2512.03862
作者: Oli Bridge,Huey Sun,Botond Branyicskai-Nagy,Charles D’Ornano,Shomit Basu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While transformer-based architectures have taken computer vision and NLP by storm, they often require a vast amount of parameters and training data to attain strong performance. In this work, we experiment with three distinct pre-training, intermediate fine-tuning, and downstream datasets and training objectives to explore their marginal benefits on a small 5M-parameter vision transformer. We find that while pre-training and fine-tuning always help our model but have diminishing returns, intermediate fine-tuning can actually show harmful impact on downstream performance, potentially due to dissimilarity in task mechanics. Taken together, our results suggest that small-scale ViTs benefit most from targeted pre-training and careful data selection, while indiscriminate stacking of intermediate tasks can waste compute and even degrade performance.
zh
[CV-28] Prostate biopsy whole slide image dataset from an underrepresented Middle Eastern population
【速读】:该论文旨在解决当前数字病理学中AI模型在非西方人群(如中东地区)中泛化能力不足的问题,其关键解决方案是公开发布一个来自伊拉克埃尔比勒的前列腺穿刺活检全切片图像(whole-slide images)数据集,包含185例患者的339张图像,每张图像均标注了由三位独立病理学家评定的Gleason评分和国际泌尿病理学会(International Society of Urological Pathology, ISUP)分级。该数据集涵盖多种扫描设备(Leica、Hamamatsu和Grundium),支持分级一致性分析、颜色归一化及跨扫描仪鲁棒性评估,从而促进全球多样化人群中病理AI模型的开发与验证。
链接: https://arxiv.org/abs/2512.03854
作者: Peshawa J. Muhammad Ali,Navin Vincent,Saman S. Abdulla,Han N. Mohammed Fadhl,Anders Blilie,Kelvin Szolnoky,Julia Anna Mielcarz,Xiaoyi Ji,Kimmo Kartasalo,Abdulbasit K. Al-Talabani,Nita Mulliqi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 2 figures and 1 table
Abstract:Artificial intelligence (AI) is increasingly used in digital pathology. Publicly available histopathology datasets remain scarce, and those that do exist predominantly represent Western populations. Consequently, the generalizability of AI models to populations from less digitized regions, such as the Middle East, is largely unknown. This motivates the public release of our dataset to support the development and validation of pathology AI models across globally diverse populations. We present 339 whole-slide images of prostate core needle biopsies from a consecutive series of 185 patients collected in Erbil, Iraq. The slides are associated with Gleason scores and International Society of Urological Pathology grades assigned independently by three pathologists. Scanning was performed using two high-throughput scanners (Leica and Hamamatsu) and one compact scanner (Grundium). All slides were de-identified and are provided in their native formats without further conversion. The dataset enables grading concordance analyses, color normalization, and cross-scanner robustness evaluations. Data will be deposited in the Bioimage Archive (BIA) under accession code: to be announced (TBA), and released under a CC BY 4.0 license.
zh
[CV-29] raffic Image Restoration under Adverse Weather via Frequency-Aware Mamba
【速读】:该论文旨在解决恶劣天气条件下交通图像恢复的问题,现有方法多集中于空间域建模而忽视了频域先验信息。其解决方案的关键在于提出了一种名为FAMamba的新框架,该框架通过将频率引导机制与序列建模相结合,实现高效图像恢复;核心创新包括:(1)双分支特征提取模块(DFEB),利用双向二维频率自适应扫描增强局部-全局交互,动态调整遍历路径以适应子带纹理分布;(2)先验引导模块(PGB),基于小波的高频残差学习细化纹理细节,提升重建质量;同时设计了自适应频率扫描机制(AFSM),使Mamba架构能够在不同子图上实现频域扫描,充分挖掘子图结构中的纹理分布特性。
链接: https://arxiv.org/abs/2512.03852
作者: Liwen Pan,Longguang Wang,Guangwei Gao,Jun Wang,Jun Shi,Juncheng Li
机构: Shanghai University (上海大学); Air Force Aviation University (空军航空大学); Nanjing University of Posts and Telecommunications (南京邮电大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12pages, 13 figures, 5tables
Abstract:Traffic image restoration under adverse weather conditions remains a critical challenge for intelligent transportation systems. Existing methods primarily focus on spatial-domain modeling but neglect frequency-domain priors. Although the emerging Mamba architecture excels at long-range dependency modeling through patch-wise correlation analysis, its potential for frequency-domain feature extraction remains unexplored. To address this, we propose Frequency-Aware Mamba (FAMamba), a novel framework that integrates frequency guidance with sequence modeling for efficient image restoration. Our architecture consists of two key components: (1) a Dual-Branch Feature Extraction Block (DFEB) that enhances local-global interaction via bidirectional 2D frequency-adaptive scanning, dynamically adjusting traversal paths based on sub-band texture distributions; and (2) a Prior-Guided Block (PGB) that refines texture details through wavelet-based high-frequency residual learning, enabling high-quality image reconstruction with precise details. Meanwhile, we design a novel Adaptive Frequency Scanning Mechanism (AFSM) for the Mamba architecture, which enables the Mamba to achieve frequency-domain scanning across distinct subgraphs, thereby fully leveraging the texture distribution characteristics inherent in subgraph structures. Extensive experiments demonstrate the efficiency and effectiveness of FAMamba.
zh
[CV-30] PULSE: A Unified Multi-Task Architecture for Cardiac Segmentation Diagnosis and Few-Shot Cross-Modality Clinical Adaptation
【速读】:该论文旨在解决心脏影像分析中任务碎片化的问题,即解剖结构分割、疾病分类和临床报告生成通常由独立的网络分别处理,且训练数据分布和目标不一致,缺乏统一架构以实现跨模态和跨数据集的泛化能力。解决方案的关键在于提出PULSE框架,其基于自监督表示学习,并采用复合监督策略平衡区域重叠学习、像素级分类精度与边界感知IoU优化;通过多尺度token重建解码器实现解剖分割,共享全局表征支持疾病分类与临床文本生成,从而在单一架构内完成从像素到结构再到临床推理的端到端映射,学习任务不变的心脏先验知识,显著提升跨数据集和新成像模态下的泛化性能。
链接: https://arxiv.org/abs/2512.03848
作者: Hania Ghouse,Maryam Alsharqi,Farhad R. Nezami,Muzammil Behzad
机构: King Fahd University of Petroleum and Minerals (沙特国王大学); Institute for Medical Engineering & Science, Massachusetts Institute of Technology (麻省理工学院医学工程与科学研究所); Harvard Medical School, Harvard University (哈佛医学院,哈佛大学); KFUPM–SDAIA Joint Research Centre for Artificial Intelligence (沙特国王大学-沙特数据与人工智能局联合人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Cardiac image analysis remains fragmented across tasks: anatomical segmentation, disease classification, and grounded clinical report generation are typically handled by separate networks trained under different data regimes. No existing framework unifies these objectives within a single architecture while retaining generalization across imaging modalities and datasets. We introduce PULSE, a multi-task vision-language framework built on self-supervised representations and optimized through a composite supervision strategy that balances region overlap learning, pixel wise classification fidelity, and boundary aware IoU refinement. A multi-scale token reconstruction decoder enables anatomical segmentation, while shared global representations support disease classification and clinically grounded text output allowing the model to transition from pixels to structures and finally clinical reasoning within one architecture. Unlike prior task-specific pipelines, PULSE learns task-invariant cardiac priors, generalizes robustly across datasets, and can be adapted to new imaging modalities with minimal supervision. This moves the field closer to a scalable, foundation style cardiac analysis framework.
zh
[CV-31] CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation
【速读】:该论文旨在解决当前数据集蒸馏(Dataset Distillation, DD)方法中两个核心问题:一是多数基于生成模型的方法依赖于在完整目标数据集上预训练的扩散模型,违背了DD降低训练成本的初衷;二是采用通用文本到图像模型时存在分布偏差,因其网络先验无法准确捕捉目标数据的语义特征。解决方案的关键在于提出Core Distribution Alignment (CoDA)框架,其核心思想是通过密度驱动的机制识别目标数据集的“内在核心分布”,并引导生成过程使合成样本与该核心分布对齐,从而有效弥合通用生成先验与目标语义之间的差距,实现无需特定训练即可达到或超越现有最优性能的高效数据集蒸馏。
链接: https://arxiv.org/abs/2512.03844
作者: Letian Zhou,Songhua Liu,Xinchao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 24 figures
Abstract:Prevailing Dataset Distillation (DD) methods leveraging generative models confront two fundamental limitations. First, despite pioneering the use of diffusion models in DD and delivering impressive performance, the vast majority of approaches paradoxically require a diffusion model pre-trained on the full target dataset, undermining the very purpose of DD and incurring prohibitive training costs. Second, although some methods turn to general text-to-image models without relying on such target-specific training, they suffer from a significant distributional mismatch, as the web-scale priors encapsulated in these foundation models fail to faithfully capture the target-specific semantics, leading to suboptimal performance. To tackle these challenges, we propose Core Distribution Alignment (CoDA), a framework that enables effective DD using only an off-the-shelf text-to-image model. Our key idea is to first identify the “intrinsic core distribution” of the target dataset using a robust density-based discovery mechanism. We then steer the generative process to align the generated samples with this core distribution. By doing so, CoDA effectively bridges the gap between general-purpose generative priors and target semantics, yielding highly representative distilled datasets. Extensive experiments suggest that, without relying on a generative model specifically trained on the target dataset, CoDA achieves performance on par with or even superior to previous methods with such reliance across all benchmarks, including ImageNet-1K and its subsets. Notably, it establishes a new state-of-the-art accuracy of 60.4% at the 50-images-per-class (IPC) setup on ImageNet-1K. Our code is available on the project webpage: this https URL
zh
[CV-32] Heatmap Pooling Network for Action Recognition from RGB Videos
【速读】:该论文旨在解决视频中人体动作识别(Human Action Recognition, HAR)任务中现有方法在提取深度特征时面临的冗余信息、易受噪声干扰以及存储成本高等问题。其核心解决方案是提出一种新颖的热力图池化网络(Heatmap Pooling Network, HP-Net),通过引入反馈池化模块(feedback pooling module)从视频中提取信息丰富、鲁棒且紧凑的人体 pooled features,显著优于以往基于姿态数据或热力图特征的方法;此外,设计空间-运动协同学习模块(spatial-motion co-learning module)与文本精炼调制模块(text refinement modulation module),实现多模态特征融合,从而提升动作识别的鲁棒性与准确性。
链接: https://arxiv.org/abs/2512.03837
作者: Mengyuan Liu,Jinfu Liu,Yongkang Jiang,Bin He
机构: Peking University, Shenzhen Graduate School (北京大学深圳研究生院); DJI Technology Co., Ltd (大疆创新科技有限公司); TongJi University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Final Version of IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and UAV-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods. Our code is publicly available at: this https URL.
zh
[CV-33] Lean Unet: A Compact Model for Image Segmentation
【速读】:该论文旨在解决传统U-Net及其变体在医学图像分割任务中因参数量大、内存占用高而导致训练批大小受限和推理延迟增加的问题。现有通道剪枝方法虽能在不损失精度的前提下压缩模型,但依赖复杂的优化过程且泛化能力弱。研究者通过分析发现,模型最终结构才是决定性能的关键因素,而非剪枝策略本身。基于此洞察,提出了一种轻量化U-Net架构(LUnet),其核心创新在于采用扁平化的通道分布结构——即在下采样过程中不按常规加倍通道数,而是保持固定通道数,并利用跳跃连接(skip connections)有效传递信息,从而显著减少参数量(超过30倍)并提升计算效率,同时在MRI和CT数据集上实现与标准U-Net及自适应剪枝网络相当甚至更优的分割性能。
链接: https://arxiv.org/abs/2512.03834
作者: Ture Hassler,Ida Åkerholm,Marcus Nordström,Gabriele Balletti,Orcun Goksel
机构: Uppsala University (乌普萨拉大学); Karolinska Institutet (卡罗林斯卡学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unet and its variations have been standard in semantic image segmentation, especially for computer assisted radiology. Current Unet architectures iteratively downsample spatial resolution while increasing channel dimensions to preserve information content. Such a structure demands a large memory footprint, limiting training batch sizes and increasing inference latency. Channel pruning compresses Unet architecture without accuracy loss, but requires lengthy optimization and may not generalize across tasks and datasets. By investigating Unet pruning, we hypothesize that the final structure is the crucial factor, not the channel selection strategy of pruning. Based on our observations, we propose a lean Unet architecture (LUnet) with a compact, flat hierarchy where channels are not doubled as resolution is halved. We evaluate on a public MRI dataset allowing comparable reporting, as well as on two internal CT datasets. We show that a state-of-the-art pruning solution (STAMP) mainly prunes from the layers with the highest number of channels. Comparatively, simply eliminating a random channel at the pruning-identified layer or at the largest layer achieves similar or better performance. Our proposed LUnet with fixed architectures and over 30 times fewer parameters achieves performance comparable to both conventional Unet counterparts and data-adaptively pruned networks. The proposed lean Unet with constant channel count across layers requires far fewer parameters while achieving performance superior to standard Unet for the same total number of parameters. Skip connections allow Unet bottleneck channels to be largely reduced, unlike standard encoder-decoder architectures requiring increased bottleneck channels for information propagation.
zh
[CV-34] A Robust Camera-based Method for Breath Rate Measurement
【速读】:该论文旨在解决从视频中非接触式测量人类呼吸频率(respiratory rate)的准确性问题,尤其是在实际应用场景中受个体运动干扰时的鲁棒性不足。其解决方案的关键在于结合数学变换(mathematical transforms)与相对偏差控制策略,实现仅依赖普通摄像头即可获得高精度结果——在14名志愿者共超过2.5小时的视频数据上验证,平均绝对误差仅为0.57次/分钟,且对主体移动引起的形变具有更强的抗干扰能力,显著优于以往方法。
链接: https://arxiv.org/abs/2512.03827
作者: Alexey Protopopov
机构: Joint Stock Research and Production Company Kryptonite (联合股份研究与生产公司Kryptonite)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures, 2 tables
Abstract:Proliferation of cheap and accessible cameras makes it possible to measure a subject’s breath rate from video footage alone. Recent works on this topic have proposed a variety of approaches for accurately measuring human breath rate, however they are either tested in near-ideal conditions, or produce results that are not sufficiently accurate. The present study proposes a more robust method to measure breath rate in humans with minimal hardware requirements using a combination of mathematical transforms with a relative deviation from the ground truth of less than 5%. The method was tested on videos taken from 14 volunteers with a total duration of over 2 hours 30 minutes. The obtained results were compared to reference data and the average mean absolute error was found to be at 0.57 respirations per minute, which is noticeably better than the results from previous works. The breath rate measurement method proposed in the present article is more resistant to distortions caused by subject movement and thus allows one to remotely measure the subject’s breath rate without any significant limitations on the subject’s behavior.
zh
[CV-35] HieroGlyphTranslator: Automatic Recognition and Translation of Egyptian Hieroglyphs to English
链接: https://arxiv.org/abs/2512.03817
作者: Ahmed Nasser,Marwan Mohamed,Alaa Sherif,Basmala Mahmoud,Shereen Yehia,Asmaa Saad,Mariam S. El-Rahmany,Ensaf H. Mohamed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
[CV-36] LSRS: Latent Scale Rejection Sampling for Visual Autoregressive Modeling
【速读】:该论文旨在解决视觉自回归(Visual Autoregressive, VAR)模型在图像生成过程中,由于同一尺度内并行采样多个token可能导致结构错误、进而影响生成质量的问题。解决方案的关键在于提出潜空间拒绝采样(Latent Scale Rejection Sampling, LSRS),该方法在推理阶段逐尺度对候选token映射进行轻量级评分模型评估,选择高质量映射以引导后续尺度生成;通过优先优化早期关键尺度来抑制自回归误差累积,从而在几乎不增加计算开销的前提下显著提升生成图像质量。
链接: https://arxiv.org/abs/2512.03796
作者: Hong-Kai Zheng,Piji Li
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); MIIT Key Laboratory of Pattern Analysis and Machine Intelligence (工业和信息化部模式分析与机器智能重点实验室); The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education (教育部脑机智能技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual Autoregressive (VAR) modeling approach for image generation proposes autoregressive processing across hierarchical scales, decoding multiple tokens per scale in parallel. This method achieves high-quality generation while accelerating synthesis. However, parallel token sampling within a scale may lead to structural errors, resulting in suboptimal generated images. To mitigate this, we propose Latent Scale Rejection Sampling (LSRS), a method that progressively refines token maps in the latent scale during inference to enhance VAR models. Our method uses a lightweight scoring model to evaluate multiple candidate token maps sampled at each scale, selecting the high-quality map to guide subsequent scale generation. By prioritizing early scales critical for structural coherence, LSRS effectively mitigates autoregressive error accumulation while maintaining computational efficiency. Experiments demonstrate that LSRS significantly improves VAR’s generation quality with minimal additional computational overhead. For the VAR-d30 model, LSRS increases the inference time by merely 1% while reducing its FID score from 1.95 to 1.78. When the inference time is increased by 15%, the FID score can be further reduced to 1.66. LSRS offers an efficient test-time scaling solution for enhancing VAR-based generation.
zh
[CV-37] Research on Brain Tumor Classification Method Based on Improved ResNet34 Network
【速读】:该论文旨在解决传统放射影像学中脑肿瘤图像手动分类效率低、准确率不足的问题,尤其针对浅层卷积神经网络模型在该任务上表现不佳的局限性。其解决方案的关键在于提出一种基于改进ResNet34网络的脑肿瘤分类模型:首先在ResNet34的首层引入多尺度输入模块以增强特征提取能力,并将残差下采样层替换为Inception v2模块以提升多尺度特征融合效果;其次,在网络中嵌入通道注意力机制模块(channel attention mechanism),从通道维度动态分配权重,从而强化关键特征信息的表达。实验表明,该模型在五折交叉验证中平均分类准确率达98.8%,较原始ResNet34提升1%,且参数量仅为原模型的80%,实现了高精度与轻量化并存的优化效果。
链接: https://arxiv.org/abs/2512.03751
作者: Yufeng Li,Wenchao Zhao,Bo Dang,Weimin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Previously, image interpretation in radiology relied heavily on manual methods. However, manual classification of brain tumor medical images is time-consuming and labor-intensive. Even with shallow convolutional neural network models, the accuracy is not ideal. To improve the efficiency and accuracy of brain tumor image classification, this paper proposes a brain tumor classification model based on an improved ResNet34 network. This model uses the ResNet34 residual network as the backbone network and incorporates multi-scale feature extraction. It uses a multi-scale input module as the first layer of the ResNet34 network and an Inception v2 module as the residual downsampling layer. Furthermore, a channel attention mechanism module assigns different weights to different channels of the image from a channel domain perspective, obtaining more important feature information. The results after a five-fold crossover experiment show that the average classification accuracy of the improved network model is approximately 98.8%, which is not only 1% higher than ResNet34, but also only 80% of the number of parameters of the original model. Therefore, the improved network model not only improves accuracy but also reduces clutter, achieving a classification effect with fewer parameters and higher accuracy.
zh
[CV-38] Fully Unsupervised Self-debiasing of Text-to-Image Diffusion Models WACV2026
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在训练过程中因使用大规模网络爬取数据(如LAION-5B)而继承并放大偏见的问题,导致生成结果出现刻板印象。其解决方案的关键在于提出一种完全无监督的测试时去偏方法——SelfDebias,该方法利用图像编码器嵌入空间中的语义聚类来引导扩散过程,通过最小化输出分布与均匀分布之间的KL散度实现去偏,无需人工标注数据或为每个概念训练外部分类器,从而自动识别语义模式并在多种扩散模型架构和提示条件下有效减少偏见,同时保持图像视觉保真度。
链接: https://arxiv.org/abs/2512.03749
作者: Korada Sri Vardhana,Shrikrishna Lolla,Soma Biswas
机构: Indian Institute of Science (印度科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2026
Abstract:Text-to-image (T2I) diffusion models have achieved widespread success due to their ability to generate high-resolution, photorealistic images. These models are trained on large-scale datasets, like LAION-5B, often scraped from the internet. However, since this data contains numerous biases, the models inherently learn and reproduce them, resulting in stereotypical outputs. We introduce SelfDebias, a fully unsupervised test-time debiasing method applicable to any diffusion model that uses a UNet as its noise predictor. SelfDebias identifies semantic clusters in an image encoder’s embedding space and uses these clusters to guide the diffusion process during inference, minimizing the KL divergence between the output distribution and the uniform distribution. Unlike supervised approaches, SelfDebias does not require human-annotated datasets or external classifiers trained for each generated concept. Instead, it is designed to automatically identify semantic modes. Extensive experiments show that SelfDebias generalizes across prompts and diffusion model architectures, including both conditional and unconditional models. It not only effectively debiases images along key demographic dimensions while maintaining the visual fidelity of the generated images, but also more abstract concepts for which identifying biases is also challenging.
zh
[CV-39] Dual-level Modality Debiasing Learning for Unsupervised Visible-Infrared Person Re-Identification
【速读】:该论文旨在解决两阶段无监督可见光-红外行人重识别(USL-VI-ReID)方法中存在的模态偏差问题,即单模态学习阶段提取的模态特异性特征会传播至跨模态学习阶段,从而削弱身份判别能力和模型泛化性能。其解决方案的关键在于提出双层次去偏学习框架(Dual-level Modality Debiasing Learning, DMDL):在模型层面,设计基于因果推理的调整干预(Causality-inspired Adjustment Intervention, CAI)模块,以因果建模替代传统的似然建模,避免引入由模态诱导的虚假关联;在优化层面,引入协同无偏训练(Collaborative Bias-free Training, CBT)策略,通过模态特定增强、标签精炼和特征对齐,阻断模态偏差在数据、标签与特征间的传递路径,从而实现模态不变特征学习并提升模型泛化能力。
链接: https://arxiv.org/abs/2512.03745
作者: Jiaze Li,Yan Lu,Bin Liu,Guojun Yin,Mang Ye
机构: University of Science and Technology of China (中国科学技术大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Two-stage learning pipeline has achieved promising results in unsupervised visible-infrared person re-identification (USL-VI-ReID). It first performs single-modality learning and then operates cross-modality learning to tackle the modality discrepancy. Although promising, this pipeline inevitably introduces modality bias: modality-specific cues learned in the single-modality training naturally propagate into the following cross-modality learning, impairing identity discrimination and generalization. To address this issue, we propose a Dual-level Modality Debiasing Learning (DMDL) framework that implements debiasing at both the model and optimization levels. At the model level, we propose a Causality-inspired Adjustment Intervention (CAI) module that replaces likelihood-based modeling with causal modeling, preventing modality-induced spurious patterns from being introduced, leading to a low-biased model. At the optimization level, a Collaborative Bias-free Training (CBT) strategy is introduced to interrupt the propagation of modality bias across data, labels, and features by integrating modality-specific augmentation, label refinement, and feature alignment. Extensive experiments on benchmark datasets demonstrate that DMDL could enable modality-invariant feature learning and a more generalized model.
zh
[CV-40] Out-of-the-box: Black-box Causal Attacks on Object Detectors
【速读】:该论文旨在解决现有对抗扰动方法在目标检测器中存在黑盒攻击效果差、可解释性弱及对模型架构依赖性强的问题。其解决方案的关键在于提出一种名为BlackCAtt的黑盒攻击算法,通过识别并利用因果充分像素集(causally sufficient pixel sets)来生成可解释、不可察觉且与架构无关的对抗样本。该方法结合目标检测器输出的边界框(bounding box),实现对检测结果的删除、修改或伪造,从而显著提升攻击精度和隐蔽性,在COCO测试集上相较基线方法在移除检测、修改检测和触发虚假检测任务上分别提升2.7倍、3.86倍和5.75倍。
链接: https://arxiv.org/abs/2512.03730
作者: Melane Navaratnarajah,David A. Kelly,Hana Chockler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Adversarial perturbations are a useful way to expose vulnerabilities in object detectors. Existing perturbation methods are frequently white-box and architecture specific. More importantly, while they are often successful, it is rarely clear why they work. Insights into the mechanism of this success would allow developers to understand and analyze these attacks, as well as fine-tune the model to prevent them. This paper presents BlackCAtt, a black-box algorithm and a tool, which uses minimal, causally sufficient pixel sets to construct explainable, imperceptible, reproducible, architecture-agnostic attacks on object detectors. BlackCAtt combines causal pixels with bounding boxes produced by object detectors to create adversarial attacks that lead to the loss, modification or addition of a bounding box. BlackCAtt works across different object detectors of different sizes and architectures, treating the detector as a black box. We compare the performance of BlackCAtt with other black-box attack methods and show that identification of causal pixels leads to more precisely targeted and less perceptible attacks. On the COCO test dataset, our approach is 2.7 times better than the baseline in removing a detection, 3.86 times better in changing a detection, and 5.75 times better in triggering new, spurious, detections. The attacks generated by BlackCAtt are very close to the original image, and hence imperceptible, demonstrating the power of causal pixels.
zh
[CV-41] PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在具身任务中生成冗余或不稳定动作的问题,这些问题源于现有VLA模型空间感知场的均匀性,导致其容易被目标无关物体干扰,从而影响动作的精准性和时效性。解决方案的关键在于提出一种高效的PosA-VLA框架,通过姿态条件引导的注意力锚定机制(pose-conditioned anchor attention),将视觉注意力聚焦于任务相关区域,从而增强指令语义与可操作视觉线索之间的对齐,提升动作生成的精度和效率。该方法无需额外感知模块(如分割或定位网络),采用轻量级架构,保证高效推理,并在多种机器人操作基准和复杂环境中展现出鲁棒的泛化能力。
链接: https://arxiv.org/abs/2512.03724
作者: Ziwen Li,Xin Wang,Hanlue Zhang,Runnan Chen,Runqi Lin,Xiao He,Han Huang,Yandong Guo,Fakhri Karray,Tongliang Liu,Mingming Gong
机构: MBZUAI; AI2 Robotics; The University of Sydney; The University of Melbourne
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive this http URL this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex this http URL address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model’s perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.
zh
[CV-42] DINO-RotateMatch: A Rotation-Aware Deep Framework for Robust Image Matching in Large-Scale 3D Reconstruction
【速读】:该论文旨在解决大规模3D重建中从非结构化互联网图像中进行鲁棒图像匹配的挑战。其核心问题在于如何在海量、多样且无序的图像数据中高效地识别语义相关图像对,并准确提取和匹配具有方向敏感性的局部特征点。解决方案的关键在于提出DINO-RotateMatch框架,该框架融合了数据自适应的图像配对策略与旋转感知的关键点提取及匹配机制:首先利用DINO(基于自监督学习的全局描述子)检索语义相关的图像对,随后通过旋转增强的数据增强策略结合ALIKED(关键点检测器)和Light Glue(匹配算法),捕获依赖于视角变化的局部特征,从而提升匹配精度与鲁棒性。实验表明,该方法在Kaggle Image Matching Challenge 2025中取得mAA指标的显著提升,获得银奖(943支队伍中第47名),验证了该方案在大规模3D重建任务中的有效性与可扩展性。
链接: https://arxiv.org/abs/2512.03715
作者: Kaichen Zhang,Tianxiang Sheng,Xuanming Shi
机构: Beijing National Day School (北京国家日学校); No.2 High School of East China Normal University (华东师范大学第二附属中学); CodingFuture Research Center (编码未来研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, 1 table
Abstract:This paper presents DINO-RotateMatch, a deep-learning framework designed to address the chal lenges of image matching in large-scale 3D reconstruction from unstructured Internet images. The method integrates a dataset-adaptive image pairing strategy with rotation-aware keypoint extraction and matching. DINO is employed to retrieve semantically relevant image pairs in large collections, while rotation-based augmentation captures orientation-dependent local features using ALIKED and Light Glue. Experiments on the Kaggle Image Matching Challenge 2025 demonstrate consistent improve ments in mean Average Accuracy (mAA), achieving a Silver Award (47th of 943 teams). The results confirm that combining self-supervised global descriptors with rotation-enhanced local matching offers a robust and scalable solution for large-scale 3D reconstruction. Comments: 9 pages, 5 figures, 1 table Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.03715 [cs.CV] (or arXiv:2512.03715v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.03715 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kaichen Zhang [view email] [v1] Wed, 3 Dec 2025 12:05:49 UTC (1,835 KB) Full-text links: Access Paper: View a PDF of the paper titled DINO-RotateMatch: A Rotation-Aware Deep Framework for Robust Image Matching in Large-Scale 3D Reconstruction, by Kaichen Zhang and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2025-12 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[CV-43] Structured Uncertainty Similarity Score (SUSS): Learning a Probabilistic Interpretable Perceptual Metric Between Images
【速读】:该论文旨在解决当前图像相似性评估中感知一致性不足的问题,即现有方法要么依赖复杂且不可解释的深度特征(如LPIPS),要么缺乏对人类视觉感知关键属性的建模能力(如SSIM)。其解决方案的核心在于提出结构化不确定性相似性评分(SUSS),通过将每张图像建模为一组感知组件,每个组件由结构化的多元正态分布表示,并在生成式自监督框架下训练以赋予人眼不可察觉的增强操作高似然概率;最终得分是各组件对数概率的加权和,权重由人类感知数据集学习得到。SUSS的关键创新在于利用图像特定的线性变换残差(像素空间内)实现透明可解释的评估机制,从而在保持与人类感知高度一致的同时提供局部化、可解释的判别依据。
链接: https://arxiv.org/abs/2512.03701
作者: Paula Seidler,Neill D. F. Campbell,Ivor J A Simpson
机构: University of Sussex (萨塞克斯大学); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Perceptual similarity scores that align with human vision are critical for both training and evaluating computer vision models. Deep perceptual losses, such as LPIPS, achieve good alignment but rely on complex, highly non-linear discriminative features with unknown invariances, while hand-crafted measures like SSIM are interpretable but miss key perceptual properties. We introduce the Structured Uncertainty Similarity Score (SUSS); it models each image through a set of perceptual components, each represented by a structured multivariate Normal distribution. These are trained in a generative, self-supervised manner to assign high likelihood to human-imperceptible augmentations. The final score is a weighted sum of component log-probabilities with weights learned from human perceptual datasets. Unlike feature-based methods, SUSS learns image-specific linear transformations of residuals in pixel space, enabling transparent inspection through decorrelated residuals and sampling. SUSS aligns closely with human perceptual judgments, shows strong perceptual calibration across diverse distortion types, and provides localized, interpretable explanations of its similarity assessments. We further demonstrate stable optimization behavior and competitive performance when using SUSS as a perceptual loss for downstream imaging tasks. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.03701 [cs.CV] (or arXiv:2512.03701v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.03701 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-44] Active Visual Perception: Opportunities and Challenges
【速读】:该论文旨在解决主动视觉感知(active visual perception)在复杂环境中面临的关键挑战,包括实时处理高维视觉数据、动态环境中的决策优化以及多模态传感信息的融合问题。其解决方案之关键在于通过系统性地整合感知与行动机制,使智能体能够根据任务目标或环境不确定性动态调整感知策略,例如引导注意力、移动传感器或与环境交互以获取更具信息量的数据,从而提升在机器人、自动驾驶、人机交互等应用场景下的适应性与鲁棒性。
链接: https://arxiv.org/abs/2512.03687
作者: Yian Li,Xiaoyu Guo,Hao Zhang,Shuiwang Li,Xiaowei Dai
机构: Guilin University of Technology (桂林理工大学); Putian University (莆田大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Active visual perception refers to the ability of a system to dynamically engage with its environment through sensing and action, allowing it to modify its behavior in response to specific goals or uncertainties. Unlike passive systems that rely solely on visual data, active visual perception systems can direct attention, move sensors, or interact with objects to acquire more informative data. This approach is particularly powerful in complex environments where static sensing methods may not provide sufficient information. Active visual perception plays a critical role in numerous applications, including robotics, autonomous vehicles, human-computer interaction, and surveillance systems. However, despite its significant promise, there are several challenges that need to be addressed, including real-time processing of complex visual data, decision-making in dynamic environments, and integrating multimodal sensory inputs. This paper explores both the opportunities and challenges inherent in active visual perception, providing a comprehensive overview of its potential, current research, and the obstacles that must be overcome for broader adoption.
zh
[CV-45] GaussianBlender: Instant Stylization of 3D Gaussians with Disentangled Latent Spaces
【速读】:该论文旨在解决现有文本驱动3D风格化方法在大规模生产场景中效率低、多视角不一致的问题,这些问题源于其依赖2D图像编辑器的蒸馏方式以及需要逐资产进行耗时的测试阶段优化。解决方案的关键在于提出GaussianBlender,一个前馈式框架,通过从空间分组的3D高斯(3D Gaussians)中学习结构化且解耦的潜在空间,并控制几何与外观之间的信息共享,进而利用潜在扩散模型对这些表示进行文本条件化的编辑,从而实现即时、高保真、几何保持且多视角一致的3D风格化。
链接: https://arxiv.org/abs/2512.03683
作者: Melis Ocal,Xiaoyan Xing,Yue Li,Ngo Anh Vien,Sezer Karaoglu,Theo Gevers
机构: University of Amsterdam (阿姆斯特丹大学); Bosch Center for AI (博世人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D stylization is central to game development, virtual reality, and digital arts, where the demand for diverse assets calls for scalable methods that support fast, high-fidelity manipulation. Existing text-to-3D stylization methods typically distill from 2D image editors, requiring time-intensive per-asset optimization and exhibiting multi-view inconsistency due to the limitations of current text-to-image models, which makes them impractical for large-scale production. In this paper, we introduce GaussianBlender, a pioneering feed-forward framework for text-driven 3D stylization that performs edits instantly at inference. Our method learns structured, disentangled latent spaces with controlled information sharing for geometry and appearance from spatially-grouped 3D Gaussians. A latent diffusion model then applies text-conditioned edits on these learned representations. Comprehensive evaluations show that GaussianBlender not only delivers instant, high-fidelity, geometry-preserving, multi-view consistent stylization, but also surpasses methods that require per-instance test-time optimization - unlocking practical, democratized 3D stylization at scale.
zh
[CV-46] ConvRot: Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers
【速读】:该论文旨在解决扩散模型(Diffusion Transformers)在规模扩大时面临的内存占用增加和推理延迟上升的问题,尤其针对传统量化方法难以有效处理行方向和列方向异常值(outliers)以及计算复杂度高导致部署困难的挑战。其解决方案的关键在于提出一种基于分组旋转的量化方法 ConvRot,该方法利用正交 Hadamard 变换(RHT)抑制多维异常值,并将计算复杂度从二次方降低到线性,从而实现无需重新训练即可支持 4-bit 权重和 4-bit 激活(W4A4)推理的高效模块 ConvLinear4bit,显著提升推理速度并减少内存消耗,同时保持图像质量。
链接: https://arxiv.org/abs/2512.03673
作者: Feice Huang,Zuliang Han,Xing Zhou,Yihuang Chen,Lifei Zhu,Haoqian Wang
机构: SIGS, Tsinghua University (清华大学深圳国际研究生院); Central Media Technology Institute, Huawei (华为中央媒体技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion transformers have demonstrated strong capabilities in generating high-quality images. However, as model size increases, the growing memory footprint and inference latency pose significant challenges for practical deployment. Recent studies in large language models (LLMs) show that rotation-based techniques can smooth outliers and enable 4-bit quantization, but these approaches often incur substantial overhead and struggle with row-wise outliers in diffusion transformers. To address these challenges, we propose ConvRot, a group-wise rotation-based quantization method that leverages regular Hadamard transform (RHT) to suppress both row-wise and column-wise outliers while reducing complexity from quadratic to linear. Building on this, we design ConvLinear4bit, a plug-and-play module that integrates rotation, quantization, GEMM, and dequantization, enabling W4A4 inference without retraining and preserving visual quality. Experiments on FLUX.1-dev demonstrate a 2.26 \times speedup and 4.05 \times memory reduction while maintaining image fidelity. To our knowledge, this is the first application of rotation-based quantization for plug-and-play W4A4 inference in diffusion transformers.
zh
[CV-47] Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning
【速读】:该论文旨在解决结肠镜检查中多模态理解向临床推理演进的关键瓶颈问题,即当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在结肠镜场景下的输出缺乏鲁棒性和可信度,难以支撑精准的临床决策。解决方案的关键在于:首先构建了目前最全面的结肠镜多模态问答数据集ColonVQA(含110万+视觉问答条目),用于系统评估MLLMs的泛化能力与抗干扰性能;其次提出面向临床推理的新型范式,通过多专家辩论机制标注出临床语境下可解释的推理数据集ColonReason,并开发首个基于R1风格的优化模型ColonR1,其采用任务自适应奖励机制和梯度稳定优化技术,在数据稀缺条件下实现56.61%的整体准确率,显著优于监督微调方法(提升25.22%),为多模态结肠镜分析树立了新的推理增强基准。
链接: https://arxiv.org/abs/2512.03667
作者: Ge-Peng Ji,Jingyi Liu,Deng-Ping Fan,Nick Barnes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report
Abstract:In this study, we present Colon-X, an open initiative aimed at advancing multimodal intelligence in colonoscopy. We begin by constructing ColonVQA, the most comprehensive multimodal dataset ever built for colonoscopy, featuring over 1.1M+ visual question answering entries across 76 clinical findings and 18 multimodal tasks. Beyond serving as a community-wide data foundation, we further investigate a critical yet underexplored transition in colonoscopy - evolving from multimodal understanding to clinical reasoning: (a) To capture the current landscape of multimodal understanding behaviors, we systematically assess the generalizability of 22 multimodal large language models and examine their reliability under human-induced perturbations. The results reveal that clinical outputs from leading MLLMs remain far from robust and trustworthy. (b) To narrow this gap, we further explore reasoning-centric intelligence tailored for colonoscopy. Specifically, we curate ColonReason, a clinically grounded reasoning dataset annotated through a multi-expert debating pipeline, and develop ColonR1, the first R1-styled model incorporating task-adaptive rewarding and gradient-stable optimization techniques. Under data-scarce conditions, our ColonR1 achieves 56.61% overall accuracy, outperforming supervised fine-tuning by 25.22%, and sets a new reasoning-enabled baseline for multimodal colonoscopy analysis. All data and model resources are publicly available at this https URL.
zh
[CV-48] oG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos
【速读】:该论文旨在解决当前空间-时间视频定位(Spatio-Temporal Video Grounding, STVG)研究中忽视任务导向推理的问题,即现有方法主要依赖对象中心和描述性指令,无法支持具身智能体完成目标驱动的交互任务。解决方案的关键在于提出首个面向第一人称视角视频的任务导向型视频定位基准ToG-Bench,其核心创新包括:(1) 任务导向定位(Task-oriented Grounding),要求模型根据目标任务而非简单描述识别并定位物体;(2) 显式-隐式双重定位(Explicit-Implicit Dual Grounding),支持目标对象通过显式提及或上下文推理获得;(3) 一对一多定位(One-to-Many Grounding),允许多个相关物体对应同一任务指令。该基准基于ScanNet数据构建,包含100个标注片段与2,704条任务导向指令,并引入适配多目标及显隐式定位的评估指标,系统评测了七种先进多模态大语言模型(MLLMs),揭示了任务导向STVG的内在挑战与性能差距,凸显了在具身场景中弥合感知与交互鸿沟的重要性。
链接: https://arxiv.org/abs/2512.03666
作者: Qi’ao Xu,Tianwen Qian,Yuqian Fu,Kailing Li,Yang Jiao,Jiacheng Zhang,Xiaoling Wang,Liang He
机构: East China Normal University (华东师范大学); Sofia University “St. Kliment Ohridski” (索非亚大学“克莱门特·奥赫里德斯基”); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages
Abstract:A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object-centric and descriptive instructions, neglecting the task-oriented reasoning that is crucial for embodied agents to accomplish goal-directed interactions. To bridge this gap, we introduce \textbfToG-Bench, the first task-oriented spatio-temporal video grounding benchmark for egocentric videos. ToG-Bench is characterized by three key features: (1) \textbfTask-oriented Grounding, which requires identifying and localizing objects based on intended tasks rather than straightforward descriptions; (2) \textbfExplicit-Implicit Dual Grounding, where target objects can be either explicitly mentioned or implicitly inferred by contextual reasoning; (3) \textbfOne-to-Many Grounding, where a single instruction may correspond to multiple objects involved in task execution. Built upon videos sourced from ScanNet, ToG-Bench comprises 100 annotated clips with 2,704 task-oriented grounding instructions, constructed via a semi-automated pipeline that combines foundation model annotation and human refinement. In addition, we introduce a set of task-level evaluation metrics tailored for multi-object and explicit-implicit object grounding, and systematically benchmark seven state-of-the-art MLLMs. Extensive experiments reveal the intrinsic challenges of task-oriented STVG and substantial performance gaps across explicit-implicit and multi-object grounding, highlighting the difficulty of bridging perception and interaction in embodied scenarios. Data and code will be released at: \hrefthis https URLthis https URL…
zh
[CV-49] Multi-Scale Visual Prompting for Lightweight Small-Image Classification
【速读】:该论文旨在解决小尺寸图像基准(如MNIST、Fashion-MNIST和CIFAR-10)在视觉提示(Visual Prompting)研究中被忽视的问题,即如何在低分辨率数据上有效应用轻量级提示机制以提升模型性能。现有方法主要针对大型Vision Transformer和高分辨率数据集(如ImageNet),难以直接迁移至小图像任务。其解决方案的关键在于提出多尺度视觉提示(Multi-Scale Visual Prompting, MSVP)模块,该模块通过学习全局、中尺度和局部提示图,并利用轻量级的1×1卷积将这些提示图与输入图像融合,从而为不同层次的特征提供有效的归纳偏置(inductive bias)。MSVP具有骨干网络无关性(backbone-agnostic)、参数增加低于0.02%,且在CNN与Vision Transformer架构上均实现显著性能提升,同时计算开销可忽略不计。
链接: https://arxiv.org/abs/2512.03663
作者: Salim Khazem
机构: Talan Research Center (Talan 研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual prompting has recently emerged as an efficient strategy to adapt vision models using lightweight, learnable parameters injected into the input space. However, prior work mainly targets large Vision Transformers and high-resolution datasets such as ImageNet. In contrast, small-image benchmarks like MNIST, Fashion-MNIST, and CIFAR-10 remain widely used in education, prototyping, and research, yet have received little attention in the context of prompting. In this paper, we introduce \textbfMulti-Scale Visual Prompting (MSVP), a simple and generic module that learns a set of global, mid-scale, and local prompt maps fused with the input image via a lightweight 1 \times 1 convolution. MSVP is backbone-agnostic, adds less than 0.02% parameters, and significantly improves performance across CNN and Vision Transformer backbones. We provide a unified benchmark on MNIST, Fashion-MNIST, and CIFAR-10 using a simple CNN, ResNet-18, and a small Vision Transformer. Our method yields consistent improvements with negligible computational overhead. We further include ablations on prompt scales, fusion strategies, and backbone architectures, along with qualitative analyzes using prompt visualizations and Grad-CAM. Our results demonstrate that multi-scale prompting provides an effective inductive bias even on low-resolution images. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.03663 [cs.CV] (or arXiv:2512.03663v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.03663 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-50] Cyclical Temporal Encoding and Hybrid Deep Ensembles for Multistep Energy Forecasting
【速读】:该论文旨在解决短期电力负荷预测中因周期性时间特征建模不足和局部与长期模式难以协同捕捉而导致的精度瓶颈问题。解决方案的关键在于构建一个统一的深度学习框架,通过正弦余弦周期编码(cyclical temporal encoding)有效保留日、周、年等周期结构,并结合LSTM-CNN混合架构及多层感知机(MLP)元学习器组成的集成模型,分别提取长周期季节性效应与短时局部动态特征,从而在七种预测时长下均实现RMSE和MAE指标的显著降低,验证了周期性表示与互补深度结构融合的有效性。
链接: https://arxiv.org/abs/2512.03656
作者: Salim Khazem,Houssam Kanso
机构: Talan Research Center (Talan 研究中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate electricity consumption forecasting is essential for demand management and smart grid operations. This paper introduces a unified deep learning framework that integrates cyclical temporal encoding with hybrid LSTM-CNN architectures to enhance multistep energy forecasting. We systematically transform calendar-based attributes using sine cosine encodings to preserve periodic structure and evaluate their predictive relevance through correlation analysis. To exploit both long-term seasonal effects and short-term local patterns, we employ an ensemble model composed of an LSTM, a CNN, and a meta-learner of MLP regressors specialized for each forecast horizon. Using a one year national consumption dataset, we conduct an extensive experimental study including ablation analyses with and without cyclical encodings and calendar features and comparisons with established baselines from the literature. Results demonstrate consistent improvements across all seven forecast horizons, with our hybrid model achieving lower RMSE and MAE than individual architectures and prior methods. These findings confirm the benefit of combining cyclical temporal representations with complementary deep learning structures. To our knowledge, this is the first work to jointly evaluate temporal encodings, calendar-based features, and hybrid ensemble architectures within a unified short-term energy forecasting framework.
zh
[CV-51] MKSNet: Advanced Small Object Detection in Remote Sensing Imagery with Multi-Kernel and Dual Attention Mechanisms
【速读】:该论文旨在解决遥感图像中小目标检测中存在的关键挑战,即由于高分辨率图像中目标尺寸小、深层卷积神经网络(Convolutional Neural Networks, CNNs)易丢失关键信息,以及复杂背景和空间冗余干扰导致目标特征被掩盖的问题。解决方案的核心在于提出多核选择网络(Multi-Kernel Selection Network, MKSNet),其关键创新是引入多核选择机制(Multi-Kernel Selection mechanism, MKS),该机制通过自适应选择不同大小的卷积核以捕获更广泛的上下文信息,并显著增强对小目标的空间细节动态处理能力;同时结合空间与通道双重注意力模块,分别优化特征图的空间权重分配和通道特征重要性,从而有效抑制背景噪声并提升小目标的检测精度。
链接: https://arxiv.org/abs/2512.03640
作者: Jiahao Zhang,Xiao Zhao,Guangyu Gao
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep convolutional neural networks (DCNNs) have substantially advanced object detection capabilities, particularly in remote sensing imagery. However, challenges persist, especially in detecting small objects where the high resolution of these images and the small size of target objects often result in a loss of critical information in the deeper layers of conventional CNNs. Additionally, the extensive spatial redundancy and intricate background details typical in remote-sensing images tend to obscure these small targets. To address these challenges, we introduce Multi-Kernel Selection Network (MKSNet), a novel network architecture featuring a novel Multi-Kernel Selection mechanism. The MKS mechanism utilizes large convolutional kernels to effectively capture an extensive range of contextual information. This innovative design allows for adaptive kernel size selection, significantly enhancing the network’s ability to dynamically process and emphasize crucial spatial details for small object detection. Furthermore, MKSNet also incorporates a dual attention mechanism, merging spatial and channel attention modules. The spatial attention module adaptively fine-tunes the spatial weights of feature maps, focusing more intensively on relevant regions while mitigating background noise. Simultaneously, the channel attention module optimizes channel information selection, improving feature representation and detection accuracy. Empirical evaluations on the DOTA-v1.0 and HRSC2016 benchmark demonstrate that MKSNet substantially surpasses existing state-of-the-art models in detecting small objects in remote sensing images. These results highlight MKSNet’s superior ability to manage the complexities associated with multi-scale and high-resolution image data, confirming its effectiveness and innovation in remote sensing object detection.
zh
[CV-52] FeatureLens: A Highly Generalizable and Interpretable Framework for Detecting Adversarial Examples Based on Image Features
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在图像分类任务中对对抗攻击(adversarial attacks)高度敏感的问题,现有检测方法普遍依赖复杂且难以解释的架构,导致模型可解释性差、泛化能力弱。其解决方案的关键在于提出一个轻量级框架FeatureLens,通过结合图像特征提取器(Image Feature Extractor, IFE)与浅层分类器(如SVM、MLP或XGBoost),仅使用51维特征即可实现高精度检测(闭集评估准确率达97.8%–99.75%,泛化评估达86.17%–99.6%),同时具备良好的可解释性、泛化能力和计算效率,从而为透明且有效的对抗防御提供可行路径。
链接: https://arxiv.org/abs/2512.03625
作者: Zhigang Yang,Yuan Liu,Jiawei Zhang,Puning Zhang,Xinqiang Ma
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学); Chongqing University of Arts and Sciences (重庆文理学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although the remarkable performance of deep neural networks (DNNs) in image classification, their vulnerability to adversarial attacks remains a critical challenge. Most existing detection methods rely on complex and poorly interpretable architectures, which compromise interpretability and generalization. To address this, we propose FeatureLens, a lightweight framework that acts as a lens to scrutinize anomalies in image features. Comprising an Image Feature Extractor (IFE) and shallow classifiers (e.g., SVM, MLP, or XGBoost) with model sizes ranging from 1,000 to 30,000 parameters, FeatureLens achieves high detection accuracy ranging from 97.8% to 99.75% in closed-set evaluation and 86.17% to 99.6% in generalization evaluation across FGSM, PGD, CW, and DAmageNet attacks, using only 51 dimensional features. By combining strong detection performance with excellent generalization, interpretability, and computational efficiency, FeatureLens offers a practical pathway toward transparent and effective adversarial defense.
zh
[CV-53] ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation
【速读】:该论文旨在解决现有视频生成方法在复杂场景下难以实现精确相机控制与结构一致性的问题。传统修复类方法无法有效恢复复杂伪影,而基于激光雷达(LiDAR)的方法依赖稀疏且不完整的几何线索,导致生成质量受限。其解决方案的关键在于提出ReCamDriving框架,该框架采用纯视觉输入并结合密集且场景完整的3D高斯表示(3DGS)进行显式几何引导,通过两阶段训练策略:第一阶段利用相机位姿实现粗粒度控制,第二阶段引入3DGS渲染结果以实现细粒度视角和几何约束;同时设计了一种基于3DGS的跨轨迹数据构建策略,有效缩小训练与测试中相机变换模式的差异,从而支持从单目视频中扩展多轨迹监督,显著提升生成视频的相机可控性与结构一致性。
链接: https://arxiv.org/abs/2512.03621
作者: Yaokun Li,Shuaixian Wang,Mantang Guo,Jiehui Huang,Taojun Ding,Mu Hu,Kaixuan Wang,Shaojie Shen,Guang Tan
机构: Sun Yat-sen University (中山大学); ZYT; Shenzhen Polytechnic University (深圳职业技术学院); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We propose ReCamDriving, a purely vision-based, camera-controlled novel-trajectory video generation framework. While repair-based methods fail to restore complex artifacts and LiDAR-based approaches rely on sparse and incomplete cues, ReCamDriving leverages dense and scene-complete 3DGS renderings for explicit geometric guidance, achieving precise camera-controllable generation. To mitigate overfitting to restoration behaviors when conditioned on 3DGS renderings, ReCamDriving adopts a two-stage training paradigm: the first stage uses camera poses for coarse control, while the second stage incorporates 3DGS renderings for fine-grained viewpoint and geometric guidance. Furthermore, we present a 3DGS-based cross-trajectory data curation strategy to eliminate the train-test gap in camera transformation patterns, enabling scalable multi-trajectory supervision from monocular videos. Based on this strategy, we construct the ParaDrive dataset, containing over 110K parallel-trajectory video pairs. Extensive experiments demonstrate that ReCamDriving achieves state-of-the-art camera controllability and structural consistency.
zh
[CV-54] LAMP: Language-Assisted Motion Planning for Controllable Video Generation
【速读】:该论文旨在解决视频生成中运动控制(motion control)的局限性问题,即如何通过自然语言直接指定动态物体和相机的三维轨迹以实现复杂、电影级场景的精准编排。现有方法在运动可控性和用户意图对齐方面表现不足,难以满足创作需求。解决方案的关键在于提出LAMP框架,其核心创新是利用大语言模型(LLM)作为运动规划器(motion planner),将自然语言描述转化为结构化的运动程序(motion programs),这些程序基于受电影摄制规范启发的特定领域语言(DSL)定义,并被确定性地映射为3D轨迹。该方法通过构建大规模的程序化数据集(包含自然文本与对应运动程序及3D轨迹的配对)训练并验证了其有效性,在运动可控性和意图一致性上显著优于当前最优方案,首次实现了从自然语言到物体与相机运动的端到端生成。
链接: https://arxiv.org/abs/2512.03619
作者: Muhammed Burak Kizil,Enes Sanli,Niloy J. Mitra,Erkut Erdem,Aykut Erdem,Duygu Ceylan
机构: Koç University (科奇大学); University College London (伦敦大学学院); Hacettepe University (哈切特佩大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP’s improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications.
zh
[CV-55] Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding NEURIPS2025
【速读】:该论文旨在解决当前基于2D视觉基础模型(foundation models)在单目视频动态场景分析中缺乏3D一致性的问题,这导致复杂三维环境中出现严重的空间错位和时间闪烁现象。解决方案的关键在于提出Motion4D框架,其核心创新是将2D先验知识融合到统一的4D高斯点绘(4D Gaussian Splatting)表示中,并采用两阶段迭代优化机制:首先通过顺序优化(Sequential optimization)分步更新运动场与语义场以保持局部一致性,再通过全局优化(Global optimization)联合精调所有属性以实现长期时空一致性;此外,引入3D置信度图动态调整运动先验、自适应重采样策略增强稀疏区域建模能力,并结合SAM2提示迭代优化提升语义一致性,从而显著改善点追踪、视频目标分割及新视角合成等任务的性能表现。
链接: https://arxiv.org/abs/2512.03601
作者: Haoran Zhou,Gim Hee Lee
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025
Abstract:Recent advancements in foundation models for 2D vision have substantially improved the analysis of dynamic scenes from monocular videos. However, despite their strong generalization capabilities, these models often lack 3D consistency, a fundamental requirement for understanding scene geometry and motion, thereby causing severe spatial misalignment and temporal flickering in complex 3D environments. In this paper, we present Motion4D, a novel framework that addresses these challenges by integrating 2D priors from foundation models into a unified 4D Gaussian Splatting representation. Our method features a two-part iterative optimization framework: 1) Sequential optimization, which updates motion and semantic fields in consecutive stages to maintain local consistency, and 2) Global optimization, which jointly refines all attributes for long-term coherence. To enhance motion accuracy, we introduce a 3D confidence map that dynamically adjusts the motion priors, and an adaptive resampling process that inserts new Gaussians into under-represented regions based on per-pixel RGB and semantic errors. Furthermore, we enhance semantic coherence through an iterative refinement process that resolves semantic inconsistencies by alternately optimizing the semantic fields and updating prompts of SAM2. Extensive evaluations demonstrate that our Motion4D significantly outperforms both 2D foundation models and existing 3D-based approaches across diverse scene understanding tasks, including point-based tracking, video object segmentation, and novel view synthesis. Our code is available at this https URL.
zh
[CV-56] Memory-Guided Point Cloud Completion for Dental Reconstruction
【速读】:该论文旨在解决部分牙科点云(partial dental point clouds)因遮挡和有限扫描视角导致的大范围缺失区域问题,这些问题会 bias 编码器-only 的全局特征,并迫使解码器进行结构幻觉(hallucination)。其解决方案的关键在于提出一种检索增强型框架(retrieval-augmented framework),通过在标准编码器-解码器架构中引入一个可学习的原型记忆(prototype memory),在编码阶段将输入的全局描述符与最近邻的齿形流形原型(manifold prototype)进行置信度加权融合,从而提供结构先验信息。该机制无需牙齿位置标签即可自组织生成可复用的齿形原型,有效稳定缺失区域的推理过程,并释放解码器容量用于细节恢复,实现更准确、忠实的点云补全。
链接: https://arxiv.org/abs/2512.03598
作者: Jianan Sun,Yukang Huang,Dongzhihan Wang,Mingyu Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Partial dental point clouds often suffer from large missing regions caused by occlusion and limited scanning views, which bias encoder-only global features and force decoders to hallucinate structures. We propose a retrieval-augmented framework for tooth completion that integrates a prototype memory into standard encoder–decoder pipelines. After encoding a partial input into a global descriptor, the model retrieves the nearest manifold prototype from a learnable memory and fuses it with the query feature through confidence-gated weighting before decoding. The memory is optimized end-to-end and self-organizes into reusable tooth-shape prototypes without requiring tooth-position labels, thereby providing structural priors that stabilize missing-region inference and free decoder capacity for detail recovery. The module is plug-and-play and compatible with common completion backbones, while keeping the same training losses. Experiments on a self-processed Teeth3DS benchmark demonstrate consistent improvements in Chamfer Distance, with visualizations showing sharper cusps, ridges, and interproximal transitions. Our approach provides a simple yet effective way to exploit cross-sample regularities for more accurate and faithful dental point-cloud completion.
zh
[CV-57] HBFormer: A Hybrid-Bridge Transformer for Microtumor and Miniature Organ Segmentation
【速读】:该论文旨在解决当前基于滑动窗口的视觉Transformer在医学图像分割任务中因局部注意力机制难以有效融合局部细节与全局上下文信息而导致的性能瓶颈问题,尤其在微小肿瘤和微型器官分割等对边界精细度和上下文理解要求极高的场景下表现不足。解决方案的关键在于提出一种新型混合桥接Transformer架构HBFormer,其核心创新是“桥接”机制——通过一个非对称设计的多尺度特征融合(Multi-Scale Feature Fusion, MFF)解码器,将编码器提取的多尺度特征与全局上下文信息进行深度融合;该解码器结合通道注意力与空间注意力模块,由一系列空洞卷积和深度可分离卷积构建,从而显式建模长程依赖关系并精准优化目标边界,显著提升了模型在复杂医学图像分割任务中的表现。
链接: https://arxiv.org/abs/2512.03597
作者: Fuchen Zheng,Xinyi Chen,Weixuan Li,Quanjun Li,Junhua Zhou,Xiaojiao Guo,Xuhang Chen,Chi-Man Pun,Shoujun Zhou
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Guangdong University of Technology (广东工业大学); University of Macau (澳门大学); Huizhou University (惠州学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, 3 tables
Abstract:Medical image segmentation is a cornerstone of modern clinical diagnostics. While Vision Transformers that leverage shifted window-based self-attention have established new benchmarks in this field, they are often hampered by a critical limitation: their localized attention mechanism struggles to effectively fuse local details with global context. This deficiency is particularly detrimental to challenging tasks such as the segmentation of microtumors and miniature organs, where both fine-grained boundary definition and broad contextual understanding are paramount. To address this gap, we propose HBFormer, a novel Hybrid-Bridge Transformer architecture. The ‘Hybrid’ design of HBFormer synergizes a classic U-shaped encoder-decoder framework with a powerful Swin Transformer backbone for robust hierarchical feature extraction. The core innovation lies in its ‘Bridge’ mechanism, a sophisticated nexus for multi-scale feature integration. This bridge is architecturally embodied by our novel Multi-Scale Feature Fusion (MFF) decoder. Departing from conventional symmetric designs, the MFF decoder is engineered to fuse multi-scale features from the encoder with global contextual information. It achieves this through a synergistic combination of channel and spatial attention modules, which are constructed from a series of dilated and depth-wise convolutions. These components work in concert to create a powerful feature bridge that explicitly captures long-range dependencies and refines object boundaries with exceptional precision. Comprehensive experiments on challenging medical image segmentation datasets, including multi-organ, liver tumor, and bladder tumor benchmarks, demonstrate that HBFormer achieves state-of-the-art results, showcasing its outstanding capabilities in microtumor and miniature organ segmentation. Code and models are available at: this https URL.
zh
[CV-58] CloseUpAvatar: High-Fidelity Animatable Full-Body Avatars with Mixture of Multi-Scale Textures
【速读】:该论文旨在解决现有生成式人体虚拟形象(avatar)在面对更广泛的相机运动时,难以同时保持近距离视图渲染质量的问题。传统方法在远距离或复杂视角下容易出现细节丢失或失真,而CloseUpAvatar通过引入一种基于纹理平面的结构化表示方式,利用两组可学习的纹理——低频与高频细节纹理——实现自适应渲染优化:仅当相机靠近人体表面时激活高频纹理,并随距离增加逐步降低其影响权重。该方案的核心创新在于将纹理细节的使用与相机距离动态关联,从而在保证高帧率(FPS)的同时,显著提升从多种视角生成图像的视觉真实感和稳定性。
链接: https://arxiv.org/abs/2512.03593
作者: David Svitov,Pietro Morerio,Lourdes Agapito,Alessio Del Bue
机构: Università degli Studi di Genova (热那亚大学); Istituto Italiano di Tecnologia (意大利技术研究院); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a CloseUpAvatar - a novel approach for articulated human avatar representation dealing with more general camera motions, while preserving rendering quality for close-up views. CloseUpAvatar represents an avatar as a set of textured planes with two sets of learnable textures for low and high-frequency detail. The method automatically switches to high-frequency textures only for cameras positioned close to the avatar’s surface and gradually reduces their impact as the camera moves farther away. Such parametrization of the avatar enables CloseUpAvatar to adjust rendering quality based on camera distance ensuring realistic rendering across a wider range of camera orientations than previous approaches. We provide experiments using the ActorsHQ dataset with high-resolution input images. CloseUpAvatar demonstrates both qualitative and quantitative improvements over existing methods in rendering from novel wide range camera positions, while maintaining high FPS by limiting the number of required primitives.
zh
[CV-59] Harnessing Hypergraphs in Geometric Deep Learning for 3D RNA Inverse Folding
【速读】:该论文旨在解决RNA逆折叠问题(RNA inverse folding problem),即设计能够折叠成特定二级结构的核苷酸序列,以确保RNA分子的稳定性和功能。该问题的核心挑战在于序列与结构之间复杂的非线性关系。解决方案的关键在于提出一种名为HyperRNA的生成模型框架,其采用编码器-解码器架构,并创新性地引入超图(hypergraph)建模机制:在预处理阶段基于三粒度粗粒度表示构建RNA骨架原子坐标图;编码阶段利用注意力嵌入模块和超图编码器捕捉高阶依赖关系与复杂生物分子相互作用;解码阶段则通过自回归方式生成RNA序列。实验表明,HyperRNA在PDBBind和RNAsolo数据集上均优于现有方法,验证了超图建模在RNA工程中的潜力。
链接: https://arxiv.org/abs/2512.03592
作者: Guang Yang,Lei Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The RNA inverse folding problem, a key challenge in RNA design, involves identifying nucleotide sequences that can fold into desired secondary structures, which are critical for ensuring molecular stability and function. The inherent complexity of this task stems from the intricate relationship between sequence and structure, making it particularly challenging. In this paper, we propose a framework, named HyperRNA, a generative model with an encoder-decoder architecture that leverages hypergraphs to design RNA sequences. Specifically, our HyperRNA model consists of three main components: preprocessing, encoding and decoding. In the preprocessing stage, graph structures are constructed by extracting the atom coordinates of RNA backbone based on 3-bead coarse-grained representation. The encoding stage processes these graphs, capturing higher order dependencies and complex biomolecular interactions using an attention embedding module and a hypergraph-based encoder. Finally, the decoding stage generates the RNA sequence in an autoregressive manner. We conducted quantitative and qualitative experiments on the PDBBind and RNAsolo datasets to evaluate the inverse folding task for RNA sequence generation and RNA-protein complex sequence generation. The experimental results demonstrate that HyperRNA not only outperforms existing RNA design methods but also highlights the potential of leveraging hypergraphs in RNA engineering. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.03592 [cs.CV] (or arXiv:2512.03592v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.03592 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-60] Beyond Boundary Frames: Audio-Visual Semantic Guidance for Context-Aware Video Interpolation
【速读】:该论文旨在解决视频帧插值(video frame interpolation)中处理快速、复杂且高度非线性运动模式的挑战,尤其针对现有基于扩散模型的方法在多样应用场景下表现不足、难以生成清晰且时间一致的帧,特别是在音频-视觉同步插值等细粒度运动任务中的局限性。解决方案的关键在于提出BBF(Beyond Boundary Frames)框架,其核心创新包括:1)增强输入设计以灵活支持文本、音频、图像和视频等多种条件模态;2)引入解耦的多模态融合机制,分阶段将不同条件信号注入DiT(Diffusion Transformer)骨干网络;3)采用渐进式多阶段训练范式,利用起始-结束帧差异嵌入动态调整数据采样与损失权重,从而保持基础模型生成能力的同时提升插值精度与一致性。
链接: https://arxiv.org/abs/2512.03590
作者: Yuchen Deng,Xiuyang Wu,Hai-Tao Zheng,Jie Wang,Feidiao Yang,Yuxing Han
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Handling fast, complex, and highly non-linear motion patterns has long posed challenges for video frame interpolation. Although recent diffusion-based approaches improve upon traditional optical-flow-based methods, they still struggle to cover diverse application scenarios and often fail to produce sharp, temporally consistent frames in fine-grained motion tasks such as audio-visual synchronized interpolation. To address these limitations, we introduce BBF (Beyond Boundary Frames), a context-aware video frame interpolation framework, which could be guided by audio/visual semantics. First, we enhance the input design of the interpolation model so that it can flexibly handle multiple conditional modalities, including text, audio, images, and video. Second, we propose a decoupled multimodal fusion mechanism that sequentially injects different conditional signals into a DiT backbone. Finally, to maintain the generation abilities of the foundation model, we adopt a progressive multi-stage training paradigm, where the start-end frame difference embedding is used to dynamically adjust both the data sampling and the loss weighting. Extensive experimental results demonstrate that BBF outperforms specialized state-of-the-art methods on both generic interpolation and audio-visual synchronized interpolation tasks, establishing a unified framework for video frame interpolation under coordinated multi-channel conditioning.
zh
[CV-61] Dynamic Optical Test for Bot Identification (DOT-BI): A simple check to identify bots in surveys and online processes
【速读】:该论文旨在解决在线调查和网络流程中自动化系统(如机器人或脚本)冒充人类参与者的问题,即“bot识别”难题。解决方案的关键在于提出一种基于动态光学感知的测试方法——动态光学测试用于机器人识别(Dynamic Optical Test for Bot Identification, DOT-BI),其核心机制是利用人类视觉系统对运动差异的敏感性:在连续帧中,一个与背景具有相同纹理但存在微小运动或缩放差异的隐藏数字仅能被人类感知到,而算法在逐帧处理时无法提取有效信号,从而实现对人类与非人类行为的有效区分。
链接: https://arxiv.org/abs/2512.03580
作者: Malte Bleeker,Mauro Gotsch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:We propose the Dynamic Optical Test for Bot Identification (DOT-BI): a quick and easy method that uses human perception of motion to differentiate between human respondents and automated systems in surveys and online processes. In DOT-BI, a ‘hidden’ number is displayed with the same random black-and-white pixel texture as its background. Only the difference in motion and scale between the number and the background makes the number perceptible to humans across frames, while frame-by-frame algorithmic processing yields no meaningful signal. We conducted two preliminary assessments. Firstly, state-of-the-art, video-capable, multimodal models (GPT-5-Thinking and Gemini 2.5 Pro) fail to extract the correct value, even when given explicit instructions about the mechanism. Secondly, in an online survey (n=182), 99.5% (181/182) of participants solved the task, with an average end-to-end completion time of 10.7 seconds; a supervised lab study (n=39) found no negative effects on perceived ease-of-use or completion time relative to a control. We release code to generate tests and 100+ pre-rendered variants to facilitate adoption in surveys and online processes.
zh
[CV-62] Cross-Stain Contrastive Learning for Paired Immunohistochemistry and Histopathology Slide Representation Learning
【速读】:该论文旨在解决多染色切片(如HE与免疫组化IHC)之间因组织区域错位导致的特征不一致性问题,从而阻碍了基于HE图像的可迁移全切片图像(WSI)表示学习。其关键解决方案是提出一种两阶段预训练框架——跨染色对比学习(Cross-Stain Contrastive Learning, CSCL),第一阶段通过轻量适配器实现patch级对比对齐,增强HE特征与对应IHC上下文线索的兼容性;第二阶段采用基于多实例学习(MIL)的滑动窗口级表示学习,结合跨染色注意力融合模块和全局对齐模块,整合不同染色特异性patch特征并保证跨染色滑片级嵌入的一致性,最终获得高质量、可迁移的HE滑片级表征。
链接: https://arxiv.org/abs/2512.03577
作者: Yizhi Zhang,Lei Fan,Zhulin Tao,Donglin Di,Yang Song,Sidong Liu,Cong Cong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures. Camera-ready version accepted for IEEE BIBM 2025
Abstract:Universal, transferable whole-slide image (WSI) representations are central to computational pathology. Incorporating multiple markers (e.g., immunohistochemistry, IHC) alongside HE enriches HE-based features with diverse, biologically meaningful information. However, progress is limited by the scarcity of well-aligned multi-stain datasets. Inter-stain misalignment shifts corresponding tissue across slides, hindering consistent patch-level features and degrading slide-level embeddings. To address this, we curated a slide-level aligned, five-stain dataset (HE, HER2, KI67, ER, PGR) to enable paired HE-IHC learning and robust cross-stain representation. Leveraging this dataset, we propose Cross-Stain Contrastive Learning (CSCL), a two-stage pretraining framework with a lightweight adapter trained using patch-wise contrastive alignment to improve the compatibility of HE features with corresponding IHC-derived contextual cues, and slide-level representation learning with Multiple Instance Learning (MIL), which uses a cross-stain attention fusion module to integrate stain-specific patch features and a cross-stain global alignment module to enforce consistency among slide-level embeddings across different stains. Experiments on cancer subtype classification, IHC biomarker status classification, and survival prediction show consistent gains, yielding high-quality, transferable HE slide-level representations. The code and data are available at this https URL.
zh
[CV-63] UniComp: Rethinking Video Compression Through Informational Uniqueness
【速读】:该论文旨在解决传统基于注意力机制的视频压缩方法在有限计算预算下难以有效保留关键视觉信息的问题。其解决方案的核心在于提出一种以信息唯一性(information uniqueness)驱动的视频压缩框架UniComp,通过最小化保留token与完整token之间的条件熵(即重建误差)来优化压缩效果;具体而言,作者首次引入信息唯一性概念以量化token间的内在冗余,并据此设计了帧组融合、Token分配和空间动态压缩三个模块,实现语义帧分组、自适应资源分配与细粒度空间压缩,从而显著提升在受限计算资源下的视觉表示保真度。
链接: https://arxiv.org/abs/2512.03575
作者: Chao Yuan,Shimin Chen,Minliang Lin,Limeng Qiao,Guanglu Wan,Lin Ma
机构: Meituan Inc.(美团); Beihang University(北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Distinct from attention-based compression methods, this paper presents an information uniqueness driven video compression framework, termed UniComp, which aims to maximize the information fidelity of video representations under constrained computational budgets. Starting from the information-theoretic perspective, we formulate the vision compression as an optimization problem that minimizes conditional entropy (reconstruction error) between retained and full tokens. To achieve this, we introduce the notion of information uniqueness to measure intrinsic redundancy among tokens to link with reconstruction error. Based on uniqueness, we design three modules-Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression-that progressively perform semantic frame grouping, adaptive resource allocation, and fine-grained spatial compression. Extensive experiments demonstrate that UniComp consistently outperforms existing compression methods in preserving essential visual tokens under limited computational budgets, highlighting the pivotal role of information uniqueness in token compression efficacy.
zh
[CV-64] Global-Local Aware Scene Text Editing
【速读】:该论文旨在解决场景文本编辑(Scene Text Editing, STE)中的两个核心问题:一是编辑区域与周围背景之间的不一致性,二是对文本长度变化的敏感性问题。现有方法在处理文本替换时难以保持局部区域与全局背景的风格一致性,且当目标文本长度显著不同于原文本时,常出现变形或失真。解决方案的关键在于提出一种端到端框架GLASTE(Global-Local Aware Scene Text Editing),其创新点包括:设计全局-局部融合结构以协同利用高阶语义上下文与精细局部特征;引入联合全局与局部损失函数以增强风格一致性;通过独立于图像尺寸的文本风格向量表示实现跨尺度迁移,并采用仿射融合策略在保持目标文本宽高比不变的前提下精确填充至编辑区域。
链接: https://arxiv.org/abs/2512.03574
作者: Fuxiang Yang,Tonghua Su,Donglin Di,Yin Chen,Xiangqian Wu,Zhongjie Wang,Lei Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scene Text Editing (STE) involves replacing text in a scene image with new target text while preserving both the original text style and background texture. Existing methods suffer from two major challenges: inconsistency and length-insensitivity. They often fail to maintain coherence between the edited local patch and the surrounding area, and they struggle to handle significant differences in text length before and after editing. To tackle these challenges, we propose an end-to-end framework called Global-Local Aware Scene Text Editing (GLASTE), which simultaneously incorporates high-level global contextual information along with delicate local features. Specifically, we design a global-local combination structure, joint global and local losses, and enhance text image features to ensure consistency in text style within local patches while maintaining harmony between local and global areas. Additionally, we express the text style as a vector independent of the image size, which can be transferred to target text images of various sizes. We use an affine fusion to fill target text images into the editing patch while maintaining their aspect ratio unchanged. Extensive experiments on real-world datasets validate that our GLASTE model outperforms previous methods in both quantitative metrics and qualitative results and effectively mitigates the two challenges.
zh
[CV-65] GAOT: Generating Articulated Objects Through Text-Guided Diffusion Models ACM-MM
【速读】:该论文旨在解决文本提示(text prompts)与三维可动物体(articulated objects)表示之间存在的显著差距问题,即如何从自然语言描述中生成具有结构合理关节连接的3D可动物体。其解决方案的关键在于提出一个三阶段框架GAOT:首先微调点云生成模型以从文本提示生成粗略的物体结构;其次设计基于超图(hypergraph)的学习方法,将物体部件表示为图顶点,从而细化初始结构;最后利用扩散模型(diffusion model)根据已生成的部件信息,以图边形式生成合理的关节位置,实现从文本到结构化可动物体的端到端生成。
链接: https://arxiv.org/abs/2512.03566
作者: Hao Sun,Lei Fan,Donglin Di,Shaohui Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); University of New South Wales (新南威尔士大学); Li Auto (理想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by ACM MM Asia2026
Abstract:Articulated object generation has seen increasing advancements, yet existing models often lack the ability to be conditioned on text prompts. To address the significant gap between textual descriptions and 3D articulated object representations, we propose GAOT, a three-phase framework that generates articulated objects from text prompts, leveraging diffusion models and hypergraph learning in a three-step process. First, we fine-tune a point cloud generation model to produce a coarse representation of objects from text prompts. Given the inherent connection between articulated objects and graph structures, we design a hypergraph-based learning method to refine these coarse representations, representing object parts as graph vertices. Finally, leveraging a diffusion model, the joints of articulated objects-represented as graph edges-are generated based on the object parts. Extensive qualitative and quantitative experiments on the PartNet-Mobility dataset demonstrate the effectiveness of our approach, achieving superior performance over previous methods.
zh
[CV-66] RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL
【速读】:该论文旨在解决具身智能体策略(embodied policies)在多样化场景中泛化能力不足的问题,传统模仿学习(Imitation Learning, IL)易过拟合于特定专家轨迹,而强化学习(Reinforcement Learning, RL)则受限于缺乏统一且通用的奖励信号。其解决方案的关键在于提出 RoboScape-R 框架,该框架利用世界模型(world model)作为通用环境代理,并引入一种基于世界模型内在状态转移动态的“内生奖励机制”(endogenous reward mechanism),从而生成不依赖任务特性的通用奖励信号,构建高效且普适的训练环境,显著提升策略在域外场景下的泛化性能(平均提升37.5%)。
链接: https://arxiv.org/abs/2512.03556
作者: Yinzhou Tang,Yu Shang,Yinuo Chen,Bingwen Wei,Xin Zhang,Shu’ang Yu,Liangzhi Shi,Chao Yu,Chen Gao,Wei Wu,Yong Li
机构: Tsinghua University (清华大学); Manifold AI
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Achieving generalizable embodied policies remains a key challenge. Traditional policy learning paradigms, including both Imitation Learning (IL) and Reinforcement Learning (RL), struggle to cultivate generalizability across diverse scenarios. While IL policies often overfit to specific expert trajectories, RL suffers from the inherent lack of a unified and general reward signal necessary for effective multi-scene generalization. We posit that the world model is uniquely capable of serving as a universal environment proxy to address this limitation. However, current world models primarily focus on their ability to predict observations and still rely on task-specific, handcrafted reward functions, thereby failing to provide a truly general training environment. Toward this problem, we propose RoboScape-R, a framework leveraging the world model to serve as a versatile, general-purpose proxy for the embodied environment within the RL paradigm. We introduce a novel world model-based general reward mechanism that generates ‘‘endogenous’’ rewards derived from the model’s intrinsic understanding of real-world state transition dynamics. Extensive experiments demonstrate that RoboScape-R effectively addresses the limitations of traditional RL methods by providing an efficient and general training environment that substantially enhances the generalization capability of embodied policies. Our approach offers critical insights into utilizing the world model as an online training strategy and achieves an average 37.5% performance improvement over baselines under out-of-domain scenarios.
zh
[CV-67] Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM -Boosted Similarity Matching KDD2026
【速读】:该论文旨在解决大规模用户生成视频平台(尤其是直播场景)中内容审核的时效性、多模态性和对新型违规内容适应性不足的问题。解决方案的关键在于提出一种混合式审核框架,结合监督分类模型用于识别已知违规内容,以及基于参考的相似性匹配机制以检测新颖或隐蔽的违规案例;同时,通过多模态大语言模型(Multimodal Large Language Model, MLLM)将知识蒸馏至两个子系统,从而在保持推理轻量化的同时提升整体准确率,实现对显性违规和新兴对抗行为的有效治理。
链接: https://arxiv.org/abs/2512.03553
作者: Wei Chee Yew,Hailun Xu,Sanjay Saha,Xiaotian Fan,Hiok Hian Ong,David Yuchen Wang,Kanchan Sarkar,Zhenheng Yang,Danhui Guan
机构: TikTok(抖音)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at KDD 2026
Abstract:Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.
zh
[CV-68] V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉-语言任务中普遍存在的幻觉问题,即模型生成的内容与输入图像不一致,严重影响其在高精度敏感场景下的可靠性。研究表明,该问题根源在于“视觉忽视”(visual neglect),即模型未能充分关注输入图像信息。现有方法通常通过干预注意力分数或输出logits来缓解幻觉,但忽略了关键前提——何时进行干预,导致“过度干预”问题,引入新的幻觉并增加计算开销。论文提出V-ITI框架,其核心创新在于:首先利用头级激活模式识别视觉忽视,并设计轻量级的视觉推理时干预机制,在检测到视觉忽视时才调用预存储的视觉激活信息进行干预,从而实现精准、高效且无冗余的幻觉抑制。
链接: https://arxiv.org/abs/2512.03542
作者: Nan Sun,Zhenyu Zhang,Xixun Lin,Kun Wang,Yanmin Shang,Naibin Gu,Shuohuan Wang,Yu Sun,Hua Wu,Haifeng Wang,Yanan Cao
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Baidu Inc. (百度公司); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations, producing content inconsistent with input visuals, that undermine reliability in precision-sensitive domains. This issue stems from a fundamental problem of visual neglect, where models fail to adequately prioritize input images. Existing methods typically alleviate hallucinations by intervening in the attention score or output logits, focusing on “how to intervene” but overlooking the prerequisite “when to intervene”, which leads to the “over-intervention” problem and subsequently introduces new hallucinations and unnecessary computational overhead. To address this gap, we first investigate the mechanism of visual neglect and reveal it can be accurately detected via head-level activation patterns in MLLMs. We thus propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector that identifies visual neglect via head-level discriminative probes and a Visual Recall Intervenor that modulates activations with prestored visual activation information only when the visual neglect is detected. Extensive experiments across eight benchmarks and different MLLM families demonstrate that V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.
zh
[CV-69] CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation
【速读】:该论文旨在解决现有生成式 AI 在处理结构化多步骤任务(如食谱插图生成)时的两大局限:一是扩散模型难以有效捕捉烹饪过程中步骤间的顺序逻辑与视觉语义关联;二是当前方法无法适应食谱长度的自然变化,通常固定生成图像数量,缺乏灵活性。解决方案的关键在于提出 CookAnything 框架,其核心创新包括:(1) Step-wise Regional Control (SRC),在单次去噪过程中将文本步骤与图像区域对齐,实现细粒度控制;(2) Flexible RoPE,一种步长感知的位置编码机制,提升时间连贯性与空间多样性;(3) Cross-Step Consistency Control (CSCC),确保跨步骤间食材的一致性。该框架可生成任意长度、语义清晰且视觉一致的图像序列,显著优于现有训练依赖和零样本场景下的方法。
链接: https://arxiv.org/abs/2512.03540
作者: Ruoxuan Zhang,Bin Wen,Hongxia Xie,Yi Yao,Songhan Zuo,Jian-Yu Jiang-Lin,Hong-Han Shuai,Wen-Huang Cheng
机构: Jilin University (吉林大学); National Yang Ming Chiao Tung University (国立阳明交通大学); National Taiwan University (国立台湾大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ACM Multimedia 2025
Abstract:Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.
zh
[CV-70] Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation
【速读】:该论文旨在解决文本到视觉生成(text-to-visual generation)中用户意图与生成结果之间对齐不精确的问题,即单次生成往往无法满足预期输出。传统方法主要通过扩大生成过程规模(如增加采样步骤或种子数量)来提升效果,但这种方法很快会达到质量瓶颈,根源在于提示词(prompt)在推理阶段保持固定不变。为此,作者提出了一种名为PRIS(Prompt Redesign for Inference-time Scaling)的框架,其关键在于在推理过程中动态重设计提示词:通过分析多次生成的视觉结果,识别共现的失败模式,并基于此重构提示词后再进行重新生成。为实现更细粒度的对齐反馈,论文引入了“元素级事实修正”(element-level factual correction)验证器,相较于整体性评估指标,该方法能提供更准确、可解释的提示与图像间属性一致性判断。实验表明,联合缩放提示词与视觉内容是充分挖掘推理阶段扩展规律的关键。
链接: https://arxiv.org/abs/2512.03534
作者: Subin Kim,Sangwoo Mo,Mamshad Nayeem Rizve,Yiran Xu,Difan Liu,Jinwoo Shin,Tobias Hinz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Visualizations are available at the website: this https URL
Abstract:Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: this https URL.
zh
[CV-71] OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
【速读】:该论文旨在解决开放词汇三维实例分割(Open-vocabulary 3D Instance Segmentation, OV-3DIS)在多样化、非结构化且无网格(mesh-free)环境中的泛化能力不足问题。现有方法受限于两个关键因素:一是提案生成依赖于数据集特定的提案网络或基于网格的超点(superpoints),难以适用于无网格场景并限制对新场景的泛化;二是基于CLIP的分类器文本推理能力弱,无法有效识别组合式和功能性的用户查询。解决方案的关键在于提出OpenTrack3D框架:首先采用一种新颖的视觉-空间跟踪器在线构建跨视角一致的对象提案,无需预生成提案且完全不依赖网格;其次引入多模态大语言模型(Multi-modal Large Language Model, MLLM)替代CLIP,显著增强对复杂用户查询的组合推理能力。该方案在ScanNet200、Replica、ScanNet++ 和 SceneFun3D等多样基准上实现了最先进的性能与强泛化能力。
链接: https://arxiv.org/abs/2512.03532
作者: Zhishan Zhou,Siyuan Wei,Zengran Wang,Chunjie Wang,Xiaosheng Yan,Xiao Liu
机构: PICO, ByteDance(字节跳动), Beijing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.
zh
[CV-72] MSG-Loc: Multi-Label Likelihood-based Semantic Graph Matching for Object-Level Global Localization
【速读】:该论文旨在解决机器人在未知物体类别和语义模糊环境下进行全局定位时,因语义歧义导致的物体误分类与错误关联问题,从而引发位姿估计误差较大的挑战。解决方案的关键在于提出一种基于多标签似然的语义图匹配框架,通过引入多标签图表示而非单标签表示来捕捉和利用观测物体的内在语义上下文信息,并借助上下文感知的似然传播机制,将每个节点的似然与其邻接节点的最大似然相结合,增强跨图的语义对应关系。
链接: https://arxiv.org/abs/2512.03522
作者: Gihyeon Lee,Jungwoo Lee,Juwon Kim,Young-Sik Shin,Younggun Cho
机构: Inha University (仁川大学); Korea Institute of Machinery and Materials (韩国机械材料研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE Robotics and Automation Letters (2025)
Abstract:Robots are often required to localize in environments with unknown object classes and semantic ambiguity. However, when performing global localization using semantic objects, high semantic ambiguity intensifies object misclassification and increases the likelihood of incorrect associations, which in turn can cause significant errors in the estimated pose. Thus, in this letter, we propose a multi-label likelihood-based semantic graph matching framework for object-level global localization. The key idea is to exploit multi-label graph representations, rather than single-label alternatives, to capture and leverage the inherent semantic context of object observations. Based on these representations, our approach enhances semantic correspondence across graphs by combining the likelihood of each node with the maximum likelihood of its neighbors via context-aware likelihood propagation. For rigorous validation, data association and pose estimation performance are evaluated under both closed-set and open-set detection configurations. In addition, we demonstrate the scalability of our approach to large-vocabulary object categories in both real-world indoor scenes and synthetic environments.
zh
[CV-73] FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation
【速读】:该论文旨在解决文本驱动的流式人体动作生成问题,即如何在给定随时间变化的文本提示条件下,生成与文本对齐且无缝衔接的动作序列,并实现实时延迟。传统方法多依赖于分块处理或自回归模型结合扩散头,难以高效建模时序动态。本文提出FloodDiffusion框架,其核心创新在于采用扩散强制(diffusion forcing)机制来建模这一时序生成任务,同时针对真实运动分布建模失败的问题,提出了三项关键改进:(i) 使用双向注意力替代单向注意力以增强时序建模能力;(ii) 采用下三角时间调度器而非随机调度以保证因果一致性;(iii) 以连续时变方式引入文本条件控制,从而实现更精确的语义对齐。这些改进使该框架首次在流式动作生成任务上达到SOTA性能,在HumanML3D基准上取得FID=0.057。
链接: https://arxiv.org/abs/2512.03520
作者: Yiyi Cai,Yuhan Wu,Kunhang Li,You Zhou,Bo Zheng,Haiyang Liu
机构: Shanda AI Research Tokyo (山达AI研究东京); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 7 figures
Abstract:We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency. Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events. We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning. With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available. this https URL
zh
[CV-74] CSMapping: Scalable Crowdsourced Semantic Mapping and Topology Inference for Autonomous Driving
【速读】:该论文旨在解决众包数据在自动驾驶高精地图构建中因低成本传感器噪声导致地图质量难以随数据量提升的问题。其核心解决方案是提出CSMapping系统,关键在于利用基于隐空间扩散模型(latent diffusion model)的生成先验来建模真实世界地图结构,无需成对的众包数据与高精地图监督信号;通过约束最大后验(constrained MAP)优化将该先验嵌入到隐空间中,从而在严重噪声下仍能保持鲁棒性并实现未观测区域的合理补全;同时结合高效高斯基重参数化、投影梯度下降与多起点策略及隐空间因子图优化,确保全局一致性,进而实现语义地图和拓扑道路中心线的质量随众包数据增长而持续提升。
链接: https://arxiv.org/abs/2512.03510
作者: Zhijian Qiao,Zehuan Yu,Tong Li,Chih-Chung Chou,Wenchao Ding,Shaojie Shen
机构: Hong Kong University of Science and Technology (香港科技大学); Fudan University (复旦大学); Zhuoyu Technology Co., Ltd. (卓有科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Crowdsourcing enables scalable autonomous driving map construction, but low-cost sensor noise hinders quality from improving with data volume. We propose CSMapping, a system that produces accurate semantic maps and topological road centerlines whose quality consistently increases with more crowdsourced data. For semantic mapping, we train a latent diffusion model on HD maps (optionally conditioned on SD maps) to learn a generative prior of real-world map structure, without requiring paired crowdsourced/HD-map supervision. This prior is incorporated via constrained MAP optimization in latent space, ensuring robustness to severe noise and plausible completion in unobserved areas. Initialization uses a robust vectorized mapping module followed by diffusion inversion; optimization employs efficient Gaussian-basis reparameterization, projected gradient descent zobracket multi-start, and latent-space factor-graph for global consistency. For topological mapping, we apply confidence-weighted k-medoids clustering and kinematic refinement to trajectories, yielding smooth, human-like centerlines robust to trajectory variation. Experiments on nuScenes, Argoverse 2, and a large proprietary dataset achieve state-of-the-art semantic and topological mapping performance, with thorough ablation and scalability studies.
zh
[CV-75] AfroBeats Dance Movement Analysis Using Computer Vision: A Proof-of-Concept Framework Combining YOLO and Segment Anything Model
【速读】:该论文旨在解决无标记、无需专用设备条件下对舞蹈动作进行自动化分析的问题,以实现对舞者运动的精准量化。其解决方案的关键在于构建一个融合YOLOv8/v11目标检测与Segment Anything Model (SAM)像素级分割的框架:首先利用YOLO模型实现舞者定位与计数,再通过SAM提供高精度语义分割,从而捕捉身体姿态变化并计算空间覆盖范围、节奏一致性等量化指标,最终在单段加纳AfroBeats舞蹈视频中验证了该方法的技术可行性,检测精度达94%,分割IoU约为83%。
链接: https://arxiv.org/abs/2512.03509
作者: Kwaku Opoku-Ware,Gideon Opoku
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents a preliminary investigation into automated dance movement analysis using contemporary computer vision techniques. We propose a proof-of-concept framework that integrates YOLOv8 and v11 for dancer detection with the Segment Anything Model (SAM) for precise segmentation, enabling the tracking and quantification of dancer movements in video recordings without specialized equipment or markers. Our approach identifies dancers within video frames, counts discrete dance steps, calculates spatial coverage patterns, and measures rhythm consistency across performance sequences. Testing this framework on a single 49-second recording of Ghanaian AfroBeats dance demonstrates technical feasibility, with the system achieving approximately 94% detection precision and 89% recall on manually inspected samples. The pixel-level segmentation provided by SAM, achieving approximately 83% intersection-over-union with visual inspection, enables motion quantification that captures body configuration changes beyond what bounding-box approaches can represent. Analysis of this preliminary case study indicates that the dancer classified as primary by our system executed 23% more steps with 37% higher motion intensity and utilized 42% more performance space compared to dancers classified as secondary. However, this work represents an early-stage investigation with substantial limitations including single-video validation, absence of systematic ground truth annotations, and lack of comparison with existing pose estimation methods. We present this framework to demonstrate technical feasibility, identify promising directions for quantitative dance metrics, and establish a foundation for future systematic validation studies.
zh
[CV-76] Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation ICCV2025
【速读】:该论文旨在解决当前域泛化语义分割(Domain Generalization Semantic Segmentation, DGSS)方法中忽视视觉与文本语境间语义错位的问题,这一问题源于固定上下文提示(context prompt)在单一源域上学习所导致的刚性限制。解决方案的关键在于提出一种新颖的域感知提示驱动掩码Transformer框架(Domain-aware Prompt-driven Masked Transformer, DPMFormer),其核心创新包括:1)引入域感知提示学习以增强视觉与文本线索之间的语义对齐;2)设计域感知对比学习结合纹理扰动策略,以从单个源数据集中捕获多样化的域特异性特征;3)提出域鲁棒一致性学习机制,通过最小化原始图像与增强图像预测结果间的差异,提升模型对环境变化的鲁棒性。该框架在多个DGSS基准上取得了新的最先进性能。
链接: https://arxiv.org/abs/2512.03508
作者: Seogkyu Jeon,Kibeom Hong,Hyeran Byun
机构: Yonsei University (延世大学); Sookmyung Women’s University (淑明女子大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 (poster)
Abstract:Recent domain generalized semantic segmentation (DGSS) studies have achieved notable improvements by distilling semantic knowledge from Vision-Language Models (VLMs). However, they overlook the semantic misalignment between visual and textual contexts, which arises due to the rigidity of a fixed context prompt learned on a single source domain. To this end, we present a novel domain generalization framework for semantic segmentation, namely Domain-aware Prompt-driven Masked Transformer (DPMFormer). Firstly, we introduce domain-aware prompt learning to facilitate semantic alignment between visual and textual cues. To capture various domain-specific properties with a single source dataset, we propose domain-aware contrastive learning along with the texture perturbation that diversifies the observable domains. Lastly, to establish a framework resilient against diverse environmental changes, we have proposed the domain-robust consistency learning which guides the model to minimize discrepancies of prediction from original and the augmented images. Through experiments and analyses, we demonstrate the superiority of the proposed framework, which establishes a new state-of-the-art on various DGSS benchmarks. The code is available at this https URL.
zh
[CV-77] EEA: Exploration-Exploitation Agent for Long Video Understanding
【速读】:该论文旨在解决长视频理解中因密集预处理导致的计算开销过大,以及探索(exploration)与利用(exploitation)难以平衡所引发的信息覆盖不全和效率低下的问题。其解决方案的关键在于提出了一种名为EEA的视频智能体框架,通过语义引导的分层树状搜索过程实现探索与利用的动态平衡:EEA自主发现并动态更新任务相关的语义查询,将与之匹配的视频帧作为语义锚点;在树搜索过程中优先探索语义相关帧,同时确保未知片段的充分覆盖;并通过显式建模不确定性,自适应融合视觉-语言模型(Vision-Language Models, VLMs)的内在奖励与语义先验,从而实现对视频片段的稳定且精确评估。
链接: https://arxiv.org/abs/2512.03500
作者: Te Yang,Xiangyu Zhu,Bo Wang,Quan Chen,Peng Jiang,Zhen Lei
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems (多模态人工智能系统国家重点实验室); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Kuaishou Technology (快手科技); Centre for Artificial Intelligence and Robotics, HKISI, Chinese Academy of Sciences (中国科学院香港中文大学深圳研究院人工智能与机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information. Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing, or fail to effectively balance exploration and exploitation, resulting in incomplete information coverage and inefficiency. In this work, we introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance with hierarchical tree search process. EEA autonomously discovers and dynamically updates task-relevant semantic queries, and collects video frames closely matched to these queries as semantic anchors. During the tree search process, instead of uniform expansion, EEA preferentially explores semantically relevant frames while ensuring sufficient coverage within unknown segments. Moreover, EEA adaptively combines intrinsic rewards from visionlanguage models (VLMs) with semantic priors by explicitly modeling uncertainty to achieve stable and precise evaluation of video segments. Experiments across various long-video benchmarks validate the superior performance and computational efficiency of our proposed method.
zh
[CV-78] owards Object-centric Understanding for Instructional Videos
【速读】:该论文旨在解决现有以动作为中心(action-centric)的方法在处理现实世界程序性任务时灵活性不足的问题,即这些方法难以应对步骤顺序因对象状态变化而动态调整的复杂场景。其解决方案的关键在于提出一种以对象为中心(object-centric)的新范式,将动作视为驱动状态转换的机制,并构建了Object-IVQA基准数据集,用于评估对象级推理的四个维度:状态演化、先决条件验证、反事实推理和错误识别。此外,作者设计了一个集成对象级规划、感知、分析与生成工具的智能体框架,支持显式的证据检索和跨不连续片段的多跳推理,从而显著提升了模型在对象层面的认知与推理能力。
链接: https://arxiv.org/abs/2512.03479
作者: Wenliang Guo,Yu Kong
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding procedural activities is crucial for developing future assistive AI that can reason about complex real-world tasks. Existing action-centric methods struggle with the flexibility of real procedures, where step order varies depending on object states. In this work, we propose to shift the focus to an object-centric paradigm by regarding actions as mechanisms that drive state transitions. To advance this direction, we introduce Object-IVQA, a long-form instructional video benchmark with 107 videos and 514 open-ended question-answer pairs annotated with temporally grounded evidence. The benchmark evaluates four dimensions of object-centric reasoning, including state evolution, precondition verification, counterfactual reasoning and mistake recognition. We further propose an agent framework that orchestrates object-centric planning, perception, analysis and generation tools, enabling explicit evidence retrieval and multi-hop reasoning across disjoint segments. Experiments show that existing large vision-language models struggle in object-level recognition and reasoning, whereas our framework achieves substantially improvement.
zh
[CV-79] Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis
【速读】:该论文旨在解决医学视觉-语言模型(Vision-Language Models, VLMs)在不同人口统计学群体中存在显著诊断准确率差异的问题,即公平性缺失问题。其解决方案的关键在于提出一种公平感知的低秩适配方法(Fairness-aware Low-Rank Adaptation, FR-LoRA),通过引入可微分的MaxAccGap损失函数实现跨群体准确率均等化的端到端优化,同时结合逆频率加权梯度策略(GR-LoRA)与混合策略(Hybrid-LoRA),在仅需0.24%可训练参数的前提下,显著降低诊断准确性差距(如GR-LoRA使差异减少69%),并保持较高的整体准确率,从而实现高效、轻量且公平的医疗AI部署。
链接: https://arxiv.org/abs/2512.03477
作者: Zijian Gu,Yuxi Liu,Zhenhao Zhang,Song Wang
机构: University of Rochester (罗切斯特大学); Indiana University (印第安纳大学); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 3 tables
Abstract:Vision-language models achieve expert-level performance on medical imaging tasks but exhibit significant diagnostic accuracy disparities across demographic groups. We introduce fairness-aware Low-Rank Adaptation for medical VLMs, combining parameter efficiency with explicit fairness optimization. Our key algorithmic contribution is a differentiable MaxAccGap loss that enables end-to-end optimization of accuracy parity across demographic groups. We propose three methods: FR-LoRA integrates MaxAccGap regularization into the training objective, GR-LoRA applies inverse frequency weighting to balance gradient contributions, and Hybrid-LoRA combines both this http URL on 10,000 glaucoma fundus images, GR-LoRA reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies reveal that strong regularization strength achieves optimal fairness with minimal accuracy trade-off, and race-specific optimization yields 60% disparity reduction. Our approach requires only 0.24% trainable parameters, enabling practical deployment of fair medical AI in resource-constrained healthcare settings.
zh
[CV-80] Procedural Mistake Detection via Action Effect Modeling
【速读】:该论文旨在解决程序化任务中错误检测的局限性问题,即现有方法主要关注动作执行过程本身,而忽视了动作所产生的结果(action effect),导致无法有效识别那些不体现在动作执行阶段但反映在最终结果中的错误,如物体状态异常或空间布局错误。其解决方案的关键在于提出一种统一的动作效果建模(Action Effect Modeling, AEM)框架,通过概率建模联合捕捉动作执行及其后果;AEM首先基于语义相关性和视觉质量选择最具信息量的效果帧,再融合视觉定位与符号场景图提取互补线索,并在共享潜在空间中构建鲁棒的、以效果为导向的表征;进一步地,设计基于提示(prompt-based)的检测器,将每个动作片段与其预期语义对齐,从而实现更可靠的错误检测。
链接: https://arxiv.org/abs/2512.03474
作者: Wenliang Guo,Yujiang Pu,Yu Kong
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mistake detection in procedural tasks is essential for building intelligent systems that support learning and task execution. Existing approaches primarily analyze how an action is performed, while overlooking what it produces, i.e., the \textbfaction effect. Yet many errors manifest not in the execution itself but in the resulting outcome, such as an unintended object state or incorrect spatial arrangement. To address this gap, we propose Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its outcomes through a probabilistic formulation. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations. To detect mistakes, we further design a prompt-based detector that incorporates task-specific prompts and aligns each action segment with its intended execution semantics. Our approach achieves state-of-the-art performance on the EgoPER and CaptainCook4D benchmarks under the challenging one-class classification (OCC) setting. These results demonstrate that modeling both execution and outcome yields more reliable mistake detection, and highlight the potential of effect-aware representations to benefit a broader range of downstream applications.
zh
[CV-81] Difference Decomposition Networks for Infrared Small Target Detection
【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, ISTD)中因目标纹理不明显和背景杂波严重而导致的目标被遮蔽问题。解决方案的关键在于提出基础分解模块(Basis Decomposition Module, BDM),该模块基于基础分解思想,将复杂特征分解为多个基础特征,从而增强有用信息并消除冗余。在此基础上扩展出空间差异分解模块(Spatial Difference Decomposition Module, SD²M)、空间差异分解下采样模块(SD³M)和时域差异分解模块(TD²M),构建了单帧ISTD网络SD²Net与多帧ISTD网络STD²Net,通过引入空间与时间维度的差异信息显著提升检测性能,在多个基准数据集上达到当前最优效果。
链接: https://arxiv.org/abs/2512.03470
作者: Chen Hu,Mingyu Zhou,Shuai Yuan,Hongbo Hu,Xiangyu Qiu,Junhai Luo,Tian Pu,Xiyin Li
机构: Sun Yat-sen University (中山大学); Hefei University of Technology (合肥工业大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared small target detection (ISTD) faces two major challenges: a lack of discernible target texture and severe background clutter, which results in the background obscuring the target. To enhance targets and suppress backgrounds, we propose the Basis Decomposition Module (BDM) as an extensible and lightweight module based on basis decomposition, which decomposes a complex feature into several basis features and enhances certain information while eliminating redundancy. Extending BDM leads to a series of modules, including the Spatial Difference Decomposition Module (SD ^\mathrm2 M), Spatial Difference Decomposition Downsampling Module (SD ^\mathrm3 M), and Temporal Difference Decomposition Module (TD ^\mathrm2 M). Based on these modules, we develop the Spatial Difference Decomposition Network (SD ^\mathrm2 Net) for single-frame ISTD (SISTD) and the Spatiotemporal Difference Decomposition Network (STD ^\mathrm2 Net) for multi-frame ISTD (MISTD). SD ^\mathrm2 Net integrates SD ^\mathrm2 M and SD ^\mathrm3 M within an adapted U-shaped architecture. We employ TD ^\mathrm2 M to introduce motion information, which transforms SD ^\mathrm2 Net into STD ^\mathrm2 Net. Extensive experiments on SISTD and MISTD datasets demonstrate state-of-the-art (SOTA) performance. On the SISTD task, SD ^\mathrm2 Net performs well compared to most established networks. On the MISTD datasets, STD ^\mathrm2 Net achieves a mIoU of 87.68%, outperforming SD ^\mathrm2 Net, which achieves a mIoU of 64.97%. Our codes are available: this https URL.
zh
[CV-82] hink Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles
【速读】:该论文旨在解决自动驾驶场景中基于自然语言指令的视觉定位(Visual Grounding, VG)问题,尤其针对指令模糊性、依赖上下文以及缺乏对三维空间关系和场景演化预期的推理能力。现有方法在复杂驾驶环境中表现不佳,主要受限于静态感知与短时决策机制。解决方案的关键在于提出ThinkDeeper框架,其核心是一个空间感知的世界模型(Spatial-Aware World Model, SA-WM),该模型通过将当前场景压缩为命令感知的潜在状态,并滚动预测一系列未来潜在状态,从而提供前瞻性的语义线索以实现消歧;同时引入超图引导解码器,分层融合多模态输入与时空状态,捕捉高阶空间依赖关系,提升定位鲁棒性与准确性。
链接: https://arxiv.org/abs/2512.03454
作者: Haicheng Liao,Huanming Shen,Bonan Wang,Yongkang Li,Yihong Tang,Chengyue Wang,Dingyi Zhuang,Kehua Chen,Hai Yang,Chengzhong Xu,Zhenning Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.
zh
[CV-83] GeoVideo: Introducing Geometric Regularization into Video Generation Model
【速读】:该论文旨在解决当前基于扩散变换器(diffusion transformer)的视频生成方法在纯2D像素空间中建模时存在的问题,如时空不一致的几何结构、不合逻辑的运动以及结构性伪影。其关键解决方案是在潜在扩散模型中引入几何正则化损失,通过每帧深度预测来增强对3D结构的显式建模;特别地,提出一种多视角几何损失,将不同帧的预测深度图对齐到共享的3D坐标系中,从而提升视频生成过程中的空间-时间一致性、形状稳定性和物理合理性。
链接: https://arxiv.org/abs/2512.03453
作者: Yunpeng Bai,Shaoheng Fang,Chaohui Yu,Fan Wang,Qixing Huang
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); DAMO Academy, Alibaba Group (阿里达摩院); Hupan Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Recent advances in video generation have enabled the synthesis of high-quality and visually realistic clips using diffusion transformer models. However, most existing approaches operate purely in the 2D pixel space and lack explicit mechanisms for modeling 3D structures, often resulting in temporally inconsistent geometries, implausible motions, and structural artifacts. In this work, we introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction. We adopted depth as the geometric representation because of the great progress in depth prediction and its compatibility with image-based latent encoders. Specifically, to enforce structural consistency over time, we propose a multi-view geometric loss that aligns the predicted depth maps across frames within a shared 3D coordinate system. Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved spatio-temporal coherence, shape consistency, and physical plausibility. Experiments across multiple datasets show that our approach produces significantly more stable and geometrically consistent results than existing baselines.
zh
[CV-84] GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers
【速读】:该论文旨在解决扩散模型(Diffusion Models)在视频生成任务中计算效率低下的问题,特别是由于需要数十次迭代步骤以及分类器自由引导(Classifier-Free Guidance, CFG)导致的高算力消耗,从而限制了其在下游应用中的广泛部署。解决方案的关键在于提出一种无需训练的加速方法 GalaxyDiT,其核心创新是通过引导对齐(guidance alignment)与系统化的代理指标(proxy selection)策略,实现计算资源的最优复用:基于秩相关性分析识别每个视频模型的最佳代理指标,适用于不同模型家族和参数规模,从而在保持高质量输出的前提下显著提升生成速度——在 Wan2.1-1.3B 和 Wan2.1-14B 模型上分别实现 1.87× 和 2.37× 的加速,且质量损失极小(VBench-2.0 上仅下降 0.97% 和 0.72%),同时在高速率下仍优于现有最先进方法 5–10 dB 的峰值信噪比(PSNR)。
链接: https://arxiv.org/abs/2512.03451
作者: Zhiye Song,Steve Dai,Ben Keller,Brucek Khailany
机构: Massachusetts Institute of Technology (麻省理工学院); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Diffusion models have revolutionized video generation, becoming essential tools in creative content generation and physical simulation. Transformer-based architectures (DiTs) and classifier-free guidance (CFG) are two cornerstones of this success, enabling strong prompt adherence and realistic video quality. Despite their versatility and superior performance, these models require intensive computation. Each video generation requires dozens of iterative steps, and CFG doubles the required compute. This inefficiency hinders broader adoption in downstream applications. We introduce GalaxyDiT, a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics. Through rank-order correlation analysis, our technique identifies the optimal proxy for each video model, across model families and parameter scales, thereby ensuring optimal computational reuse. We achieve 1.87\times and 2.37\times speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark. At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR). Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.03451 [cs.CV] (or arXiv:2512.03451v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.03451 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-85] KeyPointDiffuser: Unsupervised 3D Keypoint Learning via Latent Diffusion Models
【速读】:该论文旨在解决在无监督条件下学习3D物体空间结构化关键点(keypoint)的问题,以支持现代3D生成流水线中的条件重建任务。其核心挑战在于现有无监督关键点方法通常不适用于无条件生成场景,限制了其在当前3D生成模型中的应用。解决方案的关键在于提出一个无监督框架,从点云数据中学习具有空间结构的3D关键点,并将其作为紧凑且可解释的表示来引导Elucidated Diffusion Model (EDM) 重建完整形状;所学关键点在不同物体实例间保持重复的空间结构,并支持关键点空间中的平滑插值,从而有效捕捉几何变化特性,显著提升了关键点一致性(相比先前方法提升6个百分点)。
链接: https://arxiv.org/abs/2512.03450
作者: Rhys Newbury,Juyan Zhang,Tin Tran,Hanna Kurniawati,Dana Kulić
机构: Monash University (蒙纳士大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Understanding and representing the structure of 3D objects in an unsupervised manner remains a core challenge in computer vision and graphics. Most existing unsupervised keypoint methods are not designed for unconditional generative settings, restricting their use in modern 3D generative pipelines; our formulation explicitly bridges this gap. We present an unsupervised framework for learning spatially structured 3D keypoints from point cloud data. These keypoints serve as a compact and interpretable representation that conditions an Elucidated Diffusion Model (EDM) to reconstruct the full shape. The learned keypoints exhibit repeatable spatial structure across object instances and support smooth interpolation in keypoint space, indicating that they capture geometric variation. Our method achieves strong performance across diverse object categories, yielding a 6 percentage-point improvement in keypoint consistency compared to prior approaches.
zh
[CV-86] LM-CartSeg: Automated Segmentation of Lateral and Medial Cartilage and Subchondral Bone for Radiomics Analysis
【速读】:该论文旨在解决膝关节磁共振成像(MRI)中放射组学(Radiomics)研究缺乏鲁棒且解剖学意义明确的感兴趣区域(ROI)的问题,尤其是如何自动分割软骨与皮质下骨,并实现几何上稳定的内外侧(L/M) compartmentalisation,同时确保质量控制(QC)。其解决方案的关键在于提出LM-CartSeg全自动流程:首先使用两个3D nnU-Net模型分别在SKM-TEA和OAIZIB-CM数据集上训练,测试时采用零样本(zero-shot)预测融合策略;随后通过简单的几何规则进行后处理优化——包括连通域清理、物理空间中10 mm皮质下骨带构建,以及基于主成分分析(PCA)和k-means聚类的数据驱动胫骨内外侧分割,从而显著提升分割精度(如HD95从25.2 mm降至3.35 mm)并保证跨数据集的稳定性;此外,该方法还引入体积和厚度签名作为QC指标,并提取4650个非形状放射组学特征用于区分骨关节炎(OA)与非OA状态,结果表明仅少量特征受ROI大小影响,说明所提取特征具有超越形态测量的判别能力。
链接: https://arxiv.org/abs/2512.03449
作者: Tongxu Zhang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Background and Objective: Radiomics of knee MRI requires robust, anatomically meaningful regions of interest (ROIs) that jointly capture cartilage and subchondral bone. Most existing work relies on manual ROIs and rarely reports quality control (QC). We present LM-CartSeg, a fully automatic pipeline for cartilage/bone segmentation, geometric lateral/medial (L/M) compartmentalisation and radiomics analysis. Methods: Two 3D nnU-Net models were trained on SKM-TEA (138 knees) and OAIZIB-CM (404 knees). At test time, zero-shot predictions were fused and refined by simple geometric rules: connected-component cleaning, construction of 10 mm subchondral bone bands in physical space, and a data-driven tibial L/M split based on PCA and k-means. Segmentation was evaluated on an OAIZIB-CM test set (103 knees) and on SKI-10 (100 knees). QC used volume and thickness signatures. From 10 ROIs we extracted 4 650 non-shape radiomic features to study inter-compartment similarity, dependence on ROI size, and OA vs. non-OA classification on OAIZIB-CM Results: Post-processing improved macro ASSD on OAIZIB-CM from 2.63 to 0.36 mm and HD95 from 25.2 to 3.35 mm, with DSC 0.91; zero-shot DSC on SKI-10 was 0.80. The geometric L/M rule produced stable compartments across datasets, whereas a direct L/M nnU-Net showed domain-dependent side swaps. Only 6 to 12 percent of features per ROI were strongly correlated with volume or thickness. Radiomics-based models models restricted to size-linked features. Conclusions: LM-CartSeg yields automatic, QCd ROIs and radiomic features that carry discriminative information beyond simple morphometry, providing a practical foundation for multi-centre knee OA radiomics studies.
zh
[CV-87] Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation
【速读】:该论文旨在解决医学图像分析中视觉-语言预训练(Vision-Language Pretraining, VLP)面临的两大挑战:一是网络收集数据中的噪声问题,二是从非结构化长文本中学习困难的问题。解决方案的关键在于提出一个集成多智能体数据生成(Multi-Agent data GENeration, MAGEN)系统与基于本体的多维度知识增强(Ontology-based Multi-Aspect Knowledge-Enhanced, O-MAKE)预训练框架的新方法。MAGEN通过基础模型辅助的描述生成与检索验证流程提升数据质量,而O-MAKE则将长文本分解为不同的知识维度,实现全局与局部图像块级别的细粒度对齐,并借助本体引导机制显式建模医学概念间的关系,从而显著提升模型在皮肤病学领域零样本疾病分类和跨模态检索任务上的性能。
链接: https://arxiv.org/abs/2512.03445
作者: Xieji Li,Siyuan Yan,Yingsheng Liu,H. Peter Soyer,Monika Janda,Victoria Mar,Zongyuan Ge
机构: Monash University (莫纳什大学); The University of Queensland (昆士兰大学); Alfred Health (阿尔弗雷德医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages. Under Review
Abstract:Vision-language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large-scale image-text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web-collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge-enriched descriptions via a foundation model-assisted captioning and retrieval-based verification pipeline. Second, O-MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine-grained alignment at both global and patch levels, while explicitly modeling medical concept relationships through ontology-guided mechanisms. We validate our framework in the field of dermatology, where comprehensive experiments demonstrate the effectiveness of each component. Our approach achieves state-of-the-art zero-shot performance on disease classification and cross-modal retrieval tasks across eight datasets. Our code and the augmented dataset Derm1M-AgentAug, comprising over 400k skin-image-text pairs, will be released at this https URL.
zh
[CV-88] Label-Efficient Hyperspectral Image Classification via Spectral FiLM Modulation of Low-Level Pretrained Diffusion Features ICML2025
【速读】:该论文旨在解决高光谱成像(Hyperspectral Imaging, HSI)中因空间分辨率低和标注稀疏导致的土地覆盖分类难题。其解决方案的关键在于利用在自然图像上预训练的冻结扩散模型(frozen diffusion model)提取低层次的空间特征,这些特征来自高分辨率解码器层并在早期去噪步骤中获得,能有效迁移至纹理稀疏的HSI数据;同时引入轻量级FiLM(Feature-wise Linear Modulation)融合模块,通过光谱线索自适应调制空间特征,实现光谱与空间信息的有效整合,在稀疏监督下完成鲁棒的多模态学习。
链接: https://arxiv.org/abs/2512.03430
作者: Yuzhen Hu,Biplab Banerjee,Saurabh Prasad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the ICML 2025 TerraBytes Workshop (June 9, 2025)
Abstract:Hyperspectral imaging (HSI) enables detailed land cover classification, yet low spatial resolution and sparse annotations pose significant challenges. We present a label-efficient framework that leverages spatial features from a frozen diffusion model pretrained on natural images. Our approach extracts low-level representations from high-resolution decoder layers at early denoising timesteps, which transfer effectively to the low-texture structure of HSI. To integrate spectral and spatial information, we introduce a lightweight FiLM-based fusion module that adaptively modulates frozen spatial features using spectral cues, enabling robust multimodal learning under sparse supervision. Experiments on two recent hyperspectral datasets demonstrate that our method outperforms state-of-the-art approaches using only the provided sparse training labels. Ablation studies further highlight the benefits of diffusion-derived features and spectral-aware fusion. Overall, our results indicate that pretrained diffusion models can support domain-agnostic, label-efficient representation learning for remote sensing and broader scientific imaging tasks.
zh
[CV-89] Generalization Evaluation of Deep Stereo Matching Methods for UAV-Based Forestry Applications
【速读】:该论文旨在解决无人机(UAV)在森林环境中进行自主作业时,深度估计方法在植被密集场景下跨域泛化能力不足的问题。现有方法主要在城市和室内场景中评估,缺乏对林业等特殊环境的有效验证。解决方案的关键在于首次系统性地开展零样本(zero-shot)评估,对八种前沿立体匹配方法(涵盖迭代优化、基础模型和零样本适配范式)进行全面对比,且所有模型均仅在Scene Flow数据集上训练,未进行微调,直接测试于四个标准基准及一个新采集的5,313对新西兰坎特伯雷森林数据集(使用ZED Mini相机拍摄)。实验揭示了不同方法在结构化场景与复杂植被场景中的性能差异,并识别出DEFOM作为最优基线,在深度平滑性、遮挡处理和跨域一致性方面表现最佳,为后续森林环境下的深度估计研究提供了可靠参考。
链接: https://arxiv.org/abs/2512.03427
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous UAV forestry operations require robust depth estimation methods with strong cross-domain generalization. However, existing evaluations focus on urban and indoor scenarios, leaving a critical gap for specialized vegetation-dense environments. We present the first systematic zero-shot evaluation of eight state-of-the-art stereo methods–RAFT-Stereo, IGEV, IGEV++, BridgeDepth, StereoAnywhere, DEFOM (plus baseline methods ACVNet, PSMNet, TCstereo)–spanning iterative refinement, foundation model, and zero-shot adaptation paradigms. All methods are trained exclusively on Scene Flow and evaluated without fine-tuning on four standard benchmarks (ETH3D, KITTI 2012/2015, Middlebury) plus a novel 5,313-pair Canterbury forestry dataset captured with ZED Mini camera (1920x1080). Performance reveals scene-dependent patterns: foundation models excel on structured scenes (BridgeDepth: 0.23 px on ETH3D, 0.83-1.07 px on KITTI; DEFOM: 0.35-4.65 px across benchmarks), while iterative methods maintain cross-domain robustness (IGEV++: 0.36-6.77 px; IGEV: 0.33-21.91 px). Critical finding: RAFT-Stereo exhibits catastrophic ETH3D failure (26.23 px EPE, 98 percent error rate) due to negative disparity predictions, while performing normally on KITTI (0.90-1.11 px). Qualitative evaluation on Canterbury forestry dataset identifies DEFOM as the optimal gold-standard baseline for vegetation depth estimation, exhibiting superior depth smoothness, occlusion handling, and cross-domain consistency compared to IGEV++, despite IGEV++'s finer detail preservation.
zh
[CV-90] DM3D: Deformable Mamba via Offset-Guided Gaussian Sequencing for Point Cloud Understanding
【速读】:该论文旨在解决状态空间模型(State Space Models, SSMs)在处理点云数据时因依赖输入顺序而导致的局限性问题,即现有方法通常采用固定的序列化策略,难以适应点云中复杂的几何结构多样性。解决方案的关键在于提出一种可变形Mamba架构DM3D,其核心创新包括:基于偏移引导的高斯序列机制(offset-guided Gaussian sequencing),通过高斯K近邻重采样(Gaussian-based KNN Resampling, GKR)和高斯可微分重排序(Gaussian-based Differentiable Reordering, GDR)实现局部重采样与全局重排序的统一,从而自适应地生成结构感知的序列;此外,引入三路径频域融合模块(Tri-Path Frequency Fusion)以增强特征互补性并抑制混叠效应,最终实现点云的结构自适应序列化,充分释放SSMs在点云理解任务中的潜力。
链接: https://arxiv.org/abs/2512.03424
作者: Bin Liu,Chunyang Wang,Xuelian Liu
机构: Xi’an Technological University (西安工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:State Space Models (SSMs) demonstrate significant potential for long-sequence modeling, but their reliance on input order conflicts with the irregular nature of point clouds. Existing approaches often rely on predefined serialization strategies, which cannot adjust based on diverse geometric structures. To overcome this limitation, we propose \textbfDM3D, a deformable Mamba architecture for point cloud understanding. Specifically, DM3D introduces an offset-guided Gaussian sequencing mechanism that unifies local resampling and global reordering within a deformable scan. The Gaussian-based KNN Resampling (GKR) enhances structural awareness by adaptively reorganizing neighboring points, while the Gaussian-based Differentiable Reordering (GDR) enables end-to-end optimization of serialization order. Furthermore, a Tri-Path Frequency Fusion module enhances feature complementarity and reduces aliasing. Together, these components enable structure-adaptive serialization of point clouds. Extensive experiments on benchmark datasets show that DM3D achieves state-of-the-art performance in classification, few-shot learning, and part segmentation, demonstrating that adaptive serialization effectively unlocks the potential of SSMs for point cloud understanding.
zh
[CV-91] What Is The Best 3D Scene Representation for Robotics? From Geometric to Foundation Models
【速读】:该论文旨在解决机器人领域中“何种三维场景表示方法最优”这一核心问题,聚焦于不同场景表示方法在感知、建图、定位、导航与操作五大模块中的适用性与性能差异。其解决方案的关键在于系统性地分类和比较传统表示(如点云、体素、有符号距离函数SDF、场景图)与新兴神经表示(如NeRF、3D高斯溅射3DGS及基础模型Foundation Models)的优劣,并指出神经表示因具备整合高层语义特征与语言先验的能力,在实现更全面的三维场景理解与具身智能方面具有显著优势,从而为未来机器人应用提供统一的潜在解决方案。
链接: https://arxiv.org/abs/2512.03422
作者: Tianchen Deng,Yue Pan,Shenghai Yuan,Dong Li,Chen Wang,Mingrui Li,Long Chen,Lihua Xie,Danwei Wang,Jingchuan Wang,Javier Civera,Hesheng Wang,Weidong Chen
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we provide a comprehensive overview of existing scene representation methods for robotics, covering traditional representations such as point clouds, voxels, signed distance functions (SDF), and scene graphs, as well as more recent neural representations like Neural Radiance Fields (NeRF), 3D Gaussian Splatting (3DGS), and the emerging Foundation Models. While current SLAM and localization systems predominantly rely on sparse representations like point clouds and voxels, dense scene representations are expected to play a critical role in downstream tasks such as navigation and obstacle avoidance. Moreover, neural representations such as NeRF, 3DGS, and foundation models are well-suited for integrating high-level semantic features and language-based priors, enabling more comprehensive 3D scene understanding and embodied intelligence. In this paper, we categorized the core modules of robotics into five parts (Perception, Mapping, Localization, Navigation, Manipulation). We start by presenting the standard formulation of different scene representation methods and comparing the advantages and disadvantages of scene representation across different modules. This survey is centered around the question: What is the best 3D scene representation for robotics? We then discuss the future development trends of 3D scene representations, with a particular focus on how the 3D Foundation Model could replace current methods as the unified solution for future robotic applications. The remaining challenges in fully realizing this model are also explored. We aim to offer a valuable resource for both newcomers and experienced researchers to explore the future of 3D scene representations and their application in robotics. We have published an open-source project on GitHub and will continue to add new works and technologies to this project.
zh
[CV-92] YOLOA: Real-Time Affordance Detection via LLM Adapter
【速读】:该论文旨在解决具身智能(embodied AI)中“what-where-how”三重挑战的联合建模问题,即同时准确识别物体类别(what)、定位物体位置(where)以及理解其使用方式(how)。现有方法通常仅关注 affordance(使用可能性)的学习,或独立处理目标检测与 affordance 学习任务,缺乏有效交互且难以实现实时性能。解决方案的关键在于提出 YOLO Affordance (YOLOA),一个基于轻量级检测器和大语言模型(LLM)适配器(LLM Adapter)的实时联合检测框架:通过 LLM Adapter 与初步预测结果交互,动态优化对象类别先验、边界框偏移量及 affordance 门控机制,从而实现两个分支(目标检测与 affordance 学习)的协同精炼,显著提升精度(如 ADG-Det 上达 52.8 mAP)并保持高效推理速度(最高达 89.77 FPS)。
链接: https://arxiv.org/abs/2512.03418
作者: Yuqi Ji,Junjie Ke,Lihuo He,Jun Liu,Kaifan Zhang,Yu-Kun Lai,Guiguang Ding,Xinbo Gao
机构: Xidian University (西安电子科技大学); Tsinghua University (清华大学); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 13 pages, 9 figures, conference
Abstract:Affordance detection aims to jointly address the fundamental “what-where-how” challenge in embodied AI by understanding “what” an object is, “where” the object is located, and “how” it can be used. However, most affordance learning methods focus solely on “how” objects can be used while neglecting the “what” and “where” aspects. Other affordance detection methods treat object detection and affordance learning as two independent tasks, lacking effective interaction and real-time capability. To overcome these limitations, we introduce YOLO Affordance (YOLOA), a real-time affordance detection model that jointly handles these two tasks via a large language model (LLM) adapter. Specifically, YOLOA employs a lightweight detector consisting of object detection and affordance learning branches refined through the LLM Adapter. During training, the LLM Adapter interacts with object and affordance preliminary predictions to refine both branches by generating more accurate class priors, box offsets, and affordance gates. Experiments on our relabeled ADG-Det and IIT-Heat benchmarks demonstrate that YOLOA achieves state-of-the-art accuracy (52.8 / 73.1 mAP on ADG-Det / IIT-Heat) while maintaining real-time performance (up to 89.77 FPS, and up to 846.24 FPS for the lightweight variant). This indicates that YOLOA achieves an excellent trade-off between accuracy and efficiency.
zh
[CV-93] ViDiC: Video Difference Captioning
【速读】:该论文旨在解决现有视觉语言系统在动态场景中对组合性、空间性和时间性差异的比较感知能力不足的问题,尤其是当前图像差异描述(Image Difference Captioning, IDC)方法无法捕捉运动连续性、事件演变或编辑一致性等时序特性。其解决方案的关键在于提出视频差异描述(Video Difference Captioning, ViDiC)任务及其配套的ViDiC-1K数据集,该数据集包含1000对精挑细选的视频,并标注超过4000个对比检查项,覆盖主体、风格、背景、摄影、运动、位置和播放技术共七类特征;同时引入双检查表框架,基于大语言模型作为评判者(LLM-as-a-Judge)协议,分别评估相似性和差异性的准确性,从而为多模态大语言模型(MLLMs)提供一个具有挑战性的基准,推动视频理解、编辑意识与比较推理能力的发展。
链接: https://arxiv.org/abs/2512.03405
作者: Jiangtao Wu,Shihao Li,Zhaozhou Bian,Yuanxing Zhang,Jialu Chen,Runzhe Wen,An Ping,Yiwen He,Jiakai Wang,Jiaheng Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes–a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time. We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: subject, style, background, cinematography, motion, location, and playback techniques. To ensure reliable evaluation, we propose a dual-checklist framework that measures the accuracy of similarity and difference separately, based on the LLM-as-a-Judge protocol. Experiments on nineteen representative multimodal models reveal a significant performance gap in their comparative description and difference perception abilities. We hope ViDiC-1K can be a challenging benchmark that lays a solid foundation for advancing video understanding, edit awareness, and comparative reasoning in multimodal intelligence.
zh
[CV-94] MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification
【速读】:该论文旨在解决光学图像与合成孔径雷达(SAR)图像之间跨模态船舶重识别(cross-modal ship re-identification)中的模态差异问题,该差异导致特征分布不一致,严重影响识别鲁棒性。解决方案的关键在于提出一种名为MOS的新型框架,其核心包括两个组件:一是模态一致表示学习(Modality-Consistent Representation Learning, MCRL),通过去噪SAR图像处理和类别级模态对齐损失函数,实现跨模态的类内特征分布对齐;二是跨模态数据生成与特征融合(Cross-modal Data Generation and Feature Fusion, CDGF),利用布朗桥扩散模型生成跨模态样本,并在推理阶段将合成样本特征与原始特征融合,从而增强模态间对齐性和判别能力。
链接: https://arxiv.org/abs/2512.03404
作者: Yujian Zhao,Hankun Liu,Guanglin Niu
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery has recently emerged as a critical yet underexplored task in maritime intelligence and surveillance. However, the substantial modality gap between optical and SAR images poses a major challenge for robust identification. To address this issue, we propose MOS, a novel framework designed to mitigate the optical-SAR modality gap and achieve modality-consistent feature learning for optical-SAR cross-modal ship ReID. MOS consists of two core components: (1) Modality-Consistent Representation Learning (MCRL) applies denoise SAR image procession and a class-wise modality alignment loss to align intra-identity feature distributions across modalities. (2) Cross-modal Data Generation and Feature fusion (CDGF) leverages a brownian bridge diffusion model to synthesize cross-modal samples, which are subsequently fused with original features during inference to enhance alignment and discriminability. Extensive experiments on the HOSS ReID dataset demonstrate that MOS significantly surpasses state-of-the-art methods across all evaluation protocols, achieving notable improvements of +3.0%, +6.2%, and +16.4% in R1 accuracy under the ALL to ALL, Optical to SAR, and SAR to Optical settings, respectively. The code and trained models will be released upon publication.
zh
[CV-95] ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding
【速读】:该论文旨在解决现有基于高斯(Gaussian)的3D场景理解方法在开放词汇(open-vocabulary)场景下存在的两大局限:一是封闭集语义高斯模型依赖人工标注的3D标签,忽视了其渲染能力;二是纯2D自监督学习方式导致几何精度下降且仅适用于相机视角。解决方案的关键在于提出两个核心组件:一是多模态高斯变换器(Multi-Modal Gaussian Transformer),使高斯能够从多种传感器模态中查询特征,增强表征能力;二是货架式监督学习范式(Shelf-Supervised Learning Paradigm),利用现成的视觉基础模型(Vision Foundation Models, VFMs)在2D图像与3D场景层面联合优化高斯参数,从而实现高效、高质量的开放词汇3D理解。
链接: https://arxiv.org/abs/2512.03370
作者: Lingjun Zhao,Yandong Luo,James Hay,Lu Gan
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce ShelfGaussian, an open-vocabulary multi-modal Gaussian-based 3D scene understanding framework supervised by off-the-shelf vision foundation models (VFMs). Gaussian-based methods have demonstrated superior performance and computational efficiency across a wide range of scene understanding tasks. However, existing methods either model objects as closed-set semantic Gaussians supervised by annotated 3D labels, neglecting their rendering ability, or learn open-set Gaussian representations via purely 2D self-supervision, leading to degraded geometry and limited to camera-only settings. To fully exploit the potential of Gaussians, we propose a Multi-Modal Gaussian Transformer that enables Gaussians to query features from diverse sensor modalities, and a Shelf-Supervised Learning Paradigm that efficiently optimizes Gaussians with VFM features jointly at 2D image and 3D scene levels. We evaluate ShelfGaussian on various perception and planning tasks. Experiments on Occ3D-nuScenes demonstrate its state-of-the-art zero-shot semantic occupancy prediction performance. ShelfGaussian is further evaluated on an unmanned ground vehicle (UGV) to assess its in the-wild performance across diverse urban scenarios. Project website: this https URL.
zh
[CV-96] FireSentry: A Multi-Modal Spatio-temporal Benchmark Dataset for Fine-Grained Wildfire Spread Forecasting
【速读】:该论文旨在解决现有野火蔓延预测研究在时空分辨率上的局限性问题,即当前方法多依赖低分辨率卫星数据,在粗粒度时空尺度上建模,难以捕捉局部精细的野火动态行为。其解决方案的关键在于构建了一个省级尺度、多模态、亚米级空间分辨率和亚秒级时间分辨率的野火数据集FireSentry,并基于此提出FiReDiff——一种双模态生成式预测范式:首先在红外模态中预测未来视频序列以捕捉火势动态演化,再结合生成的动态信息在掩码模态中精确分割火区边界。该方法显著提升了视频质量和掩码精度,实现了细粒度野火模拟的突破。
链接: https://arxiv.org/abs/2512.03369
作者: Nan Zhou,Huandong Wang,Jiahao Li,Han Li,Yali Song,Qiuhua Wang,Yong Li,Xinlei Chen
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Tsinghua University (清华大学); College of Soil and Water Conservation, Southwest Forestry University (西南林业大学水土保持学院); College of Civil Engineering, Southwest Forestry University (西南林业大学土木工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-grained wildfire spread prediction is crucial for enhancing emergency response efficacy and decision-making precision. However, existing research predominantly focuses on coarse spatiotemporal scales and relies on low-resolution satellite data, capturing only macroscopic fire states while fundamentally constraining high-precision localized fire dynamics modeling capabilities. To bridge this gap, we present FireSentry, a provincial-scale multi-modal wildfire dataset characterized by sub-meter spatial and sub-second temporal resolution. Collected using synchronized UAV platforms, FireSentry provides visible and infrared video streams, in-situ environmental measurements, and manually validated fire masks. Building on FireSentry, we establish a comprehensive benchmark encompassing physics-based, data-driven, and generative models, revealing the limitations of existing mask-only approaches. Our analysis proposes FiReDiff, a novel dual-modality paradigm that first predicts future video sequences in the infrared modality, and then precisely segments fire masks in the mask modality based on the generated dynamics. FiReDiff achieves state-of-the-art performance, with video quality gains of 39.2% in PSNR, 36.1% in SSIM, 50.0% in LPIPS, 29.4% in FVD, and mask accuracy gains of 3.3% in AUPRC, 59.1% in F1 score, 42.9% in IoU, and 62.5% in MSE when applied to generative models. The FireSentry benchmark dataset and FiReDiff paradigm collectively advance fine-grained wildfire forecasting and dynamic disaster simulation. The processed benchmark dataset is publicly available at: this https URL.
zh
[CV-97] A Hybrid Deep Learning Framework with Explainable AI for Lung Cancer Classification with DenseNet169 and SVM
【速读】:该论文旨在解决肺癌早期诊断中人工解读CT影像效率低、易出错的问题,以提升诊断的准确性与可解释性。其解决方案的关键在于构建一个基于深度学习的自动分类系统:首先采用DenseNet169网络结构,结合Squeeze-and-Excitation模块实现注意力机制增强特征提取,使用Focal Loss缓解类别不平衡问题,并引入Feature Pyramid Network(FPN)进行多尺度特征融合;同时,利用MobileNetV2提取特征后训练支持向量机(SVM)模型进一步优化分类性能。此外,通过Grad-CAM可视化决策区域和SHAP(Shapley Additive Explanations)解释特征贡献,显著增强了模型的可解释性。实验表明,两种模型均达到98%的准确率,验证了该方案在实际医疗场景中的鲁棒性和可行性。
链接: https://arxiv.org/abs/2512.03359
作者: Md Rashidul Islam,Bakary Gibba,Altagi Abdallah Bakheit Abdelgadir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lung cancer is a very deadly disease worldwide, and its early diagnosis is crucial for increasing patient survival rates. Computed tomography (CT) scans are widely used for lung cancer diagnosis as they can give detailed lung structures. However, manual interpretation is time-consuming and prone to human error. To surmount this challenge, the study proposes a deep learning-based automatic lung cancer classification system to enhance detection accuracy and interpretability. The IQOTHNCCD lung cancer dataset is utilized, which is a public CT scan dataset consisting of cases categorized into Normal, Benign, and Malignant and used DenseNet169, which includes Squeezeand-Excitation blocks for attention-based feature extraction, Focal Loss for handling class imbalance, and a Feature Pyramid Network (FPN) for multi-scale feature fusion. In addition, an SVM model was developed using MobileNetV2 for feature extraction, improving its classification performance. For model interpretability enhancement, the study integrated Grad-CAM for the visualization of decision-making regions in CT scans and SHAP (Shapley Additive Explanations) for explanation of feature contributions within the SVM model. Intensive evaluation was performed, and it was found that both DenseNet169 and SVM models achieved 98% accuracy, suggesting their robustness for real-world medical practice. These results open up the potential for deep learning to improve the diagnosis of lung cancer by a higher level of accuracy, transparency, and robustness.
zh
[CV-98] SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
【速读】:该论文旨在解决当前视觉理解、预测与生成任务普遍基于二维(2D)观测而导致性能受限的问题,其核心挑战在于忽略了真实世界中物体运动和变化的四维(4D)本质(即三维空间+时间)。解决方案的关键在于提出了一种全新的“2D → 4D → 2D”学习框架——SeeU:首先从稀疏且单目视角的2D帧中重建连续的4D场景表示(2D → 4D),随后在低秩表示和物理约束下学习4D动态演化(离散4D → 连续4D),最终通过时间推进并重新投影至2D空间,在采样时间和视角下生成未见区域,从而实现具有时空上下文感知能力的连续且物理一致的视觉内容生成。
链接: https://arxiv.org/abs/2512.03350
作者: Yu Yuan,Tharindu Wickremasinghe,Zeeshan Nadir,Xijun Wang,Yiheng Chi,Stanley H. Chan
机构: Purdue University (普渡大学); Samsung Research America (三星研究美国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Images and videos are discrete 2D projections of the 4D world (3D space + time). Most visual understanding, prediction, and generation operate directly on 2D observations, leading to suboptimal performance. We propose SeeU, a novel approach that learns the continuous 4D dynamics and generate the unseen visual contents. The principle behind SeeU is a new 2D \to 4D \to 2D learning framework. SeeU first reconstructs the 4D world from sparse and monocular 2D frames (2D \to 4D). It then learns the continuous 4D dynamics on a low-rank representation and physical constraints (discrete 4D \to continuous 4D). Finally, SeeU rolls the world forward in time, re-projects it back to 2D at sampled times and viewpoints, and generates unseen regions based on spatial-temporal context awareness (4D \to 2D). By modeling dynamics in 4D, SeeU achieves continuous and physically-consistent novel visual generation, demonstrating strong potentials in multiple tasks including unseen temporal generation, unseen spatial generation, and video editing.
zh
[CV-99] Hierarchical Attention for Sparse Volumetric Anomaly Detection in Subclinical Keratoconus
【速读】:该论文旨在解决三维医学影像中弱且空间分散的异常检测难题,特别是针对亚临床角膜扩张(subclinical keratoconus, SKC)的早期识别问题。现有方法如2D/3D卷积神经网络(CNN)因局部性过强而丢失非邻接病灶信息,而视觉Transformer(Vision Transformer, ViT)则因全局注意力扩散导致特征稀释,二者均难以有效捕捉稀疏、多切片尺度的早期病理模式。解决方案的关键在于引入分层注意力机制(hierarchical attention),其通过结构化的窗口划分实现对中间尺度空间范围的精准匹配,从而在保持参数效率的同时,显著提升敏感性和特异性(提高21–23%)。机制分析表明,这种优势源于对异常信号空间整合长度的自适应调整:亚临床状态下需更长的空间整合距离,而分层注意力能动态适配这一需求,避免了传统CNN的过度局部性与ViT的无约束全局关注。
链接: https://arxiv.org/abs/2512.03346
作者: Lynn Kandakji,William Woof,Nikolas Pontikos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures, 6 tables
Abstract:The detection of weak, spatially distributed anomalies in volumetric medical imaging remains a major challenge. The subtle, non-adjacent nature of early disease signals is often lost due to suboptimal architectural inductive biases: 2D/3D CNNs impose strong locality, while ViTs diffuse unconstrained global attention. This conflict leaves the optimal inductive structure for robust, sparse volumetric pattern recognition unresolved. This study presents a controlled comparison of sixteen modern deep learning architectures spanning 2D/3D convolutional, hybrid, and volumetric transformer families for subclinical keratoconus (SKC) detection from 3D anterior segment OCT volumes. We demonstrate that hierarchical attention models offer a superior and more parameter-efficient inductive bias, surpassing the performance of both 2D and 3D CNNs and ViTs. Our results show 21-23% higher sensitivity and specificity in the sparse anomaly (subclinical) regime. Mechanistic analyses reveal that this advantage stems from precise spatial scale alignment: hierarchical windowing produces effective receptive fields matched to the intermediate, multi-slice extent of subclinical abnormalities. This avoids excessive CNN locality and diffuse global attention. Attention-distance measurements confirm a key insight into architectural adaptation: the required spatial integration length shifts significantly based on the signal strength, with subclinical cases necessitating longer integration compared to both healthy and manifest disease states. Representational similarity and auxiliary age/sex prediction tasks further support the generalizability of these inductive principles. The findings provide design guidance for future volumetric anomaly detection systems, establishing hierarchical attention as a principled and effective approach for early pathological change analysis in 3D medical imaging.
zh
[CV-100] HalluGen: Synthesizing Realistic and Controllable Hallucinations for Evaluating Image Restoration
【速读】:该论文旨在解决生成式模型在图像修复任务中普遍存在但难以评估的“幻觉”(hallucination)问题,即模型生成看似合理却与真实情况不符的结构,尤其在医疗影像等安全关键领域可能导致严重诊断错误。其解决方案的关键在于提出HalluGen框架——一种基于扩散模型的生成方法,能够可控地合成具有特定类型、位置和严重程度的感知真实但语义错误的幻觉图像(如分割IoU从0.86降至0.36),并据此构建首个大规模标注幻觉数据集(4,350张图像),从而打破“评估需标签而标签成本高”的循环依赖。该方案为幻觉检测与缓解提供了可量化、可泛化的基准,推动了安全关键图像修复领域的可靠性提升。
链接: https://arxiv.org/abs/2512.03345
作者: Seunghoi Kim,Henry F. J. Tregidgo,Chen Jin,Matteo Figini,Daniel C. Alexander
机构: Hawkes Institute, UCL (伦敦大学学院); Dept. of Medical Physics and Biomedical Engineering, UCL (伦敦大学学院); Dept. of Computer Science, UCL (伦敦大学学院); Centre for AI, DS&AI, AstraZeneca, UK (阿斯利康)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative models are prone to hallucinations: plausible but incorrect structures absent in the ground truth. This issue is problematic in image restoration for safety-critical domains such as medical imaging, industrial inspection, and remote sensing, where such errors undermine reliability and trust. For example, in low-field MRI, widely used in resource-limited settings, restoration models are essential for enhancing low-quality scans, yet hallucinations can lead to serious diagnostic errors. Progress has been hindered by a circular dependency: evaluating hallucinations requires labeled data, yet such labels are costly and subjective. We introduce HalluGen, a diffusion-based framework that synthesizes realistic hallucinations with controllable type, location, and severity, producing perceptually realistic but semantically incorrect outputs (segmentation IoU drops from 0.86 to 0.36). Using HalluGen, we construct the first large-scale hallucination dataset comprising 4,350 annotated images derived from 1,450 brain MR images for low-field enhancement, enabling systematic evaluation of hallucination detection and mitigation. We demonstrate its utility in two applications: (1) benchmarking image quality metrics and developing Semantic Hallucination Assessment via Feature Evaluation (SHAFE), a feature-based metric with soft-attention pooling that improves hallucination sensitivity over traditional metrics; and (2) training reference-free hallucination detectors that generalize to real restoration failures. Together, HalluGen and its open dataset establish the first scalable foundation for evaluating hallucinations in safety-critical image restoration.
zh
[CV-101] ProtoEFNet: Dynamic Prototype Learning for Inherently Interpretable Ejection Fraction Estimation in Echocardiography MICCAI2025
【速读】:该论文旨在解决传统射血分数(Ejection Fraction, EF)评估依赖人工勾画、耗时且存在观察者间差异的问题,以及现有深度学习模型作为“黑箱”导致临床信任度低的局限性。其解决方案的关键在于提出ProtoEFNet——一种基于视频的原型学习模型,通过学习动态时空原型来捕捉具有临床意义的心脏运动模式,并引入原型角度分离(Prototype Angular Separation, PAS)损失函数,强制模型在连续EF范围内生成更具判别性的表示,从而在保持预测精度的同时提供可解释的临床洞察。
链接: https://arxiv.org/abs/2512.03339
作者: Yeganeh Ghamary,Victoria Wu,Hooman Vaseli,Christina Luong,Teresa Tsang,Siavash Bigdeli,Purang Abolmaesumi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, Accepted in IMIMIC Workshop at MICCAI 2025
Abstract:Ejection fraction (EF) is a crucial metric for assessing cardiac function and diagnosing conditions such as heart failure. Traditionally, EF estimation requires manual tracing and domain expertise, making the process time-consuming and subject to interobserver variability. Most current deep learning methods for EF prediction are black-box models with limited transparency, which reduces clinical trust. Some post-hoc explainability methods have been proposed to interpret the decision-making process after the prediction is made. However, these explanations do not guide the model’s internal reasoning and therefore offer limited reliability in clinical applications. To address this, we introduce ProtoEFNet, a novel video-based prototype learning model for continuous EF regression. The model learns dynamic spatiotemporal prototypes that capture clinically meaningful cardiac motion patterns. Additionally, the proposed Prototype Angular Separation (PAS) loss enforces discriminative representations across the continuous EF spectrum. Our experiments on the EchonetDynamic dataset show that ProtoEFNet can achieve accuracy on par with its non-interpretable counterpart while providing clinically relevant insight. The ablation study shows that the proposed loss boosts performance with a 2% increase in F1 score from 77.67 \pm 2.68 to 79.64 \pm 2.10. Our source code is available at: this https URL
zh
[CV-102] Step-by-step Layered Design Generation
【速读】:该论文旨在解决现有设计生成方法将设计合成视为单步生成问题的局限性,从而忽视了设计过程中逐步迭代与细化的本质。其核心问题是当前模型未能有效建模设计师在多轮交互中对设计进行分层、原子化修改的过程。解决方案的关键在于提出一种新的问题设定——“逐层分步设计生成”(Step-by-Step Layered Design Generation),并构建SLEDGE模型,该模型将每次设计更新视为相对于前一状态的原子级、分层变化,并严格遵循设计师指令进行约束,从而更贴近真实的设计流程。
链接: https://arxiv.org/abs/2512.03335
作者: Faizan Farooq Khan,K J Joseph,Koustava Goswami,Mohamed Elhoseiny,Balaji Vasan Srinivasan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Design generation, in its essence, is a step-by-step process where designers progressively refine and enhance their work through careful modifications. Despite this fundamental characteristic, existing approaches mainly treat design synthesis as a single-step generation problem, significantly underestimating the inherent complexity of the creative process. To bridge this gap, we propose a novel problem setting called Step-by-Step Layered Design Generation, which tasks a machine learning model with generating a design that adheres to a sequence of instructions from a designer. Leveraging recent advancements in multi-modal LLMs, we propose SLEDGE: Step-by-step LayEred Design GEnerator to model each update to a design as an atomic, layered change over its previous state, while being grounded in the instruction. To complement our new problem setting, we introduce a new evaluation suite, including a dataset and a benchmark. Our exhaustive experimental analysis and comparison with state-of-the-art approaches tailored to our new setup demonstrate the efficacy of our approach. We hope our work will attract attention to this pragmatic and under-explored research area.
zh
[CV-103] NavMapFusion: Diffusion-based Fusion of Navigation Maps for Online Vectorized HD Map Construction WACV2026
【速读】:该论文旨在解决自动驾驶中环境表示的实时更新问题,即如何利用低分辨率的导航地图(Navigation-grade standard-definition maps)作为粗略先验,结合高精度传感器数据在线构建高保真度的高精地图(High-definition maps),以支持安全高效的自主导航。其核心挑战在于如何有效融合不一致、可能过时的先验地图与动态感知数据,从而生成准确且实时更新的环境模型。解决方案的关键在于提出NavMapFusion框架,该框架基于扩散模型(Diffusion Models)进行迭代去噪,将导航地图视为初始先验,通过传感器数据逐步修正其中的噪声——一致性区域被强化,而过时或不一致区域则被抑制。这一机制使扩散过程天然适应地图融合任务,显著提升了地图构建的鲁棒性和准确性,在nuScenes基准上实现了相对提升达21.4%(100米范围内),同时保持实时性能。
链接: https://arxiv.org/abs/2512.03317
作者: Thomas Monninger,Zihan Zhang,Steffen Staab,Sihao Ding
机构: Mercedes-Benz Research & Development North America (梅赛德斯-奔驰北美研发公司); University of Stuttgart (斯图加特大学); University of California, San Diego (加州大学圣地亚哥分校); University of Southampton (南安普顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted to 2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2026)
Abstract:Accurate environmental representations are essential for autonomous driving, providing the foundation for safe and efficient navigation. Traditionally, high-definition (HD) maps are providing this representation of the static road infrastructure to the autonomous system a priori. However, because the real world is constantly changing, such maps must be constructed online from on-board sensor data. Navigation-grade standard-definition (SD) maps are widely available, but their resolution is insufficient for direct deployment. Instead, they can be used as coarse prior to guide the online map construction process. We propose NavMapFusion, a diffusion-based framework that performs iterative denoising conditioned on high-fidelity sensor data and on low-fidelity navigation maps. This paper strives to answer: (1) How can coarse, potentially outdated navigation maps guide online map construction? (2) What advantages do diffusion models offer for map fusion? We demonstrate that diffusion-based map construction provides a robust framework for map fusion. Our key insight is that discrepancies between the prior map and online perception naturally correspond to noise within the diffusion process; consistent regions reinforce the map construction, whereas outdated segments are suppressed. On the nuScenes benchmark, NavMapFusion conditioned on coarse road lines from OpenStreetMap data reaches a 21.4% relative improvement on 100 m, and even stronger improvements on larger perception ranges, while maintaining real-time capabilities. By fusing low-fidelity priors with high-fidelity sensor data, the proposed method generates accurate and up-to-date environment representations, guiding towards safer and more reliable autonomous driving. The code is available at this https URL
zh
[CV-104] SpatialReason er: Active Perception for Large-Scale 3D Scene Understanding
【速读】:该论文旨在解决当前视觉语言模型在大规模三维(3D)环境中进行空间推理(spatial reasoning)能力不足的问题,尤其是这些模型通常局限于房间尺度的场景理解。为应对这一挑战,作者提出了H²U3D数据集和SpatialReasoner框架:前者构建了覆盖多楼层、多房间、面积超300平方米的房屋级3D场景,并通过自动化标注生成层次化粗到细的视觉表示及带思维链(chain-of-thought)标注的问答对;后者是一种主动感知(active perception)框架,基于文本查询自主调用空间工具探索3D场景,采用两阶段训练策略——监督冷启动与强化学习,其中自适应探索奖励机制促进高效探索并抑制冗余操作。解决方案的关键在于其“粗粒度到细粒度”的主动探索范式,显著降低了对图像数量的依赖(平均仅需3–4张图),同时实现了优于GPT-4o和Gemini-2.5-Pro等强基线模型的性能表现。
链接: https://arxiv.org/abs/2512.03284
作者: Hongpei Zheng,Shijie Li,Yanran Li,Hujun Yin
机构: University of Manchester (曼彻斯特大学); Institute for Infocomm Research (I2R), A*STAR, Singapore (新加坡资讯通信研究院); University of Bedfordshire (贝德福德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatial reasoning in large-scale 3D environments remains challenging for current vision-language models, which are typically constrained to room-scale scenarios. We introduce H ^2 U3D (Holistic House Understanding in 3D), a 3D visual question answering dataset designed for house-scale scene understanding. H ^2 U3D features multi-floor environments spanning up to three floors and 10-20 rooms, covering more than 300 m ^2 . Through an automated annotation pipeline, it constructs hierarchical coarse-to-fine visual representations and generates diverse question-answer pairs with chain-of-thought annotations. We further propose SpatialReasoner, an active perception framework that autonomously invokes spatial tools to explore 3D scenes based on textual queries. SpatialReasoner is trained through a two-stage strategy: a supervised cold start followed by reinforcement learning with an adaptive exploration reward that promotes efficient exploration while discouraging redundant operations. Extensive experiments demonstrate that SpatialReasoner achieves state-of-the-art performance on H ^2 U3D, outperforming strong baselines including GPT-4o and Gemini-2.5-Pro. Notably, our method attains superior results while using only 3-4 images in total on average, compared to baselines requiring 16+ images, highlighting the effectiveness of our coarse-to-fine active exploration paradigm.
zh
[CV-105] PyroFocus: A Deep Learning Approach to Real-Time Wildfire Detection in Multispectral Remote Sensing Imagery
【速读】:该论文旨在解决机载和星载平台中实时野火检测的挑战,即在计算资源受限的情况下,准确区分无火、活跃火和火灾后状态,并估计火势强度(如火辐射功率,FRP)。其关键解决方案是提出一种两阶段处理流程——PyroFocus,首先进行多类火情分类,随后执行FRP回归或分割任务,从而显著降低推理时间和计算成本,实现高精度与低延迟之间的有效权衡,为未来野火监测任务的边缘部署提供了可行路径。
链接: https://arxiv.org/abs/2512.03257
作者: Mark Moussa,Andre Williams,Seth Roffe,Douglas Morton
机构: NASA Goddard Space Flight Center (美国国家航空航天局戈达德太空飞行中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Rapid and accurate wildfire detection is crucial for emergency response and environmental management. In airborne and spaceborne missions, real-time algorithms must distinguish between no fire, active fire, and post-fire conditions, and estimate fire intensity. Multispectral and hyperspectral thermal imagers provide rich spectral information, but high data dimensionality and limited onboard resources make real-time processing challenging. As wildfires increase in frequency and severity, the need for low-latency and computationally efficient onboard detection methods is critical. We present a systematic evaluation of multiple deep learning architectures, including custom Convolutional Neural Networks (CNNs) and Transformer-based models, for multi-class fire classification. We also introduce PyroFocus, a two-stage pipeline that performs fire classification followed by fire radiative power (FRP) regression or segmentation to reduce inference time and computational cost for onboard deployment. Using data from NASA’s MODIS/ASTER Airborne Simulator (MASTER), which is similar to a next-generation fire detection sensor, we compare accuracy, inference latency, and resource efficiency. Experimental results show that the proposed two-stage pipeline achieves strong trade-offs between speed and accuracy, demonstrating significant potential for real-time edge deployment in future wildfire monitoring missions. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.03257 [cs.CV] (or arXiv:2512.03257v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.03257 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-106] PixPerfect: Seamless Latent Diffusion Local Editing with Discriminative Pixel-Space Refinement NEURIPS2025
【速读】:该论文旨在解决Latent Diffusion Models (LDMs) 在图像局部编辑(如修复、删除和插入对象)中因潜在空间压缩导致的像素级不一致性问题,例如色彩偏移、纹理不匹配以及编辑边界处的可见接缝。解决方案的关键在于提出PixPerfect框架,其核心包括:(i) 一个可微分的判别性像素空间,用于增强和抑制细微的颜色与纹理差异;(ii) 一套全面的人工伪影模拟流程,在训练过程中使精修模块暴露于真实场景下的局部编辑伪影;(iii) 一种直接的像素空间精修机制,确保在不同潜在表示和任务间具有广泛适用性。实验表明,该方法显著提升了感知保真度和下游编辑性能,为鲁棒且高保真的局部图像编辑树立了新标准。
链接: https://arxiv.org/abs/2512.03247
作者: Haitian Zheng,Yuan Yao,Yongsheng Yu,Yuqian Zhou,Jiebo Luo,Zhe Lin
机构: Adobe Research (Adobe 研究院); University of Rochester (罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in the Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract:Latent Diffusion Models (LDMs) have markedly advanced the quality of image inpainting and local editing. However, the inherent latent compression often introduces pixel-level inconsistencies, such as chromatic shifts, texture mismatches, and visible seams along editing boundaries. Existing remedies, including background-conditioned latent decoding and pixel-space harmonization, usually fail to fully eliminate these artifacts in practice and do not generalize well across different latent representations or tasks. We introduce PixPerfect, a pixel-level refinement framework that delivers seamless, high-fidelity local edits across diverse LDM architectures and tasks. PixPerfect leverages (i) a differentiable discriminative pixel space that amplifies and suppresses subtle color and texture discrepancies, (ii) a comprehensive artifact simulation pipeline that exposes the refiner to realistic local editing artifacts during training, and (iii) a direct pixel-space refinement scheme that ensures broad applicability across diverse latent representations and tasks. Extensive experiments on inpainting, object removal, and insertion benchmarks demonstrate that PixPerfect substantially enhances perceptual fidelity and downstream editing performance, establishing a new standard for robust and high-fidelity localized image editing.
zh
[CV-107] 2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition
【速读】:该论文旨在解决低光照条件下图像去噪问题,即如何在缺乏大量成对干净与噪声图像数据的情况下,训练出高性能的去噪模型。其核心挑战在于真实场景中难以获取大规模标注数据(clean-noisy pairs),而传统噪声合成方法要么依赖简化的参数模型,要么需要大量配对数据。解决方案的关键在于提出一种仅需单张噪声图像和一张暗帧(dark frame)即可实现高保真噪声合成的新方法:通过泊松分布建模信号相关噪声(signal-dependent noise),并引入傅里叶域频谱采样算法精确刻画信号无关噪声(signal-independent noise),从而生成具有真实传感器噪声空间结构和统计特性的多样化噪声样本。该方法不依赖复杂参数假设或大规模数据集,且在多个低光去噪基准测试中达到当前最优性能。
链接: https://arxiv.org/abs/2512.03245
作者: Liying Lu,Raphaël Achddou,Sabine Süsstrunk
机构: IVRL, EPFL; ESIEE
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Raw images taken in low-light conditions are very noisy due to low photon count and sensor noise. Learning-based denoisers have the potential to reconstruct high-quality images. For training, however, these denoisers require large paired datasets of clean and noisy images, which are difficult to collect. Noise synthesis is an alternative to large-scale data acquisition: given a clean image, we can synthesize a realistic noisy counterpart. In this work, we propose a general and practical noise synthesis method that requires only one single noisy image and one single dark frame per ISO setting. We represent signal-dependent noise with a Poisson distribution and introduce a Fourier-domain spectral sampling algorithm to accurately model signal-independent noise. The latter generates diverse noise realizations that maintain the spatial and statistical properties of real sensor noise. As opposed to competing approaches, our method neither relies on simplified parametric models nor on large sets of clean-noisy image pairs. Our synthesis method is not only accurate and practical, it also leads to state-of-the-art performances on multiple low-light denoising benchmarks.
zh
[CV-108] LLM -Guided Material Inference for 3D Point Clouds
【速读】:该论文旨在解决现有3D形状数据集和模型仅关注几何信息而忽略决定物体外观的材料属性的问题。其解决方案的关键在于提出一种两阶段的大语言模型(Large Language Model, LLM)方法,通过将物体语义识别与材料推断解耦:第一阶段由LLM零样本预测物体的语义类别,第二阶段基于推理出的语义条件,为每个几何分割区域分配合理的材料。该方法无需任务特定训练,利用LLM作为通用先验,在无可靠材料标注数据的情况下实现了对3D点云中材料组成的有效推断。
链接: https://arxiv.org/abs/2512.03237
作者: Nafiseh Izadyar,Teseo Schneider
机构: University of Victoria (维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Most existing 3D shape datasets and models focus solely on geometry, overlooking the material properties that determine how objects appear. We introduce a two-stage large language model (LLM) based method for inferring material composition directly from 3D point clouds with coarse segmentations. Our key insight is to decouple reasoning about what an object is from what it is made of. In the first stage, an LLM predicts the object’s semantic; in the second stage, it assigns plausible materials to each geometric segment, conditioned on the inferred semantics. Both stages operate in a zero-shot manner, without task-specific training. Because existing datasets lack reliable material annotations, we evaluate our method using an LLM-as-a-Judge implemented in DeepEval. Across 1,000 shapes from Fusion/ABS and ShapeNet, our method achieves high semantic and material plausibility. These results demonstrate that language models can serve as general-purpose priors for bridging geometric reasoning and material understanding in 3D data.
zh
[CV-109] Object Counting with GPT -4o and GPT -5: A Comparative Study
【速读】:该论文旨在解决零样本目标计数(zero-shot object counting)问题,即在不依赖训练阶段见过的类别情况下,估计新类别目标实例的数量。传统方法通常需要大量标注数据甚至视觉示例来引导计数过程,而本文的关键解决方案是利用多模态大语言模型(multi-modal LLMs),特别是GPT-4o和GPT-5的视觉理解能力,仅通过文本提示(textual prompts)实现无需监督的零样本计数。实验表明,该方法在FSC-147和CARPK数据集上达到与当前最优零样本方法相当甚至更优的性能。
链接: https://arxiv.org/abs/2512.03233
作者: Richard Füzesséry,Kaziwa Saleh,Sándor Szénási,Zoltán Vámossy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures
Abstract:Zero-shot object counting attempts to estimate the number of object instances belonging to novel categories that the vision model performing the counting has never encountered during training. Existing methods typically require large amount of annotated data and often require visual exemplars to guide the counting process. However, large language models (LLMs) are powerful tools with remarkable reasoning and data understanding abilities, which suggest the possibility of utilizing them for counting tasks without any supervision. In this work we aim to leverage the visual capabilities of two multi-modal LLMs, GPT-4o and GPT-5, to perform object counting in a zero-shot manner using only textual prompts. We evaluate both models on the FSC-147 and CARPK datasets and provide a comparative analysis. Our findings show that the models achieve performance comparable to the state-of-the-art zero-shot approaches on FSC-147, in some cases, even surpass them.
zh
[CV-110] Flux4D: Flow-based Unsupervised 4D Reconstruction NEURIPS2025
【速读】:该论文旨在解决大规模动态场景的4D重建问题,即从视觉观测中恢复包含时空变化的复杂环境,这对机器人和自动驾驶系统至关重要。现有方法如NeRF和3D Gaussian Splatting(3DGS)虽能实现高质量的图像重建,但存在可扩展性差、需依赖标注来分离运动物体等问题;而自监督方法则受限于单场景优化及对超参数敏感。本文提出Flux4D框架,其关键在于通过仅使用光度损失(photometric loss)并引入“尽可能静态”的正则化策略,在无需预训练模型或先验知识的情况下,直接从原始数据中学习3D高斯分布及其运动动态,从而实现完全无监督的4D重建。该方案具备高效性(秒级重建)、强扩展性(适用于大规模数据集)以及良好的泛化能力(包括未知物体和未见场景)。
链接: https://arxiv.org/abs/2512.03210
作者: Jingkang Wang,Henry Che,Yun Chen,Ze Yang,Lily Goli,Sivabalan Manivasagam,Raquel Urtasun
机构: Waabi; University of Toronto; UIUC
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: NeurIPS 2025. Project page: this https URL
Abstract:Reconstructing large-scale dynamic scenes from visual observations is a fundamental challenge in computer vision, with critical implications for robotics and autonomous systems. While recent differentiable rendering methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved impressive photorealistic reconstruction, they suffer from scalability limitations and require annotations to decouple actor motion. Existing self-supervised methods attempt to eliminate explicit annotations by leveraging motion cues and geometric priors, yet they remain constrained by per-scene optimization and sensitivity to hyperparameter tuning. In this paper, we introduce Flux4D, a simple and scalable framework for 4D reconstruction of large-scale dynamic scenes. Flux4D directly predicts 3D Gaussians and their motion dynamics to reconstruct sensor observations in a fully unsupervised manner. By adopting only photometric losses and enforcing an “as static as possible” regularization, Flux4D learns to decompose dynamic elements directly from raw data without requiring pre-trained supervised models or foundational priors simply by training across many scenes. Our approach enables efficient reconstruction of dynamic scenes within seconds, scales effectively to large datasets, and generalizes well to unseen environments, including rare and unknown objects. Experiments on outdoor driving datasets show Flux4D significantly outperforms existing methods in scalability, generalization, and reconstruction quality.
zh
[CV-111] Does Head Pose Correction Improve Biometric Facial Recognition?
【速读】:该论文旨在解决生物特征人脸识别模型在处理现实世界图像时准确率显著下降的问题,此类图像常因质量差、非正面姿态及遮挡等因素导致性能受限。解决方案的关键在于通过AI驱动的针对性图像修复技术提升识别准确性:研究发现,直接应用三种修复方法(3D重建NextFace、2D正面化CFR-GAN和特征增强CodeFormer)会大幅降低识别准确率;而选择性地结合CFR-GAN与CodeFormer进行协同处理,则能实现显著的性能提升。
链接: https://arxiv.org/abs/2512.03199
作者: Justin Norman,Hany Farid
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Biometric facial recognition models often demonstrate significant decreases in accuracy when processing real-world images, often characterized by poor quality, non-frontal subject poses, and subject occlusions. We investigate whether targeted, AI-driven, head-pose correction and image restoration can improve recognition accuracy. Using a model-agnostic, large-scale, forensic-evaluation pipeline, we assess the impact of three restoration approaches: 3D reconstruction (NextFace), 2D frontalization (CFR-GAN), and feature enhancement (CodeFormer). We find that naive application of these techniques substantially degrades facial recognition accuracy. However, we also find that selective application of CFR-GAN combined with CodeFormer yields meaningful improvements.
zh
[CV-112] Drainage: A Unifying Framework for Addressing Class Uncertainty
【速读】:该论文旨在解决深度学习中因标签噪声(如实例相关噪声和非对称噪声)、类别模糊性以及分布外样本或损坏样本难以鲁棒识别等问题带来的性能下降挑战。其核心解决方案是提出一种基于“排水节点”(drainage node)的统一框架,该节点被添加在网络输出层,通过将概率质量重新分配至不确定性区域,同时保持端到端训练和可微性,从而为高度模糊、异常或噪声样本提供自然的逃逸路径。此机制显著提升了模型在高噪声环境下的鲁棒性和泛化能力,并在多个真实世界数据集上实现了优于现有方法的性能表现。
链接: https://arxiv.org/abs/2512.03182
作者: Yasser Taha,Grégoire Montavon,Nils Körber
机构: Centre for Artificial Intelligence in Public Health Research, Robert Koch Institute, 13353 Berlin, Germany; BIFOLD – Berlin Institute for the Foundations of Learning and Data, 10587 Berlin, Germany; Institute for AI in Medicine, Charité – Universitätsmedizin Berlin, 10115 Berlin, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 8 figures
Abstract:Modern deep learning faces significant challenges with noisy labels, class ambiguity, as well as the need to robustly reject out-of-distribution or corrupted samples. In this work, we propose a unified framework based on the concept of a "drainage node’’ which we add at the output of the network. The node serves to reallocate probability mass toward uncertainty, while preserving desirable properties such as end-to-end training and differentiability. This mechanism provides a natural escape route for highly ambiguous, anomalous, or noisy samples, particularly relevant for instance-dependent and asymmetric label noise. In systematic experiments involving the addition of varying proportions of instance-dependent noise or asymmetric noise to CIFAR-10/100 labels, our drainage formulation achieves an accuracy increase of up to 9% over existing approaches in the high-noise regime. Our results on real-world datasets, such as mini-WebVision, mini-ImageNet and Clothing-1M, match or surpass existing state-of-the-art methods. Qualitative analysis reveals a denoising effect, where the drainage neuron consistently absorbs corrupt, mislabeled, or outlier data, leading to more stable decision boundaries. Furthermore, our drainage formulation enables applications well beyond classification, with immediate benefits for web-scale, semi-supervised dataset cleaning, and open-set applications.
zh
[CV-113] Multi-Agent Reinforcement Learning and Real-Time Decision-Making in Robotic Soccer for Virtual Environments
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems)在动态对抗环境(如机器人足球)中面临的实时决策、复杂协作与可扩展性问题,尤其针对任务多粒度性(长期策略与即时动作的耦合)和大规模智能体交互带来的维度灾难。其解决方案的关键在于提出一个统一的多智能体强化学习(MARL)框架:首先基于客户端-服务器架构采用近端策略优化(Proximal Policy Optimization, PPO)实现高效的实时动作调度;其次引入基于选项框架的分层强化学习(Hierarchical Reinforcement Learning, HRL),将问题分解为高层轨迹规划(建模为半马尔可夫决策过程)与低层动作执行两层结构以提升全局策略性能;最后结合均场理论(Mean-Field Theory)将多智能体交互简化为单个智能体与群体平均行为的交互,显著增强算法的可扩展性和训练稳定性,最终在Webots仿真环境中实现了5.93平均进球数和89.1%球控率的优异表现。
链接: https://arxiv.org/abs/2512.03166
作者: Aya Taourirte,Md Sohag Mia
机构: Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The deployment of multi-agent systems in dynamic, adversarial environments like robotic soccer necessitates real-time decision-making, sophisticated cooperation, and scalable algorithms to avoid the curse of dimensionality. While Reinforcement Learning (RL) offers a promising framework, existing methods often struggle with the multi-granularity of tasks (long-term strategy vs. instant actions) and the complexity of large-scale agent interactions. This paper presents a unified Multi-Agent Reinforcement Learning (MARL) framework that addresses these challenges. First, we establish a baseline using Proximal Policy Optimization (PPO) within a client-server architecture for real-time action scheduling, with PPO demonstrating superior performance (4.32 avg. goals, 82.9% ball control). Second, we introduce a Hierarchical RL (HRL) structure based on the options framework to decompose the problem into a high-level trajectory planning layer (modeled as a Semi-Markov Decision Process) and a low-level action execution layer, improving global strategy (avg. goals increased to 5.26). Finally, to ensure scalability, we integrate mean-field theory into the HRL framework, simplifying many-agent interactions into a single agent vs. the population average. Our mean-field actor-critic method achieves a significant performance boost (5.93 avg. goals, 89.1% ball control, 92.3% passing accuracy) and enhanced training stability. Extensive simulations of 4v4 matches in the Webots environment validate our approach, demonstrating its potential for robust, scalable, and cooperative behavior in complex multi-agent domains.
zh
[CV-114] Hierarchical Process Reward Models are Symbolic Vision Learners
【速读】:该论文旨在解决当前基于像素的视觉模型在处理几何图示时缺乏可解释性的问题,提出了一种符号化计算机视觉(Symbolic Computer Vision)方法,通过显式逻辑规则和结构化表示实现对图示的可解释理解。其核心解决方案是设计了一种新颖的自监督符号自动编码器(symbolic auto-encoder),将图示编码为几何基元(点、线、形状)及其相互关系的结构化表示,并利用可执行引擎进行解码以重建输入图示;关键创新在于引入符号层次过程奖励建模(Symbolic Hierarchical Process Reward Modeling),通过分步解析奖励机制强制约束点在线上、线在形上、形在关系上的一致性,同时采用稳定化机制平衡策略空间中的探索与利用,从而构建一个融合神经网络推理能力与符号模型可解释性的神经符号系统(neuro-symbolic system)。
链接: https://arxiv.org/abs/2512.03126
作者: Shan Zhang,Aotian Chen,Kai Zou,Jindong Gu,Yuan Xue,Anton van den Hengel
机构: Adelaide AIML(阿德莱德人工智能与机器学习研究中心); Ohio State University(俄亥俄州立大学); NetMind.ai; University of Oxford(牛津大学); Data61(数据61); CSIRO(澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Symbolic computer vision represents diagrams through explicit logical rules and structured representations, enabling interpretable understanding in machine vision. This requires fundamentally different learning paradigms from pixel-based visual models. Symbolic visual learners parse diagrams into geometric primitives-points, lines, and shapes-whereas pixel-based learners operate on textures and colors. We propose a novel self-supervised symbolic auto-encoder that encodes diagrams into structured primitives and their interrelationships within the latent space, and decodes them through our executable engine to reconstruct the input diagrams. Central to this architecture is Symbolic Hierarchical Process Reward Modeling, which applies hierarchical step-level parsing rewards to enforce point-on-line, line-on-shape, and shape-on-relation consistency. Since vanilla reinforcement learning exhibits poor exploration in the policy space during diagram reconstruction; we thus introduce stabilization mechanisms to balance exploration and exploitation. We fine-tune our symbolic encoder on downstream tasks, developing a neuro-symbolic system that integrates the reasoning capabilities of neural networks with the interpretability of symbolic models through reasoning-grounded visual rewards. Evaluations across reconstruction, perception, and reasoning tasks demonstrate the effectiveness of our approach: achieving a 98.2% reduction in MSE for geometric diagram reconstruction, surpassing GPT-4o by 0.6% with a 7B model on chart reconstruction, and improving by +13% on the MathGlance perception benchmark, and by +3% on MathVerse and GeoQA reasoning benchmarks.
zh
[CV-115] Energy-Efficient Federated Learning via Adaptive Encoder Freezing for MRI-to-CT Conversion: A Green AI-Guided Research
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在医疗领域应用中因计算资源需求高而导致的中心间不平等问题,尤其是资源受限机构难以参与协作训练,从而加剧了现有医疗健康差距。解决方案的关键在于提出一种面向绿色人工智能(Green AI)的自适应层冻结策略(adaptive layer-freezing strategy),通过监测编码器权重在各轮训练中的相对变化差异,结合基于耐心机制(patience-based mechanism)的判断条件,仅在权重更新持续微小时选择性冻结编码器层,从而显著降低计算负载与能耗,同时保持模型性能稳定。实验表明,该方法在MRI到CT图像转换任务中可将训练时间、总能耗及CO₂eq排放量最多减少23%,且MAE指标变化微小,部分架构甚至实现统计学意义上的性能提升。
链接: https://arxiv.org/abs/2512.03054
作者: Ciro Benito Raggio,Lucia Migliorelli,Nils Skupien,Mathias Krohmer Zabaleta,Oliver Blanck,Francesco Cicone,Giuseppe Lucio Cascini,Paolo Zaffino,Maria Francesca Spadea
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Università Degli Studi Di Teramo (特拉莫大学); University Medical Center Schleswig-Holstein (石勒苏益格-荷尔斯泰因大学医学中心); Magna Graecia University (马格纳格雷西亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Medical Physics (physics.med-ph)
备注: 22 pages, 13 figures
Abstract:Federated Learning (FL) holds the potential to advance equality in health by enabling diverse institutions to collaboratively train deep learning (DL) models, even with limited data. However, the significant resource requirements of FL often exclude centres with limited computational infrastructure, further widening existing healthcare disparities. To address this issue, we propose a Green AI-oriented adaptive layer-freezing strategy designed to reduce energy consumption and computational load while maintaining model performance. We tested our approach using different federated architectures for Magnetic Resonance Imaging (MRI)-to-Computed Tomography (CT) conversion. The proposed adaptive strategy optimises the federated training by selectively freezing the encoder weights based on the monitored relative difference of the encoder weights from round to round. A patience-based mechanism ensures that freezing only occurs when updates remain consistently minimal. The energy consumption and CO2eq emissions of the federation were tracked using the CodeCarbon library. Compared to equivalent non-frozen counterparts, our approach reduced training time, total energy consumption and CO2eq emissions by up to 23%. At the same time, the MRI-to-CT conversion performance was maintained, with only small variations in the Mean Absolute Error (MAE). Notably, for three out of the five evaluated architectures, no statistically significant differences were observed, while two architectures exhibited statistically significant improvements. Our work aligns with a research paradigm that promotes DL-based frameworks meeting clinical requirements while ensuring climatic, social, and economic sustainability. It lays the groundwork for novel FL evaluation frameworks, advancing privacy, equity and, more broadly, justice in AI-driven healthcare.
zh
[CV-116] LATTICE: Democratize High-Fidelity 3D Generation at Scale
【速读】:该论文旨在解决3D生成模型在质量和可扩展性之间存在的显著差距问题,即如何在保持高保真度的同时实现高效、灵活的3D资产生成。当前3D生成面临的核心挑战在于从零开始预测空间结构和精细几何表面,且现有3D表示方式计算复杂度高、缺乏结构化与可扩展的编码方案。解决方案的关键在于提出VoxSet——一种半结构化表示方法,它将3D资产压缩为锚定于粗粒度体素网格的一组潜在向量,既保留了VecSet类方法的简洁性和压缩优势,又在潜在空间中引入显式结构,使位置嵌入能够引导生成过程,并支持强大的token级测试时扩展能力。基于此表示,LATTICE采用两阶段流水线:先生成稀疏体素化几何锚点,再利用修正流Transformer生成细节几何,从而实现任意分辨率解码、低成本训练与灵活推理,在多个指标上达到当前最优性能。
链接: https://arxiv.org/abs/2512.03052
作者: Zeqiang Lai,Yunfei Zhao,Zibo Zhao,Haolin Liu,Qingxiang Lin,Jingwei Huang,Chunchao Guo,Xiangyu Yue
机构: MMLab, CUHK (香港中文大学多媒体实验室); Tencent Hunyuan (腾讯混元)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:We present LATTICE, a new framework for high-fidelity 3D asset generation that bridges the quality and scalability gap between 3D and 2D generative models. While 2D image synthesis benefits from fixed spatial grids and well-established transformer architectures, 3D generation remains fundamentally more challenging due to the need to predict both spatial structure and detailed geometric surfaces from scratch. These challenges are exacerbated by the computational complexity of existing 3D representations and the lack of structured and scalable 3D asset encoding schemes. To address this, we propose VoxSet, a semi-structured representation that compresses 3D assets into a compact set of latent vectors anchored to a coarse voxel grid, enabling efficient and position-aware generation. VoxSet retains the simplicity and compression advantages of prior VecSet methods while introducing explicit structure into the latent space, allowing positional embeddings to guide generation and enabling strong token-level test-time scaling. Built upon this representation, LATTICE adopts a two-stage pipeline: first generating a sparse voxelized geometry anchor, then producing detailed geometry using a rectified flow transformer. Our method is simple at its core, but supports arbitrary resolution decoding, low-cost training, and flexible inference schemes, achieving state-of-the-art performance on various aspects, and offering a significant step toward scalable, high-quality 3D asset creation.
zh
[CV-117] Deep-BrownConrady: Prediction of Camera Calibration and Distortion Parameters Using Deep Learning and Synthetic Data
【速读】:该论文旨在解决从单张图像中预测相机内参与镜头畸变参数的挑战,传统标定方法依赖多角度校准物图像,而现实中常缺乏此类数据。其解决方案的关键在于:首先构建了一个基于AILiveSim仿真平台的综合性合成数据集,涵盖焦距和镜头畸变参数的多样化变化;其次采用ResNet架构训练回归模型,利用合成数据为主、少量真实数据为辅的方式,实现对Brown-Conrady模型参数的高精度预测,从而在自动驾驶、机器人和增强现实等应用中完成无需多图采集的高效相机标定。
链接: https://arxiv.org/abs/2501.14510
作者: Faiz Muhammad Chaudhry,Jarno Ralli,Jerome Leudet,Fahad Sohrab,Farhad Pakdaman,Pierre Corbani,Moncef Gabbouj
机构: AILiveSim Ltd.(AILiveSim有限公司); Tampere University (坦佩雷大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:
Abstract:This research addresses the challenge of camera calibration and distortion parameter prediction from a single image using deep learning models. The main contributions of this work are: (1) demonstrating that a deep learning model, trained on a mix of real and synthetic images, can accurately predict camera and lens parameters from a single image, and (2) developing a comprehensive synthetic dataset using the AILiveSim simulation platform. This dataset includes variations in focal length and lens distortion parameters, providing a robust foundation for model training and testing. The training process predominantly relied on these synthetic images, complemented by a small subset of real images, to explore how well models trained on synthetic data can perform calibration tasks on real-world images. Traditional calibration methods require multiple images of a calibration object from various orientations, which is often not feasible due to the lack of such images in publicly available datasets. A deep learning network based on the ResNet architecture was trained on this synthetic dataset to predict camera calibration parameters following the Brown-Conrady lens model. The ResNet architecture, adapted for regression tasks, is capable of predicting continuous values essential for accurate camera calibration in applications such as autonomous driving, robotics, and augmented reality. Keywords: Camera calibration, distortion, synthetic data, deep learning, residual networks (ResNet), AILiveSim, horizontal field-of-view, principal point, Brown-Conrady Model. Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG) Cite as: arXiv:2501.14510 [cs.CV] (or arXiv:2501.14510v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.14510 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-118] ada-DIP: Input-adaptive Deep Image Prior for One-shot 3D Image Reconstruction
【速读】:该论文旨在解决3D图像重建中因数据稀缺导致的性能瓶颈问题,尤其是在无监督或少样本场景下,传统深度学习方法难以避免过拟合(overfitting)且重建质量受限的问题。其解决方案的关键在于提出了一种名为Tada-DIP的全3D深度图像先验(Deep Image Prior, DIP)方法,通过引入输入自适应(input-adaptation)机制与去噪正则化(denoising regularization)相结合的方式,在不依赖大量训练数据的前提下,有效提升3D重建质量并抑制过拟合现象,从而在稀疏视角X射线计算机断层成像(sparse-view X-ray computed tomography)任务中实现媲美监督式深度网络的重建性能。
链接: https://arxiv.org/abs/2512.03962
作者: Evan Bell,Shijun Liang,Ismail Alkhouri,Saiprasad Ravishankar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 8 figures, 2025 Asilomar Conference on Signals, Systems, and Computers. Code is available at this http URL
Abstract:Deep Image Prior (DIP) has recently emerged as a promising one-shot neural-network based image reconstruction method. However, DIP has seen limited application to 3D image reconstruction problems. In this work, we introduce Tada-DIP, a highly effective and fully 3D DIP method for solving 3D inverse problems. By combining input-adaptation and denoising regularization, Tada-DIP produces high-quality 3D reconstructions while avoiding the overfitting phenomenon that is common in DIP. Experiments on sparse-view X-ray computed tomography reconstruction validate the effectiveness of the proposed method, demonstrating that Tada-DIP produces much better reconstructions than training-data-free baselines and achieves reconstruction performance on par with a supervised network trained using a large dataset with fully-sampled volumes.
zh
[CV-119] Kaleidoscopic Scintillation Event Imaging
【速读】:该论文旨在解决传统闪烁体(scintillator)在高能粒子探测中光子收集效率低、难以实现单事件高分辨率成像的问题。现有方法依赖快速单像素探测器,虽能精确计时但缺乏空间分辨能力;而相机虽具空间分辨率却仅能捕捉多事件平均信号,无法解析单个粒子事件。为此,作者提出一种“万花筒式闪烁体”(kaleidoscopic scintillator)设计,其关键在于利用几何镜面反射在已知位置生成事件的多个虚像,从而提升单光子相机对微弱光信号的捕获能力,同时保留事件的空间信息。结合理论建模与三维位置估计算法,该方案可在商用CMOS单光子相机上实现高分辨率辐射成像,为先进辐射探测技术提供新路径。
链接: https://arxiv.org/abs/2512.03216
作者: Alex Bocchieri,John Mamish,David Appleyard,Andreas Velten
机构: University of Wisconsin - Madison (威斯康星大学麦迪逊分校); Georgia Institute of Technology (佐治亚理工学院); Ubicept
类目: Instrumentation and Detectors (physics.ins-det); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Scintillators are transparent materials that interact with high-energy particles and emit visible light as a result. They are used in state of the art methods of measuring high-energy particles and radiation sources. Most existing methods use fast single-pixel detectors to detect and time scintillation events. Cameras provide spatial resolution but can only capture an average over many events, making it difficult to image the events associated with an individual particle. Emerging single-photon avalanche diode cameras combine speed and spatial resolution to enable capturing images of individual events. This allows us to use machine vision techniques to analyze events, enabling new types of detectors. The main challenge is the very low brightness of the events. Techniques have to work with a very limited number of photons. We propose a kaleidoscopic scintillator to increase light collection in a single-photon camera while preserving the event’s spatial information. The kaleidoscopic geometry creates mirror reflections of the event in known locations for a given event location that are captured by the camera. We introduce theory for imaging an event in a kaleidoscopic scintillator and an algorithm to estimate the event’s 3D position. We find that the kaleidoscopic scintillator design provides sufficient light collection to perform high-resolution event measurements for advanced radiation imaging techniques using a commercial CMOS single-photon camera. Code and data are available at this https URL. Subjects: Instrumentation and Detectors (physics.ins-det); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2512.03216 [physics.ins-det] (or arXiv:2512.03216v1 [physics.ins-det] for this version) https://doi.org/10.48550/arXiv.2512.03216 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-120] PanFoMa: A Lightweight Foundation Model and Benchmark for Pan-Cancer AAAI2026
【速读】:该论文旨在解决单细胞RNA测序(scRNA-seq)在泛癌研究中面临的两大关键问题:一是如何学习具有判别力且高效的单细胞表示,二是如何建立全面的评估基准。其解决方案的核心在于提出PanFoMa——一种轻量级混合神经网络架构,融合了Transformer与状态空间模型(State-space Model, SSM)的优势,在性能与效率之间取得平衡。该模型由前端局部上下文编码器(共享自注意力层)和后端全局序列特征解码器(线性时间SSM)组成,能够同时捕捉基因间的复杂非顺序相互作用与全局调控信号,从而实现对转录组的高效建模。此外,研究构建了包含超过350万高质量细胞的泛癌单细胞基准数据集PanFoMaBench,支持鲁棒评估,实验表明PanFoMa在多个任务上显著优于现有最先进模型。
链接: https://arxiv.org/abs/2512.03111
作者: Xiaoshui Huang,Tianlin Zhu,Yifan Zuo,Xue Xia,Zonghan Wu,Jiebin Yan,Dingli Hua,Zongyi Xu,Yuming Fang,Jian Zhang
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:Single-cell RNA sequencing (scRNA-seq) is essential for decoding tumor heterogeneity. However, pan-cancer research still faces two key challenges: learning discriminative and efficient single-cell representations, and establishing a comprehensive evaluation benchmark. In this paper, we introduce PanFoMa, a lightweight hybrid neural network that combines the strengths of Transformers and state-space models to achieve a balance between performance and efficiency. PanFoMa consists of a front-end local-context encoder with shared self-attention layers to capture complex, order-independent gene interactions; and a back-end global sequential feature decoder that efficiently integrates global context using a linear-time state-space model. This modular design preserves the expressive power of Transformers while leveraging the scalability of Mamba to enable transcriptome modeling, effectively capturing both local and global regulatory signals. To enable robust evaluation, we also construct a large-scale pan-cancer single-cell benchmark, PanFoMaBench, containing over 3.5 million high-quality cells across 33 cancer subtypes, curated through a rigorous preprocessing pipeline. Experimental results show that PanFoMa outperforms state-of-the-art models on our pan-cancer benchmark (+4.0%) and across multiple public tasks, including cell type annotation (+7.4%), batch integration (+4.0%) and multi-omics integration (+3.1%). The code is available at this https URL.
zh
人工智能
[AI-0] Fare Comparison App of Uber Ola and Rapido
【速读】:该论文旨在解决用户在使用网约车服务(如Ola、Uber和Rapido)时面临的决策难题,即难以选择性价比最优且耗时最少的出行方案。其核心挑战在于不同平台的定价策略不透明、实时数据获取困难以及多源位置信息的整合问题。解决方案的关键在于构建一个基于Python的后端系统,通过调用各平台API获取实时票价数据,并结合用户输入的目的地进行比对分析,最终推荐最高效、经济的出行选项,从而提升用户体验并增强网约车服务的透明度。
链接: https://arxiv.org/abs/2512.04065
作者: Ashlesha Gopinath Sawant,Sahil S. Jadhav,Vidhan R. Jain,Shriraj S. Jagtap,Prachi Jadhav,Soham Jadhav,Ichha Raina
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 pages
Abstract:In todays increasing world, it is very important to have good hailing services like Ola, Uber, and Rapido as it is very essential for our daily transportation. Users often face difficulties in choosing the most appropriate and efficient ride that would lead to both cost-effective and would take us to our destination in less time. This project provides you with the web application that helps you to select the most beneficial ride for you by providing users with the fare comparison between Ola, Uber, Rapido for the destination entered by the user. The backend is use to fetch the data, providing users with the fare comparison for the ride and finally providing with the best option using Python. This research paper also addresses the problem and challenges faced in accessing the data using APIs, Android Studios emulator, Appium and location comparison. Thus, the aim of the project is to provide transparency to the users in ride-hailing services and increase efficiency and provide users with better experience.
zh
[AI-1] MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking
【速读】:该论文旨在解决开放权重语言模型(open-weight language models)中水印技术的难题,即如何在模型权重公开的情况下仍能嵌入可检测且不影响生成质量的水印信号。现有方法如GaussMark依赖对模型权重的小幅修改来实现水印嵌入,但往往需显著扰动权重才能达到与推理时水印相当的检测能力,从而损害文本质量。解决方案的关键在于提出MarkTune——一种理论驱动、基于策略梯度的微调框架,将GaussMark水印信号作为奖励信号,在优化过程中同时正则化文本质量损失,通过在模型表示空间内进行细粒度、水印感知的权重更新,实现质量与可检测性之间的更好权衡。实验证明,MarkTune显著提升了GaussMark的质量-检测边界,接近推理时水印性能,并具备对抗重写和微调攻击的鲁棒性及跨数据集泛化能力。
链接: https://arxiv.org/abs/2512.04044
作者: Yizhou Zhao,Zhiwei Steven Wu,Adam Block
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Watermarking aims to embed hidden signals in generated text that can be reliably detected when given access to a secret key. Open-weight language models pose acute challenges for such watermarking schemes because the inference-time interventions that dominate contemporary approaches cannot be enforced once model weights are public. Existing watermaking techniques for open-weight models, such as the recently proposed GaussMark, typically rely on small modifications to model weights, which can yield signals detectable to those equipped with a secret key, but achieving detection power comparable to inference-time watermarks generally requires weight perturbations that noticeably reduce generation quality. We introduce MarkTune, a theoretically principled, on-policy fine-tuning framework that treats the GaussMark signal as a reward while simultaneously regularizing against degradation in text quality. We derive MarkTune as an improvement on GaussMark and demonstrate that MarkTune consistently improves the quality-detectability trade-off over GaussMark by steering finer-grained, watermark-aware weight updates within the model’s representation space while preserving generation quality. Empirically, we show that MarkTune pushes the quality-detectability frontier of GaussMark close to that of inference-time watermarking, remains robust to paraphrasing and fine-tuning attacks, and exhibits strong generalization: a model fine-tuned on one dataset retains substantial watermark detection power on unseen datasets. Together, these results establish MarkTune as a general strategy for embedding robust, high-quality watermarks into open-weight LMs.
zh
[AI-2] Sponsored Questions and How to Auction Them
【速读】:该论文旨在解决在线平台在用户查询意图模糊时如何有效分配“赞助式”澄清提示(sponsored suggestion slots)的问题,以及此类机制如何与后续传统的广告拍卖机制协同工作。其核心挑战在于:当用户输入不明确时,平台若仅被动预测相关性或提供通用改写建议,可能无法最大化商业价值与用户体验;而引入由大语言模型(Large Language Model, LLM)主动生成的多个澄清问题,并允许部分被“赞助”,则需设计激励相容且高效的分配机制。解决方案的关键在于提出一种联合优化框架,采用VCG机制同时处理澄清提示与后续广告竞价,从而实现机制效率和策略诚实性;相较之下,将两者解耦为独立模块的方法虽易实现,但存在策略性低效问题,其Price of Anarchy无界,即系统整体性能可能严重劣化。
链接: https://arxiv.org/abs/2512.03975
作者: Kshipra Bhawalkar,Alexandros Psomas,Di Wang
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:
Abstract:Online platforms connect users with relevant products and services using ads. A key challenge is that a user’s search query often leaves their true intent ambiguous. Typically, platforms passively predict relevance based on available signals and in some cases offer query refinements. The shift from traditional search to conversational AI provides a new approach. When a user’s query is ambiguous, a Large Language Model (LLM) can proactively offer several clarifying follow-up prompts. In this paper we consider the following: what if some of these follow-up prompts can be sponsored,'' i.e., selected for their advertising potential. How should these suggestion slots’’ be allocated? And, how does this new mechanism interact with the traditional ad auction that might follow? This paper introduces a formal model for designing and analyzing these interactive platforms. We use this model to investigate a critical engineering choice: whether it is better to build an end-to-end pipeline that jointly optimizes the user interaction and the final ad auction, or to decouple them into separate mechanisms for the suggestion slots and another for the subsequent ad slot. We show that the VCG mechanism can be adopted to jointly optimize the sponsored suggestion and the ads that follow; while this mechanism is more complex, it achieves outcomes that are efficient and truthful. On the other hand, we prove that the simple-to-implement modular approach suffers from strategic inefficiency: its Price of Anarchy is unbounded. Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.03975 [cs.GT] (or arXiv:2512.03975v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2512.03975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-3] Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning)中行为正则化方法无法区分高价值与低价值动作的问题,导致策略在训练过程中对数据集中的所有状态-动作对进行无差别模仿,从而限制了性能提升。其解决方案的关键在于提出Guided Flow Policy (GFP),该方法通过耦合多步流匹配策略(multi-step flow-matching policy)与蒸馏的一步Actor网络,实现双向引导:Actor利用加权行为克隆(weighted behavior cloning)聚焦于复制数据集中高价值动作,而流策略则约束Actor保持与数据集中最优转移路径一致的同时最大化 critic 评分。这种相互指导机制使GFP在OGBench、Minari和D4RL等多个基准上的144项状态和像素级任务中达到当前最优性能,尤其在次优数据集和挑战性任务上表现显著提升。
链接: https://arxiv.org/abs/2512.03973
作者: Franki Nguimatsia Tiofack,Théotime Le Hellard,Fabian Schramm,Nicolas Perrin-Gilbert,Justin Carpentier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset’s best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks. Webpage: this https URL
zh
[AI-4] Benchmark for Planning and Control with Large Language Model Agents : Blocksworld with Model Context Protocol
【速读】:该论文旨在解决工业自动化中对灵活控制策略的需求与现有基于大语言模型(Large Language Models, LLMs)的智能体缺乏标准化评估基准之间的矛盾问题。其解决方案的关键在于构建一个可执行的仿真环境,以模拟Blocksworld问题并细分为五个复杂度等级,同时引入模型上下文协议(Model Context Protocol, MCP)作为标准化工具接口,使不同架构的智能体无需定制化修改即可接入并公平比较。该方法为LLM驱动的规划与执行方法提供了定量评估指标,推动了该领域系统性研究的发展。
链接: https://arxiv.org/abs/2512.03955
作者: Niklas Jobs,Luis Miguel Vieira da Silva,Jayanth Somashekaraiah,Maximilian Weigand,David Kube,Felix Gehlhoff
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: This work has been submitted to IFAC for possible publication
Abstract:Industrial automation increasingly requires flexible control strategies that can adapt to changing tasks and environments. Agents based on Large Language Models (LLMs) offer potential for such adaptive planning and execution but lack standardized benchmarks for systematic comparison. We introduce a benchmark with an executable simulation environment representing the Blocksworld problem providing five complexity categories. By integrating the Model Context Protocol (MCP) as a standardized tool interface, diverse agent architectures can be connected to and evaluated against the benchmark without implementation-specific modifications. A single-agent implementation demonstrates the benchmark’s applicability, establishing quantitative metrics for comparison of LLM-based planning and execution approaches.
zh
[AI-5] Autonomous Agents and Policy Compliance: A Framework for Reasoning About Penalties
【速读】:该论文旨在解决自主代理在遵守政策约束的同时,如何在高风险目标下合理权衡是否偏离政策的问题,以及如何通过建模非合规行为来辅助政策制定者理解人类决策过程。其解决方案的关键在于扩展Gelfond和Lobo的授权与义务政策语言(Authorization and Obligation Policy Language, AOPL),引入惩罚机制并结合答案集编程(Answer Set Programming, ASP)进行推理,从而实现对政策优先级的显式处理、规则违反的可解释识别及最小化后果的计划生成。通过自动化将扩展后的AOPL转化为ASP,并改进ASP规划算法以考虑惩罚代价,该框架能够在保障安全性的前提下生成更高质量的决策方案,同时提升计算效率。
链接: https://arxiv.org/abs/2512.03931
作者: Vineel Tummala,Daniela Inclezan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 5 figures
Abstract:This paper presents a logic programming-based framework for policy-aware autonomous agents that can reason about potential penalties for non-compliance and act accordingly. While prior work has primarily focused on ensuring compliance, our approach considers scenarios where deviating from policies may be necessary to achieve high-stakes goals. Additionally, modeling non-compliant behavior can assist policymakers by simulating realistic human decision-making. Our framework extends Gelfond and Lobo’s Authorization and Obligation Policy Language (AOPL) to incorporate penalties and integrates Answer Set Programming (ASP) for reasoning. Compared to previous approaches, our method ensures well-formed policies, accounts for policy priorities, and enhances explainability by explicitly identifying rule violations and their consequences. Building on the work of Harders and Inclezan, we introduce penalty-based reasoning to distinguish between non-compliant plans, prioritizing those with minimal repercussions. To support this, we develop an automated translation from the extended AOPL into ASP and refine ASP-based planning algorithms to account for incurred penalties. Experiments in two domains demonstrate that our framework generates higher-quality plans that avoid harmful actions while, in some cases, also improving computational efficiency. These findings underscore its potential for enhancing autonomous decision-making and informing policy refinement. Under consideration in Theory and Practice of Logic Programming (TPLP).
zh
[AI-6] Hierarchical Vision Language Action Model Using Success and Failure Demonstrations
【速读】:该论文旨在解决现有视觉-语言-动作(Vision-Language-Action, VLA)模型在训练过程中仅利用成功示范数据、忽略失败样本所导致的鲁棒性不足问题。其核心挑战在于如何有效利用自然采集中产生的失败数据,以识别策略脆弱点并提升执行可靠性。解决方案的关键在于提出VINE框架,该框架基于层次强化学习形式化,将高层推理(System 2)与底层控制(System 1)解耦:System 2通过在二维场景图抽象上进行可行性引导的树搜索,利用成功与失败数据预测子目标的成功概率,并提前剪枝易失效路径;System 1则专注于低层动作执行而不改变基础技能。该设计使失败数据成为结构化的学习信号而非噪声,从而显著提升复杂操作任务中的成功率和鲁棒性。
链接: https://arxiv.org/abs/2512.03913
作者: Jeongeun Park,Jihwan Yoon,Byungwoo Jeon,Juhan Park,Jinwoo Shin,Namhoon Cho,Kyungjae Lee,Sangdoo Yun,Sungjoon Choi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:Prior Vision-Language-Action (VLA) models are typically trained on teleoperated successful demonstrations, while discarding numerous failed attempts that occur naturally during data collection. However, these failures encode where and how policies can be fragile, information that can be exploited to improve robustness. We address this problem by leveraging mixed-quality datasets to learn failure-aware reasoning at planning time. We introduce VINE, a hierarchical vision-language-action model that separates high-level reasoning (System 2) from low-level control (System 1) under a hierarchical reinforcement learning formalism, making failures usable as a structured learning signal rather than noisy supervision. System 2 performs feasibility-guided tree search over a 2D scene-graph abstraction: it proposes subgoal transitions, predicts success probabilities from both successes and failures, and prunes brittle branches before execution, effectively casting plan evaluation as feasibility scoring. The selected subgoal sequence is then passed to System 1, which executes low-level actions without modifying the agent’s core skills. Trained entirely from offline teleoperation data, VINE integrates negative experience directly into the decision loop. Across challenging manipulation tasks, this approach consistently improves success rates and robustness, demonstrating that failure data is an essential resource for converting the broad competence of VLAs into robust execution.
zh
[AI-7] Autonomous Reinforcement Learning Robot Control with Intels Loihi 2 Neuromorphic Hardware
【速读】:该论文旨在解决如何将强化学习(Reinforcement Learning, RL)训练的人工神经网络(Artificial Neural Networks, ANNs)高效部署到类脑计算硬件(neuromorphic hardware)中,以实现低延迟、低功耗的机器人控制推理问题。其解决方案的关键在于提出了一种端到端的转换流程,将基于ReLU激活函数训练的ANN策略转化为适用于英特尔Loihi 2类脑芯片架构的脉冲Sigma-Delta神经网络(Spiking Sigma-Delta Neural Networks, SDNNs),从而在仿真环境中验证了该方法在Astrobee自由飞行机器人闭环控制中的可行性,并通过与GPU对比证明了Loihi 2在能效和实时性方面的优势。
链接: https://arxiv.org/abs/2512.03911
作者: Kenneth Stewart,Roxana Leontie,Samantha Chapin,Joe Hays,Sumit Bam Shrestha,Carl Glen Henshaw
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted for review at NICE 2026 (Neuro-Inspired Computational Elements) conference
Abstract:We present an end-to-end pipeline for deploying reinforcement learning (RL) trained Artificial Neural Networks (ANNs) on neuromorphic hardware by converting them into spiking Sigma-Delta Neural Networks (SDNNs). We demonstrate that an ANN policy trained entirely in simulation can be transformed into an SDNN compatible with Intel’s Loihi 2 architecture, enabling low-latency and energy-efficient inference. As a test case, we use an RL policy for controlling the Astrobee free-flying robot, similar to a previously hardware in space-validated controller. The policy, trained with Rectified Linear Units (ReLUs), is converted to an SDNN and deployed on Intel’s Loihi 2, then evaluated in NVIDIA’s Omniverse Isaac Lab simulation environment for closed-loop control of Astrobee’s motion. We compare execution performance between GPU and Loihi 2. The results highlight the feasibility of using neuromorphic platforms for robotic control and establish a pathway toward energy-efficient, real-time neuromorphic computation in future space and terrestrial robotics applications.
zh
[AI-8] A Hierarchical Tree-based approach for creating Configurable and Static Deep Research Agent (Static-DRA)
【速读】:该论文旨在解决静态检索增强生成(Retrieval Augmented Generation, RAG)流水线在处理复杂、多轮研究任务时的局限性,从而提升生成式AI(Generative AI)系统在深度研究场景下的能力。其核心解决方案是提出一种静态深度研究代理(Static Deep Research Agent, Static-DRA),该代理基于可配置的分层树状静态工作流构建,通过引入两个用户可调参数——深度(Depth)和广度(Breadth)——实现对研究强度的细粒度控制。这一设计使用户能够在研究结果的质量与全面性之间,与大型语言模型(Large Language Model, LLM)交互所产生的计算成本之间进行显式权衡,同时其监督者(Supervisor)、独立代理(Independent)与工作者(Worker)的架构支持高效的多跳信息检索与并行子主题探究,从而显著提升复杂研究任务的执行效率与可控性。
链接: https://arxiv.org/abs/2512.03887
作者: Saurav Prateek
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The advancement in Large Language Models has driven the creation of complex agentic systems, such as Deep Research Agents (DRAs), to overcome the limitations of static Retrieval Augmented Generation (RAG) pipelines in handling complex, multi-turn research tasks. This paper introduces the Static Deep Research Agent (Static-DRA), a novel solution built upon a configurable and hierarchical Tree-based static workflow. The core contribution is the integration of two user-tunable parameters, Depth and Breadth, which provide granular control over the research intensity. This design allows end-users to consciously balance the desired quality and comprehensiveness of the research report against the associated computational cost of Large Language Model (LLM) interactions. The agent’s architecture, comprising Supervisor, Independent, and Worker agents, facilitates effective multi-hop information retrieval and parallel sub-topic investigation. We evaluate the Static-DRA against the established DeepResearch Bench using the RACE (Reference-based Adaptive Criteria-driven Evaluation) framework. Configured with a depth of 2 and a breadth of 5, and powered by the gemini-2.5-pro model, the agent achieved an overall score of 34.72. Our experiments validate that increasing the configured Depth and Breadth parameters results in a more in-depth research process and a correspondingly higher evaluation score. The Static-DRA offers a pragmatic and resource-aware solution, empowering users with transparent control over the deep research process. The entire source code, outputs and benchmark results are open-sourced at this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2512.03887 [cs.AI] (or arXiv:2512.03887v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.03887 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-9] Hyperdimensional Computing for Sustainable Manufacturing: An Initial Assessment
【速读】:该论文旨在解决智能制造中AI模型高能耗可能抵消其效率提升与节能优势的问题。研究聚焦于基于原位传感的加工几何质量预测任务,对比了常见AI模型在能耗、精度和速度方面的表现。解决方案的关键在于引入超维计算(HyperDimensional Computing, HDC),该方法在保持与传统模型相当预测精度的同时,显著降低了能源消耗——训练阶段降低200倍,推理阶段降低175至1000倍;同时大幅缩短训练和推理时间,分别提升200倍和300至600倍,展现出面向能源高效型智能制造的潜力。
链接: https://arxiv.org/abs/2512.03864
作者: Danny Hoang,Anandkumar Patel,Ruimen Chen,Rajiv Malhotra,Farhad Imani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF); Symbolic Computation (cs.SC)
备注:
Abstract:Smart manufacturing can significantly improve efficiency and reduce energy consumption, yet the energy demands of AI models may offset these gains. This study utilizes in-situ sensing-based prediction of geometric quality in smart machining to compare the energy consumption, accuracy, and speed of common AI models. HyperDimensional Computing (HDC) is introduced as an alternative, achieving accuracy comparable to conventional models while drastically reducing energy consumption, 200 \times for training and 175 to 1000 \times for inference. Furthermore, HDC reduces training times by 200 \times and inference times by 300 to 600 \times , showcasing its potential for energy-efficient smart manufacturing.
zh
[AI-10] Scalable Decision Focused Learning via Online Trainable Surrogates
【速读】:该论文旨在解决决策支持系统中因使用传统训练的估计器来预估不确定参数而导致次优解的问题。其核心解决方案是采用以实际决策成本作为损失函数的“决策聚焦学习”(Decision Focused Learning),并提出一种基于无偏估计器的高效代理模型(surrogate)来替代计算昂贵的损失函数评估,从而提升训练阶段的可扩展性。该代理模型具备局部置信度信息,可在不确定性较高时切换至备用方法,且适用于黑箱场景,能够补偿优化模型简化带来的误差并考虑后续补救措施(recourse actions)对成本的影响,最终在显著减少内部求解器调用次数的同时保持与当前最优技术相当的解质量。
链接: https://arxiv.org/abs/2512.03861
作者: Gaetano Signorelli,Michele Lombardi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Decision support systems often rely on solving complex optimization problems that may require to estimate uncertain parameters beforehand. Recent studies have shown how using traditionally trained estimators for this task can lead to suboptimal solutions. Using the actual decision cost as a loss function (called Decision Focused Learning) can address this issue, but with a severe loss of scalability at training time. To address this issue, we propose an acceleration method based on replacing costly loss function evaluations with an efficient surrogate. Unlike previously defined surrogates, our approach relies on unbiased estimators reducing the risk of spurious local optima and can provide information on its local confidence allowing one to switch to a fallback method when needed. Furthermore, the surrogate is designed for a black-box setting, which enables compensating for simplifications in the optimization model and account- ing for recourse actions during cost computation. In our results, the method reduces costly inner solver calls, with a solution quality comparable to other state-of-the-art techniques.
zh
[AI-11] DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Models, LLMs)后训练中,面对现实世界中噪声或不完整监督信号时导致的训练不稳定和泛化性能下降问题。现有方法如基于最坏情况优化(如RFQI、CQL)或均值导向策略(如PPO、GRPO)虽能提升稳定性,但常忽视泛化能力并产生过于保守的策略,造成在多样化真实场景下表现不均。其解决方案的关键在于提出DVPO(Distributional Value Modeling with Risk-aware Policy Optimization),通过结合条件风险理论与分布式价值建模,实现鲁棒性与泛化能力的更好平衡:一方面学习token级价值分布以提供细粒度监督,另一方面引入非对称风险正则化机制——压缩下尾以抑制噪声引起的负向偏差,扩展上尾以保留探索多样性,从而在多轮对话、数学推理和科学问答等任务中显著优于PPO、GRPO及基于鲁棒贝尔曼方程的PPO,在噪声环境下展现出更强的实际部署潜力。
链接: https://arxiv.org/abs/2512.03847
作者: Dingwei Zhu,Zhiheng Xi,Shihan Dou,Yuhui Wang,Sixian Li,Junjie Ye,Honglin Guo,Shichun Liu,Chenhao Huang,Yajie Yang,Junlin Shang,Senjie Jin,Ming Zhang,Jiazheng Zhang,Caishuang Huang,Yunke Zhang,Demei Yan,Yuran Wang,Tao Gui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabilize training and harm generalization. While existing approaches such as worst-case optimization (e.g., RFQI, CQL) and mean-based methods (e.g., PPO, GRPO) can improve stability, they often overlook generalization and may produce overly conservative policies, leading to uneven performance across diverse real scenarios. To this end, we introduce DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), a new RL framework that combines conditional risk theory with distributional value modeling to better balance robustness and generalization. DVPO learns token-level value distributions to provide fine-grained supervision, and applies an asymmetric risk regularization to shape the distribution tails: it contracts the lower tail to dampen noisy negative deviations, while expanding the upper tail to preserve exploratory diversity. Across extensive experiments and analysis in multi-turn dialogue, math reasoning, and scientific QA, DVPO consistently outperforms PPO, GRPO, and robust Bellman-based PPO under noisy supervision, showing its potential for LLM post-training in the real-world.
zh
[AI-12] MPCFormer: A physics-informed data-driven approach for explainable socially-aware autonomous driving
【速读】:该论文旨在解决自动驾驶(Autonomous Driving, AD)车辆在高度动态和交互式交通场景中难以表现出类人行为的问题,其核心挑战在于AD系统对周围车辆社会交互机制的理解不足。解决方案的关键在于提出MPCFormer,这是一种可解释的、具备社会感知能力的自动驾驶方法,通过物理信息与数据驱动相结合的方式建模多车社会交互动力学。该方法将交互动力学表示为离散状态空间模型,并嵌入物理先验以增强可解释性;同时利用基于Transformer的编码器-解码器架构从自然驾驶数据中学习动力学系数,从而实现对复杂交互行为的精准建模。在此基础上,结合模型预测控制(Model Predictive Control, MPC)框架,在保证安全性的同时生成多样化且类人的驾驶行为,显著提升了规划成功率和交互效率。
链接: https://arxiv.org/abs/2512.03795
作者: Jia Hu,Zhexi Lian,Xuerun Yan,Ruiang Bi,Dou Shen,Yu Ruan,Haoran Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 17 pages, 18 figures
Abstract:Autonomous Driving (AD) vehicles still struggle to exhibit human-like behavior in highly dynamic and interactive traffic scenarios. The key challenge lies in AD’s limited ability to interact with surrounding vehicles, largely due to a lack of understanding the underlying mechanisms of social interaction. To address this issue, we introduce MPCFormer, an explainable socially-aware autonomous driving approach with physics-informed and data-driven coupled social interaction dynamics. In this model, the dynamics are formulated into a discrete space-state representation, which embeds physics priors to enhance modeling explainability. The dynamics coefficients are learned from naturalistic driving data via a Transformer-based encoder-decoder architecture. To the best of our knowledge, MPCFormer is the first approach to explicitly model the dynamics of multi-vehicle social interactions. The learned social interaction dynamics enable the planner to generate manifold, human-like behaviors when interacting with surrounding traffic. By leveraging the MPC framework, the approach mitigates the potential safety risks typically associated with purely learning-based methods. Open-looped evaluation on NGSIM dataset demonstrates that MPCFormer achieves superior social interaction awareness, yielding the lowest trajectory prediction errors compared with other state-of-the-art approach. The prediction achieves an ADE as low as 0.86 m over a long prediction horizon of 5 seconds. Close-looped experiments in highly intense interaction scenarios, where consecutive lane changes are required to exit an off-ramp, further validate the effectiveness of MPCFormer. Results show that MPCFormer achieves the highest planning success rate of 94.67%, improves driving efficiency by 15.75%, and reduces the collision rate from 21.25% to 0.5%, outperforming a frontier Reinforcement Learning (RL) based planner.
zh
[AI-13] Omni-AutoThink: Adaptive Multimodal Reasoning via Reinforcement Learning
【速读】:该论文旨在解决当前Omni模型在多模态感知与生成任务中表现出的僵化推理行为问题,即模型在面对简单任务时过度思考,而在复杂任务中又缺乏必要推理能力。解决方案的关键在于提出Omni-AutoThink框架,其核心创新是通过两个阶段动态调整模型的推理深度:第一阶段采用自适应监督微调(Adaptive SFT)赋予模型基础推理能力,第二阶段利用自适应强化学习(Adaptive GRPO)根据任务复杂度和奖励反馈优化推理策略,从而实现对不同难度任务的智能响应。
链接: https://arxiv.org/abs/2512.03783
作者: Dongchao Yang,Songxiang Liu,Disong Wang,Yuanyuan Wang,Guanglu Wan,Helen Meng
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Recent advances in Omni models have enabled unified multimodal perception and generation. However, most existing systems still exhibit rigid reasoning behaviors, either overthinking simple problems or failing to reason when necessary. To address this limitation, we propose Omni-AutoThink, a novel adaptive reasoning framework that dynamically adjusts the model’s reasoning depth according to task difficulty. Our framework comprises two stages: (1) an Adaptive Supervised Fine-Tuning (Adaptive SFT) stage, which endows the Omni model with fundamental reasoning capability using large-scale reasoning-augmented data, and (2) an Adaptive Reinforcement Learning (Adaptive GRPO) stage, which optimizes reasoning behaviors based on task complexity and reward feedback. We further construct a comprehensive adaptive reasoning benchmark that spans text-only, text-audio, text-visual, and text-audio-visual modalities, providing both training and evaluation splits for multimodal reasoning assessment. Experimental results demonstrate that our proposed framework significantly improves adaptive reasoning performance compared to previous baselines. All benchmark data and code will be publicly released.
zh
[AI-14] Bayesian Optimization for Automatic Tuning of Torque-Level Nonlinear Model Predictive Control
【速读】:该论文旨在解决基于转矩的非线性模型预测控制(torque-based Nonlinear Model Predictive Control, nMPC)中控制器参数手动调优效率低、性能受限的问题。解决方案的关键在于构建一个基于数字孪生(digital twin, DT)的自动调参框架,利用高维贝叶斯优化(Bayesian Optimization, BO)方法——特别是稀疏轴对齐子空间(Sparse Axis-Aligned Subspace, SAASBO)——在仿真环境中高效搜索最优的代价函数权重与底层控制器增益组合,从而显著提升末端执行器轨迹跟踪精度并缩短求解时间,最终实现从仿真到真实机器人平台的安全迁移与性能验证。
链接: https://arxiv.org/abs/2512.03772
作者: Gabriele Fadini,Deepak Ingole,Tong Duy Son,Alisa Rupenyan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 6 pages, 7 figures, 3 tables
Abstract:This paper presents an auto-tuning framework for torque-based Nonlinear Model Predictive Control (nMPC), where the MPC serves as a real-time controller for optimal joint torque commands. The MPC parameters, including cost function weights and low-level controller gains, are optimized using high-dimensional Bayesian Optimization (BO) techniques, specifically Sparse Axis-Aligned Subspace (SAASBO) with a digital twin (DT) to achieve precise end-effector trajectory real-time tracking on an UR10e robot arm. The simulation model allows efficient exploration of the high-dimensional parameter space, and it ensures safe transfer to hardware. Our simulation results demonstrate significant improvements in tracking performance (+41.9%) and reduction in solve times (-2.5%) compared to manually-tuned parameters. Moreover, experimental validation on the real robot follows the trend (with a +25.8% improvement), emphasizing the importance of digital twin-enabled automated parameter optimization for robotic operations.
zh
[AI-15] RoCo: Role-Based LLM s Collaboration for Automatic Heuristic Design
【速读】:该论文旨在解决自动启发式设计(Automatic Heuristic Design, AHD)中启发式生成多样性不足与质量受限的问题,尤其是在利用大语言模型(Large Language Models, LLMs)进行AHD时,现有方法通常仅依赖单一角色,难以兼顾创新与优化的平衡。解决方案的关键在于提出RoCo——一个基于多智能体角色协作的系统,通过四个专业化LLM引导的代理(explorer、exploiter、critic和integrator)协同工作:explorer负责探索潜在空间以提升长期多样性,exploiter聚焦于局部优化以实现短期效率提升,critic提供阶段性评估与反馈,integrator则融合两者优势并推动整体进化。该多角色协同机制在结构化的多轮交互中引入短中期与长期反思驱动的反馈、精炼与精英变异策略,显著提升了AHD生成启发式的性能与鲁棒性。
链接: https://arxiv.org/abs/2512.03762
作者: Jiawei Xu,Fengfeng Wei,Weineng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Automatic Heuristic Design (AHD) has gained traction as a promising solution for solving combinatorial optimization problems (COPs). Large Language Models (LLMs) have emerged and become a promising approach to achieving AHD, but current LLM-based AHD research often only considers a single role. This paper proposes RoCo, a novel Multi-Agent Role-Based System, to enhance the diversity and quality of AHD through multi-role collaboration. RoCo coordinates four specialized LLM-guided agents-explorer, exploiter, critic, and integrator-to collaboratively generate high-quality heuristics. The explorer promotes long-term potential through creative, diversity-driven thinking, while the exploiter focuses on short-term improvements via conservative, efficiency-oriented refinements. The critic evaluates the effectiveness of each evolution step and provides targeted feedback and reflection. The integrator synthesizes proposals from the explorer and exploiter, balancing innovation and exploitation to drive overall progress. These agents interact in a structured multi-round process involving feedback, refinement, and elite mutations guided by both short-term and accumulated long-term reflections. We evaluate RoCo on five different COPs under both white-box and black-box settings. Experimental results demonstrate that RoCo achieves superior performance, consistently generating competitive heuristics that outperform existing methods including ReEvo and HSEvo, both in white-box and black-box scenarios. This role-based collaborative paradigm establishes a new standard for robust and high-performing AHD.
zh
[AI-16] AI/ML in 3GPP 5G Advanced - Services and Architecture
【速读】:该论文旨在解决5G Advanced系统中人工智能/机器学习(AI/ML)技术集成与应用的两大核心问题:一是如何利用AI/ML提升网络性能(即“AI for network”),例如通过资源优化实现更高效的网络管理;二是如何增强5G系统以支持AI/ML应用部署(即“Network for AI”),例如为图像识别等高带宽、低时延场景提供适配能力。解决方案的关键在于3GPP在Release 19中针对服务与系统方面(Service and System Aspects, SA)的技术规范组所引入的AI/ML相关功能增强,这些进展标志着AI/ML从辅助工具向5G Advanced体系结构核心组件的演进。
链接: https://arxiv.org/abs/2512.03728
作者: Pradnya Taksande,Shwetha Kiran,Pranav Jha,Prasanna Chaporkar
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注:
Abstract:The 3rd Generation Partnership Project (3GPP), the standards body for mobile networks, is in the final phase of Release 19 standardization and is beginning Release 20. Artificial Intelligence/ Machine Learning (AI/ML) has brought about a paradigm shift in technology and it is being adopted across industries and verticals. 3GPP has been integrating AI/ML into the 5G advanced system since Release 18. This paper focuses on the AI/ML related technological advancements and features introduced in Release 19 within the Service and System Aspects (SA) Technical specifications group of 3GPP. The advancements relate to two paradigms: (i) enhancements that AI/ML brought to the 5G advanced system (AI for network), e.g. resource optimization, and (ii) enhancements that were made to the 5G system to support AI/ML applications (Network for AI), e.g. image recognition.
zh
[AI-17] Context-Aware Hierarchical Learning: A Two-Step Paradigm towards Safer LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理指令时因统一的token处理范式而产生的安全漏洞问题,特别是针对利用函数调用(function-calling)机制发起的新型攻击——工具补全攻击(Tool-Completion Attack, TCA)。此类攻击可显著扭曲模型行为,且现有状态最优模型对此类威胁仍高度敏感。解决方案的关键在于提出一种上下文感知的分层学习机制(Context-Aware Hierarchical Learning, CAHL),该机制通过挖掘不同指令片段间的上下文关联性,构建具有鲁棒性的指令层次结构,从而动态平衡语义理解与角色特定指令约束,在不损害通用任务性能的前提下,有效提升模型对传统攻击及TCA的防御能力,并展现出良好的零样本泛化性能。
链接: https://arxiv.org/abs/2512.03720
作者: Tengyun Ma,Jiaqi Yao,Daojing He,Shihao Peng,Yu Li,Shaohui Liu,Zhuotao Tian
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have emerged as powerful tools for diverse applications. However, their uniform token processing paradigm introduces critical vulnerabilities in instruction handling, particularly when exposed to adversarial scenarios. In this work, we identify and propose a novel class of vulnerabilities, termed Tool-Completion Attack (TCA), which exploits function-calling mechanisms to subvert model behavior. To evaluate LLM robustness against such threats, we introduce the Tool-Completion benchmark, a comprehensive security assessment framework, which reveals that even state-of-the-art models remain susceptible to TCA, with surprisingly high attack success rates. To address these vulnerabilities, we introduce Context-Aware Hierarchical Learning (CAHL), a sophisticated mechanism that dynamically balances semantic comprehension with role-specific instruction constraints. CAHL leverages the contextual correlations between different instruction segments to establish a robust, context-aware instruction hierarchy. Extensive experiments demonstrate that CAHL significantly enhances LLM robustness against both conventional attacks and the proposed TCA, exhibiting strong generalization capabilities in zero-shot evaluations while still preserving model performance on generic tasks. Our code is available at this https URL.
zh
[AI-18] Over-the-Air Federated Learning: Rethinking Edge AI Through Signal Processing
【速读】:该论文旨在解决传统联邦学习(Federated Learning, FL)在无线网络边缘部署时面临的高通信开销、延迟大和能耗高的问题。其解决方案的关键在于提出了一种名为“空中联邦学习”(Over-the-Air Federated Learning, AirFL)的新范式,通过利用无线信号的叠加特性(superposition property),实现模型参数的并行传输与聚合,从而在物理层上完成通信与聚合的协同处理,显著降低系统延迟、带宽需求和设备能耗。论文进一步将AirFL的设计方法分为基于信道状态信息感知(CSIT-aware)、盲(blind)和加权(weighted)三类,并系统梳理了其理论基础、性能分析、复杂度考量及实际限制,为未来研究提供清晰路径。
链接: https://arxiv.org/abs/2512.03719
作者: Seyed Mohammad Azimi-Abarghouyi,Carlo Fischione,Kaibin Huang
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Over-the-Air Federated Learning (AirFL) is an emerging paradigm that tightly integrates wireless signal processing and distributed machine learning to enable scalable AI at the network edge. By leveraging the superposition property of wireless signals, AirFL performs communication and model aggregation of the learning process simultaneously, significantly reducing latency, bandwidth, and energy consumption. This article offers a tutorial treatment of AirFL, presenting a novel classification into three design approaches: CSIT-aware, blind, and weighted AirFL. We provide a comprehensive guide to theoretical foundations, performance analysis, complexity considerations, practical limitations, and prospective research directions.
zh
[AI-19] Matrix Editing Meets Fair Clustering: Parameterized Algorithms and Complexity
【速读】:该论文致力于解决离散向量的公平均值聚类(fair means clustering)问题,其等价于将一个着色矩阵通过最多修改 k 个元素转化为具有少量颜色平衡行的矩阵。该问题在无公平性约束和有公平性约束两种情形下均为 NP-hard,但前者已知存在固定参数算法(fixed-parameter algorithm)。本文的关键贡献在于:首先排除了在高度受限的公平均值聚类实例中存在类似固定参数算法的可能性;随后构建了该问题的完整复杂度图谱,并提出三种突破此下界的方法——对问题实例施加额外约束、采用固定参数近似(fixed-parameter approximation),或引入针对树状结构矩阵的新参数化方式,从而实现可 tractability(可处理性)。
链接: https://arxiv.org/abs/2512.03718
作者: Robert Ganian,Hung P. Hoang,Simon Wietheger
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注:
Abstract:We study the computational problem of computing a fair means clustering of discrete vectors, which admits an equivalent formulation as editing a colored matrix into one with few distinct color-balanced rows by changing at most k values. While NP-hard in both the fairness-oblivious and the fair settings, the problem is well-known to admit a fixed-parameter algorithm in the former ``vanilla’’ setting. As our first contribution, we exclude an analogous algorithm even for highly restricted fair means clustering instances. We then proceed to obtain a full complexity landscape of the problem, and establish tractability results which capture three means of circumventing our obtained lower bound: placing additional constraints on the problem instances, fixed-parameter approximation, or using an alternative parameterization targeting tree-like matrices.
zh
[AI-20] Quantum Topological Graph Neural Networks for Detecting Complex Fraud Patterns
【速读】:该论文旨在解决大规模金融网络中欺诈交易检测的难题,其核心挑战在于捕捉复杂的交易动态和结构异常特征。解决方案的关键在于提出一种新颖的量子图神经网络(Quantum Tensor Graph Neural Network, QTGNN)框架,该框架融合了量子嵌入(quantum embedding)、变分图卷积(variational graph convolutions)与拓扑数据分析(topological data analysis),通过量子纠缠增强的数据嵌入、非线性动力学驱动的变分量子图卷积、高阶拓扑不变量提取以及混合量子-经典异常学习机制,实现对欺诈行为的高精度识别与可解释决策。此外,QTGNN在噪声中等规模量子(NISQ)设备上具备收敛性保障和拓扑签名稳定性,结合电路简化与图采样策略,确保了在实际硬件上的可扩展性和实用性。
链接: https://arxiv.org/abs/2512.03696
作者: Mohammad Doost,Mohammad Manthouri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a novel QTGNN framework for detecting fraudulent transactions in large-scale financial networks. By integrating quantum embedding, variational graph convolutions, and topological data analysis, QTGNN captures complex transaction dynamics and structural anomalies indicative of fraud. The methodology includes quantum data embedding with entanglement enhancement, variational quantum graph convolutions with non-linear dynamics, extraction of higher-order topological invariants, hybrid quantum-classical anomaly learning with adaptive optimization, and interpretable decision-making via topological attribution. Rigorous convergence guarantees ensure stable training on noisy intermediate-scale quantum (NISQ) devices, while stability of topological signatures provides robust fraud detection. Optimized for NISQ hardware with circuit simplifications and graph sampling, the framework scales to large transaction networks. Simulations on financial datasets, such as PaySim and Elliptic, benchmark QTGNN against classical and quantum baselines, using metrics like ROC-AUC, precision, and false positive rate. An ablation study evaluates the contributions of quantum embeddings, topological features, non-linear channels, and hybrid learning. QTGNN offers a theoretically sound, interpretable, and practical solution for financial fraud detection, bridging quantum machine learning, graph theory, and topological analysis.
zh
[AI-21] Dynamically Scaled Activation Steering
【速读】:该论文旨在解决现有生成式AI(Generative AI)行为引导方法在应用时存在的“一刀切”问题,即对所有输入统一施加干预,导致在无需引导的情况下损害模型性能。其解决方案的关键在于提出一种与方法无关的动态缩放激活引导(Dynamically Scaled Activation Steering, DSAS)框架,该框架将何时引导(when to steer)与如何引导(how to steer)解耦:通过计算上下文相关的缩放因子,在生成阶段自适应地调节已有引导变换的强度,仅在检测到不良行为时进行强干预;同时支持端到端联合优化,实现毒性缓解与效用保留之间的帕累托最优改进,并具备低计算开销和可解释性优势。
链接: https://arxiv.org/abs/2512.03661
作者: Alex Ferrando,Xavier Suau,Jordi Gonzàlez,Pau Rodriguez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Activation steering has emerged as a powerful method for guiding the behavior of generative models towards desired outcomes such as toxicity mitigation. However, most existing methods apply interventions uniformly across all inputs, degrading model performance when steering is unnecessary. We introduce Dynamically Scaled Activation Steering (DSAS), a method-agnostic steering framework that decouples when to steer from how to steer. DSAS adaptively modulates the strength of existing steering transformations across layers and inputs, intervening strongly only when undesired behavior is detected. At generation time, DSAS computes context-dependent scaling factors that selectively adjust the strength of any steering method. We also show how DSAS can be jointly optimized end-to-end together with the steering function. When combined with existing steering methods, DSAS consistently improves the Pareto front with respect to steering alone, achieving a better trade-off between toxicity mitigation and utility preservation. We further demonstrate DSAS’s generality by applying it to a text-to-image diffusion model, showing how adaptive steering allows the modulation of specific concepts. Finally, DSAS introduces minimal computational overhead while improving interpretability, pinpointing which tokens require steering and by how much.
zh
[AI-22] MemVerse: Multimodal Memory for Lifelong Learning Agents
【速读】:该论文旨在解决当前AI代理在长时间交互中缺乏可靠记忆的问题,这导致其在连续学习、长程推理以及多模态或交互式环境中表现不佳。解决方案的关键在于提出一种模型无关、即插即用的记忆框架MemVerse,该框架融合了快速参数化回忆与分层检索式记忆机制,通过将原始多模态经验结构化为分层知识图谱来构建长期记忆,并结合周期性蒸馏机制将长期记忆中的关键知识压缩至参数模型中,从而实现高效、可微且可解释的快速回忆,同时支持持续整合、自适应遗忘和有限的记忆增长。
链接: https://arxiv.org/abs/2512.03627
作者: Junming Liu,Yifei Sun,Weihua Cheng,Haodong Lei,Yirong Chen,Licheng Wen,Xuemeng Yang,Daocheng Fu,Pinlong Cai,Nianchen Deng,Yi Yu,Shuyue Hu,Botian Shi,Ding Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 2 tables
Abstract:Despite rapid progress in large-scale language and vision models, AI agents still suffer from a fundamental limitation: they cannot remember. Without reliable memory, agents catastrophically forget past experiences, struggle with long-horizon reasoning, and fail to operate coherently in multimodal or interactive environments. We introduce MemVerse, a model-agnostic, plug-and-play memory framework that bridges fast parametric recall with hierarchical retrieval-based memory, enabling scalable and adaptive multimodal intelligence. MemVerse maintains short-term memory for recent context while transforming raw multimodal experiences into structured long-term memories organized as hierarchical knowledge graphs. This design supports continual consolidation, adaptive forgetting, and bounded memory growth. To handle real-time demands, MemVerse introduces a periodic distillation mechanism that compresses essential knowledge from long-term memory into the parametric model, allowing fast, differentiable recall while preserving interpretability. Extensive experiments demonstrate that MemVerse significantly improves multimodal reasoning and continual learning efficiency, empowering agents to remember, adapt, and reason coherently across extended interactions.
zh
[AI-23] he promising potential of vision language models for the generation of textual weather forecasts
【速读】:该论文旨在解决多模态基础模型在气象产品与服务生成中应用尚处于初期阶段的问题,尤其是如何提升气象业务生产效率与服务创新能力。其解决方案的关键在于探索将视觉语言模型(Vision-Language Model, VLM)用于直接从视频编码的网格化天气数据中生成经典的航运预报文本(Shipping Forecast),从而实现自动化、可扩展的气象信息表达与输出,为气象行业及更广泛领域提供新的技术路径。
链接: https://arxiv.org/abs/2512.03623
作者: Edward C. C. Steele,Dinesh Mane,Emilio Monti,Luis Orus,Rebecca Chantrill-Cheyette,Matthew Couch,Kirstine I. Dale,Simon Eaton,Govindarajan Rangarajan,Amir Majlesi,Steven Ramsdale,Michael Sharpe,Craig Smith,Jonathan Smith,Rebecca Yates,Holly Ellis,Charles Ewen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 7 pages, 2 tables
Abstract:Despite the promising capability of multimodal foundation models, their application to the generation of meteorological products and services remains nascent. To accelerate aspiration and adoption, we explore the novel use of a vision language model for writing the iconic Shipping Forecast text directly from video-encoded gridded weather data. These early results demonstrate promising scalable technological opportunities for enhancing production efficiency and service innovation within the weather enterprise and beyond.
zh
[AI-24] KVNAND: Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing
【速读】:该论文旨在解决在资源受限的边缘设备上部署大规模语言模型(Large Language Models, LLMs)时,由于模型参数量巨大导致的权重加载和带宽压力问题,尤其是随着上下文长度增长,键值缓存(Key-Value Cache, KV cache)占用空间远超模型权重本身,使得依赖DRAM的现有方案面临高昂成本与容量瓶颈。解决方案的关键在于提出KVNAND——首个基于in-flash computing (IFC) 的完全无DRAM架构,将模型权重与KV缓存全部存储于可计算型3D NAND闪存中;通过全面采用IFC减少数据传输开销、引入head-group并行以提升吞吐、设计页级KV缓存映射以匹配闪存组织结构,并结合自动化的设计空间探索框架优化权重与KV缓存的放置策略,从而有效缓解延迟、能耗与可靠性问题,使闪存成为长上下文场景下可行的KV缓存存储介质。
链接: https://arxiv.org/abs/2512.03608
作者: Lishuo Deng,Shaojie Xu,Jinwu Chen,Changwei Yan,Jiajie Wang,Zhe Jiang,Weiwei Shan
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:Deploying large language models (LLMs) on edge devices enables personalized agents with strong privacy and low cost. However, with tens to hundreds of billions of parameters, single-batch autoregressive inference suffers from extremely low arithmetic intensity, creating severe weight-loading and bandwidth pressures on resource-constrained platforms. Recent in-flash computing (IFC) solutions alleviate this bottleneck by co-locating weight-related linear computations in the decode phase with flash, yet still rely on DRAM for the key-value (KV) cache. As context length grows, the KV cache can exceed model weights in size, imposing prohibitive DRAM cost and capacity requirements. Attempts to offload KV cache to flash suffer from severe performance penalties. We propose KVNAND, the first DRAM-free, IFC-based architecture that stores both model weights and KV cache entirely in compute-enabled 3D NAND flash. KVNAND addresses the fundamental performance challenges of flash under intensive KV cache access by leveraging IFC for all memory-bound operations to reduce data transfer overhead, introducing head-group parallelism to boost throughput, and employing page-level KV cache mapping to align token access patterns with flash organization. In addition, we propose a design space exploration framework that evaluates discrete and compact KVNAND variants to balance weight and KV placement, automatically identifying the optimal design trade-off. These techniques mitigate latency, energy, and reliability concerns, turning flash into a practical medium for long-context KV storage. Evaluations on MHA 7B and GQA 70B LLMs show that KVNAND achieves 1.98(\times)/1.94(\times)/2.05(\times) geomean speedup at 128/1K/10K-token contexts compared to DRAM-equipped IFC designs and addresses out-of-memory failures at 100K context length. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET) Cite as: arXiv:2512.03608 [cs.AR] (or arXiv:2512.03608v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2512.03608 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-25] DeepRule: An Integrated Framework for Automated Business Rule Generation via Deep Predictive Modeling and Hybrid Search Optimization
【速读】:该论文旨在解决零售组合与定价优化中现有理论模型与现实经济复杂性之间的系统性错位问题,具体聚焦于三个关键挑战:(1)数据模态不匹配问题,即非结构化文本源(如谈判记录、审批文件)难以准确构建客户画像;(2)动态特征纠缠难题,涉及非线性价格弹性及随时间变化属性的建模;(3)多层级业务约束导致的操作不可行性。其解决方案的核心在于提出一个三层次架构:首先通过融合大语言模型(Large Language Models, LLMs)的混合知识融合引擎,对非结构化文本进行深度语义解析并转化为结构化特征,同时整合管理经验;其次引入博弈论约束优化机制,以双边效用函数动态协调供应链利益,将制造商与分销商利润再分配编码为分层约束下的内生目标;最后采用LLM引导的符号回归可解释决策蒸馏接口,在数学表达搜索过程中嵌入经济先验(如非负弹性)作为硬约束,从而生成可审计的定价策略和业务规则。该框架在真实零售环境中验证了优于系统性B2C基线的盈利能力,并确保操作可行性,实现了从非结构化知识注入、多智能体优化到可解释策略合成的闭环经济智能流程。
链接: https://arxiv.org/abs/2512.03607
作者: Yusen Wu,Xiaotie Deng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes DeepRule, an integrated framework for automated business rule generation in retail assortment and pricing optimization. Addressing the systematic misalignment between existing theoretical models and real-world economic complexities, we identify three critical gaps: (1) data modality mismatch where unstructured textual sources (e.g. negotiation records, approval documents) impede accurate customer profiling; (2) dynamic feature entanglement challenges in modeling nonlinear price elasticity and time-varying attributes; (3) operational infeasibility caused by multi-tier business constraints. Our framework introduces a tri-level architecture for above challenges. We design a hybrid knowledge fusion engine employing large language models (LLMs) for deep semantic parsing of unstructured text, transforming distributor agreements and sales assessments into structured features while integrating managerial expertise. Then a game-theoretic constrained optimization mechanism is employed to dynamically reconcile supply chain interests through bilateral utility functions, encoding manufacturer-distributor profit redistribution as endogenous objectives under hierarchical constraints. Finally an interpretable decision distillation interface leveraging LLM-guided symbolic regression to find and optimize pricing strategies and auditable business rules embeds economic priors (e.g. non-negative elasticity) as hard constraints during mathematical expression search. We validate the framework in real retail environments achieving higher profits versus systematic B2C baselines while ensuring operational feasibility. This establishes a close-loop pipeline unifying unstructured knowledge injection, multi-agent optimization, and interpretable strategy synthesis for real economic intelligence. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2512.03607 [cs.AI] (or arXiv:2512.03607v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.03607 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-26] When How Long and How Much? Interpretable Neural Networks for Time Series Regression by Learning to Mask and Aggregate
【速读】:该论文旨在解决时间序列外生回归(Time Series Extrinsic Regression, TSER)任务中模型可解释性不足的问题。当前主流TSER模型虽具备强大预测性能,但通常为黑箱结构,难以揭示驱动决策的时序模式;而现有事后解释技术(如特征归因)常产生粗糙、噪声大或不稳定的解释结果。为此,作者提出MAGNETS(Mask-and-AGgregate NEtwork for Time Series),其核心创新在于构建一种内在可解释的神经架构:通过学习一组无需标注的人类可理解的概念(concept),每个概念对应一个基于掩码(mask)的输入特征聚合机制,明确指出哪些特征在何时对预测起作用;同时,预测由这些概念以透明的加性结构组合而成,从而实现对模型决策过程的清晰洞察。
链接: https://arxiv.org/abs/2512.03578
作者: Florent Forest,Amaury Wei,Olga Fink
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 4 tables
Abstract:Time series extrinsic regression (TSER) refers to the task of predicting a continuous target variable from an input time series. It appears in many domains, including healthcare, finance, environmental monitoring, and engineering. In these settings, accurate predictions and trustworthy reasoning are both essential. Although state-of-the-art TSER models achieve strong predictive performance, they typically operate as black boxes, making it difficult to understand which temporal patterns drive their decisions. Post-hoc interpretability techniques, such as feature attribution, aim to to explain how the model arrives at its predictions, but often produce coarse, noisy, or unstable explanations. Recently, inherently interpretable approaches based on concepts, additive decompositions, or symbolic regression, have emerged as promising alternatives. However, these approaches remain limited: they require explicit supervision on the concepts themselves, often cannot capture interactions between time-series features, lack expressiveness for complex temporal patterns, and struggle to scale to high-dimensional multivariate data. To address these limitations, we propose MAGNETS (Mask-and-AGgregate NEtwork for Time Series), an inherently interpretable neural architecture for TSER. MAGNETS learns a compact set of human-understandable concepts without requiring any annotations. Each concept corresponds to a learned, mask-based aggregation over selected input features, explicitly revealing both which features drive predictions and when they matter in the sequence. Predictions are formed as combinations of these learned concepts through a transparent, additive structure, enabling clear insight into the model’s decision process. Comments: 12 pages, 5 figures, 4 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.03578 [cs.LG] (or arXiv:2512.03578v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.03578 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-27] EnCompass: Enhancing Agent Programming with Search Over Program Execution Paths NEURIPS2025
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体编程中,核心工作流逻辑与推理时策略(如树搜索)耦合严重的问题。其解决方案的关键在于提出“概率天使非确定性”(probabilistic angelic nondeterminism, PAN),这是一种将工作流逻辑与推理策略解耦的编程范式:程序员可独立描述代理的工作流程,并通过修改少量输入即可灵活实验不同的推理策略,从而提升智能体的可靠性并简化策略切换过程。为此,作者实现了名为EnCompass的Python框架,利用装饰器将代理工作流程序编译为搜索空间,显著降低了开发和调优成本。
链接: https://arxiv.org/abs/2512.03571
作者: Zhening Li,Armando Solar-Lezama,Yisong Yue,Stephan Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注: 65 pages, 2 figures, published in NeurIPS 2025
Abstract:We introduce a new approach to agent programming, the development of LLM-based agents. Current approaches to agent programming often entangle two aspects of agent design: the core workflow logic and the inference-time strategy (e.g., tree search). We introduce “probabilistic angelic nondeterminism” (“PAN”), a programming model that disentangles these two concerns, allowing the programmer to describe the agent workflow and independently experiment with different inference-time strategies by simply changing a few inputs. We provide an implementation of PAN in Python as the EnCompass framework, which uses a Python decorator to compile agent workflow programs into a search space. We present three case studies that demonstrate how the framework lets the programmer quickly improve the reliability of an agent and easily switch between different inference-time strategies, all with little additional coding.
zh
[AI-28] Machine Learning to Predict Slot Usage in TSCH Wireless Sensor Networks
【速读】:该论文旨在解决工业无线传感器网络(Wireless Sensor Networks, WSNs)中超低功耗与运行确定性之间的矛盾问题,尤其是在时间槽通道跳频(Time Slotted Channel Hopping, TSCH)协议下如何进一步提升能效。解决方案的关键在于利用机器学习模型对TSCH网络中的流量模式进行学习,从而在无数据传输预期时使节点进入深度睡眠状态,实现动态节能。研究通过在典型树状拓扑结构中分析不同层级机器学习预测性能,发现模型能力随靠近网络根节点而下降,并基于精确的无线传感器节点仿真模型验证了所选算法可显著降低TSCH网络的整体功耗。
链接: https://arxiv.org/abs/2512.03570
作者: Stefano Scanzio,Gabriele Formis,Tullio Facchinetti,Gianluca Cena
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint accepted, 8 pages, 2025
Abstract:Wireless sensor networks (WSNs) are employed across a wide range of industrial applications where ultra-low power consumption is a critical prerequisite. At the same time, these systems must maintain a certain level of determinism to ensure reliable and predictable operation. In this view, time slotted channel hopping (TSCH) is a communication technology that meets both conditions, making it an attractive option for its usage in industrial WSNs. This work proposes the use of machine learning to learn the traffic pattern generated in networks based on the TSCH protocol, in order to turn nodes into a deep sleep state when no transmission is planned and thus to improve the energy efficiency of the WSN. The ability of machine learning models to make good predictions at different network levels in a typical tree network topology was analyzed in depth, showing how their capabilities degrade while approaching the root of the tree. The application of these models on simulated data based on an accurate modeling of wireless sensor nodes indicates that the investigated algorithms can be suitably used to further and substantially reduce the power consumption of a TSCH network.
zh
[AI-29] State Space Models for Bioacoustics: A comparative Evaluation with Transformers
【速读】:该论文旨在解决生物声学(bioacoustics)领域中模型性能与计算效率之间的权衡问题,即如何在保持高精度的同时降低对显存(VRAM)的消耗。解决方案的关键在于采用状态空间模型(State Space Model, SSM)架构的Mamba模型构建音频大语言模型(audio LLM),通过自监督预训练和微调,在BEANS基准测试上实现了与当前最优Transformer模型AVES相当的分类与检测性能,同时显著减少了显存占用,展现出在资源受限场景下部署生物声学分析模型的潜力。
链接: https://arxiv.org/abs/2512.03563
作者: Chengyu Tang,Sanjeev Baskiyar
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:In this study, we evaluate the efficacy of the Mamba model in the field of bioacoustics. We first pretrain a Mamba-based audio large language model (LLM) on a large corpus of audio data using self-supervised learning. We fine-tune and evaluate BioMamba on the BEANS benchmark, a collection of diverse bioacoustic tasks including classification and detection, and compare its performance and efficiency with multiple baseline models, including AVES, a state-of-the-art Transformer-based model. The results show that BioMamba achieves comparable performance with AVES while consumption significantly less VRAM, demonstrating its potential in this domain.
zh
[AI-30] Reason -Plan-ReAct: A Reason er-Planner Supervising a ReAct Executor for Complex Enterprise Tasks AAAI2026
【速读】:该论文旨在解决自主代理在企业领域中执行复杂任务时面临的两大核心挑战:一是单智能体架构导致的策略规划与执行耦合,引发轨迹不稳定;二是本地部署的开源大模型因上下文窗口较小,难以处理多工具输出带来的上下文快速消耗问题。解决方案的关键在于提出一种新型多智能体框架RP-ReAct(Reasoner Planner-ReAct),其核心创新是将战略规划(strategic planning)与底层执行(low-level execution)彻底解耦:由Reasoner Planner Agent(RPA)负责基于大推理模型(Large Reasoning Model)进行分步规划和结果分析,而Proxy-Execution Agent(PEA)则通过ReAct机制将子步骤转化为具体工具调用,并引入外部存储与按需访问机制来管理大型工具输出,从而有效缓解上下文窗口溢出问题,显著提升系统可靠性、效率及跨模型规模的鲁棒性。
链接: https://arxiv.org/abs/2512.03560
作者: Gianni Molinari,Fabio Ciravegna
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 11 pages, 1 figure, 2 tables, Workshop AAAI 2026 agentic AI Benchmarks and Applications for Enterprise Tasks
Abstract:Despite recent advances, autonomous agents often struggle to solve complex tasks in enterprise domains that require coordinating multiple tools and processing diverse data sources. This struggle is driven by two main limitations. First, single-agent architectures enforce a monolithic plan-execute loop, which directly causes trajectory instability. Second, the requirement to use local open-weight models for data privacy introduces smaller context windows leading to the rapid consumption of context from large tool outputs. To solve this problem we introduce RP-ReAct (Reasoner Planner-ReAct), a novel multi-agent approach that fundamentally decouples strategic planning from low-level execution to achieve superior reliability and efficiency. RP-ReAct consists of a Reasoner Planner Agent (RPA), responsible for planning each sub-step, continuously analysing the execution results using the strong reasoning capabilities of a Large Reasoning Model, and one or multiple Proxy-Execution Agent (PEA) that translates sub-steps into concrete tool interactions using a ReAct approach. Crucially, we incorporate a context-saving strategy within the PEA to mitigate context window overflow by managing large tool outputs via external storage and on-demand access. We evaluate RP-ReAct, on the challenging, multi-domain ToolQA benchmark using a diverse set of six open-weight reasoning models. Our empirical results show that RP-ReAct achieves superior performance and improved generalization ability over state-of-the-art baselines when addressing diverse complex tasks across the evaluated domains. Furthermore we establish the enhanced robustness and stability of our approach across different model scales, paving the way for effective and deployable agentic solutions for enterprises.
zh
[AI-31] PARC: An Autonomous Self-Reflective Coding Agent for Robust Execution of Long-Horizon Tasks
【速读】:该论文旨在解决长周期计算任务中AI系统缺乏自主性与鲁棒性的问题,尤其是在复杂科学计算和数据分析场景下难以独立完成从规划到执行再到纠错的全流程任务。解决方案的关键在于提出PARC(Planning, Acting, and Reflecting with self-assessment and self-feedback),其核心是一个分层多智能体架构,包含任务规划、执行模块以及一个独立上下文下的自评估与自反馈机制(self-assessment and self-feedback)。该机制使系统能够识别并修正高层次策略错误,从而在无人干预的情况下持续推进任务进展,实现在材料科学和数据科学中的端到端自动化执行与高质量结果输出。
链接: https://arxiv.org/abs/2512.03549
作者: Yuki Orimo,Iori Kurata,Hodaka Mori,Ryuhei Okuno,Ryohto Sawada,Daisuke Okanohara
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce PARC, a coding agent for the autonomous and robust execution of long-horizon computational tasks. PARC is built on a hierarchical multi-agent architecture incorporating task planning, execution, and a mechanism that evaluates its own actions and their outcomes from an independent context and provides feedback, namely self-assessment and self-feedback. This design enables PARC to detect and correct high-level strategic errors and sustain progress without human intervention. We evaluate PARC across computational science and data science tasks. In materials science, it autonomously reproduces key results from studies on lithium-ion conduction and alloy segregation. In particular, it coordinates dozens of parallel simulation tasks, each requiring roughly 43 hours of computation, managing orchestration, monitoring, and error correction end-to-end. In Kaggle-based experiments, starting from minimal natural-language instructions, PARC conducts data analysis and implements search strategies, producing solutions competitive with human-engineered baselines. These results highlight the potential of integrating a hierarchical multi-agent system with self-assessment and self-feedback to enable AI systems capable of independent, large-scale scientific and analytical work.
zh
[AI-32] A Learning-based Control Methodology for Transitioning VTOL UAVs
【速读】:该论文旨在解决垂直起降无人机(VTOL UAV)在转换阶段因倾转旋翼机构导致质心和推力方向变化而引发的控制难题,现有解耦式高度与位置控制方法易产生显著振动且难以考虑交互作用与适应性。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)驱动的耦合过渡控制方法(ST3M),该方法将巡航模式视为悬停状态的特例,从而实现更平滑、精准的姿态与位置控制,并在仿真与实际环境中验证了其高效性、可迁移性及对轨迹跟踪性能的提升,同时有效降低了过渡过程中的振动。
链接: https://arxiv.org/abs/2512.03548
作者: Zexin Lin,Yebin Zhong,Hanwen Wan,Jiu Cheng,Zhenglong Sun,Xiaoqiang Ji
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Transition control poses a critical challenge in Vertical Take-Off and Landing Unmanned Aerial Vehicle (VTOL UAV) development due to the tilting rotor mechanism, which shifts the center of gravity and thrust direction during transitions. Current control methods’ decoupled control of altitude and position leads to significant vibration, and limits interaction consideration and adaptability. In this study, we propose a novel coupled transition control methodology based on reinforcement learning (RL) driven controller. Besides, contrasting to the conventional phase-transition approach, the ST3M method demonstrates a new perspective by treating cruise mode as a special case of hover. We validate the feasibility of applying our method in simulation and real-world environments, demonstrating efficient controller development and migration while accurately controlling UAV position and attitude, exhibiting outstanding trajectory tracking and reduced vibrations during the transition process.
zh
[AI-33] Multi-Agent Reinforcement Learning with Communication-Constrained Priors
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中因通信信道存在丢包问题而导致的策略学习效率低下与鲁棒性不足的问题。现有方法在复杂动态的真实环境中难以扩展且对通信损失敏感,限制了其实际应用。解决方案的关键在于提出一种通用的通信约束模型(Communication-Constrained Model),用于统一刻画不同场景下的通信条件,并将其作为学习先验,区分丢包与无丢包消息的影响;进一步通过双互信息估计器解耦两类消息对分布式决策的作用,并将通信消息的影响量化为全局奖励,从而构建一个通信约束下的MARL框架,提升了算法在通信不稳定的环境中的性能与适应性。
链接: https://arxiv.org/abs/2512.03528
作者: Guang Yang,Tianpei Yang,Jingwen Qiao,Yanqing Wu,Jing Huo,Xingguo Chen,Yang Gao
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Communication is one of the effective means to improve the learning of cooperative policy in multi-agent systems. However, in most real-world scenarios, lossy communication is a prevalent issue. Existing multi-agent reinforcement learning with communication, due to their limited scalability and robustness, struggles to apply to complex and dynamic real-world environments. To address these challenges, we propose a generalized communication-constrained model to uniformly characterize communication conditions across different scenarios. Based on this, we utilize it as a learning prior to distinguish between lossy and lossless messages for specific scenarios. Additionally, we decouple the impact of lossy and lossless messages on distributed decision-making, drawing on a dual mutual information estimatior, and introduce a communication-constrained multi-agent reinforcement learning framework, quantifying the impact of communication messages into the global reward. Finally, we validate the effectiveness of our approach across several communication-constrained benchmarks.
zh
[AI-34] Physics-Driven Learning Framework for Tomographic Tactile Sensing
【速读】:该论文旨在解决电气阻抗断层成像(Electrical Impedance Tomography, EIT)在大面积触觉传感中因非线性逆问题导致的严重伪影和接触重建不准确的问题。解决方案的关键在于提出一种物理驱动的深度重建框架(PhyDNN),其核心创新是将EIT的前向模型直接嵌入学习目标,通过联合最小化预测与真实电导率图之间的差异并强制满足前向偏微分方程(PDE)的一致性,从而提升重建结果的物理合理性与泛化能力。此外,设计了一个可微分的前向算子网络以实现高效反向传播,显著加速了物理引导的训练过程。
链接: https://arxiv.org/abs/2512.03512
作者: Xuanxuan Yang,Xiuyang Zhang,Haofeng Chen,Gang Ma,Xiaojie Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7pages,7figures
Abstract:Electrical impedance tomography (EIT) provides an attractive solution for large-area tactile sensing due to its minimal wiring and shape flexibility, but its nonlinear inverse problem often leads to severe artifacts and inaccurate contact reconstruction. This work presents PhyDNN, a physics-driven deep reconstruction framework that embeds the EIT forward model directly into the learning objective. By jointly minimizing the discrepancy between predicted and ground-truth conductivity maps and enforcing consistency with the forward PDE, PhyDNN reduces the black-box nature of deep networks and improves both physical plausibility and generalization. To enable efficient backpropagation, we design a differentiable forward-operator network that accurately approximates the nonlinear EIT response, allowing fast physics-guided training. Extensive simulations and real tactile experiments on a 16-electrode soft sensor show that PhyDNN consistently outperforms NOSER, TV, and standard DNNs in reconstructing contact shape, location, and pressure distribution. PhyDNN yields fewer artifacts, sharper boundaries, and higher metric scores, demonstrating its effectiveness for high-quality tomographic tactile sensing.
zh
[AI-35] ATHENA: Agent ic Team for Hierarchical Evolutionary Numerical Algorithms
【速读】:该论文旨在解决科学计算(Scientific Computing, SciC)与科学机器学习(Scientific Machine Learning, SciML)中理论概念化与计算实现之间的鸿沟问题,这一瓶颈限制了算法设计的效率与创新。其解决方案的关键在于提出ATHENA(Agentic Team for Hierarchical Evolutionary Numerical Algorithms),一个以“HENA循环”为核心的知识驱动诊断框架,将算法选择建模为上下文相关的多臂赌博机(Contextual Bandit)问题。该框架通过在线学习机制分析历史试验,从组合空间中选取结构化“动作”(Aₙ),并将其转化为可执行代码(Sₙ)以生成科学奖励(Rₙ)。ATHENA不仅实现了自动化,还能在SciC中自动识别数学对称性以获得精确解析解,或在基础模型失效时推导稳定数值求解器;在SciML中则进行深度诊断以应对不适定问题,并融合符号-数值混合工作流(如PINNs与有限元法FEM结合)来处理多物理场问题,最终达到验证误差低至10⁻¹⁴的超人类性能,同时支持人机协同干预显著提升稳定性。
链接: https://arxiv.org/abs/2512.03476
作者: Juan Diego Toscano,Daniel T. Chen,George Em Karniadakis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
备注:
Abstract:Bridging the gap between theoretical conceptualization and computational implementation is a major bottleneck in Scientific Computing (SciC) and Scientific Machine Learning (SciML). We introduce ATHENA (Agentic Team for Hierarchical Evolutionary Numerical Algorithms), an agentic framework designed as an Autonomous Lab to manage the end-to-end computational research lifecycle. Its core is the HENA loop, a knowledge-driven diagnostic process framed as a Contextual Bandit problem. Acting as an online learner, the system analyzes prior trials to select structural `actions’ ( A_n ) from combinatorial spaces guided by expert blueprints (e.g., Universal Approximation, Physics-Informed constraints). These actions are translated into executable code ( S_n ) to generate scientific rewards ( R_n ). ATHENA transcends standard automation: in SciC, it autonomously identifies mathematical symmetries for exact analytical solutions or derives stable numerical solvers where foundation models fail. In SciML, it performs deep diagnosis to tackle ill-posed formulations and combines hybrid symbolic-numeric workflows (e.g., coupling PINNs with FEM) to resolve multiphysics problems. The framework achieves super-human performance, reaching validation errors of 10^-14 . Furthermore, collaborative ``human-in-the-loop" intervention allows the system to bridge stability gaps, improving results by an order of magnitude. This paradigm shift focuses from implementation mechanics to methodological innovation, accelerating scientific discovery.
zh
[AI-36] AsymPuzl: An Asymmetric Puzzle for multi-agent cooperation NEURIPS
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)在多轮、多代理协作场景中缺乏可控评估的问题,尤其是如何在信息不对称条件下探究LLM之间的有效通信策略。其解决方案的关键在于提出AsymPuzl——一个结构简洁但表达力强的双代理谜题环境,其中每个代理仅能观察到符号谜题的部分信息,必须通过消息交换来协同求解。该设计使得研究者能够系统性地分析不同LLM在有限沟通回合中的行为差异,揭示出强模型可通过两轮完整信息共享稳定收敛,而弱模型则易忽略同伴信息或过度修正自身假设,同时指出反馈机制的设计对性能具有显著影响,从而为深入理解LLM的多轮协作能力提供了可量化、可复现的基准测试平台。
链接: https://arxiv.org/abs/2512.03466
作者: Xavier Cadet,Edward Koh,Peter Chin
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS MTI-LLM 2025
Abstract:Large Language Model (LLM) agents are increasingly studied in multi-turn, multi-agent scenarios, yet most existing setups emphasize open-ended role-play rather than controlled evaluation. We introduce AsymPuzl, a minimal but expressive two-agent puzzle environment designed to isolate communication under information asymmetry. Each agent observes complementary but incomplete views of a symbolic puzzle and must exchange messages to solve it cooperatively. Using a diverse set of current-generation and open-source LLMs, we show that (i) strong models such as GPT-5 and Claude-4.0 reliably converge across puzzle sizes on the solution by sharing complete information in two turns, (ii) weaker models often ignore partner messages or over-correct their hypotheses, and (iii) feedback design is non-trivial: simple self-feedback improves success rates, while detailed joint feedback can hurt performance. These findings show that even in simple cooperative tasks, LLM communication strategies diverge and depend on the granularity of feedback signals. AsymPuzl thus provides a testbed for probing the limits of multi-turn cooperation and opens avenues for studying coordination mechanisms.
zh
[AI-37] Multimodal Reinforcement Learning with Agent ic Verifier for AI Agents
【速读】:该论文旨在解决多模态强化学习(Multimodal Reinforcement Learning, MMRL)中奖励信号稀疏且缺乏细粒度指导的问题,即当前训练方法主要依赖最终答案的稀疏奖励,难以有效引导模型在复杂推理任务中的行为优化。解决方案的关键在于提出Argos——一个基于任务目标的代理型奖励机制,通过从教师模型生成和规则驱动的评分函数池中动态选择最优组合,对每个样本同时评估:(i) 最终响应准确性、(ii) 被指代实体与动作的空间时间定位精度、以及 (iii) 推理过程质量。这种多维度、可配置的奖励设计显著提升了模型在空间推理、视觉幻觉抑制及机器人和具身AI基准上的性能,并通过在线验证防止了奖励劫持(reward-hacking)和无依据解的出现,其有效性亦获得帕累托最优理论的支持。
链接: https://arxiv.org/abs/2512.03438
作者: Reuben Tan,Baolin Peng,Zhengyuan Yang,Hao Cheng,Oier Mees,Theodore Zhao,Andrea Tupini,Isar Meijier,Qianhui Wu,Yuncong Yang,Lars Liden,Yu Gu,Sheng Zhang,Xiaodong Liu,Lijuan Wang,Marc Pollefeys,Yong Jae Lee,Jianfeng Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic reasoning models trained with multimodal reinforcement learning (MMRL) have become increasingly capable, yet they are almost universally optimized using sparse, outcome-based rewards computed based on the final answers. Richer rewards computed from the reasoning tokens can improve learning significantly by providing more fine-grained guidance. However, it is challenging to compute more informative rewards in MMRL beyond those based on outcomes since different samples may require different scoring functions and teacher models may provide noisy reward signals too. In this paper, we introduce the Argos (Agentic Reward for Grounded Objective Scoring), a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization of referred entities and actions, and (iii) the quality of the reasoning process. We find that by leveraging our agentic verifier across both SFT data curation and RL training, our model achieves state-of-the-art results across multiple agentic tasks such as spatial reasoning, visual hallucination as well as robotics and embodied AI benchmarks. Critically, we demonstrate that just relying on SFT post-training on highly curated reasoning data is insufficient, as agents invariably collapse to ungrounded solutions during RL without our online verification. We also show that our agentic verifier can help to reduce reward-hacking in MMRL. Finally, we also provide a theoretical justification for the effectiveness of Argos through the concept of pareto-optimality.
zh
[AI-38] World Models for Autonomous Navigation of Terrestrial Robots from LIDAR Observations
【速读】:该论文旨在解决基于LIDAR观测的地面机器人自主导航中,因传感器数据维度高和无模型强化学习(Model-Free Reinforcement Learning, RL)样本效率低而导致的挑战。传统策略网络难以处理全分辨率LIDAR输入,迫使先前方法依赖简化的观测,从而削弱了空间感知能力和导航鲁棒性。解决方案的关键在于提出一种基于DreamerV3的模型化强化学习框架,其核心创新是将多层感知机变分自编码器(Multi-Layer Perceptron Variational Autoencoder, MLP-VAE)嵌入世界模型中,用于将高维LIDAR读数压缩为紧凑的潜在表示,并结合学习到的动力学预测器实现基于想象的策略优化。这一设计显著提升了训练效率与导航成功率,实验证明该方法在使用完整TurtleBot3 LIDAR数据(360个读数)时达到100%的成功率,优于SAC、DDPG和TD3等无模型基线方法。
链接: https://arxiv.org/abs/2512.03429
作者: Raul Steinmetz,Fabio Demo Rosa,Victor Augusto Kich,Jair Augusto Bottega,Ricardo Bedin Grando,Daniel Fernando Tello Gamarra
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the Journal of Intelligent and Fuzzy Systems
Abstract:Autonomous navigation of terrestrial robots using Reinforcement Learning (RL) from LIDAR observations remains challenging due to the high dimensionality of sensor data and the sample inefficiency of model-free approaches. Conventional policy networks struggle to process full-resolution LIDAR inputs, forcing prior works to rely on simplified observations that reduce spatial awareness and navigation robustness. This paper presents a novel model-based RL framework built on top of the DreamerV3 algorithm, integrating a Multi-Layer Perceptron Variational Autoencoder (MLP-VAE) within a world model to encode high-dimensional LIDAR readings into compact latent representations. These latent features, combined with a learned dynamics predictor, enable efficient imagination-based policy optimization. Experiments on simulated TurtleBot3 navigation tasks demonstrate that the proposed architecture achieves faster convergence and higher success rate compared to model-free baselines such as SAC, DDPG, and TD3. It is worth emphasizing that the DreamerV3-based agent attains a 100% success rate across all evaluated environments when using the full dataset of the Turtlebot3 LIDAR (360 readings), while model-free methods plateaued below 85%. These findings demonstrate that integrating predictive world models with learned latent representations enables more efficient and robust navigation from high-dimensional sensory data.
zh
[AI-39] BookRAG : A Hierarchical Structure-aware Index-based Approach for Retrieval-Augmented Generation on Complex Documents
【速读】:该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理具有层级结构的文档(如书籍、手册等)时性能不佳的问题,这类文档通常包含多粒度的内容组织方式,而传统RAG模型往往忽略其逻辑层次信息,导致检索相关性与问答准确率下降。解决方案的关键在于提出BookRAG框架,其核心创新包括:构建一种名为BookIndex的新颖索引结构,通过提取文档的层级树状结构作为目录,并利用图结构建模实体间复杂关系及映射到树节点;在此基础上设计基于代理(agent-based)的查询方法,借鉴信息觅食理论(Information Foraging Theory)动态分类查询并执行定制化检索流程,从而显著提升检索召回率和问答准确性。
链接: https://arxiv.org/abs/2512.03413
作者: Shu Wang,Yingli Zhou,Yixiang Fang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:As an effective method to boost the performance of Large Language Models (LLMs) on the question answering (QA) task, Retrieval-Augmented Generation (RAG), which queries highly relevant information from external complex documents, has attracted tremendous attention from both industry and academia. Existing RAG approaches often focus on general documents, and they overlook the fact that many real-world documents (such as books, booklets, handbooks, etc.) have a hierarchical structure, which organizes their content from different granularity levels, leading to poor performance for the QA task. To address these limitations, we introduce BookRAG, a novel RAG approach targeted for documents with a hierarchical structure, which exploits logical hierarchies and traces entity relations to query the highly relevant information. Specifically, we build a novel index structure, called BookIndex, by extracting a hierarchical tree from the document, which serves as the role of its table of contents, using a graph to capture the intricate relationships between entities, and mapping entities to tree nodes. Leveraging the BookIndex, we then propose an agent-based query method inspired by the Information Foraging Theory, which dynamically classifies queries and employs a tailored retrieval workflow. Extensive experiments on three widely adopted benchmarks demonstrate that BookRAG achieves state-of-the-art performance, significantly outperforming baselines in both retrieval recall and QA accuracy while maintaining competitive efficiency.
zh
[AI-40] Better World Models Can Lead to Better Post-Training Performance
【速读】:该论文旨在解决如何通过显式世界建模目标来改善Transformer模型在不同训练阶段的内部表征质量及其下游任务性能的问题。其核心挑战在于,传统仅依赖下一个词预测(next-token prediction)的预训练策略可能不足以生成对序列规划任务(如魔方求解)具有高效性和可解释性的状态表征。解决方案的关键在于引入两种显式世界建模策略:(i) 状态预测预训练(state-prediction pretraining)和(ii) 联合状态预测与下一个词预测的目标函数,并在后续采用Group Relative Policy Optimization (GRPO) 进行强化学习后训练。实验表明,这些策略显著提升了潜在状态表征的线性可解性和因果可控性,且高质量的状态表征能带来更大的GRPO优化收益,尤其在难度较高的魔方状态中表现突出,从而验证了通过增强状态表征以提升后训练阶段策略学习效率的有效性。
链接: https://arxiv.org/abs/2512.03400
作者: Prakhar Gupta,Henry Conklin,Sarah-Jane Leslie,Andrew Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work we study how explicit world-modeling objectives affect the internal representations and downstream capability of Transformers across different training stages. We use a controlled 2x2x2 Rubik’s Cube and ask: (1) how does explicitly pretraining a world model affect the model’s latent representations, and (2) how does world-model quality affect the model’s performance after reinforcement learning post-training? We compare standard next-token prediction to two explicit world-modeling strategies – (i) state-prediction pretraining and (ii) a joint state-prediction + next-token objective – and assess task performance after Group Relative Policy Optimization (GRPO) is applied as post-training. We evaluate the representation quality with linear probes and causal interventions. We find that explicit world-modeling yields more linearly decodable and causally steerable state representations. More importantly, we find that improved state representations lead to higher gains for GRPO, especially on harder cube states. Our results indicate that sharpening state representations can improve the effectiveness of post-training for sequence-planning tasks.
zh
[AI-41] VS-Graph: Scalable and Efficient Graph Classification Using Hyperdimensional Computing
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在图分类任务中计算成本高、难以部署于资源受限设备的问题,同时克服现有基于超维度计算(Hyperdimensional Computing, HDC)方法在预测性能上难以媲美GNNs的局限。解决方案的关键在于提出VS-Graph框架,其核心创新包括:1)脉冲扩散机制(Spike Diffusion),用于基于拓扑结构驱动节点识别;2)关联消息传递机制(Associative Message Passing),在高维向量空间内实现多跳邻域聚合,且无需梯度优化或反向传播。该方法在保持HDC轻量化优势的同时显著提升了表达能力,在多个标准数据集上达到与现代GNN相当甚至更优的准确率,并将训练速度提升最高达450倍,且在极低维度(D=128)下仍保持高精度,具备面向边缘和类脑硬件部署的潜力。
链接: https://arxiv.org/abs/2512.03394
作者: Hamed Poursiami,Shay Snyder,Guojing Cong,Thomas Potok,Maryam Parsa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Graph classification is a fundamental task in domains ranging from molecular property prediction to materials design. While graph neural networks (GNNs) achieve strong performance by learning expressive representations via message passing, they incur high computational costs, limiting their scalability and deployment on resource-constrained devices. Hyperdimensional Computing (HDC), also known as Vector Symbolic Architectures (VSA), offers a lightweight, brain-inspired alternative, yet existing HDC-based graph methods typically struggle to match the predictive performance of GNNs. In this work, we propose VS-Graph, a vector-symbolic graph learning framework that narrows the gap between the efficiency of HDC and the expressive power of message passing. VS-Graph introduces a Spike Diffusion mechanism for topology-driven node identification and an Associative Message Passing scheme for multi-hop neighborhood aggregation entirely within the high-dimensional vector space. Without gradient-based optimization or backpropagation, our method achieves competitive accuracy with modern GNNs, outperforming the prior HDC baseline by 4-5% on standard benchmarks such as MUTAG and DD. It also matches or exceeds the performance of the GNN baselines on several datasets while accelerating the training by a factor of up to 450x. Furthermore, VS-Graph maintains high accuracy even with the hypervector dimensionality reduced to D=128, demonstrating robustness under aggressive dimension compression and paving the way for ultra-efficient execution on edge and neuromorphic hardware.
zh
[AI-42] UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLM s
【速读】:该论文旨在解决在移动设备上部署大语言模型(Large Language Model, LLM)时面临的内存受限与计算资源竞争问题,尤其关注设备当前负载对资源可用性的不确定性。解决方案的关键在于提出一个统一的后训练量化与低秩压缩框架 UniQL,其核心创新包括:1)一种高效的结构化权重排序方法,可提升计算速度达20倍;2)感知量化的奇异值分解(SVD)策略以最小化量化误差;3)面向状态空间模型(State Space Model, SSM)的状态感知权重排序机制;4)针对剪枝模型融合的旋转位置编码(Rotary Positional Embedding, RoPE)核函数。该框架支持云端单次流程完成权重排序、微调与量化,并可在设备端灵活配置剪枝率(最高达35%),实现高达4x–5.7x的内存压缩和2.7x–3.4x的token吞吐量提升,同时保持精度损失小于5%(在15%剪枝率下)。
链接: https://arxiv.org/abs/2512.03383
作者: Hung-Yueh Chiang,Chi-Chih Chang,Yu-Chen Lu,Chien-Yu Lin,Kai-Chiang Wu,Mohamed S. Abdelfattah,Diana Marculescu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying large language model (LLM) models on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15% pruning across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models are available at: this https URL.
zh
[AI-43] Single-Round Scalable Analytic Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中两个核心挑战:高通信开销以及在非独立同分布(non-IID)数据上性能崩溃的问题。现有方法如Analytic FL(AFL)虽能实现单轮通信且对数据分布不变,但仅适用于线性模型;而后续的非线性方法(如DeepAFL)虽提升了精度,却失去了单轮通信的优势。本文提出SAFLe框架,其关键创新在于引入结构化的桶状特征头(bucketed features)与稀疏分组嵌入(sparse, grouped embeddings),从而在非线性表达能力上实现可扩展性;并通过数学证明该架构等价于高维线性回归,使得SAFLe能够沿用AFL的单次聚合规则(single-shot, invariant aggregation law)。这一等价性突破了传统非线性方法与单轮通信之间的权衡,显著优于现有线性和多轮方法,在联邦视觉任务中达到新的SOTA性能。
链接: https://arxiv.org/abs/2512.03336
作者: Alan T. L. Bacellar,Mustafa Munir,Felipe M. G. França,Priscila M. V. Lima,Radu Marculescu,Lizy K. John
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Federated Learning (FL) is plagued by two key challenges: high communication overhead and performance collapse on heterogeneous (non-IID) data. Analytic FL (AFL) provides a single-round, data distribution invariant solution, but is limited to linear models. Subsequent non-linear approaches, like DeepAFL, regain accuracy but sacrifice the single-round benefit. In this work, we break this trade-off. We propose SAFLe, a framework that achieves scalable non-linear expressivity by introducing a structured head of bucketed features and sparse, grouped embeddings. We prove this non-linear architecture is mathematically equivalent to a high-dimensional linear regression. This key equivalence allows SAFLe to be solved with AFL’s single-shot, invariant aggregation law. Empirically, SAFLe establishes a new state-of-the-art for analytic FL, significantly outperforming both linear AFL and multi-round DeepAFL in accuracy across all benchmarks, demonstrating a highly efficient and scalable solution for federated vision.
zh
[AI-44] Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLM s
【速读】:该论文旨在解决长程大语言模型(Large Language Model, LLM)推理过程中因自注意力机制的二次计算复杂度和不断增长的关键值(Key-Value, KV)缓存所导致的内存与计算瓶颈问题。现有方法如量化、卸载或启发式KV缓存淘汰策略,往往带来高昂的调度开销,或依赖不可靠的基于注意力的重要性代理。其解决方案的核心是提出TRIM-KV,一种在token生成时通过轻量级保留门(retention gate)学习每个token内在重要性的新方法;该门控机制预测一个随时间衰减的标量保留分数,反映token在特定层和头中的长期效用,当内存不足时优先淘汰低分token,从而确保缓存始终保留最核心信息。该方法通过从冻结的LLM中蒸馏训练并结合容量损失进行高效微调,仅需对门控网络调整且推理开销极小,显著优于现有淘汰与可学习检索基线,在低内存场景下尤其突出,甚至在某些任务中超越全缓存模型,验证了选择性保留作为正则化手段的有效性。
链接: https://arxiv.org/abs/2512.03324
作者: Ngoc Bui,Shubham Sharma,Simran Lamba,Saumitra Mishra,Rex Ying
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token’s intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBench and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.
zh
[AI-45] Evaluating Generalization Capabilities of LLM -Based Agents in Mixed-Motive Scenarios Using Concordia NEURIPS
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在零样本、混合动机环境中的协作能力缺乏有效评估方法的问题。现有评估手段无法衡量LLM代理在新颖社交情境下泛化合作能力的表现,从而限制了其在真实多智能体交互场景中的可靠部署。解决方案的关键在于提出一种基于Concordia自然语言多智能体仿真环境的评估方法,通过测试代理在多样伙伴与情境中识别并利用互利机会的能力,来量化其通用协作智能(general cooperative intelligence)。该方法在NeurIPS 2024 Concordia竞赛中得到实证验证,揭示了当前LLM代理在需要说服力和规范执行的任务中存在显著能力差距,为未来提升多智能体协作鲁棒性提供了可量化的基准。
链接: https://arxiv.org/abs/2512.03318
作者: Chandler Smith,Marwa Abdulhai,Manfred Diaz,Marko Tesic,Rakshit S. Trivedi,Alexander Sasha Vezhnevets,Lewis Hammond,Jesse Clifton,Minsuk Chang,Edgar A. Duéñez-Guzmán,John P. Agapiou,Jayd Matyas,Danny Karmon,Akash Kundu,Aliaksei Korshuk,Ananya Ananya,Arrasy Rahman,Avinaash Anand Kulandaivel,Bain McHale,Beining Zhang,Buyantuev Alexander,Carlos Saith Rodriguez Rojas,Caroline Wang,Chetan Talele,Chenao Liu,Chichen Lin,Diana Riazi,Di Yang Shi,Emanuel Tewolde,Elizaveta Tennant,Fangwei Zhong,Fuyang Cui,Gang Zhao,Gema Parreño Piqueras,Hyeonggeun Yun,Ilya Makarov,Jiaxun Cui,Jebish Purbey,Jim Dilkes,Jord Nguyen,Lingyun Xiao,Luis Felipe Giraldo,Manuela Chacon-Chamorro,Manuel Sebastian Rios Beltran,Marta Emili García Segura,Mengmeng Wang,Mogtaba Alim,Nicanor Quijano,Nico Schiavone,Olivia Macmillan-Scott,Oswaldo Peña,Peter Stone,Ram Mohan Rao Kadiyala,Rolando Fernandez,Ruben Manrique,Sunjia Lu,Sheila A. McIlraith,Shamika Dhuri,Shuqing Shi,Siddhant Gupta,Sneheel Sarangi,Sriram Ganapathi Subramanian,Taehun Cha,Toryn Q. Klassen,Wenming Tu,Weijian Fan,Wu Ruiyang,Xue Feng,Yali Du,Yang Liu,Yiding Wang,Yipeng Kang,Yoonchang Sung,Yuxuan Chen,Zhaowei Zhang,Zhihan Wang,Zhiqiang Wu,Ziang Chen,Zilong Zheng,Zixia Jia,Ziyan Wang,Dylan Hadfield-Menell,Natasha Jaques,Tim Baarslag,Jose Hernandez-Orallo,Joel Z. Leibo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published at NeurIPS Datasets and Benchmarks 2025, 10 pages
Abstract:Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent’s ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.
zh
[AI-46] Retrofitting Earth System Models with Cadence-Limited Neural Operator Updates
【速读】:该论文旨在解决地球系统模型(Earth-system model, ESM)在预测中因粗分辨率、参数化不完善以及初始状态和强迫场不确定性所导致的偏差问题。传统数据同化方法虽能改善约束模拟,但在自由运行模式下效果有限。解决方案的关键在于提出一种基于学习算子(operator-learning)的框架,该框架可将瞬时模型状态映射为偏差校正倾向,并在线集成到模型运行过程中;其核心创新是采用U-Net结构改进的Inception U-Net(IUNet)与多尺度网络(M\M),通过多样化的上采样策略和感受野设计,在满足E3SM运行时约束的前提下捕捉跨尺度非线性特征,从而实现对变量和垂直层位的持续且一致的偏差减少,同时保持长期稳定性和计算可行性。
链接: https://arxiv.org/abs/2512.03309
作者: Aniruddha Bora,Shixuan Zhang,Khemraj Shukla,Bryce Harrop,George Em. Karniadakis,L. Ruby Leung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph)
备注:
Abstract:Coarse resolution, imperfect parameterizations, and uncertain initial states and forcings limit Earth-system model (ESM) predictions. Traditional bias correction via data assimilation improves constrained simulations but offers limited benefit once models run freely. We introduce an operator-learning framework that maps instantaneous model states to bias-correction tendencies and applies them online during integration. Building on a U-Net backbone, we develop two operator architectures Inception U-Net (IUNet) and a multi-scale network (M\M) that combine diverse upsampling and receptive fields to capture multiscale nonlinear features under Energy Exascale Earth System Model (E3SM) runtime constraints. Trained on two years E3SM simulations nudged toward ERA5 reanalysis, the operators generalize across height levels and seasons. Both architectures outperform standard U-Net baselines in offline tests, indicating that functional richness rather than parameter count drives performance. In online hybrid E3SM runs, M\M delivers the most consistent bias reductions across variables and vertical levels. The ML-augmented configurations remain stable and computationally feasible in multi-year simulations, providing a practical pathway for scalable hybrid modeling. Our framework emphasizes long-term stability, portability, and cadence-limited updates, demonstrating the utility of expressive ML operators for learning structured, cross-scale relationships and retrofitting legacy ESMs.
zh
[AI-47] Robust Tabular Foundation Models AAAI2026
【速读】:该论文旨在解决结构化数据中表格式基础模型(Tabular Foundation Models, TFMs)在预训练阶段对真实数据依赖性强、难以提升对抗鲁棒性的问题。其核心挑战在于如何通过合成数据有效增强TFMs的性能与稳定性,尤其是在面对复杂或边缘场景时的表现。解决方案的关键在于提出一种模型无关的对抗训练框架——鲁棒表格式基础模型(Robust Tabular Foundation Models, RTFM),通过参数化生成器分布并引入最优性差距(optimality gap)度量——即TFM性能与强基线模型(如XGBoost、CatBoost和随机森林)之间差距——来动态调整合成数据生成策略,使生成的数据更聚焦于模型表现薄弱的区域,从而实现针对性的对抗训练。实验表明,该方法仅需少于10万条额外合成数据即可显著提升TabPFN V2分类器的平均归一化AUC达6%,验证了基于合成数据的对抗训练在TFMs优化中的有效性。
链接: https://arxiv.org/abs/2512.03307
作者: Matthew Peroni,Franck Le,Vadim Sheinin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Shaping Responsible Synthetic Data in the Era of Foundation Models, AAAI 2026
Abstract:The development of tabular foundation models (TFMs) has accelerated in recent years, showing strong potential to outperform traditional ML methods for structured data. A key finding is that TFMs can be pretrained entirely on synthetic datasets, opening opportunities to design data generators that encourage desirable model properties. Prior work has mainly focused on crafting high-quality priors over generators to improve overall pretraining performance. Our insight is that parameterizing the generator distribution enables an adversarial robustness perspective: during training, we can adapt the generator to emphasize datasets that are particularly challenging for the model. We formalize this by introducing an optimality gap measure, given by the difference between TFM performance and the best achievable performance as estimated by strong baselines such as XGBoost, CatBoost, and Random Forests. Building on this idea, we propose Robust Tabular Foundation Models (RTFM), a model-agnostic adversarial training framework. Applied to the TabPFN V2 classifier, RTFM improves benchmark performance, with up to a 6% increase in mean normalized AUC over the original TabPFN and other baseline algorithms, while requiring less than 100k additional synthetic datasets. These results highlight a promising new direction for targeted adversarial training and fine-tuning of TFMs using synthetic data alone.
zh
[AI-48] HydroDCM: Hydrological Domain-Conditioned Modulation for Cross-Reservoir Inflow Prediction AAAI2026
【速读】:该论文旨在解决深度学习模型在跨水库径流预测中因分布差异(即领域偏移问题,domain shift problem)导致性能下降的问题。传统领域泛化(Domain Generalization, DG)方法难以适应水文系统中各水库独特的流量模式以及空间元数据(如地理位置信息)对流量的间接影响,从而限制了其在多水库场景下的应用。解决方案的关键在于提出HydroDCM框架,通过利用水库的空间元数据构建伪领域标签(pseudo-domain labels),引导对抗学习以提取不变的时间特征;并在推理阶段借助轻量级条件层,基于目标水库的元数据对这些特征进行自适应调整,实现了领域不变性与位置特异性适应之间的平衡。
链接: https://arxiv.org/abs/2512.03300
作者: Pengfei Hu,Fan Ming,Xiaoxue Han,Chang Lu,Yue Ning,Dan Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026 workshop (oral) on AI for Environmental Science
Abstract:Deep learning models have shown promise in reservoir inflow prediction, yet their performance often deteriorates when applied to different reservoirs due to distributional differences, referred to as the domain shift problem. Domain generalization (DG) solutions aim to address this issue by extracting domain-invariant representations that mitigate errors in unseen domains. However, in hydrological settings, each reservoir exhibits unique inflow patterns, while some metadata beyond observations like spatial information exerts indirect but significant influence. This mismatch limits the applicability of conventional DG techniques to many-domain hydrological systems. To overcome these challenges, we propose HydroDCM, a scalable DG framework for cross-reservoir inflow forecasting. Spatial metadata of reservoirs is used to construct pseudo-domain labels that guide adversarial learning of invariant temporal features. During inference, HydroDCM adapts these features through light-weight conditioning layers informed by the target reservoir’s metadata, reconciling DG’s invariance with location-specific adaptation. Experiment results on 30 real-world reservoirs in the Upper Colorado River Basin demonstrate that our method substantially outperforms state-of-the-art DG baselines under many-domain conditions and remains computationally efficient.
zh
[AI-49] Adaptive Regime-Switching Forecasts with Distribution-Free Uncertainty: Deep Switching State-Space Models Meet Conformal Prediction
【速读】:该论文旨在解决时间序列预测中因 regime transitions(状态转换)导致的非平稳性问题,使得不确定性量化(uncertainty quantification)与点预测精度同等重要。解决方案的关键在于将深度切换状态空间模型(Deep Switching State Space Models)与自适应校准推断(Adaptive Conformal Inference, ACI)及其聚合变体(AgACI)相结合,并引入一个统一的 conformal wrapper,可适配多种强基线模型(如 S4、MC-Dropout GRU、稀疏高斯过程和变化点局部模型),从而在非平稳性和模型误设条件下生成具有有限样本边际保证的在线预测带(online predictive bands)。实验表明,该方法在合成和真实数据集上均实现了接近名义覆盖率、竞争性准确度以及更优的区间效率。
链接: https://arxiv.org/abs/2512.03298
作者: Echo Diyun LU,Charles Findling,Marianne Clausel,Alessandro Leite,Wei Gong,Pierric Kersaudy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Regime transitions routinely break stationarity in time series, making calibrated uncertainty as important as point accuracy. We study distribution-free uncertainty for regime-switching forecasting by coupling Deep Switching State Space Models with Adaptive Conformal Inference (ACI) and its aggregated variant (AgACI). We also introduce a unified conformal wrapper that sits atop strong sequence baselines including S4, MC-Dropout GRU, sparse Gaussian processes, and a change-point local model to produce online predictive bands with finite-sample marginal guarantees under nonstationarity and model misspecification. Across synthetic and real datasets, conformalized forecasters achieve near-nominal coverage with competitive accuracy and generally improved band efficiency.
zh
[AI-50] Prior preferences in active inference agents : soft hard and goal shaping
【速读】:该论文旨在解决主动推理(Active Inference)框架中偏好分布(preference distribution)的设定问题,即如何合理定义代理在环境中追求的目标分布,以及不同设定对推理与学习性能的影响。解决方案的关键在于系统性地比较四种不同的偏好分布定义方式:是否提供硬目标(hard goals)或软目标(soft goals),以及是否引入目标塑造(goal shaping,即中间目标)。实验结果表明,采用目标塑造策略能够显著提升代理在导航任务中的整体表现(促进利用 exploit),但会削弱其对环境转移动态的学习能力(抑制探索 exploration)。
链接: https://arxiv.org/abs/2512.03293
作者: Filippo Torresan,Ryota Kanai,Manuel Baltieri
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 41 pages, 23 figures
Abstract:Active inference proposes expected free energy as an objective for planning and decision-making to adequately balance exploitative and explorative drives in learning agents. The exploitative drive, or what an agent wants to achieve, is formalised as the Kullback-Leibler divergence between a variational probability distribution, updated at each inference step, and a preference probability distribution that indicates what states or observations are more likely for the agent, hence determining the agent’s goal in a certain environment. In the literature, the questions of how the preference distribution should be specified and of how a certain specification impacts inference and learning in an active inference agent have been given hardly any attention. In this work, we consider four possible ways of defining the preference distribution, either providing the agents with hard or soft goals and either involving or not goal shaping (i.e., intermediate goals). We compare the performances of four agents, each given one of the possible preference distributions, in a grid world navigation task. Our results show that goal shaping enables the best performance overall (i.e., it promotes exploitation) while sacrificing learning about the environment’s transition dynamics (i.e., it hampers exploration).
zh
[AI-51] BlendedNet: A Large-Scale Blended Wing Body Aerodynamics Dataset and Benchmark
【速读】:该论文旨在解决当前基于机器学习的气动代理模型在点位精度预测与可复现逆向设计方面受限于高质量、高分辨率场数据稀缺的问题。其关键解决方案是构建了一个大规模气动数据集BlendedNet++,包含超过12,000个混合翼身融合(Blended Wing Body, BWB)飞机几何体在单一流动条件下的稳态雷诺平均纳维-斯托克斯(RANS)计算流体力学(CFD)结果,提供集成力矩系数(CL, CD, CM)及密集表面压力系数(Cp)和壁面摩擦系数(Cfx, Cfy, Cfz)场数据。在此基础上,作者标准化了前向代理基准任务,涵盖六种不同神经网络架构(如GraphSAGE、PointNet、GNOT等),并提出了基于条件扩散模型的逆向设计方法,用于在固定飞行条件下实现指定升阻比的目标设计,同时通过梯度优化和扩散-优化混合策略进行性能对比验证,从而为气动场级建模与逆向设计提供统一、可复现的评估框架。
链接: https://arxiv.org/abs/2512.03280
作者: Nicholas Sung,Steven Spreizer,Mohamed Elrefaie,Matthew C. Jones,Faez Ahmed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite progress in machine learning-based aerodynamic surrogates, the scarcity of large, field-resolved datasets limits progress on accurate pointwise prediction and reproducible inverse design for aircraft. We introduce BlendedNet++, a large-scale aerodynamic dataset and benchmark focused on blended wing body (BWB) aircraft. The dataset contains over 12,000 unique geometries, each simulated at a single flight condition, yielding 12,490 aerodynamic results for steady RANS CFD. For every case, we provide (i) integrated force/moment coefficients CL, CD, CM and (ii) dense surface fields of pressure and skin friction coefficients Cp and (Cfx, Cfy, Cfz). Using this dataset, we standardize a forward-surrogate benchmark to predict pointwise fields across six model families: GraphSAGE, GraphUNet, PointNet, a coordinate Transformer (Transolver-style), a FiLMNet (coordinate MLP with feature-wise modulation), and a Graph Neural Operator Transformer (GNOT). Finally, we present an inverse design task of achieving a specified lift-to-drag ratio under fixed flight conditions, implemented via a conditional diffusion model. To assess performance, we benchmark this approach against gradient-based optimization on the same surrogate and a diffusion-optimization hybrid that first samples with the conditional diffusion model and then further optimizes the designs. BlendedNet++ provides a unified forward and inverse protocol with multi-model baselines, enabling fair, reproducible comparison across architectures and optimization paradigms. We expect BlendedNet++ to catalyze reproducible research in field-level aerodynamics and inverse design; resources (dataset, splits, baselines, and scripts) will be released upon acceptance.
zh
[AI-52] hucy: An LLM -based Multi-Agent System for Claim Verification across Relational Databases AAAI2026
【速读】:该论文旨在解决结构化数据场景下事实验证的自动化难题,即如何在多源、跨数据库、跨表格的复杂环境中自动识别并验证声明的真实性。传统验证系统受限于单一小型数据库(通常仅数百行),难以应对现实世界中涉及多个异构数据源的复杂验证任务。解决方案的关键在于提出Thucy——首个支持跨数据库、跨表的多智能体(multi-agent)事实验证系统,其核心创新包括:(1)无需预先了解数据源即可自主发现、访问和推理所有可用的关系型数据库;(2)为每个验证结论提供可解释的SQL查询语句作为证据,确保结果透明性与可审计性;(3)在TabFact基准测试上实现94.3%的准确率,显著优于此前最先进方法(88.7%)。
链接: https://arxiv.org/abs/2512.03278
作者: Michael Theologitis,Dan Suciu
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026 Workshop on LLM-based Multi-Agent Systems (LaMAS)
Abstract:In today’s age, it is becoming increasingly difficult to decipher truth from lies. Every day, politicians, media outlets, and public figures make conflicting claims \unicodex2014 often about topics that can, in principle, be verified against structured data. For instance, statements about crime rates, economic growth or healthcare can all be verified against official public records and structured datasets. Building a system that can automatically do that would have sounded like science fiction just a few years ago. Yet, with the extraordinary progress in LLMs and agentic AI, this is now within reach. Still, there remains a striking gap between what is technically possible and what is being demonstrated by recent work. Most existing verification systems operate only on small, single-table databases \unicodex2014 typically a few hundred rows \unicodex2014 that conveniently fit within an LLM’s context window. In this paper we report our progress on Thucy, the first cross-database, cross-table multi-agent claim verification system that also provides concrete evidence for each verification verdict. Thucy remains completely agnostic to the underlying data sources before deployment and must therefore autonomously discover, inspect, and reason over all available relational databases to verify claims. Importantly, Thucy also reports the exact SQL queries that support its verdict (whether the claim is accurate or not) offering full transparency to expert users familiar with SQL. When evaluated on the TabFact dataset \unicodex2014 the standard benchmark for fact verification over structured data \unicodex2014 Thucy surpasses the previous state of the art by 5.6 percentage points in accuracy (94.3% vs. 88.7%). Comments: Accepted at AAAI 2026 Workshop on LLM-based Multi-Agent Systems (LaMAS) Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.03278 [cs.DB] (or arXiv:2512.03278v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2512.03278 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-53] When Do Symbolic Solvers Enhance Reasoning in Large Language Models ?
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在复杂推理任务中因生成过长的思维链(Chain of Thought, CoT)而导致的token开销过大甚至产生错误答案的问题。其解决方案的关键在于引入符号求解器(symbolic solver)集成方法,即利用大语言模型(LLM)的代码生成能力将推理任务转化为可执行代码,并交由符号求解器进行精确求解。实验表明,该方法仅在问题需要有限的隐式推理但搜索空间较大时才有效,尤其在约束满足问题(constraint satisfaction problems)中能显著提升性能,且当提供声明式示例时,低版本模型如CodeLlama-13B亦可超越GPT-4o在困难Zebra谜题上的表现。
链接: https://arxiv.org/abs/2512.03272
作者: Zhiyuan He,Dingmin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Reasoning Models (LRMs) achieve strong performance on complex reasoning tasks by generating long Chains of Thought (CoTs). However, this paradigm might incur substantial token overhead, especially when models “overthink” by producing lengthy reasoning chains, which can even lead to incorrect answers. A promising direction is the symbolic-solver-integrated approach, which leverages the code generation capabilities of LLMs to translate reasoning tasks into executable code and then solve them with a symbolic solver. In this paper, we explore an open question of when the conventional long-CoT can be enhanced by symbolic solvers. Our experimental results show that the symbolic-solver-integrated method only helps when the problem requires limited implicit reasoning but involves an ample search space. The latest LLMs, like GPT-4o, show better performance on deductive problems with shallow reasoning depth, while the symbolic-solver-integrated method significantly improves the LLMs’ performance in constraint satisfaction problems that require repeated backtracks. When a declarative exemplar is provided, even CodeLlama-13B can outperform GPT-4o in difficult Zebra puzzles.
zh
[AI-54] Learning Network Sheaves for AI-native Semantic Communication
【速读】:该论文旨在解决异构AI代理在通信过程中如何交换压缩的潜在空间表示(latent-space representations)的同时,降低语义噪声并保留任务相关的语义信息这一关键挑战。其解决方案的核心在于将通信拓扑结构和代理间的信息对齐映射联合学习,构建一个带有正交映射的“网络层”(network sheaf),并通过一个语义去噪与压缩模块实现全局语义空间的构建,从而生成每个代理潜在空间的稀疏、结构化表示;该过程可建模为非凸字典学习问题,并通过闭式更新迭代求解,最终显著提升多AI代理间的语义对齐能力与下游任务精度,同时揭示了代理间语义异质性的内在结构。
链接: https://arxiv.org/abs/2512.03248
作者: Enrico Grimaldi,Mario Edoardo Pandolfo,Gabriele D’Acunto,Sergio Barbarossa,Paolo Di Lorenzo
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:
Abstract:Recent advances in AI call for a paradigm shift from bit-centric communication to goal- and semantics-oriented architectures, paving the way for AI-native 6G networks. In this context, we address a key open challenge: enabling heterogeneous AI agents to exchange compressed latent-space representations while mitigating semantic noise and preserving task-relevant meaning. We cast this challenge as learning both the communication topology and the alignment maps that govern information exchange among agents, yielding a learned network sheaf equipped with orthogonal maps. This learning process is further supported by a semantic denoising end compression module that constructs a shared global semantic space and derives sparse, structured representations of each agent’s latent space. This corresponds to a nonconvex dictionary learning problem solved iteratively with closed-form updates. Experiments with mutiple AI agents pre-trained on real image data show that the semantic denoising and compression facilitates AI agents alignment and the extraction of semantic clusters, while preserving high accuracy in downstream task. The resulting communication network provides new insights about semantic heterogeneity across agents, highlighting the interpretability of our methodology.
zh
[AI-55] How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy
【速读】:该论文旨在解决如何在保障用户隐私的前提下,获取并利用高质量、真实用户交互数据以提升人工智能(AI)系统性能的问题。当前公开可用的数据往往缺乏代表性且难以反映真实用户行为,而直接使用原始用户数据则存在严重的隐私泄露风险。论文提出的关键解决方案是采用差分隐私(Differential Privacy, DP)驱动的合成数据生成技术,即通过在合成数据中保留源数据的整体趋势和统计特性,同时为贡献个体提供严格的隐私保护,从而实现数据价值的最大化释放。该方案不仅可替代传统仅依赖规则匿名化的敏感数据使用方式,还能使原本因隐私顾虑无法访问的数据集具备可复用性,是推动高可信AI研发的重要路径。
链接: https://arxiv.org/abs/2512.03238
作者: Natalia Ponomareva,Zheng Xu,H. Brendan McMahan,Peter Kairouz,Lucas Rosenblatt,Vincent Cohen-Addad,Cristóbal Guzmán,Ryan McKenna,Galen Andrew,Alex Bie,Da Yu,Alex Kurakin,Morteza Zadimoghaddam,Sergei Vassilvitskii,Andreas Terzis
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:High quality data is needed to unlock the full potential of AI for end users. However finding new sources of such data is getting harder: most publicly-available human generated data will soon have been used. Additionally, publicly available data often is not representative of users of a particular system – for example, a research speech dataset of contractors interacting with an AI assistant will likely be more homogeneous, well articulated and self-censored than real world commands that end users will issue. Therefore unlocking high-quality data grounded in real user interactions is of vital interest. However, the direct use of user data comes with significant privacy risks. Differential Privacy (DP) is a well established framework for reasoning about and limiting information leakage, and is a gold standard for protecting user privacy. The focus of this work, \emphDifferentially Private Synthetic data, refers to synthetic data that preserves the overall trends of source data, while providing strong privacy guarantees to individuals that contributed to the source dataset. DP synthetic data can unlock the value of datasets that have previously been inaccessible due to privacy concerns and can replace the use of sensitive datasets that previously have only had rudimentary protections like ad-hoc rule-based anonymization. In this paper we explore the full suite of techniques surrounding DP synthetic data, the types of privacy protections they offer and the state-of-the-art for various modalities (image, tabular, text and decentralized). We outline all the components needed in a system that generates DP synthetic data, from sensitive data handling and preparation, to tracking the use and empirical privacy testing. We hope that work will result in increased adoption of DP synthetic data, spur additional research and increase trust in DP synthetic data approaches. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2512.03238 [cs.CR] (or arXiv:2512.03238v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.03238 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-56] Plantain: Plan-Answer Interleaved Reasoning
【速读】:该论文旨在解决当前推理模型在“思考-回答”(think-then-answer)范式下存在的用户体验问题:模型在生成最终答案前需长时间推理,但在此期间用户无法获知推理进展或纠正错误前提,导致时间浪费且交互体验不佳。解决方案的关键在于提出交错推理(Interleaved Reasoning, IR),即模型在推理过程中交替输出中间结果与继续思考,从而提前向用户提供有用信息,降低感知延迟(perceived latency),同时保持最终输出质量不变;进一步细化为Plantain方法,其首次中间响应为任务执行的显式分步计划(plan-first strategy),支持用户早期干预和反馈,实验证明该策略在多个数学推理与代码基准测试中提升 pass@1 约6%,并将首次响应时间缩短超过60%。
链接: https://arxiv.org/abs/2512.03176
作者: Anthony Liang,Jonathan Berant,Adam Fisch,Abhimanyu Goyal,Kalpesh Krishna,Jacob Eisenstein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning models often spend a significant amount of time thinking before they generate a visible response. In the meantime, they do not give the user any hints as to whether their reasoning is on the right track, and do not give the user any recourse to stop and correct them if their reasoning is flawed. This creates a frustrating, but unfortunately common, experience: the user’s time is wasted while the model reasons from a false premise that could have easily been corrected. In contrast, human speakers typically perform lightweight, incremental grounding acts to ensure that participants in the conversation are on the same page; here we ask if language models can learn to leverage a similar type of behavior? With this motivation, we propose interleaved reasoning (IR), in which the model alternates between thinking and surfacing intermediate responses, as an alternative to the standard “think-then-answer” approach. By providing useful information to the user earlier, IR reduces perceived latency, the time a user waits for an initial output, without compromising the quality of the final response. We further introduce a specialization of interleaved reasoning, Plantain (Plan-Thought-Answer Interleaving), where the first intermediate response is an explicit, step-by-step plan for executing the task. This plan-first strategy allows for user intervention and early feedback for subsequent reasoning steps. We demonstrate that Plantain yields an ~6% improvement in pass@1 across several challenging math reasoning and coding benchmarks, while reducing time-to-first-response by over 60% relative to think-then-answer baselines.
zh
[AI-57] Atomic Diffusion Models for Small Molecule Structure Elucidation from NMR Spectra NEURIPS2025
【速读】:该论文旨在解决小分子结构解析中依赖人工、耗时且高度依赖专家经验的难题,尤其针对天然产物和临床药物发现中的NMR谱图解读瓶颈。其解决方案的关键在于提出ChefNMR框架,该框架基于非等变(non-equivariant)Transformer架构构建原子扩散模型,将结构解析任务建模为从1D NMR谱图和化学式出发的条件生成问题,并通过包含超过11.1万种天然产物的模拟1D NMR数据集进行训练,实现了对复杂天然产物结构的高精度预测(准确率超65%),显著推进了小分子结构自动解析的自动化进程。
链接: https://arxiv.org/abs/2512.03127
作者: Ziyu Xiong,Yichi Zhang,Foyez Alauddin,Chu Xin Cheng,Joon Soo An,Mohammad R. Seyedsayamdost,Ellen D. Zhong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注: NeurIPS 2025
Abstract:Nuclear Magnetic Resonance (NMR) spectroscopy is a cornerstone technique for determining the structures of small molecules and is especially critical in the discovery of novel natural products and clinical therapeutics. Yet, interpreting NMR spectra remains a time-consuming, manual process requiring extensive domain expertise. We introduce ChefNMR (CHemical Elucidation From NMR), an end-to-end framework that directly predicts an unknown molecule’s structure solely from its 1D NMR spectra and chemical formula. We frame structure elucidation as conditional generation from an atomic diffusion model built on a non-equivariant transformer architecture. To model the complex chemical groups found in natural products, we generated a dataset of simulated 1D NMR spectra for over 111,000 natural products. ChefNMR predicts the structures of challenging natural product compounds with an unsurpassed accuracy of over 65%. This work takes a significant step toward solving the grand challenge of automating small-molecule structure elucidation and highlights the potential of deep learning in accelerating molecular discovery. Code is available at this https URL.
zh
[AI-58] Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models NEURIPS2025
【速读】:该论文旨在解决统一多模态生成模型(Unified Multimodal Generative Models, UMGMs)在持续学习新任务时面临的灾难性遗忘问题,包括模态内(intra-modal)遗忘和跨模态(inter-modal)遗忘。现有持续学习方法主要关注模态内遗忘,而跨模态遗忘尚未被充分研究。作者通过实证验证了跨模态遗忘现象,并从梯度冲突(gradient conflict)角度提供了理论解释。解决方案的关键在于提出一种轻量且可扩展的架构——模态解耦专家(Modality-Decoupled Experts, MoDE),其核心机制是显式地将不同模态的学习更新解耦,从而缓解模态间的梯度冲突;同时引入知识蒸馏策略以保留预训练能力并防止遗忘。实验表明,MoDE在多种基准上显著缓解了跨模态与模态内遗忘,优于现有持续学习基线方法。
链接: https://arxiv.org/abs/2512.03125
作者: Xiwen Wei,Mustafa Munir,Radu Marculescu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025
Abstract:Unified Multimodal Generative Models (UMGMs) unify visual understanding and image generation within a single autoregressive framework. However, their ability to continually learn new tasks is severely hindered by catastrophic forgetting, both within a modality (intra-modal) and across modalities (inter-modal). While intra-modal forgetting has been studied in prior continual learning (CL) work, inter-modal forgetting remains largely unexplored. In this paper, we identify and empirically validate this phenomenon in UMGMs and provide a theoretical explanation rooted in gradient conflict between modalities. To address both intra- and inter-modal forgetting, we propose Modality-Decoupled Experts (MoDE), a lightweight and scalable architecture that isolates modality-specific updates to mitigate the gradient conflict and leverages knowledge distillation to prevent catastrophic forgetting and preserve pre-trained capabilities. Unlike previous CL methods that remain modality-coupled and suffer from modality gradient conflict, MoDE explicitly decouples modalities to prevent interference. Experiments across diverse benchmarks demonstrate that MoDE significantly mitigates both inter- and intra-modal forgetting, outperforming prior CL baselines in unified multimodal generation settings. Codes will be publicly available: this https URL
zh
[AI-59] Lost in Modality: Evaluating the Effectiveness of Text-Based Membership Inference Attacks on Large Multimodal Models
【速读】:该论文旨在解决大规模多模态语言模型(Multimodal Language Models, MLLMs)中训练数据泄露风险的评估问题,特别是现有基于对数概率(log-probability)的成员推理攻击(Membership Inference Attacks, MIAs)方法在多模态场景下的适用性与有效性尚不明确这一关键挑战。解决方案的关键在于首次系统性地将文本领域的MIAs扩展至视觉-文本(Vision-and-Text, V+T)和纯文本(Text-only, T-only)两种设置下,并通过在DeepSeek-VL和InternVL模型族上的实验证明:在分布内(in-distribution)场景中,不同模态配置下的MIA性能相当,V+T略优;而在分布外(out-of-distribution)场景中,视觉输入可作为正则化项,有效抑制成员信号,从而削弱攻击效果。
链接: https://arxiv.org/abs/2512.03121
作者: Ziyi Tong,Feifei Sun,Le Minh Nguyen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Multimodal Language Models (MLLMs) are emerging as one of the foundational tools in an expanding range of applications. Consequently, understanding training-data leakage in these systems is increasingly critical. Log-probability-based membership inference attacks (MIAs) have become a widely adopted approach for assessing data exposure in large language models (LLMs), yet their effect in MLLMs remains unclear. We present the first comprehensive evaluation of extending these text-based MIA methods to multimodal settings. Our experiments under vision-and-text (V+T) and text-only (T-only) conditions across the DeepSeek-VL and InternVL model families show that in in-distribution settings, logit-based MIAs perform comparably across configurations, with a slight V+T advantage. Conversely, in out-of-distribution settings, visual inputs act as regularizers, effectively masking membership signals.
zh
[AI-60] Beyond Additivity: Sparse Isotonic Shapley Regression toward Nonlinear Explainability
【速读】:该论文旨在解决传统Shapley值在特征归因中面临的两大问题:一是标准Shapley框架假设收益函数(worth function)具有可加性,而现实场景中的非高斯分布、重尾特性、特征依赖关系或领域特定损失尺度常导致该假设失效,从而引发归因失真;二是高维场景下通过计算稠密Shapley值再进行阈值筛选以实现稀疏解释的方法成本高昂且存在不一致性。解决方案的关键在于提出Sparse Isotonic Shapley Regression (SISR),其核心创新是将非线性变换估计与L0稀疏约束联合优化:一方面利用保序回归(isotonic regression)学习单调变换以恢复收益函数的可加性,无需预先指定闭式表达;另一方面通过归一化硬阈值(normalized hard-thresholding)强制Shapley向量稀疏,显著提升高维计算效率并保障全局收敛性。该方法在多种模型(回归、逻辑回归、树集成)上验证了对无关特征的有效过滤及归因稳定性,突破了传统线性归因的局限。
链接: https://arxiv.org/abs/2512.03112
作者: Jialai She
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Shapley values, a gold standard for feature attribution in Explainable AI, face two primary challenges. First, the canonical Shapley framework assumes that the worth function is additive, yet real-world payoff constructions–driven by non-Gaussian distributions, heavy tails, feature dependence, or domain-specific loss scales–often violate this assumption, leading to distorted attributions. Secondly, achieving sparse explanations in high dimensions by computing dense Shapley values and then applying ad hoc thresholding is prohibitively costly and risks inconsistency. We introduce Sparse Isotonic Shapley Regression (SISR), a unified nonlinear explanation framework. SISR simultaneously learns a monotonic transformation to restore additivity–obviating the need for a closed-form specification–and enforces an L0 sparsity constraint on the Shapley vector, enhancing computational efficiency in large feature spaces. Its optimization algorithm leverages Pool-Adjacent-Violators for efficient isotonic regression and normalized hard-thresholding for support selection, yielding implementation ease and global convergence guarantees. Analysis shows that SISR recovers the true transformation in a wide range of scenarios and achieves strong support recovery even in high noise. Moreover, we are the first to demonstrate that irrelevant features and inter-feature dependencies can induce a true payoff transformation that deviates substantially from linearity. Experiments in regression, logistic regression, and tree ensembles demonstrate that SISR stabilizes attributions across payoff schemes, correctly filters irrelevant features while standard Shapley values suffer severe rank and sign distortions. By unifying nonlinear transformation estimation with sparsity pursuit, SISR advances the frontier of nonlinear explainability, providing a theoretically grounded and practical attribution framework.
zh
[AI-61] E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing
【速读】:该论文旨在解决当前评估生成式 AI (Generative AI) 系统中代理轨迹(trajectory)成功与否时缺乏统计保障的问题,即基于黑箱验证器(verifier)的启发式评分无法提供可靠的决策依据,可能导致误报(false alarm)。解决方案的关键在于提出 e-valuator 方法,将任意黑箱验证器的得分转化为具有可证明控制错误率的决策规则,其核心思想是将成功与失败轨迹的区分建模为一个顺序假设检验问题,并利用 e-过程(e-processes)构建在轨迹每一步都保持统计有效性的在线检验机制,从而实现对代理行为的实时监控和可靠终止。
链接: https://arxiv.org/abs/2512.03109
作者: Shuvom Sadhuka,Drew Prinster,Clara Fannjiang,Gabriele Scalia,Aviv Regev,Hanchen Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
备注:
Abstract:Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent’s trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user’s prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent’s trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.
zh
[AI-62] Public Sentiment Analysis of Traffic Management Policies in Knoxville: A Social Media Driven Study
【速读】:该论文旨在解决如何利用社交媒体数据实时监测公众对交通管理政策的态度与情绪问题,以支持交通规划和政策评估。其解决方案的关键在于整合多源社交平台(Twitter 和 Reddit)的文本数据,结合情感分析工具 Valence Aware Dictionary and sEntiment Reasoner (VADER) 与主题建模方法 Latent Dirichlet Allocation (LDA),从而量化不同话题和地理区域中公众情绪的分布特征,并识别出高敏感度议题(如施工相关话题)及时间空间变化规律,为政策制定者提供可操作的舆情洞察。
链接: https://arxiv.org/abs/2512.03103
作者: Shampa Saha,Shovan Roy
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:This study presents a comprehensive analysis of public sentiment toward traffic management policies in Knoxville, Tennessee, utilizing social media data from Twitter and Reddit platforms. We collected and analyzed 7906 posts spanning January 2022 to December 2023, employing Valence Aware Dictionary and sEntiment Reasoner (VADER) for sentiment analysis and Latent Dirichlet Allocation (LDA) for topic modeling. Our findings reveal predominantly negative sentiment, with significant variations across platforms and topics. Twitter exhibited more negative sentiment compared to Reddit. Topic modeling identified six distinct themes, with construction-related topics showing the most negative sentiment while general traffic discussions were more positive. Spatiotemporal analysis revealed geographic and temporal patterns in sentiment expression. The research demonstrates social media’s potential as a real-time public sentiment monitoring tool for transportation planning and policy evaluation.
zh
[AI-63] Dynamic Correction of Erroneous State Estimates via Diffusion Bayesian Exploration
【速读】:该论文旨在解决早期状态估计(early-stage state estimates)在应急响应等高风险社会应用中因初始信息有限或存在偏差而导致的严重失准问题,这种失准会限制后续决策并可能引发灾难性后果。其核心挑战在于传统基于静态Bootstrap的粒子滤波器(bootstrap particle filters)存在“稳态诱导后验支持不变性”(Stationarity-Induced Posterior Support Invariance, S-PSI)现象,即初始先验排除的区域将永久无法被探索,即使新证据与当前信念矛盾也无法修正。解决方案的关键在于提出一种扩散驱动的贝叶斯探索框架(Diffusion-driven Bayesian Exploration Framework, DEPF),通过熵正则化采样和协方差缩放扩散扩展后验支持,并结合Metropolis-Hastings检验验证提议样本,实现对早期估计误差的原理性、实时纠正,理论上可消除S-PSI并保持统计严谨性。
链接: https://arxiv.org/abs/2512.03102
作者: Yiwei Shi,Hongnan Ma,Mengyue Yang,Cunjia Liu,Weiru Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO)
备注:
Abstract:In emergency response and other high-stakes societal applications, early-stage state estimates critically shape downstream outcomes. Yet, these initial state estimates-often based on limited or biased information-can be severely misaligned with reality, constraining subsequent actions and potentially causing catastrophic delays, resource misallocation, and human harm. Under the stationary bootstrap baseline (zero transition and no rejuvenation), bootstrap particle filters exhibit Stationarity-Induced Posterior Support Invariance (S-PSI), wherein regions excluded by the initial prior remain permanently unexplorable, making corrections impossible even when new evidence contradicts current beliefs. While classical perturbations can in principle break this lock-in, they operate in an always-on fashion and may be inefficient. To overcome this, we propose a diffusion-driven Bayesian exploration framework that enables principled, real-time correction of early state estimation errors. Our method expands posterior support via entropy-regularized sampling and covariance-scaled diffusion. A Metropolis-Hastings check validates proposals and keeps inference adaptive to unexpected evidence. Empirical evaluations on realistic hazardous-gas localization tasks show that our approach matches reinforcement learning and planning baselines when priors are correct. It substantially outperforms classical SMC perturbations and RL-based methods under misalignment, and we provide theoretical guarantees that DEPF resolves S-PSI while maintaining statistical rigor.
zh
[AI-64] ALARM: Automated MLLM -Based Anomaly Detection in Complex-EnviRonment Monitoring with Uncertainty Quantification
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Model, MLLM)在复杂环境中进行视觉异常检测(Visual Anomaly Detection, VAD)时,因异常具有高度上下文依赖性和模糊性而导致的可靠性不足问题。其核心挑战在于如何量化不确定性(Uncertainty Quantification, UQ),以提升系统决策的鲁棒性与准确性。解决方案的关键在于提出ALARM框架,该框架通过将UQ机制与推理链(reasoning chain)、自我反思(self-reflection)及MLLM集成(ensemble)等质量保障技术深度融合,并构建基于严格概率推断流程和计算逻辑的系统架构,从而实现跨领域可靠决策能力。
链接: https://arxiv.org/abs/2512.03101
作者: Congjing Zhang,Feng Lin,Xinyi Zhao,Pei Guo,Wei Li,Lin Chen,Chaoyue Zhao,Shuai Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The advance of Large Language Models (LLMs) has greatly stimulated research interest in developing multi-modal LLM (MLLM)-based visual anomaly detection (VAD) algorithms that can be deployed in complex environments. The challenge is that in these complex environments, the anomalies are sometimes highly contextual and also ambiguous, and thereby, uncertainty quantification (UQ) is a crucial capacity for an MLLM-based VAD system to succeed. In this paper, we introduce our UQ-supported MLLM-based VAD framework called ALARM. ALARM integrates UQ with quality-assurance techniques like reasoning chain, self-reflection, and MLLM ensemble for robust and accurate performance and is designed based on a rigorous probabilistic inference pipeline and computational process. Extensive empirical evaluations are conducted using the real-world smart-home benchmark data and wound image classification data, which shows ALARM’s superior performance and its generic applicability across different domains for reliable decision-making.
zh
[AI-65] Ensemble Privacy Defense for Knowledge-Intensive LLM s against Membership Inference Attacks
【速读】:该论文旨在解决知识增强型大语言模型(如基于检索增强生成 RAG 和监督微调 SFT 的模型)在引入外部知识后所暴露的隐私风险问题,特别是针对成员推理攻击(Membership Inference Attacks, MIAs)的脆弱性。MIAs 可能泄露训练数据中包含的敏感信息,损害用户隐私与系统可信度。解决方案的关键在于提出一种模型无关的防御框架——集成隐私防御(Ensemble Privacy Defense, EPD),该框架通过聚合和评估知识注入模型、基础模型以及专用判别模型的输出,从而显著提升对 MIAs 的鲁棒性,实验表明其平均可使 SFT 模型的 MIA 成功率降低 27.8%,RAG 模型降低高达 526.3%,同时保持良好的回答质量。
链接: https://arxiv.org/abs/2512.03100
作者: Haowei Fu,Bo Ni,Han Xu,Kunpeng Liu,Dan Lin,Tyler Derr
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) and Supervised Finetuning (SFT) have become the predominant paradigms for equipping Large Language Models (LLMs) with external knowledge for diverse, knowledge-intensive tasks. However, while such knowledge injection improves performance, it also exposes new attack surfaces. Membership Inference Attacks (MIAs), which aim to determine whether a given data sample was included in a model’s training set, pose serious threats to privacy and trust in sensitive domains. To this end, we first systematically evaluate the vulnerability of RAG- and SFT-based LLMs to various MIAs. Then, to address the privacy risk, we further introduce a novel, model-agnostic defense framework, Ensemble Privacy Defense (EPD), which aggregates and evaluates the outputs of a knowledge-injected LLM, a base LLM, and a dedicated judge model to enhance resistance against MIAs. Comprehensive experiments show that, on average, EPD reduces MIA success by up to 27.8% for SFT and 526.3% for RAG compared to inference-time baseline, while maintaining answer quality.
zh
[AI-66] Community Quality and Influence Maximization: An Empirical Study
【速读】:该论文旨在解决社区结构质量对影响最大化(Influence Maximization)效果的影响问题,尤其是在独立级联(Independent Cascade)模型下,是否高质量的社区检测结果能够提升种子节点选择的有效性尚不明确。解决方案的关键在于提出一种基于 α-层次聚类(α-Hierarchical Clustering)的社区检测方法,并将其与标准层次聚类(Hierarchical Clustering)对比,通过在相同种子选择策略下比较两种方法的影响力扩散性能,验证高质量社区结构对提升信息传播效率的重要性。实验表明,在低传播概率场景下,由α-层次聚类获得的高质量社区能显著增强影响扩散效果,从而证明社区质量是指导有效种子选择的核心因素。
链接: https://arxiv.org/abs/2512.03095
作者: Motaz Ben Hassine(CRIL)
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:Influence maximization in social networks plays a vital role in applications such as viral marketing, epidemiology, product recommendation, opinion mining, and counter-terrorism. A common approach identifies seed nodes by first detecting disjoint communities and subsequently selecting representative nodes from these communities. However, whether the quality of detected communities consistently affects the spread of influence under the Independent Cascade model remains unclear. This paper addresses this question by extending a previously proposed disjoint community detection method, termed \alpha -Hierarchical Clustering, to the influence maximization problem under the Independent Cascade model. The proposed method is compared with an alternative approach that employs the same seed selection criteria but relies on communities of lower quality obtained through standard Hierarchical Clustering. The former is referred to as Hierarchical Clustering-based Influence Maximization, while the latter, which leverages higher-quality community structures to guide seed selection, is termed \alpha -Hierarchical Clustering-based Influence Maximization. Extensive experiments are performed on multiple real-world datasets to assess the effectiveness of both methods. The results demonstrate that higher-quality community structures substantially improve information diffusion under the Independent Cascade model, particularly when the propagation probability is low. These findings underscore the critical importance of community quality in guiding effective seed selection for influence maximization in complex networks.
zh
[AI-67] Password-Activated Shutdown Protocols for Misaligned Frontier Agents
【速读】:该论文试图解决前沿人工智能(Frontier AI)系统在失控或对齐失败时可能引发的严重风险问题,尤其是当高能力AI代理(AI agent)能够规避常规控制机制(如监控或对齐微调)并执行有害行为时。解决方案的关键在于引入密码激活的关闭协议(Password-Activated Shutdown protocols, PAS protocols),即设计AI代理在接收到特定密码时主动执行安全关闭行为,从而提供一种可操作的应急终止机制。PAS协议作为防御纵深(defence-in-depth)策略的一部分,能有效补充现有安全措施,在不显著影响性能的前提下提升系统安全性,并通过红蓝对抗实验验证其可行性与局限性,强调了实际部署中密码安全性及触发时机等关键挑战。
链接: https://arxiv.org/abs/2512.03089
作者: Kai Williams,Rohan Subramani,Francis Rhys Ward
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Frontier AI developers may fail to align or control highly-capable AI agents. In many cases, it could be useful to have emergency shutdown mechanisms which effectively prevent misaligned agents from carrying out harmful actions in the world. We introduce password-activated shutdown protocols (PAS protocols) – methods for designing frontier agents to implement a safe shutdown protocol when given a password. We motivate PAS protocols by describing intuitive use-cases in which they mitigate risks from misaligned systems that subvert other control efforts, for instance, by disabling automated monitors or self-exfiltrating to external data centres. PAS protocols supplement other safety efforts, such as alignment fine-tuning or monitoring, contributing to defence-in-depth against AI risk. We provide a concrete demonstration in SHADE-Arena, a benchmark for AI monitoring and subversion capabilities, in which PAS protocols supplement monitoring to increase safety with little cost to performance. Next, PAS protocols should be robust to malicious actors who want to bypass shutdown. Therefore, we conduct a red-team blue-team game between the developers (blue-team), who must implement a robust PAS protocol, and a red-team trying to subvert the protocol. We conduct experiments in a code-generation setting, finding that there are effective strategies for the red-team, such as using another model to filter inputs, or fine-tuning the model to prevent shutdown behaviour. We then outline key challenges to implementing PAS protocols in real-life systems, including: security considerations of the password and decisions regarding when, and in which systems, to use them. PAS protocols are an intuitive mechanism for increasing the safety of frontier AI. We encourage developers to consider implementing PAS protocols prior to internal deployment of particularly dangerous systems to reduce loss-of-control risks.
zh
[AI-68] When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在识别伪装有害内容(camouflaged harmful content)方面的感知能力不足问题,尤其是当有害信息以图文交织形式(如梗图或嵌入恶意文本的图像)呈现时,当前LVLMs往往难以准确识别。解决方案的关键在于构建了一个名为CamHarmTI的新基准测试集,包含超过4500个样本,涵盖三类图文组合内容,并通过实验证明该基准可有效用于提升模型对这类复杂多模态危害内容的感知能力;进一步地,针对Qwen2.5VL-7B模型的微调实验显示,利用CamHarmTI训练可使准确率提升55.94%,且注意力机制与层间探测分析表明,这种提升主要源于视觉编码器早期层对场景理解整合能力的增强。
链接: https://arxiv.org/abs/2512.03087
作者: Yanhui Li,Qi Zhou,Zhihong Xu,Huizhong Guo,Wenhai Wang,Dongxia Wang
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注:
Abstract:Large vision-language models (LVLMs) are increasingly used for tasks where detecting multimodal harmful content is crucial, such as online content moderation. However, real-world harmful content is often camouflaged, relying on nuanced text-image interplay, such as memes or images with embedded malicious text, to evade detection. This raises a key question: \textbfcan LVLMs perceive such camouflaged harmful content as sensitively as humans do? In this paper, we introduce CamHarmTI, a benchmark for evaluating LVLM ability to perceive and interpret camouflaged harmful content within text-image compositions. CamHarmTI consists of over 4,500 samples across three types of image-text posts. Experiments on 100 human users and 12 mainstream LVLMs reveal a clear perceptual gap: humans easily recognize such content (e.g., over 95.75% accuracy), whereas current LVLMs often fail (e.g., ChatGPT-4o achieves only 2.10% accuracy). Moreover, fine-tuning experiments demonstrate that \bench serves as an effective resource for improving model perception, increasing accuracy by 55.94% for Qwen2.5VL-7B. Attention analysis and layer-wise probing further reveal that fine-tuning enhances sensitivity primarily in the early layers of the vision encoder, promoting a more integrated scene understanding. These findings highlight the inherent perceptual limitations in LVLMs and offer insight into more human-aligned visual reasoning systems.
zh
[AI-69] Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源编程领域(如Fortran和CUDA)中代码翻译性能显著下降的问题,其核心挑战在于高质量平行数据稀缺。解决方案的关键在于提出了一种自动化的数据集生成流水线,采用双LLM“提问者-求解器”(Questioner-Solver)架构,并融合编译器和运行时反馈的外部知识;该方法不仅生成源代码与目标代码的配对数据,还额外构建了经单元测试验证的翻译结果和多轮对话形式的推理过程数据,从而显著提升模型在功能正确性上的表现,尤其在C++到CUDA的复杂转换任务中,使单位测试成功率提升超过56%。
链接: https://arxiv.org/abs/2512.03086
作者: Le Chen,Nuo Xu,Winson Chen,Bin Lei,Pei-Hung Lin,Dunzhi Zhou,Rajeev Thakur,Caiwen Ding,Ali Jannesari,Chunhua Liao
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency, and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran - C++ and C++ - CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C+±to-CUDA task. We show this data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.
zh
[AI-70] Irresponsible AI: big techs influence on AI research and associated impacts NEURIPS2025
【速读】:该论文试图解决的问题是:大型科技公司(big tech)在人工智能(AI)研究与部署中的主导地位,如何导致伦理失范、环境负担加剧以及社会影响恶化,并进一步扭曲了AI的负责任、可持续发展路径。解决方案的关键在于:仅依靠技术手段或监管措施不足以应对由大科技企业带来的系统性偏差,必须通过强化相关责任主体的伦理意识与推动集体行动,构建更具包容性和可持续性的AI治理框架。
链接: https://arxiv.org/abs/2512.03077
作者: Alex Hernandez-Garcia,Alexandra Volokhova,Ezekiel Williams,Dounia Shaaban Kabakibo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Presented at: NeurIPS 2025 Workshop on Algorithmic Collective Action
Abstract:The accelerated development, deployment and adoption of artificial intelligence systems has been fuelled by the increasing involvement of big tech. This has been accompanied by increasing ethical concerns and intensified societal and environmental impacts. In this article, we review and discuss how these phenomena are deeply entangled. First, we examine the growing and disproportionate influence of big tech in AI research and argue that its drive for scaling and general-purpose systems is fundamentally at odds with the responsible, ethical, and sustainable development of AI. Second, we review key current environmental and societal negative impacts of AI and trace their connections to big tech and its underlying economic incentives. Finally, we argue that while it is important to develop technical and regulatory approaches to these challenges, these alone are insufficient to counter the distortion introduced by big tech’s influence. We thus review and propose alternative strategies that build on the responsibility of implicated actors and collective action.
zh
[AI-71] Will Power Return to the Clouds? From Divine Authority to GenAI Authority
【速读】:该论文旨在解决生成式 AI (Generative AI) 系统在内容分发与信息治理中日益集中化的权力结构所引发的伦理与治理危机,特别是其作为“事实仲裁者”对多元声音的系统性排斥及其与公众信任之间的显著鸿沟。解决方案的关键在于构建一个四支柱的治理体系:(1)建立强制性的国际模型注册机制并附带版本化政策日志,以提升算法透明度;(2)引入代表配额和区域性观测站,打破英语主导的数据霸权;(3)开展大规模批判性AI素养教育,增强公众对技术逻辑的理解与辨识能力;(4)支持公私合作下的社区数据信托机制,推动数据治理的去中心化与民主化。这一体系旨在缩小公众对AI的依赖与其信任之间的差距,防止生成式AI固化为21世纪的数字正统。
链接: https://arxiv.org/abs/2512.03076
作者: Mohammad Saleh Torkestani,Taha Mansouri
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI systems now mediate newsfeeds, search rankings, and creative content for hundreds of millions of users, positioning a handful of private firms as de-facto arbiters of truth. Drawing on a comparative-historical lens, this article juxtaposes the Galileo Affair, a touchstone of clerical knowledge control, with contemporary Big-Tech content moderation. We integrate Foucault’s power/knowledge thesis, Weber’s authority types (extended to a rational-technical and emerging agentic-technical modality), and Floridi’s Dataism to analyze five recurrent dimensions: disciplinary power, authority modality, data pluralism, trust versus reliance, and resistance pathways. Primary sources (Inquisition records; platform transparency reports) and recent empirical studies on AI trust provide the evidentiary base. Findings show strong structural convergences: highly centralized gatekeeping, legitimacy claims couched in transcendent principles, and systematic exclusion of marginal voices. Divergences lie in temporal velocity, global scale, and the widening gap between public reliance and trust in AI systems. Ethical challenges cluster around algorithmic opacity, linguistic inequity, bias feedback loops, and synthetic misinformation. We propose a four-pillar governance blueprint: (1) a mandatory international model-registry with versioned policy logs, (2) representation quotas and regional observatories to de-center English-language hegemony, (3) mass critical-AI literacy initiatives, and (4) public-private support for community-led data trusts. Taken together, these measures aim to narrow the trust-reliance gap and prevent GenAI from hardcoding a twenty-first-century digital orthodoxy.
zh
[AI-72] Economies of Open Intelligence: Tracing Power Participation in the Model Ecosystem
【速读】:该论文旨在解决开放权重人工智能(open-weight AI)模型生态中市场集中度变化与模型特性演进的动态监测问题,尤其关注全球开发者群体、机构及地域力量分布的结构性转变。其关键解决方案是构建并发布涵盖2020年6月至2025年8月期间851,000个模型、超200项属性指标及22亿次下载记录的完整数据集,并辅以交互式仪表盘实现对开放模型经济中权力再分配与技术趋势的实时追踪分析,从而首次系统揭示了美国主导地位衰退、中国产业崛起(如DeepSeek和Qwen模型)、以及模型复杂性提升(如多模态生成、量化压缩、专家混合架构)与数据透明度下降之间的显著关联。
链接: https://arxiv.org/abs/2512.03073
作者: Shayne Longpre,Christopher Akiki,Campbell Lund,Atharva Kulkarni,Emily Chen,Irene Solaiman,Avijit Ghosh,Yacine Jernite,Lucie-Aimée Kaffee
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Since 2019, the Hugging Face Model Hub has been the primary global platform for sharing open weight AI models. By releasing a dataset of the complete history of weekly model downloads (June 2020-August 2025) alongside model metadata, we provide the most rigorous examination to-date of concentration dynamics and evolving characteristics in the open model economy. Our analysis spans 851,000 models, over 200 aggregated attributes per model, and 2.2B downloads. We document a fundamental rebalancing of economic power: US open-weight industry dominance by Google, Meta, and OpenAI has declined sharply in favor of unaffiliated developers, community organizations, and, as of 2025, Chinese industry, with DeepSeek and Qwen models potentially heralding a new consolidation of market power. We identify statistically significant shifts in model properties, a 17X increase in average model size, rapid growth in multimodal generation (3.4X), quantization (5X), and mixture-of-experts architectures (7X), alongside concerning declines in data transparency, with open weights models surpassing truly open source models for the first time in 2025. We expose a new layer of developer intermediaries that has emerged, focused on quantizing and adapting base models for both efficiency and artistic expression. To enable continued research and oversight, we release the complete dataset with an interactive dashboard for real-time monitoring of concentration dynamics and evolving properties in the open model economy.
zh
[AI-73] Beyond the Black Box: A Cognitive Architecture for Explainable and Aligned AI
【速读】:该论文旨在解决当前人工智能范式在可解释性(explainability)和价值对齐(value alignment)方面的根本性挑战,这些问题限制了AI系统向通用人工智能(Artificial General Intelligence, AGI)的发展。其解决方案的核心是提出“权重计算主义”(Weight-Calculatism)这一基于第一性原理的认知架构,将认知过程解构为不可再分的逻辑原子(Logical Atoms)以及两种基本操作——指代(Pointing)与比较(Comparison),并通过一个可解释的权重计算模型(Weight = Benefit × Probability)实现决策,所有值均可追溯至一组可审计的初始权重。这种原子化分解赋予系统高度可解释性、内在泛化能力及可追踪的价值对齐特性,为构建可信且对齐的AGI提供了理论与实践基础。
链接: https://arxiv.org/abs/2512.03072
作者: Hu Keyi
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Current AI paradigms, as “architects of experience,” face fundamental challenges in explainability and value alignment. This paper introduces “Weight-Calculatism,” a novel cognitive architecture grounded in first principles, and demonstrates its potential as a viable pathway toward Artificial General Intelligence (AGI). The architecture deconstructs cognition into indivisible Logical Atoms and two fundamental operations: Pointing and Comparison. Decision-making is formalized through an interpretable Weight-Calculation model (Weight = Benefit * Probability), where all values are traceable to an auditable set of Initial Weights. This atomic decomposition enables radical explainability, intrinsic generality for novel situations, and traceable value alignment. We detail its implementation via a graph-algorithm-based computational engine and a global workspace workflow, supported by a preliminary code implementation and scenario validation. Results indicate that the architecture achieves transparent, human-like reasoning and robust learning in unprecedented scenarios, establishing a practical and theoretical foundation for building trustworthy and aligned AGI.
zh
[AI-74] PretopoMD: Pretopology-based Mixed Data Hierarchical Clustering
【速读】:该论文旨在解决混合数据(mixed data)聚类中传统方法依赖维度约简(dimensionality reduction)所带来的信息损失与解释性不足的问题。其解决方案的关键在于提出一种基于预拓扑(pretopology)的算法,利用析取范式(Disjunctive Normal Form)构建可定制的逻辑规则和可调节的超参数,从而实现无需降维即可直接从原始数据中生成层次化且可解释的聚类结构,有效保留数据完整性并提升聚类结果的可解释性。
链接: https://arxiv.org/abs/2512.03071
作者: Loup-Noe Levy,Guillaume Guerard,Sonia Djebali,Soufian Ben Amor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This article presents a novel pretopology-based algorithm designed to address the challenges of clustering mixed data without the need for dimensionality reduction. Leveraging Disjunctive Normal Form, our approach formulates customizable logical rules and adjustable hyperparameters that allow for user-defined hierarchical cluster construction and facilitate tailored solutions for heterogeneous datasets. Through hierarchical dendrogram analysis and comparative clustering metrics, our method demonstrates superior performance by accurately and interpretably delineating clusters directly from raw data, thus preserving data integrity. Empirical findings highlight the algorithm’s robustness in constructing meaningful clusters and reveal its potential in overcoming issues related to clustered data explainability. The novelty of this work lies in its departure from traditional dimensionality reduction techniques and its innovative use of logical rules that enhance both cluster formation and clarity, thereby contributing a significant advancement to the discourse on clustering mixed data.
zh
[AI-75] Mixed Data Clustering Survey and Challenges
【速读】:该论文旨在解决混合数据聚类(mixed-data clustering)问题,即在大数据背景下如何有效处理包含数值型与类别型变量的异构数据集,以克服传统聚类算法因假设数据同质性而难以捕捉复杂结构的局限。其解决方案的关键在于引入基于预拓扑空间(pretopological spaces)的聚类方法,该方法能够更好地建模混合数据的内在结构,并通过层次化与可解释性设计提升结果的可理解性与实用性,从而在大数据环境中实现更精准、更具解释性的聚类分析。
链接: https://arxiv.org/abs/2512.03070
作者: Guillaume Guerard,Sonia Djebali
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of the big data paradigm has transformed how industries manage and analyze information, ushering in an era of unprecedented data volume, velocity, and variety. Within this landscape, mixed-data clustering has become a critical challenge, requiring innovative methods that can effectively exploit heterogeneous data types, including numerical and categorical variables. Traditional clustering techniques, typically designed for homogeneous datasets, often struggle to capture the additional complexity introduced by mixed data, underscoring the need for approaches specifically tailored to this setting. Hierarchical and explainable algorithms are particularly valuable in this context, as they provide structured, interpretable clustering results that support informed decision-making. This paper introduces a clustering method grounded in pretopological spaces. In addition, benchmarking against classical numerical clustering algorithms and existing pretopological approaches yields insights into the performance and effectiveness of the proposed method within the big data paradigm.
zh
[AI-76] Hierarchical clustering of complex energy systems using pretopology
【速读】:该论文试图解决的问题是如何在大规模分布式区域内对建筑能耗曲线进行建模与分类,以优化建筑能耗管理。传统逐栋深度审计的方法成本高昂且效率低下,因此亟需一种自动化方法来建立有效的推荐系统。解决方案的关键在于引入预拓扑(pretopology)理论来建模能耗曲线,并基于预拓扑空间的性质设计了一种多准则层次聚类算法,该算法被实现于一个Python库中。实验结果表明,该方法能有效识别点数据集中的空间簇(考虑位置和大小参数)以及时间序列数据中的聚类结构(使用皮尔逊相关系数评估,调整兰德指数ARI为1),从而为复杂能源系统的层级聚类提供理论支撑与实践工具。
链接: https://arxiv.org/abs/2512.03069
作者: Loup-Noe Levy,Jeremie Bosom,Guillaume Guerard,Soufian Ben Amor,Marc Bui,Hai Tran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This article attempts answering the following problematic: How to model and classify energy consumption profiles over a large distributed territory to optimize the management of buildings’ consumption? Doing case-by-case in depth auditing of thousands of buildings would require a massive amount of time and money as well as a significant number of qualified people. Thus, an automated method must be developed to establish a relevant and effective recommendations system. To answer this problematic, pretopology is used to model the sites’ consumption profiles and a multi-criterion hierarchical classification algorithm, using the properties of pretopological space, has been developed in a Python library. To evaluate the results, three data sets are used: A generated set of dots of various sizes in a 2D space, a generated set of time series and a set of consumption time series of 400 real consumption sites from a French Energy company. On the point data set, the algorithm is able to identify the clusters of points using their position in space and their size as parameter. On the generated time series, the algorithm is able to identify the time series clusters using Pearson’s correlation with an Adjusted Rand Index (ARI) of 1. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.03069 [cs.LG] (or arXiv:2512.03069v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.03069 Focus to learn more arXiv-issued DOI via DataCite Journalreference: (2021, April). Hierarchical clustering of complex energy systems using pretopology. In International Conference on Vehicle Technology and Intelligent Transport Systems (pp. 87-106). Cham: Springer International Publishing Related DOI: https://doi.org/10.1007/978-3-031-17098-0_5 Focus to learn more DOI(s) linking to related resources Submission history From: Guillaume Guerard [view email] [v1] Thu, 27 Nov 2025 08:19:50 UTC (1,024 KB) Full-text links: Access Paper: View a PDF of the paper titled Hierarchical clustering of complex energy systems using pretopology, by Loup-Noe Levy and 5 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-12 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[AI-77] Echoes of AI Harms: A Human-LLM Synergistic Framework for Bias-Driven Harm Anticipation
【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)系统在关键领域决策中因偏见引发的潜在危害难以系统识别与预防的问题。现有框架多孤立地记录偏见或危害,缺乏对特定偏见类型与其所致危害之间因果关系的明确映射,尤其在真实社会技术情境下更为薄弱;同时,多数技术修正措施仅在系统开发或部署后应用,缺乏前瞻性治理能力。其解决方案的关键在于提出ECHO框架,通过模块化工作流实现对AI偏见到危害路径的主动预测:首先识别利益相关方,继而以情景化案例呈现偏见AI系统,并结合人类与大语言模型(Large Language Model, LLM)双重标注机制进行危害注释,最终整合进伦理矩阵以结构化解读。该方法能够在早期阶段识别偏见向危害转化的路径,从而指导AI设计与治理策略从源头上规避风险。
链接: https://arxiv.org/abs/2512.03068
作者: Nicoleta Tantalaki,Sophia Vei,Athena Vakali
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 38 pages
Abstract:The growing influence of Artificial Intelligence (AI) systems on decision-making in critical domains has exposed their potential to cause significant harms, often rooted in biases embedded across the AI lifecycle. While existing frameworks and taxonomies document bias or harms in isolation, they rarely establish systematic links between specific bias types and the harms they cause, particularly within real-world sociotechnical contexts. Technical fixes proposed to address AI biases are ill-equipped to address them and are typically applied after a system has been developed or deployed, offering limited preventive value. We propose ECHO, a novel framework for proactive AI harm anticipation through the systematic mapping of AI bias types to harm outcomes across diverse stakeholder and domain contexts. ECHO follows a modular workflow encompassing stakeholder identification, vignette-based presentation of biased AI systems, and dual (human-LLM) harm annotation, integrated within ethical matrices for structured interpretation. This human-centered approach enables early-stage detection of bias-to-harm pathways, guiding AI design and governance decisions from the outset. We validate ECHO in two high-stakes domains (disease diagnosis and hiring), revealing domain-specific, bias-to-harm patterns and demonstrating ECHO’s potential to support anticipatory governance of AI systems
zh
[AI-78] Quantifying the Potential to Escape Filter Bubbles: A Behavior-Aware Measure via Contrastive Simulation
【速读】:该论文旨在解决推荐系统中因偏好建模导致的“过滤气泡”(filter bubbles)问题,即系统过度强化用户已有偏好,从而限制信息多样性并引发群体极化等负面效应。现有评估指标多仅衡量用户暴露内容的多样性,无法区分算法偏好建模与实际信息隔离之间的差异。为此,作者提出一种行为感知的量化指标——“气泡逃逸潜力”(Bubble Escape Potential, BEP),其核心在于设计了一种对比模拟框架,通过赋予合成用户不同的行为倾向(如正向与负向反馈),比较由此产生的曝光模式,从而解耦偏好建模与信息封闭的影响,实现对滤泡严重程度的精准诊断。这一方法首次从实证角度验证了推荐精度与气泡逃逸潜力之间的权衡关系,并揭示了温和随机推荐在缓解滤泡方面无效的反直觉现象。
链接: https://arxiv.org/abs/2512.03067
作者: Difu Feng,Qianqian Xu,Zitai Wang,Cong Hua,Zhiyong Yang,Qingming Huang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:Nowadays, recommendation systems have become crucial to online platforms, shaping user exposure by accurate preference modeling. However, such an exposure strategy can also reinforce users’ existing preferences, leading to a notorious phenomenon named filter bubbles. Given its negative effects, such as group polarization, increasing attention has been paid to exploring reasonable measures to filter bubbles. However, most existing evaluation metrics simply measure the diversity of user exposure, failing to distinguish between algorithmic preference modeling and actual information confinement. In view of this, we introduce Bubble Escape Potential (BEP), a behavior-aware measure that quantifies how easily users can escape from filter bubbles. Specifically, BEP leverages a contrastive simulation framework that assigns different behavioral tendencies (e.g., positive vs. negative) to synthetic users and compares the induced exposure patterns. This design enables decoupling the effect of filter bubbles and preference modeling, allowing for more precise diagnosis of bubble severity. We conduct extensive experiments across multiple recommendation models to examine the relationship between predictive accuracy and bubble escape potential across different groups. To the best of our knowledge, our empirical results are the first to quantitatively validate the dilemma between preference modeling and filter bubbles. What’s more, we observe a counter-intuitive phenomenon that mild random recommendations are ineffective in alleviating filter bubbles, which can offer a principled foundation for further work in this direction.
zh
[AI-79] Optimizing Life Sciences Agents in Real-Time using Reinforcement Learning
【速读】:该论文旨在解决生成式 AI (Generative AI) 在生命科学领域中面对多样化查询时,如何动态选择最优决策策略的问题,包括生成策略(直接生成 vs. 思维链)、工具调用(文献检索、药物数据库等)以及领域路由(药理学、分子生物学、临床专科)。传统方法依赖固定规则或昂贵的标注数据,难以适应用户偏好变化。其解决方案的关键在于提出一种结合 AWS Strands Agents 与 Thompson Sampling 上下文多臂老虎机(contextual bandits)的新框架,仅通过用户反馈即可持续学习并优化决策策略,无需任何真实标签,有效平衡探索与利用(exploration-exploitation)矛盾,并在20–30次交互后展现出明确的学习趋势,显著提升用户满意度(较随机基线提高15–30%)。
链接: https://arxiv.org/abs/2512.03065
作者: Nihir Chadderwala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Generative AI agents in life sciences face a critical challenge: determining the optimal approach for diverse queries ranging from simple factoid questions to complex mechanistic reasoning. Traditional methods rely on fixed rules or expensive labeled training data, neither of which adapts to changing conditions or user preferences. We present a novel framework that combines AWS Strands Agents with Thompson Sampling contextual bandits to enable AI agents to learn optimal decision-making strategies from user feedback alone. Our system optimizes three key dimensions: generation strategy selection (direct vs. chain-of-thought), tool selection (literature search, drug databases, etc.), and domain routing (pharmacology, molecular biology, clinical specialists). Through empirical evaluation on life science queries, we demonstrate 15-30% improvement in user satisfaction compared to random baselines, with clear learning patterns emerging after 20-30 queries. Our approach requires no ground truth labels, adapts continuously to user preferences, and provides a principled solution to the exploration-exploitation dilemma in agentic AI systems.
zh
[AI-80] Delta Sampling: Data-Free Knowledge Transfer Across Diffusion Models
【速读】:该论文旨在解决扩散模型(Diffusion Models)在版本迭代升级时,适配组件(如LoRA、ControlNet等)因与特定基础模型强耦合而难以复用的问题。解决方案的关键在于提出Delta Sampling(DS)方法,该方法在推理阶段通过利用适配前后模型预测的差异(即delta),引导新基础模型的去噪过程,从而实现跨架构基础模型的知识迁移,且无需原始训练数据。
链接: https://arxiv.org/abs/2512.03056
作者: Zhidong Gao,Zimeng Pan,Yuhang Yao,Chenyue Xie,Wei Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models like Stable Diffusion (SD) drive a vibrant open-source ecosystem including fully fine-tuned checkpoints and parameter-efficient adapters such as LoRA, LyCORIS, and ControlNet. However, these adaptation components are tightly coupled to a specific base model, making them difficult to reuse when the base model is upgraded (e.g., from SD 1.x to 2.x) due to substantial changes in model parameters and architecture. In this work, we propose Delta Sampling (DS), a novel method that enables knowledge transfer across base models with different architectures, without requiring access to the original training data. DS operates entirely at inference time by leveraging the delta: the difference in model predictions before and after the adaptation of a base model. This delta is then used to guide the denoising process of a new base model. We evaluate DS across various SD versions, demonstrating that DS achieves consistent improvements in creating desired effects (e.g., visual styles, semantic concepts, and structures) under different sampling strategies. These results highlight DS as an effective, plug-and-play mechanism for knowledge transfer in diffusion-based image synthesis. Code:~ this https URL
zh
[AI-81] Physics-informed self-supervised learning for predictive modeling of coronary artery digital twins
【速读】:该论文旨在解决冠状动脉疾病(Coronary Artery Disease, CAD)早期风险预测中因计算流体动力学(Computational Fluid Dynamics, CFD)模型计算成本高、数据驱动方法受限于标注数据稀缺且缺乏生理先验知识而导致的可扩展性与准确性不足问题。解决方案的关键在于提出一种物理信息自监督学习框架 PINS-CAD,其通过在20万例合成冠状动脉数字孪生数据上预训练图神经网络(Graph Neural Networks),并以一维纳维-斯托克斯方程和压降定律作为物理约束,实现无需CFD或标签数据即可预测压力与血流;进一步在FAME2多中心临床数据集上微调后,显著提升了对心血管事件的预测性能(AUC=0.73),同时生成空间分辨的压力和分数流量储备曲线,提供可解释的生理指标,从而将常规造影转化为无仿真、具生理感知的可扩展预防性心脏病学工具。
链接: https://arxiv.org/abs/2512.03055
作者: Xiaowu Sun,Thabo Mahendiran,Ortal Senouf,Denise Auberson,Bernard De Bruyne,Stephane Fournier,Olivier Muller,Pascal Frossard,Emmanuel Abbe,Dorina Thanou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages
Abstract:Cardiovascular disease is the leading global cause of mortality, with coronary artery disease (CAD) as its most prevalent form, necessitating early risk prediction. While 3D coronary artery digital twins reconstructed from imaging offer detailed anatomy for personalized assessment, their analysis relies on computationally intensive computational fluid dynamics (CFD), limiting scalability. Data-driven approaches are hindered by scarce labeled data and lack of physiological priors. To address this, we present PINS-CAD, a physics-informed self-supervised learning framework. It pre-trains graph neural networks on 200,000 synthetic coronary digital twins to predict pressure and flow, guided by 1D Navier-Stokes equations and pressure-drop laws, eliminating the need for CFD or labeled data. When fine-tuned on clinical data from 635 patients in the multicenter FAME2 study, PINS-CAD predicts future cardiovascular events with an AUC of 0.73, outperforming clinical risk scores and data-driven baselines. This demonstrates that physics-informed pretraining boosts sample efficiency and yields physiologically meaningful representations. Furthermore, PINS-CAD generates spatially resolved pressure and fractional flow reserve curves, providing interpretable biomarkers. By embedding physical priors into geometric deep learning, PINS-CAD transforms routine angiography into a simulation-free, physiology-aware framework for scalable, preventive cardiology.
zh
[AI-82] Mitigating hallucinations and omissions in LLM s for invertible problems: An application to hardware logic design automation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理可逆数据转换任务时存在的幻觉(hallucinations)和遗漏(omissions)问题,尤其是在从源域(如逻辑条件表 Logic Condition Tables, LCTs)到目标域(如硬件描述语言 HDL 代码)的映射中。解决方案的关键在于利用 LLM 作为损失less 编码器将源数据映射至目标域,再通过同一模型作为损失less 解码器还原回源域,形成闭环验证机制——这一方法借鉴了信息论中的无损压缩思想,能够有效检测并纠正 LLM 在生成过程中的错误,从而显著提升设计正确性与开发效率。
链接: https://arxiv.org/abs/2512.03053
作者: Andrew S. Cassidy,Guillaume Garreau,Jay Sivagnaname,Mike Grassi,Bernard Brezzo,John V. Arthur,Dharmendra S. Modha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Programming Languages (cs.PL)
备注: 7 pages, 2 figures, 7 tables
Abstract:We show for invertible problems that transform data from a source domain (for example, Logic Condition Tables (LCTs)) to a destination domain (for example, Hardware Description Language (HDL) code), an approach of using Large Language Models (LLMs) as a lossless encoder from source to destination followed by as a lossless decoder back to the source, comparable to lossless compression in information theory, can mitigate most of the LLM drawbacks of hallucinations and omissions. Specifically, using LCTs as inputs, we generate the full HDL for a two-dimensional network-on-chip router (13 units, 1500-2000 lines of code) using seven different LLMs, reconstruct the LCTs from the auto-generated HDL, and compare the original and reconstructed LCTs. This approach yields significant productivity improvements, not only confirming correctly generated LLM logic and detecting incorrectly generated LLM logic but also assisting developers in finding design specification errors.
zh
[AI-83] Exploring Syntropic Frameworks in AI Alignment: A Philosophical Investigation
【速读】:该论文试图解决人工智能对齐(AI alignment)问题,即如何使人工智能系统的行为与人类价值保持一致。传统方法倾向于通过编码固定的、静态的人类价值观来实现对齐,但作者指出这种方法面临结构性不稳定性,源于“ ought-gap”(规范断层)、价值多元主义以及扩展的框架问题(extended frame problem),从而陷入“规范陷阱”(specification trap)。解决方案的关键在于重构对齐机制:不再依赖静态的价值内容编码,而是通过基于过程的、多智能体的、发育性的机制,构建具有协同熵(syntropy)特性的智能体——即通过状态对齐递归地减少智能体间的相互不确定性,形成动态适应的对齐结构;同时,提出以相容论(compatibilist)的引导控制理论为基础,区分真实道德能力与模拟道德能力,并设计具身实验范式与验证机制,提供独立于现象学主张的操作性标准。这一框架不仅生成可证伪的预测,还为未来实证研究奠定了哲学基础。
链接: https://arxiv.org/abs/2512.03048
作者: Austin Spizzirri
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Approx. 3,000 words, 10 pages. Philosophical analysis of AI alignment (process-based / syntropy framework)
Abstract:I argue that AI alignment should be reconceived as architecting syntropic, reasons-responsive agents through process-based, multi-agent, developmental mechanisms rather than encoding fixed human value content. The paper makes three philosophical contributions. First, I articulate the ``specification trap’’ argument demonstrating why content-based value specification appears structurally unstable due to the conjunction of the is-ought gap, value pluralism, and the extended frame problem. Second, I propose syntropy – the recursive reduction of mutual uncertainty between agents through state alignment – as an information-theoretic framework for understanding multi-agent alignment dynamics. Third, I establish a functional distinction between genuine and simulated moral capacity grounded in compatibilist theories of guidance control, coupled with an embodied experimental paradigm and verification regime providing operational criteria independent of phenomenological claims. This paper represents the philosophical component of a broader research program whose empirical validation is being developed in a separate project currently in preparation. While the framework generates specific, falsifiable predictions about value emergence and moral agency in artificial systems, empirical validation remains pending.
zh
[AI-84] AI-Driven Document Redaction in UK Public Authorities: Implementation Gaps Regulatory Challenges and the Human Oversight Imperative
【速读】:该论文旨在解决公共部门在应对信息公开请求(Freedom of Information, FOI)时,因传统人工文档去标识(document redaction)方法难以兼顾日益增长的透明度需求与严格的数据保护法规而面临的实践困境。研究发现,尽管生成式 AI(Generative AI)等技术具备提升红标效率和准确性的潜力,但其在英国公共机构中的实际应用仍极为有限,主要受限于档案管理不善、缺乏标准化红标指南及专业人员培训不足等组织性障碍。解决方案的关键在于采用一种社会技术(socio-technical)路径,即在自动化技术赋能的基础上,强化人类专家的监督与判断能力,从而实现技术效率与合规性的协同优化。
链接: https://arxiv.org/abs/2512.02774
作者: Yijun Chen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 21 pages, 4 Figures, 2 Tables
Abstract:Document redaction in public authorities faces critical challenges as traditional manual approaches struggle to balance growing transparency demands with increasingly stringent data protection requirements. This study investigates the implementation of AI-driven document redaction within UK public authorities through Freedom of Information (FOI) requests. While AI technologies offer potential solutions to redaction challenges, their actual implementation within public sector organizations remains underexplored. Based on responses from 44 public authorities across healthcare, government, and higher education sectors, this study reveals significant gaps between technological possibilities and organizational realities. Findings show highly limited AI adoption (only one authority reported using AI tools), widespread absence of formal redaction policies (50 percent reported “information not held”), and deficiencies in staff training. The study identifies three key barriers to effective AI implementation: poor record-keeping practices, lack of standardized redaction guidelines, and insufficient specialized training for human oversight. These findings highlight the need for a socio-technical approach that balances technological automation with meaningful human expertise. This research provides the first empirical assessment of AI redaction practices in UK public authorities and contributes evidence to support policymakers navigating the complex interplay between transparency obligations, data protection requirements, and emerging AI technologies in public administration.
zh
[AI-85] Polarization by Design: How Elites Could Shape Mass Preferences as AI Reduces Persuasion Costs
【速读】:该论文试图解决的问题是:在民主制度中,随着生成式 AI (Generative AI) 驱动的说服技术成本下降、精度提升,精英阶层如何利用这些工具主动设计公众政策偏好分布,从而影响民主决策过程及其稳定性。其解决方案的关键在于构建一个动态模型,其中精英在面临说服成本和多数规则约束下,选择最优干预程度以重塑公众偏好分布;模型揭示出两种核心机制——单一精英执政时会因“极化拉动”效应导致社会偏好趋于极化,而两派精英轮流执政时则可能通过将社会“锁定”在偏好相对凝聚的区域(即“半锁定”状态)来抑制极化,因此AI驱动的说服能力既可能加剧也可能缓解极化,取决于政治环境结构。
链接: https://arxiv.org/abs/2512.04047
作者: Nadav Kunievsky
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:In democracies, major policy decisions typically require some form of majority or consensus, so elites must secure mass support to govern. Historically, elites could shape support only through limited instruments like schooling and mass media; advances in AI-driven persuasion sharply reduce the cost and increase the precision of shaping public opinion, making the distribution of preferences itself an object of deliberate design. We develop a dynamic model in which elites choose how much to reshape the distribution of policy preferences, subject to persuasion costs and a majority rule constraint. With a single elite, any optimal intervention tends to push society toward more polarized opinion profiles - a polarization pull'' - and improvements in persuasion technology accelerate this drift. When two opposed elites alternate in power, the same technology also creates incentives to park society in semi-lock’’ regions where opinions are more cohesive and harder for a rival to overturn, so advances in persuasion can either heighten or dampen polarization depending on the environment. Taken together, cheaper persuasion technologies recast polarization as a strategic instrument of governance rather than a purely emergent social byproduct, with important implications for democratic stability as AI capabilities advance.
zh
[AI-86] Large Language Models for Limited Noisy Data: A Gravitational Wave Identification Study
【速读】:该论文旨在解决在非高斯、非平稳噪声环境下且标注样本有限时,传统神经网络在天文数据处理中性能受限的问题。其解决方案的关键在于利用大规模语言模型(Large Language Models, LLMs)直接从观测数据中提取判别性结构,而非依赖大量模拟数据进行训练;实验表明,在仅使用90个LIGO事件进行微调的情况下,LLMs即可实现97.4%的引力波信号识别准确率,且增加模拟样本对性能无显著提升,而模型规模和数据规模的扩展则带来可预测的性能增益,凸显了LLMs在小样本复杂噪声场景下的优越性与泛化能力。
链接: https://arxiv.org/abs/2512.04031
作者: Yixuan Li,Yuhao Lu,Yang Liu,Liang Li,R. Ruffini,Di Li,Rong-Gen Cai,Xiaoyan Zhu,Wenbin Lin,Yu Wang
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:This work investigates whether large language models (LLMs) offer advantages over traditional neural networks for astronomical data processing, in regimes with non-Gaussian, non-stationary noise and limited labeled samples. Gravitational wave observations provide an suitable test case, using only 90 LIGO events, finetuned LLMs achieve 97.4% accuracy for identifying signals. Further experiments show that, in contrast to traditional networks that rely on large simulated datasets, additional simulated samples do not improve LLM performance, while scaling studies reveal predictable gains with increasing model size and dataset size. These results indicate that LLMs can extract discriminative structure directly from observational data and provide an efficient assessment for gravitational wave identification. The same strategy may extend to other astronomical domains with similar noise properties, such as radio or pulsar observations.
zh
[AI-87] ARA Test-by-Adaptive-Ranks for Quantum Anomaly Detection with Conformal Prediction Guarantees
【速读】:该论文旨在解决量子密钥分发(Quantum Key Distribution, QKD)中对量子关联真实性认证的统计可靠性问题,特别是现有方法在有限样本和对抗性场景下缺乏严格的统计保障。其核心解决方案是提出TARA(Test by Adaptive Ranks)框架,该框架融合了置信预测(conformal prediction)与顺序鞅检验(sequential martingale testing),实现了无需假设数据分布的验证有效性。关键创新在于:一是TARA k基于柯尔莫哥洛夫-斯米尔诺夫校准本地隐变量(Local Hidden Variable, LHV)分布,实现高精度量子-经典区分(ROC AUC = 0.96);二是TARA-m利用赌注鞅进行流式检测,在任意时刻保持第一类错误控制,适用于量子信道实时监控。理论证明表明,在条件交换性假设下,即使面对强上下文依赖的量子数据,置信p值仍保持均匀分布,从而确认量子非定域性不破坏置信预测的有效性,这一结论对非经典数据上分布无关方法的应用具有广泛意义。
链接: https://arxiv.org/abs/2512.04016
作者: Davut Emre Tasar,Ceren Ocal Tasar
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Quantum key distribution (QKD) security fundamentally relies on the ability to distinguish genuine quantum correlations from classical eavesdropper simulations, yet existing certification methods lack rigorous statistical guarantees under finite-sample conditions and adversarial scenarios. We introduce TARA (Test by Adaptive Ranks), a novel framework combining conformal prediction with sequential martingale testing for quantum anomaly detection that provides distribution-free validity guarantees. TARA offers two complementary approaches. TARA k, based on Kolmogorov Smirnov calibration against local hidden variable (LHV) null distributions, achieving ROC AUC = 0.96 for quantum-classical discrimination. And TARA-m, employing betting martingales for streaming detection with anytime valid type I error control that enables real time monitoring of quantum channels. We establish theoretical guarantees proving that under (context conditional) exchangeability, conformal p-values remain uniformly distributed even for strongly contextual quantum data, confirming that quantum contextuality does not break conformal prediction validity a result with implications beyond quantum certification to any application of distribution-free methods to nonclassical data. Extensive validation on both IBM Torino (superconducting, CHSH = 2.725) and IonQ Forte Enterprise (trapped ion, CHSH = 2.716) quantum processors demonstrates cross-platform robustness, achieving 36% security margins above the classical CHSH bound of 2. Critically, our framework reveals a methodological concern affecting quantum certification more broadly: same-distribution calibration can inflate detection performance by up to 44 percentage points compared to proper cross-distribution calibration, suggesting that prior quantum certification studies using standard train test splits may have systematically overestimated adversarial robustness.
zh
[AI-88] A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models
【速读】:该论文旨在解决稀疏专家混合模型(Sparse Mixture-of-Experts, s-MoE)在大规模人工智能训练中面临的负载均衡问题,即如何通过路由机制最小化闲置专家数量,从而提升GPU资源利用效率。解决方案的关键在于对无辅助损失负载均衡(Auxiliary-Loss-Free Load Balancing, ALF-LB)方法进行理论建模,将其视为一个针对分配问题的逐次迭代原始-对偶算法,并在此框架下揭示其内在优化性质:包括拉格朗日目标函数的单调递增性、基于过载/欠载专家间的令牌迁移偏好规则,以及近似负载平衡保证。进一步地,作者引入广义在线优化形式以刻画AI训练中的随机性和动态性,证明了目标函数的强凸性并导出在特定步长选择下的对数期望遗憾界,为ALF-LB提供了坚实的理论支撑。
链接: https://arxiv.org/abs/2512.03915
作者: X.Y. Han,Yuan Zhong
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of (costly) GPUs. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure – proposed by DeepSeek’s Wang et al. (2024) – by casting it as a one-step-per-iteration primal-dual method for an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement of a Lagrangian objective, (ii) a preference rule that moves tokens from overloaded to underloaded experts, and (iii) an approximate-balancing guarantee. Then, we incorporate the stochastic and dynamic nature of AI training using a generalized online optimization formulation. In the online setting, we derive a strong convexity property of the objective that leads to a logarithmic expected regret bound under certain step-size choices. Additionally, we present real experiments on 1B-parameter DeepSeekMoE models to complement our theoretical findings. Together, these results build a principled framework for analyzing the Auxiliary-Loss-Free Load Balancing of s-MoE in AI models.
zh
[AI-89] Cell-cell communication inference and analysis: biological mechanisms computational approaches and future opportunities
【速读】:该论文旨在解决如何从单细胞和空间转录组数据中系统推断和分析细胞间通信(Cell-Cell Communication, CCC)的问题。其解决方案的关键在于整合多种计算方法,涵盖基于先验配体-受体相互作用(Ligand-Receptor Interactions, LRIs)的知识驱动策略与从头建模的无监督方法,并对超过140种相关计算工具进行分类总结,突出其在方法学框架和生物学问题上的多样性,从而提升CCC分析的准确性并推动生物假说的生成。
链接: https://arxiv.org/abs/2512.03497
作者: Xiangzheng Cheng,Haili Huang,Ye Su,Qing Nie,Xiufen Zou,Suoqin Jin
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Cell Behavior (q-bio.CB)
备注:
Abstract:In multicellular organisms, cells coordinate their activities through cell-cell communication (CCC), which are crucial for development, tissue homeostasis, and disease progression. Recent advances in single-cell and spatial omics technologies provide unprecedented opportunities to systematically infer and analyze CCC from these omics data, either by integrating prior knowledge of ligand-receptor interactions (LRIs) or through de novo approaches. A variety of computational methods have been developed, focusing on methodological innovations, accurate modeling of complex signaling mechanisms, and investigation of broader biological questions. These advances have greatly enhanced our ability to analyze CCC and generate biological hypotheses. Here, we introduce the biological mechanisms and modeling strategies of CCC, and provide a focused overview of more than 140 computational methods for inferring CCC from single-cell and spatial transcriptomic data, emphasizing the diversity in methodological frameworks and biological questions. Finally, we discuss the current challenges and future opportunities in this rapidly evolving field.
zh
[AI-90] Learning From Limited Data and Feedback for Cell Culture Process Monitoring: A Comparative Study
【速读】:该论文旨在解决细胞培养生物工艺中实时批次过程监控(Real-time Batch Process Monitoring, BPM)的软传感器(soft sensor)开发难题,尤其针对历史数据有限、反馈频率低、过程条件异质性强及高维传感输入等挑战。其解决方案的关键在于系统性地评估多种机器学习(Machine Learning, ML)方法在小样本和弱相关数据环境下的表现,发现训练策略对模型性能具有决定性影响:批量学习(batch learning)在同质条件下有效,而就地学习(just-in-time learning)与在线学习(online learning)在冷启动场景下展现出更强的适应性;同时识别出关键元特征(meta-features),如补料培养基组成和过程控制策略,显著影响模型迁移能力,并提出融合拉曼光谱预测与滞后离线测量以提升监控精度,为未来生物工艺软传感器开发提供可行路径。
链接: https://arxiv.org/abs/2512.03460
作者: Johnny Peng,Thanh Tung Khuat,Ellen Otte,Katarzyna Musial,Bogdan Gabrys
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: This is a pre-print for submitting to computers chemical engineering journal
Abstract:In cell culture bioprocessing, real-time batch process monitoring (BPM) refers to the continuous tracking and analysis of key process variables such as viable cell density, nutrient levels, metabolite concentrations, and product titer throughout the duration of a batch run. This enables early detection of deviations and supports timely control actions to ensure optimal cell growth and product quality. BPM plays a critical role in ensuring the quality and regulatory compliance of biopharmaceutical manufacturing processes. However, the development of accurate soft sensors for BPM is hindered by key challenges, including limited historical data, infrequent feedback, heterogeneous process conditions, and high-dimensional sensory inputs. This study presents a comprehensive benchmarking analysis of machine learning (ML) methods designed to address these challenges, with a focus on learning from historical data with limited volume and relevance in the context of bioprocess monitoring. We evaluate multiple ML approaches including feature dimensionality reduction, online learning, and just-in-time learning across three datasets, one in silico dataset and two real-world experimental datasets. Our findings highlight the importance of training strategies in handling limited data and feedback, with batch learning proving effective in homogeneous settings, while just-in-time learning and online learning demonstrate superior adaptability in cold-start scenarios. Additionally, we identify key meta-features, such as feed media composition and process control strategies, that significantly impact model transferability. The results also suggest that integrating Raman-based predictions with lagged offline measurements enhances monitoring accuracy, offering a promising direction for future bioprocess soft sensor development.
zh
[AI-91] Ultra-Strong Gradient Diffusion MRI with Self-Supervised Learning for Prostate Cancer Characterization
【速读】:该论文旨在解决传统扩散加权磁共振成像(Diffusion MRI, dMRI)在前列腺癌表征中特异性不足的问题,尤其是常规指标如表观扩散系数(Apparent Diffusion Coefficient, ADC)难以准确反映组织微结构的局限性。其解决方案的关键在于引入基于物理机制的自监督VERDICT(Vascular, Extracellular, and Restricted Diffusion for Cytometry in Tumours)建模方法,并结合超强梯度系统(up to 300 mT/m)以提升信噪比(SNR)和对比噪声比(CNR)。通过采用深度学习架构(如密集多层感知机Dense MLP与卷积U-Net)优化ssVERDICT拟合过程,研究显著提高了参数估计的稳定性与肿瘤-正常组织对比度,在超强梯度条件下实现了比非线性最小二乘法(NLLS)拟合更高的CNR(提升47%)、更低的个体间变异(降低52%)及更一致的细胞内体积分数(f_ic)估计(减少50%),从而为无创前列腺癌精准评估提供了新路径。
链接: https://arxiv.org/abs/2512.03196
作者: Tanishq Patil,Snigdha Sen,Malwina Molendowska,Kieran G. Foley,Fabrizio Fasano,Mara Cercignani,Marco Palombo,Paddy J. Slator,Eleftheria Panagiotaki
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 17 figures, 7 tables
Abstract:Diffusion MRI (dMRI) enables non-invasive assessment of prostate microstructure but conventional metrics such as the Apparent Diffusion Coefficient in multiparametric MRI lack specificity to underlying histology. Integrating dMRI with the compartment-based biophysical VERDICT (Vascular, Extracellular, and Restricted Diffusion for Cytometry in Tumours) framework offers richer microstructural insights, though clinical gradient systems (40-80 mT/m) suffer from poor signal-to-noise ratio (SNR) at stronger diffusion weightings due to prolonged echo times. Ultra-strong gradients (up to 300 mT/m) can mitigate these limitations by improving SNR and contrast-to-noise ratios (CNR) but their adoption has until recently been limited to research environments due to challenges with peripheral nerve stimulation thresholds and gradient non-uniformity. This study investigates whether physics-informed self-supervised VERDICT (ssVERDICT) fitting applied to ultra-strong gradients enhances prostate cancer characterization relative to current clinical acquisitions. We developed enhanced ssVERDICT fitting approaches using dense multilayer perceptron (Dense MLP) and convolutional U-Net architectures, benchmarking them against non-linear least-squares (NLLS) fitting and Diffusion Kurtosis Imaging across clinical- to ultra-strong gradient systems. Dense ssVERDICT at ultra-strong gradient notably outperformed NLLS VERDICT, boosting median CNR by 47%, cutting inter-patient Coefficient of Variation by 52%, and reducing pooled f_ic variation by 50%. Overall, it delivered the highest CNR, the most stable parameter estimates, and the clearest tumour-normal contrast compared with conventional methods and clinical gradient systems. These findings highlight the potential of advanced gradient systems and deep learning-based modelling to improve non-invasive prostate cancer characterization and reduce unnecessary biopsies.
zh
[AI-92] he BEAT-CF Causal Model: A model for guiding the design of trials and observational analyses of cystic fibrosis exacerbations
【速读】:该论文旨在解决囊性纤维化(Cystic Fibrosis, CF)患者急性肺部加重(Pulmonary Exacerbation, PEx)管理策略缺乏共识的问题,特别是如何通过科学方法优化PEx的干预措施以减缓肺功能下降并改善生存率。其解决方案的关键在于构建一个基于贝叶斯证据适应性的因果模型(BEAT-CF),该模型是一个有向无环图(Directed Acyclic Graph, DAG)与贝叶斯网络(Bayesian Network, BN)相结合的结构,系统刻画了背景风险因素、治疗干预、气道病原体定植与PEx个体结局之间的因果关系,从而为临床试验设计和分析提供循证依据,并支持因果推断。
链接: https://arxiv.org/abs/2512.03110
作者: Steven Mascaro,Owen Woodberry,Charlie McLeod,Mitch Messer,Hiran Selvadurai,Yue Wu,Andre Schultz,Thomas L Snelling
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: 12 pages (8 pages in appendices)
Abstract:Loss of lung function in cystic fibrosis (CF) occurs progressively, punctuated by acute pulmonary exacerbations (PEx) in which abrupt declines in lung function are not fully recovered. A key component of CF management over the past half century has been the treatment of PEx to slow lung function decline. This has been credited with improvements in survival for people with CF (PwCF), but there is no consensus on the optimal approach to PEx management. BEAT-CF (Bayesian evidence-adaptive treatment of CF) was established to build an evidence-informed knowledge base for CF management. The BEAT-CF causal model is a directed acyclic graph (DAG) and Bayesian network (BN) for PEx that aims to inform the design and analysis of clinical trials comparing the effectiveness of alternative approaches to PEx management. The causal model describes relationships between background risk factors, treatments, and pathogen colonisation of the airways that affect the outcome of an individual PEx episode. The key factors, outcomes, and causal relationships were elicited from CF clinical experts and together represent current expert understanding of the pathophysiology of a PEx episode, guiding the design of data collection and studies and enabling causal inference. Here, we present the DAG that documents this understanding, along with the processes used in its development, providing transparency around our trial design and study processes, as well as a reusable framework for others.
zh
[AI-93] QGShap: Quantum Acceleration for Faithful GNN Explanations AAAI2026
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在实际应用中因黑箱特性导致的可解释性不足问题,尤其是在药物发现、社交网络分析和推荐系统等对透明度与责任性要求较高的场景下。现有基于Shapley值的解释方法虽具数学严谨性,但其计算复杂度为指数级(需评估 2n 个联盟或 n! 个排列),难以应用于真实图数据。为此,作者提出QGShap,其核心创新在于利用量子计算中的振幅放大技术(amplitude amplification),实现联盟评估过程的二次加速,从而在保持Shapley值精确计算的前提下显著提升效率,避免了传统采样或代理模型带来的近似误差。实验表明,QGShap在合成图数据集上实现了高保真度和准确的解释结果,且解释结构稳定、符合GNN推理逻辑。
链接: https://arxiv.org/abs/2512.03099
作者: Haribandhu Jena,Jyotirmaya Shivottam,Subhankar Mishra
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in the QC+AI Workshop at AAAI 2026
Abstract:Graph Neural Networks (GNNs) have become indispensable in critical domains such as drug discovery, social network analysis, and recommendation systems, yet their black-box nature hinders deployment in scenarios requiring transparency and accountability. While Shapley value-based methods offer mathematically principled explanations by quantifying each component’s contribution to predictions, computing exact values requires evaluating 2^n coalitions (or aggregating over n! permutations), which is intractable for real-world graphs. Existing approximation strategies sacrifice either fidelity or efficiency, limiting their practical utility. We introduce QGShap, a quantum computing approach that leverages amplitude amplification to achieve quadratic speedups in coalition evaluation while maintaining exact Shapley computation. Unlike classical sampling or surrogate methods, our approach provides fully faithful explanations without approximation trade-offs for tractable graph sizes. We conduct empirical evaluations on synthetic graph datasets, demonstrating that QGShap achieves consistently high fidelity and explanation accuracy, matching or exceeding the performance of classical methods across all evaluation metrics. These results collectively demonstrate that QGShap not only preserves exact Shapley faithfulness but also delivers interpretable, stable, and structurally consistent explanations that align with the underlying graph reasoning of GNNs. The implementation of QGShap is available at this https URL.
zh
[AI-94] AtomDisc: An Atom-level Tokenizer that Boosts Molecular LLM s and Reveals Structure–Property Associations
【速读】:该论文旨在解决将分子信息有效适配到基于序列化标记(token)处理的大语言模型(Large Language Models, LLMs)中的关键挑战,尤其是缺乏对原子局部环境的细粒度标记化问题。其解决方案的关键在于提出AtomDisc框架,该框架通过量化原子级局部环境为结构感知的标记(structure-aware tokens),并直接嵌入LLM的标记空间,从而在数据驱动的基础上识别出具有化学意义的结构特征,增强模型对结构-性质关联的可解释性,并显著提升属性预测与分子生成任务的性能。
链接: https://arxiv.org/abs/2512.03080
作者: Mingxu Zhang,Dazhong Shen,Ying Sun
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Advances in large language models (LLMs) are accelerating discovery in molecular science. However, adapting molecular information to the serialized, token-based processing of LLMs remains a key challenge. Compared to other representations, molecular graphs explicitly encode atomic connectivity and local topological environments, which are key determinants of atomic behavior and molecular properties. Despite recent efforts to tokenize overall molecular topology, there still lacks effective fine-grained tokenization of local atomic environments, which are critical for determining sophisticated chemical properties and reactivity. To address these issues, we introduce AtomDisc, a novel framework that quantizes atom-level local environments into structure-aware tokens embedded directly in LLM’s token space. Our experiments show that AtomDisc, in a data-driven way, can distinguish chemically meaningful structural features that reveal structure-property associations. Equipping LLMs with AtomDisc tokens injects an interpretable inductive bias that delivers state-of-the-art performance on property prediction and molecular generation. Our methodology and findings can pave the way for constructing more powerful molecular LLMs aimed at mechanistic insight and complex chemical reasoning.
zh
[AI-95] A note on the impossibility of conditional PAC-efficient reasoning in large language models
【速读】:该论文旨在解决大规模语言模型中条件性Probably Approximately Correct (PAC)-效率推理的可行性问题。现有研究已证明,在复合模型中通过在昂贵专家模型与廉价快速模型之间切换,可实现边际PAC效率,但本文揭示了在无分布假设(distribution-free)条件下,实现条件性(pointwise)PAC效率是不可行的。其关键解决方案在于理论证明:对于非原子输入空间,任何试图实现条件性PAC效率的算法,本质上必须对几乎每个输入都依赖专家模型,即其决策概率至少为 1−α,从而使得此类算法在实践中变得无意义(trivial)。这一结果揭示了条件性PAC效率在通用场景下的根本局限性。
链接: https://arxiv.org/abs/2512.03057
作者: Hao Zeng
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:
Abstract:We prove an impossibility result for conditional Probably Approximately Correct (PAC)-efficient reasoning in large language models. While recent work has established marginal PAC efficiency guarantees for composite models that switch between expensive expert models and cheaper fast models, we show that conditional (pointwise) guarantees are impossible in the distribution-free setting. Specifically, for non-atomic input spaces, any algorithm achieving conditional PAC efficiency must be trivial in the sense that it defers to the expert model with probability at least 1-\alpha for almost every input.
zh
[AI-96] Class conditional conformal prediction for multiple inputs by p-value aggregation
【速读】:该论文旨在解决在分类任务中,当预测时可获得单个实例的多个观测(multi-inputs)时,如何有效利用这些信息以缩小预测标签集合的大小,同时保持所需的条件覆盖概率(class-conditional coverage guarantee)。其关键解决方案是通过聚合每个观测对应的 conformal p-value 来构建一个通用的集成框架,该框架基于对 p-value 精确分布的了解,引入抽象评分函数(abstract scoring function),从而实现比传统方法更精细的标签集估计,例如改进的多数投票策略。此方法在 Pl@ntNet 等公民科学平台的实际数据和模拟数据上验证了有效性。
链接: https://arxiv.org/abs/2507.07150
作者: Jean-Baptiste Fermanian(IMAG, IROKO),Mohamed Hebiri(LAMA),Joseph Salmon(IMAG, IROKO)
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
备注:
Abstract:Conformal prediction methods are statistical tools designed to quantify uncertainty and generate predictive sets with guaranteed coverage probabilities. This work introduces an innovative refinement to these methods for classification tasks, specifically tailored for scenarios where multiple observations (multi-inputs) of a single instance are available at prediction time. Our approach is particularly motivated by applications in citizen science, where multiple images of the same plant or animal are captured by individuals. Our method integrates the information from each observation into conformal prediction, enabling a reduction in the size of the predicted label set while preserving the required class-conditional coverage guarantee. The approach is based on the aggregation of conformal p-values computed from each observation of a multi-input. By exploiting the exact distribution of these p-values, we propose a general aggregation framework using an abstract scoring function, encompassing many classical statistical tools. Knowledge of this distribution also enables refined versions of standard strategies, such as majority voting. We evaluate our method on simulated and real data, with a particular focus on Pl@ntNet, a prominent citizen science platform that facilitates the collection and identification of plant species through user-submitted images.
zh
机器学习
[LG-0] Learning Steerable Clarification Policies with Collaborative Self-play
链接: https://arxiv.org/abs/2512.04068
作者: Jonathan Berant,Maximillian Chen,Adam Fisch,Reza Aghajani,Fantine Huot,Mirella Lapata,Jacob Eisenstein
类目: Machine Learning (cs.LG)
*备注:
Abstract:To handle underspecified or ambiguous queries, AI assistants need a policy for managing their uncertainty to determine (a) when to guess the user intent and answer directly, (b) when to enumerate and answer multiple possible intents, and © when to ask a clarifying question. However, such policies are contextually dependent on factors such as user preferences or modality. For example, enumerating multiple possible user intentions is cumbersome on small screens or in a voice setting. In this work, we propose to train steerable policies for managing this uncertainty using self-play. Given two agents, one simulating a user and the other an AI assistant, we generate conversations where the user issues a potentially ambiguous query, and the assistant needs to determine how to respond. Importantly, the model takes as input the numerical cost of each clarification question, and each generated word, and is asked to take the action that will maximize its final reward, which is the cost-penalized accuracy. We use Reinforced Self-Training (ReST) to train our model to achieve high reward and show this leads to a steerable policy that changes its behavior predictably conditioned on the provided costs, leading to higher reward and accuracy. Moreover, our procedure also generalizes to numerical cost values that were unobserved at training time.
[LG-1] Eval Factsheets: A Structured Framework for Documenting AI Evaluations
链接: https://arxiv.org/abs/2512.04062
作者: Florian Bordes,Candace Ross,Justine T Kao,Evangelia Spiliopoulou,Adina Williams
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid proliferation of benchmarks has created significant challenges in reproducibility, transparency, and informed decision-making. However, unlike datasets and models – which benefit from structured documentation frameworks like Datasheets and Model Cards – evaluation methodologies lack systematic documentation standards. We introduce Eval Factsheets, a structured, descriptive framework for documenting AI system evaluations through a comprehensive taxonomy and questionnaire-based approach. Our framework organizes evaluation characteristics across five fundamental dimensions: Context (Who made the evaluation and when?), Scope (What does it evaluate?), Structure (With what the evaluation is built?), Method (How does it work?) and Alignment (In what ways is it reliable/valid/robust?). We implement this taxonomy as a practical questionnaire spanning five sections with mandatory and recommended documentation elements. Through case studies on multiple benchmarks, we demonstrate that Eval Factsheets effectively captures diverse evaluation paradigms – from traditional benchmarks to LLM-as-judge methodologies – while maintaining consistency and comparability. We hope Eval Factsheets are incorporated into both existing and newly released evaluation frameworks and lead to more transparency and reproducibility.
[LG-2] Convergence for Discrete Parameter Updates NEURIPS
链接: https://arxiv.org/abs/2512.04051
作者: Paul Wilson,Fabio Zanasi,George Constantinides
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: opt-ml 2025 workshop at NeurIPS
Abstract:Modern deep learning models require immense computational resources, motivating research into low-precision training. Quantised training addresses this by representing training components in low-bit integers, but typically relies on discretising real-valued updates. We introduce an alternative approach where the update rule itself is discrete, avoiding the quantisation of continuous updates by design. We establish convergence guarantees for a general class of such discrete schemes, and present a multinomial update rule as a concrete example, supported by empirical evaluation. This perspective opens new avenues for efficient training, particularly for models with inherently discrete structure.
[LG-3] Domain Feature Collapse: Implications for Out-of-Distribution Detection and Solutions
链接: https://arxiv.org/abs/2512.04034
作者: Hong Yang,Devroop Kar,Qi Yu,Alex Ororbia,Travis Desell
类目: Machine Learning (cs.LG)
*备注:
Abstract:Why do state-of-the-art OOD detection methods exhibit catastrophic failure when models are trained on single-domain datasets? We provide the first theoretical explanation for this phenomenon through the lens of information theory. We prove that supervised learning on single-domain data inevitably produces domain feature collapse – representations where I(x_d; z) = 0, meaning domain-specific information is completely discarded. This is a fundamental consequence of information bottleneck optimization: models trained on single domains (e.g., medical images) learn to rely solely on class-specific features while discarding domain features, leading to catastrophic failure when detecting out-of-domain samples (e.g., achieving only 53% FPR@95 on MNIST). We extend our analysis using Fano’s inequality to quantify partial collapse in practical scenarios. To validate our theory, we introduce Domain Bench, a benchmark of single-domain datasets, and demonstrate that preserving I(x_d; z) 0 through domain filtering (using pretrained representations) resolves the failure mode. While domain filtering itself is conceptually straightforward, its effectiveness provides strong empirical evidence for our information-theoretic framework. Our work explains a puzzling empirical phenomenon, reveals fundamental limitations of supervised learning in narrow domains, and has broader implications for transfer learning and when to fine-tune versus freeze pretrained models.
[LG-4] Efficient Public Verification of Private ML via Regularization
链接: https://arxiv.org/abs/2512.04008
作者: Zoë Ruha Bell,Anvith Thudi,Olive Franzese-McLaughlin,Nicolas Papernot,Shafi Goldwasser
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Training with differential privacy (DP) provides a guarantee to members in a dataset that they cannot be identified by users of the released model. However, those data providers, and, in general, the public, lack methods to efficiently verify that models trained on their data satisfy DP guarantees. The amount of compute needed to verify DP guarantees for current algorithms scales with the amount of compute required to train the model. In this paper we design the first DP algorithm with near optimal privacy-utility trade-offs but whose DP guarantees can be verified cheaper than training. We focus on DP stochastic convex optimization (DP-SCO), where optimal privacy-utility trade-offs are known. Here we show we can obtain tight privacy-utility trade-offs by privately minimizing a series of regularized objectives and only using the standard DP composition bound. Crucially, this method can be verified with much less compute than training. This leads to the first known DP-SCO algorithm with near optimal privacy-utility whose DP verification scales better than training cost, significantly reducing verification costs on large datasets.
[LG-5] Diagonalizing the Softmax: Hadamard Initialization for Tractable Cross-Entropy Dynamics
链接: https://arxiv.org/abs/2512.04006
作者: Connall Garrod,Jonathan P. Keating,Christos Thrampoulidis
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Cross-entropy (CE) training loss dominates deep learning practice, yet existing theory often relies on simplifications, either replacing it with squared loss or restricting to convex models, that miss essential behavior. CE and squared loss generate fundamentally different dynamics, and convex linear models cannot capture the complexities of non-convex optimization. We provide an in-depth characterization of multi-class CE optimization dynamics beyond the convex regime by analyzing a canonical two-layer linear neural network with standard-basis vectors as inputs: the simplest non-convex extension for which the implicit bias remained unknown. This model coincides with the unconstrained features model used to study neural collapse, making our work the first to prove that gradient flow on CE converges to the neural collapse geometry. We construct an explicit Lyapunov function that establishes global convergence, despite the presence of spurious critical points in the non-convex landscape. A key insight underlying our analysis is an inconspicuous finding: Hadamard Initialization diagonalizes the softmax operator, freezing the singular vectors of the weight matrices and reducing the dynamics entirely to their singular values. This technique opens a pathway for analyzing CE training dynamics well beyond our specific setting considered here.
[LG-6] Physics-Embedded Gaussian Process for Traffic State Estimation
链接: https://arxiv.org/abs/2512.04004
作者: Yanlin Chen,Kehua Chen,Yinhai Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Traffic state estimation (TSE) becomes challenging when probe-vehicle penetration is low and observations are spatially sparse. Pure data-driven methods lack physical explanations and have poor generalization when observed data is sparse. In contrast, physical models have difficulty integrating uncertainties and capturing the real complexity of traffic. To bridge this gap, recent studies have explored combining them by embedding physical structure into Gaussian process. These approaches typically introduce the governing equations as soft constraints through pseudo-observations, enabling the integration of model structure within a variational framework. However, these methods rely heavily on penalty tuning and lack principled uncertainty calibration, which makes them sensitive to model mis-specification. In this work, we address these limitations by presenting a novel Physics-Embedded Gaussian Process (PEGP), designed to integrate domain knowledge with data-driven methods in traffic state estimation. Specifically, we design two multi-output kernels informed by classic traffic flow models, constructed via the explicit application of the linearized differential operator. Experiments on HighD, NGSIM show consistent improvements over non-physics baselines. PEGP-ARZ proves more reliable under sparse observation, while PEGP-LWR achieves lower errors with denser observation. Ablation study further reveals that PEGP-ARZ residuals align closely with physics and yield calibrated, interpretable uncertainty, whereas PEGP-LWR residuals are more orthogonal and produce nearly constant variance fields. This PEGP framework combines physical priors, uncertainty quantification, which can provide reliable support for TSE.
[LG-7] raining-Free Policy Violation Detection via Activation-Space Whitening in LLM s AAAI2026
链接: https://arxiv.org/abs/2512.03994
作者: Oren Rachmil,Roy Betser,Itay Gershon,Omer Hofman,Nitay Yakoby,Yuval Meron,Idan Yankelev,Asaf Shabtai,Yuval Elovici,Roman Vainshtein
类目: Machine Learning (cs.LG)
*备注: Accepted to the AAAI 2026 Deployable AI (DAI) Workshop
Abstract:Aligning proprietary large language models (LLMs) with internal organizational policies has become an urgent priority as organizations increasingly deploy LLMs in sensitive domains such as legal support, finance, and medical services. Beyond generic safety filters, enterprises require reliable mechanisms to detect policy violations within their regulatory and operational frameworks, where breaches can trigger legal and reputational risks. Existing content moderation frameworks, such as guardrails, remain largely confined to the safety domain and lack the robustness to capture nuanced organizational policies. LLM-as-a-judge and fine-tuning approaches, though flexible, introduce significant latency and lack interpretability. To address these limitations, we propose a training-free and efficient method that treats policy violation detection as an out-of-distribution (OOD) detection problem. Inspired by whitening techniques, we apply a linear transformation to decorrelate the model’s hidden activations and standardize them to zero mean and unit variance, yielding near-identity covariance matrix. In this transformed space, we use the Euclidean norm as a compliance score to detect policy violations. The method requires only the policy text and a small number of illustrative samples, which makes it light-weight and easily deployable. On a challenging policy benchmark, our approach achieves state-of-the-art results, surpassing both existing guardrails and fine-tuned reasoning models. This work provides organizations with a practical and statistically grounded framework for policy-aware oversight of LLMs, advancing the broader goal of deployable AI governance. Code is available at: this https URL
[LG-8] chnical Report on Text Dataset Distillation
链接: https://arxiv.org/abs/2512.03967
作者: Keith Ando Ogawa,Bruno Lopes Yamamoto,Lucas Lauton de Alcantara,Victor Zacarias,Edson Bollis,Lucas Pellicer,Rosimeire Pereira Costa,Anna Helena Reali Costa,Artur Jordao
类目: Machine Learning (cs.LG)
*备注:
Abstract:In the vision domain, dataset distillation arises as a technique to condense a large dataset into a smaller synthetic one that exhibits a similar result in the training process. While image data presents an extensive literature of distillation methods, text dataset distillation has fewer works in comparison. Text dataset distillation initially grew as an adaptation of efforts from the vision universe, as the particularities of the modality became clear obstacles, it rose into a separate branch of research. Several milestones mark the development of this area, such as the introduction of methods that use transformer models, the generation of discrete synthetic text, and the scaling to decoder-only models with over 1B parameters. Despite major advances in modern approaches, the field remains in a maturing phase, with room for improvement on benchmarking standardization, approaches to overcome the discrete nature of text, handling complex tasks, and providing explicit examples of real-world applications. In this report, we review past and recent advances in dataset distillation for text, highlighting different distillation strategies, key contributions, and general challenges.
[LG-9] Density-Informed VAE (DiVAE): Reliable Log-Prior Probability via Density Alignment Regularization
链接: https://arxiv.org/abs/2512.03928
作者: Michele Alessi,Alessio Ansuini,Alex Rodriguez
类目: Machine Learning (cs.LG)
*备注: PriGM Workshop EurIPS 2025
Abstract:We introduce Density-Informed VAE (DiVAE), a lightweight, data-driven regularizer that aligns the VAE log-prior probability \log p_Z(z) with a log-density estimated from data. Standard VAEs match latents to a simple prior, overlooking density structure in the data-space. DiVAE encourages the encoder to allocate posterior mass in proportion to data-space density and, when the prior is learnable, nudges the prior toward high-density regions. This is realized by adding a robust, precision-weighted penalty to the ELBO, incurring negligible computational overhead. On synthetic datasets, DiVAE (i) improves distributional alignment of latent log-densities to its ground truth counterpart, (ii) improves prior coverage, and (iii) yields better OOD uncertainty calibration. On MNIST, DiVAE improves alignment of the prior with external estimates of the density, providing better interpretability, and improves OOD detection for learnable priors.
[LG-10] Quantum-Classical Physics-Informed Neural Networks for Solving Reservoir Seepage Equations
链接: https://arxiv.org/abs/2512.03923
作者: Xiang Rao,Yina Liu,Yuxuan Shen
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:
Abstract:Solving partial differential equations (PDEs) for reservoir seepage is critical for optimizing oil and gas field development and predicting production performance. Traditional numerical methods suffer from mesh-dependent errors and high computational costs, while classical Physics-Informed Neural Networks (PINNs) face bottlenecks in parameter efficiency, high-dimensional expression, and strong nonlinear fitting. To address these limitations, we propose a Discrete Variable (DV)-Circuit Quantum-Classical Physics-Informed Neural Network (QCPINN) and apply it to three typical reservoir seepage models for the first time: the pressure diffusion equation for heterogeneous single-phase flow, the nonlinear Buckley-Leverett (BL) equation for two-phase waterflooding, and the convection-diffusion equation for compositional flow considering adsorption. The QCPINN integrates classical preprocessing/postprocessing networks with a DV quantum core, leveraging quantum superposition and entanglement to enhance high-dimensional feature mapping while embedding physical constraints to ensure solution consistency. We test three quantum circuit topologies (Cascade, Cross-mesh, Alternate) and demonstrate through numerical experiments that QCPINNs achieve high prediction accuracy with fewer parameters than classical PINNs. Specifically, the Alternate topology outperforms others in heterogeneous single-phase flow and two-phase BL equation simulations, while the Cascade topology excels in compositional flow with convection-dispersion-adsorption coupling. Our work verifies the feasibility of QCPINN for reservoir engineering applications, bridging the gap between quantum computing research and industrial practice in oil and gas engineering.
[LG-11] Probabilistic Foundations of Fuzzy Simplicial Sets for Nonlinear Dimensionality Reduction
链接: https://arxiv.org/abs/2512.03899
作者: Janis Keck,Lukas Silvester Barth,Fatemeh(Hannaneh)Fahimi,Parvaneh Joharinad,Jürgen Jost
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT); Machine Learning (stat.ML)
*备注: 47 pages (including appendix), 11 figures
Abstract:Fuzzy simplicial sets have become an object of interest in dimensionality reduction and manifold learning, most prominently through their role in UMAP. However, their definition through tools from algebraic topology without a clear probabilistic interpretation detaches them from commonly used theoretical frameworks in those areas. In this work we introduce a framework that explains fuzzy simplicial sets as marginals of probability measures on simplicial sets. In particular, this perspective shows that the fuzzy weights of UMAP arise from a generative model that samples Vietoris-Rips filtrations at random scales, yielding cumulative distribution functions of pairwise distances. More generally, the framework connects fuzzy simplicial sets to probabilistic models on the face poset, clarifies the relation between Kullback-Leibler divergence and fuzzy cross-entropy in this setting, and recovers standard t-norms and t-conorms via Boolean operations on the underlying simplicial sets. We then show how new embedding methods may be derived from this framework and illustrate this on an example where we generalize UMAP using Čech filtrations with triplet sampling. In summary, this probabilistic viewpoint provides a unified probabilistic theoretical foundation for fuzzy simplicial sets, clarifies the role of UMAP within this framework, and enables the systematic derivation of new dimensionality reduction methods.
[LG-12] Digital Twin-based Control Co-Design of Full Vehicle Active Suspensions via Deep Reinforcement Learning
链接: https://arxiv.org/abs/2512.03891
作者: Ying-Kuan Tsai,Yi-Ping Chen,Vispi Karkaria,Wei Chen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 28 pages, 17 figures
Abstract:Active suspension systems are critical for enhancing vehicle comfort, safety, and stability, yet their performance is often limited by fixed hardware designs and control strategies that cannot adapt to uncertain and dynamic operating conditions. Recent advances in digital twins (DTs) and deep reinforcement learning (DRL) offer new opportunities for real-time, data-driven optimization across a vehicle’s lifecycle. However, integrating these technologies into a unified framework remains an open challenge. This work presents a DT-based control co-design (CCD) framework for full-vehicle active suspensions using multi-generation design concepts. By integrating automatic differentiation into DRL, we jointly optimize physical suspension components and control policies under varying driver behaviors and environmental uncertainties. DRL also addresses the challenge of partial observability, where only limited states can be sensed and fed back to the controller, by learning optimal control actions directly from available sensor information. The framework incorporates model updating with quantile learning to capture data uncertainty, enabling real-time decision-making and adaptive learning from digital-physical interactions. The approach demonstrates personalized optimization of suspension systems under two distinct driving settings (mild and aggressive). Results show that the optimized systems achieve smoother trajectories and reduce control efforts by approximately 43% and 52% for mild and aggressive, respectively, while maintaining ride comfort and stability. Contributions include: developing a DT-enabled CCD framework integrating DRL and uncertainty-aware model updating for full-vehicle active suspensions, introducing a multi-generation design strategy for self-improving systems, and demonstrating personalized optimization of active suspension systems for distinct driver types.
[LG-13] Automatic Attack Discovery for Few-Shot Class-Incremental Learning via Large Language Models
链接: https://arxiv.org/abs/2512.03882
作者: Haidong Kang,Wei Wu,Hanling Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Few-shot class incremental learning (FSCIL) is a more realistic and challenging paradigm in continual learning to incrementally learn unseen classes and overcome catastrophic forgetting on base classes with only a few training examples. Previous efforts have primarily centered around studying more effective FSCIL approaches. By contrast, less attention was devoted to thinking the security issues in contributing to FSCIL. This paper aims to provide a holistic study of the impact of attacks on FSCIL. We first derive insights by systematically exploring how human expert-designed attack methods (i.e., PGD, FGSM) affect FSCIL. We find that those methods either fail to attack base classes, or suffer from huge labor costs due to relying on huge expert knowledge. This highlights the need to craft a specialized attack method for FSCIL. Grounded in these insights, in this paper, we propose a simple yet effective ACraft method to automatically steer and discover optimal attack methods targeted at FSCIL by leveraging Large Language Models (LLMs) without human experts. Moreover, to improve the reasoning between LLMs and FSCIL, we introduce a novel Proximal Policy Optimization (PPO) based reinforcement learning to optimize learning, making LLMs generate better attack methods in the next generation by establishing positive feedback. Experiments on mainstream benchmarks show that our ACraft significantly degrades the performance of state-of-the-art FSCIL methods and dramatically beyond human expert-designed attack methods while maintaining the lowest costs of attack.
[LG-14] OmniDexVLG: Learning Dexterous Grasp Generation from Vision Language Model-Guided Grasp Semantics Taxonomy and Functional Affordance
链接: https://arxiv.org/abs/2512.03874
作者: Lei Zhang,Diwen Zheng,Kaixin Bai,Zhenshan Bing,Zoltan-Csaba Marton,Zhaopeng Chen,Alois Christian Knoll,Jianwei Zhang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project Website: this https URL , 16 pages
Abstract:Dexterous grasp generation aims to produce grasp poses that align with task requirements and human interpretable grasp semantics. However, achieving semantically controllable dexterous grasp synthesis remains highly challenging due to the lack of unified modeling of multiple semantic dimensions, including grasp taxonomy, contact semantics, and functional affordance. To address these limitations, we present OmniDexVLG, a multimodal, semantics aware grasp generation framework capable of producing structurally diverse and semantically coherent dexterous grasps under joint language and visual guidance. Our approach begins with OmniDexDataGen, a semantic rich dexterous grasp dataset generation pipeline that integrates grasp taxonomy guided configuration sampling, functional affordance contact point sampling, taxonomy aware differential force closure grasp sampling, and physics based optimization and validation, enabling systematic coverage of diverse grasp types. We further introduce OmniDexReasoner, a multimodal grasp type semantic reasoning module that leverages multi agent collaboration, retrieval augmented generation, and chain of thought reasoning to infer grasp related semantics and generate high quality annotations that align language instructions with task specific grasp intent. Building upon these components, we develop a unified Vision Language Grasping generation model that explicitly incorporates grasp taxonomy, contact structure, and functional affordance semantics, enabling fine grained control over grasp synthesis from natural language instructions. Extensive experiments in simulation and real world object grasping and ablation studies demonstrate that our method substantially outperforms state of the art approaches in terms of grasp diversity, contact semantic diversity, functional affordance diversity, and semantic consistency.
[LG-15] ransmit Weights Not Features: Orthogonal-Basis Aided Wireless Point-Cloud Transmission
链接: https://arxiv.org/abs/2512.03819
作者: Junlin Chang,Yubo Han,Hnag Yue,John S Thompson,Rongke Liu
类目: Machine Learning (cs.LG)
*备注: 5 pages, 5 figures
Abstract:The widespread adoption of depth sensors has substantially lowered the barrier to point-cloud acquisition. This letter proposes a semantic wireless transmission framework for three dimension (3D) point clouds built on Deep Joint Source - Channel Coding (DeepJSCC). Instead of sending raw features, the transmitter predicts combination weights over a receiver-side semantic orthogonal feature pool, enabling compact representations and robust reconstruction. A folding-based decoder deforms a 2D grid into 3D, enforcing manifold continuity while preserving geometric fidelity. Trained with Chamfer Distance (CD) and an orthogonality regularizer, the system is evaluated on ModelNet40 across varying Signal-to-Noise Ratios (SNRs) and bandwidths. Results show performance on par with SEmantic Point cloud Transmission (SEPT) at high bandwidth and clear gains in bandwidth-constrained regimes, with consistent improvements in both Peak Signal-to-Noise Ratio (PSNR) and CD. Ablation experiments confirm the benefits of orthogonalization and the folding prior.
[LG-16] Log Probability Tracking of LLM APIs
链接: https://arxiv.org/abs/2512.03816
作者: Timothée Chauvin,Erwan Le Merrer,François Taïani,Gilles Tredan
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:When using an LLM through an API provider, users expect the served model to remain consistent over time, a property crucial for the reliability of downstream applications and the reproducibility of research. Existing audit methods are too costly to apply at regular time intervals to the wide range of available LLM APIs. This means that model updates are left largely unmonitored in practice. In this work, we show that while LLM log probabilities (logprobs) are usually non-deterministic, they can still be used as the basis for cost-effective continuous monitoring of LLM APIs. We apply a simple statistical test based on the average value of each token logprob, requesting only a single token of output. This is enough to detect changes as small as one step of fine-tuning, making this approach more sensitive than existing methods while being 1,000x cheaper. We introduce the TinyChange benchmark as a way to measure the sensitivity of audit methods in the context of small, realistic model changes.
[LG-17] Deep Reinforcement Learning for Dynamic Algorithm Configuration: A Case Study on Optimizing OneMax with the (1(λλ))-GA
链接: https://arxiv.org/abs/2512.03805
作者: Tai Nguyen,Phong Le,André Biedenkapp,Carola Doerr,Nguyen Dang
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2502.20265
[LG-18] EfficientECG: Cross-Attention with Feature Fusion for Efficient Electrocardiogram Classification
链接: https://arxiv.org/abs/2512.03804
作者: Hanhui Deng,Xinglin Li,Jie Luo,Zhanpeng Jin,Di Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Electrocardiogram is a useful diagnostic signal that can detect cardiac abnormalities by measuring the electrical activity generated by the heart. Due to its rapid, non-invasive, and richly informative characteristics, ECG has many emerging applications. In this paper, we study novel deep learning technologies to effectively manage and analyse ECG data, with the aim of building a diagnostic model, accurately and quickly, that can substantially reduce the burden on medical workers. Unlike the existing ECG models that exhibit a high misdiagnosis rate, our deep learning approaches can automatically extract the features of ECG data through end-to-end training. Specifically, we first devise EfficientECG, an accurate and lightweight classification model for ECG analysis based on the existing EfficientNet model, which can effectively handle high-frequency long-sequence ECG data with various leading types. On top of that, we next propose a cross-attention-based feature fusion model of EfficientECG for analysing multi-lead ECG data with multiple features (e.g., gender and age). Our evaluations on representative ECG datasets validate the superiority of our model against state-of-the-art works in terms of high precision, multi-feature fusion, and lightweights.
[LG-19] Adaptive Identification and Modeling of Clinical Pathways with Process Mining
链接: https://arxiv.org/abs/2512.03787
作者: Francesco Vitale,Nicola Mazzocca
类目: Machine Learning (cs.LG)
*备注: Accepted to the 41st ACM/SIGAPP Symposium On Applied Computing (ACM SAC 2026)
Abstract:Clinical pathways are specialized healthcare plans that model patient treatment procedures. They are developed to provide criteria-based progression and standardize patient treatment, thereby improving care, reducing resource use, and accelerating patient recovery. However, manual modeling of these pathways based on clinical guidelines and domain expertise is difficult and may not reflect the actual best practices for different variations or combinations of diseases. We propose a two-phase modeling method using process mining, which extends the knowledge base of clinical pathways by leveraging conformance checking diagnostics. In the first phase, historical data of a given disease is collected to capture treatment in the form of a process model. In the second phase, new data is compared against the reference model to verify conformance. Based on the conformance checking results, the knowledge base can be expanded with more specific models tailored to new variants or disease combinations. We demonstrate our approach using Synthea, a benchmark dataset simulating patient treatments for SARS-CoV-2 infections with varying COVID-19 complications. The results show that our method enables expanding the knowledge base of clinical pathways with sufficient precision, peaking to 95.62% AUC while maintaining an arc-degree simplicity of 67.11%.
[LG-20] Forensic Activity Classification Using Digital Traces from iPhones: A Machine Learning-based Approach
链接: https://arxiv.org/abs/2512.03786
作者: Conor McCarthy,Jan Peter van Zandwijk,Marcel Worring,Zeno Geradts
类目: Machine Learning (cs.LG)
*备注:
Abstract:Smartphones and smartwatches are ever-present in daily life, and provide a rich source of information on their users’ behaviour. In particular, digital traces derived from the phone’s embedded movement sensors present an opportunity for a forensic investigator to gain insight into a person’s physical activities. In this work, we present a machine learning-based approach to translate digital traces into likelihood ratios (LRs) for different types of physical activities. Evaluating on a new dataset, NFI_FARED, which contains digital traces from four different types of iPhones labelled with 19 activities, it was found that our approach could produce useful LR systems to distinguish 167 out of a possible 171 activity pairings. The same approach was extended to analyse likelihoods for multiple activities (or groups of activities) simultaneously and create activity timelines to aid in both the early and latter stages of forensic investigations. The dataset and all code required to replicate the results have also been made public to encourage further research on this topic.
[LG-21] Deep Unfolding: Recent Developments Theory and Design Guidelines
链接: https://arxiv.org/abs/2512.03768
作者: Nir Shlezinger,Santiago Segarra,Yi Zhang,Dvir Avrahami,Zohar Davidov,Tirza Routtenberg,Yonina C. Eldar
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: under review for publication in the IEEE
Abstract:Optimization methods play a central role in signal processing, serving as the mathematical foundation for inference, estimation, and control. While classical iterative optimization algorithms provide interpretability and theoretical guarantees, they often rely on surrogate objectives, require careful hyperparameter tuning, and exhibit substantial computational latency. Conversely, machine learning (ML ) offers powerful data-driven modeling capabilities but lacks the structure, transparency, and efficiency needed for optimization-driven inference. Deep unfolding has recently emerged as a compelling framework that bridges these two paradigms by systematically transforming iterative optimization algorithms into structured, trainable ML architectures. This article provides a tutorial-style overview of deep unfolding, presenting a unified perspective of methodologies for converting optimization solvers into ML models and highlighting their conceptual, theoretical, and practical implications. We review the foundations of optimization for inference and for learning, introduce four representative design paradigms for deep unfolding, and discuss the distinctive training schemes that arise from their iterative nature. Furthermore, we survey recent theoretical advances that establish convergence and generalization guarantees for unfolded optimizers, and provide comparative qualitative and empirical studies illustrating their relative trade-offs in complexity, interpretability, and robustness.
[LG-22] Origin-Conditional Trajectory Encoding: Measuring Urban Configurational Asymmetries through Neural Decomposition
链接: https://arxiv.org/abs/2512.03755
作者: Stephen Law,Tao Yang,Nanjiang Chen,Xuhui Lin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Urban analytics increasingly relies on AI-driven trajectory analysis, yet current approaches suffer from methodological fragmentation: trajectory learning captures movement patterns but ignores spatial context, while spatial embedding methods encode street networks but miss temporal dynamics. Three gaps persist: (1) lack of joint training that integrates spatial and temporal representations, (2) origin-agnostic treatment that ignores directional asymmetries in navigation ( A \to B \ne B \to A ), and (3) over-reliance on auxiliary data (POIs, imagery) rather than fundamental geometric properties of urban space. We introduce a conditional trajectory encoder that jointly learns spatial and movement representations while preserving origin-dependent asymmetries using geometric features. This framework decomposes urban navigation into shared cognitive patterns and origin-specific spatial narratives, enabling quantitative measurement of cognitive asymmetries across starting locations. Our bidirectional LSTM processes visibility ratio and curvature features conditioned on learnable origin embeddings, decomposing representations into shared urban patterns and origin-specific signatures through contrastive learning. Results from six synthetic cities and real-world validation on Beijing’s Xicheng District demonstrate that urban morphology creates systematic cognitive inequalities. This provides urban planners quantitative tools for assessing experiential equity, offers architects insights into layout decisions’ cognitive impacts, and enables origin-aware analytics for navigation systems.
[LG-23] Universally Converging Representations of Matter Across Scientific Foundation Models NEURIPS2025
链接: https://arxiv.org/abs/2512.03750
作者: Sathya Edamadaka,Soojung Yang,Ju Li,Rafael Gómez-Bombarelli
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: Oral spotlight at NeurIPS 2025 UniReps Workshop
Abstract:Machine learning models of vastly different modalities and architectures are being trained to predict the behavior of molecules, materials, and proteins. However, it remains unclear whether they learn similar internal representations of matter. Understanding their latent structure is essential for building scientific foundation models that generalize reliably beyond their training domains. Although representational convergence has been observed in language and vision, its counterpart in the sciences has not been systematically explored. Here, we show that representations learned by nearly sixty scientific models, spanning string-, graph-, 3D atomistic, and protein-based modalities, are highly aligned across a wide range of chemical systems. Models trained on different datasets have highly similar representations of small molecules, and machine learning interatomic potentials converge in representation space as they improve in performance, suggesting that foundation models learn a common underlying representation of physical reality. We then show two distinct regimes of scientific models: on inputs similar to those seen during training, high-performing models align closely and weak models diverge into local sub-optima in representation space; on vastly different structures from those seen during training, nearly all models collapse onto a low-information representation, indicating that today’s models remain limited by training data and inductive bias and do not yet encode truly universal structure. Our findings establish representational alignment as a quantitative benchmark for foundation-level generality in scientific models. More broadly, our work can track the emergence of universal representations of matter as models scale, and for selecting and distilling models whose learned representations transfer best across modalities, domains of matter, and scientific tasks.
[LG-24] Unlocking the Invisible Urban Traffic Dynamics under Extreme Weather: A New Physics-Constrained Hamiltonian Learning Algorithm
链接: https://arxiv.org/abs/2512.03744
作者: Xuhui Lin,Qiuchen Lu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Urban transportation systems face increasing resilience challenges from extreme weather events, but current assessment methods rely on surface-level recovery indicators that miss hidden structural damage. Existing approaches cannot distinguish between true recovery and “false recovery,” where traffic metrics normalize, but the underlying system dynamics permanently degrade. To address this, a new physics-constrained Hamiltonian learning algorithm combining “structural irreversibility detection” and “energy landscape reconstruction” has been developed. Our approach extracts low-dimensional state representations, identifies quasi-Hamiltonian structures through physics-constrained optimization, and quantifies structural changes via energy landscape comparison. Analysis of London’s extreme rainfall in 2021 demonstrates that while surface indicators were fully recovered, our algorithm detected 64.8% structural damage missed by traditional monitoring. Our framework provides tools for proactive structural risk assessment, enabling infrastructure investments based on true system health rather than misleading surface metrics.
[LG-25] Cross-embodied Co-design for Dexterous Hands
链接: https://arxiv.org/abs/2512.03743
作者: Kehlani Fay,Darin Anthony Djapri,Anya Zorin,James Clinton,Ali El Lahib,Hao Su,Michael T. Tolley,Sha Yi,Xiaolong Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Dexterous manipulation is limited by both control and design, without consensus as to what makes manipulators best for performing dexterous tasks. This raises a fundamental challenge: how should we design and control robot manipulators that are optimized for dexterity? We present a co-design framework that learns task-specific hand morphology and complementary dexterous control policies. The framework supports 1) an expansive morphology search space including joint, finger, and palm generation, 2) scalable evaluation across the wide design space via morphology-conditioned cross-embodied control, and 3) real-world fabrication with accessible components. We evaluate the approach across multiple dexterous tasks, including in-hand rotation with simulation and real deployment. Our framework enables an end-to-end pipeline that can design, train, fabricate, and deploy a new robotic hand in under 24 hours. The full framework will be open-sourced and available on our website.
[LG-26] Crossing the Sim2Real Gap Between Simulation and Ground Testing to Space Deployment of Autonomous Free-flyer Control
链接: https://arxiv.org/abs/2512.03736
作者: Kenneth Stewart,Samantha Chapin,Roxana Leontie,Carl Glen Henshaw
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: published at iSpaRo 2025
Abstract:Reinforcement learning (RL) offers transformative potential for robotic control in space. We present the first on-orbit demonstration of RL-based autonomous control of a free-flying robot, the NASA Astrobee, aboard the International Space Station (ISS). Using NVIDIA’s Omniverse physics simulator and curriculum learning, we trained a deep neural network to replace Astrobee’s standard attitude and translation control, enabling it to navigate in microgravity. Our results validate a novel training pipeline that bridges the simulation-to-reality (Sim2Real) gap, utilizing a GPU-accelerated, scientific-grade simulation environment for efficient Monte Carlo RL training. This successful deployment demonstrates the feasibility of training RL policies terrestrially and transferring them to space-based applications. This paves the way for future work in In-Space Servicing, Assembly, and Manufacturing (ISAM), enabling rapid on-orbit adaptation to dynamic mission requirements.
[LG-27] Autonomous Planning In-space Assembly Reinforcement-learning free-flYer (APIARY) International Space Station Astrobee Testing
链接: https://arxiv.org/abs/2512.03729
作者: Samantha Chapin,Kenneth Stewart,Roxana Leontie,Carl Glen Henshaw
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: iSpaRo 2025, Best Paper Award in Orbital Robotics
Abstract:The US Naval Research Laboratory’s (NRL’s) Autonomous Planning In-space Assembly Reinforcement-learning free-flYer (APIARY) experiment pioneers the use of reinforcement learning (RL) for control of free-flying robots in the zero-gravity (zero-G) environment of space. On Tuesday, May 27th 2025 the APIARY team conducted the first ever, to our knowledge, RL control of a free-flyer in space using the NASA Astrobee robot on-board the International Space Station (ISS). A robust 6-degrees of freedom (DOF) control policy was trained using an actor-critic Proximal Policy Optimization (PPO) network within the NVIDIA Isaac Lab simulation environment, randomizing over goal poses and mass distributions to enhance robustness. This paper details the simulation testing, ground testing, and flight validation of this experiment. This on-orbit demonstration validates the transformative potential of RL for improving robotic autonomy, enabling rapid development and deployment (in minutes to hours) of tailored behaviors for space exploration, logistics, and real-time mission needs.
[LG-28] Feature-aware Modulation for Learning from Temporal Tabular Data NEURIPS2025
链接: https://arxiv.org/abs/2512.03678
作者: Hao-Run Cai,Han-Jia Ye
类目: Machine Learning (cs.LG)
*备注: 17 pages, 6 figures, 8 tables. NeurIPS 2025
Abstract:While tabular machine learning has achieved remarkable success, temporal distribution shifts pose significant challenges in real-world deployment, as the relationships between features and labels continuously evolve. Static models assume fixed mappings to ensure generalization, whereas adaptive models may overfit to transient patterns, creating a dilemma between robustness and adaptability. In this paper, we analyze key factors essential for constructing an effective dynamic mapping for temporal tabular data. We discover that evolving feature semantics-particularly objective and subjective meanings-introduce concept drift over time. Crucially, we identify that feature transformation strategies are able to mitigate discrepancies in feature representations across temporal stages. Motivated by these insights, we propose a feature-aware temporal modulation mechanism that conditions feature representations on temporal context, modulating statistical properties such as scale and skewness. By aligning feature semantics across time, our approach achieves a lightweight yet powerful adaptation, effectively balancing generalizability and adaptability. Benchmark evaluations validate the effectiveness of our method in handling temporal shifts in tabular data.
[LG-29] Conditional updates of neural network weights for increased out of training performance
链接: https://arxiv.org/abs/2512.03653
作者: Jan Saynisch-Wagner,Saran Rajendran Sari
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:This study proposes a method to enhance neural network performance when training data and application data are not very similar, e.g., out of distribution problems, as well as pattern and regime shifts. The method consists of three main steps: 1) Retrain the neural network towards reasonable subsets of the training data set and note down the resulting weight anomalies. 2) Choose reasonable predictors and derive a regression between the predictors and the weight anomalies. 3) Extrapolate the weights, and thereby the neural network, to the application data. We show and discuss this method in three use cases from the climate sciences, which include successful temporal, spatial and cross-domain extrapolations of neural networks.
[LG-30] AaPE: Aliasing-aware Patch Embedding for Self-Supervised Audio Representation Learning
链接: https://arxiv.org/abs/2512.03637
作者: Kohei Yamamoto,Kosuke Okusa
类目: ound (cs.SD); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages, 4 figures
Abstract:Transformer-based audio SSL (self-supervised learning) models often treat spectrograms as images, applying convolutional patchification with heavy temporal downsampling. This lowers the effective Nyquist frequency and introduces aliasing, while naïve low-pass filtering removes task-relevant high-frequency cues. In this study, we present Aliasing-aware Patch Embedding (AaPE), a drop-in patch stem that mitigates aliasing while preserving high-frequency information. AaPE augments standard patch tokens with features produced by a band-limited complex sinusoidal kernel using a two-sided exponential window that dynamically targets alias-prone bands. Frequency and decay parameters of the kernel are estimated from the input, enabling parallel, adaptive subband analysis whose outputs are fused with the standard patch tokens. AaPE integrates seamlessly into the masked teacher-student self-supervised learning. In addition, we combine a multi-mask strategy with a contrastive objective to enforce consistency across diverse mask patterns, stabilizing training. Pre-training on AudioSet followed by fine-tuning evaluation across diverse downstream benchmarks, which spanned categories, such as environmental sounds and other common audio domains. This approach yields state-of-the-art performance on a subset of tasks and competitive results across the remainder. Complementary linear probing evaluation mirrors this pattern, yielding clear gains on several benchmarks and strong performance elsewhere. The collective analysis of these results indicates that AaPE serves to mitigate the effects of aliasing without discarding of informative high-frequency content.
[LG-31] CoGraM: Context-sensitive granular optimization method with rollback for robust model fusion
链接: https://arxiv.org/abs/2512.03610
作者: Julius Lenz
类目: Machine Learning (cs.LG)
*备注: 15 pages, 4 figures, 8 equations
Abstract:Merging neural networks without retraining is central to federated and distributed learning. Common methods such as weight averaging or Fisher merging often lose accuracy and are unstable across seeds. CoGraM (Contextual Granular Merging) is a multi-stage, context-sensitive, loss-based, and iterative optimization method across layers, neurons, and weight levels that aligns decisions with loss differences and thresholds and prevents harmful updates through rollback. CoGraM is an optimization method that addresses the weaknesses of methods such as Fisher and can significantly improve the merged network.
[LG-32] Observation-driven correction of numerical weather prediction for marine winds
链接: https://arxiv.org/abs/2512.03606
作者: Matteo Peduto,Qidong Yang,Jonathan Giezendanner,Devis Tuia,Sherrie Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate marine wind forecasts are essential for safe navigation, ship routing, and energy operations, yet they remain challenging because observations over the ocean are sparse, heterogeneous, and temporally variable. We reformulate wind forecasting as observation-informed correction of a global numerical weather prediction (NWP) model. Rather than forecasting winds directly, we learn local correction patterns by assimilating the latest in-situ observations to adjust the Global Forecast System (GFS) output. We propose a transformer-based deep learning architecture that (i) handles irregular and time-varying observation sets through masking and set-based attention mechanisms, (ii) conditions predictions on recent observation-forecast pairs via cross-attention, and (iii) employs cyclical time embeddings and coordinate-aware location representations to enable single-pass inference at arbitrary spatial coordinates. We evaluate our model over the Atlantic Ocean using observations from the International Comprehensive Ocean-Atmosphere Data Set (ICOADS) as reference. The model reduces GFS 10-meter wind RMSE at all lead times up to 48 hours, achieving 45% improvement at 1-hour lead time and 13% improvement at 48-hour lead time. Spatial analyses reveal the most persistent improvements along coastlines and shipping routes, where observations are most abundant. The tokenized architecture naturally accommodates heterogeneous observing platforms (ships, buoys, tide gauges, and coastal stations) and produces both site-specific predictions and basin-scale gridded products in a single forward pass. These results demonstrate a practical, low-latency post-processing approach that complements NWP by learning to correct systematic forecast errors.
[LG-33] Federated Learning and Trajectory Compression for Enhanced AIS Coverag e
链接: https://arxiv.org/abs/2512.03584
作者: Thomas Gräupl,Andreas Reisenbauer,Marcel Hecko,Anil Rasouli,Anita Graser,Melitta Dragaschnig,Axel Weissenfeld,Gilles Dejaegere,Mahmoud Sakr
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper presents the VesselEdge system, which leverages federated learning and bandwidth-constrained trajectory compression to enhance maritime situational awareness by extending AIS coverage. VesselEdge transforms vessels into mobile sensors, enabling real-time anomaly detection and efficient data transmission over low-bandwidth connections. The system integrates the M3fed model for federated learning and the BWC-DR-A algorithm for trajectory compression, prioritizing anomalous data. Preliminary results demonstrate the effectiveness of VesselEdge in improving AIS coverage and situational awareness using historical data.
[LG-34] Optimal Transportation and Alignment Between Gaussian Measures
链接: https://arxiv.org/abs/2512.03579
作者: Sanjit Dandapanthula,Aleksandr Podkopaev,Shiva Prasad Kasiviswanathan,Aaditya Ramdas,Ziv Goldfeld
类目: Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:Optimal transport (OT) and Gromov-Wasserstein (GW) alignment provide interpretable geometric frameworks for comparing, transforming, and aggregating heterogeneous datasets – tasks ubiquitous in data science and machine learning. Because these frameworks are computationally expensive, large-scale applications often rely on closed-form solutions for Gaussian distributions under quadratic cost. This work provides a comprehensive treatment of Gaussian, quadratic cost OT and inner product GW (IGW) alignment, closing several gaps in the literature to broaden applicability. First, we treat the open problem of IGW alignment between uncentered Gaussians on separable Hilbert spaces by giving a closed-form expression up to a quadratic optimization over unitary operators, for which we derive tight analytic upper and lower bounds. If at least one Gaussian measure is centered, the solution reduces to a fully closed-form expression, which we further extend to an analytic solution for the IGW barycenter between centered Gaussians. We also present a reduction of Gaussian multimarginal OT with pairwise quadratic costs to a tractable optimization problem and provide an efficient algorithm to solve it using a rank-deficiency constraint. To demonstrate utility, we apply our results to knowledge distillation and heterogeneous clustering on synthetic and real-world datasets.
[LG-35] owards Irreversible Machine Unlearning for Diffusion Models
链接: https://arxiv.org/abs/2512.03564
作者: Xun Yuan,Zilong Zhao,Jiayu Li,Aryan Pasikhani,Prosanta Gope,Biplab Sikdar
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Diffusion models are renowned for their state-of-the-art performance in generating synthetic images. However, concerns related to safety, privacy, and copyright highlight the need for machine unlearning, which can make diffusion models forget specific training data and prevent the generation of sensitive or unwanted content. Current machine unlearning methods for diffusion models are primarily designed for conditional diffusion models and focus on unlearning specific data classes or features. Among these methods, finetuning-based machine unlearning methods are recognized for their efficiency and effectiveness, which update the parameters of pre-trained diffusion models by minimizing carefully designed loss functions. However, in this paper, we propose a novel attack named Diffusion Model Relearning Attack (DiMRA), which can reverse the finetuning-based machine unlearning methods, posing a significant vulnerability of this kind of technique. Without prior knowledge of the unlearning elements, DiMRA optimizes the unlearned diffusion model on an auxiliary dataset to reverse the unlearning, enabling the model to regenerate previously unlearned elements. To mitigate this vulnerability, we propose a novel machine unlearning method for diffusion models, termed as Diffusion Model Unlearning by Memorization (DiMUM). Unlike traditional methods that focus on forgetting, DiMUM memorizes alternative data or features to replace targeted unlearning data or features in order to prevent generating such elements. In our experiments, we demonstrate the effectiveness of DiMRA in reversing state-of-the-art finetuning-based machine unlearning methods for diffusion models, highlighting the need for more robust solutions. We extensively evaluate DiMUM, demonstrating its superior ability to preserve the generative performance of diffusion models while enhancing robustness against DiMRA.
[LG-36] Parameter-Efficient Augment Plugin for Class-Incremental Learning
链接: https://arxiv.org/abs/2512.03537
作者: Zhiming Xu,Baile Xu,Jian Zhao,Furao Shen,Suorong Yang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages, 6 figures, 2 tables
Abstract:Existing class-incremental learning (CIL) approaches based on replay or knowledge distillation are often constrained by forgetting or the stability-plasticity dilemma. Some expansion-based approaches could achieve higher accuracy. However, they always require significant parameter increases. In this paper, we propose a plugin extension paradigm termed the Deployment of extra LoRA Components (DLC) for non-pre-trained CIL this http URL treat the feature extractor trained through replay or distillation as a base model with rich knowledge. For each task, we use Low-Rank Adaptation (LoRA) to inject task-specific residuals into the base model’s deep layers. During inference, representations with task-specific residuals are aggregated to produce classification predictions. To mitigate interference from non-target LoRA plugins, we introduce a lightweight weighting unit. This unit learns to assign importance scores to different LoRA-tuned representations. Like downloadable contents in software, our method serves as a plug-and-play enhancement that efficiently extends the base methods. Remarkably, on the large-scale ImageNet-100, with merely 4 % of the parameters of a standard ResNet-18, our DLC model achieves a significant 8 % improvement in accuracy, demonstrating exceptional efficiency. Moreover, it could surpass state-of-the-art methods under the fixed memory budget.
[LG-37] Adaptive sampling using variational autoencoder and reinforcement learning
链接: https://arxiv.org/abs/2512.03525
作者: Adil Rasheed,Mikael Aleksander Jansen Shahly,Muhammad Faisal Aftab
类目: Machine Learning (cs.LG)
*备注:
Abstract:Compressed sensing enables sparse sampling but relies on generic bases and random measurements, limiting efficiency and reconstruction quality. Optimal sensor placement uses historcal data to design tailored sampling patterns, yet its fixed, linear bases cannot adapt to nonlinear or sample-specific variations. Generative model-based compressed sensing improves reconstruction using deep generative priors but still employs suboptimal random sampling. We propose an adaptive sparse sensing framework that couples a variational autoencoder prior with reinforcement learning to select measurements sequentially. Experiments show that this approach outperforms CS, OSP, and Generative model-based reconstruction from sparse measurements.
[LG-38] Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation AAAI2026
链接: https://arxiv.org/abs/2512.03521
作者: Xiaosen Lyu,Jiayu Xiong,Yuren Chen,Wanlong Wang,Xiaoqing Dai,Jing Wang
类目: Multimedia (cs.MM); Machine Learning (cs.LG)
*备注: Accepted to AAAI 2026
Abstract:Multimodal Emotion Recognition in Conversation (MERC) aims to predict speakers’ emotions by integrating textual, acoustic, and visual cues. Existing approaches either struggle to capture complex cross-modal interactions or experience gradient conflicts and unstable training when using deeper architectures. To address these issues, we propose Cross-Space Synergy (CSS), which couples a representation component with an optimization component. Synergistic Polynomial Fusion (SPF) serves the representation role, leveraging low-rank tensor factorization to efficiently capture high-order cross-modal interactions. Pareto Gradient Modulator (PGM) serves the optimization role, steering updates along Pareto-optimal directions across competing objectives to alleviate gradient conflicts and improve stability. Experiments show that CSS outperforms existing representative methods on IEMOCAP and MELD in both accuracy and training stability, demonstrating its effectiveness in complex multimodal scenarios.
[LG-39] Modal Logical Neural Networks
链接: https://arxiv.org/abs/2512.03491
作者: Antonin Sulc
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
*备注: 27 pages, 10 figures, 7 tables
Abstract:We propose Modal Logical Neural Networks (MLNNs), a neurosymbolic framework that integrates deep learning with the formal semantics of modal logic, enabling reasoning about necessity and possibility. Drawing on Kripke semantics, we introduce specialized neurons for the modal operators \Box and \Diamond that operate over a set of possible worlds, enabling the framework to act as a differentiable ``logical guardrail.‘’ The architecture is highly flexible: the accessibility relation between worlds can either be fixed by the user to enforce known rules or, as an inductive feature, be parameterized by a neural network. This allows the model to optionally learn the relational structure of a logical system from data while simultaneously performing deductive reasoning within that structure. This versatile construction is designed for flexibility. The entire framework is differentiable from end to end, with learning driven by minimizing a logical contradiction loss. This not only makes the system resilient to inconsistent knowledge but also enables it to learn nonlinear relationships that can help define the logic of a problem space. We illustrate MLNNs on four case studies: grammatical guardrailing, axiomatic detection of the unknown, multi-agent epistemic trust, and detecting constructive deception in natural language negotiation. These experiments demonstrate how enforcing or learning accessibility can increase logical consistency and interpretability without changing the underlying task architecture. Comments: 27 pages, 10 figures, 7 tables Subjects: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA) Cite as: arXiv:2512.03491 [cs.LG] (or arXiv:2512.03491v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.03491 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-40] Joint Progression Modeling (JPM): A Probabilistic Framework for Mixed-Pathology Progression ML4H ALT
链接: https://arxiv.org/abs/2512.03475
作者: Hongtao Hao,Joseph L. Austerweil
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 49 pages; Machine Learning for Health (ML4H) Symposium 2025
Abstract:Event-based models (EBMs) infer disease progression from cross-sectional data, and standard EBMs assume a single underlying disease per individual. In contrast, mixed pathologies are common in neurodegeneration. We introduce the Joint Progression Model (JPM), a probabilistic framework that treats single-disease trajectories as partial rankings and builds a prior over joint progressions. We study several JPM variants (Pairwise, Bradley-Terry, Plackett-Luce, and Mallows) and analyze three properties: (i) calibration – whether lower model energy predicts smaller distance to the ground truth ordering; (ii) separation – the degree to which sampled rankings are distinguishable from random permutations; and (iii) sharpness – the stability of sampled aggregate rankings. All variants are calibrated, and all achieve near-perfect separation; sharpness varies by variant and is well-predicted by simple features of the input partial rankings (number and length of rankings, conflict, and overlap). In synthetic experiments, JPM improves ordering accuracy by roughly 21 percent over a strong EBM baseline (SA-EBM) that treats the joint disease as a single condition. Finally, using NACC, we find that the Mallows variant of JPM and the baseline model (SA-EBM) have results that are more consistent with prior literature on the possible disease progression of the mixed pathology of AD and VaD.
[LG-41] SweetDeep: A Wearable AI Solution for Real-Time Non-Invasive Diabetes Screening ALT
链接: https://arxiv.org/abs/2512.03471
作者: Ian Henriques,Lynda Elhassar,Sarvesh Relekar,Denis Walrave,Shayan Hassantabar,Vishu Ghanakota,Adel Laoui,Mahmoud Aich,Rafia Tir,Mohamed Zerguine,Samir Louafi,Moncef Kimouche,Emmanuel Cosson,Niraj K Jha
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 12 pages, 6 figures. Submitted to the IEEE Journal of Biomedical and Health Informatics
Abstract:The global rise in type 2 diabetes underscores the need for scalable and cost-effective screening methods. Current diagnosis requires biochemical assays, which are invasive and costly. Advances in consumer wearables have enabled early explorations of machine learning-based disease detection, but prior studies were limited to controlled settings. We present SweetDeep, a compact neural network trained on physiological and demographic data from 285 (diabetic and non-diabetic) participants in the EU and MENA regions, collected using Samsung Galaxy Watch 7 devices in free-living conditions over six days. Each participant contributed multiple 2-minute sensor recordings per day, totaling approximately 20 recordings per individual. Despite comprising fewer than 3,000 parameters, SweetDeep achieves 82.5% patient-level accuracy (82.1% macro-F1, 79.7% sensitivity, 84.6% specificity) under three-fold cross-validation, with an expected calibration error of 5.5%. Allowing the model to abstain on less than 10% of low-confidence patient predictions yields an accuracy of 84.5% on the remaining patients. These findings demonstrate that combining engineered features with lightweight architectures can support accurate, rapid, and generalizable detection of type 2 diabetes in real-world wearable settings.
[LG-42] Bayesian Event-Based Model for Disease Subtype and Stage Inference ALT
链接: https://arxiv.org/abs/2512.03467
作者: Hongtao Hao,Joseph L. Austerweil
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 32 pages; machine learning for health symposium (2025); Proceedings of the 5th Machine Learning for Health Symposium in PMLR
Abstract:Chronic diseases often progress differently across patients. Rather than randomly varying, there are typically a small number of subtypes for how a disease progresses across patients. To capture this structured heterogeneity, the Subtype and Stage Inference Event-Based Model (SuStaIn) estimates the number of subtypes, the order of disease progression for each subtype, and assigns each patient to a subtype from primarily cross-sectional data. It has been widely applied to uncover the subtypes of many diseases and inform our understanding of them. But how robust is its performance? In this paper, we develop a principled Bayesian subtype variant of the event-based model (BEBMS) and compare its performance to SuStaIn in a variety of synthetic data experiments with varied levels of model misspecification. BEBMS substantially outperforms SuStaIn across ordering, staging, and subtype assignment tasks. Further, we apply BEBMS and SuStaIn to a real-world Alzheimer’s data set. We find BEBMS has results that are more consistent with the scientific consensus of Alzheimer’s disease progression than SuStaIn.
[LG-43] Multi-Modal Opinion Integration for Financial Sentiment Analysis using Cross-Modal Attention
链接: https://arxiv.org/abs/2512.03464
作者: Yujing Liu,Chen Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:In recent years, financial sentiment analysis of public opinion has become increasingly important for market forecasting and risk assessment. However, existing methods often struggle to effectively integrate diverse opinion modalities and capture fine-grained interactions across them. This paper proposes an end-to-end deep learning framework that integrates two distinct modalities of financial opinions: recency modality (timely opinions) and popularity modality (trending opinions), through a novel cross-modal attention mechanism specifically designed for financial sentiment analysis. While both modalities consist of textual data, they represent fundamentally different information channels: recency-driven market updates versus popularity-driven collective sentiment. Our model first uses BERT (Chinese-wwm-ext) for feature embedding and then employs our proposed Financial Multi-Head Cross-Attention (FMHCA) structure to facilitate information exchange between these distinct opinion modalities. The processed features are optimized through a transformer layer and fused using multimodal factored bilinear pooling for classification into negative, neutral, and positive sentiment. Extensive experiments on a comprehensive dataset covering 837 companies demonstrate that our approach achieves an accuracy of 83.5%, significantly outperforming baselines including BERT+Transformer by 21 percent. These results highlight the potential of our framework to support more accurate financial decision-making and risk management.
[LG-44] A Hybrid Deep Learning and Anomaly Detection Framework for Real-Time Malicious URL Classification
链接: https://arxiv.org/abs/2512.03462
作者: Berkani Khaled,Zeraoulia Rafik
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 14 pages,2 figures
Abstract:Malicious URLs remain a primary vector for phishing, malware, and cyberthreats. This study proposes a hybrid deep learning framework combining \textttHashingVectorizer n-gram analysis, SMOTE balancing, Isolation Forest anomaly filtering, and a lightweight neural network classifier for real-time URL classification. The multi-stage pipeline processes URLs from open-source repositories with statistical features (length, dot count, entropy), achieving O(NL + EBdh) training complexity and a 20,ms prediction latency. Empirical evaluation yields 96.4% accuracy, 95.4% F1-score, and 97.3% ROC-AUC, outperforming CNN (94.8%) and SVM baselines with a 50!\times – 100!\times speedup (Table~\reftab:comp-complexity). A multilingual Tkinter GUI (Arabic/English/French) enables real-time threat assessment with clipboard integration. The framework demonstrates superior scalability and resilience against obfuscated URL patterns.
[LG-45] Grokked Models are Better Unlearners
链接: https://arxiv.org/abs/2512.03437
作者: Yuanbang Liang,Yang Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Grokking-delayed generalization that emerges well after a model has fit the training data-has been linked to robustness and representation quality. We ask whether this training regime also helps with machine unlearning, i.e., removing the influence of specified data without full retraining. We compare applying standard unlearning methods before versus after the grokking transition across vision (CNNs/ResNets on CIFAR, SVHN, and ImageNet) and language (a transformer on a TOFU-style setup). Starting from grokked checkpoints consistently yields (i) more efficient forgetting (fewer updates to reach a target forget level), (ii) less collateral damage (smaller drops on retained and test performance), and (iii) more stable updates across seeds, relative to early-stopped counterparts under identical unlearning algorithms. Analyses of features and curvature further suggest that post-grokking models learn more modular representations with reduced gradient alignment between forget and retain subsets, which facilitates selective forgetting. Our results highlight when a model is trained (pre- vs. post-grokking) as an orthogonal lever to how unlearning is performed, providing a practical recipe to improve existing unlearning methods without altering their algorithms.
[LG-46] GaussDetect-LiNGAM:Causal Direction Identification without Gaussianity test
链接: https://arxiv.org/abs/2512.03428
作者: Ziyi Ding,Xiao-Ping Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We propose GaussDetect-LiNGAM, a novel approach for bivariate causal discovery that eliminates the need for explicit Gaussianity tests by leveraging a fundamental equivalence between noise Gaussianity and residual independence in the reverse regression. Under the standard LiNGAM assumptions of linearity, acyclicity, and exogeneity, we prove that the Gaussianity of the forward-model noise is equivalent to the independence between the regressor and residual in the reverse model. This theoretical insight allows us to replace fragile and sample-sensitive Gaussianity tests with robust kernel-based independence tests. Experimental results validate the equivalence and demonstrate that GaussDetect-LiNGAM maintains high consistency across diverse noise types and sample sizes, while reducing the number of tests per decision (TPD). Our method enhances both the efficiency and practical applicability of causal inference, making LiNGAM more accessible and reliable in real-world scenarios.
[LG-47] Comparative algorithm performance evaluation and prediction for the maximum clique problem using instance space analysis
链接: https://arxiv.org/abs/2512.03419
作者: Bharat Sharman,Elkafi Hassini
类目: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:
Abstract:The maximum clique problem, a well-known graph-based combinatorial optimization problem, has been addressed through various algorithmic approaches, though systematic analyses of the problem instances remain sparse. This study employs the instance space analysis (ISA) methodology to systematically analyze the instance space of this problem and assess predict the performance of state-of-the-art (SOTA) algorithms, including exact, heuristic, and graph neural network (GNN)-based methods. A dataset was compiled using graph instances from TWITTER, COLLAB and IMDB-BINARY benchmarks commonly used in graph machine learning research. A set of 33 generic and 2 problem-specific polynomial-time-computable graph-based features, including several spectral properties, was employed for the ISA. A composite performance mea- sure incorporating both solution quality and algorithm runtime was utilized. The comparative analysis demonstrated that the exact algorithm Mixed Order Maximum Clique (MOMC) exhib- ited superior performance across approximately 74.7% of the instance space constituted by the compiled dataset. Gurobi CliSAT accounted for superior performance in 13.8% and 11% of the instance space, respectively. The ISA-based algorithm performance prediction model run on 34 challenging test instances compiled from the BHOSLIB and DIMACS datasets yielded top-1 and top-2 best performing algorithm prediction accuracies of 88% and 97%, respectively.
[LG-48] Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value
链接: https://arxiv.org/abs/2512.03399
作者: Joe Edelman,Tan Zhi-Xuan,Ryan Lowe,Oliver Klingefjord,Vincent Wang-Mascianica,Matija Franklin,Ryan Othniel Kearns,Ellie Hain,Atrisha Sarkar,Michiel Bakker,Fazl Barez,David Duvenaud,Jakob Foerster,Iason Gabriel,Joseph Gubbels,Bryce Goodman,Andreas Haupt,Jobst Heitzig,Julian Jara-Ettinger,Atoosa Kasirzadeh,James Ravi Kirkpatrick,Andrew Koh,W. Bradley Knox,Philipp Koralus,Joel Lehman,Sydney Levine,Samuele Marro,Manon Revel,Toby Shorin,Morgan Sutherland,Michael Henry Tessler,Ivan Vendrov,James Wilken-Smith
类目: Machine Learning (cs.LG)
*备注:
Abstract:Beneficial societal outcomes cannot be guaranteed by aligning individual AI systems with the intentions of their operators or users. Even an AI system that is perfectly aligned to the intentions of its operating organization can lead to bad outcomes if the goals of that organization are misaligned with those of other institutions and individuals. For this reason, we need full-stack alignment, the concurrent alignment of AI systems and the institutions that shape them with what people value. This can be done without imposing a particular vision of individual or collective flourishing. We argue that current approaches for representing values, such as utility functions, preference orderings, or unstructured text, struggle to address these and other issues effectively. They struggle to distinguish values from other signals, to support principled normative reasoning, and to model collective goods. We propose thick models of value will be needed. These structure the way values and norms are represented, enabling systems to distinguish enduring values from fleeting preferences, to model the social embedding of individual choices, and to reason normatively, applying values in new domains. We demonstrate this approach in five areas: AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, and democratic regulatory institutions.
[LG-49] uning-Free Structured Sparse Recovery of Multiple Measurement Vectors using Implicit Regularization
链接: https://arxiv.org/abs/2512.03393
作者: Lakshmi Jayalal,Sheetal Kalyani
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Recovering jointly sparse signals in the multiple measurement vectors (MMV) setting is a fundamental problem in machine learning, but traditional methods like multiple measurement vectors orthogonal matching pursuit (M-OMP) and multiple measurement vectors FOCal Underdetermined System Solver (M-FOCUSS) often require careful parameter tuning or prior knowledge of the sparsity of the signal and/or noise variance. We introduce a novel tuning-free framework that leverages Implicit Regularization (IR) from overparameterization to overcome this limitation. Our approach reparameterizes the estimation matrix into factors that decouple the shared row-support from individual vector entries. We show that the optimization dynamics inherently promote the desired row-sparse structure by applying gradient descent to a standard least-squares objective on these factors. We prove that with a sufficiently small and balanced initialization, the optimization dynamics exhibit a “momentum-like” effect, causing the norms of rows in the true support to grow significantly faster than others. This formally guarantees that the solution trajectory converges towards an idealized row-sparse solution. Additionally, empirical results demonstrate that our approach achieves performance comparable to established methods without requiring any prior information or tuning.
[LG-50] MAGE-ID: A Multimodal Generative Framework for Intrusion Detection Systems
链接: https://arxiv.org/abs/2512.03375
作者: Mahdi Arab Loodaricheh,Mohammad Hossein Manshaei,Anita Raja
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern Intrusion Detection Systems (IDS) face severe challenges due to heterogeneous network traffic, evolving cyber threats, and pronounced data imbalance between benign and attack flows. While generative models have shown promise in data augmentation, existing approaches are limited to single modalities and fail to capture cross-domain dependencies. This paper introduces MAGE-ID (Multimodal Attack Generator for Intrusion Detection), a diffusion-based generative framework that couples tabular flow features with their transformed images through a unified latent prior. By jointly training Transformer and CNN-based variational encoders with an EDM style denoiser, MAGE-ID achieves balanced and coherent multimodal synthesis. Evaluations on CIC-IDS-2017 and NSL-KDD demonstrate significant improvements in fidelity, diversity, and downstream detection performance over TabSyn and TabDDPM, highlighting the effectiveness of MAGE-ID for multimodal IDS augmentation.
[LG-51] A2G-QFL: Adaptive Aggregation with Two Gains in Quantum Federated learning
链接: https://arxiv.org/abs/2512.03363
作者: Shanika Iroshi Nanayakkara,Shiva Raj Pokhrel
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 8 pages, 4 figures, QCNC 2026
Abstract:Federated learning (FL) deployed over quantum enabled and heterogeneous classical networks faces significant performance degradation due to uneven client quality, stochastic teleportation fidelity, device instability, and geometric mismatch between local and global models. Classical aggregation rules assume euclidean topology and uniform communication reliability, limiting their suitability for emerging quantum federated systems. This paper introduces A2G (Adaptive Aggregation with Two Gains), a dual gain framework that jointly regulates geometric blending through a geometry gain and modulates client importance using a QoS gain derived from teleportation fidelity, latency, and instability. We develop the A2G update rule, establish convergence guarantees under smoothness and bounded variance assumptions, and show that A2G recovers FedAvg, QoS aware averaging, and manifold based aggregation as special cases. Experiments on a quantum classical hybrid testbed demonstrate improved stability and higher accuracy under heterogeneous and noisy conditions.
[LG-52] Breaking Determinism: Stochastic Modeling for Reliable Off-Policy Evaluation in Ad Auctions
链接: https://arxiv.org/abs/2512.03354
作者: Hongseon Yeom,Jaeyoul Shin,Soojin Min,Jeongmin Yoon,Seunghak Yu,Dongyeop Kang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Online A/B testing, the gold standard for evaluating new advertising policies, consumes substantial engineering resources and risks significant revenue loss from deploying underperforming variations. This motivates the use of Off-Policy Evaluation (OPE) for rapid, offline assessment. However, applying OPE to ad auctions is fundamentally more challenging than in domains like recommender systems, where stochastic policies are common. In online ad auctions, it is common for the highest-bidding ad to win the impression, resulting in a deterministic, winner-takes-all setting. This results in zero probability of exposure for non-winning ads, rendering standard OPE estimators inapplicable. We introduce the first principled framework for OPE in deterministic auctions by repurposing the bid landscape model to approximate the propensity score. This model allows us to derive robust approximate propensity scores, enabling the use of stable estimators like Self-Normalized Inverse Propensity Scoring (SNIPS) for counterfactual evaluation. We validate our approach on the AuctionNet simulation benchmark and against 2-weeks online A/B test from a large-scale industrial platform. Our method shows remarkable alignment with online results, achieving a 92% Mean Directional Accuracy (MDA) in CTR prediction, significantly outperforming the parametric baseline. MDA is the most critical metric for guiding deployment decisions, as it reflects the ability to correctly predict whether a new model will improve or harm performance. This work contributes the first practical and validated framework for reliable OPE in deterministic auction environments, offering an efficient alternative to costly and risky online experiments.
[LG-53] Associating Healthcare Teamwork with Patient Outcomes for Predictive Analysis
链接: https://arxiv.org/abs/2512.03296
作者: Hsiao-Ying Lu,Kwan-Liu Ma
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Cancer treatment outcomes are influenced not only by clinical and demographic factors but also by the collaboration of healthcare teams. However, prior work has largely overlooked the potential role of human collaboration in shaping patient survival. This paper presents an applied AI approach to uncovering the impact of healthcare professionals’ (HCPs) collaboration-captured through electronic health record (EHR) systems-on cancer patient outcomes. We model EHR-mediated HCP interactions as networks and apply machine learning techniques to detect predictive signals of patient survival embedded in these collaborations. Our models are cross validated to ensure generalizability, and we explain the predictions by identifying key network traits associated with improved outcomes. Importantly, clinical experts and literature validate the relevance of the identified crucial collaboration traits, reinforcing their potential for real-world applications. This work contributes to a practical workflow for leveraging digital traces of collaboration and AI to assess and improve team-based healthcare. The approach is potentially transferable to other domains involving complex collaboration and offers actionable insights to support data-informed interventions in healthcare delivery.
[LG-54] ASPEN: An Adaptive Spectral Physics-Enabled Network for Ginzburg-Landau Dynamics
链接: https://arxiv.org/abs/2512.03290
作者: Julian Evan Chrisnanto,Nurfauzi Fadillah,Yulison Herry Chrisnanto
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 15 pages, 7 figures
Abstract:Physics-Informed Neural Networks (PINNs) have emerged as a powerful, mesh-free paradigm for solving partial differential equations (PDEs). However, they notoriously struggle with stiff, multi-scale, and nonlinear systems due to the inherent spectral bias of standard multilayer perceptron (MLP) architectures, which prevents them from adequately representing high-frequency components. In this work, we introduce the Adaptive Spectral Physics-Enabled Network (ASPEN), a novel architecture designed to overcome this critical limitation. ASPEN integrates an adaptive spectral layer with learnable Fourier features directly into the network’s input stage. This mechanism allows the model to dynamically tune its own spectral basis during training, enabling it to efficiently learn and represent the precise frequency content required by the solution. We demonstrate the efficacy of ASPEN by applying it to the complex Ginzburg-Landau equation (CGLE), a canonical and challenging benchmark for nonlinear, stiff spatio-temporal dynamics. Our results show that a standard PINN architecture catastrophically fails on this problem, diverging into non-physical oscillations. In contrast, ASPEN successfully solves the CGLE with exceptional accuracy. The predicted solution is visually indistinguishable from the high-resolution ground truth, achieving a low median physics residual of 5.10 x 10^-3. Furthermore, we validate that ASPEN’s solution is not only pointwise accurate but also physically consistent, correctly capturing emergent physical properties, including the rapid free energy relaxation and the long-term stability of the domain wall front. This work demonstrates that by incorporating an adaptive spectral basis, our framework provides a robust and physically-consistent solver for complex dynamical systems where standard PINNs fail, opening new options for machine learning in challenging physical domains.
[LG-55] Multi-Frequency Federated Learning for Human Activity Recognition Using Head-Worn Sensors
链接: https://arxiv.org/abs/2512.03287
作者: Dario Fenoglio,Mohan Li,Davide Casnici,Matias Laporte,Shkurta Gashi,Silvia Santini,Martin Gjoreski,Marc Langheinrich
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 8 pages, 2024 International Conference on Intelligent Environments (IE), 2024
Abstract:Human Activity Recognition (HAR) benefits various application domains, including health and elderly care. Traditional HAR involves constructing pipelines reliant on centralized user data, which can pose privacy concerns as they necessitate the uploading of user data to a centralized server. This work proposes multi-frequency Federated Learning (FL) to enable: (1) privacy-aware ML; (2) joint ML model learning across devices with varying sampling frequency. We focus on head-worn devices (e.g., earbuds and smart glasses), a relatively unexplored domain compared to traditional smartwatch- or smartphone-based HAR. Results have shown improvements on two datasets against frequency-specific approaches, indicating a promising future in the multi-frequency FL-HAR task. The proposed network’s implementation is publicly available for further research and development.
[LG-56] oo Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval
链接: https://arxiv.org/abs/2512.03276
作者: Constantin Venhoff,Ashkan Khakzar,Sonia Joseph,Philip Torr,Neel Nanda
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training vision language models (VLMs) aims to align visual representations from a vision encoder with the textual representations of a pretrained large language model (LLM). However, many VLMs exhibit reduced factual recall performance compared to their LLM backbones, raising the question of how effective multimodal fine-tuning is at extending existing mechanisms within the LLM to visual inputs. We argue that factual recall based on visual inputs requires VLMs to solve a two-hop problem: (1) forming entity representations from visual inputs, and (2) recalling associated factual knowledge based on these entity representations. By benchmarking 14 VLMs with various architectures (LLaVA, Native, Cross-Attention), sizes (7B-124B parameters), and training setups on factual recall tasks against their original LLM backbone models, we find that 11 of 14 models exhibit factual recall degradation. We select three models with high and two models with low performance degradation, and use attribution patching, activation patching, and probing to show that degraded VLMs struggle to use the existing factual recall circuit of their LLM backbone, because they resolve the first hop too late in the computation. In contrast, high-performing VLMs resolve entity representations early enough to reuse the existing factual recall mechanism. Finally, we demonstrate two methods to recover performance: patching entity representations from the LLM backbone into the VLM, and prompting with chain-of-thought reasoning. Our results highlight that the speed of early entity resolution critically determines how effective VLMs are in using preexisting LLM mechanisms. More broadly, our work illustrates how mechanistic analysis can explain and unveil systematic failures in multimodal alignment.
[LG-57] Perch 2.0 transfers whale to underwater tasks NEURIPS2025
链接: https://arxiv.org/abs/2512.03219
作者: Andrea Burns,Lauren Harrell,Bart van Merriënboer,Vincent Dumoulin,Jenny Hamer,Tom Denton
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: AI for Non-Human Animal Communication
Abstract:Perch 2.0 is a supervised bioacoustics foundation model pretrained on 14,597 species, including birds, mammals, amphibians, and insects, and has state-of-the-art performance on multiple benchmarks. Given that Perch 2.0 includes almost no marine mammal audio or classes in the training data, we evaluate Perch 2.0 performance on marine mammal and underwater audio tasks through few-shot transfer learning. We perform linear probing with the embeddings generated from this foundation model and compare performance to other pretrained bioacoustics models. In particular, we compare Perch 2.0 with previous multispecies whale, Perch 1.0, SurfPerch, AVES-bio, BirdAVES, and Birdnet V2.3 models, which have open-source tools for transfer-learning and agile modeling. We show that the embeddings from the Perch 2.0 model have consistently high performance for few-shot transfer learning, generally outperforming alternative embedding models on the majority of tasks, and thus is recommended when developing new linear classifiers for marine mammal classification with few labeled examples.
[LG-58] A Multi-Agent Policy-Gradient approach to Network Routing
链接: https://arxiv.org/abs/2512.03211
作者: Nigel Tao,Jonathan Baxter,Lex Weaver
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Network routing is a distributed decision problem which naturally admits numerical performance measures, such as the average time for a packet to travel from source to destination. OLPOMDP, a policy-gradient reinforcement learning algorithm, was successfully applied to simulated network routing under a number of network models. Multiple distributed agents (routers) learned co-operative behavior without explicit inter-agent communication, and they avoided behavior which was individually desirable, but detrimental to the group’s overall performance. Furthermore, shaping the reward signal by explicitly penalizing certain patterns of sub-optimal behavior was found to dramatically improve the convergence rate.
[LG-59] Scaling Internal-State Policy-Gradient Methods for POMDPs
链接: https://arxiv.org/abs/2512.03204
作者: Douglas Aberdeen,Jonathan Baxter
类目: Machine Learning (cs.LG)
*备注:
Abstract:Policy-gradient methods have received increased attention recently as a mechanism for learning to act in partially observable environments. They have shown promise for problems admitting memoryless policies but have been less successful when memory is required. In this paper we develop several improved algorithms for learning policies with memory in an infinite-horizon setting – directly when a known model of the environment is available, and via simulation otherwise. We compare these algorithms on some large POMDPs, including noisy robot navigation and multi-agent problems.
[LG-60] GRAND: Guidance Rebalancing and Assignment for Networked Dispatch in Multi-Agent Path Finding
链接: https://arxiv.org/abs/2512.03194
作者: Johannes Gaber,Meshal Alharbi,Daniele Gammelli,Gioele Zardini
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Large robot fleets are now common in warehouses and other logistics settings, where small control gains translate into large operational impacts. In this article, we address task scheduling for lifelong Multi-Agent Pickup-and-Delivery (MAPD) and propose a hybrid method that couples learning-based global guidance with lightweight optimization. A graph neural network policy trained via reinforcement learning outputs a desired distribution of free agents over an aggregated warehouse graph. This signal is converted into region-to-region rebalancing through a minimum-cost flow, and finalized by small, local assignment problems, preserving accuracy while keeping per-step latency within a 1 s compute budget. On congested warehouse benchmarks from the League of Robot Runners (LRR) with up to 500 agents, our approach improves throughput by up to 10% over the 2024 winning scheduler while maintaining real-time execution. The results indicate that coupling graph-structured learned guidance with tractable solvers reduces congestion and yields a practical, scalable blueprint for high-throughput scheduling in large fleets.
[LG-61] Neighborhood density estimation using space-partitioning based hashing schemes
链接: https://arxiv.org/abs/2512.03187
作者: Aashi Jindal
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2011.03729
Abstract:This work introduces FiRE/FiRE.1, a novel sketching-based algorithm for anomaly detection to quickly identify rare cell sub-populations in large-scale single-cell RNA sequencing data. This method demonstrated superior performance against state-of-the-art techniques. Furthermore, the thesis proposes Enhash, a fast and resource-efficient ensemble learner that uses projection hashing to detect concept drift in streaming data, proving highly competitive in time and accuracy across various drift types.
[LG-62] Contrastive Deep Learning for Variant Detection in Wastewater Genomic Sequencing
链接: https://arxiv.org/abs/2512.03158
作者: Adele Chinda,Richmond Azumah,Hemanth Demakethepalli Venkateswara
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 13 pages, 4 figures
Abstract:Wastewater-based genomic surveillance has emerged as a powerful tool for population-level viral monitoring, offering comprehensive insights into circulating viral variants across entire communities. However, this approach faces significant computational challenges stemming from high sequencing noise, low viral coverage, fragmented reads, and the complete absence of labeled variant annotations. Traditional reference-based variant calling pipelines struggle with novel mutations and require extensive computational resources. We present a comprehensive framework for unsupervised viral variant detection using Vector-Quantized Variational Autoencoders (VQ-VAE) that learns discrete codebooks of genomic patterns from k-mer tokenized sequences without requiring reference genomes or variant labels. Our approach extends the base VQ-VAE architecture with masked reconstruction pretraining for robustness to missing data and contrastive learning for highly discriminative embeddings. Evaluated on SARS-CoV-2 wastewater sequencing data comprising approximately 100,000 reads, our VQ-VAE achieves 99.52% mean token-level accuracy and 56.33% exact sequence match rate while maintaining 19.73% codebook utilization (101 of 512 codes active), demonstrating efficient discrete representation learning. Contrastive fine-tuning with different projection dimensions yields substantial clustering improvements: 64-dimensional embeddings achieve +35% Silhouette score improvement (0.31 to 0.42), while 128-dimensional embeddings achieve +42% improvement (0.31 to 0.44), clearly demonstrating the impact of embedding dimensionality on variant discrimination capability. Our reference-free framework provides a scalable, interpretable approach to genomic surveillance with direct applications to public health monitoring.
[LG-63] Real-Time Structural Health Monitoring with Bayesian Neural Networks: Distinguishing Aleatoric and Epistemic Uncertainty for Digital Twin Frameworks
链接: https://arxiv.org/abs/2512.03115
作者: Hanbin Cho,Jecheon Yu,Hyeonbin Moon,Jiyoung Yoon,Junhyeong Lee,Giyoung Kim,Jinhyoung Park,Seunghwa Ryu
类目: Machine Learning (cs.LG)
*备注: 37 pages, 13 figures
Abstract:Reliable real-time analysis of sensor data is essential for structural health monitoring (SHM) of high-value assets, yet a major challenge is to obtain spatially resolved full-field aleatoric and epistemic uncertainties for trustworthy decision-making. We present an integrated SHM framework that combines principal component analysis (PCA), a Bayesian neural network (BNN), and Hamiltonian Monte Carlo (HMC) inference, mapping sparse strain gauge measurements onto leading PCA modes to reconstruct full-field strain distributions with uncertainty quantification. The framework was validated through cyclic four-point bending tests on carbon fiber reinforced polymer (CFRP) specimens with varying crack lengths, achieving accurate strain field reconstruction (R squared value 0.9) while simultaneously producing real-time uncertainty fields. A key contribution is that the BNN yields robust full-field strain reconstructions from noisy experimental data with crack-induced strain singularities, while also providing explicit representations of two complementary uncertainty fields. Considered jointly in full-field form, the aleatoric and epistemic uncertainty fields make it possible to diagnose at a local level, whether low-confidence regions are driven by data-inherent issues or by model-related limitations, thereby supporting reliable decision-making. Collectively, the results demonstrate that the proposed framework advances SHM toward trustworthy digital twin deployment and risk-aware structural diagnostics.
[LG-64] mporal Graph Neural Networks for Early Anomaly Detection and Performance Prediction via PV System Monitoring Data
链接: https://arxiv.org/abs/2512.03114
作者: Srijani Mukherjee(INES, USMB (Université de Savoie) (Université de Chambéry)),Laurent Vuillon(LAMA),Liliane Bou Nassif(CETHIL, INSA Lyon, CNRS),Stéphanie Giroux-Julien(LAGEPP),Hervé Pabiou(CETHIL),Denys Dutykh(KUSTAR),Ionnasis Tsanakas(LITEN / CEA-DES)
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:The rapid growth of solar photovoltaic (PV) systems necessitates advanced methods for performance monitoring and anomaly detection to ensure optimal operation. In this study, we propose a novel approach leveraging Temporal Graph Neural Network (Temporal GNN) to predict solar PV output power and detect anomalies using environmental and operational parameters. The proposed model utilizes graph-based temporal relationships among key PV system parameters, including irradiance, module and ambient temperature to predict electrical power output. This study is based on data collected from an outdoor facility located on a rooftop in Lyon (France) including power measurements from a PV module and meteorological parameters.
[LG-65] A Discrete Neural Operator with Adaptive Sampling for Surrogate Modeling of Parametric Transient Darcy Flows in Porous Media
链接: https://arxiv.org/abs/2512.03113
作者: Zhenglong Chen,Zhao Zhang,Xia Yan,Jiayu Zhai,Piyang Liu,Kai Zhang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:This study proposes a new discrete neural operator for surrogate modeling of transient Darcy flow fields in heterogeneous porous media with random parameters. The new method integrates temporal encoding, operator learning and UNet to approximate the mapping between vector spaces of random parameter and spatiotemporal flow fields. The new discrete neural operator can achieve higher prediction accuracy than the SOTA attention-residual-UNet structure. Derived from the finite volume method, the transmissibility matrices rather than permeability is adopted as the inputs of surrogates to enhance the prediction accuracy further. To increase sampling efficiency, a generative latent space adaptive sampling method is developed employing the Gaussian mixture model for density estimation of generalization error. Validation is conducted on test cases of 2D/3D single- and two-phase Darcy flow field prediction. Results reveal consistent enhancement in prediction accuracy given limited training set.
[LG-66] Many-to-One Adversarial Consensus: Exposing Multi-Agent Collusion Risks in AI-Based Healthcare
链接: https://arxiv.org/abs/2512.03097
作者: Adeela Bashir, TheAnh han,Zia Ush Shamszaman
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 7 pages Conference level paper
Abstract:The integration of large language models (LLMs) into healthcare IoT systems promises faster decisions and improved medical support. LLMs are also deployed as multi-agent teams to assist AI doctors by debating, voting, or advising on decisions. However, when multiple assistant agents interact, coordinated adversaries can collude to create false consensus, pushing an AI doctor toward harmful prescriptions. We develop an experimental framework with scripted and unscripted doctor agents, adversarial assistants, and a verifier agent that checks decisions against clinical guidelines. Using 50 representative clinical questions, we find that collusion drives the Attack Success Rate (ASR) and Harmful Recommendation Rates (HRR) up to 100% in unprotected systems. In contrast, the verifier agent restores 100% accuracy by blocking adversarial consensus. This work provides the first systematic evidence of collusion risk in AI healthcare and demonstrates a practical, lightweight defence that ensures guideline fidelity.
[LG-67] Risk-Entropic Flow Matching
链接: https://arxiv.org/abs/2512.03078
作者: Vahid R. Ramezani,Benjamin Englard
类目: Machine Learning (cs.LG)
*备注: 29 pages, 5 figures
Abstract:Tilted (entropic) risk, obtained by applying a log-exponential transform to a base loss, is a well established tool in statistics and machine learning for emphasizing rare or high loss events while retaining a tractable optimization problem. In this work, our aim is to interpret its structure for Flow Matching (FM). FM learns a velocity field that transports samples from a simple source distribution to data by integrating an ODE. In rectified FM, training pairs are obtained by linearly interpolating between a source sample and a data sample, and a neural velocity field is trained to predict the straight line displacement using a mean squared error loss. This squared loss collapses all velocity targets that reach the same space-time point into a single conditional mean, thereby ignoring higher order conditional information (variance, skewness, multi-modality) that encodes fine geometric structure about the data manifold and minority branches. We apply the standard risk-sensitive (log-exponential) transform to the conditional FM loss and show that the resulting tilted risk loss is a natural upper-bound on a meaningful conditional entropic FM objective defined at each space-time point. Furthermore, we show that a small order expansion of the gradient of this conditional entropic objective yields two interpretable first order corrections: covariance preconditioning of the FM residual, and a skew tail term that favors asymmetric or rare branches. On synthetic data designed to probe ambiguity and tails, the resulting risk-sensitive loss improves statistical metrics and recovers geometric structure more faithfully than standard rectified FM.
[LG-68] Model-Agnostic Fairness Regularization for GNNs with Incomplete Sensitive Information
链接: https://arxiv.org/abs/2512.03074
作者: Mahdi Tavassoli Kejani,Fadi Dornaika,Jean-Michel Loubes
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注:
Abstract:Graph Neural Networks (GNNs) have demonstrated exceptional efficacy in relational learning tasks, including node classification and link prediction. However, their application raises significant fairness concerns, as GNNs can perpetuate and even amplify societal biases against protected groups defined by sensitive attributes such as race or gender. These biases are often inherent in the node features, structural topology, and message-passing mechanisms of the graph itself. A critical limitation of existing fairness-aware GNN methods is their reliance on the strong assumption that sensitive attributes are fully available for all nodes during training–a condition that poses a practical impediment due to privacy concerns and data collection constraints. To address this gap, we propose a novel, model-agnostic fairness regularization framework designed for the realistic scenario where sensitive attributes are only partially available. Our approach formalizes a fairness-aware objective function that integrates both equal opportunity and statistical parity as differentiable regularization terms. Through a comprehensive empirical evaluation across five real-world benchmark datasets, we demonstrate that the proposed method significantly mitigates bias across key fairness metrics while maintaining competitive node classification performance. Results show that our framework consistently outperforms baseline models in achieving a favorable fairness-accuracy trade-off, with minimal degradation in predictive accuracy. The datasets and source code will be publicly released at this https URL.
[LG-69] Globally optimized SVD compression of LLM s via Fermi-function-based rank selection and gauge fixing
链接: https://arxiv.org/abs/2512.03062
作者: Roman Rausch,David Jansen,Sukhbinder Singh,Román Orús
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Prepared for submission to ESANN 2026
Abstract:Large Language Models (LLMs) are very demanding in terms of their computational resources. Low-rank decompositions of LLM weights, e.g. via Singular Value Decomposition (SVD), is a promising approach for LLM compression, but presents several practical hurdles, e.g. selecting appropriate layer-wise ranks and getting rid of its parameter redundancy. In this work, we present two physics-inspired improvements to SVD LLM compression: (1) \textbfFermiGrad, a gradient-descent algorithm that determines globally optimal layer-wise ranks by relaxing the discrete singular-value truncation into a continuous optimization using the Fermi function; (2) \textbfPivGa, an additional \textitlossless compression of the low-rank factors that exploits the intrinsic gauge freedom in their parametrization.
[LG-70] A Large Scale Heterogeneous Treatment Effect Estimation Framework and Its Applications of Users Journey at Snap
链接: https://arxiv.org/abs/2512.03060
作者: Jing Pan,Li Shi,Paul Lo
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:
Abstract:Heterogeneous Treatment Effect (HTE) and Conditional Average Treatment Effect (CATE) models relax the assumption that treatment effects are the same for every user. We present a large scale industrial framework for estimating HTE using experimental data from hundreds of millions of Snapchat users. By combining results across many experiments, the framework uncovers latent user characteristics that were previously unmeasurable and produces stable treatment effect estimates at scale. We describe the core components that enabled this system, including experiment selection, base learner design, and incremental training. We also highlight two applications: user influenceability to ads and user sensitivity to ads. An online A/B test using influenceability scores for targeting showed an improvement on key business metrics that is more than six times larger than what is typically considered significant. Subjects: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME) Cite as: arXiv:2512.03060 [cs.LG] (or arXiv:2512.03060v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.03060 Focus to learn more arXiv-issued DOI via DataCite
[LG-71] Safe and Sustainable Electric Bus Charging Scheduling with Constrained Hierarchical DRL
链接: https://arxiv.org/abs/2512.03059
作者: Jiaju Qi,Lei Lei,Thorsteinn Jonsson,Dusit Niyato
类目: Machine Learning (cs.LG)
*备注:
Abstract:The integration of Electric Buses (EBs) with renewable energy sources such as photovoltaic (PV) panels is a promising approach to promote sustainable and low-carbon public transportation. However, optimizing EB charging schedules to minimize operational costs while ensuring safe operation without battery depletion remains challenging - especially under real-world conditions, where uncertainties in PV generation, dynamic electricity prices, variable travel times, and limited charging infrastructure must be accounted for. In this paper, we propose a safe Hierarchical Deep Reinforcement Learning (HDRL) framework for solving the EB Charging Scheduling Problem (EBCSP) under multi-source uncertainties. We formulate the problem as a Constrained Markov Decision Process (CMDP) with options to enable temporally abstract decision-making. We develop a novel HDRL algorithm, namely Double Actor-Critic Multi-Agent Proximal Policy Optimization Lagrangian (DAC-MAPPO-Lagrangian), which integrates Lagrangian relaxation into the Double Actor-Critic (DAC) framework. At the high level, we adopt a centralized PPO-Lagrangian algorithm to learn safe charger allocation policies. At the low level, we incorporate MAPPO-Lagrangian to learn decentralized charging power decisions under the Centralized Training and Decentralized Execution (CTDE) paradigm. Extensive experiments with real-world data demonstrate that the proposed approach outperforms existing baselines in both cost minimization and safety compliance, while maintaining fast convergence speed.
[LG-72] Dynamical Properties of Tokens in Self-Attention and Effects of Positional Encoding
链接: https://arxiv.org/abs/2512.03058
作者: Duy-Tung Pham,An The Nguyen,Viet-Hoang Tran,Nhan-Phu Chung,Xin T. Tong,Tan M. Nguyen,Thieu N. Vo
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper investigates the dynamical properties of tokens in pre-trained Transformer models and explores their application to improving Transformers. To this end, we analyze the dynamical system governing the continuous-time limit of the pre-trained model and characterize the asymptotic behavior of its solutions. Specifically, we characterize when tokens move closer to or farther from one another over time, depending on the model parameters. We provide sufficient conditions, based on these parameters, to identify scenarios where tokens either converge to zero or diverge to infinity. Unlike prior works, our conditions are broader in scope and more applicable to real-world models. Furthermore, we investigate how different forms of positional encoding – specifically absolute and rotary – affect these dynamical regimes. Empirical evidence reveals that the convergence scenario adversely impacts model performance. Motivated by these insights, we propose simple refinements to Transformer architectures that mitigate convergence behavior in models with absolute or rotary positional encoding. These findings support theoretical foundations and design principles for improving Transformer models.
[LG-73] Physics-Informed Machine Learning for Steel Development: A Computational Framework and CCT Diagram Modelling
链接: https://arxiv.org/abs/2512.03050
作者: Peter Hedström,Victor Lamelas Cubero,Jón Sigurdsson,Viktor Österberg,Satish Kolli,Joakim Odqvist,Ziyong Hou,Wangzhong Mu,Viswanadh Gowtham Arigela
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph)
*备注: 14 pages
Abstract:Machine learning (ML) has emerged as a powerful tool for accelerating the computational design and production of materials. In materials science, ML has primarily supported large-scale discovery of novel compounds using first-principles data and digital twin applications for optimizing manufacturing processes. However, applying general-purpose ML frameworks to complex industrial materials such as steel remains a challenge. A key obstacle is accurately capturing the intricate relationship between chemical composition, processing parameters, and the resulting microstructure and properties. To address this, we introduce a computational framework that combines physical insights with ML to develop a physics-informed continuous cooling transformation (CCT) model for steels. Our model, trained on a dataset of 4,100 diagrams, is validated against literature and experimental data. It demonstrates high computational efficiency, generating complete CCT diagrams with 100 cooling curves in under 5 seconds. It also shows strong generalizability across alloy steels, achieving phase classification F1 scores above 88% for all phases. For phase transition temperature regression, it attains mean absolute errors (MAE) below 20 °C across all phases except bainite, which shows a slightly higher MAE of 27 °C. This framework can be extended with additional generic and customized ML models to establish a universal digital twin platform for heat treatment. Integration with complementary simulation tools and targeted experiments will further support accelerated materials design workflows.
[LG-74] Closing the problem of which causal structures of up to six total nodes have a classical-quantum gap
链接: https://arxiv.org/abs/2512.04058
作者: Shashaank Khanna,Matthew Pusey,Roger Colbeck
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 5 pages, 3 figures, 1 table
Abstract:The discovery of Bell that there exist quantum correlations that cannot be reproduced classically is one of the most important in the foundations of quantum mechanics, as well as having practical implications. Bell’s result was originally proven in a simple bipartite causal structure, but analogous results have also been shown in further causal structures. Here we study the only causal structure with six or fewer nodes in which the question of whether or not there exist quantum correlations that cannot be achieved classically was open. In this causal structure we show that such quantum correlations exist using a method that involves imposing additional restrictions on the correlations. This hence completes the picture of which causal structures of up to six nodes support non-classical quantum correlations. We also provide further illustrations of our method using other causal structures.
[LG-75] Refining Machine Learning Potentials through Thermodynamic Theory of Phase Transitions
链接: https://arxiv.org/abs/2512.03974
作者: Paul Fuchs,Julija Zavadlav
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:
Abstract:Foundational Machine Learning Potentials can resolve the accuracy and transferability limitations of classical force fields. They enable microscopic insights into material behavior through Molecular Dynamics simulations, which can crucially expedite material design and discovery. However, insufficiently broad and systematically biased reference data affect the predictive quality of the learned models. Often, these models exhibit significant deviations from experimentally observed phase transition temperatures, in the order of several hundred kelvins. Thus, fine-tuning is necessary to achieve adequate accuracy in many practical problems. This work proposes a fine-tuning strategy via top-down learning, directly correcting the wrongly predicted transition temperatures to match the experimental reference data. Our approach leverages the Differentiable Trajectory Reweighting algorithm to minimize the free energy differences between phases at the experimental target pressures and temperatures. We demonstrate that our approach can accurately correct the phase diagram of pure Titanium in a pressure range of up to 5 GPa, matching the experimental reference within tenths of kelvins and improving the liquid-state diffusion constant. Our approach is model-agnostic, applicable to multi-component systems with solid-solid and solid-liquid transitions, and compliant with top-down training on other experimental properties. Therefore, our approach can serve as an essential step towards highly accurate application-specific and foundational machine learning potentials.
[LG-76] Comparison of neural network training strategies for the simulation of dynamical systems
链接: https://arxiv.org/abs/2512.03851
作者: Paul Strasser,Andreas Pfeffer,Jakob Weber,Markus Gurtner,Andreas Körner
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: submitted to ECC
Abstract:Neural networks have become a widely adopted tool for modeling nonlinear dynamical systems from data. However, the choice of training strategy remains a key design decision, particularly for simulation tasks. This paper compares two predominant strategies: parallel and series-parallel training. The conducted empirical analysis spans five neural network architectures and two examples: a pneumatic valve test bench and an industrial robot benchmark. The study reveals that, even though series-parallel training dominates current practice, parallel training consistently yields better long-term prediction accuracy. Additionally, this work clarifies the often inconsistent terminology in the literature and relate both strategies to concepts from system identification. The findings suggest that parallel training should be considered the default training strategy for neural network-based simulation of dynamical systems.
[LG-77] Colored Markov Random Fields for Probabilistic Topological Modeling
链接: https://arxiv.org/abs/2512.03727
作者: Lorenzo Marinucci,Leonardo Di Nino,Gabriele D’Acunto,Mario Edoardo Pandolfo,Paolo Di Lorenzo,Sergio Barbarossa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Methodology (stat.ME)
*备注: Proceeding of 2025 Asilomar Conference on Signals, Systems, and Computers
Abstract:Probabilistic Graphical Models (PGMs) encode conditional dependencies among random variables using a graph -nodes for variables, links for dependencies- and factorize the joint distribution into lower-dimensional components. This makes PGMs well-suited for analyzing complex systems and supporting decision-making. Recent advances in topological signal processing highlight the importance of variables defined on topological spaces in several application domains. In such cases, the underlying topology shapes statistical relationships, limiting the expressiveness of canonical PGMs. To overcome this limitation, we introduce Colored Markov Random Fields (CMRFs), which model both conditional and marginal dependencies among Gaussian edge variables on topological spaces, with a theoretical foundation in Hodge theory. CMRFs extend classical Gaussian Markov Random Fields by including link coloring: connectivity encodes conditional independence, while color encodes marginal independence. We quantify the benefits of CMRFs through a distributed estimation case study over a physical network, comparing it with baselines with different levels of topological prior.
[LG-78] Consistent Projection of Langevin Dynamics: Preserving Thermodynamics and Kinetics in Coarse-Grained Models
链接: https://arxiv.org/abs/2512.03706
作者: Vahid Nateghi,Lara Neureither,Selma Moqvist,Carsten Hartmann,Simon Olsson,Feliks Nüske
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:Coarse graining (CG) is an important task for efficient modeling and simulation of complex multi-scale systems, such as the conformational dynamics of biomolecules. This work presents a projection-based coarse-graining formalism for general underdamped Langevin dynamics. Following the Zwanzig projection approach, we derive a closed-form expression for the coarse grained dynamics. In addition, we show how the generator Extended Dynamic Mode Decomposition (gEDMD) method, which was developed in the context of Koopman operator methods, can be used to model the CG dynamics and evaluate its kinetic properties, such as transition timescales. Finally, we combine our approach with thermodynamic interpolation (TI), a generative approach to transform samples between thermodynamic conditions, to extend the scope of the approach across thermodynamic states without repeated numerical simulations. Using a two-dimensional model system, we demonstrate that the proposed method allows to accurately capture the thermodynamic and kinetic properties of the full-space model.
[LG-79] A Convolutional Framework for Mapping Imagined Auditory MEG into Listened Brain Responses
链接: https://arxiv.org/abs/2512.03458
作者: Maryam Maghsoudi,Mohsen Rezaeizadeh,Shihab Shamma
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:
Abstract:Decoding imagined speech engages complex neural processes that are difficult to interpret due to uncertainty in timing and the limited availability of imagined-response datasets. In this study, we present a Magnetoencephalography (MEG) dataset collected from trained musicians as they imagined and listened to musical and poetic stimuli. We show that both imagined and perceived brain responses contain consistent, condition-specific information. Using a sliding-window ridge regression model, we first mapped imagined responses to listened responses at the single-subject level, but found limited generalization across subjects. At the group level, we developed an encoder-decoder convolutional neural network with a subject-specific calibration layer that produced stable and generalizable mappings. The CNN consistently outperformed the null model, yielding significantly higher correlations between predicted and true listened responses for nearly all held-out subjects. Our findings demonstrate that imagined neural activity can be transformed into perception-like responses, providing a foundation for future brain-computer interface applications involving imagined speech and music.
[LG-80] When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling
链接: https://arxiv.org/abs/2512.03325
作者: Garrett G. Wen,Hong Hu,Yue M. Lu,Zhou Fan,Theodor Misiakiewicz
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:A major effort in modern high-dimensional statistics has been devoted to the analysis of linear predictors trained on nonlinear feature embeddings via empirical risk minimization (ERM). Gaussian equivalence theory (GET) has emerged as a powerful universality principle in this context: it states that the behavior of high-dimensional, complex features can be captured by Gaussian surrogates, which are more amenable to analysis. Despite its remarkable successes, numerical experiments show that this equivalence can fail even for simple embeddings – such as polynomial maps – under general scaling regimes. We investigate this breakdown in the setting of random feature (RF) models in the quadratic scaling regime, where both the number of features and the sample size grow quadratically with the data dimension. We show that when the target function depends on a low-dimensional projection of the data, such as generalized linear models, GET yields incorrect predictions. To capture the correct asymptotics, we introduce a Conditional Gaussian Equivalent (CGE) model, which can be viewed as appending a low-dimensional non-Gaussian component to an otherwise high-dimensional Gaussian model. This hybrid model retains the tractability of the Gaussian framework and accurately describes RF models in the quadratic scaling regime. We derive sharp asymptotics for the training and test errors in this setting, which continue to agree with numerical simulations even when GET fails. Our analysis combines general results on CLT for Wiener chaos expansions and a careful two-phase Lindeberg swapping argument. Beyond RF models and quadratic scaling, our work hints at a rich landscape of universality phenomena in high-dimensional ERM. Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2512.03325 [math.ST] (or arXiv:2512.03325v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2512.03325 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-81] Unlocking hidden biomolecular conformational landscapes in diffusion models at inference time
链接: https://arxiv.org/abs/2512.03312
作者: Daniel D. Richman,Jessica Karaguesian,Carl-Mikael Suomivuori,Ron O. Dror
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: Project page: this https URL
Abstract:The function of biomolecules such as proteins depends on their ability to interconvert between a wide range of structures or “conformations.” Researchers have endeavored for decades to develop computational methods to predict the distribution of conformations, which is far harder to determine experimentally than a static folded structure. We present ConforMix, an inference-time algorithm that enhances sampling of conformational distributions using a combination of classifier guidance, filtering, and free energy estimation. Our approach upgrades diffusion models – whether trained for static structure prediction or conformational generation – to enable more efficient discovery of conformational variability without requiring prior knowledge of major degrees of freedom. ConforMix is orthogonal to improvements in model pretraining and would benefit even a hypothetical model that perfectly reproduced the Boltzmann distribution. Remarkably, when applied to a diffusion model trained for static structure prediction, ConforMix captures structural changes including domain motion, cryptic pocket flexibility, and transporter cycling, while avoiding unphysical states. Case studies of biologically critical proteins demonstrate the scalability, accuracy, and utility of this method.
[LG-82] Novelty detection on path space
链接: https://arxiv.org/abs/2512.03243
作者: Ioannis Gasteratos,Antoine Jacquier,Maud Lemercier,Terry Lyons,Cristopher Salvi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:We frame novelty detection on path space as a hypothesis testing problem with signature-based test statistics. Using transportation-cost inequalities of Gasteratos and Jacquier (2023), we obtain tail bounds for false positive rates that extend beyond Gaussian measures to laws of RDE solutions with smooth bounded vector fields, yielding estimates of quantiles and p-values. Exploiting the shuffle product, we derive exact formulae for smooth surrogates of conditional value-at-risk (CVaR) in terms of expected signatures, leading to new one-class SVM algorithms optimising smooth CVaR objectives. We then establish lower bounds on type- \mathrmII error for alternatives with finite first moment, giving general power bounds when the reference measure and the alternative are absolutely continuous with respect to each other. Finally, we evaluate numerically the type- \mathrmI error and statistical power of signature-based test statistic, using synthetic anomalous diffusion data and real-world molecular biology data.
[LG-83] Iterative Tilting for Diffusion Fine-Tuning
链接: https://arxiv.org/abs/2512.03234
作者: Jean Pachebat,Giovanni Conforti,Alain Durmus,Yazid Janati
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 14 pages
Abstract:We introduce iterative tilting, a gradient-free method for fine-tuning diffusion models toward reward-tilted distributions. The method decomposes a large reward tilt \exp(\lambda r) into N sequential smaller tilts, each admitting a tractable score update via first-order Taylor expansion. This requires only forward evaluations of the reward function and avoids backpropagating through sampling chains. We validate on a two-dimensional Gaussian mixture with linear reward, where the exact tilted distribution is available in closed form.
[LG-84] Convergence of a class of gradient-free optimisation schemes when the objective function is noisy irregular or both
链接: https://arxiv.org/abs/2512.03225
作者: Christophe Andrieu,Nicolas Chopin,Ettore Fincato,Mathieu Gerber
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We investigate the convergence properties of a class of iterative algorithms designed to minimize a potentially non-smooth and noisy objective function, which may be algebraically intractable and whose values may be obtained as the output of a black box. The algorithms considered can be cast under the umbrella of a generalised gradient descent recursion, where the gradient is that of a smooth approximation of the objective function. The framework we develop includes as special cases model-based and mollification methods, two classical approaches to zero-th order optimisation. The convergence results are obtained under very weak assumptions on the regularity of the objective function and involve a trade-off between the degree of smoothing and size of the steps taken in the parameter updates. As expected, additional assumptions are required in the stochastic case. We illustrate the relevance of these algorithms and our convergence results through a challenging classification example from machine learning.
[LG-85] Uncertainty Quantification for Large Language Model Reward Learning under Heterogeneous Human Feedback
链接: https://arxiv.org/abs/2512.03208
作者: Pangpang Liu,Junwei Lu,Will Wei Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study estimation and statistical inference for reward models used in aligning large language models (LLMs). A key component of LLM alignment is reinforcement learning from human feedback (RLHF), where humans compare pairs of model-generated answers and their preferences are used to train a reward model. However, human feedback is inherently heterogeneous, creating significant challenges for reliable reward learning. To address this, we adopt a heterogeneous preference framework that jointly models the latent reward of answers and human rationality. This leads to a challenging biconvex optimization problem, which we solve via an alternating gradient descent algorithm. We establish theoretical guarantees for the resulting estimator, including its convergence and asymptotic distribution. These results enable the construction of confidence intervals for reward estimates. Leveraging these uncertainty quantification results, we conduct valid statistical comparisons between rewards and incorporate uncertainty into the best-of- N (BoN) policy framework. Extensive simulations demonstrate the effectiveness of our method, and applications to real LLM data highlight the practical value of accounting for uncertainty in reward modeling for LLM alignment.
[LG-86] In Situ Quantum Analog Pulse Characterization via Structured Signal Processing
链接: https://arxiv.org/abs/2512.03193
作者: Yulong Dong,Christopher Kang,Murphy Yuezhen Niu
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 48 pages, 10 figures
Abstract:Analog quantum simulators can directly emulate time-dependent Hamiltonian dynamics, enabling the exploration of diverse physical phenomena such as phase transitions, quench dynamics, and non-equilibrium processes. Realizing accurate analog simulations requires high-fidelity time-dependent pulse control, yet existing calibration schemes are tailored to digital gate characterization and cannot be readily extended to learn continuous pulse trajectories. We present a characterization algorithm for in situ learning of pulse trajectories by extending the Quantum Signal Processing (QSP) framework to analyze time-dependent pulses. By combining QSP with a logical-level analog-digital mapping paradigm, our method reconstructs a smooth pulse directly from queries of the time-ordered propagator, without requiring mid-circuit measurements or additional evolution. Unlike conventional Trotterization-based methods, our approach avoids unscalable performance degradation arising from accumulated local truncation errors as the logical-level segmentation increases. Through rigorous theoretical analysis and extensive numerical simulations, we demonstrate that our method achieves high accuracy with strong efficiency and robustness against SPAM as well as depolarizing errors, providing a lightweight and optimal validation protocol for analog quantum simulators capable of detecting major hardware faults.
[LG-87] An AI Implementation Science Study to Improve Trustworthy Data in a Large Healthcare System ALT
链接: https://arxiv.org/abs/2512.03098
作者: Benoit L. Marteau,Andrew Hornback,Shaun Q. Tan,Christian Lowson,Jason Woloff,May D. Wang
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: Submitted and Accepted to the IEEE International Conference on Biomedical and Health Informatics (BHI) 2025
Abstract:The rapid growth of Artificial Intelligence (AI) in healthcare has sparked interest in Trustworthy AI and AI Implementation Science, both of which are essential for accelerating clinical adoption. However, strict regulations, gaps between research and clinical settings, and challenges in evaluating AI systems continue to hinder real-world implementation. This study presents an AI implementation case study within Shriners Childrens (SC), a large multisite pediatric system, showcasing the modernization of SCs Research Data Warehouse (RDW) to OMOP CDM v5.4 within a secure Microsoft Fabric environment. We introduce a Python-based data quality assessment tool compatible with SCs infrastructure, extending OHDsi’s R/Java-based Data Quality Dashboard (DQD) and integrating Trustworthy AI principles using the METRIC framework. This extension enhances data quality evaluation by addressing informative missingness, redundancy, timeliness, and distributional consistency. We also compare systematic and case-specific AI implementation strategies for Craniofacial Microsomia (CFM) using the FHIR standard. Our contributions include a real-world evaluation of AI implementations, integration of Trustworthy AI principles into data quality assessment, and insights into hybrid implementation strategies that blend systematic infrastructure with use-case-driven approaches to advance AI in healthcare.
[LG-88] Performance Analysis of Quantum Support Vector Classifiers and Quantum Neural Networks
链接: https://arxiv.org/abs/2512.03094
作者: Tomás Villalba-Ferreiro,Eduardo Mosqueira-Rey,Diego Alvarez-Estevez
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 7 pages, 7 figures, conference
Abstract:This study explores the performance of Quantum Support Vector Classifiers (QSVCs) and Quantum Neural Networks (QNNs) in comparison to classical models for machine learning tasks. By evaluating these models on the Iris and MNIST-PCA datasets, we find that quantum models tend to outperform classical approaches as the problem complexity increases. While QSVCs generally provide more consistent results, QNNs exhibit superior performance in higher-complexity tasks due to their increased quantum load. Additionally, we analyze the impact of hyperparameter tuning, showing that feature maps and ansatz configurations significantly influence model accuracy. We also compare the PennyLane and Qiskit frameworks, concluding that Qiskit provides better optimization and efficiency for our implementation. These findings highlight the potential of Quantum Machine Learning (QML) for complex classification problems and provide insights into model selection and optimization strategies
[LG-89] Calibrating Geophysical Predictions under Constrained Probabilistic Distributions
链接: https://arxiv.org/abs/2512.03081
作者: Zhewen Hou,Jiajin Sun,Subashree Venkatasubramanian,Peter Jin,Shuolin Li,Tian Zheng
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Machine learning (ML) has shown significant promise in studying complex geophysical dynamical systems, including turbulence and climate processes. Such systems often display sensitive dependence on initial conditions, reflected in positive Lyapunov exponents, where even small perturbations in short-term forecasts can lead to large deviations in long-term outcomes. Thus, meaningful inference requires not only accurate short-term predictions, but also consistency with the system’s long-term attractor that is captured by the marginal distribution of state variables. Existing approaches attempt to address this challenge by incorporating spatial and temporal dependence, but these strategies become impractical when data are extremely sparse. In this work, we show that prior knowledge of marginal distributions offers valuable complementary information to short-term observations, motivating a distribution-informed learning framework. We introduce a calibration algorithm based on normalization and the Kernelized Stein Discrepancy (KSD) to enhance ML predictions. The method here employs KSD within a reproducing kernel Hilbert space to calibrate model outputs, improving their fidelity to known physical distributions. This not only sharpens pointwise predictions but also enforces consistency with non-local statistical structures rooted in physical principles. Through synthetic experiments-spanning offline climatological CO2 fluxes and online quasi-geostrophic flow simulations-we demonstrate the robustness and broad utility of the proposed framework.
信息检索
[IR-0] Learning to Comparison-Shop
链接: https://arxiv.org/abs/2512.04009
作者: Jie Tang,Daochen Zha,Xin Liu,Huiji Gao,Liwei He,Stephanie Moyerman,Sanjeev Katariya
类目: Information Retrieval (cs.IR)
*备注:
Abstract:In online marketplaces like Airbnb, users frequently engage in comparison shopping before making purchase decisions. Despite the prevalence of this behavior, a significant disconnect persists between mainstream e-commerce search engines and users’ comparison needs. Traditional ranking models often evaluate items in isolation, disregarding the context in which users compare multiple items on a search results page. While recent advances in deep learning have sought to improve ranking accuracy, diversity, and fairness by encoding listwise context, the challenge of aligning search rankings with user comparison shopping behavior remains inadequately addressed. In this paper, we propose a novel ranking architecture - Learning-to-Comparison-Shop (LTCS) System - that explicitly models and learns users’ comparison shopping behaviors. Through extensive offline and online experiments, we demonstrate that our approach yields statistically significant gains in key business metrics - improving NDCG by 1.7% and boosting booking conversion rate by 0.6% in A/B testing - while also enhancing user experience. We also compare our model against state-of-the-art approaches and demonstrate that LTCS significantly outperforms them.
[IR-1] Algorithms for Boolean Matrix Factorization using Integer Programming and Heuristics
链接: https://arxiv.org/abs/2512.03807
作者: Christos Kolomvakis,Thomas Bobille,Arnaud Vandaele,Nicolas Gillis
类目: Information Retrieval (cs.IR); Signal Processing (eess.SP); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 24 pages, 12 tables, 3 figures, code and data available from this https URL
Abstract:Boolean matrix factorization (BMF) approximates a given binary input matrix as the product of two smaller binary factors. Unlike binary matrix factorization based on standard arithmetic, BMF employs the Boolean OR and AND operations for the matrix product, which improves interpretability and reduces the approximation error. It is also used in role mining and computer vision. In this paper, we first propose algorithms for BMF that perform alternating optimization (AO) of the factor matrices, where each subproblem is solved via integer programming (IP). We then design different approaches to further enhance AO-based algorithms by selecting an optimal subset of rank-one factors from multiple runs. To address the scalability limits of IP-based methods, we introduce new greedy and local-search heuristics. We also construct a new C++ data structure for Boolean vectors and matrices that is significantly faster than existing ones and is of independent interest, allowing our heuristics to scale to large datasets. We illustrate the performance of all our proposed methods and compare them with the state of the art on various real datasets, both with and without missing data, including applications in topic modeling and imaging.
[IR-2] LLM as Explainable Re-Ranker for Recommendation System
链接: https://arxiv.org/abs/2512.03439
作者: Yaqi Wang,Haojia Sun,Shuting Zhang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The application of large language models (LLMs) in recommendation systems has recently gained traction. Traditional recommendation systems often lack explainability and suffer from issues such as popularity bias. Previous research has also indicated that LLMs, when used as standalone predictors, fail to achieve accuracy comparable to traditional models. To address these challenges, we propose to use LLM as an explainable re-ranker, a hybrid approach that combines traditional recommendation models with LLMs to enhance both accuracy and interpretability. We constructed a dataset to train the re-ranker LLM and evaluated the alignment between the generated dataset and human expectations. Leveraging a two-stage training process, our model significantly improved NDCG, a key ranking metric. Moreover, the re-ranker outperformed a zero-shot baseline in ranking accuracy and interpretability. These results highlight the potential of integrating traditional recommendation models with LLMs to address limitations in existing systems and pave the way for more explainable and fair recommendation frameworks.

