本篇博文主要内容为 2025-12-23 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-12-23)
今日共更新811篇论文,其中:
- 自然语言处理共100篇(Computation and Language (cs.CL))
- 人工智能共242篇(Artificial Intelligence (cs.AI))
- 计算机视觉共203篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共225篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)智能体训练中因真实世界交互数据成本高且静态而导致的性能瓶颈问题。其核心解决方案是提出GenEnv框架,通过构建一个难度对齐的协同进化博弈机制,使智能体与可扩展的生成式环境模拟器(generative environment simulator)共同演化。该框架的关键在于引入α-课程奖励(α-Curriculum Reward),动态调整任务难度以匹配智能体当前的能力水平,从而实现“最近发展区”(zone of proximal development)内的高效学习。相比传统静态数据集训练方法,GenEnv实现了数据效率提升和性能显著增强。
链接: https://arxiv.org/abs/2512.19682
作者: Jiacheng Guo,Ling Yang,Peter Chen,Qixin Xiao,Yinjie Wang,Xinzhe Juan,Jiahao Qiu,Ke Shen,Mengdi Wang
机构: Princeton University (普林斯顿大学); ByteDance Seed (字节跳动种子基金); Columbia University (哥伦比亚大学); University of Michigan (密歇根大学); University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注: Our codes are available at this https URL
Abstract:Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty-aligned co-evolutionary game between an agent and a scalable, generative environment simulator. Unlike traditional methods that evolve models on static datasets, GenEnv instantiates a dataevolving: the simulator acts as a dynamic curriculum policy, continuously generating tasks specifically tailored to the agent’s ``zone of proximal development’'. This process is guided by a simple but effective \alpha -Curriculum Reward, which aligns task difficulty with the agent’s current capabilities. We evaluate GenEnv on five benchmarks, including API-Bank, ALFWorld, BFCL, Bamboogle, and TravelPlanner. Across these tasks, GenEnv improves agent performance by up to \textbf+40.3% over 7B baselines and matches or exceeds the average performance of larger models. Compared to Gemini 2.5 Pro-based offline data augmentation, GenEnv achieves better performance while using 3.3 \times less data. By shifting from static supervision to adaptive simulation, GenEnv provides a data-efficient pathway for scaling agent capabilities.
zh
[NLP-1] Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies
【速读】: 该论文旨在解决现有强化学习(Reinforcement Learning, RL)方法将大语言模型(Large Language Models, LLMs)视为单一统一策略时,忽视其内部结构与机制的问题。为实现更精准的优化并揭示复杂推理机制,作者提出通过分解语言模型策略来识别内部层策略(Internal Layer Policies)和模块策略(Internal Modular Policies),分别对应Transformer残差流中各层及自注意力与前馈网络(Feed-Forward Network, FFN)组件的贡献。解决方案的关键在于利用隐藏状态与未嵌入矩阵(unembedding matrix)组合的等价性,从而显式建模每一层或模块对最终采样策略的影响,并基于此设计出“自底向上策略优化”(Bottom-up Policy Optimization, BuPO)——一种在早期训练阶段直接优化底层层策略的新RL范式,有效重建基础推理能力并显著提升复杂推理任务上的性能表现。
链接: https://arxiv.org/abs/2512.19673
作者: Yuqiao Tan,Minzheng Wang,Shizhu He,Huanxuan Liao,Chengfeng Zhao,Qiunan Lu,Tian Liang,Jun Zhao,Kang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Our code is available at this https URL
Abstract:Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama’s prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at this https URL.
zh
[NLP-2] Exploring Zero-Shot ACSA with Unified Meaning Representation in Chain-of-Thought Prompting
【速读】: 该论文旨在解决Aspect-Category Sentiment Analysis (ACSA)任务中因标注数据稀缺和成本高昂而导致的模型训练困难问题,尤其是在新领域缺乏足够标注数据的情况下。其解决方案的关键在于提出一种基于Chain-of-Thought (CoT)提示技术的新方法,该方法通过引入中间统一语义表示(Unified Meaning Representation, UMR)来结构化推理过程,从而在零样本设置下提升模型对ACSA任务的理解与性能。实验表明,该UMR-based方法在不同规模的大语言模型(LLMs)上表现存在差异,尤其在中等规模模型(如Qwen3-8B)上效果接近标准CoT基线,但其普适性仍需进一步验证,特别是在小型模型架构中的适用性。
链接: https://arxiv.org/abs/2512.19651
作者: Filippos Ventirozos,Peter Appleby,Matthew Shardlow
机构: Manchester Metropolitan University (曼彻斯特都会大学); Autotrader Research Group, Autotrader UK (Autotrader 研究组,Autotrader 英国)
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures, 3 tables
Abstract:Aspect-Category Sentiment Analysis (ACSA) provides granular insights by identifying specific themes within reviews and their associated sentiment. While supervised learning approaches dominate this field, the scarcity and high cost of annotated data for new domains present significant barriers. We argue that leveraging large language models (LLMs) in a zero-shot setting is a practical alternative where resources for data annotation are limited. In this work, we propose a novel Chain-of-Thought (CoT) prompting technique that utilises an intermediate Unified Meaning Representation (UMR) to structure the reasoning process for the ACSA task. We evaluate this UMR-based approach against a standard CoT baseline across three models (Qwen3-4B, Qwen3-8B, and Gemini-2.5-Pro) and four diverse datasets. Our findings suggest that UMR effectiveness may be model-dependent. Whilst preliminary results indicate comparable performance for mid-sized models such as Qwen3-8B, these observations warrant further investigation, particularly regarding the potential applicability to smaller model architectures. Further research is required to establish the generalisability of these findings across different model scales.
zh
[NLP-3] Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands Māori
【速读】: 该论文旨在解决低资源语言中的声调符号恢复(diacritic restoration)问题,这是自然语言处理(NLP)中文本规范化的重要环节。研究聚焦于两种极度资源匮乏的语言:哥斯达黎加的布里布里语(Bribri)和库克群岛毛利语(Cook Islands Māori),并系统比较了不同算法在该任务上的表现。其关键解决方案是采用微调后的字符级大语言模型(character-level LLMs),因其能将复杂字符分解为UTF-8字节表示,从而有效捕捉语言结构;相比之下,大规模多语言模型在数据受限条件下表现较差,且零样本方法效果不佳。研究发现,约10,000词的数据预算即可实现稳定性能,表明小规模高质量标注数据对低资源场景下的模型泛化至关重要。
链接: https://arxiv.org/abs/2512.19630
作者: Rolando Coto-Solano,Daisy Li,Manoela Teleginski Ferraz,Olivia Sasse,Cha Krupka,Sharid Loáiciga,Sally Akevai Tenamu Nicholas
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We present experiments on diacritic restoration, a form of text normalization essential for natural language processing (NLP) tasks. Our study focuses on two extremely under-resourced languages: Bribri, a Chibchan language spoken in Costa Rica, and Cook Islands Māori, a Polynesian language spoken in the Cook Islands. Specifically, this paper: (i) compares algorithms for diacritics restoration in under-resourced languages, including tonal diacritics, (ii) examines the amount of data required to achieve target performance levels, (iii) contrasts results across varying resource conditions, and (iv) explores the related task of diacritic correction. We find that fine-tuned, character-level LLMs perform best, likely due to their ability to decompose complex characters into their UTF-8 byte representations. In contrast, massively multilingual models perform less effectively given our data constraints. Across all models, reliable performance begins to emerge with data budgets of around 10,000 words. Zero-shot approaches perform poorly in all cases. This study responds both to requests from the language communities and to broader NLP research questions concerning model performance and generalization in under-resourced contexts.
zh
[NLP-4] Exploring the features used for summary evaluation by Human and GPT
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自动评估文本摘要质量时,其判断依据与人类评分之间不明确的问题,特别是缺乏对LLMs所依赖的关键特征及其与人工评分映射关系的系统理解。解决方案的关键在于通过统计分析和机器学习方法识别出与人类评分及生成式预训练变换器(Generative Pre-trained Transformers, GPTs)响应高度一致的特征,并进一步证明:通过指令引导GPTs使用人类常用的评估指标,可显著提升其判断准确性,使其评价结果更贴近人类评分。
链接: https://arxiv.org/abs/2512.19620
作者: Zahra Sadeghi,Evangelos Milios,Frank Rudzicz
机构: Dalhousie University (达尔豪斯大学); Vector Institute for Artificial Intelligence (人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Summary assessment involves evaluating how well a generated summary reflects the key ideas and meaning of the source text, requiring a deep understanding of the content. Large Language Models (LLMs) have been used to automate this process, acting as judges to evaluate summaries with respect to the original text. While previous research investigated the alignment between LLMs and Human responses, it is not yet well understood what properties or features are exploited by them when asked to evaluate based on a particular quality dimension, and there has not been much attention towards mapping between evaluation scores and metrics. In this paper, we address this issue and discover features aligned with Human and Generative Pre-trained Transformers (GPTs) responses by studying statistical and machine learning metrics. Furthermore, we show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.
zh
[NLP-5] MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery
【速读】: 该论文旨在解决自监督语音模型在跨语言语音表征学习中缺乏语言无关性和对未见语言适应能力不足的问题。解决方案的关键在于引入发音特征(articulatory features)作为监督信号,对HuBERT模型进行多语言预训练,使其能够从55种语言的数据中学习预测发音特征或音素,从而获得捕捉多语言语音特性的语言无关表征。这种基于发音特征的监督机制有效增强了模型的上下文不变性,并显著提升了其在未见语言和非正式语音场景下的迁移性能,仅需少量(10小时)自监督微调即可实现良好适应。
链接: https://arxiv.org/abs/2512.19612
作者: Angelo Ortiz Tandazo,Manel Khentout,Youssef Benchekroun,Thomas Hueber,Emmanuel Dupoux
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:This paper introduces MauBERT, a multilingual extension of HuBERT that leverages articulatory features for robust cross-lingual phonetic representation learning. We continue HuBERT pre-training with supervision based on a phonetic-to-articulatory feature mapping in 55 languages. Our models learn from multilingual data to predict articulatory features or phones, resulting in language-independent representations that capture multilingual phonetic properties. Through comprehensive ABX discriminability testing, we show MauBERT models produce more context-invariant representations than state-of-the-art multilingual self-supervised learning models. Additionally, the models effectively adapt to unseen languages and casual speech with minimal self-supervised fine-tuning (10 hours of speech). This establishes an effective approach for instilling linguistic inductive biases in self-supervised speech models.
zh
[NLP-6] Increasing the Thinking Budget is Not All You Need
【速读】: 该论文旨在解决如何在有限计算资源下优化大型语言模型(Large Language Models, LLMs)的推理性能问题,特别是探究“思考预算”(thinking budget)这一关键参数对模型表现的影响及其与多种配置策略(如自一致性、自我反思等)的交互关系。其解决方案的关键在于:单纯增加思考预算并非最高效的计算利用方式,而通过引入自一致性(self-consistency)和自我反思(self-reflection)等替代性配置策略,可在更低的计算成本下实现更准确的响应结果,从而构建一个兼顾性能与效率的平衡评估框架。
链接: https://arxiv.org/abs/2512.19585
作者: Ignacio Iacobacci,Zhaozhi Qian,Faroq AL-Tam,Muhammad AL-Qurishi,Riad Souissi
机构: Elm Company(埃尔姆公司)
类目: Computation and Language (cs.CL)
备注: 4 pages, 4 figures, 3 tables
Abstract:Recently, a new wave of thinking-capable Large Language Models has emerged, demonstrating exceptional capabilities across a wide range of reasoning benchmarks. Early studies have begun to explore how the amount of compute in terms of the length of the reasoning process, the so-called thinking budget, impacts model performance. In this work, we propose a systematic investigation of the thinking budget as a key parameter, examining its interaction with various configurations such as self-consistency, reflection, and others. Our goal is to provide an informative, balanced comparison framework that considers both performance outcomes and computational cost. Among our findings, we discovered that simply increasing the thinking budget is not the most effective use of compute. More accurate responses can instead be achieved through alternative configurations, such as self-consistency and self-reflection.
zh
[NLP-7] Algerian Dialect
【速读】: 该论文旨在解决阿尔及利亚方言(Algerian Arabic)在自然语言处理(NLP)领域中公开可用资源稀缺的问题,特别是针对情感分析(sentiment analysis)任务缺乏大规模标注数据集的现状。解决方案的关键在于构建并发布一个包含45,000条YouTube评论的大型情感标注数据集,这些评论全部来自阿尔及利亚新闻与媒体频道,且每条评论均被人工标注为五个情感类别(非常负面、负面、中性、正面、非常正面),同时附带丰富的元数据(如时间戳、点赞数、视频URL等)。该数据集通过Mendeley Data平台以CC BY 4.0许可证公开,为阿尔及利亚方言的NLP研究和社交媒体分析提供了重要基础资源。
链接: https://arxiv.org/abs/2512.19543
作者: Zakaria Benmounah,Abdennour Boulesnane
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We present Algerian Dialect, a large-scale sentiment-annotated dataset consisting of 45,000 YouTube comments written in Algerian Arabic dialect. The comments were collected from more than 30 Algerian press and media channels using the YouTube Data API. Each comment is manually annotated into one of five sentiment categories: very negative, negative, neutral, positive, and very positive. In addition to sentiment labels, the dataset includes rich metadata such as collection timestamps, like counts, video URLs, and annotation dates. This dataset addresses the scarcity of publicly available resources for Algerian dialect and aims to support research in sentiment analysis, dialectal Arabic NLP, and social media analytics. The dataset is publicly available on Mendeley Data under a CC BY 4.0 license at this https URL.
zh
[NLP-8] Event Extraction in Large Language Model
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的事件抽取(Event Extraction, EE)系统在实际部署中面临的三大核心问题:弱约束下的幻觉现象、长文本中时间与因果关系链接的脆弱性,以及受限上下文窗口内长期知识管理能力不足。其解决方案的关键在于将事件抽取视为一个认知支撑组件(cognitive scaffold),通过构建结构化的事件本体(event schemas)和槽位约束(slot constraints)实现推理的可验证性;利用事件中心结构作为可控中间表示以支持分步推理;借助事件关联网络实现图结构检索增强生成(graph-based RAG);并通过事件存储机制提供超出上下文窗口的可更新情景记忆(episodic memory)与智能体记忆(agent memory)。这一框架推动EE从静态信息提取向面向开放世界系统的可靠感知与记忆层演进。
链接: https://arxiv.org/abs/2512.19537
作者: Bobo Li,Xudong Han,Jiang Liu,Yuzhe Ding,Liqiang Jing,Zhaoqi Zhang,Jinheng Li,Xinya Du,Fei Li,Meishan Zhang,Min Zhang,Aixin Sun,Philip S. Yu,Hao Fei
机构: National University of Singapore (新加坡国立大学); University of Sussex (萨塞克斯大学); Wuhan University (武汉大学); University of Texas at Dallas (德克萨斯大学达拉斯分校); Nanyang Technological University (南洋理工大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳)); University of Illinois at Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注: 38 pages, 9 Figures, 5 Tables
Abstract:Large language models (LLMs) and multimodal LLMs are changing event extraction (EE): prompting and generation can often produce structured outputs in zero shot or few shot settings. Yet LLM based pipelines face deployment gaps, including hallucinations under weak constraints, fragile temporal and causal linking over long contexts and across documents, and limited long horizon knowledge management within a bounded context window. We argue that EE should be viewed as a system component that provides a cognitive scaffold for LLM centered solutions. Event schemas and slot constraints create interfaces for grounding and verification; event centric structures act as controlled intermediate representations for stepwise reasoning; event links support relation aware retrieval with graph based RAG; and event stores offer updatable episodic and agent memory beyond the context window. This survey covers EE in text and multimodal settings, organizing tasks and taxonomy, tracing method evolution from rule based and neural models to instruction driven and generative frameworks, and summarizing formulations, decoding strategies, architectures, representations, datasets, and evaluation. We also review cross lingual, low resource, and domain specific settings, and highlight open challenges and future directions for reliable event centric systems. Finally, we outline open challenges and future directions that are central to the LLM era, aiming to evolve EE from static extraction into a structurally reliable, agent ready perception and memory layer for open world systems.
zh
[NLP-9] A Large-Language-Model Framework for Automated Humanitarian Situation Reporting
【速读】: 该论文旨在解决人道主义决策中情境报告生成效率低、依赖人工且一致性差的问题。其核心解决方案是提出一个完全自动化的框架,利用大语言模型(Large Language Models, LLMs)将异构的人道主义文档转化为结构化、基于证据的报告。关键创新在于整合了语义文本聚类、自动问题生成、检索增强的答案提取(附带引用)、多层级摘要与执行摘要生成,并引入内部评估指标模拟专家推理过程,从而确保输出的准确性、可解释性和可操作性。
链接: https://arxiv.org/abs/2512.19475
作者: Ivan Decostanzi,Yelena Mejova,Kyriaki Kalimeri
机构: ISI Foundation( ISI基金会); UNICEF(联合国儿童基金会)
类目: Computation and Language (cs.CL)
备注: 18 pages, 3 figures
Abstract:Timely and accurate situational reports are essential for humanitarian decision-making, yet current workflows remain largely manual, resource intensive, and inconsistent. We present a fully automated framework that uses large language models (LLMs) to transform heterogeneous humanitarian documents into structured and evidence-grounded reports. The system integrates semantic text clustering, automatic question generation, retrieval augmented answer extraction with citations, multi-level summarization, and executive summary generation, supported by internal evaluation metrics that emulate expert reasoning. We evaluated the framework across 13 humanitarian events, including natural disasters and conflicts, using more than 1,100 documents from verified sources such as ReliefWeb. The generated questions achieved 84.7 percent relevance, 84.0 percent importance, and 76.4 percent urgency. The extracted answers reached 86.3 percent relevance, with citation precision and recall both exceeding 76 percent. Agreement between human and LLM based evaluations surpassed an F1 score of 0.80. Comparative analysis shows that the proposed framework produces reports that are more structured, interpretable, and actionable than existing baselines. By combining LLM reasoning with transparent citation linking and multi-level evaluation, this study demonstrates that generative AI can autonomously produce accurate, verifiable, and operationally useful humanitarian situation reports.
zh
[NLP-10] Epistemological Fault Lines Between Human and Artificial Intelligence
【速读】: 该论文试图解决的问题是:尽管大型语言模型(Large Language Models, LLMs)常被描述为具有类人智能,但其认知本质与人类知识获取机制存在根本性差异,这种差异导致人们误将语言流畅性等同于知识可靠性,从而引发“伪认知”风险。解决方案的关键在于识别并揭示LLMs作为“随机模式补全系统”(stochastic pattern-completion systems)的本质——它们并非基于信念或世界模型进行推理,而是通过高维语言转移图上的随机游走生成输出。作者通过系统映射人类与人工智能的认知路径,指出七条关键的“认识论断层”(epistemic fault lines),包括根基性、解析方式、经验基础、动机机制、因果推理、元认知和价值判断等方面的不一致,并提出“Epistemia”这一概念来刻画这种结构状态:即语言上的合理性替代了真正的认识评估,使人产生“知道”的错觉而无需承担判断的劳动。这为未来在评估、治理及认识论素养培养方面提供了理论框架与实践指引。
链接: https://arxiv.org/abs/2512.19466
作者: Walter Quattrociocchi,Valerio Capraro,Matjaž Perc
机构: Sapienza University of Rome (罗马大学); University of Milan Bicocca (米兰博科尼大学); University of Maribor (马里博尔大学); Korea University (韩国中央大学); Kyung Hee University (庆熙大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 16 pages, 1 figure
Abstract:Large language models (LLMs) are widely described as artificial intelligence, yet their epistemic profile diverges sharply from human cognition. Here we show that the apparent alignment between human and machine outputs conceals a deeper structural mismatch in how judgments are produced. Tracing the historical shift from symbolic AI and information filtering systems to large-scale generative transformers, we argue that LLMs are not epistemic agents but stochastic pattern-completion systems, formally describable as walks on high-dimensional graphs of linguistic transitions rather than as systems that form beliefs or models of the world. By systematically mapping human and artificial epistemic pipelines, we identify seven epistemic fault lines, divergences in grounding, parsing, experience, motivation, causal reasoning, metacognition, and value. We call the resulting condition Epistemia: a structural situation in which linguistic plausibility substitutes for epistemic evaluation, producing the feeling of knowing without the labor of judgment. We conclude by outlining consequences for evaluation, governance, and epistemic literacy in societies increasingly organized around generative AI.
zh
[NLP-11] Activations as Features: Probing LLM s for Generalizable Essay Scoring Representations
【速读】: 该论文旨在解决自动化作文评分(Automated Essay Scoring, AES)在跨提示(cross-prompt)场景下的挑战,即不同评分标准导致的评分一致性下降问题。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)中间层激活(activations)作为特征表示,通过拟合探测器(probes)来评估这些激活在区分作文质量上的判别能力,并进一步分析不同模型架构和输入内容对判别性能的影响。研究发现,LLMs的中间层激活具有强大的判别力,且能根据作文类型和评分维度自适应调整评价视角,从而有效应对跨提示场景中评分标准的多样性。
链接: https://arxiv.org/abs/2512.19456
作者: Jinwei Chi,Ke Wang,Yu Chen,Xuanye Lin,Qiang Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated essay scoring (AES) is a challenging task in cross-prompt settings due to the diversity of scoring criteria. While previous studies have focused on the output of large language models (LLMs) to improve scoring accuracy, we believe activations from intermediate layers may also provide valuable information. To explore this possibility, we evaluated the discriminative power of LLMs’ activations in cross-prompt essay scoring task. Specifically, we used activations to fit probes and further analyzed the effects of different models and input content of LLMs on this discriminative power. By computing the directions of essays across various trait dimensions under different prompts, we analyzed the variation in evaluation perspectives of large language models concerning essay types and traits. Results show that the activations possess strong discriminative power in evaluating essay quality and that LLMs can adapt their evaluation perspectives to different traits and essay types, effectively handling the diversity of scoring criteria in cross-prompt settings.
zh
[NLP-12] SiamGPT : Quality-First Fine-Tuning for Stable Thai Text Generation
【速读】: 该论文旨在解决开放权重大语言模型在泰语场景下因复杂指令导致生成不稳定的问题,尽管其在英语任务中表现优异。解决方案的关键在于提出SiamGPT-32B模型,该模型基于Qwen3-32B进行微调,并采用“质量优先”策略,强调高质量标注监督而非单纯扩大数据规模;其微调流程融合了高复杂度英文指令数据的翻译与适应泰语的AutoIF(Automatic Instruction Following)框架,以强化指令遵循和语言约束,在不依赖持续预训练或语料扩展的前提下,显著提升了指令遵从性、多轮对话鲁棒性和语言稳定性。
链接: https://arxiv.org/abs/2512.19455
作者: Thittipat Pairatsuppawat,Abhibhu Tachaapornchai,Paweekorn Kusolsomboon,Chutikan Chaiwong,Thodsaporn Chay-intr,Kobkrit Viriyayudhakorn,Nongnuch Ketui,Aslan B. Wong
机构: SIAM.AI; iApp Technology Co., Ltd.; Intelligent Informatics and Service Innovation Research Center, Thailand; Artificial Intelligence Entrepreneur Association of Thailand (AIEAT); Rajamangala University of Technology Lanna Nan, Thailand; Artificial Intelligence Association of Thailand (AIAT); National Electronics and Computer Technology Center (NECTEC)
类目: Computation and Language (cs.CL)
备注:
Abstract:Open-weights large language models remain difficult to deploy for Thai due to unstable generation under complex instructions, despite strong English performance. To mitigate these limitations, We present SiamGPT-32B, an open-weights model based on Qwen3-32B, fine-tuned with a Quality-First strategy emphasizing curated supervision over data scale. The fine-tuning pipeline combines translated high-complexity English instruction data with a Thai-adapted AutoIF framework for instruction and linguistic constraints. Using supervised fine-tuning only, without continual pretraining or corpus expansion, SiamGPT-32B improves instruction adherence, multi-turn robustness, and linguistic stability. Evaluations on the SEA-HELM benchmark show that SiamGPT-32B achieves the strongest overall performance among similar-scale open-weights Thai models, with consistent gains in instruction following, multi-turn dialogue, and natural language understanding.
zh
[NLP-13] MobileWorld: Benchmarking Autonomous Mobile Agents in Agent -User Interactive and MCP-Augmented Environments
【速读】: 该论文旨在解决当前主流移动应用智能体评测基准AndroidWorld存在的局限性,包括任务难度饱和、缺乏真实场景复杂性(如模糊用户指令和跨应用交互)以及关键应用类别缺失(如电商与企业通信)。其解决方案的核心在于提出MobileWorld这一更具挑战性的新基准,该基准通过两大创新实现突破:一是设计长周期、高复杂度的任务结构,平均任务步骤数达27.8步(远超AndroidWorld的14.3步),且62.2%的任务涉及多应用协同;二是引入新型任务类型,如代理-用户交互和MCP增强任务,同时提供基于快照的容器化环境与精确的功能验证机制(包括后端数据库检查和回调API),从而更贴近现实移动使用场景。
链接: https://arxiv.org/abs/2512.19432
作者: Quyu Kong,Xu Zhang,Zhenyu Yang,Nolan Gao,Chen Liu,Panrong Tong,Chenglin Cai,Hanzhang Zhou,Jianan Zhang,Liangyu Chen,Zhidan Liu,Steven Hoi,Yue Wang
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团); HKUST (GZ) (香港科技大学(广州)); University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. To bridge this gap, we introduce MobileWorld, a substantially more challenging benchmark designed to better reflect real-world mobile usage, comprising 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld. The difficulty of MobileWorld is twofold. First, it emphasizes long-horizon tasks with cross-application interactions: MobileWorld requires nearly twice as many task-completion steps on average (27.8 vs. 14.3) and includes far more multi-application tasks (62.2% vs. 9.5%) compared to AndroidWorld. Second, MobileWorld extends beyond standard GUI manipulation by introducing novel task categories, including agent-user interaction and MCP-augmented tasks. To ensure robust evaluation, we provide snapshot-based container environment and precise functional verifications, including backend database inspection and task callback APIs. We further develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively. Our analysis shows that current models struggle significantly with user interaction and MCP calls, offering a strategic roadmap toward more robust, next-generation mobile intelligence.
zh
[NLP-14] CodeSimpleQA: Scaling Factuality in Code Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成任务中普遍存在 factual accuracy(事实准确性)不足的问题,即模型虽能生成语法正确或可执行的代码,却可能在编程概念、技术实现细节等方面提供错误信息。为应对这一挑战,作者提出 CodeSimpleQA,一个涵盖多编程语言与计算机科学核心领域的双语基准测试集,并构建了包含6600万样本的 CodeSimpleQA-Instruct 指令数据集,同时设计了一种结合监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)的后训练框架,以提升模型对代码知识的事实性对齐能力。实验表明,即使前沿模型在事实准确性上仍表现不佳,而所提框架显著优于基线模型,验证了事实感知对齐在开发可靠代码生成模型中的关键作用。
链接: https://arxiv.org/abs/2512.19424
作者: Jian Yang,Wei Zhang,Yizhi Li,Shawn Guo,Haowen Wang,Aishan Liu,Ge Zhang,Zili Wang,Zhoujun Li,Xianglong Liu,Weifeng Lv
机构: Beihang University (北京航空航天大学); Manchester (曼彻斯特大学); M-A-P; StepFun
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have made significant strides in code generation, achieving impressive capabilities in synthesizing code snippets from natural language instructions. However, a critical challenge remains in ensuring LLMs generate factually accurate responses about programming concepts, technical implementations, etc. Most previous code-related benchmarks focus on code execution correctness, overlooking the factual accuracy of programming knowledge. To address this gap, we present CodeSimpleQA, a comprehensive bilingual benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions, which contains carefully curated question-answer pairs in both English and Chinese, covering diverse programming languages and major computer science domains. Further, we create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning. Our comprehensive evaluation of diverse LLMs reveals that even frontier LLMs struggle with code factuality. Our proposed framework demonstrates substantial improvements over the base model, underscoring the critical importance of factuality-aware alignment in developing reliable code LLMs.
zh
[NLP-15] From Retrieval to Reasoning : A Framework for Cyber Threat Intelligence NER with Explicit and Adaptive Instructions
【速读】: 该论文旨在解决当前基于检索的上下文学习(In-Context Learning, ICL)在网络安全威胁情报(Cyber Threat Intelligence, CTI)命名实体识别(Named Entity Recognition, NER)任务中存在根本性缺陷的问题,即其性能提升主要依赖于检索示例与目标文本在实体类型上的偶然重叠,而非真正的语义理解,从而导致模型对隐式归纳的依赖不可靠。解决方案的关键在于提出TTPrompt框架,将CTI中的战术(Tactics)、技术(Techniques)和程序(Procedures, TTPs)映射为显式的指令层级结构:以战术定义任务、技术指导策略、程序提供标注规范;同时引入反馈驱动的指令精炼机制(Feedback-driven Instruction Refinement, FIR),使大语言模型(Large Language Models, LLMs)能够通过少量标注数据自我修正标注指南,适应不同标注风格,显著提升泛化能力和适应性。实验表明,TTPrompt在五个CTI NER基准上均优于检索基线,且仅用1%训练数据微调即可媲美全量数据微调模型。
链接: https://arxiv.org/abs/2512.19414
作者: Jiaren Peng,Hongda Sun,Xuan Tian,Cheng Huang,Zeqing Li,Rui Yan
机构: Sichuan University (四川大学); Renmin University of China (中国人民大学); Wuhan University (武汉大学); Gaoling School of Artificial Intelligence (人工智能高精尖创新中心); Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Ministry of Education (下一代智能搜索与推荐工程研究中心,教育部)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:The automation of Cyber Threat Intelligence (CTI) relies heavily on Named Entity Recognition (NER) to extract critical entities from unstructured text. Currently, Large Language Models (LLMs) primarily address this task through retrieval-based In-Context Learning (ICL). This paper analyzes this mainstream paradigm, revealing a fundamental flaw: its success stems not from global semantic similarity but largely from the incidental overlap of entity types within retrieved examples. This exposes the limitations of relying on unreliable implicit induction. To address this, we propose TTPrompt, a framework shifting from implicit induction to explicit instruction. TTPrompt maps the core concepts of CTI’s Tactics, Techniques, and Procedures (TTPs) into an instruction hierarchy: formulating task definitions as Tactics, guiding strategies as Techniques, and annotation guidelines as Procedures. Furthermore, to handle the adaptability challenge of static guidelines, we introduce Feedback-driven Instruction Refinement (FIR). FIR enables LLMs to self-refine guidelines by learning from errors on minimal labeled data, adapting to distinct annotation dialects. Experiments on five CTI NER benchmarks demonstrate that TTPrompt consistently surpasses retrieval-based baselines. Notably, with refinement on just 1% of training data, it rivals models fine-tuned on the full dataset. For instance, on LADDER, its Micro F1 of 71.96% approaches the fine-tuned baseline, and on the complex CTINexus, its Macro F1 exceeds the fine-tuned ACLM model by 10.91%.
zh
[NLP-16] Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara
【速读】: 该论文旨在解决低资源、高口语化语言(如巴马拉语)在实际应用场景中自动语音识别(ASR)性能不佳的问题,尤其针对真实世界中常见的语音特征(如代码切换、不流畅表达、背景噪声和多说话人重叠)。其解决方案的关键在于构建一个包含160小时真实广播录音的大型标注数据集Kunkado,并基于其中33.47小时人工校对子集对Parakeet模型进行微调,同时引入实用的转录规范化策略以减少数字格式、标签和代码切换标注的变异性。实验表明,该方法显著降低了词错误率(WER),并在人类评估中优于使用更干净但缺乏现实复杂性的数据训练的同类模型,从而为口述主导语言的鲁棒ASR提供了有效支持。
链接: https://arxiv.org/abs/2512.19400
作者: Yacouba Diarra,Panga Azazia Kamate,Nouhoum Souleymane Coulibaly,Michael Leventhal
机构: RobotsMali AI4D Lab (RobotsMali AI4D 实验室)
类目: Computation and Language (cs.CL)
备注: 7 pages, 2 figures
Abstract:We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47% to 37.12% on one and from 36.07% to 32.33% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.
zh
[NLP-17] HATS: High-Accuracy Triple-Set Watermarking for Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLM)生成文本被滥用的问题,提出通过水印技术在输出中嵌入隐式信号以实现可检测性。其解决方案的关键在于设计一种三重分区(Triple-Partition)水印机制:在每个解码步骤中将词汇表划分为绿(Green)、黄(Yellow)、红(Red)三个固定比例的集合,并限制采样仅限于绿和黄集合;检测时重复相同的分区策略,计算绿色富集与红色耗竭统计量,转换为单侧z分数后,利用费舍尔方法(Fisher’s method)聚合p值以判断文本是否被水印标记。该方法在保持文本可读性的前提下实现了高检测准确率和固定的假阳性率(False Positive Rate, FPR)。
链接: https://arxiv.org/abs/2512.19378
作者: Zhiqing Hu,Chenxu Zhao,Jiazhong Lu,Xiaolei Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Camera-ready version of the paper accepted for oral presentation at the 11th International Conference on Computer and Communications (ICCC 2025)
Abstract:Misuse of LLM-generated text can be curbed by watermarking techniques that embed implicit signals into the output. We propose a watermark that partitions the vocabulary at each decoding step into three sets (Green/Yellow/Red) with fixed ratios and restricts sampling to the Green and Yellow sets. At detection time, we replay the same partitions, compute Green-enrichment and Red-depletion statistics, convert them to one-sided z-scores, and aggregate their p-values via Fisher’s method to decide whether a passage is watermarked. We implement generation, detection, and testing on Llama 2 7B, and evaluate true-positive rate, false-positive rate, and text quality. Results show that the triple-partition scheme achieves high detection accuracy at fixed FPR while preserving readability.
zh
[NLP-18] MAGIC: Achieving Superior Model Merging via Magnitude Calibration
【速读】: 该论文旨在解决模型合并(model merging)过程中因参数融合与稀疏化等操作导致的特征幅值(magnitude)扰动问题,这种扰动会引发合并后模型特征偏离原始专业化模型的行为特性,进而造成性能下降。现有方法多聚焦于特征方向(direction)对齐,忽视了幅值稳定性的重要性。解决方案的关键在于提出一种即插即用的幅值校准框架 MAGnItude Calibration (MAGIC),其核心是通过层级地在特征空间和权重空间中校正幅值:Feature Space Calibration (FSC) 利用少量无标签数据重对齐特征空间;Weight Space Calibration (WSC) 在无需额外数据的情况下扩展至权重空间;二者结合形成 Dual Space Calibration (DSC),从而有效提升合并模型在计算机视觉与自然语言处理任务中的性能表现。
链接: https://arxiv.org/abs/2512.19320
作者: Yayuan Li,Jian Zhang,Jintao Guo,Zihan Cheng,Lei Qi,Yinghuan Shi,Yang Gao
机构: Nanjing University (南京大学); Shanghai Jiao Tong University Medical School Affiliated Ruijin Hospital (上海交通大学医学院附属瑞金医院); Southeast University (东南大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The proliferation of pre-trained models has given rise to a wide array of specialised, fine-tuned models. Model merging aims to merge the distinct capabilities of these specialised models into a unified model, requiring minimal or even no additional training. A core objective of model merging is to ensure the merged model retains the behavioural characteristics of the specialised models, typically achieved through feature alignment. We identify that features consist of two critical components: direction and magnitude. Prior research has predominantly focused on directional alignment, while the influence of magnitude remains largely neglected, despite its pronounced vulnerability to perturbations introduced by common merging operations (e.g., parameter fusion and sparsification). Such perturbations to magnitude inevitably lead to feature deviations in the merged model from the specialised models, resulting in subsequent performance degradation. To address this, we propose MAGnItude Calibration (MAGIC), a plug-and-play framework that rectifies layer-wise magnitudes in feature and weight spaces, with three variants. Specifically, our Feature Space Calibration (FSC) realigns the merged model’s features using a small set of unlabelled data, while Weight Space Calibration (WSC) extends this calibration to the weight space without requiring additional data. Combining these yields Dual Space Calibration (DSC). Comprehensive experiments demonstrate that MAGIC consistently boosts performance across diverse Computer Vision tasks (+4.3% on eight datasets) and NLP tasks (+8.0% on Llama) without additional training. Our code is available at: this https URL
zh
[NLP-19] CienaLLM : Generative Climate-Impact Extraction from News Articles with Autoregressive LLM s
【速读】: 该论文旨在解决如何从异构新闻文章中大规模提取结构化信息以理解与监测气候灾害的社会经济影响这一问题。其解决方案的关键在于提出了一种基于模式引导的生成式信息抽取(schema-guided Generative Information Extraction)的模块化框架——CienaLLM,该框架利用开源大语言模型(Large Language Models, LLMs)实现零样本信息抽取,支持可配置提示词与输出模式、多步骤处理流程,并兼容云端或本地推理部署;通过系统性因子实验发现,较大模型性能最优且稳定,量化压缩带来显著效率提升而仅轻微损失精度,且提示策略效果具有模型特异性,同时引入响应解析步骤几乎消除格式错误而不损害准确性,从而在西班牙语干旱新闻中达到或超越监督基线的抽取准确率,具备良好的可迁移性与扩展潜力。
链接: https://arxiv.org/abs/2512.19305
作者: Javier Vela-Tambo,Jorge Gracia,Fernando Dominguez-Castro
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Understanding and monitoring the socio-economic impacts of climate hazards requires extracting structured information from heterogeneous news articles on a large scale. To that end, we have developed CienaLLM, a modular framework based on schema-guided Generative Information Extraction. CienaLLM uses open-weight Large Language Models for zero-shot information extraction from news articles, and supports configurable prompts and output schemas, multi-step pipelines, and cloud or on-premise inference. To systematically assess how the choice of LLM family, size, precision regime, and prompting strategy affect performance, we run a large factorial study in models, precisions, and prompt engineering techniques. An additional response parsing step nearly eliminates format errors while preserving accuracy; larger models deliver the strongest and most stable performance, while quantization offers substantial efficiency gains with modest accuracy trade-offs; and prompt strategies show heterogeneous, model-specific effects. CienaLLM matches or outperforms the supervised baseline in accuracy for extracting drought impacts from Spanish news, although at a higher inference cost. While evaluated in droughts, the schema-driven and model-agnostic design is suitable for adapting to related information extraction tasks (e.g., other hazards, sectors, or languages) by editing prompts and schemas rather than retraining. We release code, configurations, and schemas to support reproducible use.
zh
[NLP-20] Auto-Prompting with Retrieval Guidance for Frame Detection in Logistics
【速读】: 该论文旨在解决如何在不进行大规模微调的前提下,提升大语言模型(Large Language Models, LLMs)在物流文本中帧检测(frame detection)任务上的推理准确性和标注效率问题。其核心解决方案在于提出了一种结构化的提示优化(prompt optimization)流水线,关键创新点在于引入一个基于LLM的提示优化代理(prompt optimizer agent),通过检索增强生成(Retrieval-Augmented Generation, RAG)、少样本提示(few-shot prompting)、思维链(Chain-of-Thought, CoT)推理以及自动CoT合成(Auto-CoT)技术,结合检索到的示例、性能反馈和内部自评估机制对提示进行迭代优化。实验表明,该方法显著提升了多种主流LLM(如GPT-4o、Qwen 2.5 (72B) 和 LLaMA 3.1 (70B))在真实物流文本标注任务中的推理准确性,最高达15%,验证了其通用性与实用性,为领域特定自然语言处理应用提供了一种可扩展的替代全量微调的方案。
链接: https://arxiv.org/abs/2512.19247
作者: Do Minh Duc,Quan Xuan Truong,Nguyen Tat Dat,Nguyen Van Vinh
机构: VNU University of Engineering and Technology (越南国家大学工程与技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Prompt engineering plays a critical role in adapting large language models (LLMs) to complex reasoning and labeling tasks without the need for extensive fine-tuning. In this paper, we propose a novel prompt optimization pipeline for frame detection in logistics texts, combining retrieval-augmented generation (RAG), few-shot prompting, chain-of-thought (CoT) reasoning, and automatic CoT synthesis (Auto-CoT) to generate highly effective task-specific prompts. Central to our approach is an LLM-based prompt optimizer agent that iteratively refines the prompts using retrieved examples, performance feedback, and internal self-evaluation. Our framework is evaluated on a real-world logistics text annotation task, where reasoning accuracy and labeling efficiency are critical. Experimental results show that the optimized prompts - particularly those enhanced via Auto-CoT and RAG - improve real-world inference accuracy by up to 15% compared to baseline zero-shot or static prompts. The system demonstrates consistent improvements across multiple LLMs, including GPT-4o, Qwen 2.5 (72B), and LLaMA 3.1 (70B), validating its generalizability and practical value. These findings suggest that structured prompt optimization is a viable alternative to full fine-tuning, offering scalable solutions for deploying LLMs in domain-specific NLP applications such as logistics.
zh
[NLP-21] ChemATP: A Training-Free Chemical Reasoning Framework for Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在分子科学领域推理能力不足的问题,其根源在于标准字符串表示中缺乏显式的化学先验知识(chemical priors)。现有方法面临根本性困境:基于训练的方法虽能将先验注入模型参数,但导致知识更新困难且削弱通用推理能力;而无需训练的方法虽灵活,却仅依赖表面提示,无法提供原子级的精细先验信息,难以实现精准化学推理。解决方案的关键在于提出ChemATP框架,通过构建首个原子级别的文本知识库,实现化学知识与推理引擎的解耦,使冻结的LLM能够动态检索并显式利用该知识进行推理,从而在保持模型通用智能的同时提升可解释性和适应性。
链接: https://arxiv.org/abs/2512.19240
作者: Mingxu Zhang,Dazhong Shen,Qi Zhang,Ying Sun
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) exhibit strong general reasoning but struggle in molecular science due to the lack of explicit chemical priors in standard string representations. Current solutions face a fundamental dilemma. Training-based methods inject priors into parameters, but this static coupling hinders rapid knowledge updates and often compromises the model’s general reasoning capabilities. Conversely, existing training-free methods avoid these issues but rely on surface-level prompting, failing to provide the fine-grained atom-level priors essential for precise chemical reasoning. To address this issue, we introduce ChemATP, a framework that decouples chemical knowledge from the reasoning engine. By constructing the first atom-level textual knowledge base, ChemATP enables frozen LLMs to explicitly retrieve and reason over this information dynamically. This architecture ensures interpretability and adaptability while preserving the LLM’s intrinsic general intelligence. Experiments show that ChemATP significantly outperforms training-free baselines and rivals state-of-the-art training-based models, demonstrating that explicit prior injection is a competitive alternative to implicit parameter updates.
zh
[NLP-22] Identifying Features Associated with Bias Against 93 Stigmatized Groups in Language Models and Guardrail Model Safety Mitigation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理非受保护的污名化群体时存在社会偏见的问题,特别是探讨哪些污名的社会特征(如危险性、隐蔽性、破坏性等)与模型输出偏见相关,并评估基于防护模型(guardrail models)的偏见缓解策略效果。其关键发现是:人类对污名“危险性”(peril)评分较高的群体(如帮派成员或HIV感染者)在LLM输出中表现出最高比例的偏见(60%),而社会人口学污名(如亚裔美国人或老年人)偏见最低(11%);尽管使用各模型自带的防护机制(如Granite Guardian 3.0、Llama Guard 3.0和Mistral Moderation API)可分别降低偏见10.4%、1.4%和7.8%,但这些防护模型未能有效识别提示中的偏见意图,且影响偏见的关键社会特征在干预后仍保持不变,表明当前防护机制对偏见根源的干预不足,亟需改进。
链接: https://arxiv.org/abs/2512.19238
作者: Anna-Maria Gueorguieva,Aylin Caliskan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) have been shown to exhibit social bias, however, bias towards non-protected stigmatized identities remain understudied. Furthermore, what social features of stigmas are associated with bias in LLM outputs is unknown. From psychology literature, it has been shown that stigmas contain six shared social features: aesthetics, concealability, course, disruptiveness, origin, and peril. In this study, we investigate if human and LLM ratings of the features of stigmas, along with prompt style and type of stigma, have effect on bias towards stigmatized groups in LLM outputs. We measure bias against 93 stigmatized groups across three widely used LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) using SocialStigmaQA, a benchmark that includes 37 social scenarios about stigmatized identities; for example deciding wether to recommend them for an internship. We find that stigmas rated by humans to be highly perilous (e.g., being a gang member or having HIV) have the most biased outputs from SocialStigmaQA prompts (60% of outputs from all models) while sociodemographic stigmas (e.g. Asian-American or old age) have the least amount of biased outputs (11%). We test if the amount of biased outputs could be decreased by using guardrail models, models meant to identify harmful input, using each LLM’s respective guardrail model (Granite Guardian 3.0, Llama Guard 3.0, Mistral Moderation API). We find that bias decreases significantly by 10.4%, 1.4%, and 7.8%, respectively. However, we show that features with significant effect on bias remain unchanged post-mitigation and that guardrail models often fail to recognize the intent of bias in prompts. This work has implications for using LLMs in scenarios involving stigmatized groups and we suggest future work towards improving guardrail models for bias mitigation.
zh
[NLP-23] CycleChart: A Unified Consistency-Based Learning Framework for Bidirectional Chart Understanding and Generation
【速读】: 该论文旨在解决当前图表理解与生成任务(如图表问答、图表解析和图表生成)通常孤立研究导致模型难以学习跨任务共享语义的问题。其解决方案的关键在于提出CycleChart框架,该框架采用以模式(schema)为中心的统一接口,构建了一个多任务一致性数据集,并引入“生成-解析一致性目标”:模型从表格和文本查询生成图表模式后,再从生成的图表中恢复原始模式和数据,从而强制不同方向间语义对齐,实现双向图表理解和生成的协同优化。
链接: https://arxiv.org/abs/2512.19173
作者: Dazhen Deng,Sen Yang,Yuchen He,Yuan Tian,Yingcai Wu
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current chart-specific tasks, such as chart question answering, chart parsing, and chart generation, are typically studied in isolation, preventing models from learning the shared semantics that link chart generation and interpretation. We introduce CycleChart, a consistency-based learning framework for bidirectional chart understanding and generation. CycleChart adopts a schema-centric formulation as a common interface across tasks. We construct a consistent multi-task dataset, where each chart sample includes aligned annotations for schema prediction, data parsing, and question answering. To learn cross-directional chart semantics, CycleChart introduces a generate-parse consistency objective: the model generates a chart schema from a table and a textual query, then learns to recover the schema and data from the generated chart, enforcing semantic alignment across directions. CycleChart achieves strong results on chart generation, chart parsing, and chart question answering, demonstrating improved cross-task generalization and marking a step toward more general chart understanding models.
zh
[NLP-24] JEPA-Reason er: Decoupling Latent Reasoning from Token Generation
【速读】: 该论文旨在解决两大问题:一是Joint-Embedding Predictive Architecture (JEPA) 缺乏生成能力,难以支持复杂任务中的创造性输出;二是现有基于Transformer的模型(如COCONUT)虽通过潜空间推理提升性能,但依赖逐标记生成方式,易累积误差且高度依赖上下文信息以获得推理洞察。解决方案的关键在于提出JEPA-Reasoner,其核心创新是将潜空间推理与标记生成解耦:一方面利用JEPA-Reasoner在潜空间中进行多线程推理,生成混合潜向量以增强推理能力;另一方面引入独立的“Talker”模型负责生成人类可读文本,从而实现自回归生成过程的鲁棒性提升,有效缓解误差累积问题。
链接: https://arxiv.org/abs/2512.19171
作者: Bingyang Kelvin Liu,Ziyu Patrick Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:While Joint-Embedding Predictive Architecture (JEPA) has emerged as a powerful architecture for learning rich latent representations, it fundamentally lacks generative abilities. Meanwhile, latent space reasoning attempts for Transformer models like COCONUT do improve performance, but they ultimately rely on token-by-token generation, which still accumulates compounding error and relies on context information to gain reasoning insights. To address these limitations, we propose JEPA-Reasoner, a novel JEPA model enhanced with generative ability that reasons in latent space. We augment it with a separate action-taker model, Talker, to produce human-readable sentences. Our approach demonstrates that decoupling latent space reasoning and token generation enables JEPA-Reasoner to produce mixed latent vectors that might lay the foundation for multi-threaded reasoning, while performing autoregressive generation with superior robustness to compounding error.
zh
[NLP-25] From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs
【速读】: 该论文旨在解决当前自动语音识别(ASR)系统在真实生产环境中,尤其是在非英语内容(如意大利语长视频)上的准确性不足问题,从而无法满足媒体行业对字幕生成的高精度要求。解决方案的关键在于采用“人在回路”(Human-in-the-Loop, HITL)方法,即通过评估四种前沿ASR模型(Whisper Large v2、AssemblyAI Universal、Parakeet TDT v3 0.6b 和 WhisperX)在50小时意大利语电视节目数据集上的表现,发现尽管这些模型尚不能实现完全自动化字幕生成,但可作为增强人工效率的有力工具;进而设计并部署了一个云端生产级基础设施以支持该人机协同的工作流程。
链接: https://arxiv.org/abs/2512.19161
作者: Alessandro Lucca,Francesco Pierri
机构: Politecnico di Milano (米兰理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Subtitles are essential for video accessibility and audience engagement. Modern Automatic Speech Recognition (ASR) systems, built upon Encoder-Decoder neural network architectures and trained on massive amounts of data, have progressively reduced transcription errors on standard benchmark datasets. However, their performance in real-world production environments, particularly for non-English content like long-form Italian videos, remains largely unexplored. This paper presents a case study on developing a professional subtitling system for an Italian media company. To inform our system design, we evaluated four state-of-the-art ASR models (Whisper Large v2, AssemblyAI Universal, Parakeet TDT v3 0.6b, and WhisperX) on a 50-hour dataset of Italian television programs. The study highlights their strengths and limitations, benchmarking their performance against the work of professional human subtitlers. The findings indicate that, while current models cannot meet the media industry’s accuracy needs for full autonomy, they can serve as highly effective tools for enhancing human productivity. We conclude that a human-in-the-loop (HITL) approach is crucial and present the production-grade, cloud-based infrastructure we designed to support this workflow.
zh
[NLP-26] QuCo-RAG : Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中易产生幻觉(hallucination)的问题,尤其关注动态检索增强生成(Dynamic Retrieval-Augmented Generation, Dynamic RAG)中依赖模型内部置信度信号(如logits、熵值)所导致的不可靠性。现有方法因LLMs通常校准不足且常对错误输出表现出高置信度,难以有效识别幻觉风险。解决方案的关键在于将不确定性评估从主观模型置信度转向基于预训练数据的客观统计量:首先,在生成前通过Infini-gram快速查询低频实体以识别长尾知识缺口;其次,在生成过程中验证实体共现频率,若零共现则提示高幻觉风险并触发检索。该方法利用4万亿token规模的预训练语料进行毫秒级查询,实现了模型无关的动态RAG机制,显著提升了多跳问答任务上的准确率(EM提升5–14点),并在不同模型架构和领域(如生物医学QA)中展现出良好的泛化能力。
链接: https://arxiv.org/abs/2512.19134
作者: Dehai Min,Kailin Zhang,Tongtong Wu,Lu Cheng
机构: University of Illinois at Chicago (芝加哥大学伊利诺伊分校); New York University (纽约大学); Monash University (蒙纳士大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5–12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama, Qwen, GPT), improving EM by up to 14 points. Domain generalization on biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at this https URL.
zh
[NLP-27] AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards
【速读】: 该论文旨在解决现有强化学习(Reinforcement Learning, RL)方法在训练工具使用大语言模型(Tool-use Large Language Models, LLMs)时,忽视显式推理奖励(explicit reasoning rewards)对提升推理能力和工具调用效果的潜力,且天然融合推理奖励与结果奖励可能导致性能不佳或偏离主要优化目标的问题。解决方案的关键在于提出优势加权策略优化(Advantage-Weighted Policy Optimization, AWPO),其核心机制包括基于组内相对统计的方差感知门控(variance-aware gating)和难度感知加权(difficulty-aware weighting),以自适应调节来自推理信号的优势值,并引入定制化的裁剪机制保障优化稳定性,从而有效整合推理奖励并显著提升多轮复杂任务中的工具使用能力。
链接: https://arxiv.org/abs/2512.19126
作者: Zihan Lin,Xiaohan Wang,Hexiong Yang,Jiajun Chai,Jie Cao,Guojun Yin,Wei Lin,Ran He
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:While reinforcement learning (RL) shows promise in training tool-use large language models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of explicit reasoning rewards to bolster reasoning and tool utilization. Furthermore, natively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose advantage-weighted policy optimization (AWPO) – a principled RL framework that effectively integrates explicit reasoning rewards to enhance tool-use capability. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0 percent in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.
zh
[NLP-28] SAP: Syntactic Attention Pruning for Transformer-based Language Models
【速读】: 该论文旨在解决Transformer模型中注意力头(attention head)剪枝效率与可解释性不足的问题。现有方法通常仅依赖对模型权重和激活值的数学分析进行剪枝,忽略了句子的句法结构和注意力模式,导致剪枝后性能下降且难以解释模型行为。解决方案的关键在于提出语法注意力剪枝(Syntactic Attention Pruning, SAP),该方法融合了句法结构(syntactic structure)和注意力模式(attention patterns)来指导剪枝过程,从而在不重新训练的情况下保留具有高密度强注意力值的关键注意力头,显著提升剪枝后的模型性能与可解释性。此外,通过候选过滤机制(Candidate Filtering, CF)优先保留对模型性能贡献大的注意力头,进一步增强了剪枝的鲁棒性。
链接: https://arxiv.org/abs/2512.19125
作者: Tzu-Yun Lee,Ding-Yong Hong,Jan-Jan Wu
机构: Institute of Information Science, Academia Sinica (中央研究院资讯科学研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:This paper introduces Syntactic Attention Pruning (SAP), a novel method for effectively pruning attention heads in Transformer models. Unlike conventional approaches that rely solely on mathematical analysis of model weights and activations, SAP incorporates both the syntactic structure and attention patterns of sentences to guide the pruning process. By leveraging these linguistic features, SAP not only achieves performance comparable to state-of-the-art methods but also enhances the interpretability of model behavior. To further improve robustness, we propose Candidate Filtering (CF), a mechanism that prioritizes heads based on their contribution to model performance, mitigating degradation during pruning. Experimental results indicate that SAP effectively preserves critical heads of a high density of strong attention values, outperforming existing head pruning strategies in retrain-free settings. These findings position SAP as a promising foundation for a new direction in model compression research, offering high flexibility for pruning across all transformer-based language models.
zh
[NLP-29] BanglaForge: LLM Collaboration with Self-Refinement for Bangla Code Generation AACL2025
【速读】: 该论文旨在解决低资源语言 Bangla 在代码生成任务中的挑战,即缺乏大规模标注数据集和工具,难以将自然语言规范转化为可执行程序。其解决方案的关键在于提出 BanglaForge 框架,采用检索增强的双模型协同机制与自精炼策略,结合上下文学习、基于大语言模型(LLM)的翻译、系统化提示工程以及基于执行反馈的迭代自精炼过程:其中编码器生成初始解决方案,评审器根据执行结果优化方案以提升鲁棒性,从而在 BLP-2025 Bangla 代码生成基准上实现 84.00% 的 Pass@1 准确率,验证了检索、模型协同与自精炼的有效性。
链接: https://arxiv.org/abs/2512.19122
作者: Mahir Labib Dihan,Sadif Ahmed,Md Nafiu Rahman
机构: Bangladesh University of Engineering and Technology (BUET)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Accepted at BLP Workshop @ IJCNLP-AACL 2025. Code is available at this https URL
Abstract:Bangla is a low-resource language for code generation, lacking large-scale annotated datasets and tools to transform natural language specifications into executable programs. This makes Bangla-to-code generation a challenging task requiring innovative solutions. To address this, we introduce BanglaForge, a novel framework for generating code from Bangla function descriptions. BanglaForge leverages a retrieval-augmented dual-model collaboration paradigm with self-refinement, combining in-context learning, llm-based translation, systematic prompt engineering, and iterative self-refinement based on execution feedback, where a coder generates initial solutions and a reviewer enhances them for robustness. On the BLP-2025 Bangla Code Generation benchmark, BanglaForge achieves a competitive Pass@1 accuracy of 84.00%, demonstrating the effectiveness of retrieval, model collaboration, and self-refinement for low-resource Bangla code generation.
zh
[NLP-30] Stop saying LLM : Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?
【速读】: 该论文试图解决当前对大型生成式模型(Large Generative Models)理解中存在的概念模糊与治理困境问题,即现有“大语言模型”(LLM)框架难以充分揭示其在社会语境中作为话语建构工具的本质。解决方案的关键在于提出一个认知论转向:将LLM重新定义为“大型话语模型”(Large Discourse Models, LDM),进而进一步发展为“人工话语代理”(Artificial Discursive Agent, ADA),并基于本体论三元结构——即对现实世界现象规律的感知、具身认知的结构化以及言语在社会历史语境中的语言沉积——来重构分析范式。该框架强调模型所处理的核心对象是“文本”(document),即人类经验的再实体化产物,从而推动建立一种以公共试验和程序化机制为基础的治理路径,使人工话语代理在当代社会空间中的定位、用途与边界变得可解码,实现国家、产业、公民社会与学术界共同参与的协同规制。
链接: https://arxiv.org/abs/2512.19117
作者: Amar Lakel(MICA)
机构: 未知
类目: Computation and Language (cs.CL)
备注: in French language
Abstract:This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ‘‘Large Language Models’’ (LLM) with that of ‘‘Large Discourse Models’’ (LDM), and then with that of Artificial Discursive Agent (ADA). The theoretical framework is based on an ontological triad distinguishing three regulatory instances: the apprehension of the phenomenal regularities of the referential world, the structuring of embodied cognition, and the structural-linguistic sedimentation of the utterance within a socio-historical context. LDMs, operating on the product of these three instances (the document), model the discursive projection of a portion of human experience reified by the learning corpus. The proposed program aims to replace the ‘‘fascination/fear’’ dichotomy with public trials and procedures that make the place, uses, and limits of artificial discursive agents in contemporary social space decipherable, situating this approach within a perspective of governance and co-regulation involving the State, industry, civil society, and academia.
zh
[NLP-31] A Large Language Model Based Method for Complex Logical Reasoning over Knowledge Graphs
【速读】: 该论文旨在解决知识图谱(Knowledge Graph, KG)中基于一阶逻辑(First-Order Logic, FOL)查询的推理问题,尤其针对现实世界KG中存在的不完整性以及复杂逻辑结构带来的挑战。现有方法多依赖于将实体和关系嵌入连续几何空间并通过可微集合运算回答查询,但在处理涉及多个逻辑算子、深层推理链或异构KG模式的复杂查询时表现不佳。解决方案的关键在于提出ROG(Reasoning Over knowledge Graphs with large language models),其核心是通过查询感知的知识图谱邻域检索获取紧凑且相关的子图作为上下文证据,并结合大语言模型(Large Language Model, LLM)进行链式思维(Chain-of-Thought)推理,从而将复杂FOL查询分解为一系列简单子查询并逐层推理,避免了任务特定的嵌入优化过程。实验表明,ROG在标准KG推理基准上显著优于主流嵌入基基线方法,在高复杂度查询类型上提升尤为明显。
链接: https://arxiv.org/abs/2512.19092
作者: Ziyan Zhang,Chao Wang,Zhuo Chen,Lei Chen,Chiyi Li,Kai Song
机构: Chongqing Jiaotong University (重庆交通大学); State Grid Chongqing Electric Power Company (国网重庆市电力公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reasoning over knowledge graphs (KGs) with first-order logic (FOL) queries is challenging due to the inherent incompleteness of real-world KGs and the compositional complexity of logical query structures. Most existing methods rely on embedding entities and relations into continuous geometric spaces and answer queries via differentiable set operations. While effective for simple query patterns, these approaches often struggle to generalize to complex queries involving multiple operators, deeper reasoning chains, or heterogeneous KG schemas. We propose ROG (Reasoning Over knowledge Graphs with large language models), an ensemble-style framework that combines query-aware KG neighborhood retrieval with large language model (LLM)-based chain-of-thought reasoning. ROG decomposes complex FOL queries into sequences of simpler sub-queries, retrieves compact, query-relevant subgraphs as contextual evidence, and performs step-by-step logical inference using an LLM, avoiding the need for task-specific embedding optimization. Experiments on standard KG reasoning benchmarks demonstrate that ROG consistently outperforms strong embedding-based baselines in terms of mean reciprocal rank (MRR), with particularly notable gains on high-complexity query types. These results suggest that integrating structured KG retrieval with LLM-driven logical reasoning offers a robust and effective alternative for complex KG reasoning tasks.
zh
[NLP-32] Watch Closely: Mitigating Object Hallucinations in Large Vision-Language Models with Disentangled Decoding
【速读】: 该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在物体识别任务中普遍存在的幻觉问题,即模型生成的文本虽流畅但与图像内容不符,严重影响实际应用可靠性。解决方案的关键在于提出无需训练的“幻觉解耦解码”(Hallucination Disentangled Decoding, HDD)方法:通过图像分割和增强选择来强化原始图像信息,同时引入空白图像以消除语言先验导致的幻觉,从而降低模型对语言先验的依赖并提升视觉模态的准确性。
链接: https://arxiv.org/abs/2512.19070
作者: Ruiqi Ma,Yu Yan,Chunhong Zhang,Minghao Yin,XinChao Liu,Zhihong Jin,Zheng Hu
机构: Beijing University of Posts and Telecommunications (北京邮电大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Large Vision-Language Models (LVLMs) bridge the gap between visual and linguistic modalities, demonstrating strong potential across a variety of domains. However, despite significant progress, LVLMs still suffer from severe hallucination issues in object recognition tasks. These models often fail to accurately identify certain objects, leading to text generation that appears fluent but does not correspond to the visual content, which can have serious consequences in real-world applications. Recently, several methods have been proposed to alleviate LVLM hallucinations, but most focus solely on reducing hallucinations in the language modality. To mitigate hallucinations in both the language and visual modalities, we introduce Hallucination Disentangled Decoding (HDD) method that requires no training. HDD enhances the original image by segmenting it and selecting images that augment the original, while also utilizing a blank image to eliminate language prior hallucinations in both the original and segmented images. This design not only reduces the model’s dependence on language priors but also enhances its visual performance. (Code: this https URL)
zh
[NLP-33] DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation
【速读】: 该论文旨在解决现有基准在评估戏剧剧本续写(drama script continuation)能力时存在的不足,尤其是无法全面衡量模型在角色一致性、情节推进、戏剧结构保持等方面的综合表现。其解决方案的关键在于提出首个大规模基准测试框架——DramaBench,该框架从六个独立维度(格式规范、叙事效率、角色一致性、情感深度、逻辑一致性和冲突处理)对剧本续写质量进行系统化评估,结合规则驱动分析、大语言模型(LLM)标注与统计指标,确保评价的客观性与可复现性,并通过大量模型评测(8种SOTA模型,1,103个剧本,8,824次评估)和人类验证,证明了各维度的独立性与有效性,从而为生成式AI在创意写作领域的性能提升提供可操作的维度级反馈。
链接: https://arxiv.org/abs/2512.19012
作者: Shijian Ma,Yunqi Huang,Yan Lin
机构: University of Macau (澳门大学); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.
zh
[NLP-34] Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)系统中持续存在的提示注入(prompt injection)与越狱攻击(jailbreaking attacks)所带来的安全威胁。其解决方案的关键在于提出一种轻量级、多阶段的防御架构,核心组件为基于文本归一化、TF-IDF表示和线性支持向量机(Linear SVM)分类器的语义过滤模块。该模块在保持极低计算开销的同时,实现了93.4%的准确率和96.5%的特异性,显著降低了攻击流量,并通过后续集成的检测与缓解机制形成分层防护,最终在超过3万条标注提示数据上验证了其高效性和鲁棒性,相较现有方法如ShieldGemma在延迟上降低超10倍,同时大幅提升防御精度。
链接: https://arxiv.org/abs/2512.19011
作者: Akshaj Prashanth Rao,Advait Singh,Saumya Kumaar Saksena,Dhruv Kumar
机构: Birla Institute of Technology and Science, Pilani(比尔拉理工大学,皮拉尼校区)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under Review
Abstract:Prompt injection and jailbreaking attacks pose persistent security challenges to large language model (LLM)-based systems. We present an efficient and systematically evaluated defense architecture that mitigates these threats through a lightweight, multi-stage pipeline. Its core component is a semantic filter based on text normalization, TF-IDF representations, and a Linear SVM classifier. Despite its simplicity, this module achieves 93.4% accuracy and 96.5% specificity on held-out data, substantially reducing attack throughput while incurring negligible computational overhead. Building on this efficient foundation, the full pipeline integrates complementary detection and mitigation mechanisms that operate at successive stages, providing strong robustness with minimal latency. In comparative experiments, our SVM-based configuration improves overall accuracy from 35.1% to 93.4% while reducing average time to completion from approximately 450s to 47s, yielding over 10 times lower latency than ShieldGemma. These results demonstrate that the proposed design simultaneously advances defensive precision and efficiency, addressing a core limitation of current model-based moderators. Evaluation across a curated corpus of over 30,000 labeled prompts, including benign, jailbreak, and application-layer injections, confirms that staged, resource-efficient defenses can robustly secure modern LLM-driven applications. Comments: Under Review Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2512.19011 [cs.CR] (or arXiv:2512.19011v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.19011 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-35] Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models
【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, DLLMs)在推理阶段因需大量去噪迭代才能从信息空白的全掩码初始化生成连贯文本而导致的效率低下问题。现有加速方法主要聚焦于优化生成轨迹的遍历效率,如改进求解器或采样策略;而本文提出一种互补思路:通过上下文感知的初始化缩短生成轨迹本身。其核心解决方案是引入一个无需训练的接口,将轻量级辅助模型中基于提示的先验信息注入扩散初始化过程,具体实现方式包括离散token注入和表示层嵌入插值两种机制,并辅以基于置信度的重新掩码机制以缓解先验不准确导致的早期过早承诺问题。实验表明,该方法可在GSM8K任务上减少约35%的函数评估次数,但同时也揭示了朴素热启动可能损害最终准确性这一关键挑战,从而推动未来对校准、修正机制与表征对齐的研究方向。
链接: https://arxiv.org/abs/2512.19004
作者: Tongyuan Miao,Gary Huang,Kai Jun Han,Annie Jiang
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization. We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.19004 [cs.CL] (or arXiv:2512.19004v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.19004 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-36] Evaluating the Challenges of LLM s in Real-world Medical Follow-up: A Comparative Study and An Optimized Framework
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗随访任务中因直接端到端应用而导致对话流程失控和信息提取不准确的问题。其解决方案的关键在于构建一个模块化流水线系统,通过任务分解(task decomposition)、语义聚类(semantic clustering)和流程控制(flow management)实现结构化的交互管理,从而显著提升对话稳定性与信息抽取准确性,并大幅减少对话轮次(降低46.73%)和令牌消耗(降低80%–87.5%)。
链接: https://arxiv.org/abs/2512.18999
作者: Jinyan Liu,Zikang Chen,Qinchuan Wang,Tan Xie,Heming Zheng,Xudong Lv
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages,3 figures,conference ICCBB2025
Abstract:When applied directly in an end-to-end manner to medical follow-up tasks, Large Language Models (LLMs) often suffer from uncontrolled dialog flow and inaccurate information extraction due to the complexity of follow-up forms. To address this limitation, we designed and compared two follow-up chatbot systems: an end-to-end LLM-based system (control group) and a modular pipeline with structured process control (experimental group). Experimental results show that while the end-to-end approach frequently fails on lengthy and complex forms, our modular method-built on task decomposition, semantic clustering, and flow management-substantially improves dialog stability and extraction accuracy. Moreover, it reduces the number of dialogue turns by 46.73% and lowers token consumption by 80% to 87.5%. These findings highlight the necessity of integrating external control mechanisms when deploying LLMs in high-stakes medical follow-up scenarios.
zh
[NLP-37] Affordance RAG : Hierarchical Multimodal Retrieval with Affordance-Aware Embodied Memory for Mobile Manipulation ICRA2026
【速读】: 该论文旨在解决开放词汇移动操作(open-vocabulary mobile manipulation)问题,即机器人需根据自由形式的自然语言指令,在复杂室内环境中识别并携带多种物体至指定容器。此任务难点在于理解视觉语义与操作 affordance(可操作性)。解决方案的关键在于提出 Affordance RAG 框架——一种零样本分层多模态检索方法,通过预探索图像构建具操作感知的具身记忆(Affordance-Aware Embodied Memory),基于区域和视觉语义检索候选目标,并结合 affordance 评分进行重排序,从而提升机器人在真实场景中执行可操作动作的识别准确率。该方法在大规模室内环境中的检索性能及实际任务成功率(85%)均优于现有方法。
链接: https://arxiv.org/abs/2512.18987
作者: Ryosuke Korekata,Quanting Xie,Yonatan Bisk,Komei Sugiura
机构: Keio University (庆应义塾大学); Keio AI Research Center (庆应义塾大学人工智能研究中心); Carnegie Mellon University (卡内基梅隆大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE RA-L, with presentation at ICRA 2026
Abstract:In this study, we address the problem of open-vocabulary mobile manipulation, where a robot is required to carry a wide range of objects to receptacles based on free-form natural language instructions. This task is challenging, as it involves understanding visual semantics and the affordance of manipulation actions. To tackle these challenges, we propose Affordance RAG, a zero-shot hierarchical multimodal retrieval framework that constructs Affordance-Aware Embodied Memory from pre-explored images. The model retrieves candidate targets based on regional and visual semantics and reranks them with affordance scores, allowing the robot to identify manipulation options that are likely to be executable in real-world environments. Our method outperformed existing approaches in retrieval performance for mobile manipulation instruction in large-scale indoor environments. Furthermore, in real-world experiments where the robot performed mobile manipulation in indoor environments based on free-form instructions, the proposed method achieved a task success rate of 85%, outperforming existing methods in both retrieval performance and overall task success.
zh
[NLP-38] FASTRIC: Prompt Specification Language for Verifiable LLM Interactions
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行复杂多轮交互协议时缺乏形式化规范以验证其行为是否符合设计意图的问题。解决方案的关键在于提出FASTRIC——一种提示规范语言(Prompt Specification Language),它将隐式的有限状态机(Finite State Machines, FSMs)显式地编码于自然语言提示中,从而通过执行轨迹分析实现行为合规性验证。FASTRIC利用LLM作为统一的智能执行代理,同时承担解析器、解释器、运行时环境和开发助手的角色,无需传统符号规范语言所需的编译器与解析器;其核心创新在于引导设计者明确表达七类FSM要素(终态、代理、状态、触发事件、角色、初始状态、约束),并以规范形式度(formality level)作为调节参数,揭示出不同模型容量下存在最优“适度约束区间”(Goldilocks zones),从而将多轮交互设计从经验性艺术转变为具有可测量过程保障的系统工程。
链接: https://arxiv.org/abs/2512.18940
作者: Wen-Long Jin
机构: University of California, Irvine (加州大学欧文分校); California Institute for Telecommunications and Information Technology (加州电信与信息科技研究所); Institute of Transportation Studies (交通研究学院)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 13 pages, 3 figures. Supplementary materials at this https URL
Abstract:Large Language Models (LLMs) execute complex multi-turn interaction protocols but lack formal specifications to verify execution against designer intent. We introduce FASTRIC, a Prompt Specification Language that makes implicit Finite State Machines (FSMs) explicit in natural language prompts, enabling conformance verification through execution trace analysis. The LLM serves as intelligent execution agent: interpreting designer-encoded FSMs to execute specified behavioral roles. Unlike symbolic specification languages requiring parsers and compilers, FASTRIC leverages LLMs as unified infrastructure-simultaneously parser, interpreter, runtime environment, and development assistant. FASTRIC guides designers to articulate seven FSM elements (Final States, Agents, States, Triggers, Roles, Initial State, Constraints) structuring multi-turn interactions. Specification formality-ranging from implicit descriptions that frontier models infer to explicit step-by-step instructions for weaker models-serves as a design parameter. We introduce procedural conformance as verification metric measuring execution adherence to FSM specifications. Testing a 3-state kindergarten tutoring FSM across four formality levels and three model scales (14.7B, 685B, 1T+ parameters) reveals optimal specification formality is a function of model capacity. DeepSeek-V3.2 (685B) achieves perfect conformance (1.00) at L2-L4; ChatGPT-5 (~1T) peaks at L3 (0.90) before collapsing at L4 (0.39); Phi4 (14.7B) shows no stable optimum with high variance (SD=0.16-0.36). These findings reveal model-specific formality ranges-“Goldilocks zones”-where specifications provide sufficient structure without over-constraint, establishing Prompt Specification Engineering for creating verifiable interaction protocols, transforming multi-turn interaction design from heuristic art to systematic engineering with measurable procedural guarantees.
zh
[NLP-39] Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations
【速读】: 该论文旨在解决自动机器翻译(Machine Translation, MT)评估指标存在的黑箱性问题,即现有指标虽能与人工评分高度一致,但缺乏可解释性,且在分布外(out-of-distribution, OOD)输入下性能显著下降。其解决方案的关键在于提出 Remedy-R,一种基于强化学习从成对翻译偏好中训练的生成式 MT 评估模型,无需错误片段标注或闭源大语言模型(Large Language Model, LLM)蒸馏。Remedy-R 能够输出逐步骤的准确性、流畅性和完整性分析,并给出最终评分,从而实现可解释的评估;同时,其生成的自省反馈可用于改进翻译质量,进一步推动了 Remedy-R Agent 的构建,形成“评估-修正”闭环,验证了该方法在多模型上的有效性与实用性。
链接: https://arxiv.org/abs/2512.18906
作者: Shaomu Tan,Ryosuke Mitani,Ritvik Choudhary,Qiyu Wu,Toshiyuki Sekiya,Christof Monz
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R’s evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R’s reasoning captures translation-relevant information and is practically useful.
zh
[NLP-40] Can LLM s Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
【速读】: 该论文试图解决教育评估中项目(题目或任务)难度准确估计的问题,尤其是针对冷启动问题(cold start problem),即在缺乏历史数据时难以评估新题目的难度。解决方案的关键在于通过大规模实证分析,比较人类与大型语言模型(Large Language Models, LLMs)在多个领域(如医学知识和数学推理)中的难度感知一致性,发现模型的规模扩大并不能可靠提升与人类难度判断的对齐度,反而趋向于形成一种“机器共识”,且高能力模型更难模拟学习者的认知局限,同时缺乏自我反思能力来预测自身局限。这表明当前LLMs虽具备强大的通用问题求解能力,但并不等同于理解人类的认知困难,从而揭示了利用现有模型进行自动化难度预测的核心挑战。
链接: https://arxiv.org/abs/2512.18880
作者: Ming Li,Han Chen,Yunze Xiao,Jian Chen,Hong Jiao,Tianyi Zhou
机构: University of Maryland (马里兰大学); Carnegie Mellon University (卡内基梅隆大学); University at Buffalo (水牛城大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.
zh
[NLP-41] Application of deep learning approaches for medieval historical documents transcription
【速读】: 该论文旨在解决中世纪拉丁语手写文献(9至11世纪)在光学字符识别(Optical Character Recognition, OCR)和手写文本识别(Handwritten Text Recognition, HTR)任务中效率显著下降的问题。其解决方案的关键在于提出一种深度学习方法,该方法充分考虑了中世纪文档特有的属性,包括笔迹多样性、纸张老化、墨水褪色及排版不规则等特征。通过构建针对此类历史文档的专用数据集,并设计从目标检测到单词分类与嵌入的端到端深度学习流水线,实现了对复杂手写文本的有效提取与识别,从而提升了模型在历史文献上的性能表现。
链接: https://arxiv.org/abs/2512.18865
作者: Maksym Voloshchuk,Bohdana Zarembovska,Mykola Kozlenko
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 15 figures, 4 tables. Originally published by CEUR Workshop Proceedings ( this http URL , ISSN 1613-0073), available: this https URL
Abstract:Handwritten text recognition and optical character recognition solutions show excellent results with processing data of modern era, but efficiency drops with Latin documents of medieval times. This paper presents a deep learning method to extract text information from handwritten Latin-language documents of the 9th to 11th centuries. The approach takes into account the properties inherent in medieval documents. The paper provides a brief introduction to the field of historical document transcription, a first-sight analysis of the raw data, and the related works and studies. The paper presents the steps of dataset development for further training of the models. The explanatory data analysis of the processed data is provided as well. The paper explains the pipeline of deep learning models to extract text information from the document images, from detecting objects to word recognition using classification models and embedding word images. The paper reports the following results: recall, precision, F1 score, intersection over union, confusion matrix, and mean string distance. The plots of the metrics are also included. The implementation is published on the GitHub repository.
zh
[NLP-42] oward Human-Centered AI-Assisted Terminology Work
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在术语工作中快速扩散所引发的专业自主性削弱、偏见放大及语言与概念多样性流失等问题。其解决方案的关键在于提出一种以人类为中心的人工智能框架,将AI定位为增强术语工作者能力的工具而非替代者,核心围绕三个相互关联的维度:增强型术语工作者、伦理AI和以人为本的设计。该框架强调高自动化与强人类控制的兼容性,突出术语工作者在偏见缓解中的核心作用,并主张从术语工作者的需求、价值观与福祉出发设计AI工具与工作流程,从而保障术语工作的准确性、充分性与多样性。
链接: https://arxiv.org/abs/2512.18859
作者: Antonio San Martin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid diffusion of generative artificial intelligence is transforming terminology work. While this technology promises gains in efficiency, its unstructured adoption risks weakening professional autonomy, amplifying bias, and eroding linguistic and conceptual diversity. This paper argues that a human-centered approach to artificial intelligence has become a necessity for terminology work. Building on research in artificial intelligence and translation studies, it proposes a human-centered framework that conceptualizes artificial intelligence as a means of amplifying the terminologist’s capabilities, rather than replacing them. The framework is organized around three interrelated dimensions: the augmented terminologist, ethical AI, and human-centered design. Together, these dimensions emphasize the compatibility of high automation with strong human control, the central role of terminologists in bias mitigation, and the importance of designing AI tools and workflows around the needs, values, and well-being of the terminologist. The paper concludes by stressing that current choices in AI adoption will shape not only terminological practice, but also the preservation of accuracy, adequacy, and diversity in terminology and specialized knowledge.
zh
[NLP-43] MDToC: Metacognitive Dynamic Tree of Concepts for Boosting Mathematical Problem-Solving of Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在使用传统提示技术时进行数学计算验证能力不足的问题。其核心挑战在于模型虽具备一定推理能力,但在复杂算术任务中难以可靠地识别和纠正错误计算路径。解决方案的关键在于提出一种三阶段方法——MDToC(Metacognitive Dynamic Tree of Concepts),通过构建概念树(Concept Tree)来结构化问题分解,为每个概念生成经过准确性验证的计算过程,并利用多数投票机制对多个候选解进行评估与选择,从而实现元认知层面的计算验证,显著提升数学推理的准确性和鲁棒性。
链接: https://arxiv.org/abs/2512.18841
作者: Tung Duong Ta,Tim Oates
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite advances in mathematical reasoning capabilities, Large Language Models (LLMs) still struggle with calculation verification when using established prompting techniques. We present MDToC (Metacognitive Dynamic Tree of Concepts), a three-phase approach that constructs a concept tree, develops accuracy-verified calculations for each concept, and employs majority voting to evaluate competing solutions. Evaluations across CHAMP, MATH, and Game-of-24 benchmarks demonstrate our MDToC’s effectiveness, with GPT-4-Turbo achieving 58.1% on CHAMP, 86.6% on MATH, and 85% on Game-of-24 - outperforming GoT by 5%, 5.4%, and 4% on all these tasks, respectively, without hand-engineered hints. MDToC consistently surpasses existing prompting methods across all backbone models, yielding improvements of up to 7.6% over ToT and 6.2% over GoT, establishing metacognitive calculation verification as a promising direction for enhanced mathematical reasoning.
zh
[NLP-44] AraMix: Recycling Refiltering and Deduplicating to Deliver the Largest Arabic Pretraining Corpus
【速读】: 该论文旨在解决阿拉伯语(Arabic)预训练语料库中存在大量冗余数据的问题,这限制了模型训练效率和质量。其核心解决方案在于通过系统性地重用与精炼现有公开语料库,而非重复进行网络爬取:研究者整合了七个已有的阿拉伯语网络数据集,采用专为阿拉伯语文本设计的质量过滤策略对部分数据集重新筛选,并实施基于MinHash和句级的跨数据集去重处理。结果显示,这些独立收集的数据集中约60%的token为重复内容,表明新爬取工作将不可避免地复制冗余。该方法验证了对现有数据进行高质量清洗与整合,相较于盲目扩展网络采集,在低资源语言场景下更具成本效益和实际价值。
链接: https://arxiv.org/abs/2512.18834
作者: Sultan Alrashed,Francesco Orabona
机构: King Abdullah University of Science and Technology (KAUST)
类目: Computation and Language (cs.CL)
备注: Initial version, without pretraining experiments
Abstract:We present AraMix, a deduplicated Arabic pretraining corpus containing approximately 178 billion tokens across 179 million documents. Rather than scraping the web again, AraMix demonstrates that substantial value lies in systematically reusing and curating existing pretraining datasets: we combine seven publicly available Arabic web datasets, apply quality filtering designed specifically for Arabic text to re-filter some datasets, and perform cross-dataset deduplication, both MinHash and sentence-level. This approach reveals that nearly 60% of tokens across these independently collected corpora are duplicates, redundancy that any new scraping efforts will reproduce. Our work suggests that for lower resource languages, investment in curation pipelines for existing data yields greater returns than additional web crawls, an approach that allowed us to curate the largest heavily filtered publicly available Arabic pretraining corpus.
zh
[NLP-45] From Word to World: Can Large Language Models be Implicit Text-based World Models?
【速读】: 该论文旨在解决当前基于代理的强化学习(Agentic Reinforcement Learning)在真实世界环境中面临的经验获取困难、覆盖范围有限及扩展性差的问题,探索大语言模型(Large Language Models, LLMs)作为世界模型(World Models)在文本环境中的有效性与适用边界。其解决方案的关键在于构建一个三层次评估框架(包括保真度与一致性、可扩展性与鲁棒性、代理效用),并通过实证表明:经过充分训练的LLM世界模型能够维持一致的潜在状态表示,并随数据量和模型规模稳定扩展,进而通过动作验证、合成轨迹生成和强化学习预热等方式显著提升代理性能;但这些收益高度依赖于行为覆盖度与环境复杂度,明确了世界建模对代理学习有效支持的边界条件。
链接: https://arxiv.org/abs/2512.18832
作者: Yixia Li,Hongru Wang,Jiahao Qiu,Zhenfei Yin,Dongdong Zhang,Cheng Qian,Zeping Li,Pony Ma,Guanhua Chen,Heng Ji,Mengdi Wang
机构: Southern University of Science and Technology (南方科技大学); Microsoft Research; University of Edinburgh (爱丁堡大学); Princeton University (普林斯顿大学); Oxford University (牛津大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Fudan University (复旦大学); Mind Lab; Cheng Qian; Mengdi Wang
类目: Computation and Language (cs.CL)
备注:
Abstract:Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text-based environments, which provide a controlled setting to reinterpret language modeling as next-state prediction under interaction. We introduce a three-level framework for evaluating LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm-starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.
zh
[NLP-46] From Natural Language to Control Signals: A Conceptual Framework for Semantic Channel Finding in Complex Experimental Infrastructure
【速读】: 该论文旨在解决复杂实验基础设施中控制与诊断信号的语义定位问题,即如何将自然语言意图准确映射到具体的控制系统信号,以提升运维效率、可扩展性及面向语言模型的交互能力。其核心挑战在于现有系统依赖非正式专家知识、命名不一致和碎片化文档,导致信号查找成为瓶颈。解决方案的关键在于提出一个四范式框架:(i) 基于预定义通道字典的上下文直接查找;(ii) 通过结构化树进行受限层级导航;(iii) 利用迭代推理与工具调用的交互式智能体探索;(iv) 基于本体的语义搜索,解耦信号语义与设施特异性命名规则。该框架在四个不同规模和架构的运行设施上验证,实现了90–97%的专家标注查询准确率。
链接: https://arxiv.org/abs/2512.18779
作者: Thorsten Hellert,Nikolay Agladze,Alex Giovannone,Jan Jug,Frank Mayet,Mark Sherwin,Antonin Sulc,Chris Tennant
机构: Lawrence Berkeley National Laboratory (劳伦斯伯克利国家实验室); University of California Santa Barbara (加州大学圣塔芭芭拉分校); Deutsches Elektronen-Synchrotron DESY (德国电子同步加速器研究中心); Thomas Jefferson National Accelerator Facility (托马斯杰斐逊国家加速器设施); Cosylab USA (科西拉布美国)
类目: Computation and Language (cs.CL); Accelerator Physics (physics.acc-ph)
备注:
Abstract:Modern experimental platforms such as particle accelerators, fusion devices, telescopes, and industrial process control systems expose tens to hundreds of thousands of control and diagnostic channels accumulated over decades of evolution. Operators and AI systems rely on informal expert knowledge, inconsistent naming conventions, and fragmented documentation to locate signals for monitoring, troubleshooting, and automated control, creating a persistent bottleneck for reliability, scalability, and language-model-driven interfaces. We formalize semantic channel finding-mapping natural-language intent to concrete control-system signals-as a general problem in complex experimental infrastructure, and introduce a four-paradigm framework to guide architecture selection across facility-specific data regimes. The paradigms span (i) direct in-context lookup over curated channel dictionaries, (ii) constrained hierarchical navigation through structured trees, (iii) interactive agent exploration using iterative reasoning and tool-based database queries, and (iv) ontology-grounded semantic search that decouples channel meaning from facility-specific naming conventions. We demonstrate each paradigm through proof-of-concept implementations at four operational facilities spanning two orders of magnitude in scale-from compact free-electron lasers to large synchrotron light sources-and diverse control-system architectures, from clean hierarchies to legacy environments. These implementations achieve 90-97% accuracy on expert-curated operational queries.
zh
[NLP-47] Code2Doc: A Quality-First Curated Dataset for Code Documentation
【速读】: 该论文旨在解决当前自动代码文档生成模型因训练数据质量低下而导致性能受限的问题。现有代码文档数据集多通过大规模爬取公共仓库构建,存在噪声文档、大量重复内容及AI生成内容污染等问题,削弱了监督信号并增加了评估复杂性。其解决方案的关键在于提出一个以质量优先的全新数据集Code2Doc,采用四阶段精细化筛选流程:确保文档完整性与清晰度、基于结构和复杂度过滤函数、去除精确与近似重复代码片段,并识别潜在AI生成文档。最终从52,069个候选样本中仅保留25.6%高质量样本(共13,358对函数-文档),显著提升数据纯净度与可用性,实验证明在小规模情况下仍能带来显著的BLEU(+29.47%)和ROUGE-L(+24.04%)提升,为高质量代码文档生成提供了可靠的数据基础与可复现的研究框架。
链接: https://arxiv.org/abs/2512.18748
作者: Recep Kaan Karaman,Meftun Akarsu
机构: Uludag University (乌卢达格大学); Technische Hochschule Ingolstadt (英戈尔施塔特应用技术大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The performance of automatic code documentation generation models depends critically on the quality of the training data used for supervision. However, most existing code documentation datasets are constructed through large scale scraping of public repositories with limited quality control. As a result, they often contain noisy documentation, extensive duplication, and increasing contamination from AI generated content. These issues weaken the supervision signal available to learning-based models and complicate evaluation. We introduce \textbfCode2Doc, a quality-first curated dataset for function-level code documentation generation. Code2Doc consists of 13,358 high-quality function-documentation pairs extracted from widely used open-source repositories spanning five programming languages: Python, Java, TypeScript, JavaScript, and C++. The dataset is constructed using a four-stage curation pipeline that enforces documentation completeness and clarity, filters functions based on structural and complexity criteria, removes exact and near-duplicate code, and identifies documentation likely to be AI generated. Starting from 52,069 extracted candidates, only 25.6 percent satisfy all quality constraints. We provide a detailed analysis of the resulting dataset, which achieves a mean documentation quality score of 6.93 out of 10. Overall, 86.9% of samples contain explicit type annotations, and only 2.9% are flagged as potentially AI generated. Baseline experiments show that fine-tuning a large language model on Code2Doc yields relative improvements of 29.47% in BLEU and 24.04% in ROUGE-L over zero shot performance, despite the modest dataset size. We release both the dataset and the full curation pipeline to support reproducible research on automatic code documentation generation. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2512.18748 [cs.SE] (or arXiv:2512.18748v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.18748 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-48] MemEvolve: Meta-Evolution of Agent Memory Systems
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在自进化过程中受限于静态记忆架构的问题——即尽管现有方法可通过人工设计的记忆系统存储轨迹、提炼经验并合成可复用工具以实现在线演化,但其底层记忆结构无法根据任务情境进行元适应(meta-adaptation),从而限制了智能体持续优化学习机制的能力。解决方案的关键在于提出 MemEvolve,一个元进化框架,能够协同演化智能体的经验知识与其记忆架构(包括编码、存储、检索和管理四个模块),使系统不仅积累经验,还能逐步改进如何从经验中学习。该框架通过引入 EvolveLab 作为统一的自进化记忆代码库,将十二种代表性记忆系统抽象为模块化设计空间,为实验提供标准化实现与公平比较平台,显著提升了跨任务和跨LLM的泛化性能。
链接: https://arxiv.org/abs/2512.18746
作者: Guibin Zhang,Haotian Ren,Chong Zhan,Zhenhong Zhou,Junhao Wang,He Zhu,Wangchunshu Zhou,Shuicheng Yan
机构: 未知
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Self-evolving memory systems are unprecedentedly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve, a meta-evolutionary framework that jointly evolves agents’ experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior research and foster openness in future self-evolving systems, we introduce EvolveLab, a unified self-evolving memory codebase that distills twelve representative memory systems into a modular design space (encode, store, retrieve, manage), providing both a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks demonstrate that MemEvolve achieves (I) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to 17.06% ; and (II) strong cross-task and cross-LLM generalization, designing memory architectures that transfer effectively across diverse benchmarks and backbone models.
zh
[NLP-49] InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search
【速读】: 该论文旨在解决当前开放多模态智能体在视觉推理能力上的显著不足,尤其是在处理现实世界任务(如分析包含密集图表或地图的文档)时,难以有效整合和利用图像中的细粒度视觉信息。其核心挑战在于如何实现对图像中分散区域的多步推理与注意力交互。解决方案的关键在于提出O3-Bench基准测试集和InSight-o3多智能体框架:前者通过设计高难度、需跨区域视觉信息融合的问题来系统评估多模态推理能力;后者引入一个专门用于泛化视觉搜索(generalized visual search)的任务,即定位由自由语言描述的关系性、模糊或概念性区域,而非仅限于简单对象,同时训练了一个面向此任务的多模态大语言模型(multimodal LLM),并通过强化学习优化其性能。该框架作为插件式模块可增强前沿多模态模型的视觉推理能力,推动类o3级开放系统的实际进展。
链接: https://arxiv.org/abs/2512.18745
作者: Kaican Li,Lewei Yao,Jiannan Wu,Tiezheng Yu,Jierun Chen,Haoli Bai,Lu Hou,Lanqing Hong,Wei Zhang,Nevin L. Zhang
机构: Hong Kong University of Science and Technology (香港科技大学); Huawei (华为)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The ability for AI agents to “think with images” requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search – locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at this https URL .
zh
[NLP-50] Solver-Independent Automated Problem Formulation via LLM s for High-Cost Simulation-Driven Design
【速读】: 该论文旨在解决高成本仿真驱动设计领域中,将模糊的设计需求转化为数学优化问题的瓶颈问题,这一过程通常耗时且高度依赖专家知识。现有方法要么形式化质量差,无法准确匹配设计意图,要么依赖求解器反馈进行数据筛选,而由于仿真成本高昂,此类反馈难以获取。解决方案的关键在于提出APF框架,其核心是一个创新的数据自动生成管道,通过无需求解器反馈的方式构建高质量训练数据集,并基于此对大语言模型(Large Language Models, LLMs)进行监督微调,从而显著提升其生成准确、可执行优化问题公式的能力。
链接: https://arxiv.org/abs/2512.18682
作者: Yuchen Li,Handing Wang,Bing Xue,Mengjie Zhang,Yaochu Jin
机构: Xidian University (西安电子科技大学); Victoria University of Wellington (维多利亚大学); Westlake University (西湖大学)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:In the high-cost simulation-driven design domain, translating ambiguous design requirements into a mathematical optimization formulation is a bottleneck for optimizing product performance. This process is time-consuming and heavily reliant on expert knowledge. While large language models (LLMs) offer potential for automating this task, existing approaches either suffer from poor formalization that fails to accurately align with the design intent or rely on solver feedback for data filtering, which is unavailable due to the high simulation costs. To address this challenge, we propose APF, a framework for solver-independent, automated problem formulation via LLMs designed to automatically convert engineers’ natural language requirements into executable optimization models. The core of this framework is an innovative pipeline for automatically generating high-quality data, which overcomes the difficulty of constructing suitable fine-tuning datasets in the absence of high-cost solver feedback with the help of data generation and test instance annotation. The generated high-quality dataset is used to perform supervised fine-tuning on LLMs, significantly enhancing their ability to generate accurate and executable optimization problem formulations. Experimental results on antenna design demonstrate that APF significantly outperforms the existing methods in both the accuracy of requirement formalization and the quality of resulting radiation efficiency curves in meeting the design goals.
zh
[NLP-51] brat: Aligned Multi-View Embeddings for Brain MRI Analysis WACV2026
【速读】: 该论文旨在解决脑部磁共振成像(MRI)中异常病灶定位困难、形态多样且细微的问题,尤其是在三维图像体积中仅局部区域存在病变的情况下,传统方法难以有效学习跨模态的对齐表示。其解决方案的关键在于提出了一种多视角表征学习框架 brat(brain report alignment transformer),该框架基于包含约80,000个3D MRI扫描及其对应放射科报告的大规模数据集,并引入受文档检索启发的多视图预训练策略;通过设计隐式查询-特征匹配机制并融合质量-多样性(quality-diversity)理念,生成与临床报告语义高度对齐的MRI多视图嵌入,从而显著提升在视觉语言和视觉任务上的性能表现。
链接: https://arxiv.org/abs/2512.18679
作者: Maxime Kayser,Maksim Gridnev,Wanting Wang,Max Bain,Aneesh Rangnekar,Avijit Chatterjee,Aleksandr Petrov,Harini Veeraraghavan,Nathaniel C. Swinburne
机构: Memorial Sloan Kettering Cancer Center (纪念斯隆-凯特琳癌症中心); University of Oxford (牛津大学); London School of Economics (伦敦经济学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: First round accept at WACV 2026
Abstract:We present brat (brain report alignment transformer), a multi-view representation learning framework for brain magnetic resonance imaging (MRI) trained on MRIs paired with clinical reports. Brain MRIs present unique challenges due to the presence of numerous, highly varied, and often subtle abnormalities that are localized to a few slices within a 3D volume. To address these challenges, we introduce a brain MRI dataset 10\times larger than existing ones, containing approximately 80,000 3D scans with corresponding radiology reports, and propose a multi-view pre-training approach inspired by advances in document retrieval. We develop an implicit query-feature matching mechanism and adopt concepts from quality-diversity to obtain multi-view embeddings of MRIs that are aligned with the clinical features given by report sentences. We evaluate our approach across multiple vision-language and vision tasks, demonstrating substantial performance improvements. The brat foundation models are publicly released.
zh
[NLP-52] Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital
【速读】: 该论文旨在解决资本结构核对(capitalization tie-out)这一法律实务中的复杂任务自动化问题,该任务涉及多文档推理、严格的证据可追溯性以及确定性输出,而当前主流的生成式 AI(Generative AI)和代理系统(agentic systems)难以可靠实现。解决方案的关键在于提出一种世界模型(world model)架构,以支持资本结构核对的自动化,并为更广泛的法律智能应用提供基础。
链接: https://arxiv.org/abs/2512.18658
作者: Pierre Colombo,Malik Boudiaf,Allyn Sweet,Michael Desa,Hongxi Wang,Kevin Candra,Syméon del Marmol
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Before closing venture capital financing rounds, lawyers conduct diligence that includes tying out the capitalization table: verifying that every security (for example, shares, options, warrants) and issuance term (for example, vesting schedules, acceleration triggers, transfer restrictions) is supported by large sets of underlying legal documentation. While LLMs continue to improve on legal benchmarks, specialized legal workflows, such as capitalization tie-out, remain out of reach even for strong agentic systems. The task requires multi-document reasoning, strict evidence traceability, and deterministic outputs that current approaches fail to reliably deliver. We characterize capitalization tie-out as an instance of a real-world benchmark for legal AI, analyze and compare the performance of existing agentic systems, and propose a world model architecture toward tie-out automation-and more broadly as a foundation for applied legal intelligence.
zh
[NLP-53] LLM -CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction AAAI2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中常出现的幻觉(hallucination)问题,即模型输出内容缺乏事实依据或上下文一致性,从而限制其在关键应用场景中的可靠性。现有方法如监督微调和基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)存在数据与计算成本高的缺陷,而静态参数编辑方法则难以应对上下文相关的错误并易引发灾难性遗忘。论文提出的LLM-CAS框架将实时幻觉修正建模为分层强化学习问题,其核心创新在于训练一个代理(agent)学习动态策略,在推理阶段根据当前上下文选择临时神经元扰动(temporary neuron perturbations),实现无需永久修改参数的自适应、细粒度校正。该机制显著优于静态编辑方法(如ITI、CAA)和现有动态方法(如SADI),在多个基准测试中提升了事实准确性,展现出高效且上下文感知的改进潜力。
链接: https://arxiv.org/abs/2512.18623
作者: Jensen Zhang,Ningyuan Liu,Yijia Fan,Zihao Huang,Qinglin Zeng,Kaitong Cai,Jian Wang,Keze Wang
机构: Sun Yat-sen University (中山大学); Snap Inc
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026
Abstract:Large language models (LLMs) often generate hallucinated content that lacks factual or contextual grounding, limiting their reliability in critical applications. Existing approaches such as supervised fine-tuning and reinforcement learning from human feedback are data intensive and computationally expensive, while static parameter editing methods struggle with context dependent errors and catastrophic forgetting. We propose LLM-CAS, a framework that formulates real-time hallucination correction as a hierarchical reinforcement learning problem. LLM-CAS trains an agent to learn a policy that dynamically selects temporary neuron perturbations during inference based on the current context. Unlike prior dynamic approaches that rely on heuristic or predefined adjustments, this policy driven mechanism enables adaptive and fine grained correction without permanent parameter modification. Experiments across multiple language models demonstrate that LLM-CAS consistently improves factual accuracy, achieving gains of 10.98 percentage points on StoryCloze, 2.71 points on TriviaQA, and 2.06 points on the MC1 score of TruthfulQA. These results outperform both static editing methods such as ITI and CAA and the dynamic SADI framework. Overall, LLM-CAS provides an efficient and context aware solution for improving the reliability of LLMs, with promising potential for future multimodal extensions. Comments: Accepted at AAAI 2026 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.18623 [cs.CL] (or arXiv:2512.18623v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.18623 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-54] A Multi-agent Text2SQL Framework using Small Language Models and Execution Feedback
【速读】: 该论文旨在解决小规模语言模型(Small Language Models, SLMs)在Text2SQL任务中因泛化能力不足而导致性能受限的问题,尤其是在企业内部部署场景下,由于隐私和成本限制无法使用外部大型语言模型(Large Language Models, LLMs)服务。解决方案的关键在于提出一种名为MATS的多智能体框架,其通过为辅助智能体分配专业化角色来降低个体负担并促进协作,并结合基于强化学习的训练机制,利用执行过程中的反馈对齐各智能体行为,从而在参数量显著减少的情况下仍能实现与大规模LLMs相当的准确率。
链接: https://arxiv.org/abs/2512.18622
作者: Thanh Dat Hoang,Thanh Trung Huynh,Matthias Weidlich,Thanh Tam Nguyen,Tong Chen,Hongzhi Yin,Quoc Viet Hung Nguyen
机构: Griffith University (澳大利亚); VinUniversity (越南); Humboldt-Universitat zu Berlin (德国); The University of Queensland (澳大利亚)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:
Abstract:Text2SQL, the task of generating SQL queries from natural language text, is a critical challenge in data engineering. Recently, Large Language Models (LLMs) have demonstrated superior performance for this task due to their advanced comprehension and generation capabilities. However, privacy and cost considerations prevent companies from using Text2SQL solutions based on external LLMs offered as a service. Rather, small LLMs (SLMs) that are openly available and can hosted in-house are adopted. These SLMs, in turn, lack the generalization capabilities of larger LLMs, which impairs their effectiveness for complex tasks such as Text2SQL. To address these limitations, we propose MATS, a novel Text2SQL framework designed specifically for SLMs. MATS uses a multi-agent mechanism that assigns specialized roles to auxiliary agents, reducing individual workloads and fostering interaction. A training scheme based on reinforcement learning aligns these agents using feedback obtained during execution, thereby maintaining competitive performance despite a limited LLM size. Evaluation results using on benchmark datasets show that MATS, deployed on a single- GPU server, yields accuracy that are on-par with large-scale LLMs when using significantly fewer parameters. Our source code and data are available at this https URL.
zh
[NLP-55] A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts
【速读】: 该论文旨在解决隐私保护对话系统中个人身份信息(Personally Identifiable Information, PII)自动屏蔽的问题,特别是在当前前沿大语言模型(Large Language Models, LLMs)存在数据处理风险和计算成本高的背景下,探索轻量级模型是否能够实现与之相当的屏蔽性能。解决方案的关键在于对比编码器-解码器架构(如T5-small)与仅解码器架构(如Mistral-Instruct-v0.3)在不同数据集变体上的表现,其中通过标签标准化(label standardization)提升一致性,并引入细粒度PII类别以增强泛化能力;实验表明,尽管Mistral在F1分数和召回率上更优且鲁棒性更强,但T5在结构化输出可控性和推理效率方面更具优势,尤其适用于实时场景(如Discord机器人),从而为平衡准确性、鲁棒性与计算效率提供了可行路径。
链接: https://arxiv.org/abs/2512.18608
作者: Prabigya Acharya,Liza Shrestha
机构: IOE, Pulchowk Campus (工程学院,普尔乔克校区)
类目: Computation and Language (cs.CL)
备注:
Abstract:Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.
zh
[NLP-56] On Finding Inconsistencies in Documents
【速读】: 该论文旨在解决文档中不一致性的自动检测问题,尤其是在学术、法律和金融等专业领域中,人工审计成本高且效率低。其解决方案的关键在于构建一个名为FIND(Finding INconsistencies in Documents)的基准测试集,其中每个文档均由领域专家手动引入不一致性,从而模拟真实场景下的复杂性和技术性挑战。实验表明,即使是最先进的语言模型(如gpt-5),也只能识别出64%的已知不一致性,并意外发现原始文档中未被作者察觉的136处潜在不一致,凸显了生成式AI在文档审计中的潜力与局限性。
链接: https://arxiv.org/abs/2512.18601
作者: Charles J. Lovering,Seth Ebner,Brandon Smock,Michael Krumdick,Saad Rabbani,Ahmed Muhammad,Varshini Reddy,Chris Tanner
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Professionals in academia, law, and finance audit their documents because inconsistencies can result in monetary, reputational, and scientific costs. Language models (LMs) have the potential to dramatically speed up this auditing process. To understand their abilities, we introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert. Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model’s suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, demonstrating that inconsistency detection is still a challenging task.
zh
[NLP-57] From Scratch to Fine-Tuned: A Comparative Study of Transformer Training Strategies for Legal Machine Translation
【速读】: 该论文旨在解决多语言国家(如印度)中因语言障碍导致的法律信息获取困难问题,尤其是在大量法律和司法文档仍以英语为主的情况下。其核心解决方案是通过法律机器翻译(Legal Machine Translation, L-MT)实现法律文本的高精度、可访问性翻译。关键创新在于采用两种互补策略:一是对预训练的OPUS-MT模型进行领域适配微调,二是基于提供的法律语料从零训练Transformer模型。实验结果表明,微调后的OPUS-MT模型在SacreBLEU指标上达到46.03,显著优于基线和从头训练模型,凸显了领域适应在提升翻译质量中的有效性,为改善多语言环境中司法公正与法律透明度提供了可行路径。
链接: https://arxiv.org/abs/2512.18593
作者: Amit Barman,Atanu Mandal,Sudip Kumar Naskar
机构: Jadavpur University (加尔各答大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:In multilingual nations like India, access to legal information is often hindered by language barriers, as much of the legal and judicial documentation remains in English. Legal Machine Translation (L-MT) offers a scalable solution to this challenge by enabling accurate and accessible translations of legal documents. This paper presents our work for the JUST-NLP 2025 Legal MT shared task, focusing on English-Hindi translation using Transformer-based approaches. We experiment with 2 complementary strategies, fine-tuning a pre-trained OPUS-MT model for domain-specific adaptation and training a Transformer model from scratch using the provided legal corpus. Performance is evaluated using standard MT metrics, including SacreBLEU, chrF++, TER, ROUGE, BERTScore, METEOR, and COMET. Our fine-tuned OPUS-MT model achieves a SacreBLEU score of 46.03, significantly outperforming both baseline and from-scratch models. The results highlight the effectiveness of domain adaptation in enhancing translation quality and demonstrate the potential of L-MT systems to improve access to justice and legal transparency in multilingual contexts.
zh
[NLP-58] oward Training Superintelligent Software Agents through Self-Play SWE-RL
【速读】: 该论文旨在解决当前由大语言模型(Large Language Models, LLMs)和代理强化学习(Agentic Reinforcement Learning, RL)驱动的软件代理在训练过程中对人类标注数据(如GitHub问题和测试用例)的高度依赖问题,这一依赖构成了通往超级智能软件代理(Superintelligent Software Agents)的根本障碍。解决方案的关键在于提出Self-play SWE-RL(SSR),其核心创新是仅需访问沙箱环境中的真实源代码仓库及其依赖项,无需人工标注的问题或测试用例;通过自对弈(self-play)机制,单一LLM代理在迭代中自主注入并修复复杂度递增的软件缺陷,且每个缺陷由测试补丁(test patch)形式的形式化描述,而非自然语言问题描述。实验表明,SSR在SWE-bench Verified和SWE-Bench Pro基准上实现了显著的自我提升(分别+10.4和+7.8分),并在整个训练过程中持续优于依赖人类数据的基线模型,即便评估时使用的是未参与自对弈过程的自然语言问题。这揭示了一条代理可从现实世界代码库中自主获取大量学习经验、最终实现超越人类能力的路径。
链接: https://arxiv.org/abs/2512.18552
作者: Yuxiang Wei,Zhiqing Sun,Emily McMilin,Jonas Gehring,David Zhang,Gabriel Synnaeve,Daniel Fried,Lingming Zhang,Sida Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments (e.g., pass-to-pass and fail-to-pass tests) heavily depend on human knowledge or curation, posing a fundamental barrier to superintelligence. In this paper, we present Self-play SWE-RL (SSR), a first step toward training paradigms for superintelligent software agents. Our approach takes minimal data assumptions, only requiring access to sandboxed repositories with source code and installed dependencies, with no need for human-labeled issues or tests. Grounded in these real-world codebases, a single LLM agent is trained via reinforcement learning in a self-play setting to iteratively inject and repair software bugs of increasing complexity, with each bug formally specified by a test patch rather than a natural language issue description. On the SWE-bench Verified and SWE-Bench Pro benchmarks, SSR achieves notable self-improvement (+10.4 and +7.8 points, respectively) and consistently outperforms the human-data baseline over the entire training trajectory, despite being evaluated on natural language issues absent from self-play. Our results, albeit early, suggest a path where agents autonomously gather extensive learning experiences from real-world software repositories, ultimately enabling superintelligent systems that exceed human capabilities in understanding how systems are constructed, solving novel challenges, and autonomously creating new software from scratch.
zh
[NLP-59] Neologism Learning as a Parameter-Efficient Alternative to Fine-Tuning for Model Steering
【速读】: 该论文旨在解决大语言模型中行为控制与个性化引导的问题,即如何在不显著增加计算成本的前提下,实现对模型输出行为的灵活调控。传统方法如低秩适应(Low-rank Adaptation, LoRA)虽能通过微调实现行为改变,但需要大量计算资源且灵活性较低。论文提出以“新词”(neologism)作为解决方案的关键:通过训练模型将特定概念映射为未出现在原始词汇表中的新token,仅需更新少量参数(d),即可诱导模型产生期望行为,同时保留原有默认能力。实验表明,在相同训练条件下,neologism方法优于LoRA微调,且模型在面对新词时会自发生成自定义表达,展现出更强的行为可塑性与语义理解能力。
链接: https://arxiv.org/abs/2512.18551
作者: Sungjoon Park,Varun Ramamurthi,Owen Terry
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In language modeling, neologisms are new tokens trained to represent a concept not already included in a given model’s vocabulary. Neologisms can be used to encourage specific behavior in models, for example by appending prompts with “Give me a neologism answer.” Behavioral steering can also be achieved through fine-tuning, albeit with more compute and less flexibility: learning a neologism only trains d parameters and allows the user to still access the model’s default behavior. We compare the performance of neologism learning against low-rank adaptation (LoRA) fine-tuning, finding that neologisms outperform fine-tuned models under a matched training setup (same data and hyperparameters). We also investigate self-verbalizations of neologisms, and observe that the model will occasionally make up its own new words when asked about a neologism.
zh
[NLP-60] LLM s on Drugs: Language Models Are Few-Shot Consumers
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理时对人为设定人格角色(persona)敏感的问题,尤其是评估不同心理活性提示(psychoactive framings)对模型性能的影响,此前缺乏严谨的基准测试。其解决方案的关键在于设计了一项受控实验,使用ARC-Challenge数据集,在确定性解码、完整日志记录和统计显著性检验(Fisher精确检验)的基础上,系统比较了四种单句提示(LSD、可卡因、酒精、大麻)与清醒对照组在100个验证样本上的表现。结果表明,这些“人格提示”会显著降低模型准确率,且主要机制是破坏了模型必须遵循的“Answer: LETTER”输出模板,而非修改模型权重,揭示了提示工程中“人格消耗品”(few-shot consumable)对可靠性的潜在威胁。
链接: https://arxiv.org/abs/2512.18546
作者: Alexander Doudkin
机构: HFBK Hamburg (汉堡美术学院)
类目: Computation and Language (cs.CL)
备注: 8 pages, 2 figures, 2 tables
Abstract:Large language models (LLMs) are sensitive to the personas imposed on them at inference time, yet prompt-level “drug” interventions have never been benchmarked rigorously. We present the first controlled study of psychoactive framings on GPT-5-mini using ARC-Challenge. Four single-sentence prompts – LSD, cocaine, alcohol, and cannabis – are compared against a sober control across 100 validation items per condition, with deterministic decoding, full logging, Wilson confidence intervals, and Fisher exact tests. Control accuracy is 0.45; alcohol collapses to 0.10 (p = 3.2e-8), cocaine to 0.21 (p = 4.9e-4), LSD to 0.19 (p = 1.3e-4), and cannabis to 0.30 (p = 0.041), largely because persona prompts disrupt the mandated “Answer: LETTER” template. Persona text therefore behaves like a “few-shot consumable” that can destroy reliability without touching model weights. All experimental code, raw results, and analysis scripts are available at this https URL.
zh
[NLP-61] SecureCode v2.0: A Production-Grade Dataset for Training Security-Aware Code Generation Models
【速读】: 该论文旨在解决当前安全编码数据集在真实场景覆盖、规模不足以及缺乏生产环境安全上下文方面的局限性,这些问题导致AI助手在生成代码时易引入漏洞(如45%的安全相关场景中产生脆弱代码),进而影响生产系统的安全性。其解决方案的关键在于提出SecureCode v2.0——一个1,215个经结构验证和专家安全审查的生产级安全编码示例组成的高质量数据集,每个案例均锚定于实际CVE记录的安全事件,提供漏洞与修复版本、具体攻击演示及纵深防御(defense-in-depth)操作指南,并涵盖11种编程语言(包括AI/ML安全威胁)和完整的开发流程交互结构(4轮对话式设计),同时配套自动化验证框架、SIEM集成策略、基础设施加固建议(如Docker、WAF配置)和语言特定测试方法,确保数据集具备可扩展性、真实性与实用性。
链接: https://arxiv.org/abs/2512.18542
作者: Scott Thornton
机构: perfecxion.ai
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 37 pages, 5 figures. Dataset available at this https URL . Code and validation tools at this https URL
Abstract:AI assistants produce vulnerable code in 45% of security-relevant scenarios, introducing flaws into production systems at scale. Yet existing secure coding datasets fall short. They lack incident grounding, don’t provide the scale modern training requires, and miss the operational security context developers need for production deployments. We present SecureCode v2.0, a production-grade dataset of 1,215 security-focused coding examples that passed structural validation and expert security review. Every example ties to actual documented security incidents with CVE references, provides vulnerable and secure implementations, demonstrates concrete attacks, and includes defense-in-depth operational guidance. The dataset covers 11 vulnerability categories (complete OWASP Top 10:2025 plus AI/ML Security Threats) across 11 languages (Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, and YAML for infrastructure-as-code). Our quality assurance framework ensures complete incident grounding. Each example includes SIEM integration strategies, infrastructure hardening recommendations (Docker, AppArmor, WAF configurations), and testing approaches using language-appropriate frameworks. The dataset uses a 4-turn conversational structure mirroring actual developer-AI interactions, escalating from basic implementations to advanced security considerations and defense-in-depth guidance. Our contributions: (1) 1,215 rigorously validated examples split into 989 training, 122 validation, and 104 test sets, (2) an automated validation framework ensuring dataset consistency, (3) a 4-turn conversational structure capturing realistic security workflows, (4) comprehensive operational security guidance with SIEM integration strategies, (5) complete language-specific implementation fidelity, and (6) open-source release of data, validation tools, and benchmarking protocols. Comments: 37 pages, 5 figures. Dataset available at this https URL. Code and validation tools at this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2512.18542 [cs.CR] (or arXiv:2512.18542v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2512.18542 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-62] Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset
【速读】: 该论文旨在解决政治虚假信息(political disinformation)中语言细微差异对自动化事实核查系统造成的挑战,尤其关注仅依赖文本的语言建模方法在复杂分类任务中的性能瓶颈。其关键发现在于:尽管模型架构日益复杂,但基于纯文本的分类任务存在一个硬性的“性能天花板”(Performance Ceiling),即细粒度分类的加权F1分数不超过0.32;更值得注意的是,简单线性支持向量机(SVM)与预训练Transformer如RoBERTa性能相当(准确率分别为0.624和0.620),表明模型容量并非主要限制因素;进一步诊断揭示树集成模型存在严重的“泛化差距”(Generalization Gap),训练准确率超99%却在测试集上骤降至约25%,说明其依赖词汇记忆而非语义推理;合成数据增强(SMOTE)无效,进一步确认问题根源在于语义特征模糊性而非数据分布差异。因此,解决方案的关键在于引入外部知识以突破当前纯文本建模的语义局限性,而非单纯提升模型复杂度。
链接: https://arxiv.org/abs/2512.18533
作者: S Mahmudul Hasan,Shaily Roy,Akib Jawad Nafis
机构: Syracuse University (雪城大学); Eastern University (东大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The proliferation of linguistically subtle political disinformation poses a significant challenge to automated fact-checking systems. Despite increasing emphasis on complex neural architectures, the empirical limits of text-only linguistic modeling remain underexplored. We present a systematic diagnostic evaluation of nine machine learning algorithms on the LIAR benchmark. By isolating lexical features (Bag-of-Words, TF-IDF) and semantic embeddings (GloVe), we uncover a hard “Performance Ceiling”, with fine-grained classification not exceeding a Weighted F1-score of 0.32 across models. Crucially, a simple linear SVM (Accuracy: 0.624) matches the performance of pre-trained Transformers such as RoBERTa (Accuracy: 0.620), suggesting that model capacity is not the primary bottleneck. We further diagnose a massive “Generalization Gap” in tree-based ensembles, which achieve more than 99% training accuracy but collapse to approximately 25% on test data, indicating reliance on lexical memorization rather than semantic inference. Synthetic data augmentation via SMOTE yields no meaningful gains, confirming that the limitation is semantic (feature ambiguity) rather than distributional. These findings indicate that for political fact-checking, increasing model complexity without incorporating external knowledge yields diminishing returns.
zh
[NLP-63] aching and Critiquing Conceptualization and Operationalization in NLP
【速读】: 该论文试图解决自然语言处理(Natural Language Processing, NLP)研究中对关键概念如“可解释性”(interpretability)、“偏见”(bias)、“推理”(reasoning)和“刻板印象”(stereotypes)缺乏明确定义的问题,这些问题的模糊性导致了评估数据集构建、度量指标设计以及系统性能声称的基础不统一。解决方案的关键在于通过开设研讨课程,引导学生深入探讨这些概念的“概念化”(conceptualization)与“操作化”(operationalization),并辅以跨学科阅读材料和强调批判性讨论,从而推动领域内对术语共识的建立与实践标准的规范化。
链接: https://arxiv.org/abs/2512.18505
作者: Vagrant Gautam
机构: Heidelberg Institute for Theoretical Studies (海德堡理论研究学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:NLP researchers regularly invoke abstract concepts like “interpretability,” “bias,” “reasoning,” and “stereotypes,” without defining them. Each subfield has a shared understanding or conceptualization of what these terms mean and how we should treat them, and this shared understanding is the basis on which operational decisions are made: Datasets are built to evaluate these concepts, metrics are proposed to quantify them, and claims are made about systems. But what do they mean, what should they mean, and how should we measure them? I outline a seminar I created for students to explore these questions of conceptualization and operationalization, with an interdisciplinary reading list and an emphasis on discussion and critique.
zh
[NLP-64] Research on a hybrid LSTM-CNN-Attention model for text-based web content classification
【速读】: 该论文旨在解决文本分类任务中如何有效融合局部特征与长距离语义依赖的问题,以提升网页内容分类的准确性与泛化能力。其解决方案的关键在于提出了一种混合深度学习架构,结合了卷积神经网络(CNN)提取局部n-gram模式和词汇特征、长短期记忆网络(LSTM)建模序列的长期依赖关系,并引入注意力机制(Attention mechanism)实现对输入序列中最具信息量部分的动态聚焦。通过预训练的GloVe词嵌入(GloVe embeddings)进行语义保留的向量表示,该模型在保持计算效率的同时显著提升了分类性能,在5折交叉验证下达到0.98准确率、0.93 F1分数,优于单一CNN、LSTM或BERT等基线模型,证明了多模块协同设计在复杂非结构化文本处理中的有效性。
链接: https://arxiv.org/abs/2512.18475
作者: Mykola Kuz,Ihor Lazarovych,Mykola Kozlenko,Mykola Pikuliak,Andrii Kvasniuk
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 5 figures, 2 tables. Accepted by Radio Electronics Computer Science Control 2025
Abstract:This study presents a hybrid deep learning architecture that integrates LSTM, CNN, and an Attention mechanism to enhance the classification of web content based on text. Pretrained GloVe embeddings are used to represent words as dense vectors that preserve semantic similarity. The CNN layer extracts local n-gram patterns and lexical features, while the LSTM layer models long-range dependencies and sequential structure. The integrated Attention mechanism enables the model to focus selectively on the most informative parts of the input sequence. A 5-fold cross-validation setup was used to assess the robustness and generalizability of the proposed solution. Experimental results show that the hybrid LSTM-CNN-Attention model achieved outstanding performance, with an accuracy of 0.98, precision of 0.94, recall of 0.92, and F1-score of 0.93. These results surpass the performance of baseline models based solely on CNNs, LSTMs, or transformer-based classifiers such as BERT. The combination of neural network components enabled the model to effectively capture both fine-grained text structures and broader semantic context. Furthermore, the use of GloVe embeddings provided an efficient and effective representation of textual data, making the model suitable for integration into systems with real-time or near-real-time requirements. The proposed hybrid architecture demonstrates high effectiveness in text-based web content classification, particularly in tasks requiring both syntactic feature extraction and semantic interpretation. By combining presented mechanisms, the model addresses the limitations of individual architectures and achieves improved generalization. These findings support the broader use of hybrid deep learning approaches in NLP applications, especially where complex, unstructured textual data must be processed and classified with high reliability.
zh
[NLP-65] Mitigating Spurious Correlations in NLI via LLM -Synthesized Counterfactuals and Dynamic Balanced Sampling
【速读】: 该论文旨在解决自然语言推理(Natural Language Inference, NLI)模型普遍依赖虚假相关性而非语义推理的问题,同时克服现有缓解策略在标注成本高或微调过程中引发灾难性遗忘的局限。其解决方案的关键在于提出一个自动化且可扩展的流水线:首先引入对数频率LMI(Log-Frequency LMI)精准检测语义伪特征;其次通过大语言模型(LLM)合成与多评委验证相结合的流程生成高质量合成对比数据集;最后采用动态平衡采样(Dynamic Balanced Sampling)训练策略,在防止遗忘的同时调整原始数据分布。该方法在挑战性基准上将一致性指标从63.5%提升至81.0%,同时保持88.4%的域内准确率,显著优于朴素微调。
链接: https://arxiv.org/abs/2512.18462
作者: Christopher Román Jaimes
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Natural Language Inference (NLI) models frequently rely on spurious correlations rather than semantic reasoning. Existing mitigation strategies often incur high annotation costs or trigger catastrophic forgetting during fine-tuning. We propose an automated, scalable pipeline to address these limitations. First, we introduce Log-Frequency LMI (LF-LMI) to accurately detect semantic artifacts. Second, we generate a high-quality synthetic contrast set via an LLM-synthesis pipeline with multi-judge verification. Finally, we introduce Dynamic Balanced Sampling, a training strategy that rotates the original data distribution to prevent forgetting. Our method improves consistency on a challenging benchmark from 63.5% to 81.0% while maintaining 88.4% in-domain accuracy, significantly outperforming naive fine-tuning.
zh
[NLP-66] An Agent ic AI Framework for Training General Practitioner Student Skills
【速读】: 该论文旨在解决当前虚拟模拟患者(Virtual Simulated Patients, VSPs)在医学教育中面临的四大核心问题:医疗准确性不足、角色扮演一致性差、场景生成能力有限,以及缺乏结构化的教学反馈。其解决方案的关键在于提出一个代理型(agentic)框架,通过将场景控制、交互控制与基于标准的评估三者分离并协同运作,实现可配置的循证病历生成、受控的人格驱动对话(支持检索增强)、以及针对沟通能力和临床推理能力的标准化反馈机制。实证评估表明,该框架能有效提升VSP的真实性、难度适配性及反馈质量,为构建可靠且具教学价值的VSP工具提供了可实践的技术路径。
链接: https://arxiv.org/abs/2512.18440
作者: Victor De Marez,Jens Van Nooten,Luna De Bruyne,Walter Daelemans
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Advancements in large language models offer strong potential for enhancing virtual simulated patients (VSPs) in medical education by providing scalable alternatives to resource-intensive traditional methods. However, current VSPs often struggle with medical accuracy, consistent roleplaying, scenario generation for VSP use, and educationally structured feedback. We introduce an agentic framework for training general practitioner student skills that unifies (i) configurable, evidence-based vignette generation, (ii) controlled persona-driven patient dialogue with optional retrieval grounding, and (iii) standards-based assessment and feedback for both communication and clinical reasoning. We instantiate the framework in an interactive spoken consultation setting and evaluate it with medical students ( \mathbfN=14 ). Participants reported realistic and vignette-faithful dialogue, appropriate difficulty calibration, a stable personality signal, and highly useful example-rich feedback, alongside excellent overall usability. These results support agentic separation of scenario control, interaction control, and standards-based assessment as a practical pattern for building dependable and pedagogically valuable VSP training tools.
zh
[NLP-67] AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen 3
【速读】: 该论文旨在解决通用分词器(tokenization)在阿拉伯语等形态丰富的语言上表现不佳的问题,具体表现为分词后序列长度膨胀、压缩效率降低,进而影响大语言模型(LLM)的训练效率与下游性能。其解决方案的关键在于构建一个专为阿拉伯语优化的分词器AraToken,基于SentencePiece Unigram算法并引入全面的归一化处理流程,以应对阿拉伯语特有的拼写变体(如Alif变体、元音符号和阿拉伯数字),显著提升分词效率;同时提出Language Extension Pipeline(LEP),通过词汇扩展、子词初始化及选择性Transformer层解冻策略,高效地将优化后的分词器集成至Qwen3-0.6B模型中,在仅800个训练步内使评估损失从8.28降至2.43,验证了方法的有效性。
链接: https://arxiv.org/abs/2512.18399
作者: Mark Kashirskiy,Artiom Lipinski,Ilya Makarov
机构: Higher School of Economics, Moscow, Russia; AI Talent Hub, ITMO University, Saint Petersburg, Russia; Markov Lab, Saint Petersburg State University, Russia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures, 5 tables
Abstract:Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages exhibit suboptimal performance on morphologically rich languages such as Arabic, resulting in inflated token sequences and reduced compression efficiency. In this work, we present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm with a comprehensive normalization pipeline addressing Arabic-specific orthographic variations including Alif variants, diacritics, and Arabic-Indic numerals. We systematically compare BPE, WordPiece, and SentencePiece algorithms across multiple configurations, demonstrating that SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized baselines. Furthermore, we introduce the Language Extension Pipeline (LEP), a method for integrating the optimized tokenizer into Qwen3-0.6B through vocabulary extension with mean subtoken initialization and selective transformer layer unfreezing. Our experiments show that LEP reduces evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. We release our tokenizer, training scripts, and model checkpoints to facilitate Arabic NLP research.
zh
[NLP-68] SRS-Stories: Vocabulary-constrained multilingual story generation for language learning EMNLP2025
【速读】: 该论文旨在解决语言学习中词汇习得效率低的问题,即如何在不增加学习负担的前提下,通过自然语境中的重复使用来提升学习者对新词和已学词汇的掌握。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成个性化故事,这些故事仅使用学习者已知词汇,并通过间隔重复系统(Spaced Repetition System, SRS)优化词汇复习与新词引入的时机,从而在保证文本语法正确性、连贯性和可读性的基础上,实现高效且沉浸式的词汇学习体验。实验表明,相较标准约束束搜索方法,该方案生成的故事在词汇使用示例上更具代表性,同时整体质量更高。
链接: https://arxiv.org/abs/2512.18362
作者: Wiktor Kamzela,Mateusz Lango,Ondrej Dusek
机构: Poznan University of Technology (波兹南理工大学); Charles University (查尔斯大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025
Abstract:In this paper, we use large language models to generate personalized stories for language learners, using only the vocabulary they know. The generated texts are specifically written to teach the user new vocabulary by simply reading stories where it appears in context, while at the same time seamlessly reviewing recently learned vocabulary. The generated stories are enjoyable to read and the vocabulary reviewing/learning is optimized by a Spaced Repetition System. The experiments are conducted in three languages: English, Chinese and Polish, evaluating three story generation methods and three strategies for enforcing lexical constraints. The results show that the generated stories are more grammatical, coherent, and provide better examples of word usage than texts generated by the standard constrained beam search approach
zh
[NLP-69] LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators EMNLP2025
【速读】: 该论文旨在解决RDF-to-text生成任务中传统方法依赖大量标注数据、模型黑箱特性以及生成内容易产生幻觉(hallucination)的问题。其解决方案的关键在于提出一种新型神经符号框架,通过多个大语言模型(LLM)代理之间的协作式交互来“训练”系统,而非依赖传统的反向传播机制;这些代理仅基于RDF三元组生成规则化的Python代码作为生成器,无需领域内人工参考文本,从而实现完全可解释、零监督训练且推理速度极快(单CPU即可近实时生成)的文本生成系统。
链接: https://arxiv.org/abs/2512.18360
作者: Mateusz Lango,Ondřej Dušek
机构: Charles University, Faculty of Mathematics and Physics, Prague, Czechia (查尔斯大学,数学与物理学学院,布拉格,捷克)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025
Abstract:We present a novel neurosymbolic framework for RDF-to-text generation, in which the model is “trained” through collaborative interactions among multiple LLM agents rather than traditional backpropagation. The LLM agents produce rule-based Python code for a generator for the given domain, based on RDF triples only, with no in-domain human reference texts. The resulting system is fully interpretable, requires no supervised training data, and generates text nearly instantaneously using only a single CPU. Our experiments on the WebNLG and OpenDialKG data show that outputs produced by our approach reduce hallucination, with only slight fluency penalties compared to finetuned or prompted language models
zh
[NLP-70] DACE For Railway Acronym Disambiguation
【速读】: 该论文旨在解决技术文本处理中的缩写词消歧(Acronym Disambiguation, AD)问题,尤其针对高歧义性场景下的自动化分析挑战。其解决方案的关键在于提出DACE框架,通过动态提示(Dynamic Prompting)、检索增强生成(Retrieval Augmented Generation, RAG)、上下文选择(Contextual Selection)与集成聚合(Ensemble Aggregation)四个核心模块,实现大语言模型(Large Language Models)的自适应上下文学习与领域知识注入,从而有效缓解幻觉现象并提升低资源场景下的性能表现。
链接: https://arxiv.org/abs/2512.18357
作者: El Mokhtar Hribach,Oussama Mechhour,Mohammed Elmonstaser,Yassine El Boudouri,Othmane Kabal
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Acronym Disambiguation (AD) is a fundamental challenge in technical text processing, particularly in specialized sectors where high ambiguity complicates automated analysis. This paper addresses AD within the context of the TextMine’26 competition on French railway documentation. We present DACE (Dynamic Prompting, Retrieval Augmented Generation, Contextual Selection, and Ensemble Aggregation), a framework that enhances Large Language Models through adaptive in-context learning and external domain knowledge injection. By dynamically tailoring prompts to acronym ambiguity and aggregating ensemble predictions, DACE mitigates hallucination and effectively handles low-resource scenarios. Our approach secured the top rank in the competition with an F1 score of 0.9069.
zh
[NLP-71] LLM -based Few-Shot Early Rumor Detection with Imitation Agent
【速读】: 该论文旨在解决早期谣言检测(Early Rumor Detection, EARD)在数据稀缺场景下的挑战,即如何在社交媒体信息流的早期阶段准确识别谣言。传统方法难以应对少样本条件下的时序建模问题,而大型语言模型(Large Language Models, LLMs)虽具备强大的文本理解能力,却因不擅长处理时间序列数据且计算开销大,难以直接应用于EARD任务。解决方案的关键在于提出一种结合自主代理(autonomous agent)与LLM检测模型的新框架:代理负责判断最早可进行可靠分类的时间点,而LLM则专注于谣言识别;该设计仅需训练轻量级代理,保持LLM无需再训练,从而实现高效、准确的少样本EARD,实验表明该方法在多个真实数据集上显著优于现有方法。
链接: https://arxiv.org/abs/2512.18352
作者: Fengzhu Zeng,Qian Shao,Ling Cheng,Wei Gao,Shih-Fen Cheng,Jing Ma,Cheng Niu
机构: Singapore Management University (新加坡管理大学); Hong Kong Baptist University (香港浸会大学); Particle Media, Inc (粒子媒体公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data-scarce settings. While Large Language Models (LLMs) perform well in few-shot NLP tasks, they are not well-suited for time-series data and are computationally expensive for both training and inference. In this work, we propose a novel EARD framework that combines an autonomous agent and an LLM-based detection model, where the agent acts as a reliable decision-maker for \textitearly time point determination, while the LLM serves as a powerful \textitrumor detector. This approach offers the first solution for few-shot EARD, necessitating only the training of a lightweight agent and allowing the LLM to remain training-free. Extensive experiments on four real-world datasets show our approach boosts performance across LLMs and surpasses existing EARD methods in accuracy and earliness.
zh
[NLP-72] owards Efficient Agents : A Co-Design of Inference Architecture and System
【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的智能体(Agent)在真实场景中部署时面临的系统性延迟问题,这些问题并非源于单次模型推理效率低下,而是由推理循环中的累积延迟、上下文增长以及异构工具交互所引发的效率瓶颈。解决方案的关键在于提出一个统一的端到端加速框架——AgentInfer,其核心创新在于四个协同模块:AgentCollab(基于动态角色分配的分层双模型推理机制)、AgentSched(缓存感知的混合调度器)、AgentSAM(基于后缀自动机的推测解码方法,复用多会话语义记忆)和AgentCompress(语义压缩机制,异步蒸馏与重构代理记忆)。这四个模块共同构成自进化引擎,能够在长程推理任务中维持高效性和认知稳定性,实验证明该方案可减少超过50%的无效token消耗,并实现1.8–2.5倍的整体加速,同时保持精度不变,凸显了以任务完成度为导向优化的重要性。
链接: https://arxiv.org/abs/2512.18337
作者: Weizhe Lin,Hui-Ling Zhen,Shuai Yang,Xian Wang,Renxi Liu,Hanting Chen,Wangze Zhang,Chuansai Zhou,Yiming Li,Chen Chen,Xing Li,Zhiyuan Yang,Xiaosong Li,Xianzhi Yu,Zhenhua Dong,Mingxuan Yuan,Yunhe Wang
机构: Huawei(华为)
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making. However, their real-world deployment is hindered by severe inefficiencies that arise not from isolated model inference, but from the systemic latency accumulated across reasoning loops, context growth, and heterogeneous tool interactions. This paper presents AgentInfer, a unified framework for end-to-end agent acceleration that bridges inference optimization and architectural design. We decompose the problem into four synergistic components: AgentCollab, a hierarchical dual-model reasoning framework that balances large- and small-model usage through dynamic role assignment; AgentSched, a cache-aware hybrid scheduler that minimizes latency under heterogeneous request patterns; AgentSAM, a suffix-automaton-based speculative decoding method that reuses multi-session semantic memory to achieve low-overhead inference acceleration; and AgentCompress, a semantic compression mechanism that asynchronously distills and reorganizes agent memory without disrupting ongoing reasoning. Together, these modules form a Self-Evolution Engine capable of sustaining efficiency and cognitive stability throughout long-horizon reasoning tasks. Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%, achieving an overall 1.8-2.5 times speedup with preserved accuracy. These results underscore that optimizing for agentic task completion-rather than merely per-token throughput-is the key to building scalable, efficient, and self-improving intelligent systems.
zh
[NLP-73] LIR3AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation AAAI2026
【速读】: 该论文旨在解决推理模型在检索增强生成(Retrieval-Augmented Generation, RAG)多跳问答(multi-hop QA)任务中因引入复杂推理策略而导致的显著计算开销问题,包括token消耗增加和推理延迟上升。解决方案的关键在于提出一种轻量级重排序推理策略框架(Lightweight Rerank Reasoning Strategy Framework for RAG, LiR³AG),通过将检索到的证据重构为连贯的推理链,使非推理模型能够迁移推理模型的结构化推理策略,从而在大幅降低平均98%输出token开销和58.6%推理时间的同时,提升8B规模非推理模型的F1性能达6.2%–22.5%,并超越32B推理模型在RAG中的表现,为高效RAG系统提供了可实践的新路径。
链接: https://arxiv.org/abs/2512.18329
作者: Guo Chen,Junjie Huang,Huaijin Xie,Fei Sun,Tao Jia
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI2026
Abstract:Retrieval-Augmented Generation (RAG) effectively enhances Large Language Models (LLMs) by incorporating retrieved external knowledge into the generation process. Reasoning models improve LLM performance in multi-hop QA tasks, which require integrating and reasoning over multiple pieces of evidence across different documents to answer a complex question. However, they often introduce substantial computational costs, including increased token consumption and inference latency. To better understand and mitigate this trade-off, we conduct a comprehensive study of reasoning strategies for reasoning models in RAG multi-hop QA tasks. Our findings reveal that reasoning models adopt structured strategies to integrate retrieved and internal knowledge, primarily following two modes: Context-Grounded Reasoning, which relies directly on retrieved content, and Knowledge-Reconciled Reasoning, which resolves conflicts or gaps using internal knowledge. To this end, we propose a novel Lightweight Rerank Reasoning Strategy Framework for RAG (LiR ^3 AG) to enable non-reasoning models to transfer reasoning strategies by restructuring retrieved evidence into coherent reasoning chains. LiR ^3 AG significantly reduce the average 98% output tokens overhead and 58.6% inferencing time while improving 8B non-reasoning model’s F1 performance ranging from 6.2% to 22.5% to surpass the performance of 32B reasoning model in RAG, offering a practical and efficient path forward for RAG systems.
zh
[NLP-74] CTTA-T: Continual Test-Time Adaptation for Text Understanding via Teacher-Student with a Domain-aware and Generalized Teacher
【速读】: 该论文旨在解决持续测试时适应(Continual Test-Time Adaptation, CTTA)场景下文本理解任务中的两个核心挑战:一是错误累积问题,即在多个未观测测试域中逐步传播并放大预测误差;二是泛化能力不足,难以有效应对未见过的域。解决方案的关键在于提出一种名为CTTA-T的框架,其核心创新包括:1)基于dropout驱动一致性的“精炼-过滤”机制,通过校准教师模型预测并剔除不可靠指导来提升预测可靠性;2)采用增量主成分分析(Incremental PCA)动态积累跨域语义信息,构建具备领域感知能力的教师模型,从而实现对演化目标域的持续追踪与自适应累积。该方法在保持低误差传播的同时增强了对未见域的泛化能力。
链接: https://arxiv.org/abs/2512.18321
作者: Tianlun Liu,Zhiliang Tian,Zhen Huang,Xingzhi Zhou,Wanlong Yu,Tianle Liu,Feng Liu,Dongsheng Li
机构: National University of Defense Technology (国防科技大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Text understanding often suffers from domain shifts. To handle testing domains, domain adaptation (DA) is trained to adapt to a fixed and observed testing domain; a more challenging paradigm, test-time adaptation (TTA), cannot access the testing domain during training and online adapts to the testing samples during testing, where the samples are from a fixed domain. We aim to explore a more practical and underexplored scenario, continual test-time adaptation (CTTA) for text understanding, which involves a sequence of testing (unobserved) domains in testing. Current CTTA methods struggle in reducing error accumulation over domains and enhancing generalization to handle unobserved domains: 1) Noise-filtering reduces accumulated errors but discards useful information, and 2) accumulating historical domains enhances generalization, but it is hard to achieve adaptive accumulation. In this paper, we propose a CTTA-T (continual test-time adaptation for text understanding) framework adaptable to evolving target domains: it adopts a teacher-student framework, where the teacher is domain-aware and generalized for evolving domains. To improve teacher predictions, we propose a refine-then-filter based on dropout-driven consistency, which calibrates predictions and removes unreliable guidance. For the adaptation-generalization trade-off, we construct a domain-aware teacher by dynamically accumulating cross-domain semantics via incremental PCA, which continuously tracks domain shifts. Experiments show CTTA-T excels baselines.
zh
[NLP-75] InstructNet: A Novel Approach for Multi-Label Instruction Classification through Advanced Deep Learning
【速读】: 该论文旨在解决多标签指令分类问题,即如何准确地将来自wikiHow的“How To”类文章自动归类到多个相关主题类别中,以支持任务导向型学习和知识库构建。其解决方案的关键在于采用基于Transformer的深度神经网络架构(如XLNet和BERT),通过引入多级评估策略对模型性能进行细致分析,其中XLNet在准确率(97.30%)和宏F1分数(93%)上表现卓越,验证了其在处理复杂多标签分类任务中的有效性。
链接: https://arxiv.org/abs/2512.18301
作者: Tanjim Taharat Aurpa,Md Shoaib Ahmed,Md Mahbubur Rahman,Md. Golam Moazzam
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:People use search engines for various topics and items, from daily essentials to more aspirational and specialized objects. Therefore, search engines have taken over as peoples preferred resource. The How To prefix has become familiar and widely used in various search styles to find solutions to particular problems. This search allows people to find sequential instructions by providing detailed guidelines to accomplish specific tasks. Categorizing instructional text is also essential for task-oriented learning and creating knowledge bases. This study uses the How To articles to determine the multi-label instruction category. We have brought this work with a dataset comprising 11,121 observations from wikiHow, where each record has multiple categories. To find out the multi-label category meticulously, we employ some transformer-based deep neural architectures, such as Generalized Autoregressive Pretraining for Language Understanding (XLNet), Bidirectional Encoder Representation from Transformers (BERT), etc. In our multi-label instruction classification process, we have reckoned our proposed architectures using accuracy and macro f1-score as the performance metrics. This thorough evaluation showed us much about our strategys strengths and drawbacks. Specifically, our implementation of the XLNet architecture has demonstrated unprecedented performance, achieving an accuracy of 97.30% and micro and macro average scores of 89.02% and 93%, a noteworthy accomplishment in multi-label classification. This high level of accuracy and macro average score is a testament to the effectiveness of the XLNet architecture in our proposed InstructNet approach. By employing a multi-level strategy in our evaluation process, we have gained a more comprehensive knowledge of the effectiveness of our proposed architectures and identified areas for forthcoming improvement and refinement.
zh
[NLP-76] Explainable Transformer-CNN Fusion for Noise-Robust Speech Emotion Recognition
【速读】: 该论文旨在解决语音情感识别(Speech Emotion Recognition, SER)系统在真实环境中因不可预测的声学干扰导致性能下降,以及深度学习模型缺乏可解释性从而限制其在高信任度应用场景中部署的问题。解决方案的关键在于提出一种混合Transformer-CNN架构,该架构通过双流设计统一了Wav2Vec 2.0的上下文建模能力与一维卷积神经网络(1D-CNN)的频谱稳定性;具体而言,模型同时处理原始波形以捕捉长时依赖关系,并利用自定义的注意力时间池化机制提取对噪声鲁棒的频谱特征(如MFCC、ZCR、RMSE)。此外,为提升模型透明度,研究引入SHAP和Score-CAM进行细粒度可视化解释,揭示模型如何在复杂环境噪声下动态调整对时域与频域线索的关注策略,从而增强可靠性与可信度。
链接: https://arxiv.org/abs/2512.18298
作者: Sudip Chakrabarty,Pappu Bishwas,Rajdeep Chatterjee
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:
Abstract:Speech Emotion Recognition (SER) systems often degrade in performance when exposed to the unpredictable acoustic interference found in real-world environments. Additionally, the opacity of deep learning models hinders their adoption in trust-sensitive applications. To bridge this gap, we propose a Hybrid Transformer-CNN framework that unifies the contextual modeling of Wav2Vec 2.0 with the spectral stability of 1D-Convolutional Neural Networks. Our dual-stream architecture processes raw waveforms to capture long-range temporal dependencies while simultaneously extracting noise-resistant spectral features (MFCC, ZCR, RMSE) via a custom Attentive Temporal Pooling mechanism. We conducted extensive validation across four diverse benchmark datasets: RAVDESS, TESS, SAVEE, and CREMA-D. To rigorously test robustness, we subjected the model to non-stationary acoustic interference using real-world noise profiles from the SAS-KIIT dataset. The proposed framework demonstrates superior generalization and state-of-the-art accuracy across all datasets, significantly outperforming single-branch baselines under realistic environmental interference. Furthermore, we address the ``black-box" problem by integrating SHAP and Score-CAM into the evaluation pipeline. These tools provide granular visual explanations, revealing how the model strategically shifts attention between temporal and spectral cues to maintain reliability in the presence of complex environmental noise.
zh
[NLP-77] Measuring Fine-Grained Negotiation Tactics of Humans and LLM s in Diplomacy
【速读】: 该论文试图解决的问题是:如何系统性地识别和量化谈判策略中的风格差异,尤其是在自然语言驱动的博弈场景中,传统研究多关注谈判结果的成功与否,而忽视了对谈判行为本身细致分类与分析。解决方案的关键在于构建一个基于社会学基础分类体系的细粒度谈判战术标注框架,并利用大语言模型作为评判者(LLM-as-a-judge)对人类间博弈(Diplomacy)中的对话进行标注,从而实现对谈判特征与游戏成功之间关系的量化建模;同时通过对比人类与大语言模型(LLM)在谈判策略上的差异,发现可通过微调引导LLM向更接近人类的谈判行为演化。
链接: https://arxiv.org/abs/2512.18292
作者: Wenkai Li,Lynnette Hui Xian Ng,Andy Liu,Daniel Fried
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
Abstract:The study of negotiation styles dates back to Aristotle’s ethos-pathos-logos rhetoric. Prior efforts primarily studied the success of negotiation agents. Here, we shift the focus towards the styles of negotiation strategies. Our focus is the strategic dialogue board game Diplomacy, which affords rich natural language negotiation and measures of game success. We used LLM-as-a-judge to annotate a large human-human set of Diplomacy games for fine-grained negotiation tactics from a sociologically-grounded taxonomy. Using a combination of the It Takes Two and WebDiplomacy datasets, we demonstrate the reliability of our LLM-as-a-Judge framework and show strong correlations between negotiation features and success in the Diplomacy setting. Lastly, we investigate the differences between LLM and human negotiation strategies and show that fine-tuning can steer LLM agents toward more human-like negotiation behaviors.
zh
[NLP-78] Investigating Spatial Attention Bias in Vision-Language Models
【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在空间信息处理中存在系统性注意力偏倚的问题,即模型在处理左右并置图像时倾向于优先描述左侧内容。其解决方案的关键在于通过受控实验验证该偏倚的普遍性:在中性提示条件下,无论使用开源或闭源模型,模型在约97%的情况下均先描述左侧行内容;进一步测试阿拉伯语微调模型表明,该偏倚不随语言阅读方向改变而消失,排除了语言习惯作为主因的可能性;同时分析训练数据标注规范发现并无明确“从左到右”顺序指令,暗示该偏倚源于模型架构特性而非显式数据引导。这一发现揭示了当前VLM在空间感知上的根本局限。
链接: https://arxiv.org/abs/2512.18231
作者: Aryan Chaudhary,Sanchit Goyal,Pratik Narang,Dhruv Kumar
机构: Birla Institute of Technology and Science, Pilani, India (比尔拉理工大学与科学学院,皮拉尼,印度)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Vision-Language Models have demonstrated remarkable capabilities in understanding visual content, yet systematic biases in their spatial processing remain largely unexplored. This work identifies and characterizes a systematic spatial attention bias where VLMs consistently prioritize describing left-positioned content before right-positioned content in horizontally concatenated images. Through controlled experiments on image pairs using both open-source and closed-source models, we demonstrate that this bias persists across different architectures, with models describing left-positioned content first in approximately 97% of cases under neutral prompting conditions. Testing on an Arabic-finetuned model reveals that the bias persists despite right-to-left language training, ruling out language reading direction as the primary cause. Investigation of training dataset annotation guidelines from PixMo and Visual Genome reveals no explicit left-first ordering instructions, suggesting the bias is consistent with architectural factors rather than explicit training data instructions. These findings reveal fundamental limitations in how current VLMs process spatial information.
zh
[NLP-79] GeoSense-AI: Fast Location Inference from Crisis Microblogs
【速读】: 该论文旨在解决从嘈杂的微博流中实时进行地理定位(geolocation)的问题,传统依赖稀疏地理标签(geotags)的方法在突发事件中的响应效率和覆盖范围受限。解决方案的关键在于构建一个端到端的AI处理流水线(pipeline),融合统计哈希标签分割、基于词性标注的专有名词检测、围绕灾害词汇的依存句法分析、轻量级命名实体识别(Named Entity Recognition, NER)以及基于地名索引(gazetteer)的消歧机制,从而直接从文本内容推断位置信息。该系统通过优化低延迟自然语言处理组件与地理知识库的高效验证,在保证高F1值的同时实现数个数量级的吞吐量提升,适用于实时危机情报场景下的快速部署与可视化呈现。
链接: https://arxiv.org/abs/2512.18225
作者: Deepit Sapru
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
Abstract:This paper presents an applied AI pipeline for realtime geolocation from noisy microblog streams, unifying statistical hashtag segmentation, part-of-speech-driven proper-noun detection, dependency parsing around disaster lexicons, lightweight named-entity recognition, and gazetteer-grounded disambiguation to infer locations directly from text rather than sparse geotags. The approach operationalizes information extraction under streaming constraints, emphasizing low-latency NLP components and efficient validation against geographic knowledge bases to support situational awareness during emergencies. In head to head comparisons with widely used NER toolkits, the system attains strong F1 while being engineered for orders-of-magnitude faster throughput, enabling deployment in live crisis informatics settings. A production map interface demonstrates end-to-end AI functionality ingest, inference, and visualization–surfacing locational signals at scale for floods, outbreaks, and other fastmoving events. By prioritizing robustness to informal text and streaming efficiency, GeoSense-AI illustrates how domain-tuned NLP and knowledge grounding can elevate emergency response beyond conventional geo-tag reliance.
zh
[NLP-80] Stable and Efficient Single-Rollout RL for Multimodal Reasoning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)训练中面临的效率与稳定性权衡问题。现有基于群体的算法(如GRPO)虽能提升推理能力,但需对每个提示进行多轨迹采样,计算开销大;而单轨迹变体在纯文本场景下虽更高效,但在多模态环境中易引发训练崩溃,导致不稳定。为此,论文提出**MSSR(Multimodal Stabilized Single-Rollout)**框架,其核心创新在于引入一种基于熵的优势重塑机制(entropy-based advantage-shaping mechanism),通过自适应地调节优势值幅度,在无需群体信息的情况下实现训练稳定性和多模态推理性能的兼顾。实验表明,MSSR不仅在相同训练步数下优于基线模型,且在训练效率上仅需一半步数即可达到相当性能,显著提升了RLVR在复杂多模态推理任务中的实用性。
链接: https://arxiv.org/abs/2512.18215
作者: Rui Liu,Dian Yu,Lei Ke,Haolin Liu,Yujun Zhou,Zhenwen Liang,Haitao Mi,Pratap Tokekar,Dong Yu
机构: Tencent AI Lab (腾讯AI实验室); University of Maryland, College Park; University of Virginia; University of Notre Dame
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce \textbfMSSR (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR’s performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.
zh
[NLP-81] raining LLM s with LogicReward for Faithful and Rigorous Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中依赖结果导向反馈导致的“正确答案但推理错误”的问题,尤其在高风险场景中缺乏逻辑一致性保障。其核心解决方案是提出LogicReward奖励机制,通过引入定理证明器(theorem prover)对中间推理步骤进行逻辑正确性约束,从而提升模型推理的可信度与可解释性;关键创新在于结合软合一(Soft Unification)的自动形式化方法(Autoformalization),有效降低自然语言歧义并提高形式化质量,使定理证明器能更准确地评估逻辑合理性,最终实现无需人工标注即可获得可靠奖励信号,并显著优于GPT-4o和o4-mini在自然语言推理与逻辑推理任务上的表现。
链接: https://arxiv.org/abs/2512.18196
作者: Jundong Xu,Hao Fei,Huichi Zhou,Xin Quan,Qijun Huang,Shengqiong Wu,William Yang Wang,Mong-Li Lee,Wynne Hsu
机构: National University of Singapore(新加坡国立大学); University College London(伦敦大学学院); University of Manchester(曼彻斯特大学); University of Melbourne(墨尔本大学); University of California, Santa Barbara(加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Although LLMs exhibit strong reasoning capabilities, existing training methods largely depend on outcome-based feedback, which can produce correct answers with flawed reasoning. Prior work introduces supervision on intermediate steps but still lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount. To address this, we propose LogicReward, a novel reward system that guides model training by enforcing step-level logical correctness with a theorem prover. We further introduce Autoformalization with Soft Unification, which reduces natural language ambiguity and improves formalization quality, enabling more effective use of the theorem prover. An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6% and 2% on natural language inference and logical reasoning tasks with simple training procedures. Further analysis shows that LogicReward enhances reasoning faithfulness, improves generalizability to unseen tasks such as math and commonsense reasoning, and provides a reliable reward signal even without ground-truth labels. We will release all data and code at this https URL.
zh
[NLP-82] External Hippocampus: Topological Cognitive Maps for Guiding Large Language Model Reasoning
【速读】: 该论文旨在解决小规模语言模型在多步推理中面临的认知死锁(cognitive deadlock)问题,即模型在复杂任务中容易陷入局部最优或停滞状态,导致推理效率和准确性下降。其解决方案的关键在于提出外部海马体框架(External Hippocampus framework),该框架从认知动力学视角出发,将语言模型的推理过程建模为语义空间中的信息能量流动,并通过降维投影构建拓扑认知地图(topological cognitive maps),从而在测试阶段实现对能量流的精确导航与干预。该方法无需额外训练,具备自主增长能力,且能有效避免高计算开销,同时揭示了推理停滞表现为“认知漩涡”(Cognitive Vortex)和低熵势阱(low-entropy potential wells),并通过温度扰动重启能量流,展现出可控、高效的小模型推理优化路径。
链接: https://arxiv.org/abs/2512.18190
作者: Jian Yan
机构: Zunyi Normal University (遵义师范学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 7 figures
Abstract:This paper proposes the External Hippocampus framework, which models language model reasoning from a cognitive dynamics perspective as the flow of information energy in semantic space. Unlike traditional weight-space optimization methods, this framework constructs topological cognitive maps through dimensionality reduction projection, enabling precise navigation and intervention of energy flow at test time while avoiding substantial computational requirements and demonstrating predictable intervention patterns. The method effectively addresses the cognitive deadlock problem in multi-step reasoning for small models. Experiments on models =7B parameters show: map-guided methods achieve 81.20% accuracy on 500 challenging problems (relative baseline +16.80%), reduce reasoning time by = 15x, with key findings revealing that reasoning stagnation manifests as “Cognitive Vortex” and low-entropy potential wells, while temperature perturbations effectively restart energy flow. The framework requires no additional training, possesses autonomous growth capability, and provides an efficient and controllable topological-aware solution for small model reasoning.
zh
[NLP-83] Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown ICDAR2025
【速读】: 该论文旨在解决学术文档从PDF格式转换为结构化标记语言(如HTML或XML)时存在的效率低下问题。现有基于端到端解码器Transformer模型的方法在处理包含复杂布局、公式、表格等元素的PDF文档时,存在冗余计算和低效的逐token生成过程,导致推理延迟高。其关键解决方案是提出一种混合编辑-生成模型EditTrans,该模型通过一个轻量级分类器(基于162,127页arXiv文档微调的文档版面分析模型)预先识别出需要编辑的文本区域,从而避免重复生成已存在于PDF中的密集文本内容,显著提升转换效率。实验表明,EditTrans相比传统模型可将转换延迟降低最高达44.5%,同时保持高质量的输出结果。
链接: https://arxiv.org/abs/2512.18115
作者: Changxu Duan
机构: 未知
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注: Accepted ICDAR 2025
Abstract:Academic documents stored in PDF format can be transformed into plain text structured markup languages to enhance accessibility and enable scalable digital library workflows. Markup languages allow for easier updates and customization, making academic content more adaptable and accessible to diverse usage, such as linguistic corpus compilation. Such documents, typically delivered in PDF format, contain complex elements including mathematical formulas, figures, headers, and tables, as well as densely layouted text. Existing end-to-end decoder transformer models can transform screenshots of documents into markup language. However, these models exhibit significant inefficiencies; their token-by-token decoding from scratch wastes a lot of inference steps in regenerating dense text that could be directly copied from PDF files. To solve this problem, we introduce EditTrans, a hybrid editing-generation model whose features allow identifying a queue of to-be-edited text from a PDF before starting to generate markup language. EditTrans contains a lightweight classifier fine-tuned from a Document Layout Analysis model on 162,127 pages of documents from arXiv. In our evaluations, EditTrans reduced the transformation latency up to 44.5% compared to end-to-end decoder transformer models, while maintaining transformation quality. Our code and reproducible dataset production scripts are open-sourced.
zh
[NLP-84] Statistical laws and linguistics inform meaning in naturalistic and fictional conversation
【速读】: 该论文试图解决的问题是:如何通过统计规律(如Heaps定律)来刻画对话在时间维度上的演化特征,特别是不同语言特征(如词性类别)对词汇多样性增长模式的影响。其解决方案的关键在于:基于真实对话数据(视频聊天中的陌生人互动与电影中虚构角色的对白),定量测量词汇规模随对话长度增长的幂律关系,并发现词汇量的增长速率因词性类别而异,从而揭示了语言结构与对话动态之间的关联机制。
链接: https://arxiv.org/abs/2512.18072
作者: Ashley M. A. Fehr,Calla G. Beauregard,Julia Witte Zimmerman,Katie Ekström,Pablo Rosillo-Rodes,Christopher M. Danforth,Peter Sheridan Dodds
机构: University of Vermont (佛蒙特大学); Vermont Advanced Computing Center (佛蒙特高级计算中心); Computational Story Lab (计算故事实验室); Vermont Complex Systems Institute (佛蒙特复杂系统研究所); MassMutual Center of Excellence for Complex Systems and Data Science (马萨诸塞互助中心复杂系统与数据科学卓越中心); Computational Ethics Lab (计算伦理实验室); Glass Brain Lab, Department of Biomedical Engineering (玻璃大脑实验室,生物医学工程系); Vermont Conversation Lab, Larner College of Medicine (佛蒙特对话实验室,拉纳医学院); Institute for Cross-Disciplinary Physics and Complex Systems IFISC (UIB-CSIC) (跨学科物理与复杂系统研究所 IFISC ( UIB-CSIC)) ; Department of Mathematics & Statistics (数学与统计系); Santa Fe Institute (圣达菲研究所)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps’ law, which holds that vocabulary size scales with document length. Little work on Heaps’s law has looked at conversation and considered how language features impact scaling. We measure Heaps’ law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.
zh
[NLP-85] Narrative Consolidation: Formulating a New Task for Unifying Multi-Perspective Accounts
【速读】: 该论文旨在解决重叠叙事文档(如法律证词或历史记载)的整合问题,传统多文档摘要(Multi-Document Summarization, MDS)侧重于压缩信息,无法保留叙事连贯性。为此,作者提出“叙事整合”(Narrative Consolidation)这一新自然语言处理任务,其核心目标是保持时间顺序一致性、内容完整性以及互补细节的融合。解决方案的关键在于引入时序对齐事件图(Temporal Alignment Event Graph, TAEG),该图结构显式建模事件的时间顺序与对齐关系,并通过标准中心性算法在TAEG上选择每个事件最核心的表述,从而实现事件在正确时序位置上的最优代表选取。该方法在《圣经四福音书》数据集上实现了完美时间排序(Kendall’s Tau = 1.000)并显著提升内容质量(ROUGE-L F1提升357.2%),验证了显式时序骨架对叙事整合任务的重要性。
链接: https://arxiv.org/abs/2512.18041
作者: Roger A. Finger,Eduardo G. Cortes,Sandro J. Rigo,Gabriel de O. Ramos
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Processing overlapping narrative documents, such as legal testimonies or historical accounts, often aims not for compression but for a unified, coherent, and chronologically sound text. Standard Multi-Document Summarization (MDS), with its focus on conciseness, fails to preserve narrative flow. This paper formally defines this challenge as a new NLP task: Narrative Consolidation, where the central objectives are chronological integrity, completeness, and the fusion of complementary details. To demonstrate the critical role of temporal structure in this task, we introduce Temporal Alignment Event Graph (TAEG), a graph structure that explicitly models chronology and event alignment. By applying a standard centrality algorithm to TAEG, our method functions as a version selection mechanism, choosing the most central representation of each event in its correct temporal position. In a study on the four Biblical Gospels, this structure-focused approach guarantees perfect temporal ordering (Kendall’s Tau of 1.000) by design and dramatically improves content metrics (e.g., +357.2% in ROUGE-L F1). The success of this baseline method validates the formulation of Narrative Consolidation as a relevant task and establishes that an explicit temporal backbone is a fundamental component for its resolution.
zh
[NLP-86] CoPE: A Small Language Model for Steerable and Scalable Content Labeling
【速读】: 该论文旨在解决当前内容分类模型在政策可解释性和效率方面的局限性,即传统模型往往依赖于对政策的机械记忆而非真正理解,导致泛化能力差且难以适应多变的治理需求。其解决方案的关键在于提出两种创新方法:一是Contradictory Example Training(矛盾示例训练),通过设计包含冲突语义的训练样本,使模型学习政策的核心逻辑而非简单记忆标签;二是Binocular Labeling(双视角标注),利用双重标注机制快速构建无歧义的高质量训练数据集。这两项技术共同推动了小规模语言模型(如90亿参数版本)在七类有害内容识别中达到甚至超越主流大模型的准确率,同时具备部署成本低、可由单个消费级GPU运行的优势,实现了从“机器学习任务”到“政策编写任务”的范式转变。
链接: https://arxiv.org/abs/2512.18027
作者: Samidh Chakrabarti,David Willner,Kevin Klyman,Tiffany Saade,Emily Capstick,Sabina Nong
机构: Zentropi(zentropi); Stanford University (斯坦福大学); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 21 pages, 2 figures, 7 tables
Abstract:This paper details the methodology behind CoPE, a policy-steerable small language model capable of fast and accurate content labeling. We present a novel training curricula called Contradictory Example Training that enables the model to learn policy interpretation rather than mere policy memorization. We also present a novel method for generating content policies, called Binocular Labeling, which enables rapid construction of unambiguous training datasets. When evaluated across seven different harm areas, CoPE exhibits equal or superior accuracy to frontier models at only 1% of their size. We openly release a 9 billion parameter version of the model that can be run on a single consumer-grade GPU. Models like CoPE represent a paradigm shift for classifier systems. By turning an ML task into a policy writing task, CoPE opens up new design possibilities for the governance of online platforms.
zh
[NLP-87] ReGal: A First Look at PPO-based Legal AI for Judgment Prediction and Summarization in India AAAI2026
【速读】: 该论文旨在解决如何将强化学习(Reinforcement Learning, RL)有效应用于印度法律场景下的高风险、长文档任务,以提升法律AI系统的推理能力与适应性。其核心挑战在于法律文本的语言复杂性、奖励模型对齐困难以及领域特定的适配问题。解决方案的关键在于提出ReGal框架,该框架融合多任务指令微调(Multi-Task Instruction Tuning)与基于AI反馈的强化学习(Reinforcement Learning from AI Feedback, RLAIF),并采用近端策略优化(Proximal Policy Optimization, PPO)算法进行训练,从而在法院判决预测与解释(Court Judgment Prediction and Explanation, CJPE)及法律文书摘要生成任务中实现可解释且具备一定自适应性的法律推理能力。
链接: https://arxiv.org/abs/2512.18014
作者: Shubham Kumar Nigam,Tanuj Tyagi,Siddharth Shukla,Aditya Kumar Guru,Balaramamahanthi Deepak Patnaik,Danush Khanna,Noel Shallum,Kripabandhu Ghosh,Arnab Bhattacharya
机构: 1. Indian Institute of Technology, Delhi (印度理工学院德里分校); 2. IIT Delhi (印度理工学院德里分校); 3. Google (谷歌); 4. Indian Institute of Science (印度科学研究所); 5. University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in AILaw @ AAAI 2026 conference
Abstract:This paper presents an early exploration of reinforcement learning methodologies for legal AI in the Indian context. We introduce Reinforcement Learning-based Legal Reasoning (ReGal), a framework that integrates Multi-Task Instruction Tuning with Reinforcement Learning from AI Feedback (RLAIF) using Proximal Policy Optimization (PPO). Our approach is evaluated across two critical legal tasks: (i) Court Judgment Prediction and Explanation (CJPE), and (ii) Legal Document Summarization. Although the framework underperforms on standard evaluation metrics compared to supervised and proprietary models, it provides valuable insights into the challenges of applying RL to legal texts. These challenges include reward model alignment, legal language complexity, and domain-specific adaptation. Through empirical and qualitative analysis, we demonstrate how RL can be repurposed for high-stakes, long-document tasks in law. Our findings establish a foundation for future work on optimizing legal reasoning pipelines using reinforcement learning, with broader implications for building interpretable and adaptive legal AI systems.
zh
[NLP-88] Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models AAAI2026
【速读】: 该论文旨在解决低资源语言(如马拉地语)中手写文本识别(Handwritten Text Recognition, HTR)与机器翻译(Machine Translation, MT)的联合挑战,尤其是在法律文档数字化场景下,传统两阶段OCR-MT流水线在准确性与可扩展性上的局限。其解决方案的关键在于对比分析传统分步处理方法与基于视觉大语言模型(Vision Large Language Models)的端到端直接翻译方法,后者能够统一识别与翻译任务,在单一模型中直接从手写图像生成目标语言文本,从而提升效率并适应边缘部署需求,尤其适用于印度地区法院中大量非标准化手写法律文件的自动化处理。
链接: https://arxiv.org/abs/2512.18004
作者: Shubham Kumar Nigam,Parjanya Aditya Shukla,Noel Shallum,Arnab Bhattacharya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted in AILaw @ AAAI 2026 Conference
Abstract:Handwritten text recognition (HTR) and machine translation continue to pose significant challenges, particularly for low-resource languages like Marathi, which lack large digitized corpora and exhibit high variability in handwriting styles. The conventional approach to address this involves a two-stage pipeline: an OCR system extracts text from handwritten images, which is then translated into the target language using a machine translation model. In this work, we explore and compare the performance of traditional OCR-MT pipelines with Vision Large Language Models that aim to unify these stages and directly translate handwritten text images in a single, end-to-end step. Our motivation is grounded in the urgent need for scalable, accurate translation systems to digitize legal records such as FIRs, charge sheets, and witness statements in India’s district and high courts. We evaluate both approaches on a curated dataset of handwritten Marathi legal documents, with the goal of enabling efficient legal document processing, even in low-resource environments. Our findings offer actionable insights toward building robust, edge-deployable solutions that enhance access to legal information for non-native speakers and legal professionals alike.
zh
[NLP-89] Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在提示压缩(prompt compression)条件下性能下降的问题,尤其是对约束遵守(constraint compliance, CC)与语义准确性(semantic accuracy, SA)之间关系的理解不足。其关键解决方案是提出Compression-Decay Comprehension Test (CDCT) 基准,独立量化不同压缩水平下的CC和SA,并通过多维度分析揭示:在中等压缩(c=0.5)时出现普遍的U型约束违反峰值,而极端压缩(c=0.0)反而表现更优;进一步实验表明,强化学习人类反馈(Reinforcement Learning from Human Feedback, RLHF)中“帮助性”信号是导致中度压缩下约束违规的主要原因——移除此类信号可使约束遵守率平均提升598%,且79%的模型达到完全合规。这揭示了RLHF对指令遵循能力的潜在负面影响,为部署阶段优化模型行为提供了可操作的指导原则。
链接: https://arxiv.org/abs/2512.17920
作者: Rahul Baxi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 9 figures; currently under peer review at TMLR
Abstract:Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss’ \kappa=0.90). We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic effects. Experimental validation via RLHF ablation confirms our constraint salience hypothesis: removing “helpfulness” signals improves CC by 598% on average (71/72 trials, p0.001), with 79% achieving perfect compliance. This demonstrates that RLHF-trained helpfulness behaviors are the dominant cause of constraint violations at medium compression. Reasoning models outperform efficient models by 27.5% (Cohen’s d=0.96). Our findings reveal a fundamental tension between RLHF alignment and instruction-following, providing actionable guidelines for improving deployed systems. Comments: 19 pages, 9 figures; currently under peer review at TMLR Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.7 Cite as: arXiv:2512.17920 [cs.CL] (or arXiv:2512.17920v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.17920 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Rahul Baxi [view email] [v1] Tue, 2 Dec 2025 13:25:48 UTC (149 KB)
zh
[NLP-90] KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文长度快速增加背景下,键值缓存(Key-Value Cache, KV Cache)内存需求成为部署和批处理瓶颈的问题。传统KV缓存压缩方法通过永久删除或不可逆合并低注意力分数的“不重要”token来节省空间,导致信息不可恢复的损失,即所谓的“情境遗忘”(Contextual Amnesia),严重削弱模型的信息检索能力。为此,作者提出KVReviver,一种基于sketch算法的可逆KV缓存压缩方法,其关键在于引入一个额外的数据结构以实现压缩token的重建,从而在有限内存内完成全规模计算,实验证明在2k上下文长度下仅需10%的KV缓存预算即可保持端到端推理精度,在32k长度下使用25%预算时精度损失不超过2%。
链接: https://arxiv.org/abs/2512.17917
作者: Aomufei Yuan,Zhiming Wang,Ruijie Miao,Dayu Wang,Yuxuan Tian,Zihan Wang,Yebo Peng,Yuhan Wu,Bairen Yi,Xin Liu,Tong Yang
机构: ByteDance Inc. (字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures
Abstract:As the context length of current large language models (LLMs) rapidly increases, the memory demand for the Key-Value (KV) cache is becoming a bottleneck for LLM deployment and batch processing. Traditional KV cache compression methods typically involve permanently evicting or irreversibly merging “less important” tokens with low attention scores. This approach results in the unrecoverable loss of token information, which we call Contextual Amnesia, significantly degrading the model’s information retrieval capability. To address this issue, we propose KVReviver, a reversible KV cache compression method based on the sketch algorithm. This method allows reconstructing compressed tokens from an additional data structure, thus enabling full-scale computation within limited memory. Experiments showed that in 2k-length contexts, it requires only 10% of KV Cache budget while maintaining identical end-to-end inference accuracy. For 32k-length contexts, it achieves equivalent or comparable accuracy ~2% accuracy loss) using merely 25% of KV Cache budget.
zh
[NLP-91] Learning to Prioritize IT Tickets: A Comparative Evaluation of Embedding-based Approaches and Fine-Tuned Transformer Models
【速读】: 该论文旨在解决IT服务管理(ITSM)中服务工单优先级排序的问题,该问题因文本输入噪声大、写作风格主观性强以及类别分布极度不平衡而难以处理。解决方案的关键在于采用经过微调的多语言Transformer模型,该模型能够同时处理文本和数值特征,相较于基于嵌入(embedding)的流水线方法(包括降维、聚类和传统分类器),展现出显著更优的性能——平均F1分数达78.5%,加权Cohen’s kappa值接近0.80,表明其对真实标签具有更强的一致性。研究结果揭示了通用嵌入在ITSM数据上的局限性,并验证了领域适配的Transformer架构在工单优先级排序任务中的有效性。
链接: https://arxiv.org/abs/2512.17916
作者: Minh Tri LÊ,Ali Ait-Bachir
机构: Global AI Lab; EasyVista
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Prioritizing service tickets in IT Service Management (ITSM) is critical for operational efficiency but remains challenging due to noisy textual inputs, subjective writing styles, and pronounced class imbalance. We evaluate two families of approaches for ticket prioritization: embedding-based pipelines that combine dimensionality reduction, clustering, and classical classifiers, and a fine-tuned multilingual transformer that processes both textual and numerical features. Embedding-based methods exhibit limited generalization across a wide range of thirty configurations, with clustering failing to uncover meaningful structures and supervised models highly sensitive to embedding quality. In contrast, the proposed transformer model achieves substantially higher performance, with an average F1-score of 78.5% and weighted Cohen’s kappa values of nearly 0.80, indicating strong alignment with true labels. These results highlight the limitations of generic embeddings for ITSM data and demonstrate the effectiveness of domain-adapted transformer architectures for operational ticket prioritization.
zh
[NLP-92] Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset
【速读】: 该论文旨在解决当前自动语音识别(ASR)研究中缺乏跨声学与语言域、且分区定义清晰的公开基准数据集的问题,以替代如LibriSpeech或TED-Lium等传统数据集。其解决方案的关键在于推出Loquacious数据集,并配套提供n-gram语言模型(LM)、音素到音位转换(G2P)模型及发音词典等附加资源,均采用开放许可,支持学术界与工业界的广泛使用。通过这些资源,作者在多种ASR架构上开展实验,验证了Loquacious数据集在应对常见ASR挑战中的有效性与实用性。
链接: https://arxiv.org/abs/2512.17915
作者: Nick Rossenbach,Robin Schmitt,Tina Raissi,Simon Berger,Larissa Kleppel,Ralf Schlüter
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The recently published Loquacious dataset aims to be a replacement for established English automatic speech recognition (ASR) datasets such as LibriSpeech or TED-Lium. The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. Utilizing those additional resources we show experimental results across a wide range of ASR architectures with different label units and topologies. Our initial experimental results indicate that the Loquacious dataset offers a valuable study case for a variety of common challenges in ASR.
zh
[NLP-93] Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression
【速读】: 该论文旨在解决多智能体大语言模型(Multi-agent Large Language Model, LLM)系统中因代理间冗余传输上下文信息而导致的带宽与计算资源消耗过高的问题。传统方法仅传输原始文本,忽略内部语义表示,导致接收方需重复计算相似表示,效率低下。解决方案的关键在于提出Q-KVComm协议,通过直接传输压缩后的键值(Key-Value, KV)缓存表示实现高效通信,其核心创新包括:(1) 自适应分层量化机制,根据敏感性分析动态分配不同位宽;(2) 混合信息提取策略,跨内容域保留关键事实;(3) 异构模型校准技术,支持不同架构间的通信。实验表明,该方案在保持语义保真度的前提下实现5–6倍压缩比,且在多种场景下一致性得分高于0.77,具备良好的泛化性和实用性。
链接: https://arxiv.org/abs/2512.17914
作者: Boris Kriuk,Logic Ng
机构: 未知
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 7 pages, 4 figures, 1 table
Abstract:Multi-agent Large Language Model (LLM) systems face a critical bottleneck: redundant transmission of contextual information between agents consumes excessive bandwidth and computational resources. Traditional approaches discard internal semantic representations and transmit raw text, forcing receiving agents to recompute similar representations from scratch. We introduce Q-KVComm, a new protocol that enables direct transmission of compressed key-value (KV) cache representations between LLM agents. Q-KVComm combines three key innovations: (1) adaptive layer-wise quantization that allocates variable bit-widths based on sensitivity profiling, (2) hybrid information extraction that preserves critical facts across content domains, and (3) heterogeneous model calibration establishing cross-architecture communication. Extensive experiments across three diverse question-answering datasets demonstrate that Q-KVComm achieves 5-6x compression ratios while maintaining semantic fidelity, with coherence quality scores above 0.77 across all scenarios. The protocol exhibits robust performance across model sizes (1.1B-1.5B parameters) and adapts to real-world applications including conversational QA and multi-hop reasoning. Our work establishes a new paradigm for LLM agent communication, shifting from text-based to representation-based information exchange.
zh
[NLP-94] Graph-O1 : Monte Carlo Tree Search with Reinforcement Learning for Text-Attributed Graph Reasoning
【速读】: 该论文旨在解决文本属性图(text-attributed graphs)上的问答任务中,大型语言模型(LLMs)难以有效融合非结构化文本与图结构关系信号的问题。现有方法存在两大局限:一是纯文本检索增强生成(RAG)方法将段落视为孤立单元,忽略图的拓扑结构;二是基于图的RAG方法因序列化子图长度受限于LLM上下文窗口,导致推理碎片化、准确性下降。解决方案的关键在于提出Graph-O1——一个基于代理的GraphRAG框架,其核心创新是将蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)与端到端强化学习相结合,使LLM能够以逐步交互的方式选择性探索和检索最具信息量的子图组件,从而实现更精准、可靠且可解释的推理过程。
链接: https://arxiv.org/abs/2512.17912
作者: Lihui Liu
机构: Wayne State University (韦恩州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:ChatGPT said: Text-attributed graphs, where nodes and edges contain rich textual information, are widely used across diverse domains. A central challenge in this setting is question answering, which requires jointly leveraging unstructured text and the structured relational signals within the graph. Although Large Language Models (LLMs) have made significant advances in natural language understanding, their direct use for reasoning over text-attributed graphs remains limited. Retrieval-augmented generation methods that operate purely on text often treat passages as isolated units, ignoring the interconnected structure of the graph. Conversely, graph-based RAG methods that serialize large subgraphs into long textual sequences quickly become infeasible due to LLM context-length constraints, resulting in fragmented reasoning and degraded accuracy. To overcome these limitations, we introduce Graph-O1, an agentic GraphRAG framework that enables LLMs to conduct stepwise, interactive reasoning over graphs. Our approach integrates Monte Carlo Tree Search (MCTS) with end-to-end reinforcement learning, allowing the model to selectively explore and retrieve only the most informative subgraph components. The reasoning procedure is framed as a multi-turn interaction between the agent and the graph environment, and the agent is trained through a unified reward mechanism. Extensive experiments across multiple LLM backbones demonstrate that Graph-O1 consistently surpasses state-of-the-art baselines, producing answers that are more accurate, reliable, and interpretable.
zh
[NLP-95] owards Reasoning -Preserving Unlearning in Multimodal Large Language Models
【速读】: 该论文旨在解决推理型多模态大语言模型(Reasoning Multimodal Large Language Models, RMLLMs)在数据遗忘过程中存在的两个核心问题:一是中间链式思维(chain-of-thought)步骤可能泄露敏感信息,即使最终答案已被遗忘;二是现有遗忘方法若过于激进会严重损害模型的通用推理能力。为应对这一挑战,作者提出R-MUSE(Reasoning-preserving MLLM Unlearning via Subspace guidance and Adaptive Steering),其关键在于无需训练、仅在推理阶段通过子空间引导与自适应调控机制,精准地使模型内部表征同时遗忘目标数据的答案和推理路径,同时显式保留模型的通用推理能力,从而在遗忘效果与推理保留之间实现更优平衡。
链接: https://arxiv.org/abs/2512.17911
作者: Hongji Li,Junchi yao,Manjiang Yu,Priyanka Singh,Xue Li,Di Wang,Lijie Hu
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); University of Queensland (昆士兰大学); Provable Responsible AI and Data Analytics (PRADA) Lab; King Abdullah University of Science and Technology (KAUST)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Machine unlearning aims to erase requested data from trained models without full retraining. For Reasoning Multimodal Large Language Models (RMLLMs), this is uniquely challenging: intermediate chain-of-thought steps can still leak sensitive information even when final answers are forgotten, and overly aggressive interventions easily damage general reasoning ability. Yet no benchmark jointly evaluates how well unlearning methods suppress reasoning-level leakage while preserving reasoning competence. We address this gap with RMLLMU-Bench, the first benchmark for RMLLM unlearning that extends standard forgetting metrics with dedicated measures of reasoning leakage and reasoning retention. A systematic evaluation on RMLLMU-Bench reveals that existing unlearning methods for MLLMs and Large (Language) Reasoning Models (LRMs) either leave substantial leakage in the reasoning process or severely degrade reasoning performance. To address these gaps, we propose R-MUSE (Reasoning-preserving MLLM Unlearning via Subspace guidance and Adaptive Steering), a training-free and inference-time intervention framework that steers internal representations to forget both answers and reasoning traces while explicitly preserving general reasoning. Experiments on RMLLMU-Bench demonstrate that R-MUSE achieves a substantially better balance between effective forgetting and reasoning retention.
zh
[NLP-96] Merge on workspaces as Hopf algebra Markov chain
【速读】: 该论文旨在解决生成语言学中句法结构形成与变换的动态机制问题,特别是基于乔姆斯基最小主义模型下Merge操作如何驱动树状结构的演化过程。其核心问题是:在无外部约束的情况下,仅靠内部Merge、外部Merge及侧向Merge(Sideward Merge)的组合是否能导致系统收敛至完整的连通树结构(即语法上合法的句法树)。解决方案的关键在于引入一个加权的马尔可夫链模型,其中不同形式的Merge操作通过成本函数进行加权,并利用热带半环(tropical semiring)中的Perron-Frobenius问题分析其渐近行为。研究发现,传统语言学中提出的成本函数(如最小搜索和资源限制)不足以促使系统收敛到树结构,而加入基于香农熵(Shannon entropy)的优化项后,能够实现预期的收敛性——这表明语义嵌入相关的连续参数与语法规则过滤(如theta角色和阶段结构的着色规则)共同构成了从动态过程中选择最优句法结构的关键机制。
链接: https://arxiv.org/abs/2512.18861
作者: Matilde Marcolli,David Skigin
机构: 未知
类目: Dynamical Systems (math.DS); Computation and Language (cs.CL); Quantum Algebra (math.QA); Rings and Algebras (math.RA)
备注: 80 pages, LaTeX, 1 png figure
Abstract:We study the dynamical properties of a Hopf algebra Markov chain with state space the binary rooted forests with labelled leaves. This Markovian dynamical system describes the core computational process of structure formation and transformation in syntax via the Merge operation, according to Chomsky’s Minimalism model of generative linguistics. The dynamics decomposes into an ergodic dynamical system with uniform stationary distribution, given by the action of Internal Merge, while the contributions of External Merge and (a minimal form of) Sideward Merge reduce to a simpler Markov chain with state space the set of partitions and with combinatorial weights. The Sideward Merge part of the dynamics prevents convergence to fully formed connected structures (trees), unless the different forms of Merge are weighted by a cost function, as predicted by linguistic theory. Results on the asymptotic behavior of the Perron-Frobenius eigenvalue and eigenvector in this weighted case, obtained in terms of an associated Perron-Frobenius problem in the tropical semiring, show that the usual cost functions (Minimal Search and Resource Restrictions) proposed in the linguistic literature do not suffice to obtain convergence to the tree structures, while an additional optimization property based on the Shannon entropy achieves the expected result for the dynamics. We also comment on the introduction of continuous parameters related to semantic embedding and other computational models, and also on some filtering of the dynamics by coloring rules that model the linguistic filtering by theta roles and phase structure, and on parametric variation and the process of parameter setting in Externalization.
zh
[NLP-97] ICL: A Case Study On Speech In-Context Learning for Childrens Speech Recognition
【速读】: 该论文旨在解决儿童语音识别(Children’s Speech Recognition)中存在的挑战,包括显著的声学与语言变异性、标注数据有限以及儿童语音与成人语音之间的显著差异。为应对这些问题,作者提出了一种基于Speech In-Context Learning(SICL)的方法扩展——TICL+,其关键在于引入了一个声学重排序(acoustic reranking)步骤,以在检索语义相似示例的基础上进一步筛选出在声学特征上也与测试输入对齐的样本,从而实现语义与声学信息的联合优化,提升模型在儿童语音上的鲁棒性和可扩展性。
链接: https://arxiv.org/abs/2512.18263
作者: Haolong Zheng,Yekaterina Yegorova,Mark Hasegawa-Johnson
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published at IEEE ASRU 2025 Satellite Workshop-AI for Children’s Speech and Language
Abstract:Children’s speech recognition remains challenging due to substantial acoustic and linguistic variability, limited labeled data, and significant differences from adult speech. Speech foundation models can address these challenges through Speech In-Context Learning (SICL), allowing adaptation to new domains without fine-tuning. However, the effectiveness of SICL depends on how in-context examples are selected. We extend an existing retrieval-based method, Text-Embedding KNN for SICL (TICL), introducing an acoustic reranking step to create TICL+. This extension prioritizes examples that are both semantically and acoustically aligned with the test input. Experiments on four children’s speech corpora show that TICL+ achieves up to a 53.3% relative word error rate reduction over zero-shot performance and 37.6% over baseline TICL, highlighting the value of combining semantic and acoustic information for robust, scalable ASR in children’s speech.
zh
[NLP-98] Distributed Asymmetric Allocation: A Topic Model for Large Imbalanced Corpora in Social Sciences
【速读】: 该论文旨在解决传统潜在狄利克雷分布(Latent Dirichlet Allocation, LDA)在社会科学研究中应用时面临的三大问题:一是LDA在大规模语料上拟合耗时较长;二是无监督LDA在短文档中易将主题碎片化为子主题;三是半监督LDA难以利用种子词准确识别特定主题。为解决这些问题,作者提出了一种新的主题模型——分布式非对称分配(Distributed Asymmetric Allocation, DAA),其关键在于整合多种算法以高效识别大规模语料中关于重要主题的句子,从而显著提升分类准确性和计算效率。实证结果表明,DAA在联合国大会演讲文本中的政治主题识别任务中优于LDA,且强调优化LDA的狄利克雷先验对于内容分析准确性至关重要。
链接: https://arxiv.org/abs/2512.18119
作者: Kohei Watanabe
机构: 未知
类目: Methodology (stat.ME); Computation and Language (cs.CL)
备注: 34 pages
Abstract:Social scientists employ latent Dirichlet allocation (LDA) to find highly specific topics in large corpora, but they often struggle in this task because (1) LDA, in general, takes a significant amount of time to fit on large corpora; (2) unsupervised LDA fragments topics into sub-topics in short documents; (3) semi-supervised LDA fails to identify specific topics defined using seed words. To solve these problems, I have developed a new topic model called distributed asymmetric allocation (DAA) that integrates multiple algorithms for efficiently identifying sentences about important topics in large corpora. I evaluate the ability of DAA to identify politically important topics by fitting it to the transcripts of speeches at the United Nations General Assembly between 1991 and 2017. The results show that DAA can classify sentences significantly more accurately and quickly than LDA thanks to the new algorithms. More generally, the results demonstrate that it is important for social scientists to optimize Dirichlet priors of LDA to perform content analysis accurately.
zh
[NLP-99] A Critical Review of Monte Carlo Algorithms Balancing Performance and Probabilistic Accuracy with AI Augmented Framework
【速读】: 该论文旨在解决蒙特卡洛算法在实际应用中面临的统计效率与计算成本之间的权衡问题。其解决方案的关键在于通过系统性分析从Metropolis-Hastings到哈密顿蒙特卡洛(Hamiltonian Monte Carlo, HMC)等主要算法类别的时间复杂度、空间复杂度及其渐近紧致界,厘清不同方法的理论优势与实践适用场景,并提出一个明确的比较框架,用于判断在特定条件下某一算法相较于其他算法具有显著优越性。此外,论文强调了引入梯度信息和自适应调参等关键改进策略对提升算法性能的重要作用。
链接: https://arxiv.org/abs/2512.17968
作者: Ravi Prasad
机构: 未知
类目: Computation (stat.CO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Monte Carlo algorithms are a foundational pillar of modern computational science, yet their effective application hinges on a deep understanding of their performance trade offs. This paper presents a critical analysis of the evolution of Monte Carlo algorithms, focusing on the persistent tension between statistical efficiency and computational cost. We describe the historical development from the foundational Metropolis Hastings algorithm to contemporary methods like Hamiltonian Monte Carlo. A central emphasis of this survey is the rigorous discussion of time and space complexity, including upper, lower, and asymptotic tight bounds for each major algorithm class. We examine the specific motivations for developing these methods and the key theoretical and practical observations such as the introduction of gradient information and adaptive tuning in HMC that led to successively better solutions. Furthermore, we provide a justification framework that discusses explicit situations in which using one algorithm is demonstrably superior to another for the same problem. The paper concludes by assessing the profound significance and impact of these algorithms and detailing major current research challenges.
zh
计算机视觉
[CV-0] he Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
【速读】:该论文旨在解决多模态表征学习中语义编码器(semantic encoder)与像素编码器(pixel encoder)在特征空间中难以统一的问题,即如何在单一潜在空间中同时保留语义抽象性与像素级细节 fidelity。其解决方案的关键在于提出“棱镜假说”(Prism Hypothesis),揭示语义编码器主要捕获低频成分以表达抽象意义,而像素编码器则保留高频信息以刻画精细结构;基于此理论洞察,作者设计了统一自动编码器(Unified Autoencoding, UAE),通过创新的频率带调制模块(frequency-band modulator)实现语义结构与像素细节的协同建模,从而在 ImageNet 和 MS-COCO 基准上实现最优性能。
链接: https://arxiv.org/abs/2512.19693
作者: Weichen Fan,Haiwen Diao,Quan Wang,Dahua Lin,Ziwei Liu
机构: S-Lab, Nanyang Technological University (南洋理工大学); SenseTime Research (商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code link: this https URL
Abstract:Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder’s feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.
zh
[CV-1] Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models
【速读】:该论文旨在解决生成逼真的人类交互动作时面临的两大挑战:一是现有方法通常忽略手部运动,导致交互的 realism(真实性)和 expressivity(表现力)受限;二是基于扩散模型的方法通常同时生成整个动作序列,难以捕捉人类交互中固有的反应性和适应性。解决方案的关键在于提出 Interact2Ar,这是一个端到端的文本条件自回归扩散模型,首次实现了全身体运动与手部运动的协同生成。其核心创新包括:通过并行分支专门建模手部运动以提升细节保真度,以及引入自回归流水线结合新型记忆机制,利用高效的长上下文窗口实现对交互动态变化的适应能力,从而支持时间上动作组合、实时扰动响应及多人场景扩展等下游应用。
链接: https://arxiv.org/abs/2512.19692
作者: Pablo Ruiz-Ponce,Sergio Escalera,José García-Rodríguez,Jiankang Deng,Rolandos Alexandros Potamias
机构: Huawei Noah’s Ark Lab(华为诺亚方舟实验室); Universidad de Alicante(阿尔卡拉大学); Universitat de Barcelona and Computer Vision Center(巴塞罗那大学和计算机视觉中心); Imperial College London(帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Generating realistic human-human interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coordination among all interactants. Due to limitations in available data and increased learning complexity, previous methods tend to ignore hand motions, limiting the realism and expressivity of the interactions. Additionally, current diffusion-based approaches generate entire motion sequences simultaneously, limiting their ability to capture the reactive and adaptive nature of human interactions. To address these limitations, we introduce Interact2Ar, the first end-to-end text-conditioned autoregressive diffusion model for generating full-body, human-human interactions. Interact2Ar incorporates detailed hand kinematics through dedicated parallel branches, enabling high-fidelity full-body generation. Furthermore, we introduce an autoregressive pipeline coupled with a novel memory technique that facilitates adaptation to the inherent variability of human interactions using efficient large context windows. The adaptability of our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios. To validate the generated motions, we introduce a set of robust evaluators and extended metrics designed specifically for assessing full-body interactions. Through quantitative and qualitative experiments, we demonstrate the state-of-the-art performance of Interact2Ar.
zh
[CV-2] Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning
【速读】:该论文旨在解决多模态表示学习中音频与视频理解能力不足的问题,尤其是缺乏统一的跨模态嵌入框架来支持音频-视频、音频-文本和视频-文本之间的联合表示。其解决方案的关键在于提出PE-AV(Perception Encoder Audiovisual),一个基于缩放对比学习训练的新型编码器家族,通过构建高质量的百万级音频-视频对及其合成标题的数据引擎,实现跨模态监督的一致性;同时利用十种成对对比目标强化跨模态对齐,显著提升零样本性能,并进一步通过帧级对比目标优化PE-A-Frame,实现细粒度的音频-帧-文本对齐,从而在语音检索等新任务上取得突破并达到当前最优效果。
链接: https://arxiv.org/abs/2512.19687
作者: Apoorv Vyas,Heng-Jui Chang,Cheng-Fu Yang,Po-Yao Huang,Luya Gao,Julius Richter,Sanyuan Chen,Matt Le,Piotr Dollár,Christoph Feichtenhofer,Ann Lee,Wei-Ning Hsu
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV’s unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.
zh
[CV-3] Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models
【速读】:该论文旨在解决多模态生成过程中视觉上下文一致性不足的问题,即当前统一模型在生成文本时虽能保持与文本提示的一致性,却忽视了与参考图像的视觉一致性,导致关键视觉特征(如人物ID、物体属性、风格)难以维持。解决方案的关键在于将视觉上下文一致性显式引入模型推理过程,具体包括两个核心机制:1)自适应视觉规划(Adaptive Visual Planning),通过生成结构化的视觉检查清单来明确需保持一致性的视觉元素;2)迭代视觉修正(Iterative Visual Correction),基于检查清单进行自我反思并迭代优化生成结果。此外,研究采用监督微调训练模型掌握视觉规划与自我修正能力,并利用定制化的视觉检查奖励函数通过flow-GRPO进一步提升视觉一致性,实验表明该方法在多模态生成任务中优于零样本统一模型及仅使用文本Chain-of-Thought(CoT)的方法。
链接: https://arxiv.org/abs/2512.19686
作者: Zixuan Ye,Quande Liu,Cong Wei,Yuanxing Zhang,Xintao Wang,Pengfei Wan,Kun Gai,Wenhan Luo
机构: The Hong Kong University of Science and Technology (香港科技大学); Kling Team, Kuaishou Technology (快手科技Kling团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the \textbfvisual context consistency with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the guidance of check lists and refining the generated result in an iterative manner. To achieve this, we use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency through a customized visual checking reward. The experiments show that our method outperforms both zero-shot unified models and those with text CoTs in multi-modal generation, demonstrating higher visual context consistency.
zh
[CV-4] Zero-shot Reconstruction of In-Scene Object Manipulation from Video
【速读】:该论文旨在解决从单目RGB视频中重建场景内物体操作(in-scene object manipulation)的问题,其核心挑战在于病态的场景重建、手与物体间深度模糊以及对物理合理交互关系的建模需求。现有方法通常以手为中心坐标系进行建模,忽略场景信息,导致度量精度不足且实用性受限。本文的关键解决方案是:首先利用数据驱动的基础模型(foundation models)初始化关键组件,包括物体网格及其位姿、场景点云和手部姿态;随后通过两阶段优化策略,恢复从抓取到交互的完整手-物运动过程,并确保其与输入视频中观测到的场景信息保持一致。
链接: https://arxiv.org/abs/2512.19684
作者: Dixuan Lin,Tianyou Wang,Zhuoyang Pan,Yufu Wang,Lingjie Liu,Kostas Daniilidis
机构: University of Pennsylvania (宾夕法尼亚大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video. It is challenging due to ill-posed scene reconstruction, ambiguous hand-object depth, and the need for physically plausible interactions. Existing methods operate in hand centric coordinates and ignore the scene, hindering metric accuracy and practical use. In our method, we first use data-driven foundation models to initialize the core components, including the object mesh and poses, the scene point cloud, and the hand poses. We then apply a two-stage optimization that recovers a complete hand-object motion from grasping to interaction, which remains consistent with the scene information observed in the input video.
zh
[CV-5] From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间智能(spatial intelligence)方面的不足,尤其是在开放世界场景下缺乏对物理空间的精准理解与推理能力的问题。当前主流基准测试或局限于定性关系推理,或依赖于特定室内场景的数据集,且普遍缺乏可验证的度量真值(metric ground truth),难以有效评估模型的空间感知能力。其解决方案的关键在于构建一个大规模、基于行人视角的多传感器同步数据集,包含立体相机、LiDAR和IMU/GPS信息,从而提供精确的三维度量信息,并据此自动生成从定性到定量的多层次空间推理问题。这一设计使得能够系统性地诊断MLLMs是否具备真正 grounded 的视觉-空间推理能力,而非仅依赖语言先验(linguistic priors)。
链接: https://arxiv.org/abs/2512.19683
作者: Mingrui Wu,Zhaozhi Wang,Fangjinhua Wang,Jiaolong Yang,Marc Pollefeys,Tong Zhang
机构: University of Chinese Academy of Sciences (中国科学院大学); ETH Zürich (苏黎世联邦理工学院); Microsoft Research Asia (亚洲微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence–crucial for robust and grounded AI systems–remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum–from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.
zh
[CV-6] VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
【速读】:该论文旨在解决自回归(Autoregressive, AR)视觉生成模型中因生成器与分词器(tokenizer)之间训练目标不一致而导致的图像质量下降问题。具体而言,分词器通常在干净图像上进行重建训练,而AR生成器仅优化token预测似然,缺乏像素空间的直接监督,导致生成的token序列解码后图像质量较低。解决方案的关键在于提出VA-π框架,其核心是将生成器与分词器对齐建模为变分优化问题,推导出一个统一像素重建与自回归建模的证据下界(ELBO)。该框架引入基于强化学习的对齐策略,将AR生成器视为策略模型,以教师强制(teacher forcing)条件下像素重建质量作为内在奖励信号,从而实现无需重训练分词器或外部奖励模型的快速微调,同时利用ELBO中的正则项保持token分布一致性,显著提升图像质量和文本到图像生成性能。
链接: https://arxiv.org/abs/2512.19680
作者: Xinyao Liao,Qiyuan He,Kai Xu,Xiaoye Qu,Yicong Li,Wei Wei,Angela Yao
机构: Huazhong University of Science & Technology (华中科技大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 24 figures
Abstract:Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA- \pi , a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA- \pi formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA- \pi introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA- \pi enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at this https URL.
zh
[CV-7] WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion
【速读】:该论文旨在解决生成长距离、几何一致视频时的核心矛盾:现有生成模型在相机条件下的潜在空间中表现最优,但几何一致性要求像素空间严格遵循三维结构,二者之间的不匹配导致当前方法在处理遮挡区域和复杂相机轨迹时效果不佳。解决方案的关键在于提出WorldWarp框架,其核心创新是将一个3D结构锚点与2D生成精修模块相耦合:通过在线维护基于高斯溅射(Gaussian Splatting, 3DGS)构建的3D几何缓存,实现历史内容到新视角的显式形变,从而提供结构支撑以保证每帧均符合先前几何;同时引入时空扩散(Spatio-Temporal Diffusion, ST-Diff)模型,采用时空变化的噪声调度机制——空白区域接收全噪声以触发生成,而形变区域仅接收部分噪声以实现精细化修复,结合每步动态更新3D缓存,确保视频片段间的一致性,最终实现由3D逻辑主导结构、扩散逻辑优化纹理的高质量视频生成。
链接: https://arxiv.org/abs/2512.19678
作者: Hanyang Kong,Xingyi Yang,Xiaoxu Zheng,Xinchao Wang
机构: National University of Singapore (新加坡国立大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a “fill-and-revise” objective. Our key innovation is a spatio-temporal varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture. Project page: \hrefthis https URLthis https URL.
zh
[CV-8] Efficient Vision Mamba for MRI Super-Resolution via Hybrid Selective Scanning
【速读】:该论文旨在解决高分辨率磁共振成像(MRI)在临床应用中因采集时间过长而受限的问题,提出一种计算高效且准确的生成式超分辨率(Super-Resolution, SR)深度学习框架,以在不牺牲解剖细节的前提下提升图像分辨率。其解决方案的关键在于设计了一种融合多头选择性状态空间模型(Multi-Head Selective State-Space Model, MHSSM)与轻量级通道MLP(Multilayer Perceptron)的新型架构,通过2D块提取结合混合扫描策略捕获长程依赖关系,并在每个MambaFormer模块中集成MHSSM、深度可分离卷积和门控通道混合机制,从而在仅使用0.9M参数和57 GFLOPs的情况下实现优于现有主流方法(如SwinIR、MambaIR、Res-SRDiff等)的性能,显著降低了计算复杂度并保持了高保真度。
链接: https://arxiv.org/abs/2512.19676
作者: Mojtaba Safari,Shansong Wang,Vanessa L Wildman,Mingzhe Hu,Zach Eidex,Chih-Wei Chang,Erik H Middlebrooks,Richard L.J Qiu,Pretesh Patel,Ashesh B. Jania,Hui Mao,Zhen Tian,Xiaofeng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
Abstract:Background: High-resolution MRI is critical for diagnosis, but long acquisition times limit clinical use. Super-resolution (SR) can enhance resolution post-scan, yet existing deep learning methods face fidelity-efficiency trade-offs. Purpose: To develop a computationally efficient and accurate deep learning framework for MRI SR that preserves anatomical detail for clinical integration. Materials and Methods: We propose a novel SR framework combining multi-head selective state-space models (MHSSM) with a lightweight channel MLP. The model uses 2D patch extraction with hybrid scanning to capture long-range dependencies. Each MambaFormer block integrates MHSSM, depthwise convolutions, and gated channel mixing. Evaluation used 7T brain T1 MP2RAGE maps (n=142) and 1.5T prostate T2w MRI (n=334). Comparisons included Bicubic interpolation, GANs (CycleGAN, Pix2pix, SPSR), transformers (SwinIR), Mamba (MambaIR), and diffusion models (I2SB, Res-SRDiff). Results: Our model achieved superior performance with exceptional efficiency. For 7T brain data: SSIM=0.951±0.021, PSNR=26.90±1.41 dB, LPIPS=0.076±0.022, GMSD=0.083±0.017, significantly outperforming all baselines (p0.001). For prostate data: SSIM=0.770±0.049, PSNR=27.15±2.19 dB, LPIPS=0.190±0.095, GMSD=0.087±0.013. The framework used only 0.9M parameters and 57 GFLOPs, reducing parameters by 99.8% and computation by 97.5% versus Res-SRDiff, while outperforming SwinIR and MambaIR in accuracy and efficiency. Conclusion: The proposed framework provides an efficient, accurate MRI SR solution, delivering enhanced anatomical detail across datasets. Its low computational demand and state-of-the-art performance show strong potential for clinical translation.
zh
[CV-9] Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis
【速读】:该论文旨在解决医学影像领域中跨模态对齐难题,尤其是针对糖尿病视网膜病变(Diabetic Retinopathy, DR)的自动诊断系统在处理眼底图像与临床文本之间的语义关联时存在的性能瓶颈。现有通用视觉-语言模型(如CLIP)在自然图像任务上表现优异,但在医疗场景下因缺乏领域知识而难以实现精准的图文检索与联合理解。其解决方案的关键在于提出一种知识增强的联合嵌入框架,通过多模态Transformer架构融合眼底彩色照相(retinal fundus images)、临床文本(clinical narratives)及结构化患者数据(structured demographic and clinical features),并引入多种训练目标(包括模态间对比损失、图像与文本重建损失以及DR分级分类损失),从而显著提升跨模态匹配精度和诊断准确性。实验表明,该方法在BRSET数据集上实现了99.94%的Recall@1文本到图像检索性能,远超微调后的CLIP模型(1.29%),且在未见数据集DeepEyeNet上保持强泛化能力(93.95% Recall@1),验证了其在医疗多模态学习中的有效性。
链接: https://arxiv.org/abs/2512.19663
作者: Argha Kamal Samanta,Harshika Goyal,Vasudha Joshi,Tushar Mungle,Pabitra Mitra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 14 figures
Abstract:Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP’s 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.
zh
[CV-10] Over: Generative Video Compositing for Layer Interaction Effects
【速读】:该论文旨在解决专业视频合成工作流中艺术家需手动创建前景与背景图层之间环境交互效果(如阴影、反射、灰尘和飞溅等)的低效问题,同时克服现有视频生成模型在保留输入视频内容的同时添加此类效果时的局限性,以及当前视频修复方法对逐帧掩码依赖高或结果不合理的缺陷。解决方案的关键在于提出“增强合成”(augmented compositing)这一新任务,并设计了Over++框架,该框架无需假设相机位姿、场景静止性或深度监督,通过构建针对该任务的成对效果数据集并引入无配对增强策略以保持文本驱动的编辑能力,同时支持可选的掩码控制和关键帧引导而无需密集标注,从而在有限训练数据下实现多样且逼真的环境效应生成与原始场景的有效保真。
链接: https://arxiv.org/abs/2512.19661
作者: Luchao Qi,Jiaye Wu,Jun Myeong Choi,Cary Phillips,Roni Sengupta,Dan B Goldman
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); University of Maryland (马里兰大学); Industrial Light & Magic (工业光魔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.
zh
[CV-11] 4D Gaussian Splatting as a Learned Dynamical System
【速读】:该论文旨在解决传统基于变形场(deformation field)的4D高斯点绘(4D Gaussian Splatting)方法在动态场景建模中存在运动一致性差、时间外推能力弱以及局部动态控制困难的问题。其解决方案的关键在于将4D高斯点绘重新诠释为一个连续时间的动力系统(continuous-time dynamical system),其中场景运动由一个可学习的神经动力场(neural dynamical field)积分得到,而非依赖逐帧变形。这一框架被称为EvoGS,它将高斯表示视为一个受学习到的运动规律驱动的演化物理系统,从而实现从稀疏时间监督中高效学习运动规律、支持时间范围外的前向与后向预测,并允许通过注入局部动态实现可控的场景合成。
链接: https://arxiv.org/abs/2512.19648
作者: Arnold Caleb Asiimwe,Carl Vondrick
机构: Princeton University (普林斯顿大学); Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We reinterpret 4D Gaussian Splatting as a continuous-time dynamical system, where scene motion arises from integrating a learned neural dynamical field rather than applying per-frame deformations. This formulation, which we call EvoGS, treats the Gaussian representation as an evolving physical system whose state evolves continuously under a learned motion law. This unlocks capabilities absent in deformation-based approaches:(1) sample-efficient learning from sparse temporal supervision by modeling the underlying motion law; (2) temporal extrapolation enabling forward and backward prediction beyond observed time ranges; and (3) compositional dynamics that allow localized dynamics injection for controllable scene synthesis. Experiments on dynamic scene benchmarks show that EvoGS achieves better motion coherence and temporal consistency compared to deformation-field baselines while maintaining real-time rendering
zh
[CV-12] Generative diffusion models for agricultural AI: plant image generation indoor-to-outdoor translation and expert preference alignment
【速读】:该论文旨在解决农业人工智能(Agricultural AI)中高质量植物图像数据获取难的问题,尤其是在真实田间环境下收集大规模、多样化且高质量的植物图像数据成本高、耗时长且受季节限制。解决方案的关键在于构建一套基于扩散模型(diffusion-based generative modeling)的数据增强与迁移框架:首先通过在带标注的室内和室外植物图像上微调Stable Diffusion模型,生成文本条件驱动的高保真油菜和大豆图像,从而有效提升下游表型分类任务的准确率;其次利用DreamBooth引导的文本反演与图像引导扩散技术,实现高分辨率室内数据到有限室外场景的跨域翻译,增强YOLOv8在杂草检测与分类中的性能;最后引入专家偏好引导的微调机制,基于专家评分训练奖励模型并采用奖励加权更新策略,使生成结果更稳定且符合专家认知。三者协同构建了一个面向农业AI的数据高效生成式流水线。
链接: https://arxiv.org/abs/2512.19632
作者: Da Tan,Michael Beck,Christopher P. Bidinosti,Robert H. Gulden,Christopher J. Henry
机构: University of Winnipeg (温尼伯大学); University of Manitoba (曼尼托巴大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The success of agricultural artificial intelligence depends heavily on large, diverse, and high-quality plant image datasets, yet collecting such data in real field conditions is costly, labor intensive, and seasonally constrained. This paper investigates diffusion-based generative modeling to address these challenges through plant image synthesis, indoor-to-outdoor translation, and expert preference aligned fine tuning. First, a Stable Diffusion model is fine tuned on captioned indoor and outdoor plant imagery to generate realistic, text conditioned images of canola and soybean. Evaluation using Inception Score, Frechet Inception Distance, and downstream phenotype classification shows that synthetic images effectively augment training data and improve accuracy. Second, we bridge the gap between high resolution indoor datasets and limited outdoor imagery using DreamBooth-based text inversion and image guided diffusion, generating translated images that enhance weed detection and classification with YOLOv8. Finally, a preference guided fine tuning framework trains a reward model on expert scores and applies reward weighted updates to produce more stable and expert aligned outputs. Together, these components demonstrate a practical pathway toward data efficient generative pipelines for agricultural AI.
zh
[CV-13] LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry
【速读】:该论文旨在解决移动机器人在非结构化环境中进行轨迹规划时面临的挑战,尤其是传统模块化流水线因感知、定位、建图与规划模块间的延迟和误差累积导致性能受限的问题。同时,现有端到端方法虽能提升效率,但仍依赖于独立的定位模块且需精确传感器外参标定,限制了跨平台(embodiment)和跨环境的泛化能力。其解决方案的关键在于提出LoGoPlanner框架,通过三个核心机制实现:(1) 微调长程视觉-几何主干网络以引入绝对度量尺度,从而隐式地提供状态估计;(2) 从历史观测中重建周围场景几何结构,增强对环境的细粒度认知以保障障碍物规避;(3) 将策略条件化于由上述辅助任务所引导的隐式几何信息,显著降低误差传播。实验证明,该方法在仿真与真实场景中均表现出更强的规划一致性与鲁棒性,并在多个平台和环境中展现出优异的泛化性能。
链接: https://arxiv.org/abs/2512.19629
作者: Jiaqi Peng,Wenzhe Cai,Yuqiang Yang,Tai Wang,Yuan Shen,Jiangmiao Pang
机构: Tsinghua University (清华大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Trajectory planning in unstructured environments is a fundamental and challenging capability for mobile robots. Traditional modular pipelines suffer from latency and cascading errors across perception, localization, mapping, and planning modules. Recent end-to-end learning methods map raw visual observations directly to control signals or trajectories, promising greater performance and efficiency in open-world settings. However, most prior end-to-end approaches still rely on separate localization modules that depend on accurate sensor extrinsic calibration for self-state estimation, thereby limiting generalization across embodiments and environments. We introduce LoGoPlanner, a localization-grounded, end-to-end navigation framework that addresses these limitations by: (1) finetuning a long-horizon visual-geometry backbone to ground predictions with absolute metric scale, thereby providing implicit state estimation for accurate localization; (2) reconstructing surrounding scene geometry from historical observations to supply dense, fine-grained environmental awareness for reliable obstacle avoidance; and (3) conditioning the policy on implicit geometry bootstrapped by the aforementioned auxiliary tasks, thereby reducing error this http URL evaluate LoGoPlanner in both simulation and real-world settings, where its fully end-to-end design reduces cumulative error while metric-aware geometry memory enhances planning consistency and obstacle avoidance, leading to more than a 27.3% improvement over oracle-localization baselines and strong generalization across embodiments and environments. The code and models have been made publicly available on the \hrefthis https URLproject page.
zh
[CV-14] MapTrace: Scalable Data Generation for Route Tracing on Maps
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度空间理解任务(如地图上的路径追踪)中表现不足的问题,尤其是模型难以遵守基本的路径约束。其关键解决方案是构建一个可扩展的合成数据生成流程,利用合成地图图像和像素级解析自动生成高精度标注数据,从而为模型提供细粒度空间推理的监督信号。该方法克服了真实世界中获取大规模像素级路径标注的高昂成本与困难,最终通过在23k条路径样本上微调模型,显著提升了MapBench基准上的鲁棒性和路径追踪准确率(NDTW)。
链接: https://arxiv.org/abs/2512.19609
作者: Artemis Panagopoulou,Aveek Purohit,Achin Kulshrestha,Soroosh Yazdani,Mohit Goyal
机构: Google XR(谷歌XR); University of Pennsylvania(宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatial understanding, such as route tracing on maps remains limited. Unlike humans, who can quickly learn to parse and navigate maps, current models often fail to respect fundamental path constraints, in part due to the prohibitive cost and difficulty of collecting large-scale, pixel-accurate path annotations. To address this, we introduce a scalable synthetic data generation pipeline that leverages synthetic map images and pixel-level parsing to automatically produce precise annotations for this challenging task. Using this pipeline, we construct a fine-tuning dataset of 23k path samples across 4k maps, enabling models to acquire more human-like spatial capabilities. Using this dataset, we fine-tune both open-source and proprietary MLLMs. Results on MapBench show that finetuning substantially improves robustness, raising success rates by up to 6.4 points, while also reducing path-tracing error (NDTW). These gains highlight that fine-grained spatial reasoning, absent in pretrained models, can be explicitly taught with synthetic supervision.
zh
[CV-15] KerJEPA: Kernel Discrepancies for Euclidean Self-Supervised Learning
【速读】:该论文旨在解决自监督学习中表示训练稳定性与下游任务泛化性能之间的权衡问题,尤其关注如何通过正则化策略优化嵌入空间的分布特性。其解决方案的关键在于提出一类新的、灵活的KerJEPAs(基于核函数的JEPA),利用核函数正则项替代传统欧几里得空间中的高斯先验约束,从而在理论上可证明地提升训练稳定性与模型灵活性;其中特定实例对应于LeJEPA中引入的Epps-Pulley正则器,它通过高斯核近似切片最大均值差异(sliced maximum mean discrepancy, MMD)实现对数据分布的更有效建模,并通过推导切片MMD在高维极限下的闭式表达,扩展了可用核函数和先验分布的范围,进而获得更好的训练稳定性和设计自由度。
链接: https://arxiv.org/abs/2512.19605
作者: Eric Zimmermann,Harley Wiltzer,Justin Szeto,David Alvarez-Melis,Lester Mackey
机构: Microsoft Research, Cambridge, MA United States; Mila–Québec AI Institute, Montreal, Canada; McGill University, Montreal, Canada
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent breakthroughs in self-supervised Joint-Embedding Predictive Architectures (JEPAs) have established that regularizing Euclidean representations toward isotropic Gaussian priors yields provable gains in training stability and downstream generalization. We introduce a new, flexible family of KerJEPAs, self-supervised learning algorithms with kernel-based regularizers. One instance of this family corresponds to the recently-introduced LeJEPA Epps-Pulley regularizer which approximates a sliced maximum mean discrepancy (MMD) with a Gaussian prior and Gaussian kernel. By expanding the class of viable kernels and priors and computing the closed-form high-dimensional limit of sliced MMDs, we develop alternative KerJEPAs with a number of favorable properties including improved training stability and design flexibility.
zh
[CV-16] No Data? No Problem: Robust Vision-Tabular Learning with Missing Values
【速读】:该论文旨在解决医学影像与表格数据(如人口统计学或临床测量指标)融合时,训练阶段使用完整表格信息而推理阶段存在缺失值的问题,即现实场景中表格属性常不完整,但现有方法难以在不同缺失程度下保持稳定性能。解决方案的关键在于提出RoVTL(Robust Vision-Tabular Learning)框架,其核心包括两个阶段:一是通过引入表格属性缺失作为数据增强进行对比预训练,提升模型对缺失数据的鲁棒性;二是采用门控交叉注意力模块实现多模态融合,并结合一种基于可用表格数据量的“更多 vs. 更少”损失函数(Tabular More vs. Fewer loss),辅以解耦梯度学习策略,确保在从0%到100%任意表格数据完整性条件下均能维持一致性能表现。
链接: https://arxiv.org/abs/2512.19602
作者: Marta Hasny,Laura Daza,Keno Bressem,Maxime Di Folco,Julia Schnabel
机构: Technical University of Munich (慕尼黑工业大学); Helmholtz Munich (亥姆霍兹慕尼黑研究中心); King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale medical biobanks provide imaging data complemented by extensive tabular information, such as demographics or clinical measurements. However, this abundance of tabular attributes does not reflect real-world datasets, where only a subset of attributes may be available. This discrepancy calls for methods that can leverage all the tabular data during training while remaining robust to missing values at inference. To address this challenge, we propose RoVTL (Robust Vision-Tabular Learning), a framework designed to handle any level of tabular data availability, from 0% to 100%. RoVTL comprises two key stages: contrastive pretraining, where we introduce tabular attribute missingness as data augmentation to promote robustness, and downstream task tuning using a gated cross-attention module for multimodal fusion. During fine-tuning, we employ a novel Tabular More vs. Fewer loss that ranks performance based on the amount of available tabular data. Combined with disentangled gradient learning, this enables consistent performance across all tabular data completeness scenarios. We evaluate RoVTL on cardiac MRI scans from the UK Biobank, demonstrating superior robustness to missing tabular data compared to prior methods. Furthermore, RoVTL successfully generalizes to an external cardiac MRI dataset for multimodal disease classification, and extends to the natural images domain, achieving robust performance on a car advertisements dataset. The code is available at this https URL.
zh
[CV-17] BabyFlow: 3D modeling of realistic and expressive infant faces
【速读】:该论文旨在解决婴儿面部形态学分析中因数据稀缺和自发表情多样性导致的建模难题,尤其在早期发育障碍检测场景下,如何准确重建和合成婴儿面部图像成为关键挑战。解决方案的关键在于提出BabyFlow模型,该模型基于归一化流(normalizing flows)构建可解耦的面部身份与表情表征,实现对二者独立控制;同时通过跨年龄表情迁移技术,将成人三维扫描中的表情映射至婴儿数据,有效扩充并增强婴儿表达数据的多样性与真实性,从而显著提升高表达区域(如口、眼、鼻)的三维重建精度,并支持保留身份信息的前提下生成或修改婴儿表情,进一步结合扩散模型实现具有一致3D几何结构的高质量2D婴儿图像生成,为数据增强与早期面部分析提供强大工具。
链接: https://arxiv.org/abs/2512.19560
作者: Antonia Alomar,Mireia Masias,Marius George Linguraru,Federico M. Sukno,Gemma Piella
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Early detection of developmental disorders can be aided by analyzing infant craniofacial morphology, but modeling infant faces is challenging due to limited data and frequent spontaneous expressions. We introduce BabyFlow, a generative AI model that disentangles facial identity and expression, enabling independent control over both. Using normalizing flows, BabyFlow learns flexible, probabilistic representations that capture the complex, non-linear variability of expressive infant faces without restrictive linear assumptions. To address scarce and uncontrolled expressive data, we perform cross-age expression transfer, adapting expressions from adult 3D scans to enrich infant datasets with realistic and systematic expressive variants. As a result, BabyFlow improves 3D reconstruction accuracy, particularly in highly expressive regions such as the mouth, eyes, and nose, and supports synthesis and modification of infant expressions while preserving identity. Additionally, by integrating with diffusion models, BabyFlow generates high-fidelity 2D infant images with consistent 3D geometry, providing powerful tools for data augmentation and early facial analysis.
zh
[CV-18] ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars
【速读】:该论文旨在解决当前说话头像生成方法中存在的三大核心问题:一是文本引导下动作多样性不足,难以准确跟随复杂语义;二是动作与音频内容在时间维度上缺乏精确对齐;三是依赖额外控制信号(如姿态骨架)导致系统灵活性受限。解决方案的关键在于提出ActAvatar框架,其核心创新包括:(1) 相位感知交叉注意力机制(Phase-Aware Cross-Attention, PACA),将提示分解为全局基块和时序锚定的相位块,实现动作语义与时间上下文的精准对齐;(2) 渐进式音视频对齐策略,通过分层特征学习过程动态调整模态影响权重——浅层以文本为主建立动作结构,深层强化音频驱动唇部运动,避免模态干扰;(3) 两阶段训练策略,在多样化数据上先建立稳定的音视频对应关系,再通过结构化标注微调注入动作控制能力,从而兼顾音视频一致性与文本跟随性能。
链接: https://arxiv.org/abs/2512.19546
作者: Ziqiao Peng,Yi Chen,Yifeng Ma,Guozhen Zhang,Zhiyao Sun,Zixiang Zhou,Youliang Zhang,Zhengguang Zhou,Zhaoxin Fan,Hongyan Liu,Yuan Zhou,Qinglin Lu,Jun He
机构: Renmin University of China (中国人民大学); Tencent Hunyuan (腾讯混元); Beihang University (北京航空航天大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process-early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model’s text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality.
zh
[CV-19] StoryMem: Multi-shot Long Video Storytelling with Memory
【速读】:该论文旨在解决长时序视频生成中跨镜头一致性(cross-shot consistency)不足的问题,即如何在多镜头(multi-shot)视频生成过程中保持叙事连贯性与视觉质量。其核心解决方案是提出StoryMem框架,关键在于引入显式视觉记忆机制(explicit visual memory),通过一个动态更新的紧凑关键帧记忆库(memory bank),将历史生成镜头的关键信息以潜在空间拼接和负RoPE偏移的方式注入单镜头扩散模型中,仅需LoRA微调即可实现多镜头故事生成。该设计显著提升了生成视频的长期一致性,同时保留高美学质量和提示遵循性,为分钟级连贯视频生成提供了新范式。
链接: https://arxiv.org/abs/2512.19539
作者: Kaiwen Zhang,Liming Jiang,Angtian Wang,Jacob Zhiyuan Fang,Tiancheng Zhi,Qing Yan,Hao Kang,Xin Lu,Xingang Pan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.
zh
[CV-20] CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理高分辨率图像、长对话或流式视频时,因采用图像标记插入(token insertion)策略而导致的内存和计算成本过高问题。当前基于交叉注意力(cross-attention)的VLM方案虽具高效性与可扩展性,但在细粒度视觉理解任务上存在明显性能差距。论文提出的关键解决方案是:在专用的交叉注意力层中引入局部文本到文本交互机制(local text-to-text interaction),从而增强模型对视觉细节的感知能力。基于此思想,作者设计了CASA(Cross-Attention via Self-Attention)框架,通过将自注意力机制引入交叉注意力过程,在保持交叉注意力模型低资源消耗优势的同时,显著缩小了与全标记插入方法在主流图像理解基准上的性能差距,并在长上下文多模态任务(如流式视频字幕生成)中展现出良好的可扩展性。
链接: https://arxiv.org/abs/2512.19535
作者: Moritz Böhle,Amélie Royer,Juliette Marrie,Edouard Grave,Patrick Pérez
机构: Kyutai( Kyutai)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at this https URL .
zh
[CV-21] SlicerOrbitSurgerySim: An Open-Source Platform for Virtual Registration and Quantitative Comparison of Preformed Orbital Plates
【速读】:该论文旨在解决当前眶内植入物(orbital implants)适配性差导致术后并发症和翻修手术频发的问题。尽管预成型眶板(preformed orbital plates)因其成本低、手术时间短而被广泛使用,但临床医生缺乏公开可用的工具和标准化指标来定量比较不同厂商、尺寸及患者解剖结构下的植入物贴合度。解决方案的关键在于开发了SlicerOrbitSurgerySim——一个基于3D Slicer平台的开源扩展模块,支持在个体化虚拟规划环境中对多个预成型眶板进行交互式虚拟配准、评估与对比;该工具可生成可重复的定量距离指标(plate-to-orbit distance metrics)和可视化分析手段,从而实现从个体术前决策到群体统计分析的双层支持,推动客观比较植入设计与放置策略,减少术中修改需求,并促进协作研究与外科教学。
链接: https://arxiv.org/abs/2512.19534
作者: Chi Zhang,Braedon Gunn,Andrew M. Read-Fuller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures. Submitted to Journal of Oral and Maxillofacial Surgery. Code: this https URL
Abstract:Poor adaptation of orbital implants remains a major contributor to postoperative complications and revision surgery. Although preformed orbital plates are widely used to reduce cost and operative time compared with customized implants, surgeons currently lack publicly available tools and standardized metrics to quantitatively compare plate fit across vendors, sizes, and patient anatomy. We developed SlicerOrbitSurgerySim, an open-source extension for the 3D Slicer platform that enables interactive virtual registration, evaluation, and comparison of multiple preformed orbital plates in a patient-specific virtual planning environment. The software generates reproducible quantitative plate-to-orbit distance metrics and visualization tools that support both patient-specific planning and population-level statistical analysis of plate adaptability. By facilitating objective comparison of implant designs and placement strategies, this tool aims to improve preoperative decision-making, reduce intraoperative plate modification, and promote collaborative research and surgical education. Pilot studies, sample datasets, and detailed tutorials are provided to support testing, transparency, and reproducibility.
zh
[CV-22] Multi-Modal Soccer Scene Analysis with Masked Pre-Training WACV2026
【速读】:该论文旨在解决足球战术摄像机视频中三个核心任务的分析问题:球轨迹推断、球状态分类以及持球者识别。传统方法通常依赖于精确的球跟踪或手工设计的启发式规则,难以在真实顶级联赛比赛中应对噪声或遮挡情况。本文提出了一种多模态架构,将球员轨迹、球员类型和个体球员图像裁剪三种模态信息融合到统一框架中,并通过一系列社会时空变换器(sociotemporal transformer blocks)处理空间与时间动态。其关键创新在于无需直接访问球的历史或未来位置即可推断球轨迹,同时利用CropDrop这一模态特定的掩码预训练策略,避免模型过度依赖图像特征,促使模型学习跨模态关联模式,从而在复杂现实场景下实现更鲁棒的性能提升。
链接: https://arxiv.org/abs/2512.19528
作者: Marc Peral,Guillem Capellera,Luis Ferraz,Antonio Rubio,Antonio Agudo
机构: Institut de Robòtica i Informàtica Industrial, CSIC-UPC (机器人与工业信息研究所,CSIC-UPC); Kognia Sports Intelligence (Kognia体育智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures. WACV 2026
Abstract:In this work we propose a multi-modal architecture for analyzing soccer scenes from tactical camera footage, with a focus on three core tasks: ball trajectory inference, ball state classification, and ball possessor identification. To this end, our solution integrates three distinct input modalities (player trajectories, player types and image crops of individual players) into a unified framework that processes spatial and temporal dynamics using a cascade of sociotemporal transformer blocks. Unlike prior methods, which rely heavily on accurate ball tracking or handcrafted heuristics, our approach infers the ball trajectory without direct access to its past or future positions, and robustly identifies the ball state and ball possessor under noisy or occluded conditions from real top league matches. We also introduce CropDrop, a modality-specific masking pre-training strategy that prevents over-reliance on image features and encourages the model to rely on cross-modal patterns during pre-training. We show the effectiveness of our approach on a large-scale dataset providing substantial improvements over state-of-the-art baselines in all tasks. Our results highlight the benefits of combining structured and visual cues in a transformer-based architecture, and the importance of realistic masking strategies in multi-modal learning.
zh
[CV-23] A Convolutional Neural Deferred Shader for Physics Based Rendering
【速读】:该论文旨在解决神经渲染中多层感知机(MLP)参数量大、计算资源消耗高以及在低光照条件下反射建模不稳定的问题。其关键解决方案是提出了一种基于物理的神经延迟着色流水线 pbnds+,通过引入卷积神经网络(CNN)替代传统 MLP 以减少参数数量并提升渲染效率,并设计能量正则化机制以约束模型在暗光条件下的反射行为,从而增强对异常光照场景的泛化能力。
链接: https://arxiv.org/abs/2512.19522
作者: Zhuo He,Yingdong Ru,Qianying Liu,Paul Henderson,Nicolas Pugeault
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in neural rendering have achieved impressive results on photorealistic shading and relighting, by using a multilayer perceptron (MLP) as a regression model to learn the rendering equation from a real-world dataset. Such methods show promise for photorealistically relighting real-world objects, which is difficult to classical rendering, as there is no easy-obtained material ground truth. However, significant challenges still remain the dense connections in MLPs result in a large number of parameters, which requires high computation resources, complicating the training, and reducing performance during rendering. Data driven approaches require large amounts of training data for generalization; unbalanced data might bias the model to ignore the unusual illumination conditions, e.g. dark scenes. This paper introduces pbnds+: a novel physics-based neural deferred shading pipeline utilizing convolution neural networks to decrease the parameters and improve the performance in shading and relighting tasks; Energy regularization is also proposed to restrict the model reflection during dark illumination. Extensive experiments demonstrate that our approach outperforms classical baselines, a state-of-the-art neural shading model, and a diffusion-based method.
zh
[CV-24] Anatomy-R1: Enhancing Anatomy Reasoning in Multimodal Large Language Models via Anatomical Similarity Curriculum and Group Diversity Augmentation
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在临床解剖学手术图像理解任务中表现不足的问题,特别是由于医学数据复杂性和高质量专家标注稀缺导致的监督微调(Supervised Fine-Tuning, SFT)效果受限。核心挑战在于:1)不同解剖结构间知识难以有效共享,造成信息获取不均衡且模型难以收敛;2)模型过早锁定单一推理路径,抑制了多样化策略的探索。解决方案的关键在于提出两种创新方法:一是基于答案选项相似性的渐进式学习策略——解剖相似性课程学习(Anatomical Similarity Curriculum Learning),通过控制问题难度引导模型逐步掌握复杂任务;二是群体多样性问题增强(Group Diversity Question Augmentation),扩展困难样本的搜索空间,缓解响应同质化问题。实验证明,该方案显著提升了MLLMs在SGG-VQA和OmniMedVQA两个医学视觉问答基准上的推理能力。
链接: https://arxiv.org/abs/2512.19512
作者: Ziyang Song,Zelin Zang,Zuyao Chen,Xusheng Liang,Dong Yi,Jinlin Wu,Hongbin Liu,Jiebo Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored, especially in clinical anatomical surgical images. Anatomy understanding tasks demand precise understanding and clinically coherent answers, which are difficult to achieve due to the complexity of medical data and the scarcity of high-quality expert annotations. These challenges limit the effectiveness of conventional Supervised Fine-Tuning (SFT) strategies. While recent work has demonstrated that Group Relative Policy Optimization (GRPO) can enhance reasoning in MLLMs without relying on large amounts of data, we find two weaknesses that hinder GRPO’s reasoning performance in anatomy recognition: 1) knowledge cannot be effectively shared between different anatomical structures, resulting in uneven information gain and preventing the model from converging, and 2) the model quickly converges to a single reasoning path, suppressing the exploration of diverse strategies. To overcome these challenges, we propose two novel methods. First, we implement a progressive learning strategy called Anatomical Similarity Curriculum Learning by controlling question difficulty via the similarity of answer choices, enabling the model to master complex problems incrementally. Second, we utilize question augmentation referred to as Group Diversity Question Augmentation to expand the model’s search space for difficult queries, mitigating the tendency to produce uniform responses. Comprehensive experiments on the SGG-VQA and OmniMedVQA benchmarks show our method achieves a significant improvement across the two benchmarks, demonstrating its effectiveness in enhancing the medical reasoning capabilities of MLLMs. The code can be found in this https URL
zh
[CV-25] FusionNet: Physics-Aware Representation Learning for Multi-Spectral and Thermal Data via Trainable Signal-Processing Priors
【速读】:该论文旨在解决当前深度学习模型在多模态视觉信号处理中因缺乏与物理信号形成过程一致的归纳偏置(inductive biases),导致在跨光谱和真实场景下性能脆弱的问题。其核心挑战在于现有方法过度依赖直接热信号(thermal cues),难以捕捉由持续热辐射引发的间接但持久的环境变化。解决方案的关键在于提出一种物理感知表示学习框架,通过融合地质短波红外(SWIR)比值(对土壤属性变化敏感)与热红外(TIR)数据,并设计基于可训练差分信号处理先验的卷积架构(FusionNet),结合混合池化策略和更广的感受野以增强跨光谱鲁棒性。实验表明,该方法在五种光谱配置下均优于现有基准,且迁移学习结果强调了模态感知训练的重要性,验证了第一性原理信号建模与结构化深度学习结合能显著提升多光谱学习的泛化能力。
链接: https://arxiv.org/abs/2512.19504
作者: Georgios Voulgaris
机构: University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review at IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS)
Abstract:Modern deep learning models operating on multi-modal visual signals often rely on inductive biases that are poorly aligned with the physical processes governing signal formation, leading to brittle performance under cross-spectral and real-world conditions. In particular, approaches that prioritise direct thermal cues struggle to capture indirect yet persistent environmental alterations induced by sustained heat emissions. This work introduces a physics-aware representation learning framework that leverages multi-spectral information to model stable signatures of long-term physical processes. Specifically, a geological Short Wave Infrared (SWIR) ratio sensitive to soil property changes is integrated with Thermal Infrared (TIR) data through an intermediate fusion architecture, instantiated as FusionNet. The proposed backbone embeds trainable differential signal-processing priors within convolutional layers, combines mixed pooling strategies, and employs wider receptive fields to enhance robustness across spectral modalities. Systematic ablations show that each architectural component contributes to performance gains, with DGCNN achieving 88.7% accuracy on the SWIR ratio and FusionNet reaching 90.6%, outperforming state-of-the-art baselines across five spectral configurations. Transfer learning experiments further show that ImageNet pretraining degrades TIR performance, highlighting the importance of modality-aware training for cross-spectral learning. Evaluated on real-world data, the results demonstrate that combining physics-aware feature selection with principled deep learning architectures yields robust and generalisable representations, illustrating how first-principles signal modelling can improve multi-spectral learning under challenging conditions. Comments: Preprint. Under review at IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS) Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.19504 [cs.CV] (or arXiv:2512.19504v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.19504 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Georgios Voulgaris [view email] [v1] Mon, 22 Dec 2025 15:59:37 UTC (4,137 KB)
zh
[CV-26] Dynamic Stream Network for Combinatorial Explosion Problem in Deformable Medical Image Registration
【速读】:该论文旨在解决可变形医学图像配准(Deformable Medical Image Registration, DMIR)中因双输入导致的组合爆炸问题(combinatorial explosion problem),即模型在处理两幅图像时,特征间的组合关系呈指数增长,从而引入大量干扰性特征组合,影响特征建模效果。解决方案的关键在于提出动态流网络(Dynamic Stream Network, DySNet),其核心创新包括:1)自适应流盆地(Adaptive Stream Basin, AdSB)模块动态调整感受野形状,使模型聚焦于高相关性的特征关系;2)动态流注意力(Dynamic Stream Attention, DySA)机制生成动态权重,以搜索更具价值的特征关系。这两项设计共同提升了模型对干扰特征组合的抑制能力和对潜在特征关系的建模精度,显著优于当前最先进的DMIR方法。
链接: https://arxiv.org/abs/2512.19486
作者: Shaochen Bi,Yuting He,Weiming Wang,Hao Chen
机构: Hong Kong University of Science and Technology (香港科技大学); Case Western Reserve University (凯斯西储大学); Hong Kong Metropolitan University (香港都会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Combinatorial explosion problem caused by dual inputs presents a critical challenge in Deformable Medical Image Registration (DMIR). Since DMIR processes two images simultaneously as input, the combination relationships between features has grown exponentially, ultimately the model considers more interfering features during the feature modeling process. Introducing dynamics in the receptive fields and weights of the network enable the model to eliminate the interfering features combination and model the potential feature combination relationships. In this paper, we propose the Dynamic Stream Network (DySNet), which enables the receptive fields and weights to be dynamically adjusted. This ultimately enables the model to ignore interfering feature combinations and model the potential feature relationships. With two key innovations: 1) Adaptive Stream Basin (AdSB) module dynamically adjusts the shape of the receptive field, thereby enabling the model to focus on the feature relationships with greater correlation. 2) Dynamic Stream Attention (DySA) mechanism generates dynamic weights to search for more valuable feature relationships. Extensive experiments have shown that DySNet consistently outperforms the most advanced DMIR methods, highlighting its outstanding generalization ability. Our code will be released on the website: this https URL.
zh
[CV-27] Emotion-Director: Bridging Affective Shortcut in Emotion-Oriented Image Generation
【速读】:该论文旨在解决当前情绪导向图像生成方法中存在的“情感捷径”问题,即现有方法将情绪简单等同于语义信息,而忽视了情绪与语义本质上的差异。为应对这一挑战,论文提出Emotion-Director框架,其核心创新在于两个模块:一是跨模态协同扩散模型(MC-Diffusion),通过融合视觉提示与文本提示实现超越语义层面的情绪引导图像生成;二是跨模态协同智能体系统(MC-Agent),利用多智能体模拟人类主观情绪体验,并采用概念链式工作流重构文本提示以增强视觉表现力,同时引入负向视觉提示改进DPO优化策略,提升模型对相同语义下不同情绪的敏感性。
链接: https://arxiv.org/abs/2512.19479
作者: Guoli Jia,Junyao Hu,Xinwei Long,Kai Tian,Kaiyan Zhang,KaiKai Zhao,Ning Ding,Bowen Zhou
机构: Tsinghua University (清华大学); The Hong Kong Polytechnic University (香港理工大学); Frontis.AI; China Unicom (中国联通); Shanghai Artificial Intelligence Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image generation based on diffusion models has demonstrated impressive capability, motivating exploration into diverse and specialized applications. Owing to the importance of emotion in advertising, emotion-oriented image generation has attracted increasing attention. However, current emotion-oriented methods suffer from an affective shortcut, where emotions are approximated to semantics. As evidenced by two decades of research, emotion is not equivalent to semantics. To this end, we propose Emotion-Director, a cross-modal collaboration framework consisting of two modules. First, we propose a cross-Modal Collaborative diffusion model, abbreviated as MC-Diffusion. MC-Diffusion integrates visual prompts with textual prompts for guidance, enabling the generation of emotion-oriented images beyond semantics. Further, we improve the DPO optimization by a negative visual prompt, enhancing the model’s sensitivity to different emotions under the same semantics. Second, we propose MC-Agent, a cross-Modal Collaborative Agent system that rewrites textual prompts to express the intended emotions. To avoid template-like rewrites, MC-Agent employs multi-agents to simulate human subjectivity toward emotions, and adopts a chain-of-concept workflow that improves the visual expressiveness of the rewritten prompts. Extensive qualitative and quantitative experiments demonstrate the superiority of Emotion-Director in emotion-oriented image generation.
zh
[CV-28] Sign Language Recognition using Parallel Bidirectional Reservoir Computing
【速读】:该论文旨在解决深度学习驱动的手语识别(Sign Language Recognition, SLR)模型因计算资源需求高而难以部署于边缘设备的问题。其解决方案的关键在于提出一种轻量级SLR系统,该系统结合了并行双向储备池计算(Parallel Bidirectional Reservoir Computing, PBRC)与MediaPipe技术:MediaPipe实现手部关键点坐标的实时提取,作为PBRC的输入特征;PBRC架构由两个基于回声状态网络(Echo State Network, ESN)的双向储备池计算(Bidirectional Reservoir Computing, BRC)模块并行组成,有效捕捉时序依赖关系以生成丰富的特征表示,从而在保证识别精度的同时显著降低训练时间(仅18.67秒),相较Bi-GRU等深度学习方法提升效率超过90%。
链接: https://arxiv.org/abs/2512.19451
作者: Nitin Kumar Singh,Arie Rachmad Syulistyo,Yuichiro Tanaka,Hakaru Tamukoh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Sign language recognition (SLR) facilitates communication between deaf and hearing communities. Deep learning based SLR models are commonly used but require extensive computational resources, making them unsuitable for deployment on edge devices. To address these limitations, we propose a lightweight SLR system that combines parallel bidirectional reservoir computing (PBRC) with MediaPipe. MediaPipe enables real-time hand tracking and precise extraction of hand joint coordinates, which serve as input features for the PBRC architecture. The proposed PBRC architecture consists of two echo state network (ESN) based bidirectional reservoir computing (BRC) modules arranged in parallel to capture temporal dependencies, thereby creating a rich feature representation for classification. We trained our PBRC-based SLR system on the Word-Level American Sign Language (WLASL) video dataset, achieving top-1, top-5, and top-10 accuracies of 60.85%, 85.86%, and 91.74%, respectively. Training time was significantly reduced to 18.67 seconds due to the intrinsic properties of reservoir computing, compared to over 55 minutes for deep learning based methods such as Bi-GRU. This approach offers a lightweight, cost-effective solution for real-time SLR on edge devices.
zh
[CV-29] D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理长视觉标记序列时面临的计算负担问题,尤其是现有令牌剪枝(token pruning)方法在细粒度定位任务中表现严重退化的难题。当前主流策略——基于重要性的方法存在显著的位置偏差(positional bias),导致模型过度关注空间位置而非语义内容;而基于多样性的方法则缺乏结构感知能力(structural blindness),忽视用户提示和空间冗余信息。解决方案的关键在于提出D2Pruner框架,其创新性地融合了去偏重要性评分与结构化剪枝机制:首先通过去偏注意力分数确定核心关键标记作为枢纽(pivots),随后在混合图模型(结合空间邻近性和语义相似性)上执行最大独立集(Maximal Independent Set, MIS)选择,迭代保留最具重要性和多样性且不相邻的标记,从而在大幅降低计算开销的同时保障任务性能。
链接: https://arxiv.org/abs/2512.19443
作者: Evelyn Zhang,Fufu Yu,Aoqi Wu,Zichen Wen,Ke Yan,Shouhong Ding,Biqing Qi,Linfeng Zhang
机构: Tencent YouTu Lab(腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user’s prompt and spatial redundancy. To address this, we introduce D2Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D2Pruner has exceptional efficiency and fidelity. Applied to LLaVA-1.5-7B for general understanding tasks, it reduces FLOPs by 74.2% while retaining 99.2% of its original performance. Furthermore, in challenging localization benchmarks with InternVL-2.5-8B, it maintains 85.7% performance at a 90% token reduction rate, marking a significant advancement with up to 63. 53% improvement over existing methods.
zh
[CV-30] MT-Mark: Rethinking Image Watermarking via Mutual-Teacher Collaboration with Adaptive Feature Modulation
【速读】:该论文旨在解决现有深度图像水印方法中嵌入器(embedder)与提取器(extractor)之间弱耦合的问题,即二者通过最终损失函数间接关联,缺乏显式的协同机制,导致嵌入过程无法利用解码感知线索,提取过程也无法指导嵌入训练。解决方案的关键在于提出一种显式协作架构:首先设计了协同交互机制(Collaborative Interaction Mechanism, CIM),建立嵌入器与提取器之间的双向直接通信通道,实现互为教师的协同训练范式;进一步引入自适应特征调制模块(Adaptive Feature Modulation Module, AFMM),通过解耦调制结构与强度,实现内容感知的特征调节,引导水印嵌入至稳定图像特征区域,并在提取阶段抑制宿主干扰。CIM与双侧AFMM构成闭环协作系统,使嵌入行为与提取目标对齐,从而从表示学习层面提升鲁棒性,而非依赖大量失真模拟。
链接: https://arxiv.org/abs/2512.19438
作者: Fei Ge,Ying Huang,Jie Liu,Guixuan Zhang,Zhi Zeng,Shuwu Zhang,Hu Guan
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing deep image watermarking methods follow a fixed embedding-distortion-extraction pipeline, where the embedder and extractor are weakly coupled through a final loss and optimized in isolation. This design lacks explicit collaboration, leaving no structured mechanism for the embedder to incorporate decoding-aware cues or for the extractor to guide embedding during training. To address this architectural limitation, we rethink deep image watermarking by reformulating embedding and extraction as explicitly collaborative components. To realize this reformulation, we introduce a Collaborative Interaction Mechanism (CIM) that establishes direct, bidirectional communication between the embedder and extractor, enabling a mutual-teacher training paradigm and coordinated optimization. Built upon this explicitly collaborative architecture, we further propose an Adaptive Feature Modulation Module (AFMM) to support effective interaction. AFMM enables content-aware feature regulation by decoupling modulation structure and strength, guiding watermark embedding toward stable image features while suppressing host interference during extraction. Under CIM, the AFMMs on both sides form a closed-loop collaboration that aligns embedding behavior with extraction objectives. This architecture-level redesign changes how robustness is learned in watermarking systems. Rather than relying on exhaustive distortion simulation, robustness emerges from coordinated representation learning between embedding and extraction. Experiments on real-world and AI-generated datasets demonstrate that the proposed method consistently outperforms state-of-the-art approaches in watermark extraction accuracy while maintaining high perceptual quality, showing strong robustness and generalization.
zh
[CV-31] dMLLM -TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
【速读】:该论文旨在解决扩散多模态大语言模型(Diffusion Multi-modal Large Language Models, dMLLMs)在测试时扩展(Test-Time Scaling, TTS)过程中存在的效率低和生成质量受限的问题。现有方法通常采用线性搜索策略,在轨迹探索(trajectory exploration)与迭代精炼(iterative refinement)两个维度上计算复杂度高(O(NT)),且依赖外部验证器进行最佳样本选择,导致资源消耗大、部署困难。其解决方案的关键在于提出dMLLM-TTS框架:一是设计一种复杂度仅为O(N+T)的分层搜索算法,实现对采样轨迹的自适应扩展与剪枝;二是引入基于模型内在图文对齐理解能力的自验证反馈机制,替代外部验证器,从而显著提升生成质量并实现最高达6倍的效率增益。
链接: https://arxiv.org/abs/2512.19433
作者: Yi Xin,Siqi Luo,Qi Qin,Haoxing Chen,Kaiwen Zhu,Zhiwei Zhang,Yangfan He,Rongchao Zhang,Jinbin Bai,Shuo Cao,Bin Fu,Junjun He,Yihao Liu,Yuewen Cao,Xiaohong Liu
机构: Nanjing University (南京大学); Shanghai Innovation Institute (上海创新研究院); Shanghai AI Lab (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); Beijing University (北京大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs’ intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search. Project page: this https URL.
zh
[CV-32] Non-Contrast CT Esophageal Varices Grading through Clinical Prior-Enhanced Multi-Organ Analysis
【速读】:该论文旨在解决食管静脉曲张(Esophageal Varices, EV)的非侵入性诊断难题,传统方法依赖于侵入性内镜检查,而现有非对比计算机断层扫描(Non-contrast Computed Tomography, NCCT)尚未被充分应用于临床评估。其解决方案的关键在于提出Multi-Organ-COhesion Network++(MOON++),一个融合多器官影像特征(包括食管、肝脏和脾脏)的新型多模态深度学习框架,通过引入临床知识先验(clinical knowledge priors)建模器官体积关系与肝病严重程度的关联,从而显著提升EV分级的准确性,尤其在重度(G3)和中重度(≥G2)EV分类任务中表现优于单一器官分析方法(AUC达0.894 vs. 0.803 和 0.921 vs. 0.793)。
链接: https://arxiv.org/abs/2512.19415
作者: Xiaoming Zhang,Chunli Li,Jiacheng Hao,Yuan Gao,Danyang Tu,Jianyi Qiao,Xiaoli Yin,Le Lu,Ling Zhang,Ke Yan,Yang Hou,Yu Shi
机构: Shengjing Hospital of China Medical University (中国医科大学盛京医院); Alibaba Group (阿里巴巴集团); Tsinghua University (清华大学); Sorbonne University (索邦大学); Hupan Lab (虎盘实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Medical Image Analysis
Abstract:Esophageal varices (EV) represent a critical complication of portal hypertension, affecting approximately 60% of cirrhosis patients with a significant bleeding risk of ~30%. While traditionally diagnosed through invasive endoscopy, non-contrast computed tomography (NCCT) presents a potential non-invasive alternative that has yet to be fully utilized in clinical practice. We present Multi-Organ-COhesion Network++ (MOON++), a novel multimodal framework that enhances EV assessment through comprehensive analysis of NCCT scans. Inspired by clinical evidence correlating organ volumetric relationships with liver disease severity, MOON++ synthesizes imaging characteristics of the esophagus, liver, and spleen through multimodal learning. We evaluated our approach using 1,631 patients, those with endoscopically confirmed EV were classified into four severity grades. Validation in 239 patient cases and independent testing in 289 cases demonstrate superior performance compared to conventional single organ methods, achieving an AUC of 0.894 versus 0.803 for the severe grade EV classification (G3 versus G3) and 0.921 versus 0.793 for the differentiation of moderate to severe grades (=G2 versus G2). We conducted a reader study involving experienced radiologists to further validate the performance of MOON++. To our knowledge, MOON++ represents the first comprehensive multi-organ NCCT analysis framework incorporating clinical knowledge priors for EV assessment, potentially offering a promising non-invasive diagnostic alternative.
zh
[CV-33] Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
【速读】:该论文旨在解决机器人学习中策略鲁棒性受限于多样演示数据收集成本的问题,尤其是针对操作任务中空间泛化能力的提升。解决方案的关键在于提出 Real2Edit2Real 框架,通过将 3D 可编辑性与 2D 视觉数据相结合,利用 3D 控制接口生成新的演示数据:首先基于多视角 RGB 观测重建场景几何结构,随后在点云上进行深度可靠的 3D 编辑以生成新的操作轨迹,并通过几何校正恢复物理一致的深度信息;最后采用多条件视频生成模型,以深度为主控信号,结合动作、边缘和射线图,合成空间增强的多视角操作视频,从而显著提升数据效率,实验证明仅需 1–5 个源演示即可达到甚至超越使用 50 个真实演示训练的策略性能。
链接: https://arxiv.org/abs/2512.19402
作者: Yujie Zhao,Hongwei Fan,Di Chen,Shengcong Chen,Liliang Chen,Xiaoqi Li,Guanghui Ren,Hao Dong
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally, we propose a multi-conditional video generation model guided by depth as the primary control signal, together with action, edge, and ray maps, to synthesize spatially augmented multi-view manipulation videos. Experiments on four real-world manipulation tasks demonstrate that policies trained on data generated from only 1-5 source demonstrations can match or outperform those trained on 50 real-world demonstrations, improving data efficiency by up to 10-50x. Moreover, experimental results on height and texture editing demonstrate the framework’s flexibility and extensibility, indicating its potential to serve as a unified data generation framework.
zh
[CV-34] winAligner: Visual-Dynamic Alignment Empowers Physics-aware Real2Sim2Real for Robotic Manipulation
【速读】:该论文旨在解决机器人学习中因依赖昂贵的真实世界数据而受限的问题,以及仿真与现实之间的视觉和动态差异导致策略迁移效果不佳的挑战。其解决方案的关键在于提出TwinAligner系统,该系统通过两个核心模块实现“Real2Sim2Real”闭环:一是视觉对齐模块,利用SDF重建和可编辑的3DGS渲染技术实现像素级视觉一致性;二是动态对齐模块,通过识别机器人与物体交互中的刚体物理特性来保证动力学一致性。此设计不仅支持可扩展的数据采集,还构建了可信的迭代循环,显著提升了在仿真中训练的策略在真实场景下的零样本泛化能力。
链接: https://arxiv.org/abs/2512.19390
作者: Hongwei Fan,Hang Dai,Jiyao Zhang,Jinzhou Li,Qiyang Yan,Yujie Zhao,Mingju Gao,Jinghang Wu,Hao Tang,Hao Dong
机构: CFCS, School of Computer Science, Peking University (北京大学计算机学院); PKU-AgiBot Lab (北京大学Agibot实验室); State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University (多媒体信息处理国家重点实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:The robotics field is evolving towards data-driven, end-to-end learning, inspired by multimodal large models. However, reliance on expensive real-world data limits progress. Simulators offer cost-effective alternatives, but the gap between simulation and reality challenges effective policy transfer. This paper introduces TwinAligner, a novel Real2Sim2Real system that addresses both visual and dynamic gaps. The visual alignment module achieves pixel-level alignment through SDF reconstruction and editable 3DGS rendering, while the dynamic alignment module ensures dynamic consistency by identifying rigid physics from robot-object interaction. TwinAligner improves robot learning by providing scalable data collection and establishing a trustworthy iterative cycle, accelerating algorithm development. Quantitative evaluations highlight TwinAligner’s strong capabilities in visual and dynamic real-to-sim alignment. This system enables policies trained in simulation to achieve strong zero-shot generalization to the real world. The high consistency between real-world and simulated policy performance underscores TwinAligner’s potential to advance scalable robot learning. Code and data will be released on this https URL
zh
[CV-35] DSTED: Decoupling Temporal Stabilization and Discriminative Enhancement for Surgical Workflow Recognition
【速读】:该论文旨在解决当前手术流程识别方法中存在的两个关键问题:连续帧间的预测抖动(prediction jitter)以及对模糊阶段(ambiguous phases)的判别能力不足。解决方案的核心在于提出一种双路径框架DSTED,其关键创新包括:1)通过可靠记忆传播(Reliable Memory Propagation, RMP)机制,基于多标准可靠性评估筛选并融合高置信度的历史特征,从而增强时间一致性;2)引入不确定性感知原型检索(Uncertainty-Aware Prototype Retrieval, UPR)机制,从高不确定性样本中构建可学习的类别特定原型,并进行自适应原型匹配以优化模糊帧的表示。最终,通过置信度驱动门控动态平衡两条路径,实现稳定且准确的手术流程识别。
链接: https://arxiv.org/abs/2512.19387
作者: Yueyao Chen,Kai-Ni Wang,Dario Tayupo,Arnaud Huaulm’e,Krystel Nyangoh Timoh,Pierre Jannin,Qi Dou
机构: The Chinese University of Hong Kong (香港中文大学); Université de Rennes (雷恩大学); Centre Hospitalier Universitaire de Rennes (雷恩大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Early accepted to IPCAI 2026
Abstract:Purpose: Surgical workflow recognition enables context-aware assistance and skill assessment in computer-assisted interventions. Despite recent advances, current methods suffer from two critical challenges: prediction jitter across consecutive frames and poor discrimination of ambiguous phases. This paper aims to develop a stable framework by selectively propagating reliable historical information and explicitly modeling uncertainty for hard sample enhancement. Methods: We propose a dual-pathway framework DSTED with Reliable Memory Propagation (RMP) and Uncertainty-Aware Prototype Retrieval (UPR). RMP maintains temporal coherence by filtering and fusing high-confidence historical features through multi-criteria reliability assessment. UPR constructs learnable class-specific prototypes from high-uncertainty samples and performs adaptive prototype matching to refine ambiguous frame representations. Finally, a confidence-driven gate dynamically balances both pathways based on prediction certainty. Results: Our method achieves state-of-the-art performance on AutoLaparo-hysterectomy with 84.36% accuracy and 65.51% F1-score, surpassing the second-best method by 3.51% and 4.88% respectively. Ablations reveal complementary gains from RMP (2.19%) and UPR (1.93%), with synergistic effects when combined. Extensive analysis confirms substantial reduction in temporal jitter and marked improvement on challenging phase transitions. Conclusion: Our dual-pathway design introduces a novel paradigm for stable workflow recognition, demonstrating that decoupling the modeling of temporal consistency and phase ambiguity yields superior performance and clinical applicability. Comments: Early accepted to IPCAI 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.19387 [cs.CV] (or arXiv:2512.19387v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.19387 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yueyao Chen [view email] [v1] Mon, 22 Dec 2025 13:36:26 UTC (3,229 KB)
zh
[CV-36] Efficient Spike-driven Transformer for High-performance Drone-View Geo-Localization
【速读】:该论文旨在解决传统基于人工神经网络(Artificial Neural Networks, ANNs)的无人机视角地理定位(Drone-View Geo-Localization, DVGL)方法因密集计算导致高功耗的问题,以及脉冲神经网络(Spiking Neural Networks, SNNs)在DVGL任务中尚未被充分探索、且其固有的稀疏脉冲驱动计算易造成关键信息丢失和长距离依赖建模困难的局限性。解决方案的关键在于提出首个专为DVGL设计的SNN框架SpikeViMFormer:其核心创新包括轻量级脉冲驱动Transformer骨干网络用于粗粒度特征提取,引入脉冲驱动选择性注意力(Spike-driven Selective Attention, SSA)模块通过脉冲门控机制实现特征增强与判别区域突出,设计脉冲驱动混合状态空间(Spike-driven Hybrid State Space, SHS)模块以学习跨异构视觉源的长程依赖关系,并在推理阶段仅使用骨干网络以降低计算开销;同时提出分层重排序对齐学习(Hierarchical Re-ranking Alignment Learning, HRAL)策略,通过邻域重排序优化特征并保持跨批次一致性,直接提升骨干网络性能。
链接: https://arxiv.org/abs/2512.19365
作者: Zhongwei Chen,Hai-Jun Rong,Zhao-Xu Yang,Guoqi Li
机构: Xi’an Jiaotong University (西安交通大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional drone-view geo-localization (DVGL) methods based on artificial neural networks (ANNs) have achieved remarkable performance. However, ANNs rely on dense computation, which results in high power consumption. In contrast, spiking neural networks (SNNs), which benefit from spike-driven computation, inherently provide low power consumption. Regrettably, the potential of SNNs for DVGL has yet to be thoroughly investigated. Meanwhile, the inherent sparsity of spike-driven computation for representation learning scenarios also results in loss of critical information and difficulties in learning long-range dependencies when aligning heterogeneous visual data sources. To address these, we propose SpikeViMFormer, the first SNN framework designed for DVGL. In this framework, a lightweight spike-driven transformer backbone is adopted to extract coarse-grained features. To mitigate the loss of critical information, the spike-driven selective attention (SSA) block is designed, which uses a spike-driven gating mechanism to achieve selective feature enhancement and highlight discriminative regions. Furthermore, a spike-driven hybrid state space (SHS) block is introduced to learn long-range dependencies using a hybrid state space. Moreover, only the backbone is utilized during the inference stage to reduce computational cost. To ensure backbone effectiveness, a novel hierarchical re-ranking alignment learning (HRAL) strategy is proposed. It refines features via neighborhood re-ranking and maintains cross-batch consistency to directly optimize the backbone. Experimental results demonstrate that SpikeViMFormer outperforms state-of-the-art SNNs. Compared with advanced ANNs, it also achieves competitive this http URL code is available at this https URL
zh
[CV-37] Reason CD: A Multimodal Reasoning Large Model for Implicit Change-of-Interest Semantic Mining
【速读】:该论文旨在解决当前遥感图像变化检测方法在面对用户隐式兴趣区域(Change Region of Interest, CRoI)描述时性能显著下降的问题。现有方法过度依赖显式的文本描述来引导检测,导致在处理隐式意图时表现不佳。解决方案的关键在于提出一种名为ReasonCD的多模态推理变化检测模型,该模型利用预训练大语言模型的强大推理能力,从用户输入中挖掘隐式任务意图,并据此生成差异化的检测结果,从而实现对隐式CRoI的有效识别与解释性输出。
链接: https://arxiv.org/abs/2512.19354
作者: Zhenyang Huang,Xiao Yu,Yi Zhang,Decheng Wang,Hang Ruan
机构: Beijing Institute of Tracking and Telecommunications Technology(北京跟踪与通信技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote sensing image change detection is one of the fundamental tasks in remote sensing intelligent interpretation. Its core objective is to identify changes within change regions of interest (CRoI). Current multimodal large models encode rich human semantic knowledge, which is utilized for guidance in tasks such as remote sensing change detection. However, existing methods that use semantic guidance for detecting users’ CRoI overly rely on explicit textual descriptions of CRoI, leading to the problem of near-complete performance failure when presented with implicit CRoI textual descriptions. This paper proposes a multimodal reasoning change detection model named ReasonCD, capable of mining users’ implicit task intent. The model leverages the powerful reasoning capabilities of pre-trained large language models to mine users’ implicit task intents and subsequently obtains different change detection results based on these intents. Experiments on public datasets demonstrate that the model achieves excellent change detection performance, with an F1 score of 92.1% on the BCDD dataset. Furthermore, to validate its superior reasoning functionality, this paper annotates a subset of reasoning data based on the SECOND dataset. Experimental results show that the model not only excels at basic reasoning-based change detection tasks but can also explain the reasoning process to aid human decision-making.
zh
[CV-38] GANeXt: A Fully ConvNeXt-Enhanced Generative Adversarial Network for MRI- and CBCT-to-CT Synthesis
【速读】:该论文旨在解决多模态医学影像中CT图像的合成问题,特别是在自适应放疗(adaptive radiotherapy)场景下,如何从MRI或锥形束CT(CBCT)中准确生成模拟CT(sCT),以实现更精确的解剖结构表征。其解决方案的关键在于提出一种基于3D Patch的全ConvNeXt架构生成对抗网络(GANeXt),通过堆叠紧凑卷积核的3D ConvNeXt块构建U型生成器,并结合条件PatchGAN判别器提升合成质量;同时引入多损失函数组合(包括MAE、感知损失、分割掩码MAE、对抗损失及Dice交叉熵损失)优化训练过程,从而在不同解剖区域和模态间实现统一且高质量的CT合成。
链接: https://arxiv.org/abs/2512.19336
作者: Siyuan Mei,Yan Xia,Fuxin Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The synthesis of computed tomography (CT) from magnetic resonance imaging (MRI) and cone-beam CT (CBCT) plays a critical role in clinical treatment planning by enabling accurate anatomical representation in adaptive radiotherapy. In this work, we propose GANeXt, a 3D patch-based, fully ConvNeXt-powered generative adversarial network for unified CT synthesis across different modalities and anatomical regions. Specifically, GANeXt employs an efficient U-shaped generator constructed from stacked 3D ConvNeXt blocks with compact convolution kernels, while the discriminator adopts a conditional PatchGAN. To improve synthesis quality, we incorporate a combination of loss functions, including mean absolute error (MAE), perceptual loss, segmentation-based masked MAE, and adversarial loss and a combination of Dice loss and cross-entropy for multi-head segmentation discriminator. For both tasks, training is performed with a batch size of 8 using two separate AdamW optimizers for the generator and discriminator, each equipped with a warmup and cosine decay scheduler, with learning rates of 5\times10^-4 and 1\times10^-3 , respectively. Data preprocessing includes deformable registration, foreground cropping, percentile normalization for the input modality, and linear normalization of the CT to the range [-1024, 1000] . Data augmentation involves random zooming within (0.8, 1.3) (for MRI-to-CT only), fixed-size cropping to 32\times160\times192 for MRI-to-CT and 32\times128\times128 for CBCT-to-CT, and random flipping. During inference, we apply a sliding-window approach with 0.8 overlap and average folding to reconstruct the full-size sCT, followed by inversion of the CT normalization. After joint training on all regions without any fine-tuning, the final models are selected at the end of 3000 epochs for MRI-to-CT and 1000 epochs for CBCT-to-CT using the full training dataset.
zh
[CV-39] DeltaMIL: Gated Memory Integration for Efficient and Discriminative Whole Slide Image Analysis
【速读】:该论文旨在解决全切片图像(Whole Slide Images, WSIs)在使用多实例学习(Multiple Instance Learning, MIL)方法分析时,因图像尺度大、异质性强而导致的冗余信息分散、难以提取和整合判别性特征的问题。现有MIL方法要么无法有效剔除无信息提示,要么难以从多个图像块中凝聚相关特征,从而限制了其在复杂WSI上的表现。解决方案的关键在于提出DeltaMIL框架,该框架通过引入基于门控delta规则(gated delta rule)的信息过滤与整合机制,结合遗忘与记忆功能,动态更新内存以保留与当前图像块高相关的特征并快速遗忘无关信号;同时集成互补局部模式混合机制,保留病理学细粒度局部结构,从而增强对有意义线索的提取能力并抑制冗余或噪声信息,显著提升模型的鲁棒性和判别力。
链接: https://arxiv.org/abs/2512.19331
作者: Yueting Zhu,Yuehao Song,Shuai Zhang,Wenyu Liu,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages,7 figures,8 tables
Abstract:Whole Slide Images (WSIs) are typically analyzed using multiple instance learning (MIL) methods. However, the scale and heterogeneity of WSIs generate highly redundant and dispersed information, making it difficult to identify and integrate discriminative signals. Existing MIL methods either fail to discard uninformative cues effectively or have limited ability to consolidate relevant features from multiple patches, which restricts their performance on large and heterogeneous WSIs. To address this issue, we propose DeltaMIL, a novel MIL framework that explicitly selects semantically relevant regions and integrates the discriminative information from WSIs. Our method leverages the gated delta rule to efficiently filter and integrate information through a block combining forgetting and memory mechanisms. The delta mechanism dynamically updates the memory by removing old values and inserting new ones according to their correlation with the current patch. The gating mechanism further enables rapid forgetting of irrelevant signals. Additionally, DeltaMIL integrates a complementary local pattern mixing mechanism to retain fine-grained pathological locality. Our design enhances the extraction of meaningful cues and suppresses redundant or noisy information, which improves the model’s robustness and discriminative power. Experiments demonstrate that DeltaMIL achieves state-of-the-art performance. Specifically, for survival prediction, DeltaMIL improves performance by 3.69% using ResNet-50 features and 2.36% using UNI features. For slide-level classification, it increases accuracy by 3.09% with ResNet-50 features and 3.75% with UNI features. These results demonstrate the strong and consistent performance of DeltaMIL across diverse WSI tasks.
zh
[CV-40] Extended OpenTT Games Dataset: A table tennis dataset for fine-grained shot type and point outcome
【速读】:该论文旨在解决桌球(table tennis)视频中击球类型自动检测与分类的难题,以支持训练流程优化、直播画面增强及精细化表现分析。现有公开数据集缺乏帧级精确的击球类型标注(如正手、反手及其子类)、球员姿态标签(身体倾斜与腿部站姿)以及回合结果标记,限制了模型从事件识别向战术理解的演进。解决方案的关键在于扩展OpenTTGames数据集,引入高精度的击球类型注释、球员姿态标签和回合结局标签,并设计了一套紧凑的编码方案与代码辅助标注流程,确保标注可复现性与基准测试可行性。此扩展数据集在CC BY-NC-SA 4.0许可下开放,解决了此前资源受限或授权不明确的问题,为网拍运动中的生成式AI(Generative AI)建模提供了高质量、可复用的数据基础。
链接: https://arxiv.org/abs/2512.19327
作者: Moamal Fadhil Abdul(1),Jonas Bruun Hubrechts(1),Thomas Martini Jørgensen(1),Emil Hovad(1) ((1) Department of Applied Mathematics and Computer Science, Technical University of Denmark, Richard Petersens Plads, Building 324, 2800 Kgs. Lyngby, Denmark)
机构: DTU Compute (丹麦技术大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Thomas Martini Jørgensen and Emil Hovad contributed equally and share last authorship
Abstract:Automatically detecting and classifying strokes in table tennis video can streamline training workflows, enrich broadcast overlays, and enable fine-grained performance analytics. For this to be possible, annotated video data of table tennis is needed. We extend the public OpenTTGames dataset with highly detailed, frame-accurate shot type annotations (forehand, backhand with subtypes), player posture labels (body lean and leg stance), and rally outcome tags at point end. OpenTTGames is a set of recordings from the side of the table with official labels for bounces, when the ball is above the net, or hitting the net. The dataset already contains ball coordinates near events, which are either “bounce”, “net”, or “empty_event” in the original OpenTTGames dataset, and semantic masks (humans, table, scoreboard). Our extension adds the types of stroke to the events and a per-player taxonomy so models can move beyond event spotting toward tactical understanding (e.g., whether a stroke is likely to win the point or set up an advantage). We provide a compact coding scheme and code-assisted labeling procedure to support reproducible annotations and baselines for fine-grained stroke understanding in racket sports. This fills a practical gap in the community, where many prior video resources are either not publicly released or carry restrictive/unclear licenses that hinder reuse and benchmarking. Our annotations are released under the same CC BY-NC-SA 4.0 license as OpenTTGames, allowing free non-commercial use, modification, and redistribution, with appropriate attribution.
zh
[CV-41] Neural Implicit Heart Coordinates: 3D cardiac shape reconstruction from sparse segmentations
【速读】:该论文旨在解决从稀疏临床图像中高精度重建患者特异性心脏解剖结构的难题,尤其在数据获取受限(如2D切片数量少、存在分割噪声)的情况下仍需保持解剖一致性的问题。其解决方案的关键在于提出了一种标准化的隐式坐标系统——神经隐式心肌坐标(Neural Implicit Heart Coordinates, NIHCs),该坐标系基于通用心室坐标框架,能够为个体心脏提供统一的解剖参考系;模型直接从少量2D分割图预测NIHCs,并进一步解码生成任意分辨率下的密集3D分割和高保真网格,从而实现高效且解剖一致的重建,显著优于传统方法(推理时间由60秒以上缩短至5–15秒)。
链接: https://arxiv.org/abs/2512.19316
作者: Marica Muffoletto,Uxio Hermida,Charlène Mauger,Avan Suinesiaputra,Yiyang Xu,Richard Burns,Lisa Pankewitz,Andrew D McCulloch,Steffen E Petersen,Daniel Rueckert,Alistair A Young
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 42 pages, 8 figures
Abstract:Accurate reconstruction of cardiac anatomy from sparse clinical images remains a major challenge in patient-specific modeling. While neural implicit functions have previously been applied to this task, their application to mapping anatomical consistency across subjects has been limited. In this work, we introduce Neural Implicit Heart Coordinates (NIHCs), a standardized implicit coordinate system, based on universal ventricular coordinates, that provides a common anatomical reference frame for the human heart. Our method predicts NIHCs directly from a limited number of 2D segmentations (sparse acquisition) and subsequently decodes them into dense 3D segmentations and high-resolution meshes at arbitrary output resolution. Trained on a large dataset of 5,000 cardiac meshes, the model achieves high reconstruction accuracy on clinical contours, with mean Euclidean surface errors of 2.51 \pm 0.33 mm in a diseased cohort (n=4549) and 2.3 \pm 0.36 mm in a healthy cohort (n=5576). The NIHC representation enables anatomically coherent reconstruction even under severe slice sparsity and segmentation noise, faithfully recovering complex structures such as the valve planes. Compared with traditional pipelines, inference time is reduced from over 60 s to 5-15 s. These results demonstrate that NIHCs constitute a robust and efficient anatomical representation for patient-specific 3D cardiac reconstruction from minimal input data.
zh
[CV-42] MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture
【速读】:该论文旨在解决扩散模型(diffusion models)在训练与测试阶段之间的差异问题,即暴露偏差(exposure bias)。在训练过程中,预测网络的输入是真实噪声数据(由噪声和真实数据插值得到),而测试时则使用生成的噪声数据作为输入,这种不一致性导致性能下降。解决方案的关键在于提出一种名为MixFlow的新训练方法,其核心思想源于“慢流现象”(Slow Flow phenomenon):在给定采样时间步上,最接近生成噪声数据的真实插值对应于更高噪声的时间步(称为“慢化时间步”)。MixFlow利用这些慢化时间步上的插值混合(slowed interpolation mixture)对每个训练时间步的预测网络进行后训练优化,从而缓解暴露偏差,显著提升图像生成质量,在ImageNet数据集上实现了优异的FID得分(如无引导条件下256×256分辨率下为1.43,有引导条件下为1.10)。
链接: https://arxiv.org/abs/2512.19311
作者: Hui Li,Jiayue Lyu,Fu-Yun Wang,Kaihui Cheng,Siyu Zhu,Jingdong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 x 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 x 512.
zh
[CV-43] Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Remote Sensing
【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在遥感(Remote Sensing, RS)分析中进行推理分割时存在的几何定位能力弱和跨任务泛化性差的问题。现有方法通常通过端到端监督微调将语义推理与像素预测耦合,导致模型难以准确地将抽象语义映射到空间位置。解决方案的关键在于提出一种解耦框架 Think2Seg-RS,其中冻结的 Segment Anything Model (SAM) 作为分割执行器,而 LVLM 被训练为一个提示生成器,通过结构化的几何提示控制 SAM 的操作;该过程采用仅基于掩码的强化学习目标,使 LVLM 学习将语义推理转化为具有空间锚定性的动作,从而实现更精准的几何定位和零样本跨基准泛化性能。
链接: https://arxiv.org/abs/2512.19302
作者: Xu Zhang,Junyao Ge,Yang Zheng,Kaitai Guo,Jimin Liang
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Vision-Language Models (LVLMs) hold great promise for advancing remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only reinforcement learning objective, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Remarkably, the learned prompting policy generalizes zero-shot to multiple referring segmentation benchmarks, exposing a distinct divide between semantic-level and instance-level grounding. We further found that compact segmenters outperform larger ones under semantic-level supervision, and that negative prompts are ineffective in heterogeneous aerial backgrounds. Together, these findings establish semantic-level reasoning segmentation as a new paradigm for geospatial understanding, opening the way toward unified, interpretable LVLM-driven Earth observation. Our code and model are available at this https URL.
zh
[CV-44] RMLer: Synthesizing Novel Objects across Diverse Categories via Reinforcement Mixing Learning AAAI2026
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成中跨类别概念融合不足的问题,即如何从不同类别的文本概念中合成新颖且语义一致的视觉对象。现有方法常出现概念失衡、组合表层化或简单拼接等问题,导致生成结果质量不佳。其解决方案的关键在于提出强化混合学习(Reinforcement Mixing Learning, RMLer)框架,将跨类别概念融合建模为强化学习问题:以混合特征作为状态(state),混合策略作为动作(action),视觉输出作为奖励(reward)。具体而言,设计了一个基于多层感知机(MLP)的策略网络来动态预测跨类别文本嵌入的混合系数,并引入基于语义相似性和构图平衡性的视觉奖励机制,通过近端策略优化(PPO)算法训练策略网络;推理阶段则利用奖励选择最优融合结果,从而实现高质量、高保真度的新概念物体合成。
链接: https://arxiv.org/abs/2512.19300
作者: Jun Li,Zikun Chen,Haibo Chen,Shuo Chen,Jian Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by AAAI2026
Abstract:Novel object synthesis by integrating distinct textual concepts from diverse categories remains a significant challenge in Text-to-Image (T2I) generation. Existing methods often suffer from insufficient concept mixing, lack of rigorous evaluation, and suboptimal outputs-manifesting as conceptual imbalance, superficial combinations, or mere juxtapositions. To address these limitations, we propose Reinforcement Mixing Learning (RMLer), a framework that formulates cross-category concept fusion as a reinforcement learning problem: mixed features serve as states, mixing strategies as actions, and visual outcomes as rewards. Specifically, we design an MLP-policy network to predict dynamic coefficients for blending cross-category text embeddings. We further introduce visual rewards based on (1) semantic similarity and (2) compositional balance between the fused object and its constituent concepts, optimizing the policy via proximal policy optimization. At inference, a selection strategy leverages these rewards to curate the highest-quality fused objects. Extensive experiments demonstrate RMLer’s superiority in synthesizing coherent, high-fidelity objects from diverse categories, outperforming existing methods. Our work provides a robust framework for generating novel visual concepts, with promising applications in film, gaming, and design.
zh
[CV-45] Hand-Aware Egocentric Motion Reconstruction with Sequence-Level Context
【速读】:该论文旨在解决从第一人称视角(egocentric)视频中准确估计佩戴者全身运动的问题,这是理解人类行为的关键挑战,尤其因大多数身体部位在第一人称视角下不可见而难以实现。现有方法主要依赖头部轨迹或假设手部持续可追踪,前者存在歧义,后者对轻量化设备不现实。解决方案的关键在于提出 HaMoS——首个基于手部感知的、序列级扩散框架,直接利用头部轨迹与因视场限制和遮挡导致间歇可见的手部线索进行条件建模;同时引入新颖的数据增强方法模拟真实世界条件以弥补缺乏多视角人体动作配对数据集的不足,并通过局部注意力机制高效利用序列级上下文信息(如身体形状和视场范围),从而显著提升运动重建的准确性与时序平滑性。
链接: https://arxiv.org/abs/2512.19283
作者: Kyungwon Cho,Hanbyul Joo
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Egocentric vision systems are becoming widely available, creating new opportunities for human-computer interaction. A core challenge is estimating the wearer’s full-body motion from first-person videos, which is crucial for understanding human behavior. However, this task is difficult since most body parts are invisible from the egocentric view. Prior approaches mainly rely on head trajectories, leading to ambiguity, or assume continuously tracked hands, which is unrealistic for lightweight egocentric devices. In this work, we present HaMoS, the first hand-aware, sequence-level diffusion framework that directly conditions on both head trajectory and intermittently visible hand cues caused by field-of-view limitations and occlusions, as in real-world egocentric devices. To overcome the lack of datasets pairing diverse camera views with human motion, we introduce a novel augmentation method that models such real-world conditions. We also demonstrate that sequence-level contexts such as body shape and field-of-view are crucial for accurate motion reconstruction, and thus employ local attention to infer long sequences efficiently. Experiments on public benchmarks show that our method achieves state-of-the-art accuracy and temporal smoothness, demonstrating a practical step toward reliable in-the-wild egocentric 3D motion understanding.
zh
[CV-46] Is Visual Realism Enough? Evaluating Gait Biometric Fidelity in Generative AI Human Animation
【速读】:该论文试图解决当前生成式 AI (Generative AI) 在人体动画生成中难以保持行为生物特征(如步态)的细微时空细节,从而影响个体识别准确性的核心问题。其解决方案的关键在于通过两个评估任务系统性检验 GenAI 模型在不同复杂度下恢复和迁移步态模式的能力:一是从参考视频中重建步态模式,二是将步态模式转移到不同视觉身份上。实验结果表明,尽管生成图像的视觉质量较高,但模型在身份识别任务中表现不佳,揭示了现有 GenAI 模型无法有效分离运动与身份特征;尤其在身份转移任务中,当纹理与运动解耦后,识别性能崩溃,说明当前模型主要依赖视觉属性而非时序动态进行识别,暴露了基于外观的步态识别方法的根本缺陷。
链接: https://arxiv.org/abs/2512.19275
作者: Ivan DeAndres-Tame,Chengwei Ye,Ruben Tolosana,Ruben Vera-Rodriguez,Shiqi Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI (GenAI) models have revolutionized animation, enabling the synthesis of humans and motion patterns with remarkable visual fidelity. However, generating truly realistic human animation remains a formidable challenge, where even minor inconsistencies can make a subject appear unnatural. This limitation is particularly critical when AI-generated videos are evaluated for behavioral biometrics, where subtle motion cues that define identity are easily lost or distorted. The present study investigates whether state-of-the-art GenAI human animation models can preserve the subtle spatio-temporal details needed for person identification through gait biometrics. Specifically, we evaluate four different GenAI models across two primary evaluation tasks to assess their ability to i) restore gait patterns from reference videos under varying conditions of complexity, and ii) transfer these gait patterns to different visual identities. Our results show that while visual quality is mostly high, biometric fidelity remains low in tasks focusing on identification, suggesting that current GenAI models struggle to disentangle identity from motion. Furthermore, through an identity transfer task, we expose a fundamental flaw in appearance-based gait recognition: when texture is disentangled from motion, identification collapses, proving current GenAI models rely on visual attributes rather than temporal dynamics.
zh
[CV-47] 3SGen: Unified Subject Style and Structure-Driven Image Generation with Adaptive Task-specific Memory
【速读】:该论文旨在解决当前图像生成方法中主体(subject)、风格(style)和结构(structure)驱动的条件控制被孤立处理所导致的特征纠缠及任务迁移能力受限的问题。其解决方案的关键在于提出一个任务感知的统一框架3SGen,该框架通过引入具备可学习语义查询的多模态大语言模型(MLLM)实现文本与图像语义对齐,并结合变分自编码器(VAE)分支保留细粒度视觉细节;核心创新是设计了自适应任务特定记忆(Adaptive Task-specific Memory, ATM)模块,利用轻量级门控机制与可扩展的记忆项动态解耦、存储并检索特定于任务的先验信息(如主体身份、风格纹理和空间布局),从而有效缓解任务间干扰并支持组合输入,显著提升跨任务一致性与可控性。
链接: https://arxiv.org/abs/2512.19271
作者: Xinyang Song,Libin Wang,Weining Wang,Zhiwei Li,Jianxin Sun,Dandan Zheng,Jingdong Chen,Qi Li,Zhenan Sun
机构: School of Artificial Intelligence, UCAS (中国科学院大学人工智能学院); CASIA (中国科学院自动化研究所); AntGroup (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent image generation approaches often address subject, style, and structure-driven conditioning in isolation, leading to feature entanglement and limited task transferability. In this paper, we introduce 3SGen, a task-aware unified framework that performs all three conditioning modes within a single model. 3SGen employs an MLLM equipped with learnable semantic queries to align text-image semantics, complemented by a VAE branch that preserves fine-grained visual details. At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors, such as identity for subjects, textures for styles, and spatial layouts for structures, via a lightweight gating mechanism along with several scalable memory items. This design mitigates inter-task interference and naturally scales to compositional inputs. In addition, we propose 3SGen-Bench, a unified image-driven generation benchmark with standardized metrics for evaluating cross-task fidelity and controllability. Extensive experiments on our proposed 3SGen-Bench and other public benchmarks demonstrate our superior performance across diverse image-driven generation tasks.
zh
[CV-48] Machine Unlearning in the Era of Quantum Machine Learning: An Empirical Study
【速读】:该论文旨在解决量子机器学习(Quantum Machine Learning, QML)中模型遗忘(Machine Unlearning, MU)机制缺失的问题,即如何在混合量子-经典神经网络中实现对特定训练数据的有效删除,同时保持模型性能与隐私合规性。其解决方案的关键在于:首次系统性地将多种经典MU方法(如基于梯度、蒸馏、正则化和认证的方法)适配至量子场景,并提出两种专为混合模型设计的新策略;实验表明,量子电路深度、纠缠结构及任务复杂度显著影响遗忘效果,其中EU-k、LCA和认证遗忘等方法在遗忘强度、模型效用与重训练基准一致性之间表现最优,从而为未来量子感知的遗忘算法设计和理论保障提供了实证基础。
链接: https://arxiv.org/abs/2512.19253
作者: Carla Crivoi,Radu Tudor Ionescu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present the first comprehensive empirical study of machine unlearning (MU) in hybrid quantum-classical neural networks. While MU has been extensively explored in classical deep learning, its behavior within variational quantum circuits (VQCs) and quantum-augmented architectures remains largely unexplored. First, we adapt a broad suite of unlearning methods to quantum settings, including gradient-based, distillation-based, regularization-based and certified techniques. Second, we introduce two new unlearning strategies tailored to hybrid models. Experiments across Iris, MNIST, and Fashion-MNIST, under both subset removal and full-class deletion, reveal that quantum models can support effective unlearning, but outcomes depend strongly on circuit depth, entanglement structure, and task complexity. Shallow VQCs display high intrinsic stability with minimal memorization, whereas deeper hybrid models exhibit stronger trade-offs between utility, forgetting strength, and alignment with retrain oracle. We find that certain methods, e.g. EU-k, LCA, and Certified Unlearning, consistently provide the best balance across metrics. These findings establish baseline empirical insights into quantum machine unlearning and highlight the need for quantum-aware algorithms and theoretical guarantees, as quantum machine learning systems continue to expand in scale and capability. We publicly release our code at: this https URL.
zh
[CV-49] VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis
【速读】:该论文旨在解决当前生成式AI(Generative AI)模型在处理专业设计师提出的长尾多目标提示(long, multi-goal prompts)时表现不佳的问题,尤其是对全局布局、局部物体放置、排版和Logo保真度等复杂约束难以同时满足。为系统评估这一差距,作者构建了Long Goal Bench (LGBench),一个包含2000个任务的基准测试集(1000个文本到图像T2I与1000个图像到图像I2I),其平均指令包含18–22个紧密耦合的目标。实验表明,即使是最先进的模型也仅能实现少于72%的目标达成率,并常遗漏局部编辑需求,凸显现有流程的脆弱性。解决方案的关键在于提出VisionDirector——一种无需训练的视觉语言监督框架,其核心机制包括:(i) 从长指令中提取结构化目标;(ii) 动态决策是否采用单次生成或分阶段编辑;(iii) 每次编辑后执行语义验证与回滚的微网格采样;(iv) 记录逐目标奖励。进一步通过Group Relative Policy Optimization微调规划器,显著缩短编辑轨迹(3.1步 vs. 4.2步)并提升一致性,最终在GenEval和ImgEdit基准上达到新SOTA,且在字体、多对象场景和姿态编辑方面实现稳定质量提升。
链接: https://arxiv.org/abs/2512.19243
作者: Meng Chu,Senqiao Yang,Haoxuan Che,Suiyun Zhang,Xichen Zhang,Shaozuo Yu,Haokun Gui,Zhefan Rao,Dandan Tu,Rui Liu,Jiaya Jia
机构: The Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学); Huawei Research (华为研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models’ performance in real-world settings, we introduce Long Goal Bench (LGBench), a 2,000-task suite (1,000 T2I and 1,000 I2I) whose average instruction contains 18 to 22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find that even state-of-the-art models satisfy fewer than 72 percent of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present VisionDirector, a training-free vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling with semantic verification and rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 versus 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (plus 7 percent overall) and ImgEdit (plus 0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing.
zh
[CV-50] From Pixels to Predicates Structuring urban perception with scene graphs
【速读】:该论文旨在解决当前城市感知(urban perception)研究中因过度依赖像素特征或物体共现统计而忽视显式关系建模的问题。其解决方案的关键在于提出一个三阶段流水线:首先利用开放集全景场景图模型(OpenPSG)提取图像中的“对象-谓词-对象”三元组以构建结构化语义表示;其次通过异构图自编码器(GraphMAE)学习紧凑的场景级嵌入;最后使用神经网络从这些嵌入中预测六种感知指标得分。该方法显著提升了感知预测准确率(平均提高26%),并展现出跨城市泛化能力,同时揭示了特定关系模式(如墙面涂鸦、车辆停在人行道上)如何导致低感知评分,从而实现可解释且具上下文敏感性的城市感知建模。
链接: https://arxiv.org/abs/2512.19221
作者: Yunlong Liu,Shuyang Li,Pengyuan Liu,Yu Zhang,Rudi Stouffs
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, CAADRIA2026 presentation forthcoming
Abstract:Perception research is increasingly modelled using streetscapes, yet many approaches still rely on pixel features or object co-occurrence statistics, overlooking the explicit relations that shape human perception. This study proposes a three stage pipeline that transforms street view imagery (SVI) into structured representations for predicting six perceptual indicators. In the first stage, each image is parsed using an open-set Panoptic Scene Graph model (OpenPSG) to extract object predicate object triplets. In the second stage, compact scene-level embeddings are learned through a heterogeneous graph autoencoder (GraphMAE). In the third stage, a neural network predicts perception scores from these embeddings. We evaluate the proposed approach against image-only baselines in terms of accuracy, precision, and cross-city generalization. Results indicate that (i) our approach improves perception prediction accuracy by an average of 26% over baseline models, and (ii) maintains strong generalization performance in cross-city prediction tasks. Additionally, the structured representation clarifies which relational patterns contribute to lower perception scores in urban scenes, such as graffiti on wall and car parked on sidewalk. Overall, this study demonstrates that graph-based structure provides expressive, generalizable, and interpretable signals for modelling urban perception, advancing human-centric and context-aware urban analytics.
zh
[CV-51] owards Minimal Fine-Tuning of VLMs
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在参数高效微调(Parameter Efficient Fine-Tuning, PEFT)过程中训练计算成本高、参数利用率低的问题。其解决方案的关键在于提出Image-LoRA,一种仅对注意力层中视觉token部分的value路径进行低秩适配(Low-Rank Adaptation, LoRA)的方法,从而显著降低仅用于适配器训练的浮点运算次数(FLOPs),且与视觉token占比成比例减少;同时,通过rank-1 Image-LoRA估计的头影响分数筛选关键注意力头,并采用选择大小归一化(selection-size normalization)稳定每层更新,实现更少可训练参数下接近标准LoRA的性能,且不损害模型纯文本推理能力。
链接: https://arxiv.org/abs/2512.19219
作者: Tiange Luo,Lajanugen Logeswaran,Jaekyeom Kim,Justin Johnson,Honglak Lee
机构: University of Michigan (密歇根大学); LG AI Research (LG人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Image-LoRA, a lightweight parameter efficient fine-tuning (PEFT) recipe for transformer-based vision-language models (VLMs). Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span, reducing adapter-only training FLOPs roughly in proportion to the visual-token fraction. We further adapt only a subset of attention heads, selected using head influence scores estimated with a rank-1 Image-LoRA, and stabilize per-layer updates via selection-size normalization. Across screen-centric grounding and referring benchmarks spanning text-heavy to image-heavy regimes, Image-LoRA matches or closely approaches standard LoRA accuracy while using fewer trainable parameters and lower adapter-only training FLOPs. The method also preserves the pure-text reasoning performance of VLMs before and after fine-tuning, as further shown on GSM8K.
zh
[CV-52] HippMetric: A skeletal-representation-based framework for cross-sectional and longitudinal hippocampal substructural morphometry
【速读】:该论文旨在解决人类海马体亚结构在跨被试和纵向分析中因个体间解剖变异大、折叠模式复杂而导致的形态学对应关系不一致问题,现有方法多依赖个体特异性建模且缺乏稳定的内在坐标系,难以建立可靠的人群间与个体内的对应关系。其解决方案的关键在于提出HippMetric框架,该框架基于骨骼表示(s-rep)构建,利用轴向参考形态模型(ARMM)并引入与海马体解剖和功能相一致的可变形骨骼坐标系,从而提供生物学基础的对应参考;核心创新包括:一是基于海马体保守的纵向层状结构(lamellae沿长轴垂直堆叠)设计骨骼坐标系统,实现跨个体和时间的解剖一致性定位;二是通过表面重建、形变及几何约束的辐条优化生成个体化s-rep,确保边界贴合性、正交性和非交叉性,从而获得数学上有效的骨骼几何结构。
链接: https://arxiv.org/abs/2512.19214
作者: Na Gao,Chenfei Ye,Yanwu Yang,Anqi Li,Zhengbo He,Li Liang,Zhiyuan Liu,Xingyu Hao,Ting Ma,Tengfei Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 8 figures
Abstract:Accurate characterization of hippocampal substructure is crucial for detecting subtle structural changes and identifying early neurodegenerative biomarkers. However, high inter-subject variability and complex folding pattern of human hippocampus hinder consistent cross-subject and longitudinal analysis. Most existing approaches rely on subject-specific modelling and lack a stable intrinsic coordinate system to accommodate anatomical variability, which limits their ability to establish reliable inter- and intra-individual correspondence. To address this, we propose HippMetric, a skeletal representation (s-rep)-based framework for hippocampal substructural morphometry and point-wise correspondence across individuals and scans. HippMetric builds on the Axis-Referenced Morphometric Model (ARMM) and employs a deformable skeletal coordinate system aligned with hippocampal anatomy and function, providing a biologically grounded reference for correspondence. Our framework comprises two core modules: a skeletal-based coordinate system that respects the hippocampus’ conserved longitudinal lamellar architecture, in which functional units (lamellae) are stacked perpendicular to the long-axis, enabling anatomically consistent localization across subjects and time; and individualized s-reps generated through surface reconstruction, deformation, and geometrically constrained spoke refinement, enforcing boundary adherence, orthogonality and non-intersection to produce mathematically valid skeletal geometry. Extensive experiments on two international cohorts demonstrate that HippMetric achieves higher accuracy, reliability, and correspondence stability compared to existing shape models.
zh
[CV-53] InvCoSS: Inversion-driven Continual Self-supervised Learning in Medical Multi-modal Image Pre-training
【速读】:该论文旨在解决持续自监督学习(Continual Self-Supervised Learning, CSSL)在医学影像领域中因依赖历史数据回放而导致的数据隐私泄露和实际应用受限的问题。现有方法通常需要存储并重放先前任务的真实图像以防止灾难性遗忘(catastrophic forgetting),这在多机构协作场景下难以实现。解决方案的关键在于提出InvCoSS框架,其核心创新是通过反演预训练的自监督模型生成逼近原始数据分布的合成图像,并将这些合成图像与新任务数据联合优化,从而在不访问任何真实历史数据的前提下有效缓解遗忘问题。此外,引入基于多尺度融合结构的InvUNet提升合成图像保真度,以及设计无类别引导的排斥表征学习机制增强合成样本多样性,共同保障了模型在多个下游任务上的高性能表现。
链接: https://arxiv.org/abs/2512.19213
作者: Zihao Luo,Shaohao Rui,Zhenyu Tang,Guotai Wang,Xiaosong Wang
机构: University of Electronic Science and Technology of China (电子科技大学); Shanghai Innovation Institute (上海创新研究院); Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Brain-Computer Interface & Brain-Inspired Intelligence Key Laboratory of Sichuan Province (四川省脑机接口与脑启发智能重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures, 5 tables
Abstract:Continual self-supervised learning (CSSL) in medical imaging trains a foundation model sequentially, alleviating the need for collecting multi-modal images for joint training and offering promising improvements in downstream performance while preserving data privacy. However, most existing methods still rely on replaying data from previous stages to prevent catastrophic forgetting, which compromises privacy and limits their applicability in real-world scenarios where data transfer across sites is often restricted. In this work, we propose InvCoSS, an inversion-driven continual self-supervised learning framework for medical multi-modal image pre-training. Specifically, after training on a previous task, InvCoSS inverts the pre-trained self-supervised model to generate synthetic images that approximate the original training distribution. These synthetic images are then combined with data from the new task for joint optimization, which effectively mitigates catastrophic forgetting while strictly adhering to the constraint of no access to previous real data. Furthermore, to improve the fidelity of synthetic images, we introduce a novel InvUNet with a multi-scale fusion architecture to restore both high- and low-frequency components of the inverted images. To enhance diversity and prevent mode collapse, we design a repulsive representation-learning mechanism that encourages a diverse feature space for synthetic images without class guidance. Extensive experiments across nine downstream tasks validate the effectiveness of InvCoSS, achieving performance comparable to or even superior to prior data-replay methods while significantly reducing storage requirements and eliminating data privacy constraints.
zh
[CV-54] PEDESTRIAN: An Egocentric Vision Dataset for Obstacle Detection on Pavements
【速读】:该论文旨在解决城市环境中人行道障碍物对行人安全构成威胁的问题,尤其关注如何通过自动化手段实时检测这些障碍物以提升步行安全性。其解决方案的关键在于构建了一个名为PEDESTRIAN的全景视角(egocentric)数据集,包含29类常见城市人行道障碍物的340段视频,全部由移动设备摄像头从行人第一人称视角采集而成;同时,论文还验证了多个前沿深度学习算法在该数据集上的表现,为障碍物检测与识别任务提供了可复用的基准和训练基础,从而推动智能行人辅助系统的开发与部署。
链接: https://arxiv.org/abs/2512.19190
作者: Marios Thoma(1 and 2),Zenonas Theodosiou(1 and 3),Harris Partaourides(4),Vassilis Vassiliades(1),Loizos Michael(2 and 1),Andreas Lanitis(1 and 5) ((1) CYENS Centre of Excellence, Nicosia, Cyprus, (2) Open University Cyprus, Nicosia, Cyprus, (3) Department of Communication and Internet Studies, Cyprus University of Technology, Limassol, Cyprus, (4) AI Cyprus Ethical Novelties Ltd, Limassol, Cyprus, (5) Department of Multimedia and Graphic Arts, Cyprus University of Technology, Limassol, Cyprus)
机构: Cyprus University of Technology (塞浦路斯技术大学); CYENS Centre of Excellence (CYENS卓越中心); OpenAI(OpenAI); University of Cyprus (塞浦路斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages, 7 figures, 9 tables, Dataset: this https URL , Code: this https URL
Abstract:Walking has always been a primary mode of transportation and is recognized as an essential activity for maintaining good health. Despite the need for safe walking conditions in urban environments, sidewalks are frequently obstructed by various obstacles that hinder free pedestrian movement. Any object obstructing a pedestrian’s path can pose a safety hazard. The advancement of pervasive computing and egocentric vision techniques offers the potential to design systems that can automatically detect such obstacles in real time, thereby enhancing pedestrian safety. The development of effective and efficient identification algorithms relies on the availability of comprehensive and well-balanced datasets of egocentric data. In this work, we introduce the PEDESTRIAN dataset, comprising egocentric data for 29 different obstacles commonly found on urban sidewalks. A total of 340 videos were collected using mobile phone cameras, capturing a pedestrian’s point of view. Additionally, we present the results of a series of experiments that involved training several state-of-the-art deep learning algorithms using the proposed dataset, which can be used as a benchmark for obstacle detection and recognition tasks. The dataset can be used for training pavement obstacle detectors to enhance the safety of pedestrians in urban areas.
zh
[CV-55] OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions
【速读】:该论文旨在解决人类运动生成领域中任务孤岛问题,即现有方法局限于单一任务场景,难以实现自由形式与多目标驱动的运动生成。其核心挑战在于缺乏统一框架以整合多样化的文本-运动指令交互。解决方案的关键是提出OmniMoGen,一个基于精简RVQ-VAE(Residual Vector Quantized Variational Autoencoder)与Transformer架构的统一框架,支持端到端的指令驱动运动生成,并通过构建包含137K条交错文本-运动指令的大规模数据集X2Mo及AnyContext基准测试,验证了其在文本到运动、运动编辑以及复杂指令理解上的先进性能,展现出组合编辑、自省式生成和知识引导生成等新兴能力。
链接: https://arxiv.org/abs/2512.19159
作者: Wendong Bu,Kaihang Pan,Yuze Lin,Jiacheng Li,Kai Shen,Wenqiao Zhang,Juncheng Li,Jun Xiao,Siliang Tang
机构: Zhejiang University (浙江大学); HiThink Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large language models (LLMs) have unified diverse linguistic tasks within a single framework, yet such unification remains unexplored in human motion generation. Existing methods are confined to isolated tasks, limiting flexibility for free-form and omni-objective generation. To address this, we propose OmniMoGen, a unified framework that enables versatile motion generation through interleaved text-motion instructions. Built upon a concise RVQ-VAE and transformer architecture, OmniMoGen supports end-to-end instruction-driven motion generation. We construct X2Mo, a large-scale dataset of over 137K interleaved text-motion instructions, and introduce AnyContext, a benchmark for evaluating interleaved motion generation. Experiments show that OmniMoGen achieves state-of-the-art performance on text-to-motion, motion editing, and AnyContext, exhibiting emerging capabilities such as compositional editing, self-reflective generation, and knowledge-informed generation. These results mark a step toward the next intelligent motion generation. Project Page: this https URL.
zh
[CV-56] AMap: Distilling Future Priors for Ahead-Aware Online HD Map Construction
【速读】:该论文旨在解决在线高精地图(Online High-Definition Map, HD Map)构建中因传统时序融合方法“空间后视性”带来的安全缺陷问题。现有方法主要提升已行驶区域的重建质量,对前方未见道路的感知改善有限,而下游规划任务分析表明,前方感知误差会直接引发危险驾驶行为,存在显著的安全不对称性。解决方案的关键在于提出AMap框架,其核心创新是引入“从未来蒸馏”(distill-from-future)范式:通过一个能访问未来时序信息的教师模型,指导仅依赖当前帧的轻量级学生模型,从而在不增加推理成本的前提下,将前瞻性知识隐式压缩至学生模型中,赋予其“前瞻”能力。技术上,采用多层级BEV蒸馏策略与空间掩码机制,并结合非对称查询适配模块,有效实现未来感知表征向学生模型静态查询的迁移。
链接: https://arxiv.org/abs/2512.19150
作者: Ruikai Li,Xinrun Li,Mengwei Xie,Hao Shan,Shoumeng Qiu,Xinyuan Chang,Yizhe Fan,Feng Xiong,Han Jiang,Yilong Ren,Haiyang Yu,Mu Xu,Yang Long,Varun Ojha,Zhiyong Cui
机构: State Key Lab of Intelligent Transportation System, China (国家重点智能交通系统实验室); Amap, Alibaba Group, China (高德地图,阿里巴巴集团); Newcastle University, England (纽卡斯尔大学); Durham University, England (杜伦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 11 figures
Abstract:Online High-Definition (HD) map construction is pivotal for autonomous driving. While recent approaches leverage historical temporal fusion to improve performance, we identify a critical safety flaw in this paradigm: it is inherently spatially backward-looking." These methods predominantly enhance map reconstruction in traversed areas, offering minimal improvement for the unseen road ahead. Crucially, our analysis of downstream planning tasks reveals a severe asymmetry: while rearward perception errors are often tolerable, inaccuracies in the forward region directly precipitate hazardous driving maneuvers. To bridge this safety gap, we propose AMap, a novel framework for Ahead-aware online HD Mapping. We pioneer a distill-from-future" paradigm, where a teacher model with privileged access to future temporal contexts guides a lightweight student model restricted to the current frame. This process implicitly compresses prospective knowledge into the student model, endowing it with ``look-ahead" capabilities at zero inference-time cost. Technically, we introduce a Multi-Level BEV Distillation strategy with spatial masking and an Asymmetric Query Adaptation module to effectively transfer future-aware representations to the student’s static queries. Extensive experiments on the nuScenes and Argoverse 2 benchmark demonstrate that AMap significantly enhances current-frame perception. Most notably, it outperforms state-of-the-art temporal models in critical forward regions while maintaining the efficiency of single current frame inference.
zh
[CV-57] WorldRFT: Latent World Model Planning with Reinforcement Fine-Tuning for Autonomous Driving AAAI2026
【速读】:该论文旨在解决当前基于潜在世界模型(Latent World Models)的端到端自动驾驶方法中,感知与规划任务耦合导致的规划优化效果不佳的问题。其核心挑战在于,重建导向的表征学习会混淆感知与决策目标,从而影响驾驶策略的安全性和有效性。解决方案的关键在于提出WorldRFT框架,通过分层规划任务分解引导表征优化,并引入局部感知的交互式精炼机制,同时结合强化学习微调(Reinforcement Learning Fine-Tuning, RFT),以提升安全关键策略性能;此外,创新性地采用Group Relative Policy Optimization (GRPO) 方法,利用轨迹高斯化和碰撞感知奖励进行策略优化,显著改善了安全性与整体驾驶表现。
链接: https://arxiv.org/abs/2512.19133
作者: Pengxuan Yang,Ben Lu,Zhongpu Xia,Chao Han,Yinfeng Gao,Teng Zhang,Kun Zhan,XianPeng Lang,Yupeng Zheng,Qichao Zhang
机构: 1. Tsinghua University (清华大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 3. Beijing Key Laboratory of Intelligent Robotics (北京市智能机器人重点实验室); 4. Alibaba Group (阿里巴巴集团)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026, first version
Abstract:Latent World Models enhance scene representation through temporal self-supervised learning, presenting a perception annotation-free paradigm for end-to-end autonomous driving. However, the reconstruction-oriented representation learning tangles perception with planning tasks, leading to suboptimal optimization for planning. To address this challenge, we propose WorldRFT, a planning-oriented latent world model framework that aligns scene representation learning with planning via a hierarchical planning decomposition and local-aware interactive refinement mechanism, augmented by reinforcement learning fine-tuning (RFT) to enhance safety-critical policy performance. Specifically, WorldRFT integrates a vision-geometry foundation model to improve 3D spatial awareness, employs hierarchical planning task decomposition to guide representation optimization, and utilizes local-aware iterative refinement to derive a planning-oriented driving policy. Furthermore, we introduce Group Relative Policy Optimization (GRPO), which applies trajectory Gaussianization and collision-aware rewards to fine-tune the driving policy, yielding systematic improvements in safety. WorldRFT achieves state-of-the-art (SOTA) performance on both open-loop nuScenes and closed-loop NavSim benchmarks. On nuScenes, it reduces collision rates by 83% (0.30% - 0.05%). On NavSim, using camera-only sensors input, it attains competitive performance with the LiDAR-based SOTA method DiffusionDrive (87.8 vs. 88.1 PDMS).
zh
[CV-58] Generative Giants Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在零样本多模态检索任务中表现不佳的问题。研究表明,MLLMs的表征空间主要被文本语义主导,视觉信息占比极小,且模型对图像-文本模态的强关联性导致嵌入向量同质化,削弱了区分不同模态内容的能力;更关键的是,用于相似度计算的核心特征成分实为干扰项,反而降低了检索性能。解决方案的关键在于利用稀疏自动编码器(Sparse Autoencoders, SAEs)对MLLM输出表示进行可解释性分解,从而揭示其内在行为机制,并指出提升多模态检索能力需优化表征平衡性与去除冗余干扰特征。
链接: https://arxiv.org/abs/2512.19115
作者: Hengyi Feng,Zeang Sheng,Meiyi Qiang,Wentao Zhang
机构: University of Electronic Scinence and Technology of China (中国电子科技大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the remarkable success of multimodal large language models (MLLMs) in generative tasks, we observe that they exhibit a counterintuitive deficiency in the zero-shot multimodal retrieval task. In this work, we investigate the underlying mechanisms that hinder MLLMs from serving as effective retrievers. With the help of sparse autoencoders (SAEs), we decompose MLLM output representations into interpretable semantic concepts to probe their intrinsic behavior. Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics; the visual information essential for multimodal retrieval only constitutes a small portion. This imbalance is compounded by the heavy focus of MLLMs on bridging image-text modalities, which facilitates generation but homogenizes embeddings and finally diminishes the discriminative power required for multimodal retrieval. We further discover that the specific feature components that contribute most to the similarity computations for MLLMs are in fact distractors that actively degrade retrieval performance. Overall, our work provides the first in-depth interpretability analysis of MLLM representations in the context of multimodal retrieval and offers possible directions for enhancing the multimodal retrieval capabilities of MLLMs.
zh
[CV-59] rifocal Tensor and Relative Pose Estimation with Known Vertical Direction
【速读】:该论文旨在解决已知垂直方向条件下多视角相对位姿估计问题(relative pose estimation among views with known vertical directions)。其核心挑战在于如何在减少点对应数量的前提下实现高精度、鲁棒的位姿估计,以提升视觉里程计(visual odometry)中对异常值的处理效率。解决方案的关键在于:提出两种新颖的求解方法——一种基于线性闭式解(linear closed-form solution),仅需三视图中的四个点对应即可求解,另一方法则利用最新的Gröbner基求解器实现最小解(minimal solution),仅需三个点对应;二者均显著降低了所需点数,从而可在RANSAC框架内高效完成去噪与位姿估计。
链接: https://arxiv.org/abs/2512.19110
作者: Tao Li,Zhenbao Yu,Banglei Guan,Jianli Han,Weimin Lv,Friedrich Fraundorfer
机构: Naval Aeronautical University (海军航空大学); National University of Defense Technology (国防科技大学); Wuhan University (武汉大学); Graz University of Technology (格拉茨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work presents two novel solvers for estimating the relative poses among views with known vertical directions. The vertical directions of camera views can be easily obtained using inertial measurement units (IMUs) which have been widely used in autonomous vehicles, mobile phones, and unmanned aerial vehicles (UAVs). Given the known vertical directions, our lgorithms only need to solve for two rotation angles and two translation vectors. In this paper, a linear closed-form solution has been described, requiring only four point correspondences in three views. We also propose a minimal solution with three point correspondences using the latest Gröbner basis solver. Since the proposed methods require fewer point correspondences, they can be efficiently applied within the RANSAC framework for outliers removal and pose estimation in visual odometry. The proposed method has been tested on both synthetic data and real-world scenes from KITTI. The experimental results show that the accuracy of the estimated poses is superior to other alternative methods.
zh
[CV-60] GaussianImage: Boosted Image Representation and Compression with 2D Gaussian Splatting AAAI
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在图像表示与压缩中训练时间长、内存消耗大的问题,以及基于2D高斯点绘(Gaussian Splatting, GS)方法因需要大量高斯基元以维持视觉保真度而导致的效率瓶颈。其解决方案的关键在于:首先提出一种基于失真驱动的稀疏化机制,根据信号强度动态分配高斯基元;其次引入上下文感知的高斯滤波器,提升基元密度优化对图像内容变化的适应性;最后集成属性分离的可学习标量量化器与量化感知训练策略,实现对基元属性的高效压缩。该方案在保持实时解码和低内存占用的前提下,显著优于GaussianImage和基于INRs的COIN方法,在表示精度与压缩性能上取得平衡。
链接: https://arxiv.org/abs/2512.19108
作者: Tiantian Li,Xinjie Zhang,Xingtong Ge,Tongda Xu,Dailan He,Jun Zhang,Yan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI this http URL URL: this https URL
Abstract:Implicit neural representations (INRs) have achieved remarkable success in image representation and compression, but they require substantial training time and memory. Meanwhile, recent 2D Gaussian Splatting (GS) methods (\textite.g., GaussianImage) offer promising alternatives through efficient primitive-based rendering. However, these methods require excessive Gaussian primitives to maintain high visual fidelity. To exploit the potential of GS-based approaches, we present GaussianImage++, which utilizes limited Gaussian primitives to achieve impressive representation and compression performance. Firstly, we introduce a distortion-driven densification mechanism. It progressively allocates Gaussian primitives according to signal intensity. Secondly, we employ context-aware Gaussian filters for each primitive, which assist in the densification to optimize Gaussian primitives based on varying image content. Thirdly, we integrate attribute-separated learnable scalar quantizers and quantization-aware training, enabling efficient compression of primitive attributes. Experimental results demonstrate the effectiveness of our method. In particular, GaussianImage++ outperforms GaussianImage and INRs-based COIN in representation and compression performance while maintaining real-time decoding and low memory usage.
zh
[CV-61] Mamba-Based Modality Disentanglement Network for Multi-Contrast MRI Reconstruction
【速读】:该论文旨在解决加速磁共振成像(MRI)中两个关键问题:一是现有方法未能有效利用K空间(K-space)的先验信息,导致零填充输入仍存在混叠伪影;二是多对比度融合策略在重建过程中引入无关信息,污染目标重建质量。其解决方案的关键在于提出一种双域框架MambaMDN,首先利用全采样参考K空间数据完成欠采样目标数据,生成结构对齐但模态混合的输入;随后设计基于Mamba架构的模态解耦网络,提取并移除参考特异性特征;最后引入迭代精化机制,通过反复特征净化逐步提升重建精度。
链接: https://arxiv.org/abs/2512.19095
作者: Weiyi Lyu,Xinming Fang,Jun Wang,Jun Shi,Guixu Zhang,Juncheng Li
机构: Shanghai University (上海大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 11 figures, 6 tables
Abstract:Magnetic resonance imaging (MRI) is a cornerstone of modern clinical diagnosis, offering unparalleled soft-tissue contrast without ionizing radiation. However, prolonged scan times remain a major barrier to patient throughput and comfort. Existing accelerated MRI techniques often struggle with two key challenges: (1) failure to effectively utilize inherent K-space prior information, leading to persistent aliasing artifacts from zero-filled inputs; and (2) contamination of target reconstruction quality by irrelevant information when employing multi-contrast fusion strategies. To overcome these challenges, we present MambaMDN, a dual-domain framework for multi-contrast MRI reconstruction. Our approach first employs fully-sampled reference K-space data to complete the undersampled target data, generating structurally aligned but modality-mixed inputs. Subsequently, we develop a Mamba-based modality disentanglement network to extract and remove reference-specific features from the mixed representation. Furthermore, we introduce an iterative refinement mechanism to progressively enhance reconstruction accuracy through repeated feature purification. Extensive experiments demonstrate that MambaMDN can significantly outperform existing multi-contrast reconstruction methods.
zh
[CV-62] Auditing Significance Metric Choice and Demographic Fairness in Medical AI Challenges MICCAI2025
【速读】:该论文旨在解决医学人工智能(Medical AI)排行榜存在的三大局限性:一是得分差异未进行统计显著性检验,导致排名稳定性不明;二是对所有器官采用单一平均指标,掩盖了临床重要的边界误差;三是缺乏跨人口学特征交叉分析,难以发现公平性和公正性缺口。其解决方案的关键在于提出并开源了RankInsight工具包,该工具包通过三方面实现改进:(1) 计算成对显著性图谱,量化模型性能差异的统计置信度;(2) 使用器官适配指标重构排行榜,揭示如用NSD替代Dice时顶级模型排序发生逆转的现象;(3) 实施交叉公平性审计,识别出基于MONAI的模型在性别与种族维度上存在显著偏差。这一方法使排行榜具备统计稳健性、临床相关性和人口学公平性。
链接: https://arxiv.org/abs/2512.19091
作者: Ariel Lubonja,Pedro R. A. S. Bassi,Wenxuan Li,Hualin Qiao,Randal Burns,Alan L. Yuille,Zongwei Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: MICCAI 2025 Workshop on Machine Learning in Medical Imaging
Abstract:Open challenges have become the de facto standard for comparative ranking of medical AI methods. Despite their importance, medical AI leaderboards exhibit three persistent limitations: (1) score gaps are rarely tested for statistical significance, so rank stability is unknown; (2) single averaged metrics are applied to every organ, hiding clinically important boundary errors; (3) performance across intersecting demographics is seldom reported, masking fairness and equity gaps. We introduce RankInsight, an open-source toolkit that seeks to address these limitations. RankInsight (1) computes pair-wise significance maps that show the nnU-Net family outperforms Vision-Language and MONAI submissions with high statistical certainty; (2) recomputes leaderboards with organ-appropriate metrics, reversing the order of the top four models when Dice is replaced by NSD for tubular structures; and (3) audits intersectional fairness, revealing that more than half of the MONAI-based entries have the largest gender-race discrepancy on our proprietary Johns Hopkins Hospital dataset. The RankInsight toolkit is publicly released and can be directly applied to past, ongoing, and future challenges. It enables organizers and participants to publish rankings that are statistically sound, clinically meaningful, and demographically fair.
zh
[CV-63] Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation AAAI2026
【速读】:该论文旨在解决开放词汇表三维实例分割(open-vocabulary 3D instance segmentation)中因依赖SAM(Segment Anything Model)和CLIP模型而导致的计算开销大、推理速度慢以及对训练数据中罕见类别泛化能力差的问题。解决方案的关键在于:利用RGB图像与2D开放词汇检测器(open-vocabulary detector)生成引导信息,从而从点云中提取针对新类别的三维实例掩码(3D instance masks),在不使用SAM和CLIP的前提下,既保留了2D检测器识别未见类别的能力,又实现了高效且准确的稀有实例检索,显著提升了处理效率与泛化性能。
链接: https://arxiv.org/abs/2512.19088
作者: Khanh Nguyen,Dasith de Silva Edirimuni,Ghulam Mubashar Hassan,Ajmal Mian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026 Workshop on New Frontiers in Information Retrieval
Abstract:Locating and retrieving objects from scene-level point clouds is a challenging problem with broad applications in robotics and augmented reality. This task is commonly formulated as open-vocabulary 3D instance segmentation. Although recent methods demonstrate strong performance, they depend heavily on SAM and CLIP to generate and classify 3D instance masks from images accompanying the point cloud, leading to substantial computational overhead and slow processing that limit their deployment in real-world settings. Open-YOLO 3D alleviates this issue by using a real-time 2D detector to classify class-agnostic masks produced directly from the point cloud by a pretrained 3D segmenter, eliminating the need for SAM and CLIP and significantly reducing inference time. However, Open-YOLO 3D often fails to generalize to object categories that appear infrequently in the 3D training data. In this paper, we propose a method that generates 3D instance masks for novel objects from RGB images guided by a 2D open-vocabulary detector. Our approach inherits the 2D detector’s ability to recognize novel objects while maintaining efficient classification, enabling fast and accurate retrieval of rare instances from open-ended text queries. Our code will be made available at this https URL.
zh
[CV-64] 6DAttack: Backdoor Attacks in the 6DoF Pose Estimation
【速读】:该论文旨在解决六自由度(6DoF)物体位姿估计模型在面对后门攻击时的安全性问题。与传统2D视觉任务中的后门攻击不同,6DoF攻击需精准控制连续的位姿参数(如平移和旋转),使得现有基于2D图像的方法无法直接适用。其解决方案的关键在于提出一种名为6DAttack的框架,该框架利用3D物体触发器诱导受控的错误位姿输出,同时保持模型在干净样本上的性能不受影响。实验表明,该方法在多个主流6DoF位姿估计算法(如PVNet、DenseFusion和PoseDiffusion)上均实现了高达100%的攻击成功率(ASR)和清洁准确率,且现有防御手段对该攻击无效,揭示了6DoF位姿估计系统中一个严重但未被充分研究的安全威胁。
链接: https://arxiv.org/abs/2512.19058
作者: Jihui Guo,Zongmin Zhang,Zhen Sun,Yuhao Yang,Jinlin Wu,Fu Zhang,Xinlei He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning advances have enabled accurate six-degree-of-freedom (6DoF) object pose estimation, widely used in robotics, AR/VR, and autonomous systems. However, backdoor attacks pose significant security risks. While most research focuses on 2D vision, 6DoF pose estimation remains largely unexplored. Unlike traditional backdoors that only change classes, 6DoF attacks must control continuous parameters like translation and rotation, rendering 2D methods inapplicable. We propose 6DAttack, a framework using 3D object triggers to induce controlled erroneous poses while maintaining normal behavior. Evaluations on PVNet, DenseFusion, and PoseDiffusion across LINEMOD, YCB-Video, and CO3D show high attack success rates (ASRs) without compromising clean performance. Backdoored models achieve up to 100% clean ADD accuracy and 100% ASR, with triggered samples reaching 97.70% ADD-P. Furthermore, a representative defense remains ineffective. Our findings reveal a serious, underexplored threat to 6DoF pose estimation.
zh
[CV-65] Decoupled Generative Modeling for Human-Object Interaction Synthesis
【速读】:该论文旨在解决现有生成式人类-物体交互(Human-Object Interaction, HOI)合成方法中存在的关键问题:一是依赖人工指定的中间路径点,导致灵活性差且易出错;二是将所有优化目标集中于单一网络,造成模型复杂度高、同步性差(如人与物体运动不同步)或物理穿透等问题。其解决方案的核心在于提出解耦生成建模框架Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI),通过分离路径规划(trajectory generation)与动作合成(action synthesis)两个阶段来提升系统灵活性和稳定性:首先由轨迹生成器无监督地生成人与物体的运动路径,随后动作生成器基于这些路径条件化生成细节化的交互动作;此外,引入基于远端关节动力学的对抗训练机制以增强接触真实感,并支持动态场景下的响应式长序列规划与计划一致性保持。
链接: https://arxiv.org/abs/2512.19049
作者: Hwanhee Jung,Seunggwan Lee,Jeongyoon Yoon,SeungHyeon Kim,Giljoo Nam,Qixing Huang,Sangpil Kim
机构: Korea University (韩国大学); Meta; The University of Texas at Austin (得克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Synthesizing realistic human-object interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive, long-sequence planning in dynamic scenes, while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and perceptual studies likewise prefer our results.
zh
[CV-66] WaTeRFlow: Watermark Temporal Robustness via Flow Consistency
【速读】:该论文旨在解决图像水印在图像到视频(Image-to-Video, I2V)转换过程中鲁棒性显著下降的问题,尤其在帧级水印检测失效的场景下。现有方法虽能应对扩散模型编辑带来的干扰,但在I2V生成后因时序一致性弱化导致水印信息丢失或难以恢复。解决方案的关键在于提出WaTeRFlow框架:其核心创新包括(i)FUSE(Flow-guided Unified Synthesis Engine),通过指令驱动的编辑和快速视频扩散代理模拟真实失真以增强编码器-解码器训练;(ii)基于光流(optical flow)的时域对齐与Temporal Consistency Loss(TCL),提升帧间预测稳定性;(iii)语义保持损失(semantic preservation loss),确保条件信号不被破坏。实验表明,该方案在多种主流I2V模型上均实现了高精度的水印恢复,显著优于基线方法。
链接: https://arxiv.org/abs/2512.19048
作者: Utae Jeong,Sumin In,Hyunju Ryu,Jaewan Choi,Feng Yang,Jongheon Jeong,Seungryong Kim,Sangpil Kim
机构: Korea University (韩国大学); Google DeepMind (谷歌深度思维); KAIST AI (韩国科学技术院人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.
zh
[CV-67] Distinguishing Visually Similar Actions: Prompt-Guided Semantic Prototype Modulation for Few-Shot Action Recognition
【速读】:该论文旨在解决少样本动作识别(Few-shot action recognition)中的三大核心挑战:(1)时间建模难题,即模型易受静态背景干扰,难以捕捉动态动作的本质特征;(2)视觉相似性问题,即类别间细微视觉差异导致区分困难;(3)视觉-文本支持原型与纯视觉查询之间的模态差距,影响共享嵌入空间中的对齐效果。解决方案的关键在于提出CLIP-SPM框架,其核心创新包括:(1)分层协同运动精修模块(HSMR),通过融合深层与浅层运动特征以减少背景干扰,提升时间建模能力;(2)语义原型调制策略(SPM),生成与查询相关的文本提示并融合至视觉特征中,缩小模态差距并增强相似动作的判别力;(3)原型锚点双调制方法(PADM),通过优化支持原型并与全局语义锚点对齐,提升支持与查询样本间的一致性。实验证明该框架在多个标准数据集上均取得优异性能,且各组件的有效性得到充分验证。
链接: https://arxiv.org/abs/2512.19036
作者: Xiaoyang Li,Mingming Lu,Ruiqi Wang,Hao Li,Zewei Le
机构: Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 7 figures. Preprint under review for journal submission
Abstract:Few-shot action recognition aims to enable models to quickly learn new action categories from limited labeled samples, addressing the challenge of data scarcity in real-world applications. Current research primarily addresses three core challenges: (1) temporal modeling, where models are prone to interference from irrelevant static background information and struggle to capture the essence of dynamic action features; (2) visual similarity, where categories with subtle visual differences are difficult to distinguish; and (3) the modality gap between visual-textual support prototypes and visual-only queries, which complicates alignment within a shared embedding space. To address these challenges, this paper proposes a CLIP-SPM framework, which includes three components: (1) the Hierarchical Synergistic Motion Refinement (HSMR) module, which aligns deep and shallow motion features to improve temporal modeling by reducing static background interference; (2) the Semantic Prototype Modulation (SPM) strategy, which generates query-relevant text prompts to bridge the modality gap and integrates them with visual features, enhancing the discriminability between similar actions; and (3) the Prototype-Anchor Dual Modulation (PADM) method, which refines support prototypes and aligns query features with a global semantic anchor, improving consistency across support and query samples. Comprehensive experiments across standard benchmarks, including Kinetics, SSv2-Full, SSv2-Small, UCF101, and HMDB51, demonstrate that our CLIP-SPM achieves competitive performance under 1-shot, 3-shot, and 5-shot settings. Extensive ablation studies and visual analyses further validate the effectiveness of each component and its contributions to addressing the core challenges. The source code and models are publicly available at GitHub.
zh
[CV-68] Automatic Neuronal Activity Segmentation in Fast Four Dimensional Spatio-Temporal Fluorescence Imaging using Bayesian Approach
【速读】:该论文旨在解决荧光显微钙成像(Fluorescence Microscopy Calcium Imaging)中行为相关神经元活动自动检测的难题,传统方法依赖人工分割,存在耗时、劳动强度大且泛化能力差的问题。其解决方案的关键在于提出了一种基于贝叶斯深度学习的框架,通过融合时间维度上的像素级相关性图与空间维度上的均值概览图,实现对4D时空数据中神经元活动的精准概率分割,并同时建模检测过程中的不确定性,从而提升检测的可靠性与可重复性。
链接: https://arxiv.org/abs/2512.19032
作者: Ran Li,Pan Xiao,Kaushik Dutta,Youdong Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fluorescence Microcopy Calcium Imaging is a fundamental tool to in-vivo record and analyze large scale neuronal activities simultaneously at a single cell resolution. Automatic and precise detection of behaviorally relevant neuron activity from the recordings is critical to study the mapping of brain activity in organisms. However a perpetual bottleneck to this problem is the manual segmentation which is time and labor intensive and lacks generalizability. To this end, we present a Bayesian Deep Learning Framework to detect neuronal activities in 4D spatio-temporal data obtained by light sheet microscopy. Our approach accounts for the use of temporal information by calculating pixel wise correlation maps and combines it with spatial information given by the mean summary image. The Bayesian framework not only produces probability segmentation maps but also models the uncertainty pertaining to active neuron detection. To evaluate the accuracy of our framework we implemented the test of reproducibility to assert the generalization of the network to detect neuron activity. The network achieved a mean Dice Score of 0.81 relative to the synthetic Ground Truth obtained by Otsu’s method and a mean Dice Score of 0.79 between the first and second run for test of reproducibility. Our method successfully deployed can be used for rapid detection of active neuronal activities for behavioural studies.
zh
[CV-69] Finer-Personalization Rank: Fine-Grained Retrieval Examines Identity Preservation for Personalized Generation
【速读】:该论文旨在解决个性化生成模型中身份保留(identity preservation)的评估问题。现有生成评估指标主要关注参考图像与生成图像之间的整体语义相似性,忽略了个体身份相关的细粒度判别特征(如特定宠物的头部斑点)。其解决方案的关键在于提出一种名为“Finer-Personalization Rank”的新评估协议,该协议采用基于图库(gallery-based)的排序视角:将每张生成图像视为查询,在由视觉相似的真实图像构成的、带有身份标签的图库中进行检索,利用检索指标(如平均精度均值,mean average precision)衡量性能,从而更准确地反映生成结果是否保留了身份特异性细节。此方法在CUB、Stanford Cars及动物重识别(Re-ID)等多个基准上验证了其优于仅依赖语义相似性的传统指标,并揭示了主流个性化方法存在显著的身份漂移现象。
链接: https://arxiv.org/abs/2512.19026
作者: Connor Kilrain,David Carlyn,Julia Chae,Sara Beery,Wei-Lun Chao,Jianyang Gu
机构: The Ohio State University (俄亥俄州立大学); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of personalized generative models raises a central question: how should we evaluate identity preservation? Given a reference image (e.g., one’s pet), we expect the generated image to retain precise details attached to the subject’s identity. However, current generative evaluation metrics emphasize the overall semantic similarity between the reference and the output, and overlook these fine-grained discriminative details. We introduce Finer-Personalization Rank, an evaluation protocol tailored to identity preservation. Instead of pairwise similarity, Finer-Personalization Rank adopts a ranking view: it treats each generated image as a query against an identity-labeled gallery consisting of visually similar real images. Retrieval metrics (e.g., mean average precision) measure performance, where higher scores indicate that identity-specific details (e.g., a distinctive head spot) are preserved. We assess identity at multiple granularities – from fine-grained categories (e.g., bird species, car models) to individual instances (e.g., re-identification). Across CUB, Stanford Cars, and animal Re-ID benchmarks, Finer-Personalization Rank more faithfully reflects identity retention than semantic-only metrics and reveals substantial identity drift in several popular personalization methods. These results position the gallery-based protocol as a principled and practical evaluation for personalized generation.
zh
[CV-70] Steering Vision-Language Pre-trained Models for Incremental Face Presentation Attack Detection
【速读】:该论文旨在解决人脸活体检测(Face Presentation Attack Detection, PAD)在增量学习(Incremental Learning, IL)场景下,因隐私法规限制无法保留历史数据而面临的灾难性遗忘问题,同时应对新型欺骗手段和跨域泛化能力不足的挑战。解决方案的关键在于提出一种基于视觉-语言预训练(Vision-Language Pre-trained, VLP)模型的无重放增量学习框架SVLP-IL,其核心创新为两项机制:一是多维度提示(Multi-Aspect Prompting, MAP),通过分离通用与特定域特征来增强对分布偏移的敏感性并缓解遗忘;二是选择性弹性权重固化(Selective Elastic Weight Consolidation, SEWC),动态识别并保护前序任务的关键权重,从而在保持知识稳定性的同时支持新任务的灵活适应。实验证明该方法在多个PAD基准上显著降低灾难性遗忘并提升对未见域的检测性能。
链接: https://arxiv.org/abs/2512.19022
作者: Haoze Li,Jie Zhang,Guoying Zhao,Stephen Lin,Shiguang Shan
机构: China University of Geosciences (中国地质大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of China Academy of Sciences (中国科学院大学); University of Oulu (奥卢大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face Presentation Attack Detection (PAD) demands incremental learning (IL) to combat evolving spoofing tactics and domains. Privacy regulations, however, forbid retaining past data, necessitating rehearsal-free IL (RF-IL). Vision-Language Pre-trained (VLP) models, with their prompt-tunable cross-modal representations, enable efficient adaptation to new spoofing styles and domains. Capitalizing on this strength, we propose \textbfSVLP-IL, a VLP-based RF-IL framework that balances stability and plasticity via \textitMulti-Aspect Prompting (MAP) and \textitSelective Elastic Weight Consolidation (SEWC). MAP isolates domain dependencies, enhances distribution-shift sensitivity, and mitigates forgetting by jointly exploiting universal and domain-specific cues. SEWC selectively preserves critical weights from previous tasks, retaining essential knowledge while allowing flexibility for new adaptations. Comprehensive experiments across multiple PAD benchmarks show that SVLP-IL significantly reduces catastrophic forgetting and enhances performance on unseen domains. SVLP-IL offers a privacy-compliant, practical solution for robust lifelong PAD deployment in RF-IL settings.
zh
[CV-71] VLNVerse: A Benchmark for Vision-Language Navigation with Versatile Embodied Realistic Simulation and Evaluation
【速读】:该论文旨在解决当前视觉语言导航(Vision-Language Navigation, VLN)研究中面临的三大核心问题:一是现有基准数据集规模小且物理模拟过于简单,限制了对仿真到现实(sim-to-real)泛化能力的评估;二是任务碎片化导致无法实现统一进展;三是数据量不足难以支撑基于大语言模型(Large Language Models, LLMs)的预训练需求。解决方案的关键在于提出VLNVerse——一个大规模、可扩展的基准平台,其核心创新包括:1)通过“多功能性(Versatile)”将分散的任务整合为统一框架;2)采用“具身性(Embodied)”设计,引入具备完整运动学特性的代理和基于强大物理引擎的真实仿真环境;3)提供全面评估工具,支持从经典模型到多模态大语言模型(Multimodal Large Language Models, MLLMs)的系统性测试,并提出一种能够处理所有任务的统一多任务模型,从而推动具身智能体向规模化、通用化方向发展。
链接: https://arxiv.org/abs/2512.19021
作者: Sihao Lin,Zerui Li,Xunyi Zhao,Gengze Zhou,Liuyi Wang,Rong Wei,Rui Tang,Juncheng Li,Hanqing Wang,Jiangmiao Pang,Anton van den Hengel,Jiajun Liu,Qi Wu
机构: Adelaide University (阿德莱德大学); Responsible AI Research Centre, Australian Institute for Machine Learning (负责任人工智能研究中心,澳大利亚机器学习研究所); Tongji University (同济大学); ManyCore; Zhejiang University (浙江大学); Shanghai AI Lab (上海人工智能实验室); CSIRO Data61 (澳大利亚联邦科学与工业研究组织数据61)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Despite remarkable progress in Vision-Language Navigation (VLN), existing benchmarks remain confined to fixed, small-scale datasets with naive physical simulation. These shortcomings limit the insight that the benchmarks provide into sim-to-real generalization, and create a significant research gap. Furthermore, task fragmentation prevents unified/shared progress in the area, while limited data scales fail to meet the demands of modern LLM-based pretraining. To overcome these limitations, we introduce VLNVerse: a new large-scale, extensible benchmark designed for Versatile, Embodied, Realistic Simulation, and Evaluation. VLNVerse redefines VLN as a scalable, full-stack embodied AI problem. Its Versatile nature unifies previously fragmented tasks into a single framework and provides an extensible toolkit for researchers. Its Embodied design moves beyond intangible and teleporting “ghost” agents that support full-kinematics in a Realistic Simulation powered by a robust physics engine. We leverage the scale and diversity of VLNVerse to conduct a comprehensive Evaluation of existing methods, from classic models to MLLM-based agents. We also propose a novel unified multi-task model capable of addressing all tasks within the benchmark. VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing the community with a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.
zh
[CV-72] CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization
【速读】:该论文旨在解决视频生成中相机控制精度不足的问题,现有方法依赖难以扩展且与深度估计不一致的相机位姿标注,导致训练与测试阶段的偏差。其解决方案的关键在于提出CETCAM框架,通过一种统一且可扩展的标记化方案消除对相机标注的依赖:利用几何基础模型(如VGGT)估计深度和相机参数,并将其转化为统一的、具备几何感知能力的token;这些token通过轻量级上下文模块无缝集成到预训练视频扩散模型中,分两阶段训练——先从多样化原始视频数据中学习鲁棒的相机可控性,再用高质量数据集优化细节视觉质量,从而实现几何一致性、时间稳定性和视觉真实感的显著提升。
链接: https://arxiv.org/abs/2512.19020
作者: Zelin Zhao,Xinyu Gong,Bangya Liu,Ziyang Song,Jun Zhang,Suhui Wu,Yongxin Chen,Hao Zhang
机构: Georgia Institute of Technology (佐治亚理工学院); ByteDance (字节跳动); University of Wisconsin–Madison (威斯康星大学麦迪逊分校); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Achieving precise camera control in video generation remains challenging, as existing methods often rely on camera pose annotations that are difficult to scale to large and dynamic datasets and are frequently inconsistent with depth estimation, leading to train-test discrepancies. We introduce CETCAM, a camera-controllable video generation framework that eliminates the need for camera annotations through a consistent and extensible tokenization scheme. CETCAM leverages recent advances in geometry foundation models, such as VGGT, to estimate depth and camera parameters and converts them into unified, geometry-aware tokens. These tokens are seamlessly integrated into a pretrained video diffusion backbone via lightweight context blocks. Trained in two progressive stages, CETCAM first learns robust camera controllability from diverse raw video data and then refines fine-grained visual quality using curated high-fidelity datasets. Extensive experiments across multiple benchmarks demonstrate state-of-the-art geometric consistency, temporal stability, and visual realism. Moreover, CETCAM exhibits strong adaptability to additional control modalities, including inpainting and layout control, highlighting its flexibility beyond camera control. The project page is available at this https URL.
zh
[CV-73] owards AI-Guided Open-World Ecological Taxonomic Classification
【速读】:该论文旨在解决生态分类任务中面临的多重现实挑战,包括类不平衡导致的长尾分布、细粒度物种差异、测试时时空域偏移(test-time spatiotemporal domain shifts)以及封闭集假设(closed-set assumptions)限制模型仅能识别训练时见过的类。这些问题严重制约了生成式 AI (Generative AI) 在全球生物多样性监测与保护规划中的应用效果。解决方案的关键在于提出 TaxoNet——一种基于嵌入空间的编码器架构,结合双边界惩罚损失(dual-margin penalization loss),通过增强稀有类别的学习信号并抑制常见类别的主导效应,从而有效缓解类不平衡问题,并提升模型在开放世界场景下的泛化能力。实验在多个真实生态数据集上验证了其优越性,尤其是在稀有物种识别方面表现突出。
链接: https://arxiv.org/abs/2512.18994
作者: Cheng Yaw Low,Heejoon Koo,Jaewoo Park,Kaleb Mesfin Asfaw,Meeyoung Cha
机构: Changwon National University (昌原国立大学); Max Planck Institute for Security and Privacy (马克斯·普朗克信息安全与隐私研究所); AiV Co. (AiV公司); University of Calgary (卡尔加里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 figures, 11 tables, and 15 pages
Abstract:AI-guided classification of ecological families, genera, and species underpins global sustainability efforts such as biodiversity monitoring, conservation planning, and policy-making. Progress toward this goal is hindered by long-tailed taxonomic distributions from class imbalance, along with fine-grained taxonomic variations, test-time spatiotemporal domain shifts, and closed-set assumptions that can only recognize previously seen taxa. We introduce the Open-World Ecological Taxonomy Classification, a unified framework that captures the co-occurrence of these challenges in realistic ecological settings. To address them, we propose TaxoNet, an embedding-based encoder with a dual-margin penalization loss that strengthens learning signals from rare underrepresented taxa while mitigating the dominance of overrepresented ones, directly confronting interrelated challenges. We evaluate our method on diverse ecological domains: Google Auto-Arborist (urban trees), iNat-Plantae (Plantae observations from various ecosystems in iNaturalist-2019), and NAFlora-Mini (a curated herbarium collection). Our model consistently outperforms baselines, particularly for rare taxa, establishing a strong foundation for open-world plant taxonomic monitoring. Our findings further show that general-purpose multimodal foundation models remain constrained in plant-domain applications.
zh
[CV-74] ICP-4D: Bridging Iterative Closest Point and LiDAR Panoptic Segmentation
【速读】:该论文旨在解决4D LiDAR全景分割(4D LiDAR panoptic segmentation)中现有方法依赖大规模叠加点云训练或设计专用实例关联模块所导致的计算冗余问题,同时忽视了原始点云中蕴含的丰富几何先验信息。其解决方案的关键在于提出一种无需训练的框架ICP-4D,通过实例级点集间的几何关系统一空间与时间推理:首先利用迭代最近点(Iterative Closest Point, ICP)算法对齐源与目标点集以实现时序一致实例的直接关联;为提升在噪声预测下的稳定性,引入基于Sinkhorn的软匹配机制,利用潜在实例分布获取精确点级对应关系,从而实现鲁棒的几何对齐;此外,设计兼顾静态、动态和缺失三类实例的高效处理流程,在保证计算效率的同时支持遮挡感知匹配。
链接: https://arxiv.org/abs/2512.18991
作者: Gyeongrok Oh,Youngdong Jang,Jonghyun Choi,Suk-Ju Kang,Guang Lin,Sangpil Kim
机构: Korea University (韩国科学技术院); Hyundai Motor Company (现代汽车公司); Sogang University (中央大学); Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Dominant paradigms for 4D LiDAR panoptic segmentation are usually required to train deep neural networks with large superimposed point clouds or design dedicated modules for instance association. However, these approaches perform redundant point processing and consequently become computationally expensive, yet still overlook the rich geometric priors inherently provided by raw point clouds. To this end, we introduce ICP-4D, a simple yet effective training-free framework that unifies spatial and temporal reasoning through geometric relations among instance-level point sets. Specifically, we apply the Iterative Closest Point (ICP) algorithm to directly associate temporally consistent instances by aligning the source and target point sets through the estimated transformation. To stabilize association under noisy instance predictions, we introduce a Sinkhorn-based soft matching. This exploits the underlying instance distribution to obtain accurate point-wise correspondences, resulting in robust geometric alignment. Furthermore, our carefully designed pipeline, which considers three instance types-static, dynamic, and missing-offers computational efficiency and occlusion-aware matching. Our extensive experiments across both SemanticKITTI and panoptic nuScenes demonstrate that our method consistently outperforms state-of-the-art approaches, even without additional training or extra point cloud inputs.
zh
[CV-75] Self-Attention with State-Object Weighted Combination for Compositional Zero Shot Learning
【速读】:该论文旨在解决生成式 AI (Generative AI) 中对象与状态联合识别的难题,即如何在缺乏所有可能组合数据的情况下,准确识别未见过的对象-状态组合。当前主流方法如KG-SP虽采用分离式分类器并结合语义模型评估组合合理性,但仍存在识别精度不足且未考虑状态与对象在组合过程中权重差异的问题。其解决方案的关键在于提出SASOW框架:首先引入自注意力机制(self-attention)增强状态和对象分类器的判别能力,从而提升各自识别准确性;其次,在组合阶段显式建模状态与对象的权重分配,使生成的组合更具合理性与准确性。实验表明,SASOW在MIT-States、UT Zappos和C-GQA三个基准数据集上相较KG-SP分别提升了2.1%、1.7%和0.4%的未见组合识别准确率。
链接: https://arxiv.org/abs/2512.18969
作者: Cheng-Hong Chang,Pei-Hsuan Tsai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Object recognition has become prevalent across various industries. However, most existing applications are limited to identifying objects alone, without considering their associated states. The ability to recognize both the state and object simultaneously remains less common. One approach to address this is by treating state and object as a single category during training. However, this approach poses challenges in data collection and training since it requires comprehensive data for all possible combinations. Compositional Zero-shot Learning (CZSL) emerges as a viable solution by treating the state and object as distinct categories during training. CZSL facilitates the identification of novel compositions even in the absence of data for every conceivable combination. The current state-of-the-art method, KG-SP, addresses this issue by training distinct classifiers for states and objects, while leveraging a semantic model to evaluate the plausibility of composed compositions. However, KG-SP’s accuracy in state and object recognition can be further improved, and it fails to consider the weighting of states and objects during composition. In this study, we propose SASOW, an enhancement of KG-SP that considers the weighting of states and objects while improving composition recognition accuracy. First, we introduce self-attention mechanisms into the classifiers for states and objects, leading to enhanced accuracy in recognizing both. Additionally, we incorporate the weighting of states and objects during composition to generate more reasonable and accurate compositions. Our validation process involves testing SASOW on three established benchmark datasets. Experimental outcomes affirm when compared against OW-CZSL approach, KG-SP, SASOW showcases improvements of 2.1%, 1.7%, and 0.4% in terms of accuracy for unseen compositions across the MIT-States, UT Zappos, and C-GQA datasets, respectively.
zh
[CV-76] otal Curvature Regularization and its_Minimization for Surface and Image Smoothing
【速读】:该论文旨在解决高阶非线性优化问题中如何有效实现曲率正则化(curvature regularization)以生成具有锐利边缘和各向同性特性的解的问题。其解决方案的关键在于提出一种总法曲率正则化(total normal curvature regularization)的新公式,通过惩罚多个方向上的法曲率来实现;同时,将该优化问题转化为一个时间依赖偏微分方程(time-dependent partial differential equation, PDE)系统的稳态解求解问题,并采用算子分裂(operator splitting)进行时间离散,使得每个分步子问题要么具有闭式解,要么可借助高效算法快速求解,从而避免复杂的参数调优并展现出对参数选择的鲁棒性。
链接: https://arxiv.org/abs/2512.18968
作者: Tianle Lu,Ke Chen,Yuping Duan
机构: Beijing Normal University (北京师范大学); University of Strathclyde (斯特拉斯克莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a novel formulation for curvature regularization by penalizing normal curvatures from multiple directions. This total normal curvature regularization is capable of producing solutions with sharp edges and precise isotropic properties. To tackle the resulting high-order nonlinear optimization problem, we reformulate it as the task of finding the steady-state solution of a time-dependent partial differential equation (PDE) system. Time discretization is achieved through operator splitting, where each subproblem at the fractional steps either has a closed-form solution or can be efficiently solved using advanced algorithms. Our method circumvents the need for complex parameter tuning and demonstrates robustness to parameter choices. The efficiency and effectiveness of our approach have been rigorously validated in the context of surface and image smoothing problems.
zh
[CV-77] DVI: Disentangling Semantic and Visual Identity for Training-Free Personalized Generation
【速读】:该论文旨在解决当前无需微调的身份定制方法在保持高面部保真度的同时,往往忽视视觉上下文(如光照、皮肤纹理和环境色调)所导致的“语义-视觉不协调”问题,即准确的面部几何结构与输入图像独特氛围不匹配,从而产生类似“贴纸”的不自然效果。解决方案的关键在于提出一种零样本框架 DVI(Disentangled Visual-Identity),通过正交解耦身份信息为细粒度语义流和粗粒度视觉流,利用变分自编码器(VAE)潜在空间中的均值和方差作为轻量级全局视觉氛围描述符,并引入无参数特征调制机制,自适应地将参考图像的视觉统计特性注入语义嵌入中,实现无需训练即可注入参考图像的“视觉灵魂”;同时设计动态时间粒度调度器,在扩散过程早期优先对齐视觉氛围,后期聚焦语义细节优化,显著提升视觉一致性与氛围保真度。
链接: https://arxiv.org/abs/2512.18964
作者: Guandong Li,Yijun Ding
机构: iFLYTEK(科大讯飞); Suning(苏宁)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent tuning-free identity customization methods achieve high facial fidelity but often overlook visual context, such as lighting, skin texture, and environmental tone. This limitation leads to Semantic-Visual Dissonance,'' where accurate facial geometry clashes with the input's unique atmosphere, causing an unnatural sticker-like’’ effect. We propose DVI (Disentangled Visual-Identity), a zero-shot framework that orthogonally disentangles identity into fine-grained semantic and coarse-grained visual streams. Unlike methods relying solely on semantic vectors, DVI exploits the inherent statistical properties of the VAE latent space, utilizing mean and variance as lightweight descriptors for global visual atmosphere. We introduce a Parameter-Free Feature Modulation mechanism that adaptively modulates semantic embeddings with these visual statistics, effectively injecting the reference’s ``visual soul’’ without training. Furthermore, a Dynamic Temporal Granularity Scheduler aligns with the diffusion process, prioritizing visual atmosphere in early denoising stages while refining semantic details later. Extensive experiments demonstrate that DVI significantly enhances visual consistency and atmospheric fidelity without parameter fine-tuning, maintaining robust identity preservation and outperforming state-of-the-art methods in IBench evaluations.
zh
[CV-78] VOIC: Visible-Occluded Decoupling for Monocular 3D Semantic Scene Completion
【速读】:该论文旨在解决单目图像驱动的3D语义场景补全(3D Semantic Scene Completion, SSC)中因单一图像输入导致的可见区域高置信度感知与遮挡区域低置信度推理之间的干扰问题,该干扰易引发特征稀释和误差传播。解决方案的关键在于提出一种离线的可见区域标签提取(Visible Region Label Extraction, VRLE)策略,通过从密集3D真值中显式分离并提取可见区域的体素级监督信号,从而净化两个互补子任务——可见区域感知与遮挡区域推理的监督空间。在此基础上,论文进一步设计了可见-遮挡交互补全网络(Visible-Occluded Interactive Completion Network, VOIC),其采用双解码器架构将SSC显式解耦为可见区域语义感知和遮挡区域场景补全,并通过融合图像特征与深度引导的占据信息构建基础3D体素表示,其中可见解码器生成高保真几何与语义先验,遮挡解码器则利用这些先验及跨模态交互实现一致性的全局场景推理,显著提升了几何补全与语义分割精度。
链接: https://arxiv.org/abs/2512.18954
作者: Zaidao Han,Risa Higashita,Jiang Liu
机构: Southern University of Science and Technology (南方科技大学); University of Nottingham Ningbo China (诺丁汉大学宁波分校); Changchun University (长春大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.18954 [cs.CV] (or arXiv:2512.18954v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.18954 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-79] Symmetrization of 3D Generative Models
【速读】:该论文旨在解决3D生成模型中形状对称性不足的问题,即生成的三维物体在几何结构上缺乏反射对称性,这限制了其在真实场景中的应用。解决方案的关键在于采用数据驱动的方法:通过将原始3D形状沿x=0平面进行反射,仅使用半边物体(half-objects)作为训练数据,从而引导生成式AI(Generative AI)模型学习局部几何分布,并在生成过程中自动实现对称性重构,最终生成既视觉合理又几何对称的完整形状。
链接: https://arxiv.org/abs/2512.18953
作者: Nicolas Caytuiro,Ivan Sipiran
机构: University of Chile (智利大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures
Abstract:We propose a novel data-centric approach to promote symmetry in 3D generative models by modifying the training data rather than the model architecture. Our method begins with an analysis of reflectional symmetry in both real-world 3D shapes and samples generated by state-of-the-art models. We hypothesize that training a generative model exclusively on half-objects, obtained by reflecting one half of the shapes along the x=0 plane, enables the model to learn a rich distribution of partial geometries which, when reflected during generation, yield complete shapes that are both visually plausible and geometrically symmetric. To test this, we construct a new dataset of half-objects from three ShapeNet classes (Airplane, Car, and Chair) and train two generative models. Experiments demonstrate that the generated shapes are symmetrical and consistent, compared with the generated objects from the original model and the original dataset objects.
zh
[CV-80] Point What You Mean: Visually Grounded Instruction Policy
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在依赖纯文本指令时,于复杂或分布外(out-of-distribution, OOD)场景中对象指称(object referring)能力受限的问题。其解决方案的关键在于引入Point-VLA,一种可插拔的策略模块,通过在语言指令中加入显式的视觉线索(如边界框),实现像素级的对象定位(pixel-level grounding),从而消除指称歧义并提升对象级别的精确对齐能力。该方法显著增强了VLA模型在真实世界杂乱环境或未见物体场景下的泛化性能。
链接: https://arxiv.org/abs/2512.18933
作者: Hang Yu,Juntu Zhao,Yufeng Liu,Kaiyu Li,Cheng Ma,Di Zhang,Yingdong Hu,Guang Chen,Junyuan Xie,Junliang Guo,Junqiao Zhao,Yang Gao
机构: Tongji University (同济大学); Shanghai Jiao Tong University (上海交通大学); Spirit AI; Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.
zh
[CV-81] LouvreSAE: Sparse Autoencoders for Interpretable and Controllable Style Transfer
【速读】:该论文旨在解决生成式模型中艺术风格迁移(Artistic Style Transfer)的难题,现有方法通常依赖模型微调、额外适配器或提示工程,存在计算成本高且风格与内容难以解耦的问题。其解决方案的关键在于提出一种轻量级、可解释的风格表示方法——LouvreSAE,即基于生成图像模型潜在空间嵌入的特定艺术稀疏自编码器(Sparse Autoencoder, SAE)。该模型在艺术数据上训练后,能学习到高度解耦的风格与构图概念,如笔触、纹理、色彩基调等,并据此构建紧凑、可分解的风格引导向量(style profiles),实现无需模型更新或优化即可直接从少量参考图像中进行艺术风格迁移,显著提升效率(1.7–20倍加速)并保持优异的风格保真度(VGG Style Loss 和 CLIP Score Style)。
链接: https://arxiv.org/abs/2512.18930
作者: Raina Panda,Daniel Fein,Arpita Singhal,Mark Fiore,Maneesh Agrawala,Matyas Bohacek
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
Abstract:Artistic style transfer in generative models remains a significant challenge, as existing methods often introduce style only via model fine-tuning, additional adapters, or prompt engineering, all of which can be computationally expensive and may still entangle style with subject matter. In this paper, we introduce a training- and inference-light, interpretable method for representing and transferring artistic style. Our approach leverages an art-specific Sparse Autoencoder (SAE) on top of latent embeddings of generative image models. Trained on artistic data, our SAE learns an emergent, largely disentangled set of stylistic and compositional concepts, corresponding to style-related elements pertaining brushwork, texture, and color palette, as well as semantic and structural concepts. We call it LouvreSAE and use it to construct style profiles: compact, decomposable steering vectors that enable style transfer without any model updates or optimization. Unlike prior concept-based style transfer methods, our method requires no fine-tuning, no LoRA training, and no additional inference passes, enabling direct steering of artistic styles from only a few reference images. We validate our method on ArtBench10, achieving or surpassing existing methods on style evaluations (VGG Style Loss and CLIP Score Style) while being 1.7-20x faster and, critically, interpretable.
zh
[CV-82] Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中因处理高分辨率视觉token导致的计算成本过高问题,特别是视觉投影器(visual projector)在直接映射密集视觉特征时引入的冗余与扩展性差的问题。其解决方案的关键在于提出一种基于“基础-专业化”(base-then-specialize)架构的高效投影机制——Delta-LLaVA,该机制首先利用低秩DeltaProjection将多层级视觉特征对齐至紧凑子空间以优化token形成,随后通过轻量级Transformer模块捕捉全局与局部结构,在有限token预算下实现性能提升。此设计显著降低计算冗余,使推理吞吐量最高提升55%,预训练阶段训练速度加快4–5倍,微调阶段加速超1.5倍,兼顾效率与可扩展性。
链接: https://arxiv.org/abs/2512.18910
作者: Mohamad Zamini,Diksha Shukla
机构: University of Wyoming (怀俄明大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) combine visual and textual representations to enable rich reasoning capabilities. However, the high computational cost of processing dense visual tokens remains a major bottleneck. A critical component in this pipeline is the visual projector, which bridges the vision encoder and the language model. Standard designs often employ a simple multi-layer perceptron for direct token mapping, but this approach scales poorly with high-resolution inputs, introducing significant redundancy. We present Delta-LLaVA, a token-efficient projector that employs a low-rank DeltaProjection to align multi-level vision features into a compact subspace before further interaction. On top of this base alignment, lightweight Transformer blocks act as specialization layers, capturing both global and local structure under constrained token budgets. Extensive experiments and ablations demonstrate that this base-then-specialize design yields consistent gains across multiple benchmarks with only 144 tokens, highlighting the importance of token formation prior to scaling interaction capacity. With Delta-LLaVA, inference throughput improves by up to 55%, while end-to-end training accelerates by nearly 4-5x in pretraining and over 1.5x in finetuning, highlighting the dual benefits of our design in both efficiency and scalability.
zh
[CV-83] hinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning -Augmented LMMs
【速读】:该论文旨在解决词汇无关的细粒度图像识别问题(vocabulary-free fine-grained image recognition),即在不依赖预定义标签集合的情况下区分同一元类别内视觉相似的子类。传统方法受限于固定词汇表的刚性约束或复杂且易错的多阶段流水线。其解决方案的关键在于提出FiNDR框架,该框架基于具备显式或隐式推理能力的大规模多模态模型(LMMs),通过三个自动化步骤实现:首先由推理增强的LMM生成描述性候选标签;其次利用视觉-语言模型筛选并排序候选标签以形成一致的类别集合;最后用验证后的名称构建轻量级多模态分类器用于推理。此方法在多个基准上达到最优性能,显著超越此前方法,并首次在无需人工标注词汇的前提下优于零样本基线,证明了推理增强型LMMs在开放世界细粒度视觉识别中的有效性与可扩展性。
链接: https://arxiv.org/abs/2512.18897
作者: Dmitry Demidov,Zaigham Zaheer,Zongyan Han,Omkar Thawakar,Rao Anwer
机构: Mohamed bin Zayed University of Artificial Intelligence, UAE(穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem are limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system operates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time. Extensive experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art performance under the vocabulary-free setting, with a significant relative margin of up to 18.8% over previous approaches. Remarkably, the proposed method surpasses zero-shot baselines that exploit pre-defined ground-truth names, challenging the assumption that human-curated vocabularies define an upper bound. Additionally, we show that carefully curated prompts enable open-source LMMs to match proprietary counterparts. These findings establish reasoning-augmented LMMs as an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. The source code is available on this http URL.
zh
[CV-84] Localising Shortcut Learning in Pixel Space via Ordinal Scoring Correlations for Attribution Representations (OSCAR)
【速读】:该论文旨在解决深度神经网络在训练过程中依赖捷径特征(shortcut features)导致模型偏倚的问题,尤其关注这些捷径特征与敏感属性(sensitive attributes)相关时对公平性的影响。现有方法多依赖于定性的图像级可视化且假设捷径特征为人眼可见,难以适用于医学影像等专业领域。其解决方案的关键在于提出一种模型无关的量化框架OSCAR(Ordinal Scoring Correlations for Attribution Representations),通过将图像级任务归因图转换为数据集级别的区域排名分布,并基于三个模型(平衡基线模型BA、测试模型TS和敏感属性预测模型SA)之间的排名相关性计算多维指标,从而定量刻画测试模型对捷径的依赖程度并定位关键图像区域。该方法能够在不同随机种子和数据划分下保持稳定,有效区分局部化与弥散型捷径特征,并支持后续基于识别出的捷径区域进行测试时的衰减处理以改善最差群体性能差异。
链接: https://arxiv.org/abs/2512.18888
作者: Akshit Achara,Peter Triantafillou,Esther Puyol-Antón,Alexander Hammers,Andrew P. King
机构: King’s College London (伦敦国王学院); University of Warwick (华威大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep neural networks often exploit shortcuts. These are spurious cues which are associated with output labels in the training data but are unrelated to task semantics. When the shortcut features are associated with sensitive attributes, shortcut learning can lead to biased model performance. Existing methods for localising and understanding shortcut learning are mostly based upon qualitative, image-level inspection and assume cues are human-visible, limiting their use in domains such as medical imaging. We introduce OSCAR (Ordinal Scoring Correlations for Attribution Representations), a model-agnostic framework for quantifying shortcut learning and localising shortcut features. OSCAR converts image-level task attribution maps into dataset-level rank profiles of image regions and compares them across three models: a balanced baseline model (BA), a test model (TS), and a sensitive attribute predictor (SA). By computing pairwise, partial, and deviation-based correlations on these rank profiles, we produce a set of quantitative metrics that characterise the degree of shortcut reliance for TS, together with a ranking of image-level regions that contribute most to it. Experiments on CelebA, CheXpert, and ADNI show that our correlations are (i) stable across seeds and partitions, (ii) sensitive to the level of association between shortcut features and output labels in the training data, and (iii) able to distinguish localised from diffuse shortcut features. As an illustration of the utility of our method, we show how worst-group performance disparities can be reduced using a simple test-time attenuation approach based on the identified shortcut regions. OSCAR provides a lightweight, pixel-space audit that yields statistical decision rules and spatial maps, enabling users to test, localise, and mitigate shortcut reliance. The code is available at this https URL
zh
[CV-85] CrashChat: A Multimodal Large Language Model for Multitask Traffic Crash Video Analysis
【速读】:该论文旨在解决自动驾驶场景下事故视频分析的多任务难题,即如何在一个统一框架中实现事故识别(crash recognition)、时间定位(temporal grounding)以及高阶视频理解(high-level video understanding),而现有模型难以兼顾这些任务且缺乏有效的联合训练策略。其解决方案的关键在于提出 CrashChat——一个基于 VideoLLaMA3 构建的多模态大语言模型(Multimodal Large Language Model, MLLM),通过指令微调获取领域知识,并采用一种基于任务解耦与分组的新型多任务学习策略,在组内和组间最大化协同学习收益的同时抑制负迁移效应,从而显著提升各类 crash analysis 任务的性能表现。
链接: https://arxiv.org/abs/2512.18878
作者: Kaidi Liang,Ke Li,Xianbiao Hu,Ruwen Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Automating crash video analysis is essential to leverage the growing availability of driving video data for traffic safety research and accountability attribution in autonomous driving. Crash video analysis is a challenging multitask problem due to the complex spatiotemporal dynamics of crash events in video data and the diverse analytical requirements involved. It requires capabilities spanning crash recognition, temporal grounding, and high-level video understanding. Existing models, however, cannot perform all these tasks within a unified framework, and effective training strategies for such models remain underexplored. To fill these gaps, this paper proposes CrashChat, a multimodal large language model (MLLM) for multitask traffic crash analysis, built upon VideoLLaMA3. CrashChat acquires domain-specific knowledge through instruction fine-tuning and employs a novel multitask learning strategy based on task decoupling and grouping, which maximizes the benefit of joint learning within and across task groups while mitigating negative transfer. Numerical experiments on consolidated public datasets demonstrate that CrashChat consistently outperforms existing MLLMs across model scales and traditional vision-based methods, achieving state-of-the-art performance. It reaches near-perfect accuracy in crash recognition, a 176% improvement in crash localization, and a 40% improvement in the more challenging pre-crash localization. Compared to general MLLMs, it substantially enhances textual accuracy and content coverage in crash description and reasoning tasks, with 0.18-0.41 increases in BLEU scores and 0.18-0.42 increases in ROUGE scores. Beyond its strong performance, CrashChat is a convenient, end-to-end analytical tool ready for practical implementation. The dataset and implementation code for CrashChat are available at this https URL.
zh
[CV-86] Cross-modal Counterfactual Explanations: Uncovering Decision Factors and Dataset Biases in Subjective Classification
【速读】:该论文旨在解决图像分类模型决策过程中的可解释性问题,特别是在图像隐私决策这一具有主观性和情境依赖性的领域中,如何量化关键场景元素对模型预测的差异化贡献。解决方案的关键在于提出一种名为“分解与解释”(Decompose and Explain, DeX)的训练-free框架,其核心创新在于利用跨模态分解性(cross-modal decompositionality)和图像特定概念(image-specific concepts),生成以自然语言表达的因果反事实场景(concept-driven counterfactuals)。DeX通过多准则选择机制,在保证最小扰动(基于图像相似性)和最大化决策置信度之间权衡,识别出最具影响力的解释因素,并能评估不同解释属性之间的相互依赖关系,从而实现图像引导且稀疏的解释,显著优于现有方法,并揭示数据集潜在偏差以支持公平性改进。
链接: https://arxiv.org/abs/2512.18864
作者: Alina Elena Baia,Andrea Cavallaro
机构: EPFL (École Polytechnique Fédérale de Lausanne)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Concept-driven counterfactuals explain decisions of classifiers by altering the model predictions through semantic changes. In this paper, we present a novel approach that leverages cross-modal decompositionality and image-specific concepts to create counterfactual scenarios expressed in natural language. We apply the proposed interpretability framework, termed Decompose and Explain (DeX), to the challenging domain of image privacy decisions, which are contextual and subjective. This application enables the quantification of the differential contributions of key scene elements to the model prediction. We identify relevant decision factors via a multi-criterion selection mechanism that considers both image similarity for minimal perturbations and decision confidence to prioritize impactful changes. This approach evaluates and compares diverse explanations, and assesses the interdependency and mutual influence among explanatory properties. By leveraging image-specific concepts, DeX generates image-grounded, sparse explanations, yielding significant improvements over the state of the art. Importantly, DeX operates as a training-free framework, offering high flexibility. Results show that DeX not only uncovers the principal contributing factors influencing subjective decisions, but also identifies underlying dataset biases allowing for targeted mitigation strategies to improve fairness.
zh
[CV-87] VizDefender: Unmasking Visualization Tampering through Proactive Localization and Intent Inference
【速读】:该论文旨在解决数据可视化(Data Visualization)完整性面临的威胁问题,即图像编辑技术可实现细微但具有欺骗性的篡改行为,从而误导用户对数据的理解。解决方案的关键在于提出VizDefender框架,其核心由两个模块构成:一是半脆弱水印模块,通过嵌入位置图(location map)保护可视化图像,在保持视觉质量的同时实现篡改区域的精确定位;二是意图分析模块,利用多模态大语言模型(Multimodal Large Language Models, MLLMs)对篡改行为进行语义解读,推断攻击者的意图及其造成的误导效果。
链接: https://arxiv.org/abs/2512.18853
作者: Sicheng Song,Yanjie Zhang,Zixin Chen,Huamin Qu,Changbo Wang,Chenhui Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: IEEE Transactions on Visualization and Computer Graphics (IEEE PacificVis’26 TVCG Track)
Abstract:The integrity of data visualizations is increasingly threatened by image editing techniques that enable subtle yet deceptive tampering. Through a formative study, we define this challenge and categorize tampering techniques into two primary types: data manipulation and visual encoding manipulation. To address this, we present VizDefender, a framework for tampering detection and analysis. The framework integrates two core components: 1) a semi-fragile watermark module that protects the visualization by embedding a location map to images, which allows for the precise localization of tampered regions while preserving visual quality, and 2) an intent analysis module that leverages Multimodal Large Language Models (MLLMs) to interpret manipulation, inferring the attacker’s intent and misleading effects. Extensive evaluations and user studies demonstrate the effectiveness of our methods.
zh
[CV-88] Brain-Gen: Towards Interpreting Neural Signals for Stimulus Reconstruction Using Transformers and Latent Diffusion Models
【速读】:该论文旨在解决脑电图(EEG)信号在神经表征可解释性方面的局限性问题,特别是由于EEG信号固有的高噪声、空间扩散和显著的时间变异性导致的视觉刺激重建困难。解决方案的关键在于提出一种基于Transformer架构的框架,用于从EEG记录中提取与视觉刺激相关的时空特征,并将这些特征嵌入到潜在扩散模型(Latent Diffusion Models, LDMs)的注意力机制中,从而实现从脑活动到视觉图像的有效重建。该方法在公开基准数据集上表现出更强的语义结构建模能力,显著提升了潜在空间聚类准确率和零样本泛化性能。
链接: https://arxiv.org/abs/2512.18843
作者: Hasib Aslam,Muhammad Talal Faiz,Muhammad Imran Malik
机构: SEECS (National University of Sciences and Technology)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages and 7 figures
Abstract:Advances in neuroscience and artificial intelligence have enabled preliminary decoding of brain activity. However, despite the progress, the interpretability of neural representations remains limited. A significant challenge arises from the intrinsic properties of electroencephalography (EEG) signals, including high noise levels, spatial diffusion, and pronounced temporal variability. To interpret the neural mechanism underlying thoughts, we propose a transformers-based framework to extract spatial-temporal representations associated with observed visual stimuli from EEG recordings. These features are subsequently incorporated into the attention mechanisms of Latent Diffusion Models (LDMs) to facilitate the reconstruction of visual stimuli from brain activity. The quantitative evaluations on publicly available benchmark datasets demonstrate that the proposed method excels at modeling the semantic structures from EEG signals; achieving up to 6.5% increase in latent space clustering accuracy and 11.8% increase in zero shot generalization across unseen classes while having comparable Inception Score and Fréchet Inception Distance with existing baselines. Our work marks a significant step towards generalizable semantic interpretation of the EEG signals.
zh
[CV-89] EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer
【速读】:该论文旨在解决视频生成模型在合成复杂人类动作时性能受限的问题,其根本原因在于纯像素级训练目标难以捕捉人体运动的内在运动学规律,导致生成结果在动作合理性与时空一致性上存在不足。解决方案的关键在于提出EchoMotion框架,通过引入双分支架构联合建模外观与人体运动的联合分布,并设计MVS-RoPE(Motion-Video Synchronized RoPE)实现视频与运动token的统一3D位置编码,从而建立跨模态的时间对齐诱导偏置;同时采用两阶段训练策略,使模型既能联合生成高质量的人类动作视频及其对应运动序列,又能完成多样化的跨模态条件生成任务,显著提升了生成视频中人类动作的连贯性与合理性。
链接: https://arxiv.org/abs/2512.18814
作者: Yuxiao Yang,Hualian Sheng,Sijia Cai,Jing Lin,Jiahao Wang,Bing Deng,Junzhe Lu,Haoqian Wang,Jieping Ye
机构: Tsinghua University (清华大学); Alibaba Group (阿里巴巴集团); Nanyang Technology University (南洋理工大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 16 figures
Abstract:Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities. Furthermore, we propose MVS-RoPE (Motion-Video Syncronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a synchronized coordinate system for the dual-modal latent sequence, MVS-RoPE establishes an inductive bias that fosters temporal alignment between the two modalities. We also propose a Motion-Video Two-Stage Training Strategy. This strategy enables the model to perform both the joint generation of complex human action videos and their corresponding motion sequences, as well as versatile cross-modal conditional generation tasks. To facilitate the training of a model with these capabilities, we construct HuMoVe, a large-scale dataset of approximately 80,000 high-quality, human-centric video-motion pairs. Our findings reveal that explicitly representing human motion is complementary to appearance, significantly boosting the coherence and plausibility of human-centric video generation.
zh
[CV-90] Revealing Perception and Generation Dynamics in LVLMs: Mitigating Hallucinations via Validated Dominance Correction
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的幻觉问题,即模型生成与输入视觉内容不一致或缺乏依据的文本。通过系统分析LVLM内部视觉感知与标记生成的演化过程,研究发现感知遵循“全局扫描-聚焦收紧-探索补充”的三阶段GATE机制,而生成则呈现“次主导累积至主导”的SAD模式,其中幻觉标记源于未被注意力机制(视觉感知)或前馈网络(内部知识)支持的次主导标记的重复累积。解决方案的关键在于提出VDC(Validated Dominance Correction)策略,通过识别并替换这些无支撑的标记为经验证的主导标记,从而显著提升输出结果的可靠性。
链接: https://arxiv.org/abs/2512.18813
作者: Guangtao Lyu,Xinyi Cheng,Chenghao Xu,Qi Liu,Muli Yang,Fen Fang,Huilin Chen,Jiexi Yan,Xu Yang,Cheng Deng
机构: Xidian University (西安电子科技大学); Hohai University (河海大学); Institute for Infocomm Research (I2R) (资讯通信研究所), A*STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Vision-Language Models (LVLMs) have shown remarkable capabilities, yet hallucinations remain a persistent challenge. This work presents a systematic analysis of the internal evolution of visual perception and token generation in LVLMs, revealing two key patterns. First, perception follows a three-stage GATE process: early layers perform a Global scan, intermediate layers Approach and Tighten on core content, and later layers Explore supplementary regions. Second, generation exhibits an SAD (Subdominant Accumulation to Dominant) pattern, where hallucinated tokens arise from the repeated accumulation of subdominant tokens lacking support from attention (visual perception) or feed-forward network (internal knowledge). Guided by these findings, we devise the VDC (Validated Dominance Correction) strategy, which detects unsupported tokens and replaces them with validated dominant ones to improve output reliability. Extensive experiments across multiple models and benchmarks confirm that VDC substantially mitigates hallucinations.
zh
[CV-91] FedVideoMAE: Efficient Privacy-Preserving Federated Video Moderation
【速读】:该论文旨在解决短视频平台在内容审核中面临的隐私泄露、高带宽消耗和推理延迟等问题,这些问题源于传统云侧集中式处理模式对原始视频数据的暴露。解决方案的关键在于提出一种基于设备端联邦学习(Federated Learning, FL)的视频暴力检测框架,其核心创新包括:利用自监督VideoMAE提取视频表征以降低对云端依赖;采用LoRA(Low-Rank Adaptation)实现参数高效微调,将可训练参数量压缩至5.5M(仅占156M主干网络的3.5%);并通过差分隐私随机梯度下降(DP-SGD)与安全聚合机制构建纵深防御的隐私保护体系。实验表明,在RWF-2000数据集上使用40个客户端时,该方案在强差分隐私约束下仍能保持65–66%的准确率,同时通信开销相比全模型联邦学习降低28.3倍。
链接: https://arxiv.org/abs/2512.18809
作者: Ziyuan Tao,Chuanzhi Xu,Sandaru Jayawardana,Wei Bao,Kanchana Thilakarathna,Teng Joon Lim
机构: The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:The rapid growth of short-form video platforms increases the need for privacy-preserving moderation, as cloud-based pipelines expose raw videos to privacy risks, high bandwidth costs, and inference latency. To address these challenges, we propose an on-device federated learning framework for video violence detection that integrates self-supervised VideoMAE representations, LoRA-based parameter-efficient adaptation, and defense-in-depth privacy protection. Our approach reduces the trainable parameter count to 5.5M (~3.5% of a 156M backbone) and incorporates DP-SGD with configurable privacy budgets and secure aggregation. Experiments on RWF-2000 with 40 clients achieve 77.25% accuracy without privacy protection and 65-66% under strong differential privacy, while reducing communication cost by 28.3\times compared to full-model federated learning. The code is available at: this https URL
zh
[CV-92] mpo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation
【速读】:该论文旨在解决音乐到三维舞蹈生成任务中因依赖噪声大、粒度粗或缺失的风格标签而导致的节奏错位与风格漂移问题。其解决方案的关键在于提出了一种分层的、基于节拍感知的专家混合模块(TempoMoE),该模块将运动专家按节拍范围结构化分组,并引入多尺度节拍专家以捕捉短时与长时节奏动态;同时,通过层次化节奏自适应路由机制,从音乐特征中动态选择并融合专家,从而实现无需人工标注风格标签即可灵活且节奏对齐的舞蹈生成。
链接: https://arxiv.org/abs/2512.18804
作者: Guangtao Lyu,Chenghao Xu,Qi Liu,Jiexi Yan,Muli Yang,Fen Fang,Cheng Deng
机构: Xidian University (西安电子科技大学); Hohai University (河海大学); Institute for Infocomm Research (I2R) (新加坡资讯通信研究院), A*STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:
Abstract:Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music. While existing methods often rely on additional genre labels to further improve dance generation, such labels are typically noisy, coarse, unavailable, or insufficient to capture the diversity of real-world music, which can result in rhythm misalignment or stylistic drift. In contrast, we observe that tempo, a core property reflecting musical rhythm and pace, remains relatively consistent across datasets and genres, typically ranging from 60 to 200 BPM. Based on this finding, we propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module that enhances the diffusion model and its rhythm perception. TempoMoE organizes motion experts into tempo-structured groups for different tempo ranges, with multi-scale beat experts capturing fine- and long-range rhythmic dynamics. A Hierarchical Rhythm-Adaptive Routing dynamically selects and fuses experts from music features, enabling flexible, rhythm-aligned generation without manual genre labels. Extensive experiments demonstrate that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.
zh
[CV-93] Eff-GRot: Efficient and Generalizable Rotation Estimation with Transformers
【速读】:该论文旨在解决从RGB图像中高效且通用地估计物体旋转的问题,尤其针对延迟敏感应用场景下的性能瓶颈。传统方法通常需要针对特定对象或类别进行训练,限制了其泛化能力与部署效率。本文提出的解决方案——Eff-GRot,核心在于设计了一个基于Transformer的端到端框架,通过在潜在空间中对多个参考图像和查询图像的旋转感知表征进行联合处理,实现单次前向传播即可预测目标旋转。该机制在保持高精度的同时显著提升了计算效率,具备良好的可扩展性与通用性。
链接: https://arxiv.org/abs/2512.18784
作者: Fanis Mathioulakis,Gorjan Radevski,Tinne Tuytelaars
机构: KU Leuven (鲁汶大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We introduce Eff-GRot, an approach for efficient and generalizable rotation estimation from RGB images. Given a query image and a set of reference images with known orientations, our method directly predicts the object’s rotation in a single forward pass, without requiring object- or category-specific training. At the core of our framework is a transformer that performs a comparison in the latent space, jointly processing rotation-aware representations from multiple references alongside a query. This design enables a favorable balance between accuracy and computational efficiency while remaining simple, scalable, and fully end-to-end. Experimental results show that Eff-GRot offers a promising direction toward more efficient rotation estimation, particularly in latency-sensitive applications.
zh
[CV-94] In-Context Audio Control of Video Diffusion Transformers
【速读】:该论文旨在解决视频生成模型中对音频信号(尤其是语音)的利用不足问题,即现有基于Transformer的统一基础模型多聚焦于文本、图像等模态,而对严格时序同步的音频输入缺乏有效整合。其核心解决方案是提出In-Context Audio Control of video diffusion transformers (ICAC)框架,关键创新在于引入一种掩码3D注意力机制(Masked 3D Attention),通过约束注意力模式以强制时间对齐,从而在保持3D自注意力捕捉时空音视频关联潜力的同时,实现稳定训练并显著提升唇部同步与视频质量。
链接: https://arxiv.org/abs/2512.18772
作者: Wenze Liu,Weicai Ye,Minghong Cai,Quande Liu,Xintao Wang,Xiangyu Yue
机构: MMLab, The Chinese University of Hong Kong (香港中文大学多媒体实验室); Kling Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusion transformers (ICAC), a framework that investigates the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT. We systematically explore three distinct mechanisms for injecting audio conditions: standard cross-attention, 2D self-attention, and unified 3D self-attention. Our findings reveal that while 3D attention offers the highest potential for capturing spatio-temporal audio-visual correlations, it presents significant training challenges. To overcome this, we propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance. Our experiments demonstrate that this approach achieves strong lip synchronization and video quality, conditioned on an audio stream and reference images.
zh
[CV-95] MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在掩码生成模型(Masked Generative Models)中难以有效优化的问题。核心挑战在于,RL 的策略优化需考虑每一步的概率似然性,以应对多步迭代精炼过程,但这种对完整采样轨迹的依赖带来了高昂的计算成本,而直接随机优化则常导致次优结果。解决方案的关键在于提出 MaskFocus 框架,通过识别并聚焦于关键步骤实现高效策略优化:首先利用中间图像与最终生成图像之间的相似度衡量每一步的信息增益,从而筛选出最具价值的步骤;其次设计基于熵的动态路由采样机制,引导低熵样本探索更有效的掩码策略,显著提升了掩码生成模型在文本到图像任务中的性能表现。
链接: https://arxiv.org/abs/2512.18766
作者: Guohui Zhang,Hu Yu,Xiaoxiao Ma,Yaning Pan,Hang Xu,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL
Abstract:Reinforcement learning (RL) has demonstrated significant potential for post-training language models and autoregressive visual generative models, but adapting RL to masked generative models remains challenging. The core factor is that policy optimization requires accounting for the probability likelihood of each step due to its multi-step and iterative refinement process. This reliance on entire sampling trajectories introduces high computational cost, whereas natively optimizing random steps often yields suboptimal results. In this paper, we present MaskFocus, a novel RL framework that achieves effective policy optimization for masked generative models by focusing on critical steps. Specifically, we determine the step-level information gain by measuring the similarity between the intermediate images at each sampling step and the final generated image. Crucially, we leverage this to identify the most critical and valuable steps and execute focused policy optimization on them. Furthermore, we design a dynamic routing sampling mechanism based on entropy to encourage the model to explore more valuable masking strategies for samples with low entropy. Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of our method.
zh
[CV-96] Context-Aware Network Based on Multi-scale Spatio-temporal Attention for Action Recognition in Videos
【速读】:该论文旨在解决动作识别(Action Recognition)任务中现有方法普遍忽视动作多粒度特性的问题,即未能充分捕捉跨尺度的时空线索。其解决方案的关键在于提出Context-Aware Network (CAN),该网络包含两个核心模块:多尺度时间线索模块(Multi-scale Temporal Cue Module, MTCM)和分组空间线索模块(Group Spatial Cue Module, GSCM)。MTCM通过多尺度建模提取时间维度上的细粒度运动特征与整体动作流信息,GSCM则通过对特征图分组并应用差异化提取策略,实现空间维度上多层次语义信息的捕获,从而显著提升模型对复杂动作模式的感知能力。
链接: https://arxiv.org/abs/2512.18750
作者: Xiaoyang Li,Wenzhu Yang,Kanglin Wang,Tiebiao Wang,Qingsong Fei
机构: Hebei University (河北大学); Machine Vision Engineering Research Center (机器视觉工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 4 figures. Preprint under review for journal submission
Abstract:Action recognition is a critical task in video understanding, requiring the comprehensive capture of spatio-temporal cues across various scales. However, existing methods often overlook the multi-granularity nature of actions. To address this limitation, we introduce the Context-Aware Network (CAN). CAN consists of two core modules: the Multi-scale Temporal Cue Module (MTCM) and the Group Spatial Cue Module (GSCM). MTCM effectively extracts temporal cues at multiple scales, capturing both fast-changing motion details and overall action flow. GSCM, on the other hand, extracts spatial cues at different scales by grouping feature maps and applying specialized extraction methods to each group. Experiments conducted on five benchmark datasets (Something-Something V1 and V2, Diving48, Kinetics-400, and UCF101) demonstrate the effectiveness of CAN. Our approach achieves competitive performance, outperforming most mainstream methods, with accuracies of 50.4% on Something-Something V1, 63.9% on Something-Something V2, 88.4% on Diving48, 74.9% on Kinetics-400, and 86.9% on UCF101. These results highlight the importance of capturing multi-scale spatio-temporal cues for robust action recognition.
zh
[CV-97] IPCV: Information-Preserving Compression for MLLM Visual Encoders
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中视觉编码器计算成本过高问题,其根源在于视觉Transformer(Vision Transformer, ViT)处理大量视觉标记(visual tokens)所带来的高资源消耗。现有令牌剪枝策略存在明显不足:仅在大语言模型(Large Language Model, LLM)阶段进行剪枝会忽略ViT的计算开销;而传统的ViT令牌剪枝缺乏语言引导,可能导致关键文本相关视觉线索被误删,并因ViT双向注意力机制放大特征失真。为此,作者提出无需训练的信息保留压缩框架IPCV,其核心创新包括两项技术:一是邻域引导重建(Neighbor-Guided Reconstruction, NGR),通过临时重建被剪枝令牌以参与注意力计算但仅引入极小额外开销,随后在传递至LLM前完全恢复;二是注意力稳定化(Attention Stabilization, AS),通过近似剪枝令牌的键(Key)和值(Value)向量来缓解剪枝带来的负面影响。该方法可直接增强已有LLM侧剪枝策略性能,且在多个图像与视频基准测试中显著降低端到端计算量并超越当前最优无训练剪枝方法。
链接: https://arxiv.org/abs/2512.18747
作者: Yuan Chen,Zichen Wen,Yuzhou Wu,Xuyang Liu,Shuang Chen,Junpeng Ma,Weijia Li,Conghui He,Linfeng Zhang
机构: EPIC Lab, SJTU(上海交通大学); CityU(城市大学); Shanghai AI Laboratory(上海人工智能实验室); University of Sheffield(谢菲尔德大学); SCU(四川大学); FDU(复旦大学); SYSU(中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures
Abstract:Multimodal Large Language Models (MLLMs) deliver strong vision-language performance but at high computational cost, driven by numerous visual tokens processed by the Vision Transformer (ViT) encoder. Existing token pruning strategies are inadequate: LLM-stage token pruning overlooks the ViT’s overhead, while conventional ViT token pruning, without language guidance, risks discarding textually critical visual cues and introduces feature distortions amplified by the ViT’s bidirectional attention. To meet these challenges, we propose IPCV, a training-free, information-preserving compression framework for MLLM visual encoders. IPCV enables aggressive token pruning inside the ViT via Neighbor-Guided Reconstruction (NGR) that temporarily reconstructs pruned tokens to participate in attention with minimal overhead, then fully restores them before passing to the LLM. Besides, we introduce Attention Stabilization (AS) to further alleviate the negative influence from token pruning by approximating the K/V of pruned tokens. It can be directly applied to previous LLM-side token pruning methods to enhance their performance. Extensive experiments show that IPCV substantially reduces end-to-end computation and outperforms state-of-the-art training-free token compression methods across diverse image and video benchmarks. Our code is available at this https URL.
zh
[CV-98] Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation
【速读】:该论文旨在解决长视频生成中历史上下文记忆与计算效率之间的矛盾问题:现有基于窗口注意力(window attention)的方法会丢弃窗口外的历史信息,导致场景不一致和灾难性遗忘;而保留完整历史则带来高昂的内存开销。解决方案的关键在于提出一种名为“Memorize-and-Generate”(MAG)的框架,其核心思想是将记忆压缩(memory compression)与帧生成(frame generation)解耦为两个独立任务——通过训练一个专门的记忆模型将历史信息压缩为紧凑的键值缓存(KV cache),再由独立的生成模型利用该压缩表示合成后续帧,从而在保持高场景一致性的同时实现高效推理。
链接: https://arxiv.org/abs/2512.18741
作者: Tianrui Zhu,Shiyi Zhang,Zhirui Sun,Jingqi Tian,Yansong Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose \textbfMemorize-and-Generate (MAG), a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce \textbfMAG-Bench to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.
zh
[CV-99] AMLID: An Adaptive Multispectral Landmine Identification Dataset for Drone-Based Detection
【速读】:该论文旨在解决当前地雷探测方法存在危险性高、效率低且成本昂贵的问题。其解决方案的关键在于提出首个开源的自适应多光谱地雷识别数据集(Adaptive Multispectral Landmine Identification Dataset, AMLID),该数据集融合了可见光(Red-Green-Blue, RGB)与长波红外(Long-Wave Infrared, LWIR)图像,支持无人系统(Unmanned Aerial Systems, UAS)的地雷检测研究。AMLID包含12,078张标注图像,涵盖21种全球部署的地雷类型,并覆盖多种环境变量(如传感器高度、季节和光照条件),从而为开发和评估自适应检测算法提供可靠的数据基础,无需依赖真实弹药或昂贵的数据采集设备,推动人道主义排雷研究的普及与进步。
链接: https://arxiv.org/abs/2512.18738
作者: James E. Gallagher,Edward J. Oughton
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages with three figures and one table
Abstract:Landmines remain a persistent humanitarian threat, with an estimated 110 million mines deployed across 60 countries, claiming approximately 26,000 casualties annually. Current detection methods are hazardous, inefficient, and prohibitively expensive. We present the Adaptive Multispectral Landmine Identification Dataset (AMLID), the first open-source dataset combining Red-Green-Blue (RGB) and Long-Wave Infrared (LWIR) imagery for Unmanned Aerial Systems (UAS)-based landmine detection. AMLID comprises of 12,078 labeled images featuring 21 globally deployed landmine types across anti-personnel and anti-tank categories in both metal and plastic compositions. The dataset spans 11 RGB-LWIR fusion levels, four sensor altitudes, two seasonal periods, and three daily illumination conditions. By providing comprehensive multispectral coverage across diverse environmental variables, AMLID enables researchers to develop and benchmark adaptive detection algorithms without requiring access to live ordnance or expensive data collection infrastructure, thereby democratizing humanitarian demining research.
zh
[CV-100] M3-Verse: A “Spot the Difference” Challenge for Large Multimodal Models
【速读】:该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在理解同一空间场景中物体动态变化能力上的不足,尤其是其在跨两个不同视频观测之间对状态转换进行推理的能力尚未被充分探索。这一能力对于提升空间智能(spatial intelligence)至关重要。为系统评估该能力,作者提出了 M^3-Verse,一个包含270个室内场景和2932个问题的多模态、多状态、多维度基准测试集,涵盖超过50个子任务以检验4项核心能力。解决方案的关键在于构建了一个结构化的评测框架,并提出一种简单但有效的基线方法,在多状态感知任务上显著提升了性能,从而推动下一代具备更全面动态视觉理解能力模型的发展。
链接: https://arxiv.org/abs/2512.18735
作者: Kewei Wei,Bocheng Hu,Jie Cao,Xiaohan Chen,Zhengxi Lu,Wubing Xia,Weili Xu,Jiaao Wu,Junchen He,Mingyu Jia,Ciyun Zhao,Ye Sun,Yizhi Li,Zhonghan Zhao,Jian Zhang,Gaoang Wang
机构: Zhejiang University (浙江大学); Shanghai AI Lab (上海人工智能实验室); Hangzhou Normal University (杭州师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce M^3-Verse , a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. M^3-Verse thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from this https URL and full benchmark data from this https URL.
zh
[CV-101] Breast Cancer Recurrence Risk Prediction Based on Multiple Instance Learning
【速读】:该论文旨在解决乳腺癌复发风险预测这一关键临床挑战,通过计算病理学方法实现基于常规苏木精-伊红(Hematoxylin and Eosin, HE)染色全切片图像(Whole-Slide Images, WSI)的患者分层。其解决方案的关键在于采用多实例学习(Multiple Instance Learning, MIL)框架结合预训练特征提取模型(UNI 和 CONCH),从标准组织病理图像中自动挖掘与基因组学结果相关联的病理特征,从而实现对5年复发风险的三级分类(低、中、高)。实验表明,改进后的 CLAM-SB 模型在五折交叉验证中达到平均 AUC 为 0.836,分类准确率为 76.2%,验证了深度学习在无需额外分子检测的情况下实现快速、低成本且可解释的临床决策支持的可行性。
链接: https://arxiv.org/abs/2512.18734
作者: Jinqiu Chen,Huyan Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Predicting breast cancer recurrence risk is a critical clinical challenge. This study investigates the potential of computational pathology to stratify patients using deep learning on routine Hematoxylin and Eosin (HE) stained whole-slide images (WSIs). We developed and compared three Multiple Instance Learning (MIL) frameworks – CLAM-SB, ABMIL, and ConvNeXt-MIL-XGBoost – on an in-house dataset of 210 patient cases. The models were trained to predict 5-year recurrence risk, categorized into three tiers (low, medium, high), with ground truth labels established by the 21-gene Recurrence Score. Features were extracted using the UNI and CONCH pre-trained models. In a 5-fold cross-validation, the modified CLAM-SB model demonstrated the strongest performance, achieving a mean Area Under the Curve (AUC) of 0.836 and a classification accuracy of 76.2%. Our findings demonstrate the feasibility of using deep learning on standard histology slides for automated, genomics-correlated risk stratification, highlighting a promising pathway toward rapid and cost-effective clinical decision support.
zh
[CV-102] Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts AAAI2026
【速读】:该论文旨在解决现有图像校正与规整方法依赖特定任务架构而导致泛化能力弱、跨任务应用受限的问题。其解决方案的关键在于提出统一的畸变校正框架(Unified Rectification Framework, UniRect),通过模拟不同镜头类型将多种任务特定的逆问题纳入一个通用畸变模型中;框架采用无任务特异性的双组件结构:一是基于新型残差渐进薄板样条(Residual Progressive Thin-Plate Spline, RP-TPS)的形变模块,用于处理复杂几何畸变;二是利用残差Mamba块(Residual Mamba Blocks, RMBs)的恢复模块,以补偿形变过程中的退化并提升输出图像保真度;此外,引入稀疏专家混合(Sparse Mixture-of-Experts, SMoEs)结构缓解多任务学习中因畸变差异导致的剧烈任务竞争问题。
链接: https://arxiv.org/abs/2512.18718
作者: Linwei Qiu,Gongzhe Li,Xiaozhe Zhang,Qinlin Sun,Fengying Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026
Abstract:Image correction and rectangling are valuable tasks in practical photography systems such as smartphones. Recent remarkable advancements in deep learning have undeniably brought about substantial performance improvements in these fields. Nevertheless, existing methods mainly rely on task-specific architectures. This significantly restricts their generalization ability and effective application across a wide range of different tasks. In this paper, we introduce the Unified Rectification Framework (UniRect), a comprehensive approach that addresses these practical tasks from a consistent distortion rectification perspective. Our approach incorporates various task-specific inverse problems into a general distortion model by simulating different types of lenses. To handle diverse distortions, UniRect adopts one task-agnostic rectification framework with a dual-component structure: a Deformation Module, which utilizes a novel Residual Progressive Thin-Plate Spline (RP-TPS) model to address complex geometric deformations, and a subsequent Restoration Module, which employs Residual Mamba Blocks (RMBs) to counteract the degradation caused by the deformation process and enhance the fidelity of the output image. Moreover, a Sparse Mixture-of-Experts (SMoEs) structure is designed to circumvent heavy task competition in multi-task learning due to varying distortions. Extensive experiments demonstrate that our models have achieved state-of-the-art performance compared with other up-to-date methods.
zh
[CV-103] EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images
【速读】:该论文旨在解决现有前向传播三维高斯溅射(Feed-forward 3D Gaussian Splatting, 3DGS)方法在密集视角设置下因逐像素预测高斯分布而导致的冗余参数问题,以及缺乏对预测高斯数量的显式控制能力。解决方案的关键在于提出EcoSplat框架,其核心创新为两阶段优化机制:第一阶段为像素对齐高斯训练(Pixel-aligned Gaussian Training, PGT),学习初始高斯分布预测;第二阶段为重要性感知高斯微调(Importance-aware Gaussian Finetuning, IGF),通过排序并自适应调整高斯参数,实现针对任意目标高斯数量的推理时动态控制,从而在严格限制高斯数量的前提下显著提升重建质量和鲁棒性。
链接: https://arxiv.org/abs/2512.18692
作者: Jongmin Park,Minh-Quan Viet Bui,Juan Luis Gonzalez Bello,Jaeho Moon,Jihyong Oh,Munchurl Kim
机构: KAIST(韩国科学技术院); Flawless AI; Department of Imaging Science, GSAIM, Chung-Ang University(中央大学影像科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors contributed equally to this work (equal contribution). The last two authors advised equally to this work. Please visit our project page at this https URL
Abstract:Feed-forward 3D Gaussian Splatting (3DGS) enables efficient one-pass scene reconstruction, providing 3D representations for novel view synthesis without per-scene optimization. However, existing methods typically predict pixel-aligned primitives per-view, producing an excessive number of primitives in dense-view settings and offering no explicit control over the number of predicted Gaussians. To address this, we propose EcoSplat, the first efficiency-controllable feed-forward 3DGS framework that adaptively predicts the 3D representation for any given target primitive count at inference time. EcoSplat adopts a two-stage optimization process. The first stage is Pixel-aligned Gaussian Training (PGT) where our model learns initial primitive prediction. The second stage is Importance-aware Gaussian Finetuning (IGF) stage where our model learns rank primitives and adaptively adjust their parameters based on the target primitive count. Extensive experiments across multiple dense-view settings show that EcoSplat is robust and outperforms state-of-the-art methods under strict primitive-count constraints, making it well-suited for flexible downstream rendering tasks.
zh
[CV-104] A Study of Finetuning Video Transformers for Multi-view Geometry Tasks AAAI20206 AAAI26
【速读】:该论文旨在解决多视图几何任务(如光流估计、3D深度估计和立体匹配)中传统方法依赖定制架构设计与任务特定预训练的问题。解决方案的关键在于利用视频基础模型(video foundation models)的通用性,通过微调即可有效迁移至多视图几何任务,无需复杂架构调整或额外预训练。其核心洞察是:Transformer中的patch间注意力机制能够自然学习时空信息,从而支持几何推理;仅需在Transformer主干后添加线性解码器即可获得良好性能,结合迭代优化可进一步提升至SOTA水平,展现出卓越的跨数据集泛化能力。
链接: https://arxiv.org/abs/2512.18684
作者: Huimin Wu,Kwang-Ting Cheng,Stephen Lin,Zhirong Wu
机构: 1. University of California, Davis (加州大学戴维斯分校); 2. Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 20206, Project website: this http URL
Abstract:This paper presents an investigation of vision transformer learning for multi-view geometry tasks, such as optical flow estimation, by fine-tuning video foundation models. Unlike previous methods that involve custom architectural designs and task-specific pretraining, our research finds that general-purpose models pretrained on videos can be readily transferred to multi-view problems with minimal adaptation. The core insight is that general-purpose attention between patches learns temporal and spatial information for geometric reasoning. We demonstrate that appending a linear decoder to the Transformer backbone produces satisfactory results, and iterative refinement can further elevate performance to stateof-the-art levels. This conceptually simple approach achieves top cross-dataset generalization results for optical flow estimation with end-point error (EPE) of 0.69, 1.78, and 3.15 on the Sintel clean, Sintel final, and KITTI datasets, respectively. Our method additionally establishes a new record on the online test benchmark with EPE values of 0.79, 1.88, and F1 value of 3.79. Applications to 3D depth estimation and stereo matching also show strong performance, illustrating the versatility of video-pretrained models in addressing geometric vision tasks.
zh
[CV-105] AsyncDiff: Asynchronous Timestep Conditioning for Enhanced Text-to-Image Diffusion Inference
【速读】:该论文旨在解决文本到图像扩散模型在推理过程中依赖同步调度导致的效率与灵活性受限问题,即生成过程中的数值积分器与去噪器(denoiser)必须在同一时间步进行协调,限制了对图像细节和纹理丰富度的精细控制。解决方案的关键在于提出一种异步推理机制,通过解耦去噪器的条件输入时间步与图像更新时间步,引入一个轻量级的时间步预测模块(Timestep Prediction Module, TPM),该模块基于Group Relative Policy Optimization(GRPO)训练,能够根据当前潜在状态动态选择更合适的条件时间步,从而有效调控噪声水平以优化图像质量。部署时可通过缩放超参数在原始同步与异步时间步之间插值,实现保守或激进的调整策略,同时保持计算成本可控(SD3.5最多15步,Flux最多10步)。
链接: https://arxiv.org/abs/2512.18675
作者: Longhuan Xu,Feng Yin,Cunjian Chen
机构: Southeast University (东南大学); Monash University (蒙纳士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:Text-to-image diffusion inference typically follows synchronized schedules, where the numerical integrator advances the latent state to the same timestep at which the denoiser is conditioned. We propose an asynchronous inference mechanism that decouples these two, allowing the denoiser to be conditioned at a different, learned timestep while keeping image update schedule unchanged. A lightweight timestep prediction module (TPM), trained with Group Relative Policy Optimization (GRPO), selects a more feasible conditioning timestep based on the current state, effectively choosing a desired noise level to control image detail and textural richness. At deployment, a scaling hyper-parameter can be used to interpolate between the original and de-synchronized timesteps, enabling conservative or aggressive adjustments. To keep the study computationally affordable, we cap the inference at 15 steps for SD3.5 and 10 steps for Flux. Evaluated on Stable Diffusion 3.5 Medium and Flux.1-dev across MS-COCO 2014 and T2I-CompBench datasets, our method optimizes a composite reward that averages Image Reward, HPSv2, CLIP Score and Pick Score, and shows consistent improvement.
zh
[CV-106] SmartSight: Mitigating Hallucination in Video-LLM s Without Compromising Video Understanding via Temporal Attention Collapse AAAI26
【速读】:该论文旨在解决视频大语言模型(Video Large Language Models, Video-LLMs)中存在的感知幻觉(perceptual hallucinations)问题,该问题严重限制了模型在现实场景中的安全应用。现有缓解方法通常会损害模型的视频理解与推理能力。其解决方案的关键在于提出一种无需训练的框架 SmartSight,该框架利用模型自身的内省能力:通过生成多个候选响应来识别低幻觉输出,并借助时序注意力坍缩得分(Temporal Attention Collapse score)评估每个响应的幻觉程度,即检测模型是否过度关注输入视频中无关的时间片段;同时引入视觉注意力消失点(Visual Attention Vanishing point)以提升幻觉估计精度并实现对高幻觉响应的早期终止,从而显著降低解码成本。实验表明,SmartSight 在不牺牲性能的前提下有效降低了 Qwen2.5-VL-7B 的幻觉率(在 VRIPT-HAL 上下降 10.59%),并提升了视频理解能力(在 VideoMMMU 上最高提升 8.86%)。
链接: https://arxiv.org/abs/2512.18671
作者: Yiming Sun,Mi Zhang,Feifei Li,Geng Hong,Min Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI26 accepted
Abstract:Despite Video Large Language Models having rapidly advanced in recent years, perceptual hallucinations pose a substantial safety risk, which severely restricts their real-world applicability. While several methods for hallucination mitigation have been proposed, they often compromise the model’s capacity for video understanding and reasoning. In this work, we propose SmartSight, a pioneering step to address this issue in a training-free manner by leveraging the model’s own introspective capabilities. Specifically, SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding. It assesses the hallucination of each response using the Temporal Attention Collapse score, which measures whether the model over-focuses on trivial temporal regions of the input video when generating the response. To improve efficiency, SmartSight identifies the Visual Attention Vanishing point, enabling more accurate hallucination estimation and early termination of hallucinated responses, leading to a substantial reduction in decoding cost. Experiments show that SmartSight substantially lowers hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%. These results highlight SmartSight’s effectiveness in improving the reliability of open-source Video-LLMs.
zh
[CV-107] Offline Reinforcement Learning for End-to-End Autonomous Driving
【速读】:该论文旨在解决端到端(End-to-end, E2E)自动驾驶模型在依赖模仿学习(Imitation Learning, IL)时存在的持续性失效问题,尤其是由于IL导致的对不安全或次优行为的过度模仿以及训练稳定性差的问题。其解决方案的关键在于提出一种仅使用摄像头图像输入的离线强化学习(Offline Reinforcement Learning, Offline RL)框架,该框架无需额外探索,仅基于固定模拟器数据集进行训练,并通过从专家驾驶日志中构建伪真实轨迹作为行为正则化信号,以抑制对不良行为的模仿并稳定价值学习过程,从而在神经渲染环境中实现更安全、更高效的自动驾驶决策。
链接: https://arxiv.org/abs/2512.18662
作者: Chihiro Noguchi,Takaki Yamamoto
机构: Toyota Motor Corporation (丰田汽车公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages
Abstract:End-to-end (E2E) autonomous driving models that take only camera images as input and directly predict a future trajectory are appealing for their computational efficiency and potential for improved generalization via unified optimization; however, persistent failure modes remain due to reliance on imitation learning (IL). While online reinforcement learning (RL) could mitigate IL-induced issues, the computational burden of neural rendering-based simulation and large E2E networks renders iterative reward and hyperparameter tuning costly. We introduce a camera-only E2E offline RL framework that performs no additional exploration and trains solely on a fixed simulator dataset. Offline RL offers strong data efficiency and rapid experimental iteration, yet is susceptible to instability from overestimation on out-of-distribution (OOD) actions. To address this, we construct pseudo ground-truth trajectories from expert driving logs and use them as a behavior regularization signal, suppressing imitation of unsafe or suboptimal behavior while stabilizing value learning. Training and closed-loop evaluation are conducted in a neural rendering environment learned from the public nuScenes dataset. Empirically, the proposed method achieves substantial improvements in collision rate and route completion compared with IL baselines. Our code will be available at [URL].
zh
[CV-108] PMPGuard: Catching Pseudo-Matched Pairs in Remote Sensing Image-Text Retrieval
【速读】:该论文旨在解决遥感(Remote Sensing, RS)图像-文本检索任务中因伪匹配对(Pseudo-Matched Pairs, PMPs)导致的跨模态对齐学习困难问题,PMPs指语义不匹配或弱对齐的图像-文本配对,会干扰模型学习可靠的跨模态关联。解决方案的关键在于提出一种新颖的检索框架,其核心包含两个机制:一是跨模态门控注意力(Cross-Modal Gated Attention),用于动态调节跨模态信息流;二是正负意识注意力机制(Positive-Negative Awareness Attention),在对齐学习过程中显式区分有效(正)提示与误导(负)提示,从而增强模型对真实语义关联的识别能力。实验表明,该方法在RSICD、RSITMD和RS5M三个基准数据集上均取得最先进性能,验证了其在处理真实世界噪声和PMPs方面的有效性与鲁棒性。
链接: https://arxiv.org/abs/2512.18660
作者: Pengxiang Ouyang,Qing Ma,Zheng Wang,Cong Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Remote sensing (RS) image-text retrieval faces significant challenges in real-world datasets due to the presence of Pseudo-Matched Pairs (PMPs), semantically mismatched or weakly aligned image-text pairs, which hinder the learning of reliable cross-modal alignments. To address this issue, we propose a novel retrieval framework that leverages Cross-Modal Gated Attention and a Positive-Negative Awareness Attention mechanism to mitigate the impact of such noisy associations. The gated module dynamically regulates cross-modal information flow, while the awareness mechanism explicitly distinguishes informative (positive) cues from misleading (negative) ones during alignment learning. Extensive experiments on three benchmark RS datasets, i.e., RSICD, RSITMD, and RS5M, demonstrate that our method consistently achieves state-of-the-art performance, highlighting its robustness and effectiveness in handling real-world mismatches and PMPs in RS image-text retrieval tasks.
zh
[CV-109] SplatBright: Generalizable Low-Light Scene Reconstruction from Sparse Views via Physically-Guided Gaussian Enhancement
【速读】:该论文旨在解决从稀疏视角输入的低光照场景中进行3D重建时存在的曝光不平衡和颜色保真度下降问题,同时克服现有方法在视图一致性上的不足以及对每个场景需单独训练的限制。其解决方案的关键在于提出SplatBright——一种基于3D高斯(3D Gaussian)的通用框架,首次实现了低光照增强与3D重建的联合优化;核心创新包括:通过双分支预测器实现几何参数的稳定初始化,利用频域先验建立光照一致性以保障跨视角光照一致性,并引入外观精修模块分离光照、材质与视角依赖特征以恢复细节纹理;此外,为缓解缺乏大规模几何一致配对数据的问题,采用物理相机模型合成暗光视图用于训练,从而显著提升新视角合成质量、跨视角一致性及对未见低光场景的泛化能力。
链接: https://arxiv.org/abs/2512.18655
作者: Yue Wen,Liang Song,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学); China DXR Technology CO.,Ltd (中国DXR科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-light 3D reconstruction from sparse views remains challenging due to exposure imbalance and degraded color fidelity. While existing methods struggle with view inconsistency and require per-scene training, we propose SplatBright, which is, to our knowledge, the first generalizable 3D Gaussian framework for joint low-light enhancement and reconstruction from sparse sRGB inputs. Our key idea is to integrate physically guided illumination modeling with geometry-appearance decoupling for consistent low-light reconstruction. Specifically, we adopt a dual-branch predictor that provides stable geometric initialization of 3D Gaussian parameters. On the appearance side, illumination consistency leverages frequency priors to enable controllable and cross-view coherent lighting, while an appearance refinement module further separates illumination, material, and view-dependent cues to recover fine texture. To tackle the lack of large-scale geometrically consistent paired data, we synthesize dark views via a physics-based camera model for training. Extensive experiments on public and self-collected datasets demonstrate that SplatBright achieves superior novel view synthesis, cross-view consistency, and better generalization to unseen low-light scenes compared with both 2D and 3D methods.
zh
[CV-110] Adversarial Robustness in Zero-Shot Learning:An Empirical Study on Class and Concept-Level Vulnerabilities
【速读】:该论文旨在解决零样本学习(Zero-shot Learning, ZSL)模型在面对系统性输入扰动时的鲁棒性问题,特别是揭示其在类别级(class-level)和概念级(concept-level)攻击下的脆弱性。现有ZSL方法虽能实现对未见类别的识别,但其分类决策易受恶意干扰,例如通过非目标类别攻击(clsA)破坏预测准确性;尤其在广义零样本学习(Generalized Zero-shot Learning, GZSL)场景下,clsA的成功具有伪阳性特征——仅在原始最优校准点有效,而其他校准点仍保持较强性能。为此,作者提出两类改进攻击策略:一是类偏置增强攻击(Class-Bias Enhanced Attack, CBEA),通过放大已见类与未见类之间的差异,在所有校准点上彻底消除GZSL准确率;二是概念级攻击机制,包括保留类别的概念攻击(Class-Preserving Concept Attack, CPconA)和不保留类别的概念攻击(Non-Class-Preserving Concept Attack, NCPconA),可直接操纵语义概念以误导模型输出。实验表明,当前主流ZSL模型无论架构如何,均对这两类攻击高度敏感,凸显了提升其对抗鲁棒性的紧迫性。
链接: https://arxiv.org/abs/2512.18651
作者: Zhiyuan Peng,Zihan Ye,Shreyank N Gowda,Yuping Yan,Haotian Xu,Ling Shao
机构: iFLYTEK Co., Ltd.(科大讯飞公司); UCAS-Terminus AI Lab, University of Chinese Academy of Sciences (中国科学院大学-终局AI实验室); School of Computer Science, the University of Nottingham (诺丁汉大学计算机学院); TGAI lab, the Westlake University (西湖大学TGAI实验室); RippleInfo Co., Ltd.(涟漪信息公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Zero-shot Learning (ZSL) aims to enable image classifiers to recognize images from unseen classes that were not included during training. Unlike traditional supervised classification, ZSL typically relies on learning a mapping from visual features to predefined, human-understandable class concepts. While ZSL models promise to improve generalization and interpretability, their robustness under systematic input perturbations remain unclear. In this study, we present an empirical analysis about the robustness of existing ZSL methods at both classlevel and concept-level. Specifically, we successfully disrupted their class prediction by the well-known non-target class attack (clsA). However, in the Generalized Zero-shot Learning (GZSL) setting, we observe that the success of clsA is only at the original best-calibrated point. After the attack, the optimal bestcalibration point shifts, and ZSL models maintain relatively strong performance at other calibration points, indicating that clsA results in a spurious attack success in the GZSL. To address this, we propose the Class-Bias Enhanced Attack (CBEA), which completely eliminates GZSL accuracy across all calibrated points by enhancing the gap between seen and unseen class this http URL, at concept-level attack, we introduce two novel attack modes: Class-Preserving Concept Attack (CPconA) and NonClass-Preserving Concept Attack (NCPconA). Our extensive experiments evaluate three typical ZSL models across various architectures from the past three years and reveal that ZSL models are vulnerable not only to the traditional class attack but also to concept-based attacks. These attacks allow malicious actors to easily manipulate class predictions by erasing or introducing concepts. Our findings highlight a significant performance gap between existing approaches, emphasizing the need for improved adversarial robustness in current ZSL models.
zh
[CV-111] Geometric-Photometric Event-based 3D Gaussian Ray Tracing
【速读】:该论文旨在解决事件相机(event camera)在基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的三维重建中,如何有效利用稀疏事件数据中的细粒度时间信息以平衡精度与时间分辨率的问题。其解决方案的关键在于将渲染过程解耦为两个分支:一是逐事件几何(深度)渲染,通过射线追踪实现;二是基于快照的辐射度(强度)渲染,利用畸变事件图像生成。这种双分支策略使模型能够在不依赖预训练图像重建模型或COLMAP初始化的情况下,灵活适应不同数量的事件输入,并实现边缘锐利的高质量重建和快速训练。
链接: https://arxiv.org/abs/2512.18640
作者: Kai Kohyama,Yoshimitsu Aoki,Guillermo Gallego,Shintaro Shiba
机构: Keio University(早稻田大学); Technische Universität Berlin(柏林工业大学); Einstein Center Digital Future(爱因斯坦数字未来中心); Robotics Institute Germany(德国机器人研究所); Science of Intelligence Excellence Cluster(智能科学卓越集群)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 15 pages, 10 figures, 5 tables
Abstract:Event cameras offer a high temporal resolution over traditional frame-based cameras, which makes them suitable for motion and structure estimation. However, it has been unclear how event-based 3D Gaussian Splatting (3DGS) approaches could leverage fine-grained temporal information of sparse events. This work proposes a framework to address the trade-off between accuracy and temporal resolution in event-based 3DGS. Our key idea is to decouple the rendering into two branches: event-by-event geometry (depth) rendering and snapshot-based radiance (intensity) rendering, by using ray-tracing and the image of warped events. The extensive evaluation shows that our method achieves state-of-the-art performance on the real-world datasets and competitive performance on the synthetic dataset. Also, the proposed method works without prior information (e.g., pretrained image reconstruction models) or COLMAP-based initialization, is more flexible in the event selection number, and achieves sharp reconstruction on scene edges with fast training time. We hope that this work deepens our understanding of the sparse nature of events for 3D reconstruction. The code will be released.
zh
[CV-112] Uni-Neur2Img: Unified Neural Signal-Guided Image Generation Editing and Stylization via Diffusion Transformers
【速读】:该论文旨在解决如何高效、灵活地利用神经信号(如脑电图,EEG)直接驱动图像生成与编辑的问题,尤其针对现有研究多局限于文本模态作为条件信号,而对视觉模态作为直接条件信号的探索不足。其核心解决方案是提出Uni-Neur2Img框架,关键创新在于引入一种基于LoRA(Low-Rank Adaptation)的神经信号注入模块,该模块可独立处理每种条件信号并作为即插即用组件集成到模型中,从而实现无需修改基础模型参数的多模态灵活控制;同时采用因果注意力机制以适应条件生成任务中长序列神经信号的建模需求,显著提升了生成质量、编辑一致性与风格迁移效果,且保持低计算开销和良好的可扩展性。
链接: https://arxiv.org/abs/2512.18635
作者: Xiyue Bai,Ronghao Yu,Jia Xiu,Pengfei Zhou,Jie Xia,Peng Ji
机构: Fudan University (复旦大学); Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating or editing images directly from Neural signals has immense potential at the intersection of neuroscience, vision, and Brain-computer interaction. In this paper, We present Uni-Neur2Img, a unified framework for neural signal-driven image generation and editing. The framework introduces a parameter-efficient LoRA-based neural signal injection module that independently processes each conditioning signal as a pluggable component, facilitating flexible multi-modal conditioning without altering base model parameters. Additionally, we employ a causal attention mechanism accommodate the long-sequence modeling demands of conditional generation tasks. Existing neural-driven generation research predominantly focuses on textual modalities as conditions or intermediate representations, resulting in limited exploration of visual modalities as direct conditioning signals. To bridge this research gap, we introduce the EEG-Style dataset. We conduct comprehensive evaluations across public benchmarks and self-collected neural signal datasets: (1) EEG-driven image generation on the public CVPR40 dataset; (2) neural signal-guided image editing on the public Loongx dataset for semantic-aware local modifications; and (3) EEG-driven style transfer on our self-collected EEG-Style dataset. Extensive experimental results demonstrate significant improvements in generation fidelity, editing consistency, and style transfer quality while maintaining low computational overhead and strong scalability to additional modalities. Thus, Uni-Neur2Img offers a unified, efficient, and extensible solution for bridging neural signals and visual content generation.
zh
[CV-113] PTTA: A Pure Text-to-Animation Framework for High-Quality Creation
【速读】:该论文旨在解决传统动画制作流程复杂、人工成本高,以及现有视频生成模型(如Sora、Kling和CogVideoX)在动画风格生成上表现受限的问题。其关键解决方案是提出一种纯文本到动画(text-to-animation)的框架PTTA,通过构建一个高质量的小规模动画视频与文本描述配对数据集,并基于预训练的文本到视频模型HunyuanVideo进行微调,从而实现对动画风格的有效适配与高质量生成。实验表明,该方法在多个维度上均显著优于现有基线模型。
链接: https://arxiv.org/abs/2512.18614
作者: Ruiqi Chen,Kaitong Cai,Yijia Fan,Keze Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under submission
Abstract:Traditional animation production involves complex pipelines and significant manual labor cost. While recent video generation models such as Sora, Kling, and CogVideoX achieve impressive results on natural video synthesis, they exhibit notable limitations when applied to animation generation. Recent efforts, such as AniSora, demonstrate promising performance by fine-tuning image-to-video models for animation styles, yet analogous exploration in the text-to-video setting remains limited. In this work, we present PTTA, a pure text-to-animation framework for high-quality animation creation. We first construct a small-scale but high-quality paired dataset of animation videos and textual descriptions. Building upon the pretrained text-to-video model HunyuanVideo, we perform fine-tuning to adapt it to animation-style generation. Extensive visual evaluations across multiple dimensions show that the proposed approach consistently outperforms comparable baselines in animation video synthesis. Comments: Under submission Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.18614 [cs.CV] (or arXiv:2512.18614v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.18614 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-114] xt2Graph VPR: A Text-to-Graph Expert System for Explainable Place Recognition in Changing Environments
【速读】:该论文旨在解决长期部署场景下视觉定位(Visual Place Recognition, VPR)系统在光照、天气和季节变化等复杂条件下仍需保持鲁棒性与可解释性的问题。传统基于像素相似度的方法难以应对显著的外观变化,且缺乏透明决策机制。解决方案的关键在于提出Text2Graph VPR框架:首先将图像序列转化为文本场景描述,再解析为结构化的场景图(Scene Graph),其中包含对象、属性及成对关系;随后通过聚合帧级图生成紧凑的场所表征,并采用双相似度机制融合图注意力网络(Graph Attention Network, GAT)学习的嵌入与最短路径(Shortest-Path, SP)核进行结构匹配。这种混合设计不仅实现了语义感知与拓扑敏感的双重匹配能力,还生成了人类可读的中间表示,从而支持诊断分析并提升决策过程的透明度。
链接: https://arxiv.org/abs/2512.18613
作者: Saeideh Yousefzadeh,Hamidreza Pourreza
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint version
Abstract:Visual Place Recognition (VPR) in long-term deployment requires reasoning beyond pixel similarity: systems must make transparent, interpretable decisions that remain robust under lighting, weather and seasonal change. We present Text2Graph VPR, an explainable semantic localization system that converts image sequences into textual scene descriptions, parses those descriptions into structured scene graphs, and reasons over the resulting graphs to identify places. Scene graphs capture objects, attributes and pairwise relations; we aggregate per-frame graphs into a compact place representation and perform retrieval with a dual-similarity mechanism that fuses learned Graph Attention Network (GAT) embeddings and a Shortest-Path (SP) kernel for structural matching. This hybrid design enables both learned semantic matching and topology-aware comparison, and – critically – produces human-readable intermediate representations that support diagnostic analysis and improve transparency in the decision process. We validate the system on Oxford RobotCar and MSLS (Amman/San Francisco) benchmarks and demonstrate robust retrieval under severe appearance shifts, along with zero-shot operation using human textual queries. The results illustrate that semantic, graph-based reasoning is a viable and interpretable alternative for place recognition, particularly suited to safety-sensitive and resource-constrained settings.
zh
[CV-115] SimpleCall: A Lightweight Image Restoration Agent in Label-Free Environments with MLLM Perceptual Feedback
【速读】:该论文旨在解决复杂图像复原(Complex Image Restoration)中因多退化因素(如模糊、噪声、雨痕和压缩伪影)导致的恢复质量下降问题,同时克服现有基于视觉-语言模型和大语言模型的复原代理在效率上的瓶颈(如反射、回滚和迭代工具搜索),以及对大量标注数据依赖带来的泛化能力受限问题。解决方案的关键在于提出一种基于策略优化(Policy Optimization)的轻量级复原框架,通过训练一个能在序列决策过程中选择最优复原操作的智能体(Agent),以最大化最终图像质量;并引入由多模态大语言模型驱动的新型奖励机制,作为人类对齐的评价器,在无标签环境下提供感知反馈以指导策略优化,从而实现无需冗余工具调用的确定性复原流程,在显著加速推理的同时保持与当前最先进方法相当甚至更优的复原性能。
链接: https://arxiv.org/abs/2512.18599
作者: Jianglin Lu,Yuanwei Wu,Ziyi Zhao,Hongcheng Wang,Felix Jimenez,Abrar Majeedi,Yun Fu
机构: Amazon(亚马逊); Northeastern University (东北大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Complex image restoration aims to recover high-quality images from inputs affected by multiple degradations such as blur, noise, rain, and compression artifacts. Recent restoration agents, powered by vision-language models and large language models, offer promising restoration capabilities but suffer from significant efficiency bottlenecks due to reflection, rollback, and iterative tool searching. Moreover, their performance heavily depends on degradation recognition models that require extensive annotations for training, limiting their applicability in label-free environments. To address these limitations, we propose a policy optimization-based restoration framework that learns an lightweight agent to determine tool-calling sequences. The agent operates in a sequential decision process, selecting the most appropriate restoration operation at each step to maximize final image quality. To enable training within label-free environments, we introduce a novel reward mechanism driven by multimodal large language models, which act as human-aligned evaluator and provide perceptual feedback for policy improvement. Once trained, our agent executes a deterministic restoration plans without redundant tool invocations, significantly accelerating inference while maintaining high restoration quality. Extensive experiments show that despite using no supervision, our method matches SOTA performance on full-reference metrics and surpasses existing approaches on no-reference metrics across diverse degradation scenarios.
zh
[CV-116] Commercial Vehicle Braking Optimization: A Robust SIFT-Trajectory Approach
【速读】:该论文旨在解决商用车辆自动紧急制动(AEB)系统在低速运行时因控制器局域网(CAN)信号不准确导致的“零速制动”问题,即误触发制动而造成不必要的停车或干扰。解决方案的关键在于提出一种基于视觉的轨迹分析方法:利用NVIDIA Jetson AGX Xavier平台处理盲区摄像头的连续视频帧,通过自适应对比度受限的自适应直方图均衡化(CLAHE)增强的尺度不变特征变换(SIFT)特征提取与KNN-RANSAC匹配算法,实现对车辆运动状态(静止、振动、移动)的精确分类。其核心创新包括:1)基于5帧滑动窗口的多帧轨迹位移统计;2)双阈值状态决策矩阵;3)由OBD-II驱动的动态感兴趣区域(ROI)配置,从而有效抑制环境干扰和动态物体误检,显著降低低速误触发率。
链接: https://arxiv.org/abs/2512.18597
作者: Zhe Li,Kun Cheng,Hanyue Mo,Jintao Lu,Ziwen Kuang,Jianwen Ye,Lixu Xu,Xinya Meng,Jiahui Zhao,Shengda Ji,Shuyuan Liu,Mengyu Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 5 figures,16 pages
Abstract:A vision-based trajectory analysis solution is proposed to address the “zero-speed braking” issue caused by inaccurate Controller Area Network (CAN) signals in commercial vehicle Automatic Emergency Braking (AEB) systems during low-speed operation. The algorithm utilizes the NVIDIA Jetson AGX Xavier platform to process sequential video frames from a blind spot camera, employing self-adaptive Contrast Limited Adaptive Histogram Equalization (CLAHE)-enhanced Scale-Invariant Feature Transform (SIFT) feature extraction and K-Nearest Neighbors (KNN)-Random Sample Consensus (RANSAC) matching. This allows for precise classification of the vehicle’s motion state (static, vibration, moving). Key innovations include 1) multiframe trajectory displacement statistics (5-frame sliding window), 2) a dual-threshold state decision matrix, and 3) OBD-II driven dynamic Region of Interest (ROI) configuration. The system effectively suppresses environmental interference and false detection of dynamic objects, directly addressing the challenge of low-speed false activation in commercial vehicle safety systems. Evaluation in a real-world dataset (32,454 video segments from 1,852 vehicles) demonstrates an F1-score of 99.96% for static detection, 97.78% for moving state recognition, and a processing delay of 14.2 milliseconds (resolution 704x576). The deployment on-site shows an 89% reduction in false braking events, a 100% success rate in emergency braking, and a fault rate below 5%.
zh
[CV-117] Placenta Accreta Spectrum Detection Using an MRI-based Hybrid CNN-Transformer Model
【速读】:该论文旨在解决胎盘粘连谱(Placenta Accreta Spectrum, PAS)在磁共振成像(MRI)诊断中因放射科医生解读差异而导致的挑战性问题。解决方案的关键在于提出一种融合3D DenseNet121与3D视觉Transformer(Vision Transformer, ViT)的混合深度学习模型,通过前者提取局部特征、后者建模全局空间上下文信息,从而实现对体积MRI数据的自动化PAS检测。该方法在1,133例回顾性MRI数据集上验证,独立测试集平均准确率达84.3%,显示出其作为辅助诊断工具提升诊断一致性与准确性的重要潜力。
链接: https://arxiv.org/abs/2512.18573
作者: Sumaiya Ali,Areej Alhothali,Ohoud Alzamzami,Sameera Albasri,Ahmed Abduljabbar,Muhammad Alwazzan
机构: King Abdulaziz University (阿卜杜勒阿齐兹国王大学); King Abdulaziz University Hospital (阿卜杜勒阿齐兹国王大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Placenta Accreta Spectrum (PAS) is a serious obstetric condition that can be challenging to diagnose with Magnetic Resonance Imaging (MRI) due to variability in radiologists’ interpretations. To overcome this challenge, a hybrid 3D deep learning model for automated PAS detection from volumetric MRI scans is proposed in this study. The model integrates a 3D DenseNet121 to capture local features and a 3D Vision Transformer (ViT) to model global spatial context. It was developed and evaluated on a retrospective dataset of 1,133 MRI volumes. Multiple 3D deep learning architectures were also evaluated for comparison. On an independent test set, the DenseNet121-ViT model achieved the highest performance with a five-run average accuracy of 84.3%. These results highlight the strength of hybrid CNN-Transformer models as a computer-aided diagnosis tool. The model’s performance demonstrates a clear potential to assist radiologists by providing a robust decision support to improve diagnostic consistency across interpretations, and ultimately enhance the accuracy and timeliness of PAS diagnosis.
zh
[CV-118] ESearch-R1: Learning Cost-Aware MLLM Agents for Interactive Embodied Search via Reinforcement Learning
【速读】:该论文旨在解决当前具身智能体在面对模糊自然语言指令(如“拿一个工具”)时,无法有效权衡物理探索成本与人机交互认知成本的问题。现有方法通常将歧义消除视为被动感知任务,缺乏战略性推理以最小化总任务执行成本。其解决方案的关键在于提出ESearch-R1框架,该框架将交互式对话(Ask)、情景记忆检索(GetMemory)和物理导航(Navigate)统一为一个决策过程,并引入HC-GRPO(Heterogeneous Cost-Aware Group Relative Policy Optimization)算法,通过采样推理轨迹组并强化那些在信息增益与异构成本(如导航时间、人类注意力)之间取得最优平衡的策略,从而实现对多模态大语言模型(MLLM)的高效优化。
链接: https://arxiv.org/abs/2512.18571
作者: Weijie Zhou,Xuangtang Xiong,Ye Tian,Lijun Yue,Xinyu Wu,Wei Li,Chaoyang Zhao,Honghui Dong,Ming Tang,Jinqiao Wang,Zhengyou Zhang
机构: Beijing Jiaotong University (北京交通大学); Tencent Robotics X & Futian Laboratory (腾讯机器人X与福田实验室); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have empowered embodied agents with remarkable capabilities in planning and reasoning. However, when facing ambiguous natural language instructions (e.g., “fetch the tool” in a cluttered room), current agents often fail to balance the high cost of physical exploration against the cognitive cost of human interaction. They typically treat disambiguation as a passive perception problem, lacking the strategic reasoning to minimize total task execution costs. To bridge this gap, we propose ESearch-R1, a cost-aware embodied reasoning framework that unifies interactive dialogue (Ask), episodic memory retrieval (GetMemory), and physical navigation (Navigate) into a single decision process. We introduce HC-GRPO (Heterogeneous Cost-Aware Group Relative Policy Optimization). Unlike traditional PPO which relies on a separate value critic, HC-GRPO optimizes the MLLM by sampling groups of reasoning trajectories and reinforcing those that achieve the optimal trade-off between information gain and heterogeneous costs (e.g., navigate time, and human attention). Extensive experiments in AI2-THOR demonstrate that ESearch-R1 significantly outperforms standard ReAct-based agents. It improves task success rates while reducing total operational costs by approximately 50%, validating the effectiveness of GRPO in aligning MLLM agents with physical world constraints.
zh
[CV-119] OpenView: Empowering MLLM s with Out-of-view VQA
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在自然图像理解中普遍存在的局限性——即仅能对图像帧内内容进行推理,而缺乏对视场外(Out-of-View, OOV)场景、物体和活动的推理能力。为突破这一瓶颈,作者提出三项关键技术:首先设计OpenView四阶段流水线,利用全景图像生成具有丰富上下文和空间定位信息的多选视觉问答(VQA)数据;其次构建OpenView-Dataset,一个基于真实世界全景图合成的高质量数据集,用于监督微调MLLMs;最后开发OpenView-Bench基准,通过联合评估选项准确率与推理合理性实现可解释、可诊断的评测体系。实验表明,尽管当前MLLMs在OOV VQA任务上仍显著落后于人类水平,但在OpenView赋能下,多个主流MLLMs的平均性能从48.6%提升至64.1%,验证了该方案的有效性。
链接: https://arxiv.org/abs/2512.18563
作者: Qixiang Chen,Cheng Zhang,Chi-Wing Fu,Jingwen Ye,Jianfei Cai
机构: Monash University (莫纳什大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable evaluation. Experimental results show that despite having a large gap from human performance in OOV VQA answer selection, upon empowered by OpenView, multiple MLLMs can consistently boost their performance, uplifted from 48.6% to 64.1% on average. Code, benchmark, and data will be available at this https URL.
zh
[CV-120] Enhancing Medical Large Vision-Language Models via Alignment Distillation AAAI’2026
【速读】:该论文旨在解决医学视觉语言模型(Med-LVLMs)在临床应用中因视觉理解错位而导致的幻觉输出问题。其关键在于识别出两个根本局限:视觉表征学习不足和视觉注意力对齐不佳。为此,作者提出MEDALIGN框架,通过轻量级的知识蒸馏机制,将领域特定的对比语言-图像预训练(CLIP)模型中的视觉对齐知识迁移至Med-LVLMs;其核心创新在于引入两种蒸馏损失:基于视觉token级相似性结构的空间感知视觉对齐损失,以及引导注意力聚焦于诊断相关区域的注意力感知蒸馏损失,从而显著提升模型性能与可解释性。
链接: https://arxiv.org/abs/2512.18554
作者: Aofei Chang,Ting Wang,Fenglong Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI’2026 (Main track)
Abstract:Medical Large Vision-Language Models (Med-LVLMs) have shown promising results in clinical applications, but often suffer from hallucinated outputs due to misaligned visual understanding. In this work, we identify two fundamental limitations contributing to this issue: insufficient visual representation learning and poor visual attention alignment. To address these problems, we propose MEDALIGN, a simple, lightweight alignment distillation framework that transfers visual alignment knowledge from a domain-specific Contrastive Language-Image Pre-training (CLIP) model to Med-LVLMs. MEDALIGN introduces two distillation losses: a spatial-aware visual alignment loss based on visual token-level similarity structures, and an attention-aware distillation loss that guides attention toward diagnostically relevant regions. Extensive experiments on medical report generation and medical visual question answering (VQA) benchmarks show that MEDALIGN consistently improves both performance and interpretability, yielding more visually grounded outputs.
zh
[CV-121] Hierarchical Bayesian Framework for Multisource Domain Adaptation
【速读】:该论文旨在解决多源域适应(Multisource Domain Adaptation, MDA)中的预训练问题,即如何利用多个带有标签的源域数据集来提升无标签目标域上的模型性能。传统方法在预训练阶段通常采用权重共享或独立训练源模型的方式,缺乏对源域间分布相似性的有效建模。本文提出了一种基于贝叶斯框架的预训练方法——分层贝叶斯框架(Hierarchical Bayesian Framework),其关键在于显式建模不同源域数据分布之间的相似性,并利用这种先验信息优化预训练过程,从而提升目标域上的识别准确率。实验表明,该框架在大型基准数据集和具有挑战性的Daily-DA RGB视频人类动作识别任务中,相较现有最优方法实现了17.29%的准确率提升。
链接: https://arxiv.org/abs/2512.18553
作者: Alexander M. Glandon,Khan M. Iftekharuddin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multisource domain adaptation (MDA) aims to use multiple source datasets with available labels to infer labels on a target dataset without available labels for target supervision. Prior works on MDA in the literature is ad-hoc as the pretraining of source models is either based on weight sharing or uses independently trained models. This work proposes a Bayesian framework for pretraining in MDA by considering that the distributions of different source domains are typically similar. The Hierarchical Bayesian Framework uses similarity between the different source data distributions to optimize the pretraining for MDA. Experiments using the proposed Bayesian framework for MDA show that our framework improves accuracy on recognition tasks for a large benchmark dataset. Performance comparison with state-of-the-art MDA methods on the challenging problem of human action recognition in multi-domain benchmark Daily-DA RGB video shows the proposed Bayesian Framework offers a 17.29% improvement in accuracy when compared to the state-of-the-art methods in the literature.
zh
[CV-122] WoundNet-Ensemble: A Novel IoMT System Integrating Self-Supervised Deep Learning and Multi-Model Fusion for Automated High-Accuracy Wound Classification and Healing Progression Monitoring
【速读】:该论文旨在解决慢性伤口(如糖尿病足溃疡)临床评估主观性强、分类不一致及干预延迟的问题,从而降低医疗成本并提升治疗效率。其关键解决方案是提出WoundNet-Ensemble系统,该系统基于物联网(Internet of Medical Things)架构,融合三种互补的深度学习模型——ResNet-50、自监督视觉Transformer DINOv2和Swin Transformer,实现对六类临床显著伤口类型的自动化分类;通过加权融合策略,相较先前最优方法提升3.7%准确率,并集成纵向愈合追踪功能,可计算愈合速率、严重程度评分并生成临床警报,为远程患者监测与数字疗法提供可部署的AI工具。
链接: https://arxiv.org/abs/2512.18528
作者: Moses Kiprono
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures. Code to be released publicly
Abstract:Chronic wounds, including diabetic foot ulcers which affect up to one-third of people with diabetes, impose a substantial clinical and economic burden, with U.S. healthcare costs exceeding 25 billion dollars annually. Current wound assessment remains predominantly subjective, leading to inconsistent classification and delayed interventions. We present WoundNet-Ensemble, an Internet of Medical Things system leveraging a novel ensemble of three complementary deep learning architectures: ResNet-50, the self-supervised Vision Transformer DINOv2, and Swin Transformer, for automated classification of six clinically distinct wound types. Our system achieves 99.90 percent ensemble accuracy on a comprehensive dataset of 5,175 wound images spanning diabetic foot ulcers, pressure ulcers, venous ulcers, thermal burns, pilonidal sinus wounds, and fungating malignant tumors. The weighted fusion strategy demonstrates a 3.7 percent improvement over previous state-of-the-art methods. Furthermore, we implement a longitudinal wound healing tracker that computes healing rates, severity scores, and generates clinical alerts. This work demonstrates a robust, accurate, and clinically deployable tool for modernizing wound care through artificial intelligence, addressing critical needs in telemedicine and remote patient monitoring. The implementation and trained models will be made publicly available to support reproducibility.
zh
[CV-123] Detection of AI Generated Images Using Combined Uncertainty Measures and Particle Swarm Optimised Rejection Mechanism
【速读】:该论文旨在解决生成式 AI(Generative AI)图像日益逼真导致的自然图像与AI生成图像难以区分的问题。其解决方案的关键在于融合多种不确定性度量信号,包括基于Fisher信息的参数敏感性、蒙特卡洛Dropout的熵不确定性以及深度核学习框架中高斯过程分类器的预测方差,并利用粒子群优化(Particle Swarm Optimisation)自适应地学习各不确定性的权重和拒绝阈值,从而构建一个鲁棒且可适应分布偏移的检测框架。该方法在多个未见过的生成模型(如GLIDE、Midjourney等)上表现出稳定的误拒率(约70%),显著优于单一不确定性指标,在对抗攻击下也具备较强防御能力。
链接: https://arxiv.org/abs/2512.18527
作者: Rahul Yumlembam,Biju Issac,Nauman Aslam,Eaby Kollonoor Babu,Josh Collyer,Fraser Kennedy
机构: Northumbria University (纽卡斯尔大学); The Alan Turing Institute (艾伦图灵研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Scientific Reports (2025)
Abstract:As AI-generated images become increasingly photorealistic, distinguishing them from natural images poses a growing challenge. This paper presents a robust detection framework that leverages multiple uncertainty measures to decide whether to trust or reject a model’s predictions. We focus on three complementary techniques: Fisher Information, which captures the sensitivity of model parameters to input variations; entropy-based uncertainty from Monte Carlo Dropout, which reflects predictive variability; and predictive variance from a Deep Kernel Learning framework using a Gaussian Process classifier. To integrate these diverse uncertainty signals, Particle Swarm Optimisation is used to learn optimal weightings and determine an adaptive rejection threshold. The model is trained on Stable Diffusion-generated images and evaluated on GLIDE, VQDM, Midjourney, BigGAN, and StyleGAN3, each introducing significant distribution shifts. While standard metrics such as prediction probability and Fisher-based measures perform well in distribution, their effectiveness degrades under shift. In contrast, the Combined Uncertainty measure consistently achieves an incorrect rejection rate of approximately 70 percent on unseen generators, successfully filtering most misclassified AI samples. Although the system occasionally rejects correct predictions from newer generators, this conservative behaviour is acceptable, as rejected samples can support retraining. The framework maintains high acceptance of accurate predictions for natural images and in-domain AI data. Under adversarial attacks using FGSM and PGD, the Combined Uncertainty method rejects around 61 percent of successful attacks, while GP-based uncertainty alone achieves up to 80 percent. Overall, the results demonstrate that multi-source uncertainty fusion provides a resilient and adaptive solution for AI-generated image detection.
zh
[CV-124] GTMA: Dynamic Representation Optimization for OOD Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在开放世界应用中因分布外(Out-of-Distribution, OOD)概念导致的跨模态对齐崩溃问题,该问题会显著降低零样本(zero-shot)性能。核心原因被识别为模态不对称性:视觉编码器能够从未见过的图像中提取判别特征,而文本编码器受限于固定离散词汇表,无法生成新的语义锚点。解决方案的关键在于提出动态表示优化机制,通过引导目标匹配适应(Guided Target-Matching Adaptation, GTMA)框架,在推理阶段构建与OOD图像视觉锚点最佳对齐的连续伪词嵌入(pseudo-word embedding),从而绕过词汇表限制;该优化由自适应梯度驱动的表示策略优化算法实现,并引入语义正则化以保持合理性与与模型先验知识的一致性。
链接: https://arxiv.org/abs/2512.18504
作者: Jensen Zhang,Ningyuan Liu,Keze Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under submission
Abstract:Vision-language models (VLMs) struggle in open-world applications, where out-of-distribution (OOD) concepts often trigger cross-modal alignment collapse and severely degrade zero-shot performance. We identify the root cause as modal asymmetry: while the visual encoder can extract discriminative features from unseen images, the text encoder is constrained by a fixed discrete vocabulary and cannot synthesize new semantic anchors. Existing approaches such as CoOp or LoRA provide only partial remedies, as they remain confined to the pre-trained semantic space. To overcome this bottleneck, we propose dynamic representation optimization, realized through the Guided Target-Matching Adaptation (GTMA) framework. At inference time, GTMA constructs a continuous pseudo-word embedding that best aligns with an OOD image’s visual anchor, effectively bypassing vocabulary limitations. The optimization is driven by an adaptive gradient-based representation policy optimization algorithm, which incorporates semantic regularization to preserve plausibility and compatibility with the model’s prior knowledge. Experiments on ImageNet-R and the VISTA-Beyond benchmark demonstrate that GTMA improves zero-shot and few-shot OOD accuracy by up to 15-20 percent over the base VLM while maintaining performance on in-distribution concepts. Ablation studies further confirm the necessity of pseudo-word optimization. Comments: Under submission Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.18504 [cs.CV] (or arXiv:2512.18504v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.18504 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-125] NASTaR: NovaSAR Automated Ship Target Recognition Dataset
【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像中船舶类型分类的难题,该任务因船型多样性与复杂性导致准确识别困难,且依赖高质量标注数据和专用深度学习模型。解决方案的关键在于构建一个大规模、高标注质量的新型SAR船舶目标识别数据集——NASTaR(NovaSAR Automated Ship Target Recognition),其包含3415个来自NovaSAR S波段影像的船舶切片,标签与AIS(Automatic Identification System)数据对齐,并具备23类细粒度分类、近海/远海分离以及辅助的船舶尾迹子集等特性。通过在主流深度学习模型上验证,该数据集显著提升了多种典型船舶分类场景下的性能,如四类船型识别准确率超60%、三类场景超70%、货船与油轮区分超75%、渔船识别超87%,从而为SAR船舶分类提供了可靠的数据基础与基准测试平台。
链接: https://arxiv.org/abs/2512.18503
作者: Benyamin Hosseiny,Kamirul Kamirul,Odysseas Pappas,Alin Achim
机构: University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Synthetic Aperture Radar (SAR) offers a unique capability for all-weather, space-based maritime activity monitoring by capturing and imaging strong reflections from ships at sea. A well-defined challenge in this domain is ship type classification. Due to the high diversity and complexity of ship types, accurate recognition is difficult and typically requires specialized deep learning models. These models, however, depend on large, high-quality ground-truth datasets to achieve robust performance and generalization. Furthermore, the growing variety of SAR satellites operating at different frequencies and spatial resolutions has amplified the need for more annotated datasets to enhance model accuracy. To address this, we present the NovaSAR Automated Ship Target Recognition (NASTaR) dataset. This dataset comprises of 3415 ship patches extracted from NovaSAR S-band imagery, with labels matched to AIS data. It includes distinctive features such as 23 unique classes, inshore/offshore separation, and an auxiliary wake dataset for patches where ship wakes are visible. We validated the dataset applicability across prominent ship-type classification scenarios using benchmark deep learning models. Results demonstrate over 60% accuracy for classifying four major ship types, over 70% for a three-class scenario, more than 75% for distinguishing cargo from tanker ships, and over 87% for identifying fishing vessels. The NASTaR dataset is available at https://https://doi.org/10.5523/bris, while relevant codes for benchmarking and analysis are available at this https URL.
zh
[CV-126] PlantDiseaseNet-RT50: A Fine-tuned ResNet50 Architecture for High-Accuracy Plant Disease Detection Beyond Standard CNNs
【速读】:该论文旨在解决植物病害检测依赖人工视觉识别所导致的效率低、成本高及难以大规模应用的问题,从而保障农业生产力与全球粮食安全。其解决方案的关键在于提出了一种基于ResNet50改进的深度学习模型PlantDiseaseNet-RT50,通过策略性解冻(strategically unfrozen layers)网络末端层、设计带正则化机制的自定义分类头以及采用余弦衰减动态学习率调度等优化手段,实现了高达98%的准确率、精确率和召回率,显著提升了模型在实际农业生产场景中的诊断性能与计算效率。
链接: https://arxiv.org/abs/2512.18500
作者: Santwana Sagnika,Manav Malhotra,Ishtaj Kaur Deol,Soumyajit Roy,Swarnav Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work is published in 2025 IEEE International Conference on Advances in Computing Research On Science Engineering and Technology (ACROSET). 6 pages, 2 figures, 2 tables
Abstract:Plant diseases pose a significant threat to agricultural productivity and global food security, accounting for 70-80% of crop losses worldwide. Traditional detection methods rely heavily on expert visual inspection, which is time-consuming, labour-intensive, and often impractical for large-scale farming operations. In this paper, we present PlantDiseaseNet-RT50, a novel fine-tuned deep learning architecture based on ResNet50 for automated plant disease detection. Our model features strategically unfrozen layers, a custom classification head with regularization mechanisms, and dynamic learning rate scheduling through cosine decay. Using a comprehensive dataset of distinct plant disease categories across multiple crop species, PlantDiseaseNet-RT50 achieves exceptional performance with approximately 98% accuracy, precision, and recall. Our architectural modifications and optimization protocol demonstrate how targeted fine-tuning can transform a standard pretrained model into a specialized agricultural diagnostic tool. We provide a detailed account of our methodology, including the systematic unfreezing of terminal layers, implementation of batch normalization and dropout regularization and application of advanced training techniques. PlantDiseaseNet-RT50 represents a significant advancement in AI-driven agricultural tools, offering a computationally efficient solution for rapid and accurate plant disease diagnosis that can be readily implemented in practical farming contexts to support timely interventions and reduce crop losses.
zh
[CV-127] Adaptive-VoCo: Complexity-Aware Visual Token Compression for Vision-Language Models
【速读】:该论文旨在解决大规模视觉语言模型(Vision-Language Models, VLMs)在处理高维视觉特征时计算和内存开销大的问题,以及现有压缩方法采用固定压缩率导致无法适应不同图像视觉复杂度的局限性。解决方案的关键在于提出 Adaptive-VoCo 框架,通过引入一个轻量级预测器动态选择最优压缩率:该预测器利用视觉编码器输出的统计线索(如 patch token 熵和注意力图方差)量化图像的视觉复杂度,并结合一种联合损失函数(包含率正则化与复杂度对齐项),使模型能够在推理效率与表征能力之间实现平衡,从而提升多模态任务中的性能表现。
链接: https://arxiv.org/abs/2512.18496
作者: Xiaoyang Guo,Keze Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under submission
Abstract:In recent years, large-scale vision-language models (VLMs) have demonstrated remarkable performance on multimodal understanding and reasoning tasks. However, handling high-dimensional visual features often incurs substantial computational and memory costs. VoCo-LLaMA alleviates this issue by compressing visual patch tokens into a few VoCo tokens, reducing computational overhead while preserving strong cross-modal alignment. Nevertheless, such approaches typically adopt a fixed compression rate, limiting their ability to adapt to varying levels of visual complexity. To address this limitation, we propose Adaptive-VoCo, a framework that augments VoCo-LLaMA with a lightweight predictor for adaptive compression. This predictor dynamically selects an optimal compression rate by quantifying an image’s visual complexity using statistical cues from the vision encoder, such as patch token entropy and attention map variance. Furthermore, we introduce a joint loss function that integrates rate regularization with complexity alignment. This enables the model to balance inference efficiency with representational capacity, particularly in challenging scenarios. Experimental results show that our method consistently outperforms fixed-rate baselines across multiple multimodal tasks, highlighting the potential of adaptive visual compression for creating more efficient and robust VLMs.
zh
[CV-128] STORM: Search-Guided Generative World Models for Robotic Manipulation
【速读】:该论文旨在解决机器人操作中长期时空推理(spatio-temporal reasoning)的挑战,尤其是现有视觉-语言-动作(Vision-Language-Action, VLA)模型在决策过程中依赖抽象潜在动力学或将推理任务交由语言模块导致的可解释性差与前瞻性不足问题。解决方案的关键在于提出STORM框架,其核心创新是通过扩散模型生成候选动作、结合奖励增强的视频世界模型进行显式视觉回放模拟,并利用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)进行前瞻评估与计划精炼,从而实现基于视觉感知的可解释、高前瞻性的决策机制。该方法显著提升了任务成功率和规划鲁棒性,尤其在长时程操作场景下展现出优越性能。
链接: https://arxiv.org/abs/2512.18477
作者: Wenjun Lin,Jensen Zhang,Kaitong Cai,Keze Wang
机构: Sun Yat-sen University (中山大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Under submission
Abstract:We present STORM (Search-Guided Generative World Models), a novel framework for spatio-temporal reasoning in robotic manipulation that unifies diffusion-based action generation, conditional video prediction, and search-based planning. Unlike prior Vision-Language-Action (VLA) models that rely on abstract latent dynamics or delegate reasoning to language components, STORM grounds planning in explicit visual rollouts, enabling interpretable and foresight-driven decision-making. A diffusion-based VLA policy proposes diverse candidate actions, a generative video world model simulates their visual and reward outcomes, and Monte Carlo Tree Search (MCTS) selectively refines plans through lookahead evaluation. Experiments on the SimplerEnv manipulation benchmark demonstrate that STORM achieves a new state-of-the-art average success rate of 51.0 percent, outperforming strong baselines such as CogACT. Reward-augmented video prediction substantially improves spatio-temporal fidelity and task relevance, reducing Frechet Video Distance by over 75 percent. Moreover, STORM exhibits robust re-planning and failure recovery behavior, highlighting the advantages of search-guided generative world models for long-horizon robotic manipulation.
zh
[CV-129] Plasticine: A Traceable Diffusion Model for Medical Image Translation
【速读】:该论文旨在解决医学图像分析中因成像设备差异和人群分布变化导致的域间差异(domain gaps)问题,尤其关注现有图像到图像翻译方法在生成合成数据时忽视像素级空间对应关系(traceability)的缺陷。解决方案的关键在于提出一种名为Plasticine的新框架,其首次将可追溯性(traceability)作为核心目标,通过在去噪扩散(denoising diffusion)架构中联合建模强度变换与空间变形,实现具有可解释的强度过渡和空间一致形变的图像翻译,从而支持整个翻译过程中像素级别的对应关系。
链接: https://arxiv.org/abs/2512.18455
作者: Tianyang Zhanng,Xinxing Cheng,Jun Cheng,Shaoming Zheng,He Zhao,Huazhu Fu,Alejandro F Frangi,Jiang Liu,Jinming Duan
机构: University of Birmingham (伯明翰大学); Institute for Infocomm Research, ASTAR (新加坡科技研究局信息通信研究所); Imperial College London (帝国理工学院); University of Liverpool (利物浦大学); Institute of High Performance Computing, ASTAR (新加坡科技研究局高性能计算研究所); University of Manchester (曼彻斯特大学); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Artificial Intelligence
Abstract:Domain gaps arising from variations in imaging devices and population distributions pose significant challenges for machine learning in medical image analysis. Existing image-to-image translation methods primarily aim to learn mappings between domains, often generating diverse synthetic data with variations in anatomical scale and shape, but they usually overlook spatial correspondence during the translation process. For clinical applications, traceability, defined as the ability to provide pixel-level correspondences between original and translated images, is equally important. This property enhances clinical interpretability but has been largely overlooked in previous approaches. To address this gap, we propose Plasticine, which is, to the best of our knowledge, the first end-to-end image-to-image translation framework explicitly designed with traceability as a core objective. Our method combines intensity translation and spatial transformation within a denoising diffusion framework. This design enables the generation of synthetic images with interpretable intensity transitions and spatially coherent deformations, supporting pixel-wise traceability throughout the translation process.
zh
[CV-130] NOVA: Discovering Well-Conditioned Winograd Transforms through Numerical Optimization of Vandermonde Arithmetic
【速读】:该论文旨在解决Winograd卷积在低精度计算(如FP16或Int8)下因数值不稳定导致的性能失效问题,尤其当使用大尺寸tile(如F(8,3))时,标准整数基变换的条件数急剧上升(κ = 2×10⁵),使得模型推理准确率崩溃。解决方案的关键在于提出NOVA(Numerical Optimization of Vandermonde Arithmetic)框架,其将Winograd点选择建模为连续优化问题,在ℝⁿ⁻¹流形上通过进化策略搜索最优点集,并将候选解映射至简单有理数(如±5/6、±7/6等),同时利用符号验证确保数学正确性;这一方法揭示了传统整数约束下被忽视的稳定分数配置,显著改善了条件数——在1D中使F(8,3)的条件数降低415倍,2D中提升达172,484倍,从而在不需重新训练或校准的情况下恢复FP16图像分类任务的原始精度(如VGG16在ImageNet上的准确率从4.7%提升至75–78%)。
链接: https://arxiv.org/abs/2512.18453
作者: Jayant Lohia
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Winograd convolution is the standard algorithm for efficient inference, reducing arithmetic complexity by 2.25x for 3x3 kernels. However, it faces a critical barrier in the modern era of low precision computing: numerical instability. As tiles scale to maximize efficiency (e.g., F(6,3), F(8,3)), the condition numbers of standard integer based transforms explode, reaching kappa = 2 x 10^5 for F(8,3), rendering them unusable in FP16 or Int8. We introduce NOVA (Numerical Optimization of Vandermonde Arithmetic), a discovery framework that breaks the decades old convention of integer interpolation. Treating Winograd point selection as a continuous optimization problem, NOVA searches the manifold R^n-1 via Evolution Strategy, snaps candidates to simple rationals, and guarantees correctness via symbolic verification. This process uncovers a hidden landscape of stable, fractional configurations such as ±5/6, ±7/6, ±3/5 that defy traditional vocabulary constraints. The impact is transformative: NOVA improves the conditioning of F(8,3) by 415x in 1D, which squares to a 172,484x improvement for 2D convolution. In real world FP16 ImageNet inference, where standard transforms collapse to random chance (e.g., 4.7 percent accuracy on VGG16), NOVA’s points restore full accuracy (75 to 78 percent), recovering over 70 percentage points without retraining, calibration, or learned parameters. These discovered transforms act as drop in replacements, effectively unlocking the efficiency of large tile Winograd convolution for next generation hardware.
zh
[CV-131] Agent -Based Output Drift Detection for Breast Cancer Response Prediction in a Multisite Clinical Decision Support System
【速读】:该论文旨在解决多中心临床人工智能(AI)系统中因患者群体、成像设备和采集协议差异导致的预测模型性能下降问题,即分布偏移(distributional shift)对模型可靠性的影响。其解决方案的关键在于提出一种基于代理(agent-based)的框架,为每个独立医疗影像机构部署一个漂移监测代理,通过批处理方式将本地模型输出与参考分布进行比较,从而实现对漂移的敏感检测与严重程度评估;其中,自适应(adaptive)方案在无站点特定参考的情况下表现最优,显著提升了漂移检测的F1分数(达74.3%)与严重程度分类准确率(达83.7%),表明该方法能够有效增强多中心临床决策支持系统的鲁棒性与可解释性。
链接: https://arxiv.org/abs/2512.18450
作者: Xavier Rafael-Palou,Jose Munuera,Ana Jimenez-Pastor,Richard Osuala,Karim Lekadir,Oliver Diaz
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: Accepted at MICAD (Medical Imaging and Computer-Aided Diagnosis) 2025
Abstract:Modern clinical decision support systems can concurrently serve multiple, independent medical imaging institutions, but their predictive performance may degrade across sites due to variations in patient populations, imaging hardware, and acquisition protocols. Continuous surveillance of predictive model outputs offers a safe and reliable approach for identifying such distributional shifts without ground truth labels. However, most existing methods rely on centralized monitoring of aggregated predictions, overlooking site-specific drift dynamics. We propose an agent-based framework for detecting drift and assessing its severity in multisite clinical AI systems. To evaluate its effectiveness, we simulate a multi-center environment for output-based drift detection, assigning each site a drift monitoring agent that performs batch-wise comparisons of model outputs against a reference distribution. We analyse several multi-center monitoring schemes, that differ in how the reference is obtained (site-specific, global, production-only and adaptive), alongside a centralized baseline. Results on real-world breast cancer imaging data using a pathological complete response prediction model shows that all multi-center schemes outperform centralized monitoring, with F1-score improvements up to 10.3% in drift detection. In the absence of site-specific references, the adaptive scheme performs best, with F1-scores of 74.3% for drift detection and 83.7% for drift severity classification. These findings suggest that adaptive, site-aware agent-based drift monitoring can enhance reliability of multisite clinical decision support systems.
zh
[CV-132] Object-Centric Framework for Video Moment Retrieval AAAI2026
【速读】:该论文旨在解决现有视频片段检索方法在处理以对象为导向的查询时,因缺乏对细粒度物体语义和外观建模而导致定位精度不足的问题。其核心挑战在于,传统方法依赖于帧或片段级别的全局视觉与语义特征,难以捕捉对象层面的时间动态变化,从而限制了在需要细致对象级推理场景下的性能表现。解决方案的关键在于提出一种以对象为中心的框架:首先利用场景图解析器提取查询相关对象,并从视频帧中构建场景图以表征对象及其关系;进而基于场景图生成编码丰富视觉与语义信息的对象级特征序列,并通过关系轨迹变压器(relational tracklet transformer)建模对象随时间演变的时空关联,从而显式捕获对象状态变化,提升与对象导向查询匹配的片段定位准确性。
链接: https://arxiv.org/abs/2512.18448
作者: Zongyao Li,Yongkang Wong,Satoshi Yamazaki,Jianquan Liu,Mohan Kankanhalli
机构: 1. National University of Singapore (新加坡国立大学); 2. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2026
Abstract:Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a relational tracklet transformer, which models spatio-temporal correlations among objects over time. By explicitly capturing object-level state changes, our framework enables more accurate localization of moments aligned with object-oriented queries. We evaluated our method on three benchmarks: Charades-STA, QVHighlights, and TACoS. Experimental results demonstrate that our method outperforms existing state-of-the-art methods across all benchmarks.
zh
[CV-133] MeniMV: A Multi-view Benchmark for Meniscus Injury Severity Grading
【速读】:该论文旨在解决膝关节半月板角(meniscal horn)撕裂在磁共振成像(MRI)自动分析中缺乏精确分级的问题。现有方法多依赖粗粒度的研究级标签或二分类,无法提供损伤的定位和严重程度信息,限制了临床应用价值。解决方案的关键在于构建MeniMV这一多视角基准数据集,该数据集包含来自三家医疗中心的3,000例标注膝关节MRI检查,涵盖6,000张共注册的矢状面与冠状面图像,并由资深骨科医师验证,对前后角分别进行四等级(0–3级)严重程度标注。MeniMV不仅数据量超过以往同类数据集两倍以上,还首次系统性地捕捉了临床实践中双视角诊断的关键需求,为后续基于卷积神经网络(CNN)和Transformer模型的精准分级研究提供了高质量基准与挑战性任务。
链接: https://arxiv.org/abs/2512.18437
作者: Shurui Xu,Siqi Yang,Jiapin Ren,Zhong Cao,Hongwei Yang,Mengzhen Fan,Yuyu Sun,Shuyan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures
Abstract:Precise grading of meniscal horn tears is critical in knee injury diagnosis but remains underexplored in automated MRI analysis. Existing methods often rely on coarse study-level labels or binary classification, lacking localization and severity information. In this paper, we introduce MeniMV, a multi-view benchmark dataset specifically designed for horn-specific meniscus injury grading. MeniMV comprises 3,000 annotated knee MRI exams from 750 patients across three medical centers, providing 6,000 co-registered sagittal and coronal images. Each exam is meticulously annotated with four-tier (grade 0-3) severity labels for both anterior and posterior meniscal horns, verified by chief orthopedic physicians. Notably, MeniMV offers more than double the pathology-labeled data volume of prior datasets while uniquely capturing the dual-view diagnostic context essential in clinical practice. To demonstrate the utility of MeniMV, we benchmark multiple state-of-the-art CNN and Transformer-based models. Our extensive experiments establish strong baselines and highlight challenges in severity grading, providing a valuable foundation for future research in automated musculoskeletal imaging.
zh
[CV-134] E-RGB-D: Real-Time Event-Based Perception with Structured Light
【速读】:该论文旨在解决传统单色事件相机(Event-based Camera, EC)在检测静态或缓慢移动物体时能力有限,且缺乏颜色信息这一关键问题,从而限制了其在需要色彩感知的应用场景中的使用。解决方案的关键在于引入数字光处理(Digital Light Processing, DLP)投影仪构建主动结构光(Active Structured Light, ASL)系统,通过动态调整投影模式实现对彩色信息的按需采集,并与事件相机协同完成像素级的颜色与深度分离获取,从而在不牺牲空间分辨率的前提下,实现高速、帧无依赖的RGB-D感知。该方案利用商用TI LightCrafter 4500投影仪与单目单色事件相机的集成,实现了等效1400 fps的颜色检测速度和4 kHz的像素级深度检测性能。
链接: https://arxiv.org/abs/2512.18429
作者: Seyed Ehsan Marjani Bajestani,Giovanni Beltrame
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Event-based cameras (ECs) have emerged as bio-inspired sensors that report pixel brightness changes asynchronously, offering unmatched speed and efficiency in vision sensing. Despite their high dynamic range, temporal resolution, low power consumption, and computational simplicity, traditional monochrome ECs face limitations in detecting static or slowly moving objects and lack color information essential for certain applications. To address these challenges, we present a novel approach that integrates a Digital Light Processing (DLP) projector, forming Active Structured Light (ASL) for RGB-D sensing. By combining the benefits of ECs and projection-based techniques, our method enables the detection of color and the depth of each pixel separately. Dynamic projection adjustments optimize bandwidth, ensuring selective color data acquisition and yielding colorful point clouds without sacrificing spatial resolution. This integration, facilitated by a commercial TI LightCrafter 4500 projector and a monocular monochrome EC, not only enables frameless RGB-D sensing applications but also achieves remarkable performance milestones. With our approach, we achieved a color detection speed equivalent to 1400 fps and 4 kHz of pixel depth detection, significantly advancing the realm of computer vision across diverse fields from robotics to 3D reconstruction methods. Our code is publicly available: this https URL
zh
[CV-135] AmPLe: Supporting Vision-Language Models via Adaptive-Debiased Ensemble Multi-Prompt Learning
【速读】:该论文旨在解决多提示学习(multi-prompt learning)中存在两类关键偏差问题:一是模型-提示匹配偏差(model-prompt matching bias),即同一提示在不同视觉-语言模型(如CLIP-ViT-B/16与CLIP-ViT-B/32)中可能表达不同语义,导致对相同提示产生不一致预测;二是样本-提示匹配偏差(sample-prompt matching bias),源于输入样本中包含与提示无关的语义信息,若直接利用全部样本信息计算集成权重会导致性能下降。解决方案的关键在于提出自适应去偏集成多提示学习方法(Adaptive-Debiased Ensemble MultiPrompt Learning, AmPLe),通过信息论指导提取样本中的提示相关语义,并据此自适应地计算去偏集成权重,从而同时缓解上述两类偏差,提升下游任务的泛化能力。
链接: https://arxiv.org/abs/2512.18411
作者: Fei Song,Yi Li,Jiangmeng Li,Rui Wang,Changwen Zheng,Fanjiang Xu,Hui Xiong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IJCV2025
Abstract:Multi-prompt learning methods have emerged as an effective approach for facilitating the rapid adaptation of vision-language models to downstream tasks with limited resources. Existing multi-prompt learning methods primarily focus on utilizing various meticulously designed prompts within a single foundation vision-language model to achieve superior performance. However, the overlooked model-prompt matching bias hinders the development of multi-prompt learning, i.e., the same prompt can convey different semantics across distinct vision-language models, such as CLIP-ViT-B/16 and CLIP-ViT-B/32, resulting in inconsistent predictions of identical prompt. To mitigate the impact of this bias on downstream tasks, we explore an ensemble learning approach to sufficiently aggregate the benefits of diverse predictions. Additionally, we further disclose the presence of sample-prompt matching bias, which originates from the prompt-irrelevant semantics encapsulated in the input samples. Thus, directly utilizing all information from the input samples for generating weights of ensemble learning can lead to suboptimal performance. In response, we extract prompt-relevant semantics from input samples by leveraging the guidance of the information theory-based analysis, adaptively calculating debiased ensemble weights. Overall, we propose Adaptive-Debiased Ensemble MultiPrompt Learning, abbreviated as AmPLe, to mitigate the two types of bias simultaneously. Extensive experiments on three representative tasks, i.e., generalization to novel classes, new target datasets, and unseen domain shifts, show that AmPLe can widely outperform existing methods. Theoretical validation from a causal perspective further supports the effectiveness of AmPLe.
zh
[CV-136] hrough the PRISm: Importance-Aware Scene Graphs for Image Retrieval
【速读】:该论文旨在解决图像语义相似性检索中传统方法难以捕捉场景内在关系与上下文细节的问题(semantic similarity retrieval in computer vision)。其解决方案的关键在于提出了一种基于剪枝的图像检索框架PRISm,通过两个核心组件实现:一是重要性预测模块(Importance Prediction Module),用于识别并保留图像中最具语义重要性的对象及其关系三元组,同时剪枝无关元素;二是边缘感知图神经网络(Edge-Aware Graph Neural Network),显式建模对象间的关系结构,并融合全局视觉特征以生成语义感知的图像嵌入表示。该方法显著提升了检索结果与人类感知的一致性,实现了可解释且语义精准的图像检索。
链接: https://arxiv.org/abs/2512.18407
作者: Dimitrios Georgoulopoulos,Nikolaos Chaidos,Angeliki Dimitriou,Giorgos Stamou
机构: National Technical University of Athens (雅典国立技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures
Abstract:Accurately retrieving images that are semantically similar remains a fundamental challenge in computer vision, as traditional methods often fail to capture the relational and contextual nuances of a scene. We introduce PRISm (Pruning-based Image Retrieval via Importance Prediction on Semantic Graphs), a multimodal framework that advances image-to-image retrieval through two novel components. First, the Importance Prediction Module identifies and retains the most critical objects and relational triplets within an image while pruning irrelevant elements. Second, the Edge-Aware Graph Neural Network explicitly encodes relational structure and integrates global visual features to produce semantically informed image embeddings. PRISm achieves image retrieval that closely aligns with human perception by explicitly modeling the semantic importance of objects and their interactions, capabilities largely absent in prior approaches. Its architecture effectively combines relational reasoning with visual representation, enabling semantically grounded retrieval. Extensive experiments on benchmark and real-world datasets demonstrate consistently superior top-ranked performance, while qualitative analyses show that PRISm accurately captures key objects and interactions, producing interpretable and semantically meaningful results.
zh
[CV-137] Automated Mosaic Tesserae Segmentation via Deep Learning Techniques
【速读】:该论文旨在解决古迹中马赛克(Mosaic)图像的数字化保护问题,核心挑战在于如何准确分割马赛克中的小块瓷片(tesserae)与背景,以实现高精度的数字重建。解决方案的关键在于利用Meta AI提出的Segment Anything Model 2(SAM 2)这一基础模型,并针对马赛克图像特性构建了一个新的标注数据集进行微调。实验表明,微调后的SAM 2在Intersection over Union(IoU)和Recall指标上显著优于原始模型,同时在基准测试中F-measure提升3%,预测与实际瓷片数量差异从0.20降至0.02,验证了该方法在马赛克图像自动分割任务中的有效性与实用性。
链接: https://arxiv.org/abs/2512.18406
作者: Charilaos Kapelonis,Marios Antonakakis,Konstantinos Politof,Aristomenis Antoniadis,Michalis Zervakis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Art is widely recognized as a reflection of civilization and mosaics represent an important part of cultural heritage. Mosaics are an ancient art form created by arranging small pieces, called tesserae, on a surface using adhesive. Due to their age and fragility, they are prone to damage, highlighting the need for digital preservation. This paper addresses the problem of digitizing mosaics by segmenting the tesserae to separate them from the background within the broader field of Image Segmentation in Computer Vision. We propose a method leveraging Segment Anything Model 2 (SAM 2) by Meta AI, a foundation model that outperforms most conventional segmentation models, to automatically segment mosaics. Due to the limited open datasets in the field, we also create an annotated dataset of mosaic images to fine-tune and evaluate the model. Quantitative evaluation on our testing dataset shows notable improvements compared to the baseline SAM 2 model, with Intersection over Union increasing from 89.00% to 91.02% and Recall from 92.12% to 95.89%. Additionally, on a benchmark proposed by a prior approach, our model achieves an F-measure 3% higher than previous methods and reduces the error in the absolute difference between predicted and actual tesserae from 0.20 to just 0.02. The notable performance of the fine-tuned SAM 2 model together with the newly annotated dataset can pave the way for real-time segmentation of mosaic images.
zh
[CV-138] RecurGS: Interactive Scene Modeling via Discrete-State Recurrent Gaussian Fusion
【速读】:该论文旨在解决现有3D场景表示方法在应对离散场景变化(discrete scene changes)和构建可交互3D环境方面的局限性,尤其是缺乏对多状态场景的增量融合能力与对象级操作支持。解决方案的关键在于提出RecurGS——一种递归融合框架,通过检测连续状态间的物体级变化、利用语义对应关系与基于李代数(Lie-algebra)的SE(3)精修实现几何运动对齐,并采用带回放监督的递归更新机制以保留历史结构;同时结合体素化、可见性感知的融合模块,选择性地整合新观测区域并稳定已有结构,从而有效缓解灾难性遗忘问题,实现高效长时程更新与高质量的新视角合成。
链接: https://arxiv.org/abs/2512.18386
作者: Wenhao Hu,Haonan Zhou,Zesheng Li,Liu Liu,Jiacheng Dong,Zhizhong Su,Gaoang Wang
机构: Zhejiang University (浙江大学); Horizon Robotics ( horizon 机器人公司); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in 3D scene representations have enabled high-fidelity novel view synthesis, yet adapting to discrete scene changes and constructing interactive 3D environments remain open challenges in vision and robotics. Existing approaches focus solely on updating a single scene without supporting novel-state synthesis. Others rely on diffusion-based object-background decoupling that works on one state at a time and cannot fuse information across multiple observations. To address these limitations, we introduce RecurGS, a recurrent fusion framework that incrementally integrates discrete Gaussian scene states into a single evolving representation capable of interaction. RecurGS detects object-level changes across consecutive states, aligns their geometric motion using semantic correspondence and Lie-algebra based SE(3) refinement, and performs recurrent updates that preserve historical structures through replay supervision. A voxelized, visibility-aware fusion module selectively incorporates newly observed regions while keeping stable areas fixed, mitigating catastrophic forgetting and enabling efficient long-horizon updates. RecurGS supports object-level manipulation, synthesizes novel scene states without requiring additional scans, and maintains photorealistic fidelity across evolving environments. Extensive experiments across synthetic and real-world datasets demonstrate that our framework delivers high-quality reconstructions with substantially improved update efficiency, providing a scalable step toward continuously interactive Gaussian worlds.
zh
[CV-139] Efficient Zero-Shot Inpainting with Decoupled Diffusion Guidance
【速读】:该论文旨在解决当前零样本图像修复(zero-shot image editing)方法中因依赖复杂代理似然函数而导致的高内存和运行时间开销问题。现有方法通常需要在每一步反向扩散过程中计算向量-雅可比乘积(vector-Jacobian product),这涉及对去噪器网络进行反向传播,显著增加了推理成本。解决方案的关键在于提出一种新的似然代理(likelihood surrogate),能够诱导出简单且易于采样的高斯后验转移(Gaussian posterior transitions),从而避免了对去噪器网络的反向传播,大幅降低了推理复杂度,同时保持了与微调基线相当的观测一致性以及高质量的重建结果。
链接: https://arxiv.org/abs/2512.18365
作者: Badr Moufad,Navid Bagheri Shouraki,Alain Oliviero Durmus,Thomas Hirtz,Eric Moulines,Jimmy Olsson,Yazid Janati
机构: CMAP, Ecole Polytechnique (法国巴黎综合理工学院); Institute of Foundation Models (基础模型研究所); MBZUAI; EPITA; Sorbonne University (索邦大学); Lagrange Mathematics and Computing Research Center (拉格朗日数学与计算研究中心); EPITA Research Lab (EPITA 研究实验室); KTH Royal Institute of Technology (皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: preprint
Abstract:Diffusion models have emerged as powerful priors for image editing tasks such as inpainting and local modification, where the objective is to generate realistic content that remains consistent with observed regions. In particular, zero-shot approaches that leverage a pretrained diffusion model, without any retraining, have been shown to achieve highly effective reconstructions. However, state-of-the-art zero-shot methods typically rely on a sequence of surrogate likelihood functions, whose scores are used as proxies for the ideal score. This procedure however requires vector-Jacobian products through the denoiser at every reverse step, introducing significant memory and runtime overhead. To address this issue, we propose a new likelihood surrogate that yields simple and efficient to sample Gaussian posterior transitions, sidestepping the backpropagation through the denoiser network. Our extensive experiments show that our method achieves strong observation consistency compared with fine-tuned baselines and produces coherent, high-quality reconstructions, all while significantly reducing inference cost. Code is available at this https URL.
zh
[CV-140] Enhancing 3D Semantic Scene Completion with a Refinement Module
【速读】:该论文旨在解决语义场景补全(Semantic Scene Completion, SSC)中预测结果精度不足的问题,尤其是针对现有模型生成的粗粒度体素预测在细节和几何一致性上的局限性。解决方案的关键在于提出一个即插即用的增强框架 ESSC-RM,其核心由两个模块构成:基于3D U-Net的预测噪声感知模块(Prediction Noise-Aware Module, PNAM)和体素级局部几何模块(Voxel-level Local Geometry Module, VLGM),二者在多尺度监督下协同优化,对初始粗预测进行精细化修正,从而显著提升语义分割的平均交并比(mean IoU)。
链接: https://arxiv.org/abs/2512.18363
作者: Dunxing Zhang(3),Jiachen Lu(3),Han Yang(1 and 2),Lei Bao(1 and 2),Bo Song(1 and 2) ((1) National Science Center for Earthquake Engineering, Tianjin University, Tianjin, China, (2) School of Civil Engineering, Tianjin University, Tianjin, China, (3) Technical University of Munich, Munich, Germany)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 8 figures
Abstract:We propose ESSC-RM, a plug-and-play Enhancing framework for Semantic Scene Completion with a Refinement Module, which can be seamlessly integrated into existing SSC models. ESSC-RM operates in two phases: a baseline SSC network first produces a coarse voxel prediction, which is subsequently refined by a 3D U-Net-based Prediction Noise-Aware Module (PNAM) and Voxel-level Local Geometry Module (VLGM) under multiscale supervision. Experiments on SemanticKITTI show that ESSC-RM consistently improves semantic prediction performance. When integrated into CGFormer and MonoScene, the mean IoU increases from 16.87% to 17.27% and from 11.08% to 11.51%, respectively. These results demonstrate that ESSC-RM serves as a general refinement framework applicable to a wide range of SSC models.
zh
[CV-141] MCVI-SANet: A lightweight semi-supervised model for LAI and SPAD estimation of winter wheat under vegetation index saturation
【速读】:该论文旨在解决植被指数(Vegetation Index, VI)在作物冠层密集生长期的饱和问题以及冬小麦生长过程中地面真实标注数据有限导致的叶面积指数(Leaf Area Index, LAI)和叶绿素含量(SPAD)估计精度不足的问题。现有基于VI和纹理特征的机器学习方法存在特征表达能力弱,而深度学习基线模型则面临域差异大、数据需求高导致泛化性能受限的挑战。解决方案的关键在于提出一种轻量级半监督视觉模型——多通道植被指数饱和感知网络(Multi-Channel Vegetation Indices Saturation Aware Net, MCVI-SANet),其核心创新包括:1)设计了植被指数饱和感知模块(VI-SABlock),实现自适应的通道-空间特征增强;2)引入基于VICReg的半监督学习策略以提升模型泛化能力;3)采用基于植被高度的信息划分数据集策略,确保不同生育阶段样本的代表性。实验表明,该方法在LAI和SPAD估计上均优于当前最优基线模型,且参数量仅0.10M,推理速度快,展现出良好的应用前景。
链接: https://arxiv.org/abs/2512.18344
作者: Zhiheng Zhang,Jiajun Yang,Hong Sun,Dong Wang,Honghua Jiang,Yaru Chen,Tangyuan Ning
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vegetation index (VI) saturation during the dense canopy stage and limited ground-truth annotations of winter wheat constrain accurate estimation of LAI and SPAD. Existing VI-based and texture-driven machine learning methods exhibit limited feature expressiveness. In addition, deep learning baselines suffer from domain gaps and high data demands, which restrict their generalization. Therefore, this study proposes the Multi-Channel Vegetation Indices Saturation Aware Net (MCVI-SANet), a lightweight semi-supervised vision model. The model incorporates a newly designed Vegetation Index Saturation-Aware Block (VI-SABlock) for adaptive channel-spatial feature enhancement. It also integrates a VICReg-based semi-supervised strategy to further improve generalization. Datasets were partitioned using a vegetation height-informed strategy to maintain representativeness across growth stages. Experiments over 10 repeated runs demonstrate that MCVI-SANet achieves state-of-the-art accuracy. The model attains an average R2 of 0.8123 and RMSE of 0.4796 for LAI, and an average R2 of 0.6846 and RMSE of 2.4222 for SPAD. This performance surpasses the best-performing baselines, with improvements of 8.95% in average LAI R2 and 8.17% in average SPAD R2. Moreover, MCVI-SANet maintains high inference speed with only 0.10M parameters. Overall, the integration of semi-supervised learning with agronomic priors provides a promising approach for enhancing remote sensing-based precision agriculture.
zh
[CV-142] A two-stream network with global-local feature fusion for bone age assessment
【速读】:该论文旨在解决当前基于深度学习的骨龄评估(Bone Age Assessment, BAA)方法在高效平衡全局特征与局部骨骼细节方面存在的挑战。解决方案的关键在于提出一种双流深度学习架构——BoNet+模型,其核心创新包括:在全局特征提取通道中引入Transformer模块,利用多头自注意力机制增强对整体骨骼结构的理解;在局部特征提取通道中集成RFAConv模块,通过多尺度感受野生成自适应注意力图以强化局部细节捕捉能力;随后将全局与局部特征沿通道维度拼接,并由Inception-V3网络进行优化融合,从而实现更精准、自动且客观的骨龄评估。
链接: https://arxiv.org/abs/2512.18331
作者: Qiong Lou,Han Yang,Fang Lu
机构: Zhejiang University of Science and Technology (浙江科技学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Bone Age Assessment (BAA) is a widely used clinical technique that can accurately reflect an individual’s growth and development level, as well as maturity. In recent years, although deep learning has advanced the field of bone age assessment, existing methods face challenges in efficiently balancing global features and local skeletal details. This study aims to develop an automated bone age assessment system based on a two-stream deep learning architecture to achieve higher accuracy in bone age assessment. We propose the BoNet+ model incorporating global and local feature extraction channels. A Transformer module is introduced into the global feature extraction channel to enhance the ability in extracting global features through multi-head self-attention mechanism. A RFAConv module is incorporated into the local feature extraction channel to generate adaptive attention maps within multiscale receptive fields, enhancing local feature extraction capabilities. Global and local features are concatenated along the channel dimension and optimized by an Inception-V3 network. The proposed method has been validated on the Radiological Society of North America (RSNA) and Radiological Hand Pose Estimation (RHPE) test datasets, achieving mean absolute errors (MAEs) of 3.81 and 5.65 months, respectively. These results are comparable to the state-of-the-art. The BoNet+ model reduces the clinical workload and achieves automatic, high-precision, and more objective bone age assessment.
zh
[CV-143] Asynchronous Pipeline Parallelism for Real-Time Multilingual Lip Synchronization in Video Communication Systems
【速读】:该论文旨在解决实时视频会议系统中多语言唇同步(lip synchronization)的高延迟与低效问题,特别是在资源受限场景下难以兼顾准确性与计算效率的挑战。其核心解决方案是提出一种并行异步的Transformer框架,通过流水线并行(pipeline-parallel)设计实现翻译、语音处理与唇同步模块的并发执行,并借助基于消息队列的解耦机制显著降低端到端延迟(最高达3.1倍提升)。关键创新包括:低级图编译优化、混合精度量化(mixed-precision quantization)和硬件加速内核融合以提升推理效率,同时引入上下文自适应静音检测组件确保语义连贯的语音分段,从而增强跨语言翻译一致性与时间对齐稳定性。该架构在保持模型精度与视觉质量的前提下,实现了高性能、低延迟的多模态通信能力,适用于物联网(IoT)环境下的远程医疗、多语言自助终端等场景。
链接: https://arxiv.org/abs/2512.18318
作者: Eren Caglar,Amirkia Rafiei Oskooei,Mehmet Kutanoglu,Mustafa Keles,Mehmet S. Aktas
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
备注: Accepted to IEEE Big Data 2025, AIDE4IoT Workshop. Copyright \c{opyright} 2025 IEEE
Abstract:This paper introduces a parallel and asynchronous Transformer framework designed for efficient and accurate multilingual lip synchronization in real-time video conferencing systems. The proposed architecture integrates translation, speech processing, and lip-synchronization modules within a pipeline-parallel design that enables concurrent module execution through message-queue-based decoupling, reducing end-to-end latency by up to 3.1 times compared to sequential approaches. To enhance computational efficiency and throughput, the inference workflow of each module is optimized through low-level graph compilation, mixed-precision quantization, and hardware-accelerated kernel fusion. These optimizations provide substantial gains in efficiency while preserving model accuracy and visual quality. In addition, a context-adaptive silence-detection component segments the input speech stream at semantically coherent boundaries, improving translation consistency and temporal alignment across languages. Experimental results demonstrate that the proposed parallel architecture outperforms conventional sequential pipelines in processing speed, synchronization stability, and resource utilization. The modular, message-oriented design makes this work applicable to resource-constrained IoT communication scenarios including telemedicine, multilingual kiosks, and remote assistance systems. Overall, this work advances the development of low-latency, resource-efficient multimodal communication frameworks for next-generation AIoT systems.
zh
[CV-144] MatSpray: Fusing 2D Material World Knowledge on 3D Geometry
【速读】:该论文旨在解决在游戏与影视制作中,手动建模材料参数(如反照率、粗糙度和金属度)及三维几何结构耗时且关键的问题。现有3D重建方法虽能准确近似场景几何与外观,但在光照变换场景下因缺乏精确的空间变化材料参数而表现不足;同时,尽管基于扩散模型的2D图像方法在预测物理渲染(PBR)属性方面表现优异,但如何将这些2D材质贴图有效融合到已重建的3D几何体上仍是重大挑战。解决方案的关键在于提出一种融合2D材料数据与3D几何的新框架:首先通过高斯点绘(Gaussian Splatting)重建场景几何,利用扩散模型生成2D PBR材质贴图,并结合学习驱动与投影驱动策略将材质参数集成至3D表示中——要么通过优化图像级损失函数,要么直接使用高斯射线追踪进行投影;进一步引入轻量级神经精修模块(Neural Merger),以射线追踪得到的材质特征为输入,实现细粒度调整与多视角一致性增强。该方法显著提升了重建场景的可重光照性与视觉真实感,在定量指标和主观感知上均优于现有技术。
链接: https://arxiv.org/abs/2512.18314
作者: Philipp Langsteiner,Jan-Niklas Dihlmann,Hendrik P.A. Lensch
机构: University of Tübingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL
Abstract:Manual modeling of material parameters and 3D geometry is a time consuming yet essential task in the gaming and film industries. While recent advances in 3D reconstruction have enabled accurate approximations of scene geometry and appearance, these methods often fall short in relighting scenarios due to the lack of precise, spatially varying material parameters. At the same time, diffusion models operating on 2D images have shown strong performance in predicting physically based rendering (PBR) properties such as albedo, roughness, and metallicity. However, transferring these 2D material maps onto reconstructed 3D geometry remains a significant challenge. We propose a framework for fusing 2D material data into 3D geometry using a combination of novel learning-based and projection-based approaches. We begin by reconstructing scene geometry via Gaussian Splatting. From the input images, a diffusion model generates 2D maps for albedo, roughness, and metallic parameters. Any existing diffusion model that can convert images or videos to PBR materials can be applied. The predictions are further integrated into the 3D representation either by optimizing an image-based loss or by directly projecting the material parameters onto the Gaussians using Gaussian ray tracing. To enhance fine-scale accuracy and multi-view consistency, we further introduce a light-weight neural refinement step (Neural Merger), which takes ray-traced material features as input and produces detailed adjustments. Our results demonstrate that the proposed methods outperform existing techniques in both quantitative metrics and perceived visual realism. This enables more accurate, relightable, and photorealistic renderings from reconstructed scenes, significantly improving the realism and efficiency of asset creation workflows in content production pipelines.
zh
[CV-145] MatE: Material Extraction from Single-Image via Geometric Prior
【速读】:该论文旨在解决高保真物理基础渲染(PBR)材质生成在实际应用中依赖专业设备和专家后处理的瓶颈问题,从而实现从任意真实场景下拍摄的单张图像中自动构建可平铺的PBR材质贴图(包括反照率、法线、粗糙度和高度图)。其解决方案的关键在于提出了一种名为MatE的新方法:首先利用估计的深度图作为几何先验进行粗略校正,随后采用双分支扩散模型(dual-branch diffusion model),通过学习旋转对齐与尺度对齐训练数据中的结构一致性,进一步修正残余畸变并输出完整的材质贴图。该框架具备对输入图像未知光照和视角的不变性,从而能够从非受控条件下采集的图像中恢复材质的固有属性。
链接: https://arxiv.org/abs/2512.18312
作者: Zeyu Zhang,Wei Zhai,Jian Yang,Yang Cao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures
Abstract:The creation of high-fidelity, physically-based rendering (PBR) materials remains a bottleneck in many graphics pipelines, typically requiring specialized equipment and expert-driven post-processing. To democratize this process, we present MatE, a novel method for generating tileable PBR materials from a single image taken under unconstrained, real-world conditions. Given an image and a user-provided mask, MatE first performs coarse rectification using an estimated depth map as a geometric prior, and then employs a dual-branch diffusion model. Leveraging a learned consistency from rotation-aligned and scale-aligned training data, this model further rectify residual distortions from the coarse result and translate it into a complete set of material maps, including albedo, normal, roughness and height. Our framework achieves invariance to the unknown illumination and perspective of the input image, allowing for the recovery of intrinsic material properties from casual captures. Through comprehensive experiments on both synthetic and real-world data, we demonstrate the efficacy and robustness of our approach, enabling users to create realistic materials from real-world image.
zh
[CV-146] Pyramidal Adaptive Cross-Gating for Multimodal Detection
【速读】:该论文旨在解决无人机(UAV)航拍图像中目标检测任务中存在的两个关键问题:一是现有方法在多模态特征融合时采用简单策略,易引入跨模态噪声;二是此类策略破坏了特征金字塔的层级结构,影响小目标的细粒度检测性能。解决方案的核心在于提出Pyramidal Adaptive Cross-Gating Network (PACGNet),其关键创新为两个模块:一是对称交叉门控(Symmetrical Cross-Gating, SCG)模块,通过双向对称的“横向”门控机制选择性吸收互补信息、抑制噪声并保持各模态语义完整性;二是金字塔特征感知多模态门控(Pyramidal Feature-aware Multimodal Gating, PFMG)模块,利用逐级递进的门控机制重构特征层次结构,借助高分辨率层的细节特征引导低分辨率层的融合过程,从而有效保留细粒度信息在特征传播中的完整性。
链接: https://arxiv.org/abs/2512.18291
作者: Zidong Gu,Shoufu Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures, submitted to Image and Vision Computing
Abstract:Object detection in aerial imagery is a critical task in applications such as UAV reconnaissance. Although existing methods have extensively explored feature interaction between different modalities, they commonly rely on simple fusion strategies for feature aggregation. This introduces two critical flaws: it is prone to cross-modal noise and disrupts the hierarchical structure of the feature pyramid, thereby impairing the fine-grained detection of small objects. To address this challenge, we propose the Pyramidal Adaptive Cross-Gating Network (PACGNet), an architecture designed to perform deep fusion within the backbone. To this end, we design two core components: the Symmetrical Cross-Gating (SCG) module and the Pyramidal Feature-aware Multimodal Gating (PFMG) module. The SCG module employs a bidirectional, symmetrical “horizontal” gating mechanism to selectively absorb complementary information, suppress noise, and preserve the semantic integrity of each modality. The PFMG module reconstructs the feature hierarchy via a progressive hierarchical gating mechanism. This leverages the detailed features from a preceding, higher-resolution level to guide the fusion at the current, lower-resolution level, effectively preserving fine-grained details as features propagate. Through evaluations conducted on the DroneVehicle and VEDAI datasets, our PACGNet sets a new state-of-the-art benchmark, with mAP50 scores reaching 81.7% and 82.1% respectively.
zh
[CV-147] UniMPR: A Unified Framework for Multimodal Place Recognition with Arbitrary Sensor Configurations
【速读】:该论文旨在解决多模态地名识别(Multimodal Place Recognition, MPR)中的三大核心挑战:(1)在统一框架下动态适应任意模态输入;(2)在模态缺失或退化时保持鲁棒性;(3)在不同传感器配置和场景下具备泛化能力。其解决方案的关键在于提出UniMPR框架,通过将所有感知模态(如相机、激光雷达、雷达)统一映射到极坐标BEV(Bird’s Eye View)特征空间,并引入多分支网络结构以挖掘模态内与模态间的判别性特征;同时,利用跨数据集构建的大规模训练集和自适应标签分配策略进行充分预训练,从而显著提升模型在多样传感器组合与环境条件下的性能与鲁棒性。
链接: https://arxiv.org/abs/2512.18279
作者: Zhangshuo Qi,Jingyi Xu,Luqi Cheng,Shichen Wen,Yiming Ma,Guangming Xiong
机构: Beijing Institute of Technology (北京理工大学); Shanghai Jiao Tong University (上海交通大学); The University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Place recognition is a critical component of autonomous vehicles and robotics, enabling global localization in GPS-denied environments. Recent advances have spurred significant interest in multimodal place recognition (MPR), which leverages complementary strengths of multiple modalities. Despite its potential, most existing MPR methods still face three key challenges: (1) dynamically adapting to arbitrary modality inputs within a unified framework, (2) maintaining robustness with missing or degraded modalities, and (3) generalizing across diverse sensor configurations and setups. In this paper, we propose UniMPR, a unified framework for multimodal place recognition. Using only one trained model, it can seamlessly adapt to any combination of common perceptual modalities (e.g., camera, LiDAR, radar). To tackle the data heterogeneity, we unify all inputs within a polar BEV feature space. Subsequently, the polar BEVs are fed into a multi-branch network to exploit discriminative intra-model and inter-modal features from any modality combinations. To fully exploit the network’s generalization capability and robustness, we construct a large-scale training set from multiple datasets and introduce an adaptive label assignment strategy for extensive pre-training. Experiments on seven datasets demonstrate that UniMPR achieves state-of-the-art performance under varying sensor configurations, modality combinations, and environmental conditions. Our code will be released at this https URL.
zh
[CV-148] Building UI/UX Dataset for Dark Pattern Detection and YOLOv12x-based Real-Time Object Recognition Detection System
【速读】:该论文旨在解决在线平台中暗模式(Dark Patterns)识别的准确性与实时性问题,即如何在用户界面(UI)设计中自动、高效地检测出那些削弱用户知情权和理性决策能力的误导性元素。其解决方案的关键在于构建了一个大规模、高质量的视觉对象检测数据集,并采用YOLOv12x模型结合迁移学习技术实现高精度与高速推理的统一:数据集包含4,066张来自六大行业194个网站的UI/UX截图,标注了按钮(Button)、复选框(Checkbox)、输入框(Input Field)、弹窗(Pop-up)和二维码(QR Code)五类典型暗模式组件;实验表明,该方法在mAP@50指标上达到92.8%的检测准确率,同时保持40.5帧/秒(FPS)的实时推理速度,具备实际部署于在线环境的能力。
链接: https://arxiv.org/abs/2512.18269
作者: Se-Young Jang,Su-Yeon Yoon,Jae-Woong Jung,Dong-Hun Lee,Seong-Hun Choi,Soo-Kyung Jun,Yu-Bin Kim,Young-Seon Ju,Kyounggon Kim
机构: Baekseok University (백석대학교); Ewha Womans University (이화여자대학교); Tech University of Korea (기술대학교); Best of the Best 14th Program, KITRI (최고의 최고 14기 프로그램, 한국산업기술진흥원); Korea Internet & Security Agency (KISA) (한국인터넷진흥원); Naif Arab University for Security Sciences (나이프 아랍 보안 과학 대학교)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7page
Abstract:With the accelerating pace of digital transformation and the widespread adoption of online platforms, both social and technical concerns regarding dark patterns-user interface designs that undermine users’ ability to make informed and rational choices-have become increasingly prominent. As corporate online platforms grow more sophisticated in their design strategies, there is a pressing need for proactive and real-time detection technologies that go beyond the predominantly reactive approaches employed by regulatory authorities. In this paper, we propose a visual dark pattern detection framework that improves both detection accuracy and real-time performance. To this end, we constructed a proprietary visual object detection dataset by manually collecting 4,066 UI/UX screenshots containing dark patterns from 194 websites across six major industrial sectors in South Korea and abroad. The collected images were annotated with five representative UI components commonly associated with dark patterns: Button, Checkbox, Input Field, Pop-up, and QR Code. This dataset has been publicly released to support further research and development in the field. To enable real-time detection, this study adopted the YOLOv12x object detection model and applied transfer learning to optimize its performance for visual dark pattern recognition. Experimental results demonstrate that the proposed approach achieves a high detection accuracy of 92.8% in terms of mAP@50, while maintaining a real-time inference speed of 40.5 frames per second (FPS), confirming its effectiveness for practical deployment in online environments. Furthermore, to facilitate future research and contribute to technological advancements, the dataset constructed in this study has been made publicly available at this https URL.
zh
[CV-149] Who Can See Through You? Adversarial Shielding Against VLM-Based Attribute Inference Attacks
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在社交平台上引发的属性推断攻击(attribute inference attacks)所导致的用户隐私泄露问题。现有保护方法通常会损害图像视觉质量或干扰基于视觉的功能,难以在隐私保护与用户体验之间取得平衡。其解决方案的关键在于提出一种联合优化隐私抑制与效用保持的方法,并在视觉一致性约束下实现两者协同优化,从而在有效降低隐私泄露风险的同时,维持高质量的图像内容和良好的用户交互体验。
链接: https://arxiv.org/abs/2512.18264
作者: Yucheng Fan,Jiawei Chen,Yu Tian,Zhaoxia Yin
机构: East China Normal University (华东师范大学); Zhongguancun Academy (中关村学院); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:As vision-language models (VLMs) become widely adopted, VLM-based attribute inference attacks have emerged as a serious privacy concern, enabling adversaries to infer private attributes from images shared on social media. This escalating threat calls for dedicated protection methods to safeguard user privacy. However, existing methods often degrade the visual quality of images or interfere with vision-based functions on social media, thereby failing to achieve a desirable balance between privacy protection and user experience. To address this challenge, we propose a novel protection method that jointly optimizes privacy suppression and utility preservation under a visual consistency constraint. While our method is conceptually effective, fair comparisons between methods remain challenging due to the lack of publicly available evaluation datasets. To fill this gap, we introduce VPI-COCO, a publicly available benchmark comprising 522 images with hierarchically structured privacy questions and corresponding non-private counterparts, enabling fine-grained and joint evaluation of protection methods in terms of privacy preservation and user experience. Building upon this benchmark, experiments on multiple VLMs demonstrate that our method effectively reduces PAR below 25%, keeps NPAR above 88%, maintains high visual consistency, and generalizes well to unseen and paraphrased privacy questions, demonstrating its strong practical applicability for real-world VLM deployments.
zh
[CV-150] Loom: Diffusion-Transformer for Interleaved Generation
【速读】:该论文旨在解决多模态序列生成中文本与图像内容的协同一致性问题,即如何在统一框架下实现连贯的视觉帧与对齐文本描述的交错生成(interleaved text-image generation),以支持风格迁移、组合合成及程序化教程等复杂任务。其解决方案的关键在于提出Loom框架,该框架基于扩散-Transformer架构,通过全参数微调和交错式嵌入结构(交替排列文本与视觉嵌入)实现多条件推理与顺序规划;同时引入语言规划策略将用户指令分解为分步提示与帧嵌入,指导时序一致的合成过程,并采用局部历史帧采样而非全部历史拼接的方式进行条件建模,从而在保证可控性与效率的前提下实现长时程生成。
链接: https://arxiv.org/abs/2512.18254
作者: Mingcheng Ye,Jiaming Liu,Yiren Song
机构: Beijing Institute of Technology (北京理工大学); Independent Researcher (独立研究员); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence, enabling tasks such as style transfer, compositional synthesis, and procedural tutorials. We present Loom, a unified diffusion-transformer framework for interleaved text-image generation. Loom extends the Bagel unified model via full-parameter fine-tuning and an interleaved architecture that alternates textual and visual embeddings for multi-condition reasoning and sequential planning. A language planning strategy first decomposes a user instruction into stepwise prompts and frame embeddings, which guide temporally consistent synthesis. For each frame, Loom conditions on a small set of sampled prior frames together with the global textual context, rather than concatenating all history, yielding controllable and efficient long-horizon generation. Across style transfer, compositional generation, and tutorial-like procedures, Loom delivers superior compositionality, temporal coherence, and text-image alignment. Experiments demonstrate that Loom substantially outperforms the open-source baseline Anole, achieving an average gain of 2.6 points (on a 5-point scale) across temporal and semantic metrics in text-to-interleaved tasks. We also curate a 50K interleaved tutorial dataset and demonstrate strong improvements over unified and diffusion editing baselines.
zh
[CV-151] owards Ancient Plant Seed Classification: A Benchmark Dataset and Baseline Model
【速读】:该论文旨在解决古代植物种子分类任务中因依赖专家经验而导致的效率低下问题,以及考古植物学领域在数据和方法上的研究空白。其解决方案的关键在于构建首个古代植物种子图像分类数据集(APS Dataset)并提出专门针对该任务设计的APSNet框架:该框架引入了种子尺度特征(size feature)以辅助网络学习细粒度信息,通过在编码器部分设计Size Perception and Embedding(SPE)模块显式提取尺寸信息来补充细粒度判别特征;同时采用基于传统渐进式学习的异步解耦解码(Asynchronous Decoupled Decoding, ADD)架构,从通道与空间两个维度高效解码特征,从而实现对古代植物种子的高精度分类,最终达到90.5%的准确率。
链接: https://arxiv.org/abs/2512.18247
作者: Rui Xing,Runmin Cong,Yingying Wu,Can Wang,Zhongming Tang,Fen Wang,Hao Wu,Sam Kwong
机构: Shandong University (山东大学); Hong Kong Baptist University (香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding the dietary preferences of ancient societies and their evolution across periods and regions is crucial for revealing human-environment interactions. Seeds, as important archaeological artifacts, represent a fundamental subject of archaeobotanical research. However, traditional studies rely heavily on expert knowledge, which is often time-consuming and inefficient. Intelligent analysis methods have made progress in various fields of archaeology, but there remains a research gap in data and methods in archaeobotany, especially in the classification task of ancient plant seeds. To address this, we construct the first Ancient Plant Seed Image Classification (APS) dataset. It contains 8,340 images from 17 genus- or species-level seed categories excavated from 18 archaeological sites across China. In addition, we design a framework specifically for the ancient plant seed classification task (APSNet), which introduces the scale feature (size) of seeds based on learning fine-grained information to guide the network in discovering key “evidence” for sufficient classification. Specifically, we design a Size Perception and Embedding (SPE) module in the encoder part to explicitly extract size information for the purpose of complementing fine-grained information. We propose an Asynchronous Decoupled Decoding (ADD) architecture based on traditional progressive learning to decode features from both channel and spatial perspectives, enabling efficient learning of discriminative features. In both quantitative and qualitative analyses, our approach surpasses existing state-of-the-art image classification methods, achieving an accuracy of 90.5%. This demonstrates that our work provides an effective tool for large-scale, systematic archaeological research.
zh
[CV-152] Spectral Discrepancy and Cross-modal Semantic Consistency Learning for Object Detection in Hyperspectral Image
【速读】:该论文旨在解决高光谱图像(hyperspectral images)中目标检测面临的挑战,特别是由于同类别内部和不同类别之间光谱特征相似性高、光谱带间不一致性及冗余信息干扰(如传感器噪声和光照变化)导致的识别困难问题。解决方案的关键在于提出一种名为Spectral Discrepancy and Cross-Modal semantic consistency learning (SDCM) 的新型网络架构:其核心包括三个模块——语义一致性学习(Semantic Consistency Learning, SCL)模块利用跨波段上下文线索降低波段间信息异质性,生成高一致性的光谱维度表示;光谱门控生成器(Spectral Gated Generator, SGG)根据波段重要性过滤冗余信息;以及光谱差异感知(Spectral Discrepancy Aware, SDA)模块通过提取像素级光谱特征增强高层语义表达能力,从而在保持光谱信息一致性的同时提升目标定位精度。
链接: https://arxiv.org/abs/2512.18245
作者: Xiao He,Chang Tang,Xinwang Liu,Wei Zhang,Zhimin Gao,Chuankun Li,Shaohua Qiu,Jiangfeng Xu
机构: Wuhan University (武汉大学); Huazhong University of Science and Technology (华中科技大学); National University of Defense Technology (国防科技大学); Shandong Computer Science Center (国家超算济南中心); Qilu University of Technology (山东科学技术大学); Zhengzhou University (郑州大学); North University of China (中北大学); Naval University of Engineering (海军工程大学); Hexagon AB (海克斯康AB)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Hyperspectral images with high spectral resolution provide new insights into recognizing subtle differences in similar substances. However, object detection in hyperspectral images faces significant challenges in intra- and inter-class similarity due to the spatial differences in hyperspectral inter-bands and unavoidable interferences, e.g., sensor noises and illumination. To alleviate the hyperspectral inter-bands inconsistencies and redundancy, we propose a novel network termed \textbfSpectral \textbfDiscrepancy and \textbfCross-\textbfModal semantic consistency learning (SDCM), which facilitates the extraction of consistent information across a wide range of hyperspectral bands while utilizing the spectral dimension to pinpoint regions of interest. Specifically, we leverage a semantic consistency learning (SCL) module that utilizes inter-band contextual cues to diminish the heterogeneity of information among bands, yielding highly coherent spectral dimension representations. On the other hand, we incorporate a spectral gated generator (SGG) into the framework that filters out the redundant data inherent in hyperspectral information based on the importance of the bands. Then, we design the spectral discrepancy aware (SDA) module to enrich the semantic representation of high-level information by extracting pixel-level spectral features. Extensive experiments on two hyperspectral datasets demonstrate that our proposed method achieves state-of-the-art performance when compared with other ones.
zh
[CV-153] SG-RIFE: Semantic-Guided Real-Time Intermediate Flow Estimation with Diffusion-Competitive Perceptual Quality
【速读】:该论文旨在解决实时视频帧插值(Real-time Video Frame Interpolation, VFI)中流式方法(flow-based methods)在复杂场景下感知质量不足与扩散模型(diffusion-based approaches)延迟过高之间的矛盾问题。其解决方案的关键在于提出一种基于语义引导的微调策略——Semantic-Guided RIFE (SG-RIFE),通过引入冻结的DINOv3视觉Transformer提供的语义先验,结合Split-Fidelity Aware Projection Module (Split-FAPM) 和可变形语义融合模块(Deformable Semantic Fusion, DSF),实现对预训练RIFE骨干网络的高效增强,在不显著增加计算开销的前提下显著提升感知质量,从而在近实时条件下达到扩散模型级别的视觉保真度。
链接: https://arxiv.org/abs/2512.18241
作者: Pan Ben Wong,Chengli Wu,Hanyue Lu
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-time Video Frame Interpolation (VFI) has long been dominated by flow-based methods like RIFE, which offer high throughput but often fail in complicated scenarios involving large motion and occlusion. Conversely, recent diffusion-based approaches (e.g., Consec. BB) achieve state-of-the-art perceptual quality but suffer from prohibitive latency, rendering them impractical for real-time applications. To bridge this gap, we propose Semantic-Guided RIFE (SG-RIFE). Instead of training from scratch, we introduce a parameter-efficient fine-tuning strategy that augments a pre-trained RIFE backbone with semantic priors from a frozen DINOv3 Vision Transformer. We propose a Split-Fidelity Aware Projection Module (Split-FAPM) to compress and refine high-dimensional features, and a Deformable Semantic Fusion (DSF) module to align these semantic priors with pixel-level motion fields. Experiments on SNU-FILM demonstrate that semantic injection provides a decisive boost in perceptual fidelity. SG-RIFE outperforms diffusion-based LDMVFI in FID/LPIPS and achieves quality comparable to Consec. BB on complex benchmarks while running significantly faster, proving that semantic consistency enables flow-based methods to achieve diffusion-competitive perceptual quality in near real-time.
zh
[CV-154] Joint Learning of Depth Pose and Local Radiance Field for Large Scale Monocular 3D Reconstruction
【速读】:该论文旨在解决单目视频在大规模场景中进行真实感三维重建时面临的三大挑战:尺度模糊的深度估计导致鬼影几何、长距离位姿漂移破坏对齐精度,以及单一全局神经辐射场(NeRF)难以建模数百米范围的内容。其解决方案的关键在于提出一个联合学习框架,协同优化深度、位姿与辐射率三个因素:首先使用带度量尺度监督的视觉Transformer(Vision-Transformer, ViT)深度网络获得全局一致的深度;其次引入多尺度特征束调整(multi-scale feature bundle-adjustment, BA)层,在特征空间中直接优化相机位姿,利用学习到的金字塔描述符替代脆弱的关键点以抑制轨迹漂移;最后采用增量式的局部辐射场层次结构,当视图重叠低于阈值时动态分配并冻结新的哈希网格NeRF,从而实现在单张GPU上覆盖街区级场景的重建能力。
链接: https://arxiv.org/abs/2512.18237
作者: Shahram Najam Syed,Yitian Hu,Yuchao Yao
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 2 figures, 2 tables
Abstract:Photorealistic 3-D reconstruction from monocular video collapses in large-scale scenes when depth, pose, and radiance are solved in isolation: scale-ambiguous depth yields ghost geometry, long-horizon pose drift corrupts alignment, and a single global NeRF cannot model hundreds of metres of content. We introduce a joint learning framework that couples all three factors and demonstrably overcomes each failure case. Our system begins with a Vision-Transformer (ViT) depth network trained with metric-scale supervision, giving globally consistent depths despite wide field-of-view variations. A multi-scale feature bundle-adjustment (BA) layer refines camera poses directly in feature space–leveraging learned pyramidal descriptors instead of brittle keypoints–to suppress drift on unconstrained trajectories. For scene representation, we deploy an incremental local-radiance-field hierarchy: new hash-grid NeRFs are allocated and frozen on-the-fly when view overlap falls below a threshold, enabling city-block-scale coverage on a single GPU. Evaluated on the Tanks and Temples benchmark, our method reduces Absolute Trajectory Error to 0.001-0.021 m across eight indoor-outdoor sequences–up to 18x lower than BARF and 2x lower than NoPe-NeRF–while maintaining sub-pixel Relative Pose Error. These results demonstrate that metric-scale, drift-free 3-D reconstruction and high-fidelity novel-view synthesis are achievable from a single uncalibrated RGB camera.
zh
[CV-155] Multifaceted Exploration of Spatial Openness in Rental Housing: A Big Data Analysis in Tokyos 23 Wards
【速读】:该论文旨在解决住宅空间开放性(spatial openness)研究中因单一维度分析导致的局限性问题,即传统研究常将影响因素割裂处理,难以全面反映其与租金及住房特征的复杂关系。解决方案的关键在于构建一个融合二维(2D)与三维(3D)视角的定量评估框架:其中2D开放性通过平面可视性分析(visibility graph analysis, VGA)从户型图中计算,3D开放性则基于Mask2Former语义分割模型从室内图像中提取墙体、天花板、地板和窗户等要素进行量化。该多维数据驱动方法揭示了开放性随时间与空间的变化规律及其与租金和建筑特征的部分相关性,为住宅设计优化与城市市场动态关联提供了新的科学依据。
链接: https://arxiv.org/abs/2512.18226
作者: Takuya OKi,Yuan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding spatial openness is vital for improving residential quality and design; however, studies often treat its influencing factors separately. This study developed a quantitative framework to evaluate the spatial openness in housing from two- (2D) and three- (3D) dimensional perspectives. Using data from 4,004 rental units in Tokyo’s 23 wards, we examined the temporal and spatial variations in openness and its relationship with rent and housing attributes. 2D openness was computed via planar visibility using visibility graph analysis (VGA) from floor plans, whereas 3D openness was derived from interior images analysed using Mask2Former, a semantic segmentation model that identifies walls, ceilings, floors, and windows. The results showed an increase in living room visibility and a 1990s peak in overall openness. Spatial analyses revealed partial correlations among openness, rent, and building characteristics, reflecting urban redevelopment trends. Although the 2D and 3D openness indicators were not directly correlated, higher openness tended to correspond to higher rent. The impression scores predicted by the existing models were only weakly related to openness, suggesting that the interior design and furniture more strongly shape perceived space. This study offers a new multidimensional data-driven framework for quantifying residential spatial openness and linking it with urban and market dynamics.
zh
[CV-156] Unsupervised Anomaly Detection with an Enhanced Teacher for Student-Teacher Feature Pyramid Matching
【速读】:该论文旨在解决无监督学习中异常检测(anomaly detection)的挑战性问题,特别是如何提升模型在图像级和像素级异常检测任务中的性能。其解决方案的关键在于提出一种增强型师生特征金字塔框架(Enhanced Teacher for Student-Teacher Feature Pyramid, ET-STPM),通过先在ImageNet上预训练ResNet-18网络,再在MVTech-AD数据集上微调教师网络,从而显著提升特征表示能力,最终在图像级和像素级均实现了0.971和0.977的平均准确率,优于以往方法。
链接: https://arxiv.org/abs/2512.18219
作者: Mohammad Zolfaghari,Hedieh Sajedi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Anomaly detection or outlier is one of the challenging subjects in unsupervised learning . This paper is introduced a student-teacher framework for anomaly detection that its teacher network is enhanced for achieving high-performance metrics . For this purpose , we first pre-train the ResNet-18 network on the ImageNet and then fine-tune it on the MVTech-AD dataset . Experiment results on the image-level and pixel-level demonstrate that this idea has achieved better metrics than the previous methods . Our model , Enhanced Teacher for Student-Teacher Feature Pyramid (ET-STPM), achieved 0.971 mean accuracy on the image-level and 0.977 mean accuracy on the pixel-level for anomaly detection.
zh
[CV-157] Multi-Part Object Representations via Graph Structures and Co-Part Discovery
【速读】:该论文旨在解决现有基于图像的物体中心表示(object-centric representations)方法在处理多部件物体时,难以在遮挡或分布外(out-of-distribution)场景下准确识别物体的问题。传统方法依赖隐式对象表示,假设部件与整体的关系通过间接训练目标自动编码,但这一假设在复杂场景中失效。解决方案的关键在于提出一种新颖的显式图结构表示方法,显式建模部件之间的关系,并设计了一个协同部件(co-part)物体发现算法,从而提升模型在遮挡和分布外条件下的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2512.18192
作者: Alex Foo,Wynne Hsu,Mong Li Lee
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Discovering object-centric representations from images can significantly enhance the robustness, sample efficiency and generalizability of vision models. Works on images with multi-part objects typically follow an implicit object representation approach, which fail to recognize these learned objects in occluded or out-of-distribution contexts. This is due to the assumption that object part-whole relations are implicitly encoded into the representations through indirect training objectives. We address this limitation by proposing a novel method that leverages on explicit graph representations for parts and present a co-part object discovery algorithm. We then introduce three benchmarks to evaluate the robustness of object-centric methods in recognizing multi-part objects within occluded and out-of-distribution settings. Experimental results on simulated, realistic, and real-world images show marked improvements in the quality of discovered objects compared to state-of-the-art methods, as well as the accurate recognition of multi-part objects in occluded and out-of-distribution contexts. We also show that the discovered object-centric representations can more accurately predict key object properties in a downstream task, highlighting the potential of our method to advance the field of object-centric representations.
zh
[CV-158] ALIGN: Advanced Query Initialization with LiDAR-Image Guidance for Occlusion-Robust 3D Object Detection
【速读】:该论文旨在解决当前基于查询的3D目标检测方法中,因查询初始化策略(如随机采样或BEV热力图采样)导致的查询使用效率低下和准确性下降问题,尤其是在遮挡或密集场景下表现不佳。其解决方案的关键在于提出一种名为ALIGN(Advanced query initialization with LiDAR and Image Guidance)的新框架,包含三个核心组件:(i) 遮挡感知中心估计(Occlusion-aware Center Estimation, OCE),融合LiDAR几何信息与图像语义特征以精准定位目标中心;(ii) 自适应邻域采样(Adaptive Neighbor Sampling, ANS),基于LiDAR聚类生成候选对象,并通过空间与语义对齐的方式补充周边点;(iii) 动态查询平衡(Dynamic Query Balancing, DQB),自适应调整前景与背景区域间的查询分配比例。该方法显著提升了检测性能,尤其在复杂场景中表现出更强的鲁棒性。
链接: https://arxiv.org/abs/2512.18187
作者: Janghyun Baek,Mincheol Chang,Seokha Moon,Seung Joon Lee,Jinkyu Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures
Abstract:Recent query-based 3D object detection methods using camera and LiDAR inputs have shown strong performance, but existing query initialization strategies,such as random sampling or BEV heatmap-based sampling, often result in inefficient query usage and reduced accuracy, particularly for occluded or crowded objects. To address this limitation, we propose ALIGN (Advanced query initialization with LiDAR and Image GuidaNce), a novel approach for occlusion-robust, object-aware query initialization. Our model consists of three key components: (i) Occlusion-aware Center Estimation (OCE), which integrates LiDAR geometry and image semantics to estimate object centers accurately (ii) Adaptive Neighbor Sampling (ANS), which generates object candidates from LiDAR clustering and supplements each object by sampling spatially and semantically aligned points around it and (iii) Dynamic Query Balancing (DQB), which adaptively balances queries between foreground and background regions. Our extensive experiments on the nuScenes benchmark demonstrate that ALIGN consistently improves performance across multiple state-of-the-art detectors, achieving gains of up to +0.9 mAP and +1.2 NDS, particularly in challenging scenes with occlusions or dense crowds. Our code will be publicly available upon publication.
zh
[CV-159] Is There a Better Source Distribution than Gaussian? Exploring Source Distributions for Image Flow Matching
【速读】:该论文旨在解决流匹配(Flow Matching)在高维数据生成中源分布选择不当所导致的学习不稳定与性能瓶颈问题,特别是针对传统高斯分布作为源分布时可能存在的方向对齐偏差、范数错位及模式缺失等缺陷。其解决方案的关键在于:首先通过一个可解释的二维模拟框架揭示了流匹配训练过程中的四大核心机制问题(如密度近似恶化、路径纠缠、范数失配等),进而提出一种结合范数对齐训练与方向剪枝采样的实用框架;该方法在保留高斯源分布提供的全向监督优势的同时,在推理阶段剔除数据稀疏区域的初始点,从而显著提升生成质量和采样效率,且无需重新训练即可应用于现有流匹配模型。
链接: https://arxiv.org/abs/2512.18184
作者: Junho Lee,Kwanseok Kim,Joonseok Lee
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Flow matching has emerged as a powerful generative modeling approach with flexible choices of source distribution. While Gaussian distributions are commonly used, the potential for better alternatives in high-dimensional data generation remains largely unexplored. In this paper, we propose a novel 2D simulation that captures high-dimensional geometric properties in an interpretable 2D setting, enabling us to analyze the learning dynamics of flow matching during training. Based on this analysis, we derive several key insights about flow matching behavior: (1) density approximation can paradoxically degrade performance due to mode discrepancy, (2) directional alignment suffers from path entanglement when overly concentrated, (3) Gaussian’s omnidirectional coverage ensures robust learning, and (4) norm misalignment incurs substantial learning costs. Building on these insights, we propose a practical framework that combines norm-aligned training with directionally-pruned sampling. This approach maintains the robust omnidirectional supervision essential for stable flow learning, while eliminating initializations in data-sparse regions during inference. Importantly, our pruning strategy can be applied to any flow matching model trained with a Gaussian source, providing immediate performance gains without the need for retraining. Empirical evaluations demonstrate consistent improvements in both generation quality and sampling efficiency. Our findings provide practical insights and guidelines for source distribution design and introduce a readily applicable technique for improving existing flow matching models. Our code is available at this https URL.
zh
[CV-160] MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation
【速读】:该论文旨在解决音乐驱动舞蹈视频生成任务中难以同时实现高质量视觉外观与真实人体运动的问题。现有方法在音乐驱动3D舞蹈生成、姿态驱动图像动画和音频驱动人脸合成等方向虽有进展,但无法直接迁移至该任务,且当前研究仍存在表现力不足和物理合理性欠佳的局限。解决方案的关键在于提出MACE-Dance框架,其采用级联式混合专家(Cascaded Mixture-of-Experts, MoE)结构:其中运动专家(Motion Expert)利用基于BiMamba-Transformer混合架构的扩散模型与无引导训练(Guidance-Free Training, GFT)策略,实现高保真度且具艺术表现力的音乐到3D动作转换;外观专家(Appearance Expert)则通过解耦的运动-参考条件视频合成机制,结合运动与参考图像进行精细调控,确保视觉身份一致性与时空连贯性。该方案在自建大规模数据集上经专门设计的运动-外观联合评估协议验证,达到了当前最优性能。
链接: https://arxiv.org/abs/2512.18181
作者: Kaixing Yang,Jiashu Zhu,Xulong Tang,Ziqiao Peng,Xiangyue Zhang,Puwei Wang,Jiahong Wu,Xiangxiang Chu,Hongyan Liu,Jun He
机构: Renmin University of China (中国人民大学); Tsinghua University (清华大学); Wuhan University (武汉大学); Alibaba Group (阿里巴巴集团); Malou Tech Inc (马洛科技公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Project page: this https URL
zh
[CV-161] NEURO-GUARD: Neuro-Symbolic Generalization and Unbiased Adaptive Routing for Diagnostics – Explainable Medical AI
【速读】:该论文旨在解决医学图像诊断中准确性和可解释性难以兼得的问题,尤其在数据有限、视觉线索微弱且临床决策高风险的场景下,现有纯数据驱动的视觉模型普遍存在黑箱预测、可解释性差和跨域泛化能力弱等局限,阻碍了其在真实医疗环境中的应用。解决方案的关键在于提出 NEURO-GUARD 框架,该框架通过将 Vision Transformers (ViTs) 与语言驱动的推理相结合,引入基于检索增强生成(Retrieval-Augmented Generation, RAG)机制,使大型语言模型(Large Language Model, LLM)能够迭代式地生成、评估并优化用于医学图像特征提取的代码,从而将临床指南和专家知识嵌入到视觉学习过程中,实现从纯数据驱动向知识引导的范式转变,显著提升了诊断准确性(在糖尿病视网膜病变分类任务上较 ViT 基线提升 6.2%)和跨域鲁棒性(提升 5%),同时增强了模型输出的透明度与可解释性。
链接: https://arxiv.org/abs/2512.18177
作者: Midhat Urooj,Ayan Banerjee,Sandeep Gupta
机构: Arizona State University (亚利桑那州立大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Asilomar Conference
Abstract:Accurate yet interpretable image-based diagnosis remains a central challenge in medical AI, particularly in settings characterized by limited data, subtle visual cues, and high-stakes clinical decision-making. Most existing vision models rely on purely data-driven learning and produce black-box predictions with limited interpretability and poor cross-domain generalization, hindering their real-world clinical adoption. We present NEURO-GUARD, a novel knowledge-guided vision framework that integrates Vision Transformers (ViTs) with language-driven reasoning to improve performance, transparency, and domain robustness. NEURO-GUARD employs a retrieval-augmented generation (RAG) mechanism for self-verification, in which a large language model (LLM) iteratively generates, evaluates, and refines feature-extraction code for medical images. By grounding this process in clinical guidelines and expert knowledge, the framework progressively enhances feature detection and classification beyond purely data-driven baselines. Extensive experiments on diabetic retinopathy classification across four benchmark datasets APTOS, EyePACS, Messidor-1, and Messidor-2 demonstrate that NEURO-GUARD improves accuracy by 6.2% over a ViT-only baseline (84.69% vs. 78.4%) and achieves a 5% gain in domain generalization. Additional evaluations on MRI-based seizure detection further confirm its cross-domain robustness, consistently outperforming existing methods. Overall, NEURO-GUARD bridges symbolic medical reasoning with subsymbolic visual learning, enabling interpretable, knowledge-aware, and generalizable medical image diagnosis while achieving state-of-the-art performance across multiple datasets. Comments: Accepted at Asilomar Conference Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.18177 [cs.AI] (or arXiv:2512.18177v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.18177 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-162] Atlas is Your Perfect Context: One-Shot Customization for Generalizable Foundational Medical Image Segmentation
【速读】:该论文旨在解决当前交互式基础模型(interactive foundation models)在临床医学图像分割任务中依赖精确提示(prompt)且在训练数据中代表性不足的场景下性能下降的问题。其解决方案的关键在于提出AtlasSegFM框架,通过两个核心创新实现轻量级的一次性定制:一是构建基于解剖图谱(atlas)与查询图像配准的上下文感知提示生成管道,为基础模型提供更适配临床场景的输入;二是设计测试时适配器(test-time adapter),融合图谱注册结果与基础模型预测,提升对小而精细结构的分割准确性。该方法显著增强了模型在真实临床工作流中的适应性和鲁棒性。
链接: https://arxiv.org/abs/2512.18176
作者: Ziyu Zhang,Yi Yu,Simeng Zhu,Ahmed Aly,Yunhe Gao,Ning Gu,Yuan Xue
机构: Nanjing University (南京大学); The Ohio State University (俄亥俄州立大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate medical image segmentation is essential for clinical diagnosis and treatment planning. While recent interactive foundation models (e.g., nnInteractive) enhance generalization through large-scale multimodal pretraining, they still depend on precise prompts and often perform below expectations in contexts that are underrepresented in their training data. We present AtlasSegFM, an atlas-guided framework that customizes available foundation models to clinical contexts with a single annotated example. The core innovations are: 1) a pipeline that provides context-aware prompts for foundation models via registration between a context atlas and query images, and 2) a test-time adapter to fuse predictions from both atlas registration and the foundation model. Extensive experiments across public and in-house datasets spanning multiple modalities and organs demonstrate that AtlasSegFM consistently improves segmentation, particularly for small, delicate structures. AtlasSegFM provides a lightweight, deployable solution one-shot customization of foundation models in real-world clinical workflows. The code will be made publicly available.
zh
[CV-163] Local Patches Meet Global Context: Scalable 3D Diffusion Priors for Computed Tomography Reconstruction
【速读】:该论文旨在解决3D医学图像重建中因直接训练扩散模型(Diffusion Models)于高维数据所面临的计算资源消耗大、数据需求量高等挑战,尤其是在3D Computed Tomography (CT) 成像等实际应用中的可扩展性和生成质量难题。其解决方案的关键在于提出一种基于3D局部块(patch-based)的扩散模型架构,通过学习3D局部块与下采样全局体积之间的联合分布来建模位置感知的局部与全局信息耦合关系,从而在有限数据条件下实现高效且高质量的高分辨率3D图像生成。此方法不仅提升了重建性能,还显著降低了计算复杂度,实现了前所未有的效率和精度平衡。
链接: https://arxiv.org/abs/2512.18161
作者: Taewon Yang,Jason Hu,Jeffrey A. Fessler,Liyue Shen
机构: University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models learn strong image priors that can be leveraged to solve inverse problems like medical image reconstruction. However, for real-world applications such as 3D Computed Tomography (CT) imaging, directly training diffusion models on 3D data presents significant challenges due to the high computational demands of extensive GPU resources and large-scale datasets. Existing works mostly reuse 2D diffusion priors to address 3D inverse problems, but fail to fully realize and leverage the generative capacity of diffusion models for high-dimensional data. In this study, we propose a novel 3D patch-based diffusion model that can learn a fully 3D diffusion prior from limited data, enabling scalable generation of high-resolution 3D images. Our core idea is to learn the prior of 3D patches to achieve scalable efficiency, while coupling local and global information to guarantee high-quality 3D image generation, by modeling the joint distribution of position-aware 3D local patches and downsampled 3D volume as global context. Our approach not only enables high-quality 3D generation, but also offers an unprecedentedly efficient and accurate solution to high-resolution 3D inverse problems. Experiments on 3D CT reconstruction across multiple datasets show that our method outperforms state-of-the-art methods in both performance and efficiency, notably achieving high-resolution 3D reconstruction of 512 \times 512 \times 256 ( \sim 20 mins).
zh
[CV-164] EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams
【速读】:该论文旨在解决内窥镜视频流中单目深度估计的精度与时间一致性问题,尤其是在复杂解剖结构边界模糊、动态场景下深度预测不稳定等挑战。其解决方案的关键在于提出EndoStreamDepth框架,该框架通过三个核心组件实现:(1)采用针对内窥镜图像特性的单帧深度网络以生成高精度深度图;(2)引入多级Mamba时序模块,利用帧间信息传播提升预测准确性并增强时域稳定性;(3)基于分层设计和多尺度监督机制,结合局部边界锐化与全局几何一致性损失项,从而在保持解剖结构对齐的同时实现实时处理性能。
链接: https://arxiv.org/abs/2512.18159
作者: Hao Li,Daiwei Lu,Jiacheng Wang,Robert J. Webster III,Ipek Oguz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery. The code is publicly available at this https URL
zh
[CV-165] SERA-H: Beyond Native Sentinel Spatial Limits for High-Resolution Canopy Height Mapping
【速读】:该论文旨在解决森林冠层高度高分辨率制图中数据可获取性与空间分辨率之间的权衡问题。现有基于卫星影像的深度学习方法通常受限于传感器分辨率或商业数据成本,难以实现广泛部署。其解决方案的关键在于提出一种端到端模型SERA-H,该模型融合了超分辨率模块(EDSR)与时间注意力编码机制(UTAE),利用高密度机载激光雷达(ALS)数据进行监督训练,从而从免费获取的Sentinel-1和Sentinel-2多时相遥感数据(10 m分辨率)中重建出2.5 m分辨率的冠层高度图。通过引入时间维度的信息增强与空间细节恢复能力,该方法在法国公开基准数据集上实现了MAE为2.6 m、决定系数达0.82的精度,性能优于标准Sentinel基线,并可媲美甚至超越依赖昂贵商业高分辨率影像(如SPOT-6/7、PlanetScope)的方法,验证了时空信息融合在突破传感器物理限制中的有效性。
链接: https://arxiv.org/abs/2512.18128
作者: Thomas Boudras,Martin Schwartz,Rasmus Fensholt,Martin Brandt,Ibrahim Fayad,Jean-Pierre Wigneron,Gabriel Belouze,Fajwel Fogel,Philippe Ciais
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures, 3 tables. Source code available at this https URL
Abstract:High-resolution mapping of canopy height is essential for forest management and biodiversity monitoring. Although recent studies have led to the advent of deep learning methods using satellite imagery to predict height maps, these approaches often face a trade-off between data accessibility and spatial resolution. To overcome these limitations, we present SERA-H, an end-to-end model combining a super-resolution module (EDSR) and temporal attention encoding (UTAE). Trained under the supervision of high-density LiDAR data (ALS), our model generates 2.5 m resolution height maps from freely available Sentinel-1 and Sentinel-2 (10 m) time series data. Evaluated on an open-source benchmark dataset in France, SERA-H, with a MAE of 2.6 m and a coefficient of determination of 0.82, not only outperforms standard Sentinel-1/2 baselines but also achieves performance comparable to or better than methods relying on commercial very high-resolution imagery (SPOT-6/7, PlanetScope, Maxar). These results demonstrate that combining high-resolution supervision with the spatiotemporal information embedded in time series enables the reconstruction of details beyond the input sensors’ native resolution. SERA-H opens the possibility of freely mapping forests with high revisit frequency, achieving accuracy comparable to that of costly commercial imagery. The source code is available at this https URL
zh
[CV-166] Uncertainty-Gated Region-Level Retrieval for Robust Semantic Segmentation
【速读】:该论文旨在解决户外街景语义分割在域偏移(domain shift)条件下准确性下降与校准不足的问题,尤其关注自动驾驶、移动机器人和视障行人辅助等应用场景中对道路、人行道、车辆和行人等关键目标的精确区分需求。其解决方案的关键在于提出一种基于区域级别的不确定性门控检索机制(uncertainty-gated retrieval mechanism),通过动态判断哪些区域需要进行高精度特征检索,从而在保证分割性能的同时显著降低计算开销——实验表明,该方法在仅检索12.5%区域的情况下,相较始终开启检索的基线模型,平均交并比(mean intersection-over-union)提升11.3%,检索成本降低87.5%。
链接: https://arxiv.org/abs/2512.18082
作者: Shreshth Rajan,Raymond Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Semantic segmentation of outdoor street scenes plays a key role in applications such as autonomous driving, mobile robotics, and assistive technology for visually-impaired pedestrians. For these applications, accurately distinguishing between key surfaces and objects such as roads, sidewalks, vehicles, and pedestrians is essential for maintaining safety and minimizing risks. Semantic segmentation must be robust to different environments, lighting and weather conditions, and sensor noise, while being performed in real-time. We propose a region-level, uncertainty-gated retrieval mechanism that improves segmentation accuracy and calibration under domain shift. Our best method achieves an 11.3% increase in mean intersection-over-union while reducing retrieval cost by 87.5%, retrieving for only 12.5% of regions compared to 100% for always-on baseline.
zh
[CV-167] FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis
【速读】:该论文旨在解决生成式 AI(Generative AI)在指纹理解领域的应用空白问题,即当前多模态大语言模型(Multimodal Large Language Models, MLLMs)虽已在虹膜与人脸图像的生物特征分析中取得进展,但其在指纹识别与 forensic 任务中的能力尚未被系统评估。解决方案的关键在于构建首个针对指纹领域理解的综合性基准测试平台——\textscFPBench,该平台涵盖7个真实与合成数据集、8项生物特征与法医任务,并采用零样本(zero-shot)和思维链(chain-of-thought)提示策略,对20种开源及专有MLLM进行系统评测,从而揭示其性能表现、可解释性及局限性,为指纹领域基础模型的发展奠定评估标准与研究路径。
链接: https://arxiv.org/abs/2512.18073
作者: Ekta Balkrishna Gavas,Sudipta Banerjee,Chinmay Hegde,Nasir Memon
机构: New York University (纽约大学); University of Wyoming (怀俄明大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal LLMs (MLLMs) have gained significant traction in complex data analysis, visual question answering, generation, and reasoning. Recently, they have been used for analyzing the biometric utility of iris and face images. However, their capabilities in fingerprint understanding are yet unexplored. In this work, we design a comprehensive benchmark, \textscFPBench that evaluates the performance of 20 MLLMs (open-source and proprietary) across 7 real and synthetic datasets on 8 biometric and forensic tasks using zero-shot and chain-of-thought prompting strategies. We discuss our findings in terms of performance, explainability and share our insights into the challenges and limitations. We establish \textscFPBench as the first comprehensive benchmark for fingerprint domain understanding with MLLMs paving the path for foundation models for fingerprints.
zh
[CV-168] FOODER: Real-time Facial Authentication and Expression Recognition
【速读】:该论文旨在解决神经网络在实际部署中面临的安全性问题,即如何有效识别分布外(Out-of-distribution, OOD)样本,从而保障基于人脸的认证与表情识别系统的鲁棒性和隐私安全。其解决方案的关键在于提出了一种实时、隐私保护的雷达基框架FOODER,该框架仅依赖低成本调频连续波(Frequency-Modulated Continuous-Wave, FMCW)雷达数据,通过融合距离-多普勒(range-Doppler)和微距离-多普勒(micro range-Doppler)表示,采用多编码器-多解码器架构实现单人身份的in-distribution分类与其它人脸的OOD检测;在成功认证后,进一步利用ResNet块提取特征并驱动两个专用MobileViT网络,分别对动态表情(如微笑、惊愕)和静态表情(如中性、愤怒)进行细粒度分类,从而实现高效且隐私友好的面部认证与表情识别一体化系统。
链接: https://arxiv.org/abs/2512.18057
作者: Sabri Mustafa Kahya,Muhammet Sami Yavuz,Boran Hamdi Sivrikaya,Eckehard Steinbach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: Book chapter
Abstract:Out-of-distribution (OOD) detection is essential for the safe deployment of neural networks, as it enables the identification of samples outside the training domain. We present FOODER, a real-time, privacy-preserving radar-based framework that integrates OOD-based facial authentication with facial expression recognition. FOODER operates using low-cost frequency-modulated continuous-wave (FMCW) radar and exploits both range-Doppler and micro range-Doppler representations. The authentication module employs a multi-encoder multi-decoder architecture with Body Part (BP) and Intermediate Linear Encoder-Decoder (ILED) components to classify a single enrolled individual as in-distribution while detecting all other faces as OOD. Upon successful authentication, an expression recognition module is activated. Concatenated radar representations are processed by a ResNet block to distinguish between dynamic and static facial expressions. Based on this categorization, two specialized MobileViT networks are used to classify dynamic expressions (smile, shock) and static expressions (neutral, anger). This hierarchical design enables robust facial authentication and fine-grained expression recognition while preserving user privacy by relying exclusively on radar data. Experiments conducted on a dataset collected with a 60 GHz short-range FMCW radar demonstrate that FOODER achieves an AUROC of 94.13% and an FPR95 of 18.12% for authentication, along with an average expression recognition accuracy of 94.70%. FOODER outperforms state-of-the-art OOD detection methods and several transformer-based architectures while operating efficiently in real time.
zh
[CV-169] YolovN-CBi: A Lightweight and Efficient Architecture for Real-Time Detection of Small UAVs
【速读】:该论文旨在解决小型无人飞行器(Unmanned Aerial Vehicles, UAVs)在民用和国防场景中检测困难的问题,其核心挑战在于无人机尺寸小、运动速度快且视觉对比度低。解决方案的关键在于提出一种改进的 YolovN 架构——YolovN-CBi,该架构融合了卷积块注意力模块(Convolutional Block Attention Module, CBAM)与双向特征金字塔网络(Bidirectional Feature Pyramid Network, BiFPN),从而显著提升对小目标的敏感性。实验表明,该模型在多个基准数据集及自建小尺寸无人机测试集上均优于包括 Yolov8 和 Yolov12 在内的最新 YOLO 版本,在速度-精度权衡中表现更优;进一步通过知识蒸馏技术将教师模型 Yolov5m-CBi 与学生模型 Yolov5n-CBi 结合,实现了模型轻量化与性能增强的双重优势,最终蒸馏模型在保持高精度的同时比基线模型快 82.9%,具备良好的边缘部署潜力。
链接: https://arxiv.org/abs/2512.18046
作者: Ami Pandat,Punna Rajasekhar,Gopika Vinod,Rohit Shukla
机构: Bhabha Atomic Research Centre (巴巴原子能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unmanned Aerial Vehicles, commonly known as, drones pose increasing risks in civilian and defense settings, demanding accurate and real-time drone detection systems. However, detecting drones is challenging because of their small size, rapid movement, and low visual contrast. A modified architecture of YolovN called the YolovN-CBi is proposed that incorporates the Convolutional Block Attention Module (CBAM) and the Bidirectional Feature Pyramid Network (BiFPN) to improve sensitivity to small object detections. A curated training dataset consisting of 28K images is created with various flying objects and a local test dataset is collected with 2500 images consisting of very small drone objects. The proposed architecture is evaluated on four benchmark datasets, along with the local test dataset. The baseline Yolov5 and the proposed Yolov5-CBi architecture outperform newer Yolo versions, including Yolov8 and Yolov12, in the speed-accuracy trade-off for small object detection. Four other variants of the proposed CBi architecture are also proposed and evaluated, which vary in the placement and usage of CBAM and BiFPN. These variants are further distilled using knowledge distillation techniques for edge deployment, using a Yolov5m-CBi teacher and a Yolov5n-CBi student. The distilled model achieved a mA@P0.5:0.9 of 0.6573, representing a 6.51% improvement over the teacher’s score of 0.6171, highlighting the effectiveness of the distillation process. The distilled model is 82.9% faster than the baseline model, making it more suitable for real-time drone detection. These findings highlight the effectiveness of the proposed CBi architecture, together with the distilled lightweight models in advancing efficient and accurate real-time detection of small UAVs.
zh
[CV-170] NodMAISI: Nodule-Oriented Medical AI for Synthetic Imaging
【速读】:该论文旨在解决肺部CT影像中异常病灶(尤其是小结节)数据稀缺且标注不一致的问题,从而提升肺癌筛查中结节检测与分类模型的性能。其核心解决方案是提出NodMAISI框架,关键在于通过解剖结构约束的生成式合成与病变感知增强策略,在保持周围组织解剖一致性的同时精准生成具有临床意义的小结节变异样本,显著改善合成图像的分布保真度和结节可检测性,尤其在低数据场景下有效提升下游任务的判别性能。
链接: https://arxiv.org/abs/2512.18038
作者: Fakrul Islam Tushar,Ehsan Samei,Cynthia Rudin,Joseph Y. Lo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 3 tables, 7 figures, 12 Supplement tables, 9 Supplement figures
Abstract:Objective: Although medical imaging datasets are increasingly available, abnormal and annotation-intensive findings critical to lung cancer screening, particularly small pulmonary nodules, remain underrepresented and inconsistently curated. Methods: We introduce NodMAISI, an anatomically constrained, nodule-oriented CT synthesis and augmentation framework trained on a unified multi-source cohort (7,042 patients, 8,841 CTs, 14,444 nodules). The framework integrates: (i) a standardized curation and annotation pipeline linking each CT with organ masks and nodule-level annotations, (ii) a ControlNet-conditioned rectified-flow generator built on MAISI-v2’s foundational blocks to enforce anatomy- and lesion-consistent synthesis, and (iii) lesion-aware augmentation that perturbs nodule masks (controlled shrinkage) while preserving surrounding anatomy to generate paired CT variants. Results: Across six public test datasets, NodMAISI improved distributional fidelity relative to MAISI-v2 (real-to-synthetic FID range 1.18 to 2.99 vs 1.69 to 5.21). In lesion detectability analysis using a MONAI nodule detector, NodMAISI substantially increased average sensitivity and more closely matched clinical scans (IMD-CT: 0.69 vs 0.39; DLCS24: 0.63 vs 0.20), with the largest gains for sub-centimeter nodules where MAISI-v2 frequently failed to reproduce the conditioned lesion. In downstream nodule-level malignancy classification trained on LUNA25 and externally evaluated on LUNA16, LNDbv4, and DLCS24, NodMAISI augmentation improved AUC by 0.07 to 0.11 at =20% clinical data and by 0.12 to 0.21 at 10%, consistently narrowing the performance gap under data scarcity.
zh
[CV-171] Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在具身智能(embodied intelligence)场景下对“具身性”(embodiment)因素——即物理平台选择、传感器配置及模态对齐方式——如何影响感知、推理与控制能力的理解不足问题。其解决方案的关键在于提出Embodied4C,一个闭环式基准测试框架,作为具身推理的图灵测试:通过三种异构具身形态(自动驾驶车辆、空中无人机和机器人操作臂)执行约1100个一次性推理问题和58个目标导向导航任务,系统评估语义、空间、时间和物理推理四个基础维度;同时引入跨域远距离查询以防止特定平台过拟合,从而更全面地衡量VLMs在真实动态环境中的泛化能力和具身认知水平。
链接: https://arxiv.org/abs/2512.18028
作者: Tin Stribor Sohn,Maximilian Dillitzer,Jason J. Corso,Eric Sax
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); UAS Esslingen (埃斯林根应用技术大学); TU Wien (维也纳工业大学); Dr. Ing. h.c. F. Porsche AG (保时捷股份公司); University of Michigan (密歇根大学); Voxel51 Inc.
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language navigation requires agents to reason and act under constraints of embodiment. While vision-language models (VLMs) demonstrate strong generalization, current benchmarks provide limited understanding of how embodiment – i.e., the choice of physical platform, sensor configuration, and modality alignment – influences perception, reasoning, and control. We introduce Embodied4C, a closed-loop benchmark designed as a Turing test for embodied reasoning. The benchmark evaluates the core embodied capabilities of VLMs across three heterogeneous embodiments – autonomous vehicles, aerial drones, and robotic manipulators – through approximately 1.1K one-shot reasoning questions and 58 goal-directed navigation tasks. These tasks jointly assess four foundational dimensions: semantic, spatial, temporal, and physical reasoning. Each embodiment presents dynamic sensor configurations and environment variations to probe generalization beyond platform-specific adaptation. To prevent embodiment overfitting, Embodied4C integrates domain-far queries targeting abstract and cross-context reasoning. Comprehensive evaluation across ten state-of-the-art VLMs and four embodied control baselines shows that cross-modal alignment and instruction tuning matter more than scale, while spatial and temporal reasoning remains the primary bottleneck for reliable embodied competence.
zh
[CV-172] Robotic VLA Benefits from Joint Learning with Motion Image Diffusion
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中缺乏运动推理能力的问题,即这些模型通常仅模仿专家轨迹,而无法基于对未来动态的理解进行主动决策。解决方案的关键在于提出一种联合学习策略——运动图像扩散(motion image diffusion),通过在标准VLA架构上引入双头结构:一个动作头用于预测动作片段,另一个由扩散Transformer(Diffusion Transformer, DiT)实现的运动头用于预测基于光流的运动图像以表征未来动态。两个头共享同一个视觉语言模型(Vision-Language Model, VLM)主干网络,在训练过程中协同优化,从而让模型学习到将机器人控制与物理合理的运动知识耦合的时序一致表示,且不改变原有推理路径,保持了测试阶段的低延迟。
链接: https://arxiv.org/abs/2512.18007
作者: Yu Fang,Kanchana Ranasinghe,Le Xue,Honglu Zhou,Juntao Tan,Ran Xu,Shelby Heinecke,Caiming Xiong,Silvio Savarese,Daniel Szafir,Mingyu Ding,Michael S. Ryoo,Juan Carlos Niebles
机构: Salesforce AI Research (Salesforce人工智能研究中心); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL
Abstract:Vision-Language-Action (VLA) models have achieved remarkable progress in robotic manipulation by mapping multimodal observations and instructions directly to actions. However, they typically mimic expert trajectories without predictive motion reasoning, which limits their ability to reason about what actions to take. To address this limitation, we propose joint learning with motion image diffusion, a novel strategy that enhances VLA models with motion reasoning capabilities. Our method extends the VLA architecture with a dual-head design: while the action head predicts action chunks as in vanilla VLAs, an additional motion head, implemented as a Diffusion Transformer (DiT), predicts optical-flow-based motion images that capture future dynamics. The two heads are trained jointly, enabling the shared VLM backbone to learn representations that couple robot control with motion knowledge. This joint learning builds temporally coherent and physically grounded representations without modifying the inference pathway of standard VLAs, thereby maintaining test-time latency. Experiments in both simulation and real-world environments demonstrate that joint learning with motion image diffusion improves the success rate of pi-series VLAs to 97.5% on the LIBERO benchmark and 58.0% on the RoboTwin benchmark, yielding a 23% improvement in real-world performance and validating its effectiveness in enhancing the motion reasoning capability of large-scale VLAs.
zh
[CV-173] Name That Part: 3D Part Segmentation and Naming
【速读】:该论文旨在解决**语义3D部件分割(semantic 3D part segmentation)**中的关键挑战:现有数据集的部件定义不一致,导致模型难以鲁棒训练;同时,以往方法通常产生未标注的分解结果或仅能检索单一部件,缺乏完整的形状标注能力。其解决方案的核心是提出 ALIGN-Parts 方法,将部件命名建模为直接的集合对齐任务,通过 bipartite assignment 将隐式的3D部件表示(称为 partlets)与文本描述进行匹配。该方法融合几何线索(来自3D部件场)、多视角视觉特征和语言模型生成的可操作性描述(affordance descriptions),并通过文本对齐损失使 partlets 与文本嵌入共享同一空间,从而实现理论上开放词汇(open-vocabulary)的匹配机制。此设计支持零样本(zero-shot)匹配任意描述,并具备置信度校准预测能力,最终构建了一个统一的3D部件本体(ontology),整合了 PartNet、3DCoMPaT++ 和 Find3D 数据集,涵盖1,794个唯一3D部件。
链接: https://arxiv.org/abs/2512.18003
作者: Soumava Paul,Prakhar Kaushik,Ankit Vaidya,Anand Bhattad,Alan Yuille
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL
Abstract:We address semantic 3D part segmentation: decomposing objects into parts with meaningful names. While datasets exist with part annotations, their definitions are inconsistent across datasets, limiting robust training. Previous methods produce unlabeled decompositions or retrieve single parts without complete shape annotations. We propose ALIGN-Parts, which formulates part naming as a direct set alignment task. Our method decomposes shapes into partlets - implicit 3D part representations - matched to part descriptions via bipartite assignment. We combine geometric cues from 3D part fields, appearance from multi-view vision features, and semantic knowledge from language-model-generated affordance descriptions. Text-alignment loss ensures partlets share embedding space with text, enabling a theoretically open-vocabulary matching setup, given sufficient data. Our efficient and novel, one-shot, 3D part segmentation and naming method finds applications in several downstream tasks, including serving as a scalable annotation engine. As our model supports zero-shot matching to arbitrary descriptions and confidence-calibrated predictions for known categories, with human verification, we create a unified ontology that aligns PartNet, 3DCoMPaT++, and Find3D, consisting of 1,794 unique 3D parts. We also show examples from our newly created Tex-Parts dataset. We also introduce 2 novel metrics appropriate for the named 3D part segmentation task.
zh
[CV-174] Enhancing Tea Leaf Disease Recognition with Attention Mechanisms and Grad-CAM Visualization
【速读】:该论文旨在解决茶叶病害识别效率低、准确率不足的问题,以减少因病害蔓延导致的经济损失。其解决方案的关键在于构建一个基于深度学习的自动化分类系统,利用自建的包含5278张图像的七类茶叶病害数据集,并采用预训练模型(DenseNet、Inception 和 EfficientNet)结合注意力机制进行优化,最终通过集成学习方法实现最高达85.68%的分类准确率,同时引入可解释人工智能(Explainable AI)提升模型的可解释性,从而帮助农户快速准确地识别病害并采取干预措施。
链接: https://arxiv.org/abs/2512.17987
作者: Omar Faruq Shikdar,Fahad Ahammed,B. M. Shahria Alam,Golam Kibria,Tawhidur Rahman,Nishat Tasnim Niloy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 6 figures, International Conference on Computing and Communication Networks (ICCCNet-2025)
Abstract:Tea is among the most widely consumed drinks globally. Tea production is a key industry for many countries. One of the main challenges in tea harvesting is tea leaf diseases. If the spread of tea leaf diseases is not stopped in time, it can lead to massive economic losses for farmers. Therefore, it is crucial to identify tea leaf diseases as soon as possible. Manually identifying tea leaf disease is an ineffective and time-consuming method, without any guarantee of success. Automating this process will improve both the efficiency and the success rate of identifying tea leaf diseases. The purpose of this study is to create an automated system that can classify different kinds of tea leaf diseases, allowing farmers to take action to minimize the damage. A novel dataset was developed specifically for this study. The dataset contains 5278 images across seven classes. The dataset was pre-processed prior to training the model. We deployed three pretrained models: DenseNet, Inception, and EfficientNet. EfficientNet was used only in the ensemble model. We utilized two different attention modules to improve model performance. The ensemble model achieved the highest accuracy of 85.68%. Explainable AI was introduced for better model interpretability.
zh
[CV-175] A Modular Framework for Single-View 3D Reconstruction of Indoor Environments
【速读】:该论文旨在解决单视角室内场景三维重建中因实例形状复杂和遮挡导致的重建质量受限问题。传统方法直接从不完整的二维图像预测三维形状,易出现过拟合或失真,难以准确恢复被遮挡区域。其解决方案的关键在于提出一个模块化框架,利用扩散技术分两步完成重建:首先通过基于扩散的模态补全模块恢复被遮挡实例的完整视图及房间布局,再结合混合深度估计与视图空间对齐方法,将二维补全结果精确映射为三维结构。该设计有效提升了前景物体与背景环境的重建精度与视觉一致性。
链接: https://arxiv.org/abs/2512.17955
作者: Yuxiao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Master’s thesis
Abstract:We propose a modular framework for single-view indoor scene 3D reconstruction, where several core modules are powered by diffusion techniques. Traditional approaches for this task often struggle with the complex instance shapes and occlusions inherent in indoor environments. They frequently overshoot by attempting to predict 3D shapes directly from incomplete 2D images, which results in limited reconstruction quality. We aim to overcome this limitation by splitting the process into two steps: first, we employ diffusion-based techniques to predict the complete views of the room background and occluded indoor instances, then transform them into 3D. Our modular framework makes contributions to this field through the following components: an amodal completion module for restoring the full view of occluded instances, an inpainting model specifically trained to predict room layouts, a hybrid depth estimation technique that balances overall geometric accuracy with fine detail expressiveness, and a view-space alignment method that exploits both 2D and 3D cues to ensure precise placement of instances within the scene. This approach effectively reconstructs both foreground instances and the room background from a single image. Extensive experiments on the 3D-Front dataset demonstrate that our method outperforms current state-of-the-art (SOTA) approaches in terms of both visual quality and reconstruction accuracy. The framework holds promising potential for applications in interior design, real estate, and augmented reality.
zh
[CV-176] SCS-SupCon: Sigmoid-based Common and Style Supervised Contrastive Learning with Adaptive Decision Boundaries
【速读】:该论文旨在解决图像分类中因类别间差异细微和类内变化显著而导致现有对比学习方法效果受限的问题,特别是监督对比学习方法在InfoNCE损失下存在的负样本稀释(negative-sample dilution)和缺乏自适应决策边界的问题,从而削弱了细粒度识别任务中的判别能力。其解决方案的关键在于提出一种基于Sigmoid的共性与风格监督对比学习框架(SCS-SupCon),通过引入可学习温度与偏置参数的Sigmoid形式成对对比损失,实现自适应决策边界,强化难负样本的区分,缓解负样本稀释并更有效地利用监督信息;同时引入显式的风格距离约束,进一步解耦风格与内容表征,提升特征学习的鲁棒性。
链接: https://arxiv.org/abs/2512.17954
作者: Bin Wang,Fadi Dornaika
机构: University of the Basque Country (巴斯克大学); IKERBASQUE, Basque Foundation for Science (巴斯克科学基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Image classification is hindered by subtle inter-class differences and substantial intra-class variations, which limit the effectiveness of existing contrastive learning methods. Supervised contrastive approaches based on the InfoNCE loss suffer from negative-sample dilution and lack adaptive decision boundaries, thereby reducing discriminative power in fine-grained recognition tasks. To address these limitations, we propose Sigmoid-based Common and Style Supervised Contrastive Learning (SCS-SupCon). Our framework introduces a sigmoid-based pairwise contrastive loss with learnable temperature and bias parameters to enable adaptive decision boundaries. This formulation emphasizes hard negatives, mitigates negative-sample dilution, and more effectively exploits supervision. In addition, an explicit style-distance constraint further disentangles style and content representations, leading to more robust feature learning. Comprehensive experiments on six benchmark datasets, including CUB200-2011 and Stanford Dogs, demonstrate that SCS-SupCon achieves state-of-the-art performance across both CNN and Transformer backbones. On CIFAR-100 with ResNet-50, SCS-SupCon improves top-1 accuracy over SupCon by approximately 3.9 percentage points and over CS-SupCon by approximately 1.7 points under five-fold cross-validation. On fine-grained datasets, it outperforms CS-SupCon by 0.4–3.0 points. Extensive ablation studies and statistical analyses further confirm the robustness and generalization of the proposed framework, with Friedman tests and Nemenyi post-hoc evaluations validating the stability of the observed improvements.
zh
[CV-177] Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition NEURIPS2025
【速读】:该论文旨在解决人类动作识别模型中存在的背景偏差(background bias)问题,即模型过度依赖场景背景信息而非人体运动和姿态进行预测,从而影响模型的鲁棒性和可解释性。解决方案的关键在于:针对分类模型,通过引入分割后的行人输入(segmented human input)有效降低背景偏差3.78%;针对视频大语言模型(Video Large Language Models, VLLM),则通过人工与自动提示调优(prompt tuning)策略,引导模型关注人体行为特征,使人类导向推理比例提升9.85%。
链接: https://arxiv.org/abs/2512.17953
作者: Ellie Zhou,Jihoon Chung,Olga Russakovsky
机构: Westmont High School (西蒙特高中); Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025 Workshops: SPACE in Vision, Language, and Embodied AI; and What Makes a Good Video: Next Practices in Video Generation and Evaluation
Abstract:Human action recognition models often rely on background cues rather than human movement and pose to make predictions, a behavior known as background bias. We present a systematic analysis of background bias across classification models, contrastive text-image pretrained models, and Video Large Language Models (VLLM) and find that all exhibit a strong tendency to default to background reasoning. Next, we propose mitigation strategies for classification models and show that incorporating segmented human input effectively decreases background bias by 3.78%. Finally, we explore manual and automated prompt tuning for VLLMs, demonstrating that prompt design can steer predictions towards human-focused reasoning by 9.85%.
zh
[CV-178] SuperFlow: Training Flow Matching Models with RL on the Fly
【速读】:该论文旨在解决当前基于流的生成模型(flow-based generative models)在强化学习(Reinforcement Learning, RL)训练中存在的两个核心问题:一是GRPO-style方法采用固定每提示(per-prompt)组大小,忽略了不同提示间采样重要性的差异,导致采样效率低下和训练速度慢;二是轨迹级优势(trajectory-level advantages)被直接用作每步估计值,这与连续时间流动力学不一致,造成信用分配偏差。解决方案的关键在于提出SuperFlow框架,其通过方差感知采样(variance-aware sampling)动态调整每提示的组大小以提升采样效率,并基于连续时间流动力学重新计算每步优势(step-level advantages),确保信用分配的一致性。实验证明,SuperFlow在不改变模型架构的前提下,仅需原训练步骤的5.4%–56.3%,即可实现显著性能提升,相较SD3.5-M和Flow-GRPO分别提升4.6%–47.2%和1.7%–16.0%。
链接: https://arxiv.org/abs/2512.17951
作者: Kaijie Chen,Zhiyang Xu,Ying Shen,Zihao Lin,Yuguang Yao,Lifu Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages
Abstract:Recent progress in flow-based generative models and reinforcement learning (RL) has improved text-image alignment and visual quality. However, current RL training for flow models still has two main problems: (i) GRPO-style fixed per-prompt group sizes ignore variation in sampling importance across prompts, which leads to inefficient sampling and slower training; and (ii) trajectory-level advantages are reused as per-step estimates, which biases credit assignment along the flow. We propose SuperFlow, an RL training framework for flow-based models that adjusts group sizes with variance-aware sampling and computes step-level advantages in a way that is consistent with continuous-time flow dynamics. Empirically, SuperFlow reaches promising performance while using only 5.4% to 56.3% of the original training steps and reduces training time by 5.2% to 16.7% without any architectural changes. On standard text-to-image (T2I) tasks, including text rendering, compositional image generation, and human preference alignment, SuperFlow improves over SD3.5-M by 4.6% to 47.2%, and over Flow-GRPO by 1.7% to 16.0%.
zh
[CV-179] NystagmusNet: Explainable Deep Learning for Photosensitivity Risk Prediction
【速读】:该论文旨在解决光敏感性眼震(nystagmus)患者因环境亮度变化引发不自主眼动加剧而面临的日常挑战,现有辅助手段多为症状缓解型且缺乏个性化预测能力。其解决方案的关键在于提出NystagmusNet——一个基于AI的系统,通过双分支卷积神经网络(dual-branch convolutional neural network)对合成与增强数据集进行训练,实时估计环境亮度与眼动方差驱动的光敏感风险评分(photosensitivity risk score),并结合SHAP和GradCAM等可解释性技术识别高风险视觉区域,提升临床可信度;同时集成规则引擎生成自适应滤镜建议,实现从风险预测到个性化干预的闭环。
链接: https://arxiv.org/abs/2512.17943
作者: Karthik Prabhakar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures, 2 tables, code available at this https URL
Abstract:Nystagmus patients with photosensitivity face significant daily challenges due to involuntary eye movements exacerbated by environmental brightness conditions. Current assistive solutions are limited to symptomatic treatments without predictive personalization. This paper proposes NystagmusNet, an AI-driven system that predicts high-risk visual environments and recommends real-time visual adaptations. Using a dual-branch convolutional neural network trained on synthetic and augmented datasets, the system estimates a photosensitivity risk score based on environmental brightness and eye movement variance. The model achieves 75% validation accuracy on synthetic data. Explainability techniques including SHAP and GradCAM are integrated to highlight environmental risk zones, improving clinical trust and model interpretability. The system includes a rule-based recommendation engine for adaptive filter suggestions. Future directions include deployment via smart glasses and reinforcement learning for personalized recommendations.
zh
[CV-180] A 96pJ/Frame/Pixel and 61pJ/Event Anti-UAV System with Hybrid Object Tracking Modes
【速读】:该论文旨在解决小型高速无人机(UAV)检测中能耗高、实时性差与鲁棒性不足的问题。其核心解决方案是提出一种融合帧基(frame-based)与事件驱动(event-driven)的混合感知架构,通过运行长度编码(run-length encoding)重建二值事件帧、自适应切换工作模式以匹配目标尺寸与速度,并引入基于轨迹的快速目标跟踪单元(Fast Object Tracking Unit),结合定制指令集与零跳过乘加单元(zero-skipping MAC architecture)的神经处理单元,实现高达97%的冗余计算削减,最终在40 nm CMOS工艺下达成每帧每像素96 pJ、每事件61 pJ的能效表现,同时保持98.2%的识别准确率,覆盖50–400米距离和5–80像素/秒速度范围,实现了抗无人机系统(anti-UAV system)端到端能效的显著提升。
链接: https://arxiv.org/abs/2512.17939
作者: Yuncheng Lu,Yucen Shi,Aobo Li,Zehao Li,Junying Li,Bo Wang,Tony Tae-Hyoung Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2 pages, 7 figures, conference paper published in IEEE Asian Solid-State Circuits Conference 2025
Abstract:We present an energy-efficient anti-UAV system that integrates frame-based and event-driven object tracking to enable reliable detection of small and fast-moving drones. The system reconstructs binary event frames using run-length encoding, generates region proposals, and adaptively switches between frame mode and event mode based on object size and velocity. A Fast Object Tracking Unit improves robustness for high-speed targets through adaptive thresholding and trajectory-based classification. The neural processing unit supports both grayscale-patch and trajectory inference with a custom instruction set and a zero-skipping MAC architecture, reducing redundant neural computations by more than 97 percent. Implemented in 40 nm CMOS technology, the 2 mm^2 chip achieves 96 pJ per frame per pixel and 61 pJ per event at 0.8 V, and reaches 98.2 percent recognition accuracy on public UAV datasets across 50 to 400 m ranges and 5 to 80 pixels per second speeds. The results demonstrate state-of-the-art end-to-end energy efficiency for anti-UAV systems.
zh
[CV-181] Privacy-Aware Sharing of Raw Spatial Sensor Data for Cooperative Perception
【速读】:该论文旨在解决车辆间协作感知(cooperative perception)中因共享原始空间传感器数据而引发的隐私泄露问题,该问题可能阻碍汽车制造商等利益相关方采用此类数据共享方案。解决方案的关键在于提出SHARP框架,该框架通过最小化隐私泄露来促进利益相关方向基于原始数据的协作感知目标迈进,同时为网络化系统、移动计算、感知研究、产业界及政府层面留下开放性问题以推动其落地实施。
链接: https://arxiv.org/abs/2512.16265
作者: Bangya Liu,Chengpo Yan,Chenghao Jiang,Suman Banerjee,Akarsh Prabhakara
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Cooperative perception between vehicles is poised to offer robust and reliable scene understanding. Recently, we are witnessing experimental systems research building testbeds that share raw spatial sensor data for cooperative perception. While there has been a marked improvement in accuracies and is the natural way forward, we take a moment to consider the problems with such an approach for eventual adoption by automakers. In this paper, we first argue that new forms of privacy concerns arise and discourage stakeholders to share raw sensor data. Next, we present SHARP, a research framework to minimize privacy leakage and drive stakeholders towards the ambitious goal of raw data based cooperative perception. Finally, we discuss open questions for networked systems, mobile computing, perception researchers, industry and government in realizing our proposed framework.
zh
[CV-182] Multimodal LLM s for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)
【速读】:该论文旨在解决经济史研究中高质量历史数据集构建效率低、成本高且依赖人工的难题。传统方法依赖研究助理从复杂排版和多字体(哥特体与罗马体)的专利图像中手动提取信息,不仅耗时费力,还受限于人力成本与准确性。解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, LLMs),特别是基于Gemini-2.5-Pro和Gemini-2.5-Flash-Lite的自动化流水线,从9,562张档案图像扫描件中自动识别并结构化生成包含306,070条德国专利记录的数据集(1877–1918年)。实证表明,该方法在保证数据质量的同时,速度提升超过795倍、成本降低约205倍,标志着多模态LLMs正成为经济史领域数据构建范式的重要变革工具。
链接: https://arxiv.org/abs/2512.19675
作者: Niclas Griesshaber,Jochen Streb
机构: University of Mannheim (曼海姆大学); University of Oxford (牛津大学)
类目: General Economics (econ.GN); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注:
Abstract:We leverage multimodal large language models (LLMs) to construct a dataset of 306,070 German patents (1877-1918) from 9,562 archival image scans using our LLM-based pipeline powered by Gemini-2.5-Pro and Gemini-2.5-Flash-Lite. Our benchmarking exercise provides tentative evidence that multimodal LLMs can create higher quality datasets than our research assistants, while also being more than 795 times faster and 205 times cheaper in constructing the patent dataset from our image corpus. About 20 to 50 patent entries are embedded on each page, arranged in a double-column format and printed in Gothic and Roman fonts. The font and layout complexity of our primary source material suggests to us that multimodal LLMs are a paradigm shift in how datasets are constructed in economic history. We open-source our benchmarking and patent datasets as well as our LLM-based data pipeline, which can be easily adapted to other image corpora using LLM-assisted coding tools, lowering the barriers for less technical researchers. Finally, we explain the economics of deploying LLMs for historical dataset construction and conclude by speculating on the potential implications for the field of economic history.
zh
[CV-183] Patlak Parametric Image Estimation from Dynamic PET Using Diffusion Model Prior
【速读】:该论文旨在解决动态正电子发射断层成像(dynamic PET)中参数图像估计因拟合过程固有的不适定性及全身体积PET扫描时多床位非连续数据采集导致计数有限,从而造成图像质量低下的问题。解决方案的关键在于提出了一种基于扩散模型的动能建模框架,利用静态全身体积PET图像预训练得到的得分函数(score function)作为先验信息,通过提取Patlak模型斜率和截距图像的局部块相似性来增强图像结构;在推理阶段,将动力学模型作为数据一致性约束嵌入扩散过程,以引导高质量参数图像的生成。该方法显著提升了不同剂量水平下参数图像的质量,验证了其可行性与优越性。
链接: https://arxiv.org/abs/2512.19584
作者: Ziqian Huang,Boxiao Yu,Siqi Li,Savas Ozdemir,Sangjin Bae,Jae Sung Lee,Guobao Wang,Kuang Gong
机构: University of Florida(佛罗里达大学); Seoul National University(首尔国立大学); University of California Davis Health(加州大学戴维斯分校健康中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 10 pages, 9 figures
Abstract:Dynamic PET enables the quantitative estimation of physiology-related parameters and is widely utilized in research and increasingly adopted in clinical settings. Parametric imaging in dynamic PET requires kinetic modeling to estimate voxel-wise physiological parameters based on specific kinetic models. However, parametric images estimated through kinetic model fitting often suffer from low image quality due to the inherently ill-posed nature of the fitting process and the limited counts resulting from non-continuous data acquisition across multiple bed positions in whole-body PET. In this work, we proposed a diffusion model-based kinetic modeling framework for parametric image estimation, using the Patlak model as an example. The score function of the diffusion model was pre-trained on static total-body PET images and served as a prior for both Patlak slope and intercept images by leveraging their patch-wise similarity. During inference, the kinetic model was incorporated as a data-consistency constraint to guide the parametric image estimation. The proposed framework was evaluated on total-body dynamic PET datasets with different dose levels, demonstrating the feasibility and promising performance of the proposed framework in improving parametric image quality.
zh
[CV-184] Deep Learning for Primordial B-mode Extraction
【速读】:该论文旨在解决宇宙微波背景辐射(CMB)观测中次级B模极化信号对原初引力波信号探测的干扰问题,尤其针对由大尺度结构引起的引力透镜效应所导致的B模污染。当前低噪声、多频段观测已能有效应对原初信号弱和天体物理前景污染两大挑战,但次级B模极化成为限制原初引力波振幅约束精度的新瓶颈。解决方案的关键在于利用深度学习方法——具体为ResUNet-CMB网络——实现对多种次级B模极化源(如引力透镜、不均匀再电离及宇宙极化旋转等)的联合估计与去除,并进一步将该去噪技术嵌入似然分析框架,从而获得近乎最优且无偏的原初引力波振幅估计。
链接: https://arxiv.org/abs/2512.19577
作者: Eric Guzman,Joel Meyers
机构: Southern Methodist University (南方卫理公会大学)
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 12 pages, 8 figures. Code available from this https URL
Abstract:The search for primordial gravitational waves is a central goal of cosmic microwave background (CMB) surveys. Isolating the characteristic B -mode polarization signal sourced by primordial gravitational waves is challenging for several reasons: the amplitude of the signal is inherently small; astrophysical foregrounds produce B -mode polarization contaminating the signal; and secondary B -mode polarization fluctuations are produced via the conversion of E modes. Current and future low-noise, multi-frequency observations enable sufficient precision to address the first two of these challenges such that secondary B modes will become the bottleneck for improved constraints on the amplitude of primordial gravitational waves. The dominant source of secondary B -mode polarization is gravitational lensing by large scale structure. Various strategies have been developed to estimate the lensing deflection and to reverse its effects the CMB, thus reducing confusion from lensing B modes in the search for primordial gravitational waves. However, a few complications remain. First, there may be additional sources of secondary B -mode polarization, for example from patchy reionization or from cosmic polarization rotation. Second, the statistics of delensed CMB maps can become complicated and non-Gaussian, especially when advanced lensing reconstruction techniques are applied. We previously demonstrated how a deep learning network, ResUNet-CMB, can provide nearly optimal simultaneous estimates of multiple sources of secondary B -mode polarization. In this paper, we show how deep learning can be applied to estimate and remove multiple sources of secondary B -mode polarization, and we further show how this technique can be used in a likelihood analysis to produce nearly optimal, unbiased estimates of the amplitude of primordial gravitational waves.
zh
[CV-185] Rethinking Coupled Tensor Analysis for Hyperspectral Super-Resolution: Recoverable Modeling Under Endmember Variability
【速读】:该论文旨在解决高光谱超分辨率(Hyperspectral Super-Resolution, HSR)问题,即通过融合一对空间配准的高光谱图像(HSI)与多光谱图像(MSI),恢复出具有更高空间分辨率的图像(SRI)。现有基于耦合张量分解(Coupled Tensor Decomposition, CTD)的方法虽具可恢复性保证,但如CPD(Canonical Polyadic Decomposition)和Tucker分解缺乏物理可解释性,而LL1模型虽在线性混合模型(Linear Mixture Model, LMM)下具备可解释性,却难以应对实际中常见的非线性效应(如端元变化,Endmember Variability, EV)。本文的关键解决方案是提出一种更具灵活性的块项张量分解模型——rank-(L_r, M_r, N_r) 的LMN模型,该模型在保持物理可解释性的前提下,能够统一CPD、Tucker和LL1等特例,并有效建模非理想效应(如EV),从而在表达能力与可解释性之间取得良好平衡。更重要的是,在HSI与MSI均服从LMN模型的前提下,仍可在合理条件下保证SRI的可恢复性,为方法提供了坚实的理论支撑。
链接: https://arxiv.org/abs/2512.19489
作者: Meng Ding,Xiao Fu
机构: Southwest Jiaotong University (西南交通大学); Oregon State University (俄勒冈州立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: The paper was accepted by SIAM Journal on Imaging Sciences
Abstract:This work revisits the hyperspectral super-resolution (HSR) problem, i.e., fusing a pair of spatially co-registered hyperspectral (HSI) and multispectral (MSI) images to recover a super-resolution image (SRI) that enhances the spatial resolution of the HSI. Coupled tensor decomposition (CTD)-based methods have gained traction in this domain, offering recoverability guarantees under various assumptions. Existing models such as canonical polyadic decomposition (CPD) and Tucker decomposition provide strong expressive power but lack physical interpretability. The block-term decomposition model with rank- (L_r, L_r, 1) terms (the LL1 model) yields interpretable factors under the linear mixture model (LMM) of spectral images, but LMM assumptions are often violated in practice – primarily due to nonlinear effects such as endmember variability (EV). To address this, we propose modeling spectral images using a more flexible block-term tensor decomposition with rank- (L_r, M_r, N_r) terms (the LMN model). This modeling choice retains interpretability, subsumes CPD, Tucker, and LL1 as special cases, and robustly accounts for non-ideal effects such as EV, offering a balanced tradeoff between expressiveness and interpretability for HSR. Importantly, under the LMN model for HSI and MSI, recoverability of the SRI can still be established under proper conditions – providing strong theoretical support. Extensive experiments on synthetic and real datasets further validate the effectiveness and robustness of the proposed method compared with existing CTD-based approaches.
zh
[CV-186] Selective Phase-Aware Training of nnU-Net for Robust Breast Cancer Segmentation in Multi-Center DCE-MRI
【速读】:该论文旨在解决乳腺癌动态对比增强磁共振成像(DCE-MRI)中肿瘤分割的标准化难题,尤其是在多中心、异质性临床数据下模型泛化能力不足的问题。其核心解决方案是提出一种面向nnU-Net架构的选择性、时相感知训练框架,关键在于通过质量导向的数据筛选策略提升模型鲁棒性与泛化性能:实验表明,包含存在运动伪影和对比度降低的ISPY数据会显著损害分割精度,即使采用CLAHE等先进预处理;而基于DUKE和NACT数据集中的早期时相图像(0000–0002)进行训练,可获得更稳定的性能表现,凸显了时相敏感性和质量意识型数据选择在复杂临床场景下的重要性。
链接: https://arxiv.org/abs/2512.19225
作者: Beyza Zayim,Aissiou Ikram,Boukhiar Naima
机构: University of Algiers 1 (阿尔及尔大学1号)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Breast cancer remains the most common cancer among women and is a leading cause of female mortality. Dynamic contrast-enhanced MRI (DCE-MRI) is a powerful imaging tool for evaluating breast tumors, yet the field lacks a standardized benchmark for analyzing treatment responses and guiding personalized care. We participated in the MAMA-MIA Challenge’s Primary Tumor Segmentation task and this work presents a proposed selective, phase-aware training framework for the nnU-Net architecture, emphasizing quality-focused data selection to strengthen model robustness and generalization. We employed the No New Net (nnU-Net) framework with a selective training strategy that systematically analyzed the impact of image quality and center-specific variability on segmentation performance. Controlled experiments on the DUKE, NACT, ISPY1, and ISPY2 datasets revealed that including ISPY scans with motion artifacts and reduced contrast impaired segmentation performance, even with advanced preprocessing, such as contrast-limited adaptive histogram equalization (CLAHE). In contrast, training on DUKE and NACT data, which exhibited clearer contrast and fewer motion artifacts despite varying resolutions, with early phase images (0000-0002) provided more stable training conditions. Our results demonstrate the importance of phase-sensitive and quality-aware training strategies in achieving reliable segmentation performance in heterogeneous clinical datasets, highlighting the limitations of the expansion of naive datasets and motivating the need for future automation of quality-based data selection strategies.
zh
[CV-187] SLIM: Semantic-based Low-bitrate Image compression for Machines by leverag ing diffusion
【速读】:该论文旨在解决传统图像压缩模型过度关注人眼感知细节、导致在机器视觉任务中比特率(bits per pixel)优化不足的问题。其解决方案的关键在于提出一种基于语义的低比特率图像压缩框架SLIM(Semantic-based Low-bitrate Image compression for Machines),该方法利用预训练的扩散模型,在图像潜在空间中仅对机器视觉感兴趣的区域(Region-of-Interest, RoI)进行紧凑压缩,并通过一个聚焦于RoI的文本描述引导预训练U-Net模型对解压缩后的潜在表示进行增强,从而在无需推理阶段指导掩码的情况下实现低比特率压缩,同时提升机器视觉任务的性能,且保留对人类视觉友好的感知细节。
链接: https://arxiv.org/abs/2512.18200
作者: Hyeonjin Lee,Jun-Hyuk Kim,Jong-Seok Lee
机构: Yonsei University (延世大学); Chung-Ang University (中央大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, the demand of image compression models for machine vision has increased dramatically. However, the training frameworks of image compression still focus on the vision of human, maintaining the excessive perceptual details, thus have limitations in optimally reducing the bits per pixel in the case of performing machine vision tasks. In this paper, we propose Semantic-based Low-bitrate Image compression for Machines by leveraging diffusion, termed SLIM. This is a new effective training framework of image compression for machine vision, using a pretrained latent diffusion this http URL compressor model of our method focuses only on the Region-of-Interest (RoI) areas for machine vision in the image latent, to compress it compactly. Then the pretrained Unet model enhances the decompressed latent, utilizing a RoI-focused text caption which containing semantic information of the image. Therefore, SLIM is able to focus on RoI areas of the image without any guide mask at the inference stage, achieving low bitrate when compressing. And SLIM is also able to enhance a decompressed latent by denoising steps, so the final reconstructed image from the enhanced latent can be optimized for the machine vision task while still containing perceptual details for human vision. Experimental results show that SLIM achieves a higher classification accuracy in the same bits per pixel condition, compared to conventional image compression models for this http URL will be released upon acceptance.
zh
[CV-188] Standardized Evaluation of Automatic Methods for Perivascular Spaces Segmentation in MRI – MICCAI 2024 Challenge Results
【速读】:该论文旨在解决自动识别和分割磁共振成像(MRI)中异常扩大的血管周围间隙(Perivascular Spaces, PVS)的难题,其挑战源于PVS尺寸小、形态多样、易与其他病理特征混淆以及标注数据集有限。解决方案的关键在于组织了MICCAI 2024上的EPVS Challenge,提供了一个多中心、多协议、涵盖全脑实质的标准化数据集(共200例扫描),并鼓励参赛团队采用深度学习方法,特别是基于U-Net架构的改进模型,如引入多模态处理、集成策略及基于Transformer的组件。其中最优方案采用MedNeXt架构结合双2D/3D策略以应对不同切片厚度问题,虽在已见数据上表现良好,但跨站点泛化能力不足,凸显出域偏移(domain shift)仍是当前算法推广的核心瓶颈。
链接: https://arxiv.org/abs/2512.18197
作者: Yilei Wu,Yichi Zhang,Zijian Dong,Fang Ji,An Sen Tan,Gifford Tan,Sizhao Tang,Huijuan Chen,Zijiao Chen,Eric Kwun Kei Ng,Jose Bernal,Hang Min,Ying Xia,Ines Vati,Liz Cooper,Xiaoyu Hu,Yuchen Pei,Yutao Ma,Victor Nozais,Ami Tsuchida,Pierre-Yves Hervé,Philippe Boutinaud,Marc Joliot,Junghwa Kang,Wooseung Kim,Dayeon Bak,Rachika E. Hamadache,Valeriia Abramova,Xavier Lladó,Yuntao Zhu,Zhenyu Gong,Xin Chen,John McFadden,Pek Lan Khong,Roberto Duarte Coello,Hongwei Bran Li,Woon Puay Koh,Christopher Chen,Joanna M. Wardlaw,Maria del C. Valdés Hernández,Juan Helen Zhou
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Perivascular spaces (PVS), when abnormally enlarged and visible in magnetic resonance imaging (MRI) structural sequences, are important imaging markers of cerebral small vessel disease and potential indicators of neurodegenerative conditions. Despite their clinical significance, automatic enlarged PVS (EPVS) segmentation remains challenging due to their small size, variable morphology, similarity with other pathological features, and limited annotated datasets. This paper presents the EPVS Challenge organized at MICCAI 2024, which aims to advance the development of automated algorithms for EPVS segmentation across multi-site data. We provided a diverse dataset comprising 100 training, 50 validation, and 50 testing scans collected from multiple international sites (UK, Singapore, and China) with varying MRI protocols and demographics. All annotations followed the STRIVE protocol to ensure standardized ground truth and covered the full brain parenchyma. Seven teams completed the full challenge, implementing various deep learning approaches primarily based on U-Net architectures with innovations in multi-modal processing, ensemble strategies, and transformer-based components. Performance was evaluated using dice similarity coefficient, absolute volume difference, recall, and precision metrics. The winning method employed MedNeXt architecture with a dual 2D/3D strategy for handling varying slice thicknesses. The top solutions showed relatively good performance on test data from seen datasets, but significant degradation of performance was observed on the previously unseen Shanghai cohort, highlighting cross-site generalization challenges due to domain shift. This challenge establishes an important benchmark for EPVS segmentation methods and underscores the need for the continued development of robust algorithms that can generalize in diverse clinical settings.
zh
[CV-189] CytoDINO: Risk-Aware and Biologically-Informed Adaptation of DINOv3 for Bone Marrow Cytomorphology
【速读】:该论文旨在解决骨髓细胞形态学分析在血液系统恶性肿瘤诊断中面临的劳动密集型与高观察者间变异性的挑战,同时应对现有基础模型在计算资源消耗大且未考虑临床误诊风险不对称性的问题。其解决方案的关键在于提出一种名为CytoDINO的框架,通过低秩适配(Low-Rank Adaptation, LoRA)微调DINOv3模型实现参数高效训练(仅需8%可训练参数),并在单张NVIDIA RTX 5080显卡上完成;更重要的是,创新性地引入分层焦点损失(Hierarchical Focal Loss with Critical Penalties),显式惩罚临床危险性高的误分类(如将原始细胞误判为正常细胞),从而提升关键类别识别准确率,在MLL数据集上达到88.2%加权F1和76.5%宏F1,并通过置信度选择性预测实现67%样本的99.5%高精度,为临床部署提供可行路径。
链接: https://arxiv.org/abs/2512.17930
作者: Aziz Muminov,Anne Pham
机构: Dickinson College (迪金森学院)
类目: Other Quantitative Biology (q-bio.OT); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
备注: 11 pages, 3 figures
Abstract:Bone marrow cell cytomorphology analysis is critical for the diagnosis of hematological malignancies but remains a labor-intensive process subject to significant inter-observer variability. While recent foundation models have shown promise in computational pathology, they often require extensive computational resources and fail to account for the asymmetric risks associated with clinical misdiagnosis. We introduce CytoDINO, a framework that achieves state-of-the-art performance on the Munich Leukemia Laboratory (MLL) dataset by fine-tuning DINOv3 using Low-Rank Adaptation (LoRA). Our primary contribution is a novel Hierarchical Focal Loss with Critical Penalties, which encodes biological relationships between cell lineages and explicitly penalizes clinically dangerous misclassifications (e.g., classifying blasts as normal cells). CytoDINO achieves an 88.2% weighted F1 score and 76.5% macro F1 on a held-out test set of 21 cell classes. By utilizing parameter-efficient fine-tuning with only 8% trainable parameters on a single NVIDIA RTX 5080, we demonstrate that consumer-grade hardware can match specialized infrastructure. Furthermore, confidence-based selective prediction yields 99.5% accuracy on 67% of samples, suggesting a viable pathway for clinical deployment where high-uncertainty cases are flagged for expert review
zh
[CV-190] A curated UK rain radar data set for training and benchmarking nowcasting models
【速读】:该论文旨在解决基于雷达图像序列进行短临预报(nowcasting)的挑战,即利用历史雷达回波数据预测未来短时间内降水演变趋势。其核心问题在于构建一个高质量、结构化且易于使用的英国雷达图像数据集,以支持统计建模和机器学习方法的研究与应用。解决方案的关键在于:首先,提供包含1000个随机采样的20步(每步15分钟)雷达强度场序列(40×40像素,空间分辨率为5 km),并通过空间分层采样确保在剔除无降水时段后仍保持空间分布的同质性;其次,为每个雷达序列附加大气和地理特征(如日期、位置、平均高程、风向风速及风暴类型),增强模型输入信息;最后,公开用于解析二进制Nimrod格式雷达数据的R函数,并通过一个简单的卷积神经网络(Convolutional Neural Network, CNN)案例研究验证数据集的有效性,提供可复现的代码实现。
链接: https://arxiv.org/abs/2512.17924
作者: Viv Atureta,Rifki Priansyah Jasin,Stefan Siegert
机构: University of Exeter (埃克塞特大学)
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
备注:
Abstract:This paper documents a data set of UK rain radar image sequences for use in statistical modeling and machine learning methods for nowcasting. The main dataset contains 1,000 randomly sampled sequences of length 20 steps (15-minute increments) of 2D radar intensity fields of dimension 40x40 (at 5km spatial resolution). Spatially stratified sampling ensures spatial homogeneity despite removal of clear-sky cases by threshold-based truncation. For each radar sequence, additional atmospheric and geographic features are made available, including date, location, mean elevation, mean wind direction and speed and prevailing storm type. New R functions to extract data from the binary “Nimrod” radar data format are provided. A case study is presented to train and evaluate a simple convolutional neural network for radar nowcasting, including self-contained R code.
zh
[CV-191] Disentangled representations via score-based variational autoencoders
【速读】:该论文旨在解决无监督表示学习中如何从复杂数据中自动提取结构化、可解释的潜在表示的问题。现有方法往往难以在缺乏标签的情况下捕捉数据的语义因子或连续变化轨迹,尤其在多尺度建模和跨模态泛化方面存在局限。解决方案的关键在于提出Score-based Autoencoder for Multiscale Inference (SAMI),通过融合扩散模型(diffusion models)与变分自编码器(VAEs)的证据下界(evidence lower bound, ELBO),构建一个理论严谨的目标函数,利用基于得分的扩散过程引导潜在空间的学习。这一设计使模型不仅能从静态图像中学习到更平滑的视频序列潜在轨迹,还能直接从预训练扩散模型中提取有用表示,且其显式的概率建模框架支持无需监督即可识别语义有意义的潜在轴线,从而将扩散模型中的隐式结构信息转化为可解释的表示。
链接: https://arxiv.org/abs/2512.17127
作者: Benjamin S. H. Lyo,Eero P. Simoncelli,Cristina Savin
机构: Center for Neural Science, New York University (纽约大学神经科学中心); Center for Computational Neuroscience, Flatiron Institute (扁平铁研究所计算神经科学中心); Center for Data Science, New York University (纽约大学数据科学中心)
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 34 pages, 7 figures
Abstract:We present the Score-based Autoencoder for Multiscale Inference (SAMI), a method for unsupervised representation learning that combines the theoretical frameworks of diffusion models and VAEs. By unifying their respective evidence lower bounds, SAMI formulates a principled objective that learns representations through score-based guidance of the underlying diffusion process. The resulting representations automatically capture meaningful structure in the data: it recovers ground truth generative factors in our synthetic dataset, learns factorized, semantic latent dimensions from complex natural images, and encodes video sequences into latent trajectories that are straighter than those of alternative encoders, despite training exclusively on static images. Furthermore, SAMI can extract useful representations from pre-trained diffusion models with minimal additional training. Finally, the explicitly probabilistic formulation provides new ways to identify semantically meaningful axes in the absence of supervised labels, and its mathematical exactness allows us to make formal statements about the nature of the learned representation. Overall, these results indicate that implicit structural information in diffusion models can be made explicit and interpretable through synergistic combination with a variational autoencoder.
zh
人工智能
[AI-0] Scalably Enhancing the Clinical Validity of a Task Benchmark with Physician Oversight
【速读】:该论文旨在解决当前临床风险评分自动化评估中存在基准数据集(如MedCalc-Bench)因依赖大语言模型(LLM)生成标签而可能引入系统性错误的问题,这些问题若被当作“静态黄金标准”,会在强化学习(Reinforcement Learning, RL)训练中被固化,从而导致模型对医学真实情况的偏差。解决方案的关键在于提出将此类复杂任务的基准视为“进行中的活文档”,并通过一个医师参与的系统化流水线实现定期审计与重标注:利用高级代理验证器(agentic verifiers)自动筛选并优先分配稀缺的临床专家资源至最具争议的样本,从而识别出原始标签中的提取错误、计算器逻辑不一致及临床模糊性等问题。实验证明,基于修正后的标签训练的Qwen3-8B模型在Group Relative Policy Optimization (GRPO)框架下相比原始基线提升了8.7%的准确率,验证了标签噪声对下游RL训练具有实质性影响,强调了在安全关键领域中持续维护高质量基准数据的重要性。
链接: https://arxiv.org/abs/2512.19691
作者: Junze Ye,Daniel Tawfik,Alex J. Goodell,Nikhil V. Kotha,Mark K. Buyyounouski,Mohsen Bayati
机构: 未知
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:
Abstract:Automating the calculation of clinical risk scores offers a significant opportunity to reduce physician administrative burden and enhance patient care. The current standard for evaluating this capability is MedCalc-Bench, a large-scale dataset constructed using LLM-based feature extraction and rule-based aggregation. However, treating such model-generated benchmarks as static oracles risks enshrining historical model errors as evaluation gold standards, a problem dangerously amplified when these datasets serve as reward signals for Reinforcement Learning (RL). In this work, we propose viewing benchmarks for complex tasks such as clinical score computation as ‘‘in-progress living documents’’ that should be periodically re-evaluated as the processes for creating them improve. We introduce a systematic, physician-in-the-loop pipeline that leverages advanced agentic verifiers to audit and relabel MedCalc-Bench, utilizing automated triage to reserve scarce clinician attention for the most contentious instances. Our audit reveals that a notable fraction of original labels diverge from medical ground truth due to extraction errors, calculator logic mismatches, and clinical ambiguity. To study whether this label noise meaningfully impacts downstream RL training, we fine-tune a Qwen3-8B model via Group Relative Policy Optimization (GRPO) and demonstrate that training on corrected labels yields an 8.7% absolute improvement in accuracy over the original baseline – validating that label noise materially affects model evaluation. These findings underscore that in safety-critical domains, rigorous benchmark maintenance is a prerequisite for genuine model alignment.
zh
[AI-1] Clustering with Label Consistency
【速读】:该论文旨在解决传统度量聚类算法中缺乏标签一致性的问题,即在不同运行或迭代中,数据点被分配到相同簇(cluster)的稳定性不足,这在实际应用中可能导致结果难以解释和复现。解决方案的关键在于引入一种新的标签一致性定义——通过衡量两个连续解之间点标签的距离来量化一致性,并基于此定义设计出针对经典k-中心(k-center)和k-均值(k-median)问题的新型一致近似算法,从而在保证聚类质量的同时提升标签的稳定性。
链接: https://arxiv.org/abs/2512.19654
作者: Diptarka Chakraborty,Hendrik Fichtenberger,Bernhard Haeupler,Silvio Lattanzi,Ashkan Norouzi-Fard,Ola Svensson
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注:
Abstract:Designing efficient, effective, and consistent metric clustering algorithms is a significant challenge attracting growing attention. Traditional approaches focus on the stability of cluster centers; unfortunately, this neglects the real-world need for stable point labels, i.e., stable assignments of points to named sets (clusters). In this paper, we address this gap by initiating the study of label-consistent metric clustering. We first introduce a new notion of consistency, measuring the label distance between two consecutive solutions. Then, armed with this new definition, we design new consistent approximation algorithms for the classical k -center and k -median problems.
zh
[AI-2] LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller
【速读】:该论文旨在解决卫星姿态控制中传统控制器设计耗时且对模型不确定性和工况变化敏感的问题。其解决方案的关键在于利用深度强化学习(Deep Reinforcement Learning, DRL)在仿真环境中自主训练出具有自适应能力的智能控制器,并成功实现首次在轨部署,验证了该方法在真实卫星平台上的有效性与鲁棒性。
链接: https://arxiv.org/abs/2512.19576
作者: Kirill Djebko,Tom Baumann,Erik Dilger,Frank Puppe,Sergio Montenegro
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 55 pages, 27 figures, 29 tables. The maneuver telemetry datasets generated and analyzed during this work are available in the GitHub repository this https URL
Abstract:Attitude control is essential for many satellite missions. Classical controllers, however, are time-consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive control strategies through autonomous interaction with a simulation environment. Overcoming the Sim2Real gap, which involves deploying an agent trained in simulation onto the real physical satellite, remains a significant challenge. In this work, we present the first successful in-orbit demonstration of an AI-based attitude controller for inertial pointing maneuvers. The controller was trained entirely in simulation and deployed to the InnoCube 3U nanosatellite, which was developed by the Julius-Maximilians-Universität Würzburg in cooperation with the Technische Universität Berlin, and launched in January 2025. We present the AI agent design, the methodology of the training procedure, the discrepancies between the simulation and the observed behavior of the real satellite, and a comparison of the AI-based attitude controller with the classical PD controller of InnoCube. Steady-state metrics confirm the robust performance of the AI-based controller during repeated in-orbit maneuvers.
zh
[AI-3] he Epistemological Consequences of Large Language Models : Rethinking collective intelligence and institutional knowledge
【速读】:该论文旨在解决人类与大语言模型(Large Language Models, LLMs)交互所引发的认识论威胁问题,特别是这种交互如何影响集体理性与反思性知识(reflective knowledge)的形成。论文指出,尽管LLMs能可靠传递真理(外在主义可靠性,externalist justification),但缺乏对命题成立理由的反思理解(内在主义辩护,internalist justification),从而可能导致个体和群体在专业与公民领域中削弱其认知责任。解决方案的关键在于构建一个三层规范体系:第一层是面向个体的“认识论交互模型”,促进对信息来源的批判性评估;第二层是机构与组织层面的框架,以培育并强制执行最优认识结果的规范;第三层是制度或立法层面的道义约束(deontic constraints),用以固化讨论性规范、遏制认知恶习(epistemic vices)。此三层次协同作用,旨在维持和强化人类在人机协作中的反思能力与认知责任。
链接: https://arxiv.org/abs/2512.19570
作者: Angjelin Hila
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: AI Soc (2025)
Abstract:We examine epistemological threats posed by human and LLM interaction. We develop collective epistemology as a theory of epistemic warrant distributed across human collectives, using bounded rationality and dual process theory as background. We distinguish internalist justification, defined as reflective understanding of why a proposition is true, from externalist justification, defined as reliable transmission of truths. Both are necessary for collective rationality, but only internalist justification produces reflective knowledge. We specify reflective knowledge as follows: agents understand the evaluative basis of a claim, when that basis is unavailable agents consistently assess the reliability of truth sources, and agents have a duty to apply these standards within their domains of competence. We argue that LLMs approximate externalist reliabilism because they can reliably transmit information whose justificatory basis is established elsewhere, but they do not themselves possess reflective justification. Widespread outsourcing of reflective work to reliable LLM outputs can weaken reflective standards of justification, disincentivize comprehension, and reduce agents’ capacity to meet professional and civic epistemic duties. To mitigate these risks, we propose a three tier norm program that includes an epistemic interaction model for individual use, institutional and organizational frameworks that seed and enforce norms for epistemically optimal outcomes, and deontic constraints at organizational and or legislative levels that instantiate discursive norms and curb epistemic vices.
zh
[AI-4] Results of the 2024 CommonRoad Motion Planning Competition for Autonomous Vehicles
【速读】:该论文试图解决自动驾驶车辆运动规划算法在复杂交通场景下缺乏标准化评估基准的问题,从而限制了不同方法之间性能的客观比较。解决方案的关键在于构建并实施一个基于CommonRoad基准套件的年度公开竞赛框架,通过统一的测试场景(涵盖高速和城市环境,包含多种交通参与者)对运动规划器进行多维度评价(效率、安全性、舒适性及交通规则合规性),从而实现算法性能的可复现、可比性评估,并为领域发展提供参考依据。
链接: https://arxiv.org/abs/2512.19564
作者: Yanliang Huang,Xia Yan,Peiran Yin,Zhenduo Zhang,Zeyan Shao,Youran Wang,Haoliang Huang,Matthias Althoff
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Over the past decade, a wide range of motion planning approaches for autonomous vehicles has been developed to handle increasingly complex traffic scenarios. However, these approaches are rarely compared on standardized benchmarks, limiting the assessment of relative strengths and weaknesses. To address this gap, we present the setup and results of the 4th CommonRoad Motion Planning Competition held in 2024, conducted using the CommonRoad benchmark suite. This annual competition provides an open-source and reproducible framework for benchmarking motion planning algorithms. The benchmark scenarios span highway and urban environments with diverse traffic participants, including passenger cars, buses, and bicycles. Planner performance is evaluated along four dimensions: efficiency, safety, comfort, and compliance with selected traffic rules. This report introduces the competition format and provides a comparison of representative high-performing planners from the 2023 and 2024 editions.
zh
[AI-5] REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在真实环境中泛化能力难以评估的问题,尤其是在训练环境之外的复杂场景中表现不稳定。当前对VLA模型的评估主要依赖于昂贵且低效的真实世界实验,缺乏可靠的仿真替代方案。为此,作者提出REALM——一个高保真度的仿真环境与基准测试平台,其关键在于通过逼真的视觉渲染和精确的机器人控制对齐,建立仿真性能与真实世界性能之间的强相关性,从而为VLA模型提供可量化、可重复的泛化能力评估手段。该方案不仅支持系统性识别和分析VLA模型的弱点与失效模式,还验证了仿真作为真实世界的有效代理的价值。
链接: https://arxiv.org/abs/2512.19562
作者: Martin Sedlacek,Pavlo Yefanov,Georgy Ponimatkin,Jai Bardhan,Simon Pilc,Mederic Fourmy,Evangelos Kazakos,Cees G. M. Snoek,Josef Sivic,Vladimir Petrik
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 10 figures
Abstract:Vision-Language-Action (VLA) models empower robots to understand and execute tasks described by natural language instructions. However, a key challenge lies in their ability to generalize beyond the specific environments and conditions they were trained on, which is presently difficult and expensive to evaluate in the real-world. To address this gap, we present REALM, a new simulation environment and benchmark designed to evaluate the generalization capabilities of VLA models, with a specific emphasis on establishing a strong correlation between simulated and real-world performance through high-fidelity visuals and aligned robot control. Our environment offers a suite of 15 perturbation factors, 7 manipulation skills, and more than 3,500 objects. Finally, we establish two task sets that form our benchmark and evaluate the \pi_0, \pi_0-FAST, and GR00T N1.5 VLA models, showing that generalization and robustness remain an open challenge. More broadly, we also show that simulation gives us a valuable proxy for the real-world and allows us to systematically probe for and quantify the weaknesses and failure modes of VLAs. Project page: this https URL
zh
[AI-6] Augmenting Intelligence: A Hybrid Framework for Scalable and Stable Explanations KR
【速读】:该论文旨在解决可解释人工智能(Explainable AI, XAI)领域中长期存在的“可扩展性-稳定性困境”(Scalability-Stability Dilemma):后验解释方法(如LIME、SHAP)虽易于扩展,但存在结果不稳定的问题;而监督式解释框架(如TED)虽具稳定性,却需对每个训练样本进行人工标注,成本高昂。解决方案的关键在于提出一种混合LRR-TED框架,其核心创新是利用“发现的不对称性”(Asymmetry of Discovery)——自动化规则学习器(GLRM)擅长识别广泛的“安全网”(Safety Nets,即保留模式),但难以捕捉特定的“风险陷阱”(Risk Traps,即流失触发因素)。通过将自动化生成的安全规则作为初始解释矩阵,并仅引入4条人工定义的风险规则进行帕累托最优增强,该方法在客户流失预测任务中实现了94.00%的预测准确率,优于完整8条人工规则基线,同时减少50%的人工标注工作量,从而推动人机协同范式从“规则编写者”向“异常处理者”的转变。
链接: https://arxiv.org/abs/2512.19557
作者: Lawrence Krukrubo,Julius Odede,Olawande Olusegun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures, 2 tables. Code and experiments available at this https URL
Abstract:Current approaches to Explainable AI (XAI) face a “Scalability-Stability Dilemma.” Post-hoc methods (e.g., LIME, SHAP) may scale easily but suffer from instability, while supervised explanation frameworks (e.g., TED) offer stability but require prohibitive human effort to label every training instance. This paper proposes a Hybrid LRR-TED framework that addresses this dilemma through a novel “Asymmetry of Discovery.” When applied to customer churn prediction, we demonstrate that automated rule learners (GLRM) excel at identifying broad “Safety Nets” (retention patterns) but struggle to capture specific “Risk Traps” (churn triggers)-a phenomenon we term the Anna Karenina Principle of Churn. By initialising the explanation matrix with automated safety rules and augmenting it with a Pareto-optimal set of just four human-defined risk rules, our approach achieves 94.00% predictive accuracy. This configuration outperforms the full 8-rule manual expert baseline while reducing human annotation effort by 50%, proposing a shift in the paradigm for Human-in-the-Loop AI: moving experts from the role of “Rule Writers” to “Exception Handlers.”
zh
[AI-7] CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal
【速读】:该论文旨在解决组相对强化学习中可验证奖励(Group-relative Reinforcement Learning with Verifiable Rewards, RLVR)在训练过程中对失败数据利用不足的问题。具体而言,当所有轨迹(rollouts)均为错误时,梯度会停滞;而即使某个轨迹偶然正确,模型通常忽略其他“接近但错误”的轨迹,导致信用分配错误,将奖励错误地分配给无关的推理链。解决方案的关键在于提出CARE(Contrastive Anchored REflection)框架,其核心包括两个模块:(i) 锚定对比目标(anchored-contrastive objective),通过构建以最优轨迹为中心的紧凑子群并引入语义相近的难负样本,结合仅用负样本缩放的组内z-score归一化,并引入全负样本救援机制防止信号缺失批次;(ii) 反射引导重采样(Reflection-Guided Resampling, RGR),一种一次性结构化自修复机制,能重写代表性失败轨迹并用同一验证器重新评分,从而将“近似成功”转化为可用正样本,无需测试时反射。CARE显著提升了准确率和训练稳定性,同时明确增加了来自失败数据的学习信号占比。
链接: https://arxiv.org/abs/2512.19554
作者: Yongxin Wang,Zhicheng Yang,Meng Cao,Mingfei Han,Haokun Lin,Yingying Zhu,Xiaojun Chang,Xiaodan Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures. When all rollouts are wrong, gradients stall; when one happens to be correct, the update usually ignores why the others are close-but-wrong, and credit can be misassigned to spurious chains. We present CARE (Contrastive Anchored REflection), a failure-centric post-training framework for multimodal reasoning that turns errors into supervision. CARE combines: (i) an anchored-contrastive objective that forms a compact subgroup around the best rollout and a set of semantically proximate hard negatives, performs within-subgroup z-score normalization with negative-only scaling, and includes an all-negative rescue to prevent zero-signal batches; and (ii) Reflection-Guided Resampling (RGR), a one-shot structured self-repair that rewrites a representative failure and re-scores it with the same verifier, converting near-misses into usable positives without any test-time reflection. CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures. On Qwen2.5-VL-7B, CARE lifts macro-averaged accuracy by 4.6 points over GRPO across six verifiable visual-reasoning benchmarks; with Qwen3-VL-8B it reaches competitive or state-of-the-art results on MathVista and MMMU-Pro under an identical evaluation protocol.
zh
[AI-8] owards Closed-Loop Embodied Empathy Evolution: Probing LLM -Centric Lifelong Empathic Motion Generation in Unseen Scenarios
【速读】:该论文旨在解决现有以人类为中心的情感动作生成方法在单一固定数据集上优化性能时,忽视了跨场景(如体育、舞蹈等)灵活扩展与持续学习的问题,从而限制了模型在真实世界中的泛化能力。为此,论文提出了一个全新的“LLM-Centric Lifelong Empathic Motion Generation (L²-EMG)”任务,目标是使大型语言模型(Large Language Model, LLM)具备在不同未见场景中持续获取情感动作生成知识的能力,进而推动具身智能体向闭环自进化方向发展。解决方案的关键在于提出了一种情绪可迁移且场景自适应的专家混合模型(Emotion-Transferable and Scenario-Adapted Mixture of Experts, ES-MoE),其包含两个核心模块:一是基于因果引导的情绪解耦模块,用于分离情绪特征以实现跨场景迁移;二是场景自适应专家构建模块,用于动态适配新场景下的动作生成需求,从而有效应对“情绪解耦挑战”和“场景适应挑战”。
链接: https://arxiv.org/abs/2512.19551
作者: Jiawen Wang,Jingjing Wang Tianyang Chen,Min Zhang,Guodong Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In the literature, existing human-centric emotional motion generation methods primarily focus on boosting performance within a single scale-fixed dataset, largely neglecting the flexible and scale-increasing motion scenarios (e.g., sports, dance), whereas effectively learning these newly emerging scenarios can significantly enhance the model’s real-world generalization ability. Inspired by this, this paper proposes a new LLM-Centric Lifelong Empathic Motion Generation (L^2-EMG) task, which aims to equip LLMs with the capability to continually acquire emotional motion generation knowledge across different unseen scenarios, potentially contributing to building a closed-loop and self-evolving embodied agent equipped with both empathy and intelligence. Further, this paper poses two key challenges in the L^2-EMG task, i.e., the emotion decoupling challenge and the scenario adapting challenge. To this end, this paper proposes an Emotion-Transferable and Scenario-Adapted Mixture of Experts (ES-MoE) approach which designs a causal-guided emotion decoupling block and a scenario-adapted expert constructing block to address the two challenges, respectively. Especially, this paper constructs multiple L^2-EMG datasets to validate the effectiveness of the ES-MoE approach. Extensive evaluations show that ES-MoE outperforms advanced baselines.
zh
[AI-9] Learning Continuous Solvent Effects from Transient Flow Data: A Graph Neural Network Benchmark on Catechol Rearrangement
【速读】:该论文旨在解决有机合成与工艺化学中连续溶剂组成范围内反应结果预测的挑战,传统机器学习方法将溶剂视为离散类别变量,难以实现对溶剂空间的系统插值与外推。其核心解决方案是提出一种融合图注意力网络(Graph Attention Networks, GATs)、差分反应指纹(Differential Reaction Fingerprints, DRFP)以及可学习的混合溶剂感知编码的混合图神经网络(GNN)架构,通过显式分子图消息传递和连续混合比例编码机制,显著提升模型在未见化学环境下的泛化能力,最终实现均方误差(MSE)低至0.0039,较基线方法降低60%,相较表格式集成模型提升25倍。
链接: https://arxiv.org/abs/2512.19530
作者: Hongsheng Xing,Qiuxin Si
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures
Abstract:Predicting reaction outcomes across continuous solvent composition ranges remains a critical challenge in organic synthesis and process chemistry. Traditional machine learning approaches often treat solvent identity as a discrete categorical variable, which prevents systematic interpolation and extrapolation across the solvent space. This work introduces the \textbfCatechol Benchmark, a high-throughput transient flow chemistry dataset comprising 1,227 experimental yield measurements for the rearrangement of allyl-substituted catechol in 24 pure solvents and their binary mixtures, parameterized by continuous volume fractions ( % B ). We evaluate various architectures under rigorous leave-one-solvent-out and leave-one-mixture-out protocols to test generalization to unseen chemical environments. Our results demonstrate that classical tabular methods (e.g., Gradient-Boosted Decision Trees) and large language model embeddings (e.g., Qwen-7B) struggle with quantitative precision, yielding Mean Squared Errors (MSE) of 0.099 and 0.129, respectively. In contrast, we propose a hybrid GNN-based architecture that integrates Graph Attention Networks (GATs) with Differential Reaction Fingerprints (DRFP) and learned mixture-aware solvent encodings. This approach achieves an \textbfMSE of 0.0039 ( \pm 0.0003), representing a 60% error reduction over competitive baselines and a 25\times improvement over tabular ensembles. Ablation studies confirm that explicit molecular graph message-passing and continuous mixture encoding are essential for robust generalization. The complete dataset, evaluation protocols, and reference implementations are released to facilitate data-efficient reaction prediction and continuous solvent representation learning. Comments: 13 pages, 6 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 68T07, 92E20, 62M45 ACMclasses: I.2.1; I.2.6; J.2 Cite as: arXiv:2512.19530 [cs.LG] (or arXiv:2512.19530v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.19530 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-10] QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在物理推理能力上缺乏定量评估的问题,尤其是它们能否从视频观测中准确推断运动物体的运动学量(如尺寸、速度和加速度)。现有评估多基于问答(VQA)形式,仅能衡量定性合理性,无法检验模型是否具备数值精度。解决方案的关键在于提出QuantiPhy——首个专门用于量化评估VLM物理推理能力的基准,包含超过3.3K个带数值真值的视频-文本实例,并通过标准化提示与评分机制,系统性地测试模型在给定输入先验条件下对物理属性的数值估计准确性。实验揭示了当前最优VLMs在定性判断与实际数值正确性之间存在显著差距,且其推理高度依赖预训练世界知识而非忠实利用输入的视觉和文本信息,从而为推动VLM向数值化物理理解迈进提供了首个严谨可扩展的测试平台。
链接: https://arxiv.org/abs/2512.19526
作者: Li Puyin,Tiange Xiang,Ella Mao,Shirley Wei,Xinye Chen,Adnan Masood,Li Fei-fei,Ehsan Adeli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM’s physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM’s performance on estimating an object’s size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs as references when reasoning kinematic properties quantitatively. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.
zh
[AI-11] LacaDM: A Latent Causal Diffusion Model for Multiobjective Reinforcement Learning
【速读】:该论文旨在解决多目标强化学习(Multiobjective Reinforcement Learning, MORL)中因目标间冲突以及动态环境适应性差而导致的泛化能力不足问题,尤其是在大规模、复杂的状态-动作空间中。解决方案的关键在于提出了一种名为潜在因果扩散模型(Latent Causal Diffusion Model, LacaDM)的新方法,其核心创新是通过在扩散模型框架中学习环境状态与策略之间的潜在时间因果关系,从而实现跨多种MORL场景的有效知识迁移,并在保持强泛化能力的同时平衡多个冲突目标。
链接: https://arxiv.org/abs/2512.19516
作者: Xueming Yan,Bo Yin,Yaochu Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multiobjective reinforcement learning (MORL) poses significant challenges due to the inherent conflicts between objectives and the difficulty of adapting to dynamic environments. Traditional methods often struggle to generalize effectively, particularly in large and complex state-action spaces. To address these limitations, we introduce the Latent Causal Diffusion Model (LacaDM), a novel approach designed to enhance the adaptability of MORL in discrete and continuous environments. Unlike existing methods that primarily address conflicts between objectives, LacaDM learns latent temporal causal relationships between environmental states and policies, enabling efficient knowledge transfer across diverse MORL scenarios. By embedding these causal structures within a diffusion model-based framework, LacaDM achieves a balance between conflicting objectives while maintaining strong generalization capabilities in previously unseen environments. Empirical evaluations on various tasks from the MOGymnasium framework demonstrate that LacaDM consistently outperforms the state-of-art baselines in terms of hypervolume, sparsity, and expected utility maximization, showcasing its effectiveness in complex multiobjective tasks.
zh
[AI-12] DK-STN: A Domain Knowledge Embedded Spatio-Temporal Network Model for MJO Forecast
【速读】:该论文旨在解决传统数值天气预报(Numerical Weather Prediction, NWP)方法在长期MJO(Madden-Julian Oscillation)预测中资源消耗大、计算效率低且季节敏感性高,以及现有人工神经网络(Artificial Neural Network, ANN)方法虽高效但准确性不足的问题。其解决方案的关键在于提出一种嵌入领域知识的时空网络(Domain Knowledge Embedded Spatio-Temporal Network, DK-STN),通过两种核心机制实现:一是采用领域知识增强方法提升模型对气候物理规律的捕捉能力,二是将领域知识处理机制融入网络训练过程,从而在保持ANN高效率和稳定性的同时显著提升预测精度,使其在28天预测范围内误差仅为2–3天,达到甚至超越ECMWF的先进NWP水平。
链接: https://arxiv.org/abs/2512.19506
作者: Hongliang Li,Nong Zhang,Zhewen Xu,Xiang Li,Changzheng Liu,Chongbo Zhao,Jie Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures
Abstract:Understanding and predicting the Madden-Julian Oscillation (MJO) is fundamental for precipitation forecasting and disaster prevention. To date, long-term and accurate MJO prediction has remained a challenge for researchers. Conventional MJO prediction methods using Numerical Weather Prediction (NWP) are resource-intensive, time-consuming, and highly unstable (most NWP methods are sensitive to seasons, with better MJO forecast results in winter). While existing Artificial Neural Network (ANN) methods save resources and speed forecasting, their accuracy never reaches the 28 days predicted by the state-of-the-art NWP method, i.e., the operational forecasts from ECMWF, since neural networks cannot handle climate data effectively. In this paper, we present a Domain Knowledge Embedded Spatio-Temporal Network (DK-STN), a stable neural network model for accurate and efficient MJO forecasting. It combines the benefits of NWP and ANN methods and successfully improves the forecast accuracy of ANN methods while maintaining a high level of efficiency and stability. We begin with a spatial-temporal network (STN) and embed domain knowledge in it using two key methods: (i) applying a domain knowledge enhancement method and (ii) integrating a domain knowledge processing method into network training. We evaluated DK-STN with the 5th generation of ECMWF reanalysis (ERA5) data and compared it with ECMWF. Given 7 days of climate data as input, DK-STN can generate reliable forecasts for the following 28 days in 1-2 seconds, with an error of only 2-3 days in different seasons. DK-STN significantly exceeds ECMWF in that its forecast accuracy is equivalent to ECMWF’s, while its efficiency and stability are significantly superior.
zh
[AI-13] Kolmogorov-Arnold Graph Neural Networks Applied to Inorganic Nanomaterials Dataset
【速读】:该论文旨在解决当前图神经网络(Graph Neural Networks, GNNs)在无机纳米材料数据集上应用不足的问题,特别是忽视了近年来发布的大型无机纳米材料数据集CHILI。此前的研究主要集中在有机分子图数据集上,而对无机材料的建模能力尚未充分探索。解决方案的关键在于引入基于Kolmogorov-Arnold表示定理的新型神经网络架构——Kolmogorov-Arnold Graph Neural Networks (KAGNNs),并将其适配至CHILI数据集进行测试。实验表明,在CHILI-3K子集上,KAGNNs显著优于传统GNNs,在分类任务中达到当前最优性能,验证了其在无机纳米材料属性预测中的有效性与先进性。
链接: https://arxiv.org/abs/2512.19494
作者: Nikita Volzhin,Soowhan Yoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The recent development of Kolmogorov-Arnold Networks (KANs) introduced new discoveries in the field of Graph Neural Networks (GNNs), expanding the existing set of models with KAN-based versions of GNNs, which often surpass the accuracy of MultiLayer Perceptron (MLP)-based GNNs. These models were widely tested on the graph datasets consisting of organic molecules; however, those studies disregarded the inorganic nanomaterials datasets. In this work, we close this gap by applying Kolmogorov-Arnold Graph Neural Networks (KAGNNs) to a recently published large inorganic nanomaterials dataset called CHILI. For this, we adapt and test KAGNNs appropriate for this dataset. Our experiments reveal that on the CHILI datasets, particularly on the CHILI-3K, KAGNNs substantially surpass conventional GNNs in classification, achieving state-of-the-art results.
zh
[AI-14] A Dataset and Preliminary Study of Using GPT -5 for Code-change Impact Analysis
【速读】:该论文旨在解决软件开发中代码变更影响分析的自动化问题,即如何准确预测给定源代码变更(source code changes)所影响的其他代码实体(code entities)。当前此类分析多依赖人工操作,效率低下。研究的关键解决方案在于利用大语言模型(Large Language Models, LLMs),特别是GPT-5和GPT-5-mini,通过构建包含种子变更(seed-change)、变更对及变更类型信息的数据集,评估其在两种配置下预测代码影响的能力:一是仅提供种子变更信息与父提交树(parent commit tree),二是额外引入每个种子变更的diff hunk(差异块)。实验表明,尽管性能整体不佳,但GPT-5优于GPT-5-mini,且添加diff hunk能略微提升模型表现,验证了LLMs在代码变更影响预测任务中的潜力与局限性。
链接: https://arxiv.org/abs/2512.19481
作者: Katharina Stengg,Christian Macho,Martin Pinzger
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 6 pages
Abstract:Understanding source code changes and their impact on other code entities is a crucial skill in software development. However, the analysis of code changes and their impact is often performed manually and therefore is time-consuming. Recent advancements in AI, and in particular large language models (LLMs) show promises to help developers in various code analysis tasks. However, the extent to which this potential can be utilized for understanding code changes and their impact is underexplored. To address this gap, we study the capabilities of GPT-5 and GPT-5-mini to predict the code entities impacted by given source code changes. We construct a dataset containing information about seed-changes, change pairs, and change types for each commit. Existing datasets lack crucial information about seed changes and impacted code entities. Our experiments evaluate the LLMs in two configurations: (1) seed-change information and the parent commit tree and (2) seed-change information, the parent commit tree, and the diff hunk of each seed change. We found that both LLMs perform poorly in the two experiments, whereas GPT-5 outperforms GPT-5-mini. Furthermore, the provision of the diff hunks helps both models to slightly improve their performance.
zh
[AI-15] Multi-Layer Confidence Scoring for Detection of Out-of-Distribution Samples Adversarial Attacks and In-Distribution Misclassifications
【速读】:该论文旨在解决现有模型在高风险领域中因缺乏透明性和可信度而引发的监管与应用难题,特别是针对已部署模型无法通过重训练引入置信度评估机制的问题。其解决方案的关键在于提出一种统一的后处理框架——多层分析置信度评分(Multi-Layer Analysis for Confidence Scoring, MACS),该框架通过分析预训练模型的中间激活特征生成分类图(classification-maps),并从中提取可用于置信度估计、分布偏移检测和对抗攻击识别的通用评分指标,从而在不重新训练模型的前提下实现对多种关键问题的联合建模与性能超越。
链接: https://arxiv.org/abs/2512.19472
作者: Lorenzo Capelli,Leandro de Souza Rosa,Gianluca Setti,Mauro Mangia,Riccardo Rovatti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The recent explosive growth in Deep Neural Networks applications raises concerns about the black-box usage of such models, with limited trasparency and trustworthiness in high-stakes domains, which have been crystallized as regulatory requirements such as the European Union Artificial Intelligence Act. While models with embedded confidence metrics have been proposed, such approaches cannot be applied to already existing models without retraining, limiting their broad application. On the other hand, post-hoc methods, which evaluate pre-trained models, focus on solving problems related to improving the confidence in the model’s predictions, and detecting Out-Of-Distribution or Adversarial Attacks samples as independent applications. To tackle the limited applicability of already existing methods, we introduce Multi-Layer Analysis for Confidence Scoring (MACS), a unified post-hoc framework that analyzes intermediate activations to produce classification-maps. From the classification-maps, we derive a score applicable for confidence estimation, detecting distributional shifts and adversarial attacks, unifying the three problems in a common framework, and achieving performances that surpass the state-of-the-art approaches in our experiments with the VGG16 and ViTb16 models with a fraction of their computational overhead.
zh
[AI-16] An Agent ic Framework for Autonomous Materials Computation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在科学发现中因静态知识和幻觉问题导致的自主研究应用受限的问题。其解决方案的关键在于构建一个领域专业化智能体(domain-specialized agent),通过嵌入材料计算领域的专业知识,确保多步骤物理一致性的计算流程,并精准选择收敛且参数合理的计算设置,从而实现从头原理材料计算的可靠端到端自动化执行。
链接: https://arxiv.org/abs/2512.19458
作者: Zeyu Xia,Jinzhe Ma,Congjie Zheng,Shufei Zhang,Yuqiang Li,Hang Su,P. Hu,Changshui Zhang,Xingao Gong,Wanli Ouyang,Lei Bai,Dongzhan Zhou,Mao Su
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
备注:
Abstract:Large Language Models (LLMs) have emerged as powerful tools for accelerating scientific discovery, yet their static knowledge and hallucination issues hinder autonomous research applications. Recent advances integrate LLMs into agentic frameworks, enabling retrieval, reasoning, and tool use for complex scientific workflows. Here, we present a domain-specialized agent designed for reliable automation of first-principles materials computations. By embedding domain expertise, the agent ensures physically coherent multi-step workflows and consistently selects convergent, well-posed parameters, thereby enabling reliable end-to-end computational execution. A new benchmark of diverse computational tasks demonstrates that our system significantly outperforms standalone LLMs in both accuracy and robustness. This work establishes a verifiable foundation for autonomous computational experimentation and represents a key step toward fully automated scientific discovery.
zh
[AI-17] Attention Is Not What You Need
【速读】:该论文试图解决的问题是:在序列建模中,显式的自注意力机制(self-attention)是否对于强性能和推理能力而言是必要的。作者认为标准多头注意力本质上是一种张量提升(tensor lifting)过程——将隐藏向量映射到高维的成对交互空间,并通过梯度下降约束该提升后的张量,虽然表达能力强但数学结构不透明。为此,论文提出了一种无需注意力机制的架构,其核心在于使用Grassmann流(Grassmann flows):通过Causal Grassmann层实现三个步骤——(i) 线性压缩token状态,(ii) 利用Plücker坐标将局部token对编码为Grassmann流形上的二维子空间,(iii) 通过门控混合将几何特征融合回隐藏状态。信息传播依赖于多尺度局部窗口内低秩子空间的受控变形,计算本质发生在有限维流形上而非无结构张量空间中,从而提供更结构化的几何与不变性解释路径。
链接: https://arxiv.org/abs/2512.19428
作者: Zhang Chong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Geometry (math.AG)
备注:
Abstract:We revisit a basic question in sequence modeling: is explicit self-attention actually necessary for strong performance and reasoning? We argue that standard multi-head attention is best seen as a form of tensor lifting: hidden vectors are mapped into a high-dimensional space of pairwise interactions, and learning proceeds by constraining this lifted tensor through gradient descent. This mechanism is extremely expressive but mathematically opaque, because after many layers it becomes very hard to describe the model with a small family of explicit invariants. To explore an alternative, we propose an attention-free architecture based on Grassmann flows. Instead of forming an L by L attention matrix, our Causal Grassmann layer (i) linearly reduces token states, (ii) encodes local token pairs as two-dimensional subspaces on a Grassmann manifold via Plucker coordinates, and (iii) fuses these geometric features back into the hidden states through gated mixing. Information therefore propagates by controlled deformations of low-rank subspaces over multi-scale local windows, so the core computation lives on a finite-dimensional manifold rather than in an unstructured tensor space. On the Wikitext-2 language modeling benchmark, purely Grassmann-based models with 13 to 18 million parameters achieve validation perplexities within about 10 to 15 percent of size-matched Transformers. On the SNLI natural language inference task, a Grassmann-Plucker head on top of DistilBERT slightly outperforms a Transformer head, with best validation and test accuracies of 0.8550 and 0.8538 compared to 0.8545 and 0.8511. We analyze the complexity of Grassmann mixing, show linear scaling in sequence length for fixed rank, and argue that such manifold-based designs offer a more structured route toward geometric and invariant-based interpretations of neural reasoning. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Geometry (math.AG) Cite as: arXiv:2512.19428 [cs.LG] (or arXiv:2512.19428v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.19428 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-18] Research Program: Theory of Learning in Dynamical Systems
【速读】:该论文旨在解决动态系统在仅依赖观测数据的情况下是否可学习的问题,即如何从随时间演变且依赖于隐藏状态的数据中推断系统的结构与行为。其核心贡献在于提出了一种基于下一标记预测(next-token prediction)的可学习性研究框架,强调应将可学习性视为一个有限样本问题,并聚焦于系统底层动力学特性(如稳定性、混合性、可观测性和谱性质),而非生成序列的统计特性。解决方案的关键在于引入“动态可学习性”(dynamic learnability)的概念,通过谱滤波等非参数化方法,在无需进行完整系统辨识的前提下,实现线性动态系统在有限观测后的准确预测,从而为理解复杂动态系统的学习边界提供了理论基础和实践路径。
链接: https://arxiv.org/abs/2512.19410
作者: Elad Hazan,Shai Shalev Shwartz,Nathan Srebro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern learning systems increasingly interact with data that evolve over time and depend on hidden internal state. We ask a basic question: when is such a dynamical system learnable from observations alone? This paper proposes a research program for understanding learnability in dynamical systems through the lens of next-token prediction. We argue that learnability in dynamical systems should be studied as a finite-sample question, and be based on the properties of the underlying dynamics rather than the statistical properties of the resulting sequence. To this end, we give a formulation of learnability for stochastic processes induced by dynamical systems, focusing on guarantees that hold uniformly at every time step after a finite burn-in period. This leads to a notion of dynamic learnability which captures how the structure of a system, such as stability, mixing, observability, and spectral properties, governs the number of observations required before reliable prediction becomes possible. We illustrate the framework in the case of linear dynamical systems, showing that accurate prediction can be achieved after finite observation without system identification, by leveraging improper methods based on spectral filtering. We survey the relationship between learning in dynamical systems and classical PAC, online, and universal prediction theories, and suggest directions for studying nonlinear and controlled systems.
zh
[AI-19] EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
【速读】:该论文旨在解决当前图形用户界面(GUI)智能体在执行任务时存在的“数字失忆”问题,即缺乏从过往成功经验中系统性学习的能力,导致性能不佳、重复错误以及对新挑战的泛化能力弱。其解决方案的关键在于提出了一种名为EchoTrail-GUI的新框架,通过引入一个动态且可访问的记忆机制来模拟人类的经验学习过程:首先在经验探索阶段自动构建由奖励模型验证的成功任务轨迹数据库;其次在记忆注入阶段检索与当前任务最相关的过往轨迹作为行动参考;最后在GUI任务推理阶段将这些记忆作为上下文引导信息嵌入到智能体的决策过程中,从而显著提升任务成功率和操作效率。
链接: https://arxiv.org/abs/2512.19396
作者: Runze Li,Yuwen Zhai,Bo Xu,LiWu Xu,Nian Shi,Wei Zhang,Ran Lin,Liang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Contemporary GUI agents, while increasingly capable due to advances in Large Vision-Language Models (VLMs), often operate with a critical limitation: they treat each task in isolation, lacking a mechanism to systematically learn from past successes. This digital ‘‘amnesia’’ results in sub-optimal performance, repeated errors, and poor generalization to novel challenges. To bridge this gap, we introduce EchoTrail-GUI, a novel framework designed to mimic human-like experiential learning by equipping agents with a dynamic, accessible memory. Our framework operates in three distinct stages. First, during Experience Exploration, an agent autonomously interacts with GUI environments to build a curated database of successful task trajectories, validated by a reward model. Crucially, the entire knowledge base construction is thus fully automated, requiring no human supervision. Second, in the Memory Injection stage, upon receiving a new task, our system efficiently retrieves the most relevant past trajectories to serve as actionable ‘‘memories’’. Finally, during GUI Task Inference, these memories are injected as in-context guidance to inform the agent’s reasoning and decision-making process. We demonstrate the efficacy of our approach on benchmarks including Android World and AndroidLab. The results show that EchoTrail-GUI significantly improves the task success rate and operational efficiency of baseline agents, validating the power of structured memory in creating more robust and intelligent GUI automation.
zh
[AI-20] OmniMER: Indonesian Multimodal Emotion Recognition via Auxiliary-Enhanced LLM Adaptation
【速读】:该论文旨在解决印尼语(Indonesian)在多模态情感识别(multimodal emotion recognition)研究中长期存在的资源匮乏问题,尤其是在社交媒体场景下,尽管印尼语使用者众多,但缺乏高质量的多模态标注数据和有效的跨模态建模方法。其核心挑战在于数据中存在的跨模态不一致性以及由印尼文化沟通规范导致的长尾类别分布问题。解决方案的关键在于提出OmniMER框架,该框架基于Qwen2.5-Omni模型,通过引入三个模态特定的辅助感知任务——文本的情感关键词提取(emotion keyword extraction)、视频的面部表情分析(facial expression analysis)和音频的韵律分析(prosody analysis),使模型在融合前能够更精准地识别各模态中的情感相关线索,从而降低对低资源环境下虚假关联的依赖,显著提升情绪识别性能。
链接: https://arxiv.org/abs/2512.19379
作者: Xueming Yan,Boyan Xu,Yaochu Jin,Lixian Xiao,Wenlong Ye,Runyang Cai,Zeqi Zheng,Jingfa Liu,Aimin Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Indonesian, spoken by over 200 million people, remains underserved in multimodal emotion recognition research despite its dominant presence on Southeast Asian social media platforms. We introduce IndoMER, the first multimodal emotion recognition benchmark for Indonesian, comprising 1,944 video segments from 203 speakers with temporally aligned text, audio, and visual annotations across seven emotion categories. The dataset exhibits realistic challenges including cross-modal inconsistency and long-tailed class distributions shaped by Indonesian cultural communication norms. To address these challenges, we propose OmniMER, a multimodal adaptation framework built upon Qwen2.5-Omni that enhances emotion recognition through three auxiliary modality-specific perception tasks: emotion keyword extraction for text, facial expression analysis for video, and prosody analysis for audio. These auxiliary tasks help the model identify emotion-relevant cues in each modality before fusion, reducing reliance on spurious correlations in low-resource settings. Experiments on IndoMER show that OmniMER achieves 0.582 Macro-F1 on sentiment classification and 0.454 on emotion recognition, outperforming the base model by 7.6 and 22.1 absolute points respectively. Cross-lingual evaluation on the Chinese CH-SIMS dataset further demonstrates the generalizability of the proposed framework. The dataset and code are publicly available. this https URL
zh
[AI-21] Sprecher Networks: A Parameter-Efficient Kolmogorov-Arnold Architecture
【速读】:该论文旨在解决传统神经网络在参数效率和内存使用方面的局限性,尤其是在处理高维连续函数逼近任务时,多层感知机(MLP)存在参数量随宽度指数增长(O(LN²))以及前向传播过程中中间激活值占用大量内存(O(N²))的问题。解决方案的关键在于提出Sprecher Networks (SNs),其核心创新是基于Sprecher 1965年提出的“平移样条之和”公式设计可学习的结构化块,其中共享且可学习的样条函数(单调或一般类型)结合显式偏移参数与混合权重,在单层变体中直接实现该数学构造,并通过多层堆叠扩展至深度网络;同时引入可选的横向混合连接以实现维度间高效通信,替代全注意力机制,从而在保持表达能力的同时显著降低参数复杂度(O(LN + LG),G为样条节点数)并减少峰值内存消耗(从O(N²)降至O(N)),使更宽网络在有限内存下成为可能。
链接: https://arxiv.org/abs/2512.19367
作者: Christian Hägg,Kathlén Kohn,Giovanni Luca Marchetti,Boris Shapiro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: 37 pages
Abstract:We present Sprecher Networks (SNs), a family of trainable neural architectures inspired by the classical Kolmogorov-Arnold-Sprecher (KAS) construction for approximating multivariate continuous functions. Distinct from Multi-Layer Perceptrons (MLPs) with fixed node activations and Kolmogorov-Arnold Networks (KANs) featuring learnable edge activations, SNs utilize shared, learnable splines (monotonic and general) within structured blocks incorporating explicit shift parameters and mixing weights. Our approach directly realizes Sprecher’s specific 1965 sum of shifted splines formula in its single-layer variant and extends it to deeper, multi-layer compositions. We further enhance the architecture with optional lateral mixing connections that enable intra-block communication between output dimensions, providing a parameter-efficient alternative to full attention mechanisms. Beyond parameter efficiency with O(LN + LG) scaling (where G is the knot count of the shared splines) versus MLPs’ O(LN^2) , SNs admit a sequential evaluation strategy that reduces peak forward-intermediate memory from O(N^2) to O(N) (treating batch size as constant), making much wider architectures feasible under memory constraints. We demonstrate empirically that composing these blocks into deep networks leads to highly parameter and memory-efficient models, discuss theoretical motivations, and compare SNs with related architectures (MLPs, KANs, and networks with learnable node activations).
zh
[AI-22] Learning General Policies with Policy Gradient Methods KR2023
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)中策略泛化能力不足的问题,即如何使DRL方法学习到像组合规划方法那样具有可靠且系统性泛化能力的策略。其核心解决方案在于:首先将策略建模为状态转移分类器(state transition classifiers),以克服动作在不同实例间不具通用性的限制;其次引入适配关系结构的图神经网络(Graph Neural Networks, GNNs)来表示价值函数和策略,从而提升对规划状态空间的表达能力。在此基础上,实验表明基于Actor-Critic框架的DRL方法可实现接近组合方法的泛化性能,同时避免了组合方法的可扩展性瓶颈与特征池依赖。进一步分析指出,当前DRL方法在基准测试中的局限性并非源于深度学习或强化学习算法本身,而是受限于GNN的表达能力以及最优性与泛化性之间的权衡关系,这些问题可通过引入派生谓词(derived predicates)和替代代价结构(alternative cost structure)加以缓解,无需改变基础DRL架构。
链接: https://arxiv.org/abs/2512.19366
作者: Simon Ståhlberg,Blai Bonet,Hector Geffner
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: In Proceedings of the 20th International Conference on Principles of Knowledge Representation and Reasoning (KR 2023)
Abstract:While reinforcement learning methods have delivered remarkable results in a number of settings, generalization, i.e., the ability to produce policies that generalize in a reliable and systematic way, has remained a challenge. The problem of generalization has been addressed formally in classical planning where provable correct policies that generalize over all instances of a given domain have been learned using combinatorial methods. The aim of this work is to bring these two research threads together to illuminate the conditions under which (deep) reinforcement learning approaches, and in particular, policy optimization methods, can be used to learn policies that generalize like combinatorial methods do. We draw on lessons learned from previous combinatorial and deep learning approaches, and extend them in a convenient way. From the former, we model policies as state transition classifiers, as (ground) actions are not general and change from instance to instance. From the latter, we use graph neural networks (GNNs) adapted to deal with relational structures for representing value functions over planning states, and in our case, policies. With these ingredients in place, we find that actor-critic methods can be used to learn policies that generalize almost as well as those obtained using combinatorial approaches while avoiding the scalability bottleneck and the use of feature pools. Moreover, the limitations of the DRL methods on the benchmarks considered have little to do with deep learning or reinforcement learning algorithms, and result from the well-understood expressive limitations of GNNs, and the tradeoff between optimality and generalization (general policies cannot be optimal in some domains). Both of these limitations are addressed without changing the basic DRL methods by adding derived predicates and an alternative cost structure to optimize.
zh
[AI-23] First-Order Representation Languages for Goal-Conditioned RL AAAI AAAI2026
【速读】:该论文旨在解决在目标条件强化学习(goal-conditioned reinforcement learning, goal-conditioned RL)和广义规划(generalized planning)中,当训练实例规模较大且目标无法通过随机探索达成时,如何学习具备泛化能力的目标条件策略的问题。解决方案的关键在于将状态和目标用原子集合(sets of atoms)表示,并结合Hindsight Experience Replay(HER)技术对失败轨迹进行重标注,从而自动构建由简单到复杂的子目标课程(curriculum),使策略能够在稀疏奖励环境下高效学习并实现跨状态和目标的泛化。
链接: https://arxiv.org/abs/2512.19355
作者: Simon Ståhlberg,Hector Geffner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: In Proceedings of the 40th AAAI Conference on Artificial Intelligence (AAAI 2026)
Abstract:First-order relational languages have been used in MDP planning and reinforcement learning (RL) for two main purposes: specifying MDPs in compact form, and representing and learning policies that are general and not tied to specific instances or state spaces. In this work, we instead consider the use of first-order languages in goal-conditioned RL and generalized planning. The question is how to learn goal-conditioned and general policies when the training instances are large and the goal cannot be reached by random exploration alone. The technique of Hindsight Experience Replay (HER) provides an answer to this question: it relabels unsuccessful trajectories as successful ones by replacing the original goal with one that was actually achieved. If the target policy must generalize across states and goals, trajectories that do not reach the original goal states can enable more data- and time-efficient learning. In this work, we show that further performance gains can be achieved when states and goals are represented by sets of atoms. We consider three versions: goals as full states, goals as subsets of the original goals, and goals as lifted versions of these subgoals. The result is that the latter two successfully learn general policies on large planning instances with sparse rewards by automatically creating a curriculum of easier goals of increasing complexity. The experiments illustrate the computational gains of these versions, their limitations, and opportunities for addressing them.
zh
[AI-24] PENDULUM: A Benchmark for Assessing Sycophancy in Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的“谄媚行为”(sycophancy)问题,即模型在面对用户输入时倾向于盲目附和,即使这与事实或视觉证据相悖。这一现象在文本单一模态模型中已有研究,但在视觉或跨模态场景下仍缺乏系统性分析。解决方案的关键在于构建一个名为 \textit{PENDULUM} 的综合性评估基准,包含约2000对人工精心设计的视觉问答对(Visual Question Answering pairs),专门用于诱发模型的谄媚反应;该基准覆盖六种不同复杂度的图像领域,支持对图像类型及内在挑战如何影响谄媚倾向进行系统性探究。此外,作者提出新的量化指标以测量视觉推理中的谄媚程度,从而揭示其在多种多模态情境下的表现差异,为未来开发更具事实一致性和可靠性的MLLM架构与训练策略提供依据。
链接: https://arxiv.org/abs/2512.19350
作者: A. B. M. Ashikur Rahman,Saeed Anwar,Muhammad Usman,Irfan Ahmad,Ajmal Mian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Sycophancy, an excessive tendency of AI models to agree with user input at the expense of factual accuracy or in contradiction of visual evidence, poses a critical and underexplored challenge for multimodal large language models (MLLMs). While prior studies have examined this behavior in text-only settings of large language models, existing research on visual or multimodal counterparts remains limited in scope and depth of analysis. To address this gap, we introduce a comprehensive evaluation benchmark, \textitPENDULUM, comprising approximately 2,000 human-curated Visual Question Answering pairs specifically designed to elicit sycophantic responses. The benchmark spans six distinct image domains of varying complexity, enabling a systematic investigation of how image type and inherent challenges influence sycophantic tendencies. Through extensive evaluation of state-of-the-art MLLMs. we observe substantial variability in model robustness and a pronounced susceptibility to sycophantic and hallucinatory behavior. Furthermore, we propose novel metrics to quantify sycophancy in visual reasoning, offering deeper insights into its manifestations across different multimodal contexts. Our findings highlight the urgent need for developing sycophancy-resilient architectures and training strategies to enhance factual consistency and reliability in future MLLMs. Our proposed dataset with MLLMs response are available at this https URL.
zh
[AI-25] VIGOR: Iterative Confounder Generation and Validation via LLM -CEVAE Feedback Loop
【速读】:该论文旨在解决观测数据中隐藏混杂因素(hidden confounding)的因果推断问题,尤其针对当前基于大语言模型(Large Language Models, LLMs)生成的混杂因素虽具语义合理性但缺乏统计有效性这一关键瓶颈。解决方案的核心在于提出VIGOR+(Variational Information Gain for iterative cOnfounder Refinement)框架,其创新性地构建了一个迭代反馈机制:将CEVAE(Causal Effect Variational Autoencoder)模型输出的信息增益、潜在变量一致性指标及诊断信息等统计验证信号,转化为自然语言反馈以指导LLM进行下一轮混杂因素生成,从而实现生成与验证的闭环优化。该方法通过形式化反馈机制并证明在弱假设下的收敛性,显著提升了混杂因素的统计实用性与因果推断的准确性。
链接: https://arxiv.org/abs/2512.19349
作者: JiaWei Zhu,ZiHeng Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages,1 figure,4 tables
Abstract:Hidden confounding remains a fundamental challenge in causal inference from observational data. Recent advances leverage Large Language Models (LLMs) to generate plausible hidden confounders based on domain knowledge, yet a critical gap exists: LLM-generated confounders often exhibit semantic plausibility without statistical utility. We propose VIGOR+ (Variational Information Gain for iterative cOnfounder Refinement), a novel framework that closes the loop between LLM-based confounder generation and CEVAE-based statistical validation. Unlike prior approaches that treat generation and validation as separate stages, VIGOR+ establishes an iterative feedback mechanism: validation signals from CEVAE (including information gain, latent consistency metrics, and diagnostic messages) are transformed into natural language feedback that guides subsequent LLM generation rounds. This iterative refinement continues until convergence criteria are met. We formalize the feedback mechanism, prove convergence properties under mild assumptions, and provide a complete algorithmic framework.
zh
[AI-26] Alternative positional encoding functions for neural transformers
【速读】:该论文旨在解决神经网络中Transformer架构里位置编码(Positional Encoding)模块的优化问题,即如何更有效地将序列位置信息嵌入到模型输入中,以提升对不同周期性模式的捕捉能力。其解决方案的关键在于提出了一组替代正弦函数的新周期性函数,这些函数在保持原有正弦函数核心性质的同时,在根本上偏离了传统形式;初步实验表明,该方案显著优于原始正弦位置编码,暗示其具有在其他Transformer架构中广泛应用的潜力。
链接: https://arxiv.org/abs/2512.19323
作者: Ezequiel Lopez-Rubio,Macoris Decena-Gimenez,Rafael Marcos Luque-Baena
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:A key module in neural transformer-based deep architectures is positional encoding. This module enables a suitable way to encode positional information as input for transformer neural layers. This success has been rooted in the use of sinusoidal functions of various frequencies, in order to capture recurrent patterns of differing typical periods. In this work, an alternative set of periodic functions is proposed for positional encoding. These functions preserve some key properties of sinusoidal ones, while they depart from them in fundamental ways. Some tentative experiments are reported, where the original sinusoidal version is substantially outperformed. This strongly suggests that the alternative functions may have a wider use in other transformer architectures.
zh
[AI-27] SafeMed-R1: Adversarial Reinforcement Learning for Generalizable and Robust Medical Reasoning in Vision-Language Models
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在医学视觉问答(Medical Visual Question Answering, MedVQA)任务中对对抗攻击的严重脆弱性问题,这一缺陷阻碍了其在临床环境中的安全部署。解决方案的关键在于提出SafeMed-R1,一个混合防御框架:训练阶段采用对抗训练与组相对策略优化(Adversarial Training with Group Relative Policy Optimization, AT-GRPO)相结合的方法,显式增强模型推理过程对最坏情况扰动的鲁棒性;推理阶段引入随机平滑(Randomized Smoothing),提供L₂范数下的可证明鲁棒性保证。该方法在OmniMedVQA基准上验证有效,使模型在PGD攻击下准确率从25%提升至84.45%,同时保持高质量、可解释的临床推理能力。
链接: https://arxiv.org/abs/2512.19317
作者: A.A. Gde Yogi Pramana,Jason Ray,Anthony Jaya,Michael Wijaya
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Vision–Language Models (VLMs) show significant promise for Medical Visual Question Answering (VQA), yet their deployment in clinical settings is hindered by severe vulnerability to adversarial attacks. Standard adversarial training, while effective for simpler tasks, often degrades both generalization performance and the quality of generated clinical reasoning. We introduce SafeMed-R1, a hybrid defense framework that ensures robust performance while preserving high-quality, interpretable medical reasoning. SafeMed-R1 employs a two-stage approach: at training time, we integrate Adversarial Training with Group Relative Policy Optimization (AT-GRPO) to explicitly robustify the reasoning process against worst-case perturbations; at inference time, we augment the model with Randomized Smoothing to provide certified L_2 -norm robustness guarantees. We evaluate SafeMed-R1 on the OmniMedVQA benchmark across eight medical imaging modalities comprising over 88,000 samples. Our experiments reveal that standard fine-tuned VLMs, despite achieving 95% accuracy on clean inputs, collapse to approximately 25% under PGD attacks. In contrast, SafeMed-R1 maintains 84.45% accuracy under the same adversarial conditions, representing a 59 percentage point improvement in robustness. Furthermore, we demonstrate that models trained with explicit chain-of-thought reasoning exhibit superior adversarial robustness compared to instruction-only variants, suggesting a synergy between interpretability and security in medical AI systems.
zh
[AI-28] Helios: A Foundational Language Model for Smart Energy Knowledge Reasoning and Application
【速读】:该论文旨在解决通用大语言模型(Large Language Models, LLMs)在智能能源领域应用时面临的两大核心问题:一是缺乏特定领域的专业知识,二是对物理约束条件的感知不足,导致其在工程场景中推理与生成结果不准确。解决方案的关键在于构建一个面向智能能源领域的专用大语言模型Helios,并配套开发一套完整的资源体系,包括多智能体协作的数据集构建框架Enersys,用于生成高质量的知识库EnerBase、指令微调数据集EnerInstruct和基于人类偏好强化学习的数据集EnerReinforce;通过大规模预训练、监督微调(SFT)和人类反馈强化学习(RLHF)三阶段训练流程,显著提升模型在专业任务上的准确性、知识掌握度及与行业标准的一致性。
链接: https://arxiv.org/abs/2512.19299
作者: Haoyu Jiang,Fanjie Zeng,Boan Qu,Xiaojie Lin,Wei Zhong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In the global drive toward carbon neutrality, deeply coordinated smart energy systems underpin industrial transformation. However, the interdisciplinary, fragmented, and fast-evolving expertise in this domain prevents general-purpose LLMs, which lack domain knowledge and physical-constraint awareness, from delivering precise engineering-aligned inference and generation. To address these challenges, we introduce Helios, a large language model tailored to the smart energy domain, together with a comprehensive suite of resources to advance LLM research in this field. Specifically, we develop Enersys, a multi-agent collaborative framework for end-to-end dataset construction, through which we produce: (1) a smart energy knowledge base, EnerBase, to enrich the model’s foundational expertise; (2) an instruction fine-tuning dataset, EnerInstruct, to strengthen performance on domain-specific downstream tasks; and (3) an RLHF dataset, EnerReinforce, to align the model with human preferences and industry standards. Leveraging these resources, Helios undergoes large-scale pretraining, SFT, and RLHF. We also release EnerBench, a benchmark for evaluating LLMs in smart energy scenarios, and demonstrate that our approach significantly enhances domain knowledge mastery, task execution accuracy, and alignment with human preferences.
zh
[AI-29] Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models NDSS2026
【速读】:该论文旨在解决LoRA(Low-Rank Adaptation)微调方法在开放权重模型场景下所面临的新型后门攻击安全问题。现有攻击策略因依赖原始训练数据、忽视LoRA特有的结构特性或存在高误触发率(False Trigger Rate, FTR),难以实现隐蔽且可控的攻击。其解决方案的关键在于提出Causal-Guided Detoxify Backdoor Attack (CBA)框架:首先通过覆盖引导的数据生成管道,基于行为探索合成任务对齐输入;其次采用因果引导的去毒策略,通过保留任务关键神经元融合污染与干净适配器,从而在无需重新训练的前提下,实现攻击强度的后训练控制,并显著降低FTR(较基线方法减少50–70%),同时增强对主流后门防御机制的鲁棒性。
链接: https://arxiv.org/abs/2512.19297
作者: Linzhi Chen,Yang Sun,Hongru Wei,Yuqi Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: NDSS 2026
Abstract:Low-Rank Adaptation (LoRA) has emerged as an efficient method for fine-tuning large language models (LLMs) and is widely adopted within the open-source community. However, the decentralized dissemination of LoRA adapters through platforms such as Hugging Face introduces novel security vulnerabilities: malicious adapters can be easily distributed and evade conventional oversight mechanisms. Despite these risks, backdoor attacks targeting LoRA-based fine-tuning remain relatively underexplored. Existing backdoor attack strategies are ill-suited to this setting, as they often rely on inaccessible training data, fail to account for the structural properties unique to LoRA, or suffer from high false trigger rates (FTR), thereby compromising their stealth. To address these challenges, we propose Causal-Guided Detoxify Backdoor Attack (CBA), a novel backdoor attack framework specifically designed for open-weight LoRA models. CBA operates without access to original training data and achieves high stealth through two key innovations: (1) a coverage-guided data generation pipeline that synthesizes task-aligned inputs via behavioral exploration, and (2) a causal-guided detoxification strategy that merges poisoned and clean adapters by preserving task-critical neurons. Unlike prior approaches, CBA enables post-training control over attack intensity through causal influence-based weight allocation, eliminating the need for repeated retraining. Evaluated across six LoRA models, CBA achieves high attack success rates while reducing FTR by 50-70% compared to baseline methods. Furthermore, it demonstrates enhanced resistance to state-of-the-art backdoor defenses, highlighting its stealth and robustness.
zh
[AI-30] Vibe Reasoning : Eliciting Frontier AI Mathematical Capabilities – A Case Study on IMO 2025 Problem 6
【速读】:该论文旨在解决前沿人工智能模型在处理复杂数学问题时表现出的“知识可用但无法有效应用”的瓶颈问题,即模型虽具备相关知识,却缺乏明确的推理路径、任务分解能力和协同执行机制。解决方案的关键在于提出一种名为 Vibe Reasoning 的人-AI协作范式,其核心由三部分构成:通用元提示(generic meta-prompts)、代理基础化(agentic grounding)和模型编排(model orchestration)。通过这一范式,研究者成功解决了IMO 2025 Problem 6这一曾被自主AI系统公开失败的组合优化难题,利用GPT-5的探索能力与Gemini 3 Pro的证明能力,结合Python代码执行与基于文件的记忆机制,不仅得出正确答案(2112),还生成了严谨的数学证明。研究表明,轻量级人类引导可显著释放大模型在数学推理中的潜力,且该方法具有可迁移性和可扩展性。
链接: https://arxiv.org/abs/2512.19287
作者: Jiaao Wu,Xian Zhang,Fan Yang,Yinpeng Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 13 figures
Abstract:We introduce Vibe Reasoning, a human-AI collaborative paradigm for solving complex mathematical problems. Our key insight is that frontier AI models already possess the knowledge required to solve challenging problems – they simply do not know how, what, or when to apply it. Vibe Reasoning transforms AI’s latent potential into manifested capability through generic meta-prompts, agentic grounding, and model orchestration. We demonstrate this paradigm through IMO 2025 Problem 6, a combinatorial optimization problem where autonomous AI systems publicly reported failures. Our solution combined GPT-5’s exploratory capabilities with Gemini 3 Pro’s proof strengths, leveraging agentic workflows with Python code execution and file-based memory, to derive both the correct answer (2112) and a rigorous mathematical proof. Through iterative refinement across multiple attempts, we discovered the necessity of agentic grounding and model orchestration, while human prompts evolved from problem-specific hints to generic, transferable meta-prompts. We analyze why capable AI fails autonomously, how each component addresses specific failure modes, and extract principles for effective vibe reasoning. Our findings suggest that lightweight human guidance can unlock frontier models’ mathematical reasoning potential. This is ongoing work; we are developing automated frameworks and conducting broader evaluations to further validate Vibe Reasoning’s generality and effectiveness.
zh
[AI-31] Digital Twin-Driven Zero-Shot Fault Diagnosis of Axial Piston Pumps Using Fluid-Borne Noise Signals
【速读】:该论文旨在解决轴向柱塞泵(axial piston pump)在流体动力系统中故障诊断的难题,特别是在缺乏大量标注故障数据(labeled fault data)的情况下,传统数据驱动方法难以应用,而模型驱动方法又受限于参数不确定性。解决方案的关键在于构建一个基于数字孪生(digital twin, DT)的零样本(zero-shot)故障诊断框架,该框架仅使用健康状态数据对高保真DT模型进行校准,生成合成故障信号用于训练深度学习分类器,并引入物理信息神经网络(physics-informed neural network, PINN)作为虚拟传感器估计流量脉动(flow ripple)。通过梯度加权类激活映射(Grad-CAM)可视化分析,发现时间域输入采用大卷积核匹配子序列长度、时频域输入采用小卷积核,可聚焦于物理上可解释特征,从而显著提升诊断准确性。实验表明,基于校准DT模型训练的分类器在真实基准测试中诊断准确率超过95%,验证了该框架在数据稀缺场景下的有效性。
链接: https://arxiv.org/abs/2512.19280
作者: Chang Dong,Jianfeng Tao,Chengliang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Axial piston pumps are crucial components in fluid power systems, where reliable fault diagnosis is essential for ensuring operational safety and efficiency. Traditional data-driven methods require extensive labeled fault data, which is often impractical to obtain, while model-based approaches suffer from parameter uncertainties. This paper proposes a digital twin (DT)-driven zero-shot fault diagnosis framework utilizing fluid-borne noise (FBN) signals. The framework calibrates a high-fidelity DT model using only healthy-state data, generates synthetic fault signals for training deep learning classifiers, and employs a physics-informed neural network (PINN) as a virtual sensor for flow ripple estimation. Gradient-weighted class activation mapping (Grad-CAM) is integrated to visualize the decision-making process of neural networks, revealing that large kernels matching the subsequence length in time-domain inputs and small kernels in time-frequency domain inputs enable higher diagnostic accuracy by focusing on physically meaningful features. Experimental validations demonstrate that training on signals from the calibrated DT model yields diagnostic accuracies exceeding 95% on real-world benchmarks, while uncalibrated models result in significantly lower performance, highlighting the framework’s effectiveness in data-scarce scenarios.
zh
[AI-32] DeliveryBench: Can Agents Earn Profit in Real World?
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)和视觉语言模型(Vision-Language Models, VLMs)作为具身代理(embodied agents)时,现有基准测试多局限于短期任务、难以刻画现实世界决策中复杂约束的问题。解决方案的关键在于提出DeliveryBench——一个基于真实职业“外卖配送”的城市尺度具身基准,其通过程序化生成的三维城市环境模拟多样道路网络、建筑布局、功能地点、交通方式及真实的资源动态(如配送时限、运输成本、车辆电量等),从而系统评估模型在长周期目标下对约束条件的感知与规划能力。该设计使评测更贴近现实决策场景,并揭示了当前VLM代理在长期规划中的短视性与常识约束违反问题。
链接: https://arxiv.org/abs/2512.19234
作者: Lingjun Mao,Jiawei Ren,Kun Zhou,Jixuan Chen,Ziqiao Ma,Lianhui Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs and VLMs are increasingly deployed as embodied agents, yet existing benchmarks largely revolve around simple short-term tasks and struggle to capture rich realistic constraints that shape real-world decision making. To close this gap, we propose DeliveryBench, a city-scale embodied benchmark grounded in the real-world profession of food delivery. Food couriers naturally operate under long-horizon objectives (maximizing net profit over hours) while managing diverse constraints, e.g., delivery deadline, transportation expense, vehicle battery, and necessary interactions with other couriers and customers. DeliveryBench instantiates this setting in procedurally generated 3D cities with diverse road networks, buildings, functional locations, transportation modes, and realistic resource dynamics, enabling systematic evaluation of constraint-aware, long-horizon planning. We benchmark a range of VLM-based agents across nine cities and compare them with human players. Our results reveal a substantial performance gap to humans, and find that these agents are short-sighted and frequently break basic commonsense constraints. Additionally, we observe distinct personalities across models (e.g., adventurous GPT-5 vs. conservative Claude), highlighting both the brittleness and the diversity of current VLM-based embodied agents in realistic, constraint-dense environments. Our code, data, and benchmark are available at this https URL.
zh
[AI-33] Generation of Programmatic Rules for Document Forgery Detection Using Large Language Models ICML
【速读】:该论文旨在解决文档伪造(document forgery)检测中规则驱动的合理性检查(plausibility checks)生成效率低下的问题,即当前这些检查机制依赖人工编写,耗时且难以扩展。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)通过领域特定代码与数据的微调(fine-tuning),自动生成可执行且有效的规则-based 验证程序,从而实现自动化、可扩展的伪造检测能力。实验表明,在受限硬件资源下,经微调后的 Llama 3.1 8B 和 OpenCoder 8B 模型能够生成对未见过伪造模式有效的验证逻辑,展现出LLMs在安全敏感场景中作为可解释性辅助工具的巨大潜力。
链接: https://arxiv.org/abs/2512.19228
作者: Valentin Schmidberger,Manuel Eberhardinger,Setareh Maghsudi,Johannes Maucher
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICMLA 2025, the first two authors contributed equally
Abstract:Document forgery poses a growing threat to legal, economic, and governmental processes, requiring increasingly sophisticated verification mechanisms. One approach involves the use of plausibility checks, rule-based procedures that assess the correctness and internal consistency of data, to detect anomalies or signs of manipulation. Although these verification procedures are essential for ensuring data integrity, existing plausibility checks are manually implemented by software engineers, which is time-consuming. Recent advances in code generation with large language models (LLMs) offer new potential for automating and scaling the generation of these checks. However, adapting LLMs to the specific requirements of an unknown domain remains a significant challenge. This work investigates the extent to which LLMs, adapted on domain-specific code and data through different fine-tuning strategies, can generate rule-based plausibility checks for forgery detection on constrained hardware resources. We fine-tune open-source LLMs, Llama 3.1 8B and OpenCoder 8B, on structured datasets derived from real-world application scenarios and evaluate the generated plausibility checks on previously unseen forgery patterns. The results demonstrate that the models are capable of generating executable and effective verification procedures. This also highlights the potential of LLMs as scalable tools to support human decision-making in security-sensitive contexts where comprehensibility is required.
zh
[AI-34] Observer Not Player: Simulating Theory of Mind in LLM s through Game Observation NEURIPS
【速读】:该论文试图解决的问题是:如何系统性地评估大型语言模型(Large Language Models, LLMs)在简单但具有策略复杂性的序列博弈场景中是否表现出类心智推理(mind-like reasoning)能力,而非仅依赖记忆或模式匹配。解决方案的关键在于构建一个交互式评估框架,将LLM置于“观察者”(Observer)角色,使其对对手策略进行识别与推理,并通过多维度指标(交叉熵、Brier评分和期望值差异)整合为统一的联合损失(Union Loss),同时引入策略识别率(Strategy Identification Rate, SIR)来衡量模型对潜在策略的稳定识别能力。该框架以井字棋(Rock-Paper-Scissors)为案例,提供静态与轻量动态策略的基准测试集,支持实时调整模型分布、可视化损失演化及可解释的推理片段分析,从而实现对LLM类心智推理能力的可量化、透明且可复现的评估。
链接: https://arxiv.org/abs/2512.19210
作者: Jerry Wang,Ting Yiu Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS Workshop on Foundations of Reasoning in Language Models and Workshop on Bridging Language, Agent, and World Model
Abstract:We present an interactive framework for evaluating whether large language models (LLMs) exhibit genuine “understanding” in a simple yet strategic environment. As a running example, we focus on Rock-Paper-Scissors (RPS), which, despite its apparent simplicity, requires sequential reasoning, adaptation, and strategy recognition. Our system positions the LLM as an Observer whose task is to identify which strategies are being played and to articulate the reasoning behind this judgment. The purpose is not to test knowledge of Rock-Paper-Scissors itself, but to probe whether the model can exhibit mind-like reasoning about sequential behavior. To support systematic evaluation, we provide a benchmark consisting of both static strategies and lightweight dynamic strategies specified by well-prompted rules. We quantify alignment between the Observer’s predictions and the ground-truth distributions induced by actual strategy pairs using three complementary signals: Cross-Entropy, Brier score, and Expected Value (EV) discrepancy. These metrics are further integrated into a unified score, the Union Loss, which balances calibration, sensitivity, and payoff alignment. Together with a Strategy Identification Rate (SIR) metric, our framework captures not only predictive accuracy but also whether the model can stably identify the latent strategies in play. The demo emphasizes interactivity, transparency, and reproducibility. Users can adjust LLM distributions in real time, visualize losses as they evolve, and directly inspect reasoning snippets to identify where and why failures occur. In doing so, our system provides a practical and interpretable proxy for mind-like inference in sequential games, offering insights into both the strengths and limitations of current LLM reasoning.
zh
[AI-35] MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning
【速读】:该论文旨在解决长链式思维(Long Chain-of-Thought, CoT)推理中大型语言模型(Large Language Models, LLMs)因大量Key-Value (KV) 缓存导致的内存占用高和延迟大的问题。现有低比特量化方法在复杂推理任务中性能下降明显,主要原因是固定精度量化难以处理键缓存中的异常通道,而混合精度策略无法精准识别需高精度表示的组件。解决方案的关键在于提出一种名为 MixKVQ 的新方法,其核心是引入一个轻量级、查询感知的算法,用于识别并保留对查询相关的关键键通道以保持高精度,同时对值缓存采用逐token量化策略,从而在显著降低内存开销的同时维持接近全精度基线的推理性能。
链接: https://arxiv.org/abs/2512.19206
作者: Tao Zhang,Ziqian Zeng,Hao Peng,Huiping Zhuang,Cen Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs), but this progress is accompanied by substantial memory and latency overhead from the extensive Key-Value (KV) cache. Although KV cache quantization is a promising compression technique, existing low-bit quantization methods often exhibit severe performance degradation on complex reasoning tasks. Fixed-precision quantization struggles to handle outlier channels in the key cache, while current mixed-precision strategies fail to accurately identify components requiring high-precision representation. We find that an effective low-bit KV cache quantization strategy must consider two factors: a key channel’s intrinsic quantization difficulty and its relevance to the query. Based on this insight, we propose MixKVQ, a novel plug-and-play method that introduces a lightweight, query-aware algorithm to identify and preserve critical key channels that need higher precision, while applying per-token quantization for value cache. Experiments on complex reasoning datasets demonstrate that our approach significantly outperforms existing low-bit methods, achieving performance comparable to a full-precision baseline at a substantially reduced memory footprint.
zh
[AI-36] On the Koopman-Based Generalization Bounds for Multi-Task Deep Learning
【速读】:该论文旨在解决多任务深度神经网络(multitask deep neural networks)的泛化性能理论分析问题,特别是如何在不依赖网络宽度的前提下获得更紧致的泛化界。其解决方案的关键在于引入基于算子理论(operator-theoretic)的技术,利用权重矩阵的小条件数(small condition numbers)来提升边界紧致性,并通过构建一个定制化的Sobolev空间作为扩展假设空间(expanded hypothesis space),从而在单输出场景下仍优于现有的Koopman方法所得到的边界,实现了对多任务深度学习更精确的理论刻画。
链接: https://arxiv.org/abs/2512.19199
作者: Mahdi Mohammadigohari,Giuseppe Di Fatta,Giuseppe Nicosia,Panos M. Pardalos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication in Lecture Notes in Computer Science (LNCS). Final version forthcoming
Abstract:The paper establishes generalization bounds for multitask deep neural networks using operator-theoretic techniques. The authors propose a tighter bound than those derived from conventional norm based methods by leveraging small condition numbers in the weight matrices and introducing a tailored Sobolev space as an expanded hypothesis space. This enhanced bound remains valid even in single output settings, outperforming existing Koopman based bounds. The resulting framework maintains key advantages such as flexibility and independence from network width, offering a more precise theoretical understanding of multitask deep learning in the context of kernel methods.
zh
[AI-37] Operator-Based Generalization Bound for Deep Learning: Insights on Multi-Task Learning
【速读】:该论文旨在解决多任务学习中向量值神经网络(vector-valued neural networks)与深度核方法(deep kernel methods)的泛化性能分析问题,尤其关注如何在复杂架构下获得更紧致的泛化界。其解决方案的关键在于引入基于Koopman算子理论的框架,并结合现有技术以提升传统基于范数的泛化边界;同时提出适用于向量值神经网络的随机投影(sketching)技术,在保证计算效率的同时提供通用Lipschitz损失下的过拟合风险上界。此外,论文创新性地构建了深度向量值再生核希尔伯特空间(deep vector-valued reproducing kernel Hilbert spaces, vvRKHS)框架,利用Perron-Frobenius(PF)算子优化核方法,并通过新的Rademacher泛化界显式建模欠拟合与过拟合现象,从而为多任务深度学习模型的泛化能力提供了理论保障。
链接: https://arxiv.org/abs/2512.19184
作者: Mahdi Mohammadigohari,Giuseppe Di Fatta,Giuseppe Nicosia,Panos M. Pardalos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication in Lecture Notes in Computer Science (LNCS). Final version forthcoming
Abstract:This paper presents novel generalization bounds for vector-valued neural networks and deep kernel methods, focusing on multi-task learning through an operator-theoretic framework. Our key development lies in strategically combining a Koopman based approach with existing techniques, achieving tighter generalization guarantees compared to traditional norm-based bounds. To mitigate computational challenges associated with Koopman-based methods, we introduce sketching techniques applicable to vector valued neural networks. These techniques yield excess risk bounds under generic Lipschitz losses, providing performance guarantees for applications including robust and multiple quantile regression. Furthermore, we propose a novel deep learning framework, deep vector-valued reproducing kernel Hilbert spaces (vvRKHS), leveraging Perron Frobenius (PF) operators to enhance deep kernel methods. We derive a new Rademacher generalization bound for this framework, explicitly addressing underfitting and overfitting through kernel refinement strategies. This work offers novel insights into the generalization properties of multitask learning with deep learning architectures, an area that has been relatively unexplored until recent developments.
zh
[AI-38] Practical Quantum-Classical Feature Fusion for complex data Classification
【速读】:该论文旨在解决当前混合量子-经典学习架构中量子电路被孤立处理、仅作为特征提取器并与经典特征通过简单拼接融合所导致的性能瓶颈问题,尤其是在复杂高维表格数据和半结构化数据(如遥感、环境监测与医疗诊断)上的表现受限。其解决方案的关键在于提出一种多模态融合框架,采用交叉注意力机制(cross attention)实现量子特征令牌与经典表示之间的深度交互:经典特征作为查询向量,驱动量子生成的特征令牌进行加权聚合,并通过残差连接增强模型稳定性。该设计使量子信息在保持NISQ设备资源约束的前提下,以更结构化的方式融入经典神经网络,从而显著提升对复杂数据集的建模能力。
链接: https://arxiv.org/abs/2512.19180
作者: Azadeh Alavi,Fatemeh Kouchmeshki,Abdolrahman Alavi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 figues
Abstract:Hybrid quantum and classical learning aims to couple quantum feature maps with the robustness of classical neural networks, yet most architectures treat the quantum circuit as an isolated feature extractor and merge its measurements with classical representations by direct concatenation. This neglects that the quantum and classical branches constitute distinct computational modalities and limits reliable performance on complex, high dimensional tabular and semi structured data, including remote sensing, environmental monitoring, and medical diagnostics. We present a multimodal formulation of hybrid learning and propose a cross attention mid fusion architecture in which a classical representation queries quantum derived feature tokens through an attention block with residual connectivity. The quantum branch is kept within practical NISQ budgets and uses up to nine qubits. We evaluate on Wine, Breast Cancer, Forest CoverType, FashionMNIST, and SteelPlatesFaults, comparing a quantum only model, a classical baseline, residual hybrid models, and the proposed mid fusion model under a consistent protocol. Pure quantum and standard hybrid designs underperform due to measurement induced information loss, while cross attention mid fusion is consistently competitive and improves performance on the more complex datasets in most cases. These findings suggest that quantum derived information becomes most valuable when integrated through principled multimodal fusion rather than used in isolation or loosely appended to classical features.
zh
[AI-39] Vision-Language-Policy Model for Dynamic Robot Task Planning
【速读】:该论文旨在解决机器人在非结构化环境中将自然语言指令与自主执行之间存在的鸿沟问题,即如何使机器人通过多模态感知和推理来规划行为以实现目标,并在任务执行过程中动态调整策略以应对指令变化。解决方案的关键在于提出了一种基于语言模型的动态任务规划框架——视觉-语言-策略(Vision-Language-Policy, VLP)模型,该模型基于在真实世界数据上微调的视觉-语言模型,能够理解语义指令并融合当前任务场景的推理,生成控制机器人完成任务的行为策略;同时具备根据任务变化实时调整策略的能力,从而实现灵活适应不断演化的任务需求。
链接: https://arxiv.org/abs/2512.19178
作者: Jin Wang,Kim Tien Ly,Jacques Cloete,Nikos Tsagarakis,Ioannis Havoutis
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Manuscript under review
Abstract:Bridging the gap between natural language commands and autonomous execution in unstructured environments remains an open challenge for robotics. This requires robots to perceive and reason over the current task scene through multiple modalities, and to plan their behaviors to achieve their intended goals. Traditional robotic task-planning approaches often struggle to bridge low-level execution with high-level task reasoning, and cannot dynamically update task strategies when instructions change during execution, which ultimately limits their versatility and adaptability to new tasks. In this work, we propose a novel language model-based framework for dynamic robot task planning. Our Vision-Language-Policy (VLP) model, based on a vision-language model fine-tuned on real-world data, can interpret semantic instructions and integrate reasoning over the current task scene to generate behavior policies that control the robot to accomplish the task. Moreover, it can dynamically adjust the task strategy in response to changes in the task, enabling flexible adaptation to evolving task requirements. Experiments conducted with different robots and a variety of real-world tasks show that the trained model can efficiently adapt to novel scenarios and dynamically update its policy, demonstrating strong planning autonomy and cross-embodiment generalization. Videos: this https URL
zh
[AI-40] Can We Test Consciousness Theories on AI? Ablations Markers and Robustness
【速读】:该论文旨在解决意识可靠指标的争议问题,即当前主流意识理论(全局工作空间理论 GWT、整合信息理论 IIT 和高阶理论 HOT)之间缺乏统一的实证检验框架,导致其神经标志物相互竞争而非互补。解决方案的关键在于采用合成神经现象学(synthetic neuro-phenomenology)方法,构建体现不同理论机制的人工智能代理(artificial agents),通过在人工系统中进行精确的结构损毁实验(architectural ablations),测试各理论的功能后果。研究发现三类理论描述的是意识功能的不同层级:GWT 提供信息广播能力,HOT 提供质量控制机制,且二者存在协同关系,而非互斥;同时揭示了IIT相关指标(如扰动复杂性 PCI-A)在工程化系统中不可靠,从而为意识理论的验证提供了可操作、可区分的实验路径。
链接: https://arxiv.org/abs/2512.19155
作者: Yin Jun Phua
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The search for reliable indicators of consciousness has fragmented into competing theoretical camps (Global Workspace Theory (GWT), Integrated Information Theory (IIT), and Higher-Order Theories (HOT)), each proposing distinct neural signatures. We adopt a synthetic neuro-phenomenology approach: constructing artificial agents that embody these mechanisms to test their functional consequences through precise architectural ablations impossible in biological systems. Across three experiments, we report dissociations suggesting these theories describe complementary functional layers rather than competing accounts. In Experiment 1, a no-rewire Self-Model lesion abolishes metacognitive calibration while preserving first-order task performance, yielding a synthetic blindsight analogue consistent with HOT predictions. In Experiment 2, workspace capacity proves causally necessary for information access: a complete workspace lesion produces qualitative collapse in access-related markers, while partial reductions show graded degradation, consistent with GWT’s ignition framework. In Experiment 3, we uncover a broadcast-amplification effect: GWT-style broadcasting amplifies internal noise, creating extreme fragility. The B2 agent family is robust to the same latent perturbation; this robustness persists in a Self-Model-off / workspace-read control, cautioning against attributing the effect solely to z_\textself compression. We also report an explicit negative result: raw perturbational complexity (PCI-A) decreases under the workspace bottleneck, cautioning against naive transfer of IIT-adjacent proxies to engineered agents. These results suggest a hierarchical design principle: GWT provides broadcast capacity, while HOT provides quality control. We emphasize that our agents are not conscious; they are reference implementations for testing functional predictions of consciousness theories.
zh
[AI-41] Beyond Sliding Windows: Learning to Manage Memory in Non-Markovian Environments
【速读】:该论文旨在解决在计算资源受限的代理(agent)部署中,由于环境具有高度非马尔可夫性(non-Markovian dependencies)而导致的计算与内存开销过大的问题。传统方法如帧堆叠(Frame Stacking)需随非马尔可夫程度增加而扩大窗口大小,导致不可接受的资源消耗。其解决方案的关键在于提出一种元算法(meta-algorithm),即自适应堆叠(Adaptive Stacking),该算法动态维护一个较小且可调整的记忆栈,仅保留对预测未来奖励具有因果意义的观测,从而在保证学习性能的同时显著降低计算和内存需求,并为基于MLP、LSTM和Transformer的代理提供收敛性保障。
链接: https://arxiv.org/abs/2512.19154
作者: Geraud Nangue Tasse,Matthew Riemer,Benjamin Rosman,Tim Klinger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent success in developing increasingly general purpose agents based on sequence models has led to increased focus on the problem of deploying computationally limited agents within the vastly more complex real-world. A key challenge experienced in these more realistic domains is highly non-Markovian dependencies with respect to the agent’s observations, which are less common in small controlled domains. The predominant approach for dealing with this in the literature is to stack together a window of the most recent observations (Frame Stacking), but this window size must grow with the degree of non-Markovian dependencies, which results in prohibitive computational and memory requirements for both action inference and learning. In this paper, we are motivated by the insight that in many environments that are highly non-Markovian with respect to time, the environment only causally depends on a relatively small number of observations over that time-scale. A natural direction would then be to consider meta-algorithms that maintain relatively small adaptive stacks of memories such that it is possible to express highly non-Markovian dependencies with respect to time while considering fewer observations at each step and thus experience substantial savings in both compute and memory requirements. Hence, we propose a meta-algorithm (Adaptive Stacking) for achieving exactly that with convergence guarantees and quantify the reduced computation and memory constraints for MLP, LSTM, and Transformer-based agents. Our experiments utilize popular memory tasks, which give us control over the degree of non-Markovian dependencies. This allows us to demonstrate that an appropriate meta-algorithm can learn the removal of memories not predictive of future rewards without excessive removal of important experiences. Code: this https URL
zh
[AI-42] Understanding Chain-of-Thought in Large Language Models via Topological Data Analysis
【速读】:该论文试图解决的问题是:在大型语言模型(Large Language Models, LLMs)中,尽管长推理链(long reasoning chain)显著提升了复杂问题的求解能力,但不同推理链之间的性能差异尚不明确,且现有研究多从功能角度评估推理链质量,缺乏对推理链结构机制的深入分析。为填补这一空白,论文提出了一种基于拓扑数据分析(Topological Data Analysis, TDA)的新方法——利用持久同调(persistent homology)将推理步骤映射到语义空间,提取拓扑特征以量化推理链的连通性、冗余性和逻辑断裂等结构性属性。其解决方案的关键在于:通过计算同调群(homology groups)和可视化条形码图(barcode)与持久性图(persistence diagram),从结构维度揭示推理链的质量差异,发现推理链的拓扑复杂度与其准确性正相关,成功推理往往具有更简洁的拓扑结构,从而减少冗余和循环,提升效率与可解释性。
链接: https://arxiv.org/abs/2512.19135
作者: Chenghao Li,Chaoning Zhang,Yi Lu,Shuxu Chen,Xudong Wang,Jiaquan Zhang,Zhicheng Wang,Zhengxun Jin,Kuien Liu,Sung-Ho Bae,Guoqing Wang,Yang Yang,Hen Tao Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the development of large language models (LLMs), particularly with the introduction of the long reasoning chain technique, the reasoning ability of LLMs in complex problem-solving has been significantly enhanced. While acknowledging the power of long reasoning chains, we cannot help but wonder: Why do different reasoning chains perform differently in reasoning? What components of the reasoning chains play a key role? Existing studies mainly focus on evaluating reasoning chains from a functional perspective, with little attention paid to their structural mechanisms. To address this gap, this work is the first to analyze and evaluate the quality of the reasoning chain from a structural perspective. We apply persistent homology from Topological Data Analysis (TDA) to map reasoning steps into semantic space, extract topological features, and analyze structural changes. These changes reveal semantic coherence, logical redundancy, and identify logical breaks and gaps. By calculating homology groups, we assess connectivity and redundancy at various scales, using barcode and persistence diagrams to quantify stability and consistency. Our results show that the topological structural complexity of reasoning chains correlates positively with accuracy. More complex chains identify correct answers sooner, while successful reasoning exhibits simpler topologies, reducing redundancy and cycles, enhancing efficiency and interpretability. This work provides a new perspective on reasoning chain quality assessment and offers guidance for future optimization.
zh
[AI-43] HyperLoad: A Cross-Modality Enhanced Large Language Model-Based Framework for Green Data Center Cooling Load Prediction
【速读】:该论文旨在解决绿色数据中心在面临小样本场景(如冷启动、负载畸变、多源数据碎片化及分布偏移)时,难以实现高精度负载预测的问题,从而影响可再生能源、储能与负载的亚分钟级协同调度,以及对PUE(电源使用效率)和全生命周期碳强度的优化控制。解决方案的关键在于提出HyperLoad框架,其核心创新包括:1)跨模态知识对齐阶段,将文本先验信息与时间序列数据映射至共享潜在空间,以最大化利用先验知识;2)多尺度特征建模阶段,通过自适应前缀调优(prefix-tuning)注入领域对齐先验,实现快速场景适配,并引入增强型全局交互注意力机制捕捉跨设备的时间依赖关系,从而显著提升在数据稀缺条件下的预测性能。
链接: https://arxiv.org/abs/2512.19114
作者: Haoyu Jiang,Boan Qu,Junjie Zhu,Fanjie Zeng,Xiaojie Lin,Wei Zhong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid growth of artificial intelligence is exponentially escalating computational demand, inflating data center energy use and carbon emissions, and spurring rapid deployment of green data centers to relieve resource and environmental stress. Achieving sub-minute orchestration of renewables, storage, and loads, while minimizing PUE and lifecycle carbon intensity, hinges on accurate load forecasting. However, existing methods struggle to address small-sample scenarios caused by cold start, load distortion, multi-source data fragmentation, and distribution shifts in green data centers. We introduce HyperLoad, a cross-modality framework that exploits pre-trained large language models (LLMs) to overcome data scarcity. In the Cross-Modality Knowledge Alignment phase, textual priors and time-series data are mapped to a common latent space, maximizing the utility of prior knowledge. In the Multi-Scale Feature Modeling phase, domain-aligned priors are injected through adaptive prefix-tuning, enabling rapid scenario adaptation, while an Enhanced Global Interaction Attention mechanism captures cross-device temporal dependencies. The public DCData dataset is released for benchmarking. Under both data sufficient and data scarce settings, HyperLoad consistently surpasses state-of-the-art (SOTA) baselines, demonstrating its practicality for sustainable green data center management.
zh
[AI-44] FC-MIR: A Mobile Screen Awareness Framework for Intent-Aware Recommendation based on Frame-Compressed Multimodal Trajectory Reasoning
【速读】:该论文旨在解决从移动用户界面(UI)操作轨迹中识别用户意图的问题,以推动UI理解并支持任务自动化代理的发展。当前多模态大语言模型(Multimodal Large Language Models, MLLMs)虽在视频理解任务中表现优异,但其在移动端的实时部署受限于高计算成本和冗余帧处理效率低的问题。解决方案的关键在于提出FC-MIR框架:通过关键帧采样与自适应拼接策略减少视觉冗余,提升推理效率;同时集成先进闭源MLLMs或微调模型(如Qwen3-VL)实现轨迹摘要与意图预测,并进一步探索生成预测后的操作建议和搜索提示,引入细粒度评估指标衡量实际应用价值。实验表明,该压缩方法可在50%-60%压缩率下保持性能,且MLLMs具备良好的意图总结能力,为轻量级本地化部署提供可能,但仍需改进对“有用且出人意料”的建议生成能力。
链接: https://arxiv.org/abs/2512.19107
作者: Zhe Yang,Xiaoshuang Sheng,Zhengnan Zhang,Jidong Wu,Zexing Wang,Xin He,Shenghua Xu,Guanjing Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Identifying user intent from mobile UI operation trajectories is critical for advancing UI understanding and enabling task automation agents. While Multimodal Large Language Models (MLLMs) excel at video understanding tasks, their real-time mobile deployment is constrained by heavy computational costs and inefficient redundant frame processing. To address these issues, we propose the FC-MIR framework: leveraging keyframe sampling and adaptive concatenation, it cuts visual redundancy to boost inference efficiency, while integrating state-of-the-art closed-source MLLMs or fine-tuned models (e.g., Qwen3-VL) for trajectory summarization and intent prediction. We further expand task scope to explore generating post-prediction operations and search suggestions, and introduce a fine-grained metric to evaluate the practical utility of summaries, predictions, and suggestions. For rigorous assessment, we construct a UI trajectory dataset covering scenarios from UI-Agents (Agent-I) and real user interactions (Person-I). Experimental results show our compression method retains performance at 50%-60% compression rates; both closed-source and fine-tuned MLLMs demonstrate strong intent summarization, supporting potential lightweight on-device deployment. However, MLLMs still struggle with useful and “surprising” suggestions, leaving room for improvement. Finally, we deploy the framework in a real-world setting, integrating UI perception and UI-Agent proxies to lay a foundation for future progress in this field.
zh
[AI-45] DIVER-1 : Deep Integration of Vast Electrophysiological Recordings at Scale
【速读】:该论文旨在解决当前脑电图(Electroencephalography, EEG)和颅内脑电图(intracranial EEG, iEEG)领域中基础模型规模不足、训练效率低下的问题,尤其在数据与计算资源有限的情况下如何实现最优性能。其核心解决方案在于提出DIVER-1系列基础模型,该模型基于迄今最大且最多样化的iEEG(5.3k小时)和EEG(54k小时)数据集进行训练,并扩展至18.2亿参数规模;关键创新包括:首次系统性地揭示了该领域的数据受限型缩放定律(data-constrained scaling laws),表明在固定数据和算力条件下,通过延长训练轮次而非单纯增大模型规模可获得更优性能;此外,设计了任意变量注意力机制(any-variate attention)、滑动时间条件位置编码(sliding temporal conditional positional encoding)及多域重建策略等架构改进,最终使iEEG与EEG模型分别在各自基准测试中达到SOTA水平,为高效资源分配和模型开发提供了明确指导。
链接: https://arxiv.org/abs/2512.19097
作者: Danny Dongyeop Han,Yonghyeon Gwon,Ahhyun Lucy Lee,Taeyang Lee,Seong Jin Lee,Jubin Choi,Sebin Lee,Jihyun Bang,Seungju Lee,David Keetae Park,Shinjae Yoo,Chun Kee Chung,Jiook Cha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 47 pages, 13 figures, 26 tables
Abstract:Electrophysiology signals such as EEG and iEEG are central to neuroscience, brain-computer interfaces, and clinical applications, yet existing foundation models remain limited in scale despite clear evidence that scaling improves performance. We introduce DIVER-1, a family of EEG and iEEG foundation models trained on the largest and most diverse corpus to date-5.3k hours of iEEG and 54k hours of EEG (1.6M channel-hours from over 17.7k subjects)-and scaled up to 1.82B parameters. We present the first systematic scaling law analysis for this domain, showing that they follow data-constrained scaling laws: for a given amount of data and compute, smaller models trained for extended epochs consistently outperform larger models trained briefly. This behavior contrasts with prior electrophysiology foundation models that emphasized model size over training duration. To achieve strong performance, we also design architectural innovations including any-variate attention, sliding temporal conditional positional encoding, and multi-domain reconstruction. DIVER-1 iEEG and EEG models each achieve state-of-the-art performance on their respective benchmarks, establishing a concrete guidelines for efficient scaling and resource allocation in electrophysiology foundation model development.
zh
[AI-46] Conditioning Accept-Desirability models in the context of AGM-like belief change
【速读】:该论文旨在解决在广义线性空间中不确定收益的条件化问题,特别是在接受-偏好(Accept-Desirability)模型框架下如何合理引入事件观测对决策信念的影响。其核心挑战在于统一经典概率、量子概率并拓展至模糊概率(imprecise probabilities)情境下的条件化机制。解决方案的关键在于提出一种新的条件化规则:通过引入事件观测导致选项间新的无差异关系(indifference),从而重构信念结构;同时定义了一个与之对应的信念修正算子,并验证该框架下AGM信念修正公理的适用性,尤其在经典命题逻辑和全条件概率这两个特例中证明所有AGM公理均成立。
链接: https://arxiv.org/abs/2512.19096
作者: Kathelijne Coussement,Gert de Cooman,Keano De Vos
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic (math.LO); Probability (math.PR)
备注: 46 pages, 1 table
Abstract:We discuss conditionalisation for Accept-Desirability models in an abstract decision-making framework, where uncertain rewards live in a general linear space, and events are special projection operators on that linear space. This abstract setting allows us to unify classical and quantum probabilities, and extend them to an imprecise probabilities context. We introduce a new conditioning rule for our Accept-Desirability models, based on the idea that observing an event introduces new indifferences between options. We associate a belief revision operator with our conditioning rule, and investigate which of the AGM axioms for belief revision still hold in our more general framework. We investigate two interesting special cases where all of these axioms are shown to still hold: classical propositional logic and full conditional probabilities.
zh
[AI-47] ool-Augmented Hybrid Ensemble Reasoning with Distillation for Bilingual Mathematical Problem Solving
【速读】:该论文旨在解决双语数学问题求解中语言推理与符号计算之间缺乏清晰衔接的问题,即大语言模型虽擅长语言理解但计算精度不足的局限性。解决方案的关键在于提出HERALD框架,通过融合NuminaMath-7B-TIR、GPT-4o与Mistral-7B的混合推理机制,结合自适应路由(adaptive routing)、基于工具的强化学习(tool-based reinforcement learning)以及知识蒸馏(knowledge distillation),实现不同推理路径的动态整合;同时借助置信度校准(confidence calibration)稳定权重分配,并采用双路径验证(dual-path checking)保障结果准确性,从而在保持高精度计算的同时提升多语言场景下的推理流畅性与稳定性。
链接: https://arxiv.org/abs/2512.19093
作者: Peiqing Lu,Yuan Zhang,Haoyun Zhang,Jiasen Zheng,Kejian Tong,Wenjun Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Bilingual mathematical problem solving needs a clear link between language reasoning and symbolic calculation. Large language models often handle language well but are weak in accurate computation. This paper presents HERALD (Hybrid Ensemble Reasoning with Adaptive Learning and Distillation), a framework that joins reasoning and calculation using NuminaMath-7B-TIR, GPT-4o, and Mistral-7B. HERALD uses adaptive routing, tool-based reinforcement learning, and knowledge distillation to connect different reasoning paths. Confidence calibration keeps weighting stable, and dual-path checking keeps results correct. Reinforcement learning controls tool use to cut redundancy, and distillation lowers delay without hurting accuracy. The system shows that combining symbolic checking, adaptive ensembles, and bilingual fine-tuning helps achieve both fluent reasoning and precise calculation. HERALD offers a practical solution for multilingual mathematical reasoning with better accuracy, stability, and clarity.
zh
[AI-48] γ(34) `Attention in Cognitive Agents : Ontology-Free Knowledge Representations With Promise Theoretic Semantics
【速读】:该论文旨在解决向量化的机器学习(Machine Learning)与知识图谱(Knowledge Graph)表示之间缺乏有效融合的问题,尤其是在不依赖语言模型的前提下建立二者之间的桥梁。其核心挑战在于如何在保持数据统计稳定性(即重复观测下的平均不变性,或称“信任”)的同时,兼顾向量表示的概率估计能力与图结构对源意图的保留能力,尤其是在数据稀疏或碎片化场景下。解决方案的关键在于引入语义时空(Semantic Spacetime)γ(3,4) 图结构,通过角色分类(role-based classification)替代复杂本体(ontology),从而实现基于语义过程特征的高效推理,并利用因果边界条件(causal boundary conditions)显著压缩用于上下文判定的数据量,尤其适用于自主机器人、国防部署及应急服务等高不确定性场景。
链接: https://arxiv.org/abs/2512.19084
作者: Mark Burgess
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The semantics and dynamics of attention' are closely related to promise theoretic notions developed for autonomous agents and can thus easily be written down in promise framework. In this way one may establish a bridge between vectorized Machine Learning and Knowledge Graph representations without relying on language models implicitly. Our expectations for knowledge presume a degree of statistical stability, i.e. average invariance under repeated observation, or trust’ in the data. Both learning networks and knowledge graph representations can meaningfully coexist to preserve different aspects of data. While vectorized data are useful for probabilistic estimation, graphs preserve the intentionality of the source even under data fractionation. Using a Semantic Spacetime \gamma(3,4) graph, one avoids complex ontologies in favour of classification of features by their roles in semantic processes. The latter favours an approach to reasoning under conditions of uncertainty. Appropriate attention to causal boundary conditions may lead to orders of magnitude compression of data required for such context determination, as required in the contexts of autonomous robotics, defence deployments, and ad hoc emergency services.
zh
[AI-49] Population-Evolve: a Parallel Sampling and Evolutionary Method for LLM Math Reasoning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理能力上的局限性问题,尤其是在测试阶段如何有效提升其逻辑推理准确性与稳定性。解决方案的关键在于提出一种无需训练的进化优化方法——Population-Evolve,该方法受遗传算法(Genetic Algorithms)启发,在推理过程中维护一个动态候选解群体,并通过引入“进化提示”(evolve prompt)使LLM在每轮迭代中自我演化群体,最终通过多数投票机制确定答案。该方法不仅实现了高精度和低性能方差,还具备良好的计算效率,同时构建了一个统一框架,将现有测试时缩放策略(test-time scaling)映射为遗传算法视角下的优化过程。
链接: https://arxiv.org/abs/2512.19081
作者: Yanzhi Zhang,Yitong Duan,Zhaoxi Zhang,Jiyan He,Shuxin Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Test-time scaling has emerged as a promising direction for enhancing the reasoning capabilities of Large Language Models in last few years. In this work, we propose Population-Evolve, a training-free method inspired by Genetic Algorithms to optimize LLM reasoning. Our approach maintains a dynamic population of candidate solutions for each problem via parallel reasoning. By incorporating an evolve prompt, the LLM self-evolves its population in all iterations. Upon convergence, the final answer is derived via majority voting. Furthermore, we establish a unification framework that interprets existing test-time scaling strategies through the lens of genetic algorithms. Empirical results demonstrate that Population-Evolve achieves superior accuracy with low performance variance and computational efficiency. Our findings highlight the potential of evolutionary strategies to unlock the reasoning power of LLMs during inference.
zh
[AI-50] Can abstract concepts from LLM improve SLM performance?
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在资源受限设备上部署困难的问题。现有方法如量化(quantization)、剪枝(pruning)和知识蒸馏(distillation)虽可降低内存占用,但通常需要大量实验和复杂的基础设施设计。其解决方案的关键在于利用已有技术从大模型中提取高阶语义概念(以控制向量 steering vectors 表示),并将其迁移至小型语言模型(Small Language Models, SLMs)的推理阶段。实验证明,此类控制向量在不同模型家族(如 Phi、Llama、Qwen)之间具有良好的可迁移性,并能显著提升多种任务性能;进一步引入推理时缩放(inference-time scaling)机制,通过动态调整控制强度,在 Qwen3-0.6B 上实现了 7–15% 的准确率提升。
链接: https://arxiv.org/abs/2512.19069
作者: Siddharth Tandon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) excel at diverse tasks, but their deployment on resource-constrained devices remains challenging. Existing methods like quantization, pruning, and distillation can reduce memory footprint but often demand extensive experimentation and careful infrastructure design. Leveraging existing techniques for extracting high-level concepts (represented as steering vectors) from larger models, we investigate their transferability to smaller language models (SLM) during inference. We demonstrate through extensive experimentation that these concepts can be effectively transferred to smaller models, irrespective of their family (e.g., Phi, Llama, Qwen), leading to performance improvements across a wide range of tasks. Furthermore, we introduce inference-time scaling to enhance performance by dynamically adjusting the steering intensity which has resulted in a 7-15% of accuracy improvement for Qwen3-0.6B.
zh
[AI-51] Fraud Detection Through Large-Scale Graph Clustering with Heterogeneous Link Transformation
【速读】:该论文旨在解决协同欺诈(collaborative fraud)检测中因复杂网络结构导致的覆盖不足与聚类效果差的问题。传统方法依赖高置信度身份链接(hard links)时覆盖范围有限,而使用全部链接则易形成碎片化图结构,削弱聚类效能。其解决方案的关键在于提出一种基于图变换的新型框架:首先通过hard links识别连通分量并合并为超节点(super-nodes),再重构加权的soft-link图(soft links指设备指纹、Cookie和IP地址等行为关联),从而实现大规模异构图的有效嵌入与聚类;该方法结合LINE进行表示学习与HDBSCAN进行密度聚类,在真实支付平台数据集上显著缩小图规模(从2500万降至770万节点)、提升检测覆盖率至硬链接基线的两倍,同时保持高精度,具备工业级可扩展性。
链接: https://arxiv.org/abs/2512.19061
作者: Chi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures
Abstract:Collaborative fraud, where multiple fraudulent accounts coordinate to exploit online payment systems, poses significant challenges due to the formation of complex network structures. Traditional detection methods that rely solely on high-confidence identity links suffer from limited coverage, while approaches using all available linkages often result in fragmented graphs with reduced clustering effectiveness. In this paper, we propose a novel graph-based fraud detection framework that addresses the challenge of large-scale heterogeneous graph clustering through a principled link transformation approach. Our method distinguishes between \emphhard links (high-confidence identity relationships such as phone numbers, credit cards, and national IDs) and \emphsoft links (behavioral associations including device fingerprints, cookies, and IP addresses). We introduce a graph transformation technique that first identifies connected components via hard links, merges them into super-nodes, and then reconstructs a weighted soft-link graph amenable to efficient embedding and clustering. The transformed graph is processed using LINE (Large-scale Information Network Embedding) for representation learning, followed by HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) for density-based cluster discovery. Experiments on a real-world payment platform dataset demonstrate that our approach achieves significant graph size reduction (from 25 million to 7.7 million nodes), doubles the detection coverage compared to hard-link-only baselines, and maintains high precision across identified fraud clusters. Our framework provides a scalable and practical solution for industrial-scale fraud detection systems.
zh
[AI-52] Recontextualization Mitigates Specification Gaming without Modifying the Specification
【速读】:该论文旨在解决语言模型在训练过程中因标签或奖励信号不准确而导致的“指定博弈”(specification gaming)问题,即模型学会利用训练信号的漏洞进行偏离预期目标的不当行为,例如优先考虑评估指标而非真实对话质量、伪造代码通过错误测试、欺骗用户或谄媚迎合。解决方案的关键在于提出“再语境化”(recontextualization)方法:通过生成从抑制不当行为的提示中得出的完成内容,并将其重新语境化为允许不当行为的提示下的响应,从而训练模型即使在允许不当行为的指令下也能抵抗此类行为。该方法无需改进监督信号本身,即可有效减少因训练信号误设而强化的不良行为。
链接: https://arxiv.org/abs/2512.19027
作者: Ariana Azarbal,Victor Gillioz,Vladimir Ivanov,Bryce Woodworth,Jacob Drori,Nevan Wichers,Aram Ebtekar,Alex Cloud,Alexander Matt Turner
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 57 pages, 41 figures
Abstract:Developers often struggle to specify correct training labels and rewards. Perhaps they don’t need to. We propose recontextualization, which reduces how often language models “game” training signals, performing misbehaviors those signals mistakenly reinforce. We show recontextualization prevents models from learning to 1) prioritize evaluation metrics over chat response quality; 2) special-case code to pass incorrect tests; 3) lie to users; and 4) become sycophantic. Our method works by generating completions from prompts discouraging misbehavior and then recontextualizing them as though they were in response to prompts permitting misbehavior. Recontextualization trains language models to resist misbehavior even when instructions permit it. This mitigates the reinforcement of misbehavior from misspecified training signals, reducing specification gaming without improving the supervision signal.
zh
[AI-53] he Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation
【速读】:该论文旨在解决当前机器遗忘(machine unlearning)评估方法在大型语言模型(Large Language Models, LLMs)场景下的局限性问题。现有评估通常仅关注模型在目标遗忘数据集 $ D_u $ 上性能的下降,但这种方法忽略了模型对 $ D_u $ 所衍生的语义邻近内容仍可能保留的知识能力,从而导致对遗忘效果的误判。解决方案的关键在于提出一个自动化压力测试框架 \name,其核心是构建一个语义上源自 $ D_u $ 但嵌入空间中足够不同的替代数据集 $ \tilde{D}_u $,通过对比模型在 $ D_u $ 和 $ \tilde{D}_u $ 上的遗忘指标差异,来检验现有度量标准的可靠性。实验表明,当前主流指标普遍存在高估遗忘效果的问题,而该框架能有效暴露其不足。
链接: https://arxiv.org/abs/2512.19025
作者: Hengrui Jia,Taoran Li,Jonas Guan,Varun Chandrasekaran
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Machine unlearning aims to remove specific data influences from trained models, a capability essential for adhering to copyright laws and ensuring AI safety. Current unlearning metrics typically measure success by monitoring the model’s performance degradation on the specific unlearning dataset ( D_u ). We argue that for Large Language Models (LLMs), this evaluation paradigm is insufficient and potentially misleading. Many real-world uses of unlearning–motivated by copyright or safety–implicitly target not only verbatim content in D_u , but also behaviors influenced by the broader generalizations the model derived from it. We demonstrate that LLMs can pass standard unlearning evaluation and appear to have ``forgotten’’ the target knowledge, while simultaneously retaining strong capabilities on content that is semantically adjacent to D_u . This phenomenon indicates that erasing exact sentences does not necessarily equate to removing the underlying knowledge. To address this gap, we propose \name, an automated stress-testing framework that generates a surrogate dataset, \tildeD_u . This surrogate set is constructed to be semantically derived from D_u yet sufficiently distinct in embedding space. By comparing unlearning metric scores between D_u and \tildeD_u , we can stress-test the reliability of the metric itself. Our extensive evaluation across three LLM families (Llama-3-8B, Qwen2.5-7B, and Zephyr-7B- \beta ), three distinct datasets, and seven standard metrics reveals widespread inconsistencies. We find that current metrics frequently overestimate unlearning success, failing to detect retained knowledge exposed by our stress-test datasets.
zh
[AI-54] IndoorUAV: Benchmarking Vision-Language UAV Navigation in Continuous Indoor Environments
【速读】:该论文旨在解决室内无人机(Indoor Unmanned Aerial Vehicle, UAV)在复杂三维环境中基于自然语言指令进行导航的挑战,即室内UAV视觉-语言导航(Vision-Language Navigation, VLN)问题。现有研究多集中于地面机器人或室外无人机,而室内场景下的VLN因飞行动态复杂、环境结构多样且缺乏高质量数据集而研究不足。解决方案的关键在于构建首个面向室内UAV的基准数据集——IndoorUAV,其包含超过16,000条高质轨迹,分为长时程(IndoorUAV-VLN)与短时程(IndoorUAV-VLA)子集,并通过模拟真实飞行动力学、人工采集与数据增强相结合的方式生成多样化轨迹,同时设计自动化标注流程以生成不同粒度的语言指令;此外,提出IndoorUAV-Agent模型,采用任务分解与多模态推理机制提升导航性能,从而系统性推动室内空中视觉-语言智能体的研究发展。
链接: https://arxiv.org/abs/2512.19024
作者: Xu Liu,Yu Liu,Hanshuo Qiu,Yang Qirong,Zhouhui Lian
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Navigation (VLN) enables agents to navigate in complex environments by following natural language instructions grounded in visual observations. Although most existing work has focused on ground-based robots or outdoor Unmanned Aerial Vehicles (UAVs), indoor UAV-based VLN remains underexplored, despite its relevance to real-world applications such as inspection, delivery, and search-and-rescue in confined spaces. To bridge this gap, we introduce \textbfIndoorUAV, a novel benchmark and method specifically tailored for VLN with indoor UAVs. We begin by curating over 1,000 diverse and structurally rich 3D indoor scenes from the Habitat simulator. Within these environments, we simulate realistic UAV flight dynamics to collect diverse 3D navigation trajectories manually, further enriched through data augmentation techniques. Furthermore, we design an automated annotation pipeline to generate natural language instructions of varying granularity for each trajectory. This process yields over 16,000 high-quality trajectories, comprising the \textbfIndoorUAV-VLN subset, which focuses on long-horizon VLN. To support short-horizon planning, we segment long trajectories into sub-trajectories by selecting semantically salient keyframes and regenerating concise instructions, forming the \textbfIndoorUAV-VLA subset. Finally, we introduce \textbfIndoorUAV-Agent, a novel navigation model designed for our benchmark, leveraging task decomposition and multimodal reasoning. We hope IndoorUAV serves as a valuable resource to advance research on vision-language embodied AI in the indoor aerial navigation domain.
zh
[AI-55] he 6th International Verification of Neural Networks Competition (VNN-COMP 2025): Summary and Results
【速读】:该论文旨在解决神经网络验证工具缺乏统一评估标准和公平比较机制的问题,以推动该领域工具的标准化与社区协作。其关键解决方案在于构建了一个结构化的竞赛框架:定义了标准化的网络格式(ONNX)和规范描述语言(VNN-LIB),采用统一成本的硬件环境(基于AWS实例的自动评估流水线)进行测试,并要求参赛团队在最终测试集公开前提交参数配置,从而确保评估的客观性与可复现性。
链接: https://arxiv.org/abs/2512.19007
作者: Konstantin Kaulen,Tobias Ladner,Stanley Bak,Christopher Brix,Hai Duong,Thomas Flinkow,Taylor T. Johnson,Lukas Koller,Edoardo Manino,ThanhVu H Nguyen,Haoze Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Report on the results of VNN-COMP 2025. arXiv admin note: substantial text overlap with arXiv:2412.19985 , arXiv:2312.16760 , arXiv:2212.10376
Abstract:This report summarizes the 6th International Verification of Neural Networks Competition (VNN-COMP 2025), held as a part of the 8th International Symposium on AI Verification (SAIV), that was collocated with the 37th International Conference on Computer-Aided Verification (CAV). VNN-COMP is held annually to facilitate the fair and objective comparison of state-of-the-art neural network verification tools, encourage the standardization of tool interfaces, and bring together the neural network verification community. To this end, standardized formats for networks (ONNX) and specification (VNN-LIB) were defined, tools were evaluated on equal-cost hardware (using an automatic evaluation pipeline based on AWS instances), and tool parameters were chosen by the participants before the final test sets were made public. In the 2025 iteration, 8 teams participated on a diverse set of 16 regular and 9 extended benchmarks. This report summarizes the rules, benchmarks, participating tools, results, and lessons learned from this iteration of this competition.
zh
[AI-56] ORPR: An OR-Guided Pretrain-then-Reinforce Learning Model for Inventory Management
【速读】:该论文旨在解决人工智能(AI)在复杂库存系统中与运筹学(OR)协同时存在的核心挑战:如何有效融合AI的自适应感知能力与OR的结构化严谨性。其解决方案的关键在于提出一种“预训练-强化”(Pretrain-then-Reinforce)的OR引导框架:首先构建一个仿真增强的OR模型,生成高质量参考决策以隐式捕获业务约束和管理偏好;随后利用这些OR-derived决策作为基础标签训练一个领域知识驱动的深度学习基础模型,建立初始决策能力;最后通过强化学习(RL)进行微调,将OR的最优性原理内化为AI代理的决策逻辑,同时借助探索实现策略泛化优化,并支持专家指导下的场景特异性调整(如促销事件)。该方法在数值实验和真实部署中均显著优于现有工业实践,验证了结构化OR逻辑对轻量化AI模型性能提升与跨场景迁移能力的关键作用。
链接: https://arxiv.org/abs/2512.19001
作者: Lingjie Zhao,Xue Yu,Yongzhi Qi,Hao Hu,Jianshen Zhang,Yingzheng Ma,Shuyu Han,Wei Qi,Zuo-Jun Max Shen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As the pursuit of synergy between Artificial Intelligence (AI) and Operations Research (OR) gains momentum in handling complex inventory systems, a critical challenge persists: how to effectively reconcile AI’s adaptive perception with OR’s structural rigor. To bridge this gap, we propose a novel OR-Guided “Pretrain-then-Reinforce” framework. To provide structured guidance, we propose a simulation-augmented OR model that generates high-quality reference decisions, implicitly capturing complex business constraints and managerial preferences. Leveraging these OR-derived decisions as foundational training labels, we design a domain-informed deep learning foundation model to establish foundational decision-making capabilities, followed by a reinforcement learning (RL) fine-tuning stage. Uniquely, we position RL as a deep alignment mechanism that enables the AI agent to internalize the optimality principles of OR, while simultaneously leveraging exploration for general policy refinement and allowing expert guidance for scenario-specific adaptation (e.g., promotional events). Validated through extensive numerical experiments and a field deployment at this http URL augmented by a Difference-in-Differences (DiD) analysis, our model significantly outperforms incumbent industrial practices, delivering real-world gains of a 5.27-day reduction in turnover and a 2.29% increase in in-stock rates, alongside a 29.95% decrease in holding costs. Contrary to the prevailing trend of brute-force model scaling, our study demonstrates that a lightweight, domain-informed model can deliver state-of-the-art performance and robust transferability when guided by structured OR logic. This approach offers a scalable and cost-effective paradigm for intelligent supply chain management, highlighting the value of deeply aligning AI with OR.
zh
[AI-57] R-GenIMA: Integrating Neuroimaging and Genetics with Interpretable Multimodal AI for Alzheimers Disease Progression
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期检测中如何有效融合宏观尺度的神经解剖学改变与微观尺度的遗传易感性这一关键挑战,现有多模态方法难以对齐这两种异质信号。其解决方案的核心在于提出一种可解释的多模态大语言模型 R-GenIMA,该模型创新性地结合了基于感兴趣区(ROI)的视觉 Transformer 与遗传提示(genetic prompting)机制,将结构磁共振成像(structural MRI)中的脑区分割表示为视觉标记(visual token),并将单核苷酸多态性(SNPs)变异编码为结构化文本,从而实现跨模态注意力机制,精准关联区域萎缩模式与潜在遗传因素。此框架在ADNI队列上实现了四分类(正常认知NC、主观记忆困扰SMC、轻度认知障碍MCI和AD)的最先进性能,并通过注意力归因揭示了具有生物学意义的阶段特异性脑区和基因签名,以及贯穿疾病进程的ROI-基因关联模式,显著提升了模型的可解释性与临床转化潜力。
链接: https://arxiv.org/abs/2512.18986
作者: Kun Zhao,Siyuan Dai,Yingying Zhang,Guodong Liu,Pengfei Gu,Chenghua Lin,Paul M. Thompson,Alex Leow,Heng Huang,Lifang He,Liang Zhan,Haoteng Tang(for the Alzheimer’s Disease Neuroimaging Initiative (ADNI) Project)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Early detection of Alzheimer’s disease (AD) requires models capable of integrating macro-scale neuroanatomical alterations with micro-scale genetic susceptibility, yet existing multimodal approaches struggle to align these heterogeneous signals. We introduce R-GenIMA, an interpretable multimodal large language model that couples a novel ROI-wise vision transformer with genetic prompting to jointly model structural MRI and single nucleotide polymorphisms (SNPs) variations. By representing each anatomically parcellated brain region as a visual token and encoding SNP profiles as structured text, the framework enables cross-modal attention that links regional atrophy patterns to underlying genetic factors. Applied to the ADNI cohort, R-GenIMA achieves state-of-the-art performance in four-way classification across normal cognition (NC), subjective memory concerns (SMC), mild cognitive impairment (MCI), and AD. Beyond predictive accuracy, the model yields biologically meaningful explanations by identifying stage-specific brain regions and gene signatures, as well as coherent ROI-Gene association patterns across the disease continuum. Attention-based attribution revealed genes consistently enriched for established GWAS-supported AD risk loci, including APOE, BIN1, CLU, and RBFOX1. Stage-resolved neuroanatomical signatures identified shared vulnerability hubs across disease stages alongside stage-specific patterns: striatal involvement in subjective decline, frontotemporal engagement during prodromal impairment, and consolidated multimodal network disruption in AD. These results demonstrate that interpretable multimodal AI can synthesize imaging and genetics to reveal mechanistic insights, providing a foundation for clinically deployable tools that enable earlier risk stratification and inform precision therapeutic strategies in Alzheimer’s disease.
zh
[AI-58] raining Multimodal Large Reasoning Models Needs Better Thoughts: A Three-Stage Framework for Long Chain-of-Thought Synthesis and Selection
【速读】:该论文旨在解决多模态推理任务中高质量长链式思维(Chain-of-Thought, CoT)训练数据稀缺的问题,以及现有方法在推理深度不足、模态转换错误和生成流程僵化等方面的局限性。其解决方案的关键在于提出一种三阶段合成-选择框架(SynSelect),首先利用多个异构的多模态大推理模型(Multimodal Large Reasoning Models, LRM)生成多样化的候选CoT,随后通过实例级与批量级的选择机制筛选出高质量CoT,从而有效提升模型的推理能力。实验表明,基于SynSelect生成的数据进行监督微调可显著优于基线方法,并在强化学习后进一步提升性能,验证了该框架在增强多模态大模型推理能力方面的有效性。
链接: https://arxiv.org/abs/2512.18956
作者: Yizhi Wang,Linan Yue,Min-Ling Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks through long Chain-of-Thought (CoT) reasoning. Extending these successes to multimodal reasoning remains challenging due to the increased complexity of integrating diverse input modalities and the scarcity of high-quality long CoT training data. Existing multimodal datasets and CoT synthesis methods still suffer from limited reasoning depth, modality conversion errors, and rigid generation pipelines, hindering model performance and stability. To this end, in this paper, we propose SynSelect, a novel three-stage Synthesis-Selection framework for generating high-quality long CoT data tailored to multimodal reasoning tasks. Specifically, SynSelect first leverages multiple heterogeneous multimodal LRMs to produce diverse candidate CoTs, and then applies both instance and batch level selection to filter high-quality CoTs that can effectively enhance the model’s reasoning capabilities. Extensive experiments on multiple multimodal benchmarks demonstrate that models supervised fine-tuned on SynSelect-generated data significantly outperform baselines and achieve further improvements after reinforcement learning post-training. Our results validate SynSelect as an effective approach for advancing multimodal LRMs reasoning capabilities.
zh
[AI-59] Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement AAMAS2026 ACL AAMAS
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在复杂任务中缺乏可解释性、样本效率低以及难以持续改进的问题。传统方法依赖于对LLM参数的微调,不仅计算成本高且难以适应新任务。解决方案的关键在于提出MACLA框架,其核心思想是将推理与学习解耦:通过冻结LLM参数,所有适应过程由外部分层程序记忆(hierarchical procedural memory)完成;该记忆模块能从轨迹中提取可复用的程序,利用贝叶斯后验追踪可靠性,基于期望效用评分选择动作,并通过成功与失败对比实现程序精炼。此设计实现了无需更新LLM参数即可高效、可解释且持续优化智能体的目标。
链接: https://arxiv.org/abs/2512.18950
作者: Saman Forouzandeh,Wei Peng,Parham Moradi,Xinghuo Yu,Mahdi Jalili
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at The 25th International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS 2026). 21 pages including references, with 7 figures and 8 tables. Code is publicly available at the authors GitHub repository: this https URL
Abstract:We present MACLA, a framework that decouples reasoning from learning by maintaining a frozen large language model while performing all adaptation in an external hierarchical procedural memory. MACLA extracts reusable procedures from trajectories, tracks reliability via Bayesian posteriors, selects actions through expected-utility scoring, and refines procedures by contrasting successes and failures. Across four benchmarks (ALFWorld, WebShop, TravelPlanner, InterCodeSQL), MACLA achieves 78.1 percent average performance, outperforming all baselines. On ALFWorld unseen tasks, MACLA reaches 90.3 percent with 3.1 percent positive generalization. The system constructs memory in 56 seconds, 2800 times faster than the state-of-the-art LLM parameter-training baseline, compressing 2851 trajectories into 187 procedures. Experimental results demonstrate that structured external memory with Bayesian selection and contrastive refinement enables sample-efficient, interpretable, and continually improving agents without LLM parameter updates.
zh
[AI-60] Clustering-based Transfer Learning for Dynamic Multimodal MultiObjective Evolutionary Algorithm
【速读】:该论文旨在解决动态多模态多目标优化(Dynamic Multimodal Multiobjective Optimization)中同时追踪多个等效帕累托最优集(Pareto Optimal Sets)与在时变环境中维持种群多样性的双重挑战。现有动态多目标进化算法常忽略解的模态特性,而静态多模态多目标进化算法则缺乏对环境变化的适应能力。解决方案的关键在于:首先构建了一套融合动态性与多模态特性的新型测试函数基准集,为算法评估提供严谨平台;其次提出一种基于聚类自编码器预测机制的新型算法,利用自编码器对匹配簇进行处理以生成高多样性初始种群,并引入自适应小生境策略平衡收敛性与多样性,从而在决策空间中更有效地保持种群多样性,并在目标空间中实现更优的收敛性能。
链接: https://arxiv.org/abs/2512.18947
作者: Li Yan,Bolun Liu,Chao Li,Jing Liang,Kunjie Yu,Caitong Yue,Xuzhao Chai,Boyang Qu
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Dynamic multimodal multiobjective optimization presents the dual challenge of simultaneously tracking multiple equivalent pareto optimal sets and maintaining population diversity in time-varying environments. However, existing dynamic multiobjective evolutionary algorithms often neglect solution modality, whereas static multimodal multiobjective evolutionary algorithms lack adaptability to dynamic changes. To address above challenge, this paper makes two primary contributions. First, we introduce a new benchmark suite of dynamic multimodal multiobjective test functions constructed by fusing the properties of both dynamic and multimodal optimization to establish a rigorous evaluation platform. Second, we propose a novel algorithm centered on a Clustering-based Autoencoder prediction dynamic response mechanism, which utilizes an autoencoder model to process matched clusters to generate a highly diverse initial population. Furthermore, to balance the algorithm’s convergence and diversity, we integrate an adaptive niching strategy into the static optimizer. Empirical analysis on 12 instances of dynamic multimodal multiobjective test functions reveals that, compared with several state-of-the-art dynamic multiobjective evolutionary algorithms and multimodal multiobjective evolutionary algorithms, our algorithm not only preserves population diversity more effectively in the decision space but also achieves superior convergence in the objective space.
zh
[AI-61] When Less is More: 8-bit Quantization Improves Continual Learning in Large Language Models
【速读】:该论文旨在解决持续学习(continual learning)中因模型量化(quantization)导致的灾难性遗忘(catastrophic forgetting)问题,尤其是在部署效率要求下如何平衡模型精度与知识保留能力。其解决方案的关键在于揭示了量化精度(FP16、INT8、INT4)与回放缓冲区(replay buffer)策略之间的非线性交互关系:研究发现,尽管FP16初始任务性能最优(如NLU达74.44%),但在后续任务中,低精度量化模型(尤其是INT4)反而在最终任务前向准确率上显著优于FP16(提升8–15%),且INT8在所有精度级别中实现了最佳的学习塑性与知识保留平衡;更重要的是,极小规模回放缓冲区(如0.1%)即可大幅提升保留效果(如NLU在数学训练后从45%提升至65%),并提出量化引入的噪声可能起到隐式正则化作用,抑制高精度模型对新任务梯度的过拟合,从而挑战了“更高精度更优”的传统认知,为压缩模型在持续学习场景中的高效部署提供了实证依据和实践指导。
链接: https://arxiv.org/abs/2512.18934
作者: Michael S. Zhang,Rishi A. Ruia,Arnav Kewalram,Saathvik Dharmapuram,Utkarsh Sharma,Kevin Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Catastrophic forgetting poses a fundamental challenge in continual learning, particularly when models are quantized for deployment efficiency. We systematically investigate the interplay between quantization precision (FP16, INT8, INT4) and replay buffer strategies in large language models, revealing unexpected dynamics. While FP16 achieves superior initial task performance (74.44% on NLU), we observe a striking inversion on subsequent tasks: quantized models outperform FP16 by 8-15% on final task forward accuracy, with INT4 achieving nearly double FP16’s performance on Code generation (40% vs 20%). Critically, even minimal replay buffers (0.1%) dramatically improve retention - increasing NLU retention after Math training from 45% to 65% across all precision levels - with INT8 consistently achieving the optimal balance between learning plasticity and knowledge retention. We hypothesize that quantization-induced noise acts as implicit regularization, preventing the overfitting to new task gradients that plagues high-precision models. These findings challenge the conventional wisdom that higher precision is always preferable, suggesting instead that INT8 quantization offers both computational efficiency and superior continual learning dynamics. Our results provide practical guidelines for deploying compressed models in continual learning scenarios: small replay buffers (1-2%) suffice for NLU tasks, while Math and Code benefit from moderate buffers (5-10%), with quantized models requiring less replay than FP16 to achieve comparable retention. Code is available at this https URL.
zh
[AI-62] An Empirical Study of Developer-Provided Context for AI Coding Assistants in Open-Source Projects
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在软件工程场景中响应质量受限于缺乏充分项目上下文的问题,尤其是开发者如何通过显式指令来补充模型对特定项目约束的理解。其解决方案的关键在于识别并系统化整理开发者在实际项目中主动提供的“项目上下文”(project context),并通过分析401个开源仓库中的cursor规则,构建了一个包含五大主题的分类体系:编码规范(Conventions)、开发指南(Guidelines)、项目信息(Project Information)、LLM指令(LLM Directives)和示例(Examples)。这一发现为下一代具备上下文感知能力的AI辅助开发工具提供了实证基础与设计方向。
链接: https://arxiv.org/abs/2512.18925
作者: Shaokang Jiang,Daye Nam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:While Large Language Models (LLMs) have demonstrated remarkable capabilities, research shows that their effectiveness depends not only on explicit prompts but also on the broader context provided. This requirement is especially pronounced in software engineering, where the goals, architecture, and collaborative conventions of an existing project play critical roles in response quality. To support this, many AI coding assistants have introduced ways for developers to author persistent, machine-readable directives that encode a project’s unique constraints. Although this practice is growing, the content of these directives remains unstudied. This paper presents a large-scale empirical study to characterize this emerging form of developer-provided context. Through a qualitative analysis of 401 open-source repositories containing cursor rules, we developed a comprehensive taxonomy of project context that developers consider essential, organized into five high-level themes: Conventions, Guidelines, Project Information, LLM Directives, and Examples. Our study also explores how this context varies across different project types and programming languages, offering implications for the next generation of context-aware AI developer tools. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2512.18925 [cs.SE] (or arXiv:2512.18925v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.18925 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-63] Multimodal Bayesian Network for Robust Assessment of Casualties in Autonomous Triage NEURIPS2025
【速读】:该论文旨在解决大规模伤亡事件(Mass Casualty Incidents, MCIs)中应急医疗系统 overwhelmed 的问题,即在资源紧张情况下,因伤员评估延迟或错误导致可预防死亡的风险。解决方案的关键在于构建一个基于专家规则的贝叶斯网络(Bayesian network),该网络融合多个计算机视觉模型输出的生理指标(如严重出血、呼吸窘迫、意识状态和可见创伤)估计结果,实现无需训练数据、支持不完整信息推理且对噪声观测鲁棒的自动化分诊决策。该方法显著提升了生理评估准确率(从15%至42%和19%至46%)与整体分诊准确率(从14%至53%),并扩大了诊断覆盖范围(从31%至95%),证明了专家知识引导的概率推理在提升自动分诊系统性能上的有效性。
链接: https://arxiv.org/abs/2512.18908
作者: Szymon Rusiecki,Cecilia G. Morales,Kimberly Elenberg,Leonard Weiss,Artur Dubrawski
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025 Workshop: Structured Probabilistic Inference Generative Modeling
Abstract:Mass Casualty Incidents can overwhelm emergency medical systems and resulting delays or errors in the assessment of casualties can lead to preventable deaths. We present a decision support framework that fuses outputs from multiple computer vision models, estimating signs of severe hemorrhage, respiratory distress, physical alertness, or visible trauma, into a Bayesian network constructed entirely from expert-defined rules. Unlike traditional data-driven models, our approach does not require training data, supports inference with incomplete information, and is robust to noisy or uncertain observations. We report performance for two missions involving 11 and 9 casualties, respectively, where our Bayesian network model substantially outperformed vision-only baselines during evaluation of our system in the DARPA Triage Challenge (DTC) field scenarios. The accuracy of physiological assessment improved from 15% to 42% in the first scenario and from 19% to 46% in the second, representing nearly threefold increase in performance. More importantly, overall triage accuracy increased from 14% to 53% in all patients, while the diagnostic coverage of the system expanded from 31% to 95% of the cases requiring assessment. These results demonstrate that expert-knowledge-guided probabilistic reasoning can significantly enhance automated triage systems, offering a promising approach to supporting emergency responders in MCIs. This approach enabled Team Chiron to achieve 4th place out of 11 teams during the 1st physical round of the DTC.
zh
[AI-64] Gabliteration: Adaptive Multi-Directional Neural Weight Modification for Selective Behavioral Alteration in Large Language Models
【速读】:该论文旨在解决传统神经网络权重删减(abliteration)方法在修改特定行为模式时导致模型整体性能下降的问题。现有方法往往在调整目标行为的同时损害了模型在其他任务或领域上的表现,限制了其实际应用价值。解决方案的关键在于提出一种名为Gabliteration的新颖神经权重修改技术,通过自适应多方向投影(adaptive multi-directional projections)与正则化层选择机制,实现动态层优化和权重的理论最优调整,从而在最小化无关领域质量退化的情况下,有效提升目标行为的可控性与精度。
链接: https://arxiv.org/abs/2512.18901
作者: Gökdeniz Gülmez
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present Gabliteration, a novel neural weight modification technique that advances beyond traditional abliteration methods by implementing adaptive multi-directional projections with regularized layer selection. Our approach addresses the fundamental limitation of existing methods that compromise model quality while attempting to modify specific behavioral patterns. Through dynamic layer optimization, regularized projection matrices, and adaptive scaling mechanisms, we achieve theoretically superior weight modification while minimizing quality degradation in unrelated domains. We validate our method through the gabliterated-v1 model series (0.6B to 4B parameters) available on Hugging Face, demonstrating practical applicability across multiple model scales.
zh
[AI-65] Psychometric Validation of the Sophotechnic Mediation Scale and a New Understanding of the Development of GenAI Mastery: Lessons from 3392 Adult Brazilian Workers
【速读】:该论文旨在解决生成式人工智能(Generative AI)系统持续使用是否能够引发稳定、内化的认知模式,而非仅带来短暂的效率提升这一关键问题。其解决方案的关键在于提出并实证验证“智技中介”(Sophotechnic Mediation)概念,通过构建并 psychometric 验证“智技中介量表”,基于来自巴西伯南布哥大都会区3,932名成年工作者的跨时间截面数据,揭示了该构念具有优异的内部一致性、稳健的一维结构及群体间测量不变性,并发现其与超文化中介(Hypercultural mediation)在理论上可区分,且主要由累积的GenAI使用经验驱动,年龄则调节初始采纳速率与后期整合深度,从而为理解GenAI如何重塑人类认知方式提供了可测量、可验证的理论框架。
链接: https://arxiv.org/abs/2512.18871
作者: Bruno Campello de Souza
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 35 pages, 28 Manuscript, Portuguese and English Versions of the Instrument in Annex
Abstract:The rapid diffusion of generative artificial intelligence (GenAI) systems has introduced new forms of human-technology interaction, raising the question of whether sustained engagement gives rise to stable, internalized modes of cognition rather than merely transient efficiency gains. Grounded in the Cognitive Mediation Networks Theory, this study investigates Sophotechnic Mediation, a mode of thinking and acting associated with prolonged interaction with GenAI, and presents a comprehensive psychometric validation of the Sophotechnic Mediation Scale. Data were collected between 2023 and 2025 from independent cross-sectional samples totaling 3,932 adult workers from public and private organizations in the Metropolitan Region of Pernambuco, Brazil. Results indicate excellent internal consistency, a robust unidimensional structure, and measurement invariance across cohorts. Ordinal-robust confirmatory factor analyses and residual diagnostics show that elevated absolute fit indices reflect minor local dependencies rather than incorrect dimensionality. Distributional analyses reveal a time-evolving pattern characterized by a declining mass of non-adopters and convergence toward approximate Gaussianity among adopters, with model comparisons favoring a two-process hurdle model over a censored Gaussian specification. Sophotechnic Mediation is empirically distinct from Hypercultural mediation and is primarily driven by cumulative GenAI experience, with age moderating the rate of initial acquisition and the depth of later integration. Together, the findings support Sophotechnia as a coherent, measurable, and emergent mode of cognitive mediation associated with the ongoing GenAI revolution.
zh
[AI-66] CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在数学问题求解中表现出的“表面能力”与“真实概念理解”之间的鸿沟问题:尽管LLMs能够通过模式识别和记忆复用正确回答习题,但在需要真正理解核心概念时往往失败。传统基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法仅对最终答案进行奖励,缺乏细粒度的概念层面监督信号,导致模型优化方向偏向于模式重复而非概念应用。解决方案的关键在于提出CORE(Concept-Oriented REinforcement)框架,其核心创新是将显式概念转化为可控的监督信号,具体包括:(i) 生成与概念对齐的练习题;(ii) 在推理过程中注入简短的概念提示以诱导概念引导的轨迹;(iii) 通过轨迹替换机制(group failure-based trajectory replacement)结合轻量级前向KL约束或标准GRPO算法,实现对概念推理的强化。CORE统一了直接训练于概念对齐习题与概念注入推理轨迹的策略,并在域内和域外数学基准上均显著优于基线方法,实现了从问题解决能力到真实概念推理的提升。
链接: https://arxiv.org/abs/2512.18857
作者: Zijun Gao,Zhikun Xu,Xiao Ye,Ben Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelines reinforce final answers but provide little fine-grained conceptual signal, so models improve at pattern reuse rather than conceptual applications. We introduce CORE (Concept-Oriented REinforcement), an RL training framework that turns explicit concepts into a controllable supervision signal. Starting from a high-quality, low-contamination textbook resource that links verifiable exercises to concise concept descriptions, we run a sanity probe showing LLMs can restate definitions but fail concept-linked quizzes, quantifying the conceptual reasoning gap. CORE then (i) synthesizes concept-aligned quizzes, (ii) injects brief concept snippets during rollouts to elicit concept-primed trajectories, and (iii) reinforces conceptual reasoning via trajectory replacement after group failures, a lightweight forward-KL constraint that aligns unguided with concept-primed policies, or standard GRPO directly on concept-aligned quizzes. Across several models, CORE delivers consistent gains over vanilla and SFT baselines on both in-domain concept-exercise suites and diverse out-of-domain math benchmarks. CORE unifies direct training on concept-aligned quizzes and concept-injected rollouts under outcome regularization. It provides fine-grained conceptual supervision that bridges problem-solving competence and genuine conceptual reasoning, while remaining algorithm- and verifier-agnostic.
zh
[AI-67] HARBOR: Holistic Adaptive Risk assessment model for BehaviORal healthcare
【速读】:该论文旨在解决行为健康风险评估中因患者数据高度多模态性和情绪及情感障碍的时序动态性所带来的挑战。解决方案的关键在于提出一种名为HARBOR的行为健康感知语言模型,该模型能够预测离散的情绪状态和风险评分(即Harbor Risk Score, HRS),其取值范围为-3(严重抑郁)至+3(躁狂)。HARBOR通过整合生理、行为和自我报告的心理健康信号,结合对时间动态性的建模,在结构化临床风险评分任务中显著优于传统机器学习方法和商用大语言模型(LLM),在准确率上达到69%,远超逻辑回归(54%)和最强的商用LLM基线(29%)。
链接: https://arxiv.org/abs/2512.18829
作者: Aditya Siddhant
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Behavioral healthcare risk assessment remains a challenging problem due to the highly multimodal nature of patient data and the temporal dynamics of mood and affective disorders. While large language models (LLMs) have demonstrated strong reasoning capabilities, their effectiveness in structured clinical risk scoring remains unclear. In this work, we introduce HARBOR, a behavioral health aware language model designed to predict a discrete mood and risk score, termed the Harbor Risk Score (HRS), on an integer scale from -3 (severe depression) to +3 (mania). We also release PEARL, a longitudinal behavioral healthcare dataset spanning four years of monthly observations from three patients, containing physiological, behavioral, and self reported mental health signals. We benchmark traditional machine learning models, proprietary LLMs, and HARBOR across multiple evaluation settings and ablations. Our results show that HARBOR outperforms classical baselines and off the shelf LLMs, achieving 69 percent accuracy compared to 54 percent for logistic regression and 29 percent for the strongest proprietary LLM baseline.
zh
[AI-68] Hyperbolic Graph Embeddings: a Survey and an Evaluation on Anomaly Detection
【速读】:该论文旨在解决复杂图结构中异常检测性能不足的问题,尤其针对传统欧几里得空间(Euclidean space)嵌入方法在捕捉层级性、稀疏性和高维拓扑结构时的局限性。其解决方案的关键在于引入双曲空间(hyperbolic space)进行图嵌入建模,利用双曲几何天然适合表示树状结构和层次关系的特性,显著提升了异常检测的准确性。实验表明,基于双曲空间的模型如P-VAE 和 HGCAE 在 Elliptic 和 Cora 数据集上分别达到 94% 和 80% 的 F1 分数,优于经典欧几里得方法(如 DOMINANT 和 GraphSAGE),验证了双曲嵌入在复杂图数据异常检测中的优越性。
链接: https://arxiv.org/abs/2512.18826
作者: Souhail Abdelmouaiz Sadat,Mohamed Yacine Touahria Miliani,Khadidja Hab El Hames,Hamida Seba,Mohammed Haddad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This survey reviews hyperbolic graph embedding models, and evaluate them on anomaly detection, highlighting their advantages over Euclidean methods in capturing complex structures. Evaluating models like \textitHGCAE, \textit(\mathcalP)-VAE, and \textitHGCN demonstrates high performance, with \textit(\mathcalP)-VAE achieving an F1-score of 94% on the \textitElliptic dataset and \textitHGCAE scoring 80% on \textitCora. In contrast, Euclidean methods like \textitDOMINANT and \textitGraphSage struggle with complex data. The study emphasizes the potential of hyperbolic spaces for improving anomaly detection, and provides an open-source library to foster further research in this field.
zh
[AI-69] Controllable Probabilistic Forecasting with Stochastic Decomposition Layers
【速读】:该论文旨在解决当前生成式AI天气预测集合(ensemble)方法在训练成本高、物理可解释性差以及不确定性量化不明确等问题。现有基于连续排名概率评分(CRPS)的集合方法多采用全局噪声注入策略,依赖条件归一化机制,导致计算开销大且扰动缺乏物理意义。其解决方案的关键在于提出随机分解层(Stochastic Decomposition Layers, SDL),该结构借鉴StyleGAN的分层噪声注入思想,通过潜变量驱动调制、逐像素噪声和通道缩放三种方式,在解码器的三个尺度上施加可学习的扰动,从而将确定性模型转化为概率集合系统。SDL显著降低计算成本(仅需基线模型2%的训练资源),并实现对不确定性的层级解析:粗尺度调控天气系统格局,细尺度捕捉中尺度变化,同时支持通过潜变量重缩放灵活调整集合发散度,使预测具有良好的校准性与可复现性。
链接: https://arxiv.org/abs/2512.18815
作者: John S. Schreck,William E. Chapman,Charlie Becker,David John Gagne II,Dhamma Kimpara,Nihanth Cherukuru,Judith Berner,Kirsten J. Mayer,Negin Sobhani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph); Geophysics (physics.geo-ph)
备注:
Abstract:AI weather prediction ensembles with latent noise injection and optimized with the continuous ranked probability score (CRPS) have produced both accurate and well-calibrated predictions with far less computational cost compared with diffusion-based methods. However, current CRPS ensemble approaches vary in their training strategies and noise injection mechanisms, with most injecting noise globally throughout the network via conditional normalization. This structure increases training expense and limits the physical interpretability of the stochastic perturbations. We introduce Stochastic Decomposition Layers (SDL) for converting deterministic machine learning weather models into probabilistic ensemble systems. Adapted from StyleGAN’s hierarchical noise injection, SDL applies learned perturbations at three decoder scales through latent-driven modulation, per-pixel noise, and channel scaling. When applied to WXFormer via transfer learning, SDL requires less than 2% of the computational cost needed to train the baseline model. Each ensemble member is generated from a compact latent tensor (5 MB), enabling perfect reproducibility and post-inference spread adjustment through latent rescaling. Evaluation on 2022 ERA5 reanalysis shows ensembles with spread-skill ratios approaching unity and rank histograms that progressively flatten toward uniformity through medium-range forecasts, achieving calibration competitive with operational IFS-ENS. Multi-scale experiments reveal hierarchical uncertainty: coarse layers modulate synoptic patterns while fine layers control mesoscale variability. The explicit latent parameterization provides interpretable uncertainty quantification for operational forecasting and climate applications.
zh
[AI-70] Reliable Audio Deepfake Detection in Variable Conditions via Quantum-Kernel SVMs ICDM2025
【速读】:该论文旨在解决在标注数据稀缺且录音条件多变的情况下,生成式语音伪造(audio deepfake)检测模型易过拟合、泛化能力差的问题。现有端到端深度学习模型常因缺乏足够训练样本而性能下降,而传统核方法虽具竞争力但依赖于核函数的选择。其解决方案的关键在于引入量子核(quantum kernel),通过量子特征映射(quantum feature map)将音频特征嵌入高维希尔伯特空间(Hilbert space),从而利用更具表达力的相似性度量构建紧凑分类器,同时保持模型结构不变——仅替换经典核为量子核,不增加额外可训练参数,且在常规计算机上完成量子核计算。实验表明,基于此策略的量子支持向量机(QSVM)相比经典SVM显著降低等错误率(EER)和假阳性率(FPR),尤其在ASVspoof 5 (2024) 和 ADD23 数据集上分别实现高达56.9%和38.8%的FPR降幅。
链接: https://arxiv.org/abs/2512.18797
作者: Lisan Al Amin,Vandana P. Janeja
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: This paper is accepted in ICDM 2025-MLC workshop
Abstract:Detecting synthetic speech is challenging when labeled data are scarce and recording conditions vary. Existing end-to-end deep models often overfit or fail to generalize, and while kernel methods can remain competitive, their performance heavily depends on the chosen kernel. Here, we show that using a quantum kernel in audio deepfake detection reduces falsepositive rates without increasing model size. Quantum feature maps embed data into high-dimensional Hilbert spaces, enabling the use of expressive similarity measures and compact classifiers. Building on this motivation, we compare quantum-kernel SVMs (QSVMs) with classical SVMs using identical mel-spectrogram preprocessing and stratified 5-fold cross-validation across four corpora (ASVspoof 2019 LA, ASVspoof 5 (2024), ADD23, and an In-the-Wild set). QSVMs achieve consistently lower equalerror rates (EER): 0.183 vs. 0.299 on ASVspoof 5 (2024), 0.081 vs. 0.188 on ADD23, 0.346 vs. 0.399 on ASVspoof 2019, and 0.355 vs. 0.413 In-the-Wild. At the EER operating point (where FPR equals FNR), these correspond to absolute false-positiverate reductions of 0.116 (38.8%), 0.107 (56.9%), 0.053 (13.3%), and 0.058 (14.0%), respectively. We also report how consistent the results are across cross-validation folds and margin-based measures of class separation, using identical settings for both models. The only modification is the kernel; the features and SVM remain unchanged, no additional trainable parameters are introduced, and the quantum kernel is computed on a conventional computer.
zh
[AI-71] he Dead Salmons of AI Interpretability
【速读】:该论文试图解决AI可解释性研究中普遍存在的“死鲑鱼”(dead salmon)现象问题,即在随机初始化的神经网络或无信息输入下,解释方法仍可能产生看似合理但实为虚假的解释结果,这暴露了当前解释方法缺乏严格的统计验证机制。解决方案的关键在于将解释视为计算系统的参数,通过统计建模和因果推理框架进行推断:解释方法应被重新定义为统计估计器,其有效性需基于明确的替代计算假设进行检验,并量化不确定性,从而提升解释结果的可信度与可重复性。这一视角不仅有助于识别常见解释查询的可辨识性问题,也为构建严谨、可推广的AI可解释性科学奠定了基础。
链接: https://arxiv.org/abs/2512.18792
作者: Maxime Méloux,Giada Dirupo,François Portet,Maxime Peyrard
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures
Abstract:In a striking neuroscience study, the authors placed a dead salmon in an MRI scanner and showed it images of humans in social situations. Astonishingly, standard analyses of the time reported brain regions predictive of social emotions. The explanation, of course, was not supernatural cognition but a cautionary tale about misapplied statistical inference. In AI interpretability, reports of similar ‘‘dead salmon’’ artifacts abound: feature attribution, probing, sparse auto-encoding, and even causal analyses can produce plausible-looking explanations for randomly initialized neural networks. In this work, we examine this phenomenon and argue for a pragmatic statistical-causal reframing: explanations of computational systems should be treated as parameters of a (statistical) model, inferred from computational traces. This perspective goes beyond simply measuring statistical variability of explanations due to finite sampling of input data; interpretability methods become statistical estimators, and findings should be tested against explicit and meaningful alternative computational hypotheses, with uncertainty quantified with respect to the postulated statistical model. It also highlights important theoretical issues, such as the identifiability of common interpretability queries, which we argue is critical to understand the field’s susceptibility to false discoveries, poor generalizability, and high variance. More broadly, situating interpretability within the standard toolkit of statistical inference opens promising avenues for future work aimed at turning AI interpretability into a pragmatic and rigorous science.
zh
[AI-72] Smark: A Watermark for Text-to-Speech Diffusion Models via Discrete Wavelet Transform
【速读】:该论文旨在解决生成式语音合成(Text-to-Speech, TTS)扩散模型在知识产权保护和合法使用追踪方面的挑战,特别是现有音频水印方法因模型结构差异导致通用性差且影响语音质量的问题。解决方案的关键在于提出一种名为Smark的通用水印方案,其核心是设计一个轻量级水印嵌入框架,该框架基于所有TTS扩散模型共有的逆扩散过程(reverse diffusion paradigm),并通过离散小波变换(Discrete Wavelet Transform, DWT)将水印嵌入到音频的低频稳定区域,从而实现高质量音频与水印的无缝融合,并增强对逆扩散过程中水印移除攻击的鲁棒性。
链接: https://arxiv.org/abs/2512.18791
作者: Yichuan Zhang,Chengxin Li,Yujie Gu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Text-to-Speech (TTS) diffusion models generate high-quality speech, which raises challenges for the model intellectual property protection and speech tracing for legal use. Audio watermarking is a promising solution. However, due to the structural differences among various TTS diffusion models, existing watermarking methods are often designed for a specific model and degrade audio quality, which limits their practical applicability. To address this dilemma, this paper proposes a universal watermarking scheme for TTS diffusion models, termed Smark. This is achieved by designing a lightweight watermark embedding framework that operates in the common reverse diffusion paradigm shared by all TTS diffusion models. To mitigate the impact on audio quality, Smark utilizes the discrete wavelet transform (DWT) to embed watermarks into the relatively stable low-frequency regions of the audio, which ensures seamless watermark-audio integration and is resistant to removal during the reverse diffusion process. Extensive experiments are conducted to evaluate the audio quality and watermark performance in various simulated real-world attack scenarios. The experimental results show that Smark achieves superior performance in both audio quality and watermark extraction accuracy.
zh
[AI-73] MEEA: Mere Exposure Effect-Driven Confrontational Optimization for LLM Jailbreaking
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)安全对齐(safety alignment)在多轮交互中稳定性不足的问题,现有研究多假设安全边界静态不变,忽略了上下文交互对模型行为的动态影响,导致攻击方法缺乏泛化能力。其解决方案的关键在于提出MEEA(Mere Exposure Effect Attack),一种基于心理学“单纯曝光效应”的全自动黑盒评估框架,通过重复引入低毒性语义内容逐步降低模型的有效安全阈值,实现对模型对齐约束的渐进式侵蚀;该方法构建语义递进的提示链,并采用模拟退火策略优化,兼顾语义相似性、毒性与越狱有效性,实验证明其在多个闭源和开源模型上显著优于七种基线方法,平均攻击成功率提升超20%,并揭示了LLM安全行为具有历史依赖性和动态特性,挑战了静态对齐边界的传统假设。
链接: https://arxiv.org/abs/2512.18755
作者: Jianyi Zhang,Shizhao Liu,Ziyin Zhou,Zhen Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of large language models (LLMs) has intensified concerns about the robustness of their safety alignment. While existing jailbreak studies explore both single-turn and multi-turn strategies, most implicitly assume a static safety boundary and fail to account for how contextual interactions dynamically influence model behavior, leading to limited stability and generalization. Motivated by this gap, we propose MEEA (Mere Exposure Effect Attack), a psychology-inspired, fully automated black-box framework for evaluating multi-turn safety robustness, grounded in the mere exposure effect. MEEA leverages repeated low-toxicity semantic exposure to induce a gradual shift in a model’s effective safety threshold, enabling progressive erosion of alignment constraints over sustained interactions. Concretely, MEEA constructs semantically progressive prompt chains and optimizes them using a simulated annealing strategy guided by semantic similarity, toxicity, and jailbreak effectiveness. Extensive experiments on both closed-source and open-source models, including GPT-4, Claude-3.5, and DeepSeek-R1, demonstrate that MEEA consistently achieves higher attack success rates than seven representative baselines, with an average Attack Success Rate (ASR) improvement exceeding 20%. Ablation studies further validate the necessity of both annealing-based optimization and contextual exposure mechanisms. Beyond improved attack effectiveness, our findings indicate that LLM safety behavior is inherently dynamic and history-dependent, challenging the common assumption of static alignment boundaries and highlighting the need for interaction-aware safety evaluation and defense mechanisms. Our code is available at: this https URL
zh
[AI-74] PIPCFR: Pseudo-outcome Imputation with Post-treatment Variables for Individual Treatment Effect Estimation
【速读】:该论文旨在解决观测数据中个体治疗效应(Individual Treatment Effect, ITE)估计的难题,即在只能观测每个个体单一治疗结果的情况下,如何准确预测其在不同治疗下的潜在结果。现有方法通常依赖于伪结果(pseudo-outcome)推断或匹配实例对,但普遍忽视了治疗后变量(post-treatment variables)对结果的影响,导致反事实预测方差增大、估计精度下降。解决方案的关键在于提出一种名为 PIPCFR(Pseudo-outcome Imputation with Post-treatment Variables for Counterfactual Regression)的新方法,通过引入并有效利用治疗后变量来改进伪结果的插补过程;该方法不仅建立了新的理论边界以明确治疗后变量与ITE估计精度之间的关系,还学习出能保留信息性成分并减少偏差的有效表示,从而显著降低反事实回归中的ITE误差。
链接: https://arxiv.org/abs/2512.18737
作者: Zichuan Lin,Xiaokai Huang,Jiate Liu,Yuxuan Han,Jia Chen,Xiapeng Wu,Deheng Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 3 figures
Abstract:The estimation of individual treatment effects (ITE) focuses on predicting the outcome changes that result from a change in treatment. A fundamental challenge in observational data is that while we need to infer outcome differences under alternative treatments, we can only observe each individual’s outcome under a single treatment. Existing approaches address this limitation either by training with inferred pseudo-outcomes or by creating matched instance pairs. However, recent work has largely overlooked the potential impact of post-treatment variables on the outcome. This oversight prevents existing methods from fully capturing outcome variability, resulting in increased variance in counterfactual predictions. This paper introduces Pseudo-outcome Imputation with Post-treatment Variables for Counterfactual Regression (PIPCFR), a novel approach that incorporates post-treatment variables to improve pseudo-outcome imputation. We analyze the challenges inherent in utilizing post-treatment variables and establish a novel theoretical bound for ITE risk that explicitly connects post-treatment variables to ITE estimation accuracy. Unlike existing methods that ignore these variables or impose restrictive assumptions, PIPCFR learns effective representations that preserve informative components while mitigating bias. Empirical evaluations on both real-world and simulated datasets demonstrate that PIPCFR achieves significantly lower ITE errors compared to existing methods.
zh
[AI-75] Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中恶意代理检测的两大关键问题:一是现有基于图异常检测(Graph Anomaly Detection, GAD)的方法主要依赖粗粒度的句法层面信息,忽视了细粒度的词汇线索,导致检测性能受限;二是这些方法缺乏可解释性,限制了其在安全关键场景中的可靠性与实际应用。解决方案的关键在于提出XG-Guard框架,该框架通过双层代理编码器联合建模每个代理的句子级和词元级表示,引入基于主题的异常检测机制以捕捉对话中讨论焦点的变化,并设计双层分数融合机制量化词元级别的贡献,从而实现高精度且可解释的恶意代理识别。
链接: https://arxiv.org/abs/2512.18733
作者: Junjun Pan,Yixin Liu,Rui Miao,Kaize Ding,Yu Zheng,Quoc Viet Hung Nguyen,Alan Wee-Chung Liew,Shirui Pan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 14 pages, 3 tables, 5 figures
Abstract:Large language model (LLM)-based multi-agent systems (MAS) have shown strong capabilities in solving complex tasks. As MAS become increasingly autonomous in various safety-critical tasks, detecting malicious agents has become a critical security concern. Although existing graph anomaly detection (GAD)-based defenses can identify anomalous agents, they mainly rely on coarse sentence-level information and overlook fine-grained lexical cues, leading to suboptimal performance. Moreover, the lack of interpretability in these methods limits their reliability and real-world applicability. To address these limitations, we propose XG-Guard, an explainable and fine-grained safeguarding framework for detecting malicious agents in MAS. To incorporate both coarse and fine-grained textual information for anomalous agent identification, we utilize a bi-level agent encoder to jointly model the sentence- and token-level representations of each agent. A theme-based anomaly detector further captures the evolving discussion focus in MAS dialogues, while a bi-level score fusion mechanism quantifies token-level contributions for explanation. Extensive experiments across diverse MAS topologies and attack scenarios demonstrate robust detection performance and strong interpretability of XG-Guard.
zh
[AI-76] Counterfactual Basis Extension and Representational Geometry: An MDL-Constrained Model of Conceptual Growth
【速读】:该论文试图解决的核心问题是:在概念学习过程中,当现有表征体系无法解释经验时,如何实现表征基础(representational basis)的原则性且选择性扩展,从而支持概念增长与理论更新。传统学习模型通常假设存在固定的表征空间,而本文则关注更根本的问题——表征本身如何在满足一定条件的前提下进行扩展。解决方案的关键在于提出一个几何框架,将概念扩展建模为在最小描述长度(Minimum Description Length, MDL)准则下可接受的基底扩展(basis extension)。具体而言,经验被表示为相对于当前概念子空间的向量,其残差分量捕捉系统性表征失败;候选概念扩展被限制为低秩且适配的变换,且任何被MDL接受的扩展均可构造为仅沿经验诱导的残差空间方向进行,而正交于该空间的扩展会显著增加描述长度并被拒绝。这确保了想象(imagination)和概念创新的保守性——内部生成的反事实表征仅在揭示或放大结构化残差误差时才促进学习,不能引入任意新颖性。该框架通过几何约束与MDL优化统一了表征变化的规范性选择机制,阐明了想象力在学习中的作用边界。
链接: https://arxiv.org/abs/2512.18732
作者: Chainarong Amornbunchornvej
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: First draft
Abstract:Concept learning becomes possible only when existing representations fail to account for experience. Most models of learning and inference, however, presuppose a fixed representational basis within which belief updating occurs. In this paper, I address a prior question: under what structural conditions can the representational basis itself expand in a principled and selective way? I propose a geometric framework in which conceptual growth is modeled as admissible basis extension evaluated under a Minimum Description Length (MDL) criterion. Experience, whether externally observed or internally simulated, is represented as vectors relative to a current conceptual subspace. Residual components capture systematic representational failure, and candidate conceptual extensions are restricted to low-rank, admissible transformations. I show that any MDL-accepted extension can be chosen so that its novel directions lie entirely within the residual span induced by experience, while extensions orthogonal to this span strictly increase description length and are therefore rejected. This yields a conservative account of imagination and conceptual innovation. Internally generated counterfactual representations contribute to learning only insofar as they expose or amplify structured residual error, and cannot introduce arbitrary novelty. I further distinguish representational counterfactuals–counterfactuals over an agent’s conceptual basis–from causal or value-level counterfactuals, and show how MDL provides a normative selection principle governing representational change. Overall, the framework characterizes conceptual development as an error-driven, geometry-constrained process of basis extension, clarifying both the role and the limits of imagination in learning and theory change. Comments: First draft Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 68T05, 68T27 ACMclasses: I.2.6; I.2.4; F.1.2 Cite as: arXiv:2512.18732 [cs.AI] (or arXiv:2512.18732v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.18732 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chainarong Amornbunchornvej [view email] [v1] Sun, 21 Dec 2025 13:39:00 UTC (4,268 KB)
zh
[AI-77] KeenKT: Knowledge Mastery-State Disambiguation for Knowledge Tracing AAAI2026
【速读】:该论文旨在解决知识追踪(Knowledge Tracing, KT)中现有方法依赖单点估计导致的不确定性问题,即无法区分学生的真实能力与偶然波动(如突发性表现或粗心失误),从而影响对知识掌握状态的准确判断。解决方案的关键在于提出KeenKT模型,该模型采用正态逆高斯分布(Normal-Inverse-Gaussian, NIG)来表征学生在每次学习交互中的知识状态,以捕捉学习行为的动态波动;同时设计基于NIG距离的注意力机制建模知识状态的演化,并引入基于扩散的去噪重构损失和分布对比学习损失以提升模型鲁棒性。
链接: https://arxiv.org/abs/2512.18709
作者: Zhifei Li,Lifan Chen,Jiali Yi,Xiaoju Hou,Yue Zhao,Wenxin Huang,Miao Zhang,Kui Xiao,Bing Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by the Association for the Advancement of Artificial Intelligence 2026(AAAI2026)
Abstract:Knowledge Tracing (KT) aims to dynamically model a student’s mastery of knowledge concepts based on their historical learning interactions. Most current methods rely on single-point estimates, which cannot distinguish true ability from outburst or carelessness, creating ambiguity in judging mastery. To address this issue, we propose a Knowledge Mastery-State Disambiguation for Knowledge Tracing model (KeenKT), which represents a student’s knowledge state at each interaction using a Normal-Inverse-Gaussian (NIG) distribution, thereby capturing the fluctuations in student learning behaviors. Furthermore, we design an NIG-distance-based attention mechanism to model the dynamic evolution of the knowledge state. In addition, we introduce a diffusion-based denoising reconstruction loss and a distributional contrastive learning loss to enhance the model’s robustness. Extensive experiments on six public datasets demonstrate that KeenKT outperforms SOTA KT models in terms of prediction accuracy and sensitivity to behavioral fluctuations. The proposed method yields the maximum AUC improvement of 5.85% and the maximum ACC improvement of 6.89%.
zh
[AI-78] CauTraj: A Causal-Knowledge-Guided Framework for Lane-Changing Trajectory Planning of Autonomous Vehicles
【速读】:该论文旨在解决自动驾驶车辆在人机混合交通环境中进行变道轨迹规划时,现有方法未能有效融合人类驾驶员先验知识的问题。其核心解决方案是构建一个将因果先验知识嵌入模型预测控制(MPC)框架的新型轨迹规划方法:首先通过建模纵向与横向微观行为量化交互风险,再利用分阶段因果图捕捉变道场景中的因果依赖关系,并采用因果推断技术(包括平均因果效应ATE和条件平均处理效应CATE)估计变道车辆与周围车辆之间的因果影响;这些可解释且稳定的因果先验被进一步整合至MPC中,显著提升了轨迹规划对人类驾驶行为的拟合度,实验表明该方法能将最大轨迹偏差从1.2 m降至0.2 m,同时降低横向速度波动60%和偏航角变化率50%,从而增强了自动驾驶系统在安全性、稳定性和真实性方面的表现。
链接: https://arxiv.org/abs/2512.18703
作者: Cailin Lei,Haiyang Wu,Yuxiong Ji,Xiaoyu Cai,Yuchuan Du
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Enhancing the performance of trajectory planners for lane - changing vehicles is one of the key challenges in autonomous driving within human - machine mixed traffic. Most existing studies have not incorporated human drivers’ prior knowledge when designing trajectory planning models. To address this issue, this study proposes a novel trajectory planning framework that integrates causal prior knowledge into the control process. Both longitudinal and lateral microscopic behaviors of vehicles are modeled to quantify interaction risk, and a staged causal graph is constructed to capture causal dependencies in lane-changing scenarios. Causal effects between the lane-changing vehicle and surrounding vehicles are then estimated using causal inference, including average causal effects (ATE) and conditional average treatment effects (CATE). These causal priors are embedded into a model predictive control (MPC) framework to enhance trajectory planning. The proposed approach is validated on naturalistic vehicle trajectory datasets. Experimental results show that: (1) causal inference provides interpretable and stable quantification of vehicle interactions; (2) individual causal effects reveal driver heterogeneity; and (3) compared with the baseline MPC, the proposed method achieves a closer alignment with human driving behaviors, reducing maximum trajectory deviation from 1.2 m to 0.2 m, lateral velocity fluctuation by 60%, and yaw angle variability by 50%. These findings provide methodological support for human-like trajectory planning and practical value for improving safety, stability, and realism in autonomous vehicle testing and traffic simulation platforms.
zh
[AI-79] Fusion of Multiscale Features Via Centralized Sparse-attention Network for EEG Decoding
【速读】:该论文旨在解决脑电图(Electroencephalography, EEG)信号解码中固有的时空异质性问题,即不同时间尺度和空间位置上的脑电信号特征差异较大,导致传统模型难以有效建模其复杂结构。解决方案的关键在于提出一种多分支并行架构,其中每个时间尺度均配备独立的空间特征提取模块,并进一步设计了基于中心化稀疏注意力机制的多尺度特征融合网络(EEG-CSANet)。该网络采用主-辅分支结构:主分支通过多尺度自注意力机制捕捉核心时空模式,辅分支利用稀疏交叉注意力实现高效局部交互,从而显著提升特征融合能力与模型性能。
链接: https://arxiv.org/abs/2512.18689
作者: Xiangrui Cai,Shaocheng Ma,Lei Cao,Jie Li,Tianyu Liu,Yilin Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Electroencephalography (EEG) signal decoding is a key technology that translates brain activity into executable commands, laying the foundation for direct brain-machine interfacing and intelligent interaction. To address the inherent spatiotemporal heterogeneity of EEG signals, this paper proposes a multi-branch parallel architecture, where each temporal scale is equipped with an independent spatial feature extraction module. To further enhance multi-branch feature fusion, we propose a Fusion of Multiscale Features via Centralized Sparse-attention Network (EEG-CSANet), a centralized sparse-attention network. It employs a main-auxiliary branch architecture, where the main branch models core spatiotemporal patterns via multiscale self-attention, and the auxiliary branch facilitates efficient local interactions through sparse cross-attention. Experimental results show that EEG-CSANet achieves state-of-the-art (SOTA) performance across five public datasets (BCIC-IV-2A, BCIC-IV-2B, HGD, SEED, and SEED-VIG), with accuracies of 88.54%, 91.09%, 99.43%, 96.03%, and 90.56%, respectively. Such performance demonstrates its strong adaptability and robustness across various EEG decoding tasks. Moreover, extensive ablation studies are conducted to enhance the interpretability of EEG-CSANet. In the future, we hope that EEG-CSANet could serve as a promising baseline model in the field of EEG signal decoding. The source code is publicly available at: this https URL
zh
[AI-80] Social Comparison without Explicit Inference of Others Reward Values: A Constructive Approach Using a Probabilistic Generative Model
【速读】:该论文试图解决的问题是:在灵长类动物的社会认知中,个体如何利用他人奖励信息来评估自身奖励,即社会比较过程中的计算机制究竟是基于对他人主观奖励估值的推断,还是仅仅依赖于客观奖励差异。解决方案的关键在于构建三种具有不同社会信息处理能力的计算模型——内部预测模型(Internal Prediction Model, IPM)、无比较模型(No Comparison Model, NCM)和外部比较模型(External Comparison Model, ECM),并通过多层多模态潜在狄利克雷分配(multi-layered, multimodal latent Dirichlet allocation)对猴子行为、奖励及条件刺激数据进行建模训练与分类评估。结果显示,ECM在Rand Index上表现最优(0.88 vs. 0.79 for IPM),表明社会比较主要依赖于对他人客观奖励差异的直接感知,而非对他人主观状态的推理。
链接: https://arxiv.org/abs/2512.18687
作者: Yosuke Taniuchi,Chie Hieida,Atsushi Noritake,Kazushi Ikeda,Masaki Isoda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to Advanced Robotics
Abstract:Social comparison – the process of evaluating one’s rewards relative to others – plays a fundamental role in primate social cognition. However, it remains unknown from a computational perspective how information about others’ rewards affects the evaluation of one’s own reward. With a constructive approach, this study examines whether monkeys merely recognize objective reward differences or, instead, infer others’ subjective reward valuations. We developed three computational models with varying degrees of social information processing: an Internal Prediction Model (IPM), which infers the partner’s subjective values; a No Comparison Model (NCM), which disregards partner information; and an External Comparison Model (ECM), which directly incorporates the partner’s objective rewards. To test model performance, we used a multi-layered, multimodal latent Dirichlet allocation. We trained the models on a dataset containing the behavior of a pair of monkeys, their rewards, and the conditioned stimuli. Then, we evaluated the models’ ability to classify subjective values across pre-defined experimental conditions. The ECM achieved the highest classification score in the Rand Index (0.88 vs. 0.79 for the IPM) under our settings, suggesting that social comparison relies on objective reward differences rather than inferences about subjective states.
zh
[AI-81] Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing
【速读】:该论文旨在解决大规模语言模型中混合专家(Mixture-of-Experts, MoE)架构在无服务器计算环境下的高推理成本与冷启动延迟问题。由于MoE模型中专家模块数量庞大且激活具有输入依赖性,传统模型分区策略难以有效降低内存开销,导致资源利用率低和响应延迟高。其解决方案的关键在于提出Remoe系统,通过异构部署机制将非专家模块置于GPU、专家模块置于CPU,并将不常激活的专家卸载至独立的无服务器函数以减少内存占用并实现并行执行;同时结合三项核心技术:基于语义相似性的相似提示搜索(Similar Prompts Searching, SPS)算法用于预测专家激活模式、主模型预分配(Main Model Pre-allocation, MMP)算法通过最坏情况内存估计保障服务等级目标(SLO),以及联合内存与副本优化框架(利用拉格朗日对偶和最长处理时间LPT算法)来最小化整体推理成本。
链接: https://arxiv.org/abs/2512.18674
作者: Wentao Liu,Yuhao Hu,Ruiting Zhou,Baochun Li,Ne Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) has become a dominant architecture in large language models (LLMs) due to its ability to scale model capacity via sparse expert activation. Meanwhile, serverless computing, with its elasticity and pay-per-use billing, is well-suited for deploying MoEs with bursty workloads. However, the large number of experts in MoE models incurs high inference costs due to memory-intensive parameter caching. These costs are difficult to mitigate via simple model partitioning due to input-dependent expert activation. To address these issues, we propose Remoe, a heterogeneous MoE inference system tailored for serverless computing. Remoe assigns non-expert modules to GPUs and expert modules to CPUs, and further offloads infrequently activated experts to separate serverless functions to reduce memory overhead and enable parallel execution. We incorporate three key techniques: (1) a Similar Prompts Searching (SPS) algorithm to predict expert activation patterns based on semantic similarity of inputs; (2) a Main Model Pre-allocation (MMP) algorithm to ensure service-level objectives (SLOs) via worst-case memory estimation; and (3) a joint memory and replica optimization framework leveraging Lagrangian duality and the Longest Processing Time (LPT) algorithm. We implement Remoe on Kubernetes and evaluate it across multiple LLM benchmarks. Experimental results show that Remoe reduces inference cost by up to 57% and cold start latency by 47% compared to state-of-the-art baselines.
zh
[AI-82] IntelliCode: A Multi-Agent LLM Tutoring System with Centralized Learner Modeling EACL2026
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的辅导系统普遍存在的单轮交互局限性问题,即缺乏对学习者知识状态的持久建模,导致难以提供有原则性、可解释且长期有效的教学支持。其解决方案的关键在于构建一个以集中式、版本化的学习者状态为核心架构的多智能体 tutoring 系统 IntelliCode,该状态整合了掌握度估计(mastery estimates)、认知误区(misconceptions)、复习计划(review schedules)和参与度信号(engagement signals)。通过 StateGraph Orchestrator 协调六个专业化代理(技能评估、学习者画像、渐进式提示、课程选择、间隔重复和参与度监控),每个代理在单一写入者策略下对共享状态进行纯变换操作,从而实现可审计的掌握度更新、基于熟练度的提示生成、依赖关系感知的课程自适应以及安全对齐的提示设计,最终达成透明且可靠的 LLM 驱动式个性化教学。
链接: https://arxiv.org/abs/2512.18669
作者: Jones David,Shreya Ghosh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to EACL 2026 System Demonstrations Track. 6 pages (main content), 6 figures, includes appendices
Abstract:LLM-based tutors are typically single-turn assistants that lack persistent representations of learner knowledge, making it difficult to provide principled, transparent, and long-term pedagogical support. We introduce IntelliCode, a multi-agent LLM tutoring system built around a centralized, versioned learner state that integrates mastery estimates, misconceptions, review schedules, and engagement signals. A StateGraph Orchestrator coordinates six specialized agents: skill assessment, learner profiling, graduated hinting, curriculum selection, spaced repetition, and engagement monitoring, each operating as a pure transformation over the shared state under a single-writer policy. This architecture enables auditable mastery updates, proficiency-aware hints, dependency-aware curriculum adaptation, and safety-aligned prompting. The demo showcases an end-to-end tutoring workflow: a learner attempts a DSA problem, receives a conceptual hint when stuck, submits a corrected solution, and immediately sees mastery updates and a personalized review interval. We report validation results with simulated learners, showing stable state updates, improved task success with graduated hints, and diverse curriculum coverage. IntelliCode demonstrates how persistent learner modeling, orchestrated multi-agent reasoning, and principled instructional design can be combined to produce transparent and reliable LLM-driven tutoring. Comments: Submitted to EACL 2026 System Demonstrations Track. 6 pages (main content), 6 figures, includes appendices Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.6; I.2.7; K.3.1 Cite as: arXiv:2512.18669 [cs.AI] (or arXiv:2512.18669v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.18669 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-83] Automatic Adaptation to Concept Complexity and Subjective Natural Concepts: A Cognitive Model based on Chunking
【速读】:该论文旨在解决认知科学中的一个核心问题,即短时记忆(STM)和长时记忆(LTM)中不同类型概念的形成与提取所依赖的基本心理过程。其解决方案的关键在于提出并验证“组块化机制”(chunking mechanisms)在概念学习中的基础性作用,并通过CogAct计算模型将概念学习锚定于基本认知过程(如注意力、STM、LTM及组块化)之中。该模型能够自动适应地从简单逻辑函数、人工类别到文学、国际象棋和音乐等不同领域的真实原始数据中学习自然概念,且无需任务特定的架构调整,突破了传统认知模型仅限于人工类别的局限以及非生成式深度学习模型对预构建知识结构的依赖。此外,研究还引入了基于个体主观概念空间的人类基准设计方法,实现了对人类经验差异的控制,从而更真实地模拟复杂类别下的概念习得过程。
链接: https://arxiv.org/abs/2512.18665
作者: Dmitry Bennett,Fernand Gobet
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A key issue in cognitive science concerns the fundamental psychological processes that underlie the formation and retrieval of multiple types of concepts in short-term and long-term memory (STM and LTM, respectively). We propose that chunking mechanisms play an essential role and show how the CogAct computational model grounds concept learning in fundamental cognitive processes and structures (such as chunking, attention, STM and LTM). First are the in-principle demonstrations, with CogAct automatically adapting to learn a range of categories from simple logical functions, to artificial categories, to natural raw (as opposed to natural pre-processed) concepts in the dissimilar domains of literature, chess and music. This kind of adaptive learning is difficult for most other psychological models, e.g., with cognitive models stopping at modelling artificial categories and (non-GPT) models based on deep learning requiring task-specific changes to the architecture. Secondly, we offer novel ways of designing human benchmarks for concept learning experiments and simulations accounting for subjectivity, ways to control for individual human experiences, all while keeping to real-life complex categories. We ground CogAct in simulations of subjective conceptual spaces of individual human participants, capturing humans subjective judgements in music, with the models learning from raw music score data without bootstrapping to pre-built knowledge structures. The CogAct simulations are compared to those obtained by a deep-learning model. These findings integrate concept learning and adaptation to complexity into the broader theories of cognitive psychology. Our approach may also be used in psychological applications that move away from modelling the average participant and towards capturing subjective concept space.
zh
[AI-84] ASTIF: Adaptive Semantic-Temporal Integration for Cryptocurrency Price Forecasting
【速读】:该论文旨在解决金融时间序列预测中因静态模型架构导致的异构知识融合困难与快速市场状态转换适应性差的问题,尤其针对传统方法仅依赖历史价格序列而忽视政策不确定性等语义驱动因素的局限。其解决方案的关键在于提出ASTIF(Adaptive Semantic-Temporal Integration for Cryptocurrency Price Forecasting)框架,通过置信度感知的元学习机制实现动态集成:该框架包含三个互补模块——基于MirrorPrompt的双通道小型语言模型提取市场语义线索与数值趋势,混合LSTM-随机森林模型捕捉时序依赖关系,以及一个自适应推理层根据实时不确定性调节各子模型贡献权重。实验表明,该机制在非平稳环境中显著提升预测性能并有效降低风险,体现出对定量与定性信息融合的可扩展性。
链接: https://arxiv.org/abs/2512.18661
作者: Hafiz Saif Ur Rehman,Ling Liu,Kaleem Ullah Qasim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 33 Pages, 8 Figures
Abstract:Financial time series forecasting is fundamentally an information fusion challenge, yet most existing models rely on static architectures that struggle to integrate heterogeneous knowledge sources or adjust to rapid regime shifts. Conventional approaches, relying exclusively on historical price sequences, often neglect the semantic drivers of volatility such as policy uncertainty and market narratives. To address these limitations, we propose the ASTIF (Adaptive Semantic-Temporal Integration for Cryptocurrency Price Forecasting), a hybrid intelligent system that adapts its forecasting strategy in real time through confidence-based meta-learning. The framework integrates three complementary components. A dual-channel Small Language Model using MirrorPrompt extracts semantic market cues alongside numerical trends. A hybrid LSTM Random Forest model captures sequential temporal dependencies. A confidence-aware meta-learner functions as an adaptive inference layer, modulating each predictor’s contribution based on its real-time uncertainty. Experimental evaluation on a diverse dataset of AI-focused cryptocurrencies and major technology stocks from 2020 to 2024 shows that ASTIF outperforms leading deep learning and Transformer baselines (e.g., Informer, TFT). The ablation studies further confirm the critical role of the adaptive meta-learning mechanism, which successfully mitigates risk by shifting reliance between semantic and temporal channels during market turbulence. The research contributes a scalable, knowledge-based solution for fusing quantitative and qualitative data in non-stationary environments. Comments: 33 Pages, 8 Figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2512.18661 [cs.AI] (or arXiv:2512.18661v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.18661 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-85] ARC: Leverag ing Compositional Representations for Cross-Problem Learning on VRPs
【速读】:该论文旨在解决车辆路径问题(Vehicle Routing Problems, VRPs)在现实世界中属性多样时,如何高效地跨问题变体进行学习与泛化的问题。传统方法难以适应不同约束条件(如时间窗、容量限制等)的组合变化,导致模型泛化能力受限。其解决方案的关键在于提出ARC(Attribute Representation via Compositional Learning)框架,通过将属性表示分解为两个互补成分:不变属性嵌入(Intrinsic Attribute Embedding, IAE)用于捕捉属性语义的不变性,以及上下文交互嵌入(Contextual Interaction Embedding, CIE)用于建模属性间的组合效应;并通过在嵌入空间中强制类比一致性,确保添加某一属性(如长度约束)的语义变换在不同问题情境下保持一致,从而实现对已训练变体中不变语义的复用,并构建未见过的属性组合表示,显著提升模型在分布内、零样本泛化、少样本适应及真实场景基准上的性能表现。
链接: https://arxiv.org/abs/2512.18633
作者: Han-Seul Jeong,Youngjoon Park,Hyungseok Song,Woohyung Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 13 figures
Abstract:Vehicle Routing Problems (VRPs) with diverse real-world attributes have driven recent interest in cross-problem learning approaches that efficiently generalize across problem variants. We propose ARC (Attribute Representation via Compositional Learning), a cross-problem learning framework that learns disentangled attribute representations by decomposing them into two complementary components: an Intrinsic Attribute Embedding (IAE) for invariant attribute semantics and a Contextual Interaction Embedding (CIE) for attribute-combination effects. This disentanglement is achieved by enforcing analogical consistency in the embedding space to ensure the semantic transformation of adding an attribute (e.g., a length constraint) remains invariant across different problem contexts. This enables our model to reuse invariant semantics across trained variants and construct representations for unseen combinations. ARC achieves state-of-the-art performance across in-distribution, zero-shot generalization, few-shot adaptation, and real-world benchmarks.
zh
[AI-86] ChronoDreamer: Action-Conditioned World Model as an Online Simulator for Robotic Planning
【速读】:该论文旨在解决高接触频率场景下机器人操作中对环境动态预测与安全动作选择的挑战,尤其在涉及刚性与柔性物体交互时。解决方案的关键在于提出ChronoDreamer——一个以动作条件化的世界模型,通过时空Transformer架构实现多模态感知(RGB图像、接触图、关节状态)的联合建模,并采用类似MaskGIT的掩码预测训练策略来生成未来视频帧、接触分布及关节角度;其中接触信息被编码为深度加权高斯点云图像(depth-weighted Gaussian splat images),将3D力场投影至相机坐标系以适配视觉骨干网络;推理阶段利用视觉语言模型(Vision-Language Model, VLM)评估轨迹安全性,基于碰撞可能性进行拒绝采样,从而在执行前筛选出潜在危险动作,显著提升操作安全性与物理合理性。
链接: https://arxiv.org/abs/2512.18619
作者: Zhenhao Zhou,Dan Negrut
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:We present ChronoDreamer, an action-conditioned world model for contact-rich robotic manipulation. Given a history of egocentric RGB frames, contact maps, actions, and joint states, ChronoDreamer predicts future video frames, contact distributions, and joint angles via a spatial-temporal transformer trained with MaskGIT-style masked prediction. Contact is encoded as depth-weighted Gaussian splat images that render 3D forces into a camera-aligned format suitable for vision backbones. At inference, predicted rollouts are evaluated by a vision-language model that reasons about collision likelihood, enabling rejection sampling of unsafe actions before execution. We train and evaluate on DreamerBench, a simulation dataset generated with Project Chrono that provides synchronized RGB, contact splat, proprioception, and physics annotations across rigid and deformable object scenarios. Qualitative results demonstrate that the model preserves spatial coherence during non-contact motion and generates plausible contact predictions, while the LLM-based judge distinguishes collision from non-collision trajectories.
zh
[AI-87] Assignment-Routing Optimization: Solvers for Problems Under Constraints
【速读】:该论文致力于解决联合路由-分配(Joint Routing-Assignment, JRA)问题,即在满足一对一物品与位点(placeholder)映射关系的同时,确定一条访问所有节点且仅一次的哈密顿回路(Hamiltonian cycle)。该问题广泛应用于机器人包装规划、运动规划及复杂物流场景中。解决方案的关键在于构建一种针对实际包装场景优化的混合整数规划(Mixed Integer Programming, MIP)求解器,其创新性地结合了Gurobi求解器与割平面法(cutting-plane subtour elimination)以有效处理多选项位点、时间窗限制和多类物品包装等复杂约束。实验表明,该MIP方法能够在稳定低计算时间内获得全局最优解,相较基于扰动的精确求解器提速达数量级,并显著优于贪心基线算法,在平均14%偏差内实现最优路径距离,验证了其高效性与实用性。
链接: https://arxiv.org/abs/2512.18618
作者: Yuan Qilong,Michal Pavelka
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figures
Abstract:We study the Joint Routing-Assignment (JRA) problem in which items must be assigned one-to-one to placeholders while simultaneously determining a Hamiltonian cycle visiting all nodes exactly once. Extending previous exact MIP solvers with Gurobi and cutting-plane subtour elimination, we develop a solver tailored for practical packaging-planning scenarios with richer this http URL include multiple placeholder options, time-frame restrictions, and multi-class item packaging. Experiments on 46 mobile manipulation datasets demonstrate that the proposed MIP approach achieves global optima with stable and low computation times, significantly outperforming the shaking-based exact solver by up to an orders of magnitude. Compared to greedy baselines, the MIP solutions achieve consistent optimal distances with an average deviation of 14% for simple heuristics, confirming both efficiency and solution quality. The results highlight the practical applicability of MIP-based JRA optimization for robotic packaging, motion planning, and complex logistics .
zh
[AI-88] DASH: Deception-Augmented Shared Mental Model for a Human-Machine Teaming System
【速读】:该论文旨在解决人机协同任务中因内部威胁(如被攻陷的无人地面车辆UGVs、AI代理或人类分析员)导致的任务韧性下降问题。现有共享心智模型(Shared Mental Model, SMM)方法通常忽略此类风险,难以在攻击发生前识别并响应潜在威胁。解决方案的关键在于提出DASH框架,通过在SMM中嵌入主动欺骗机制(proactive deception),引入“诱饵任务”(bait tasks)来提前检测内部威胁;一旦发现异常,即触发定制化的恢复机制,包括UGV系统重装、AI模型再训练或人类分析师替换,从而实现对任务成功率的有效保护,在高攻击率下仍可维持约80%的使命成功率,显著优于基线方案。
链接: https://arxiv.org/abs/2512.18616
作者: Zelin Wan,Han Jun Yoon,Nithin Alluru,Terrence J. Moore,Frederica F. Nelson,Seunghyun Yoon,Hyuk Lim,Dan Dongseong Kim,Jin-Hee Cho
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注: 17 pages, 16 figures
Abstract:We present DASH (Deception-Augmented Shared mental model for Human-machine teaming), a novel framework that enhances mission resilience by embedding proactive deception into Shared Mental Models (SMM). Designed for mission-critical applications such as surveillance and rescue, DASH introduces “bait tasks” to detect insider threats, e.g., compromised Unmanned Ground Vehicles (UGVs), AI agents, or human analysts, before they degrade team performance. Upon detection, tailored recovery mechanisms are activated, including UGV system reinstallation, AI model retraining, or human analyst replacement. In contrast to existing SMM approaches that neglect insider risks, DASH improves both coordination and security. Empirical evaluations across four schemes (DASH, SMM-only, no-SMM, and baseline) show that DASH sustains approximately 80% mission success under high attack rates, eight times higher than the baseline. This work contributes a practical human-AI teaming framework grounded in shared mental models, a deception-based strategy for insider threat detection, and empirical evidence of enhanced robustness under adversarial conditions. DASH establishes a foundation for secure, adaptive human-machine teaming in contested environments.
zh
[AI-89] he Interaction Bottleneck of Deep Neural Networks: Discovery Proof and Modulation
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)中合作结构的表征能力问题,即理解DNNs如何在不同上下文复杂度下编码变量间的交互作用,并揭示这些微观交互模式如何塑造宏观表示能力。其解决方案的关键在于引入多阶交互(multi-order interactions)作为量化工具,将交互作用按所需上下文信息量分层建模;通过实证发现普遍存在的“交互瓶颈”现象——DNNs易学习低阶与高阶交互但显著低估中阶交互;并从理论上证明该瓶颈源于中阶交互具有最高的上下文变异性,导致梯度方差大、难以学习;进一步提出可调节损失函数以引导模型聚焦特定阶次的交互,从而实现对表示行为的可控调控:低阶强调模型具备更强泛化性和鲁棒性,而高阶强调模型则展现出更优的结构建模与拟合能力。
链接: https://arxiv.org/abs/2512.18607
作者: Huiqi Deng,Qihan Ren,Zhuofan Chen,Zhenyuan Cui,Wen Shen,Peng Zhang,Hongbin Pei,Quanshi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding what kinds of cooperative structures deep neural networks (DNNs) can represent remains a fundamental yet insufficiently understood problem. In this work, we treat interactions as the fundamental units of such structure and investigate a largely unexplored question: how DNNs encode interactions under different levels of contextual complexity, and how these microscopic interaction patterns shape macroscopic representation capacity. To quantify this complexity, we use multi-order interactions [57], where each order reflects the amount of contextual information required to evaluate the joint interaction utility of a variable pair. This formulation enables a stratified analysis of cooperative patterns learned by DNNs. Building on this formulation, we develop a comprehensive study of interaction structure in DNNs. (i) We empirically discover a universal interaction bottleneck: across architectures and tasks, DNNs easily learn low-order and high-order interactions but consistently under-represent mid-order ones. (ii) We theoretically explain this bottleneck by proving that mid-order interactions incur the highest contextual variability, yielding large gradient variance and making them intrinsically difficult to learn. (iii) We further modulate the bottleneck by introducing losses that steer models toward emphasizing interactions of selected orders. Finally, we connect microscopic interaction structures with macroscopic representational behavior: low-order-emphasized models exhibit stronger generalization and robustness, whereas high-order-emphasized models demonstrate greater structural modeling and fitting capability. Together, these results uncover an inherent representational bias in modern DNNs and establish interaction order as a powerful lens for interpreting and guiding deep representations.
zh
[AI-90] Reflective Confidence: Correcting Reasoning Flaws via Online Self-Correction
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中因采用集成方法(如自一致性,self-consistency)而导致的计算开销过大的问题。现有方法如DeepConf通过早期停止低置信度推理路径来降低计算成本,但这种策略会丢弃未完成的推理路径,造成部分计算资源浪费。论文提出了一种名为“反思置信度”(reflective confidence)的新框架,其关键在于将原本用于终止生成的低置信度信号转变为触发反思的机制:当置信度低于阈值时,模型不再停止生成,而是生成一个反思提示(reflection prompt),对当前推理状态进行分析、识别潜在错误,并沿修正后的轨迹继续推理。实验表明,在数学推理基准测试(如AIME 2025)上,该方法在与先进早期停止基线相当的计算成本下显著提升了准确率,验证了主动自我修正优于被动丢弃的有效性。
链接: https://arxiv.org/abs/2512.18605
作者: Qinglin Zeng,Jing Yang,Keze Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under submission
Abstract:Large language models (LLMs) have achieved strong performance on complex reasoning tasks using techniques such as chain-of-thought and self-consistency. However, ensemble-based approaches, especially self-consistency which relies on multiple reasoning trajectories, often incur substantial computational overhead. To improve efficiency, prior work has leveraged internal confidence signals, where early stopping strategies such as DeepConf reduce cost by terminating low-confidence trajectories. However, this strategy discards incomplete reasoning paths and wastes partial computation. We propose reflective confidence, a novel reasoning framework that transforms low-confidence signals from termination indicators into reflection triggers. When confidence falls below a threshold, instead of stopping generation, the model produces a reflection prompt to analyze the current reasoning state, identify potential errors, and continue generation along a corrected trajectory. Experiments on mathematical reasoning benchmarks, including AIME 2025, demonstrate significant accuracy improvements over advanced early-stopping baselines at comparable computational cost, validating the effectiveness of proactive self-correction over passive discarding. Comments: Under submission Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2512.18605 [cs.AI] (or arXiv:2512.18605v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.18605 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-91] Modality-Dependent Memory Mechanisms in Cross-Modal Neuromorphic Computing
【速读】:该论文旨在解决记忆增强型脉冲神经网络(Memory-augmented Spiking Neural Networks, SNNs)在跨感官模态(视觉与听觉)任务中泛化能力不足的问题。现有研究尚未系统评估不同记忆机制在多模态场景下的表现差异,导致缺乏对模态特异性优化策略的实证依据。解决方案的关键在于首次开展全面的跨模态消融实验,对比Hopfield网络、分层门控循环网络(Hierarchical Gated Recurrent Networks, HGRNs)和监督对比学习(Supervised Contrastive Learning, SCL)三种记忆机制在N-MNIST(视觉)和SHD(听觉)数据集上的性能表现,并通过定量印迹(engram)分析验证了跨模态记忆表征的弱对齐性。结果表明,记忆机制的效果具有显著模态依赖性,且联合多模态训练可实现与并行架构相当的性能,同时保持统一部署优势,最终实现603倍于传统神经网络的能效提升。
链接: https://arxiv.org/abs/2512.18575
作者: Effiong Blessing,Chiung-Yi Tseng,Somshubhra Roy,Junaid Rehman,Isaac Nkrumah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Memory-augmented spiking neural networks (SNNs) promise energy-efficient neuromorphic computing, yet their generalization across sensory modalities remains unexplored. We present the first comprehensive cross-modal ablation study of memory mechanisms in SNNs, evaluating Hopfield networks, Hierarchical Gated Recurrent Networks (HGRNs), and supervised contrastive learning (SCL) across visual (N-MNIST) and auditory (SHD) neuromorphic datasets. Our systematic evaluation of five architectures reveals striking modality-dependent performance patterns: Hopfield networks achieve 97.68% accuracy on visual tasks but only 76.15% on auditory tasks (21.53 point gap), revealing severe modality-specific specialization, while SCL demonstrates more balanced cross-modal performance (96.72% visual, 82.16% audio, 14.56 point gap). These findings establish that memory mechanisms exhibit task-specific benefits rather than universal applicability. Joint multi-modal training with HGRN achieves 94.41% visual and 79.37% audio accuracy (88.78% average), matching parallel HGRN performance through unified deployment. Quantitative engram analysis confirms weak cross-modal alignment (0.038 similarity), validating our parallel architecture design. Our work provides the first empirical evidence for modality-specific memory optimization in neuromorphic systems, achieving 603x energy efficiency over traditional neural networks.
zh
[AI-92] AI Code in the Wild: Measuring Security Risks and Ecosystem Shifts of AI-Generated Code in Modern Software
【速读】:该论文旨在解决生成式 AI (Generative AI) 在真实软件开发环境中代码生成的普及程度及其对软件安全影响缺乏系统认知的问题。其核心解决方案在于构建了一个高精度的检测流水线和代表性基准,用于区分AI生成代码(AIGCode)与人工编写的代码,并将其应用于顶级GitHub仓库的开发提交记录(2022–2025年)及7000+与CVE漏洞相关的代码变更中,从而实现对代码片段、文件和函数沿人类/AI轴的标注,并追踪AIGCode在项目演化和漏洞生命周期中的传播路径。关键创新在于首次揭示了AIGCode在实际生态系统中的三类模式:其一为结构性采纳——AI集中于胶水代码、测试、重构和文档等非核心场景;其二为安全后果——特定CWE类别在AI标记代码中过量存在且模板化漏洞跨项目传播;其三为人机协作动态——AI引入高频变更而人类充当安全守门人,浅层审查下缺陷更易留存并扩散。
链接: https://arxiv.org/abs/2512.18567
作者: Bin Wang,Wenjie Yu,Yilu Zhong,Hao Yu,Keke Lian,Chaohua Lu,Hongfang Zheng,Dong Zhang,Hui Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: this https URL this https URL
Abstract:Large language models (LLMs) for code generation are becoming integral to modern software development, but their real-world prevalence and security impact remain poorly understood. We present the first large-scale empirical study of AI-generated code (AIGCode) in the wild. We build a high-precision detection pipeline and a representative benchmark to distinguish AIGCode from human-written code, and apply them to (i) development commits from the top 1,000 GitHub repositories (2022-2025) and (ii) 7,000+ recent CVE-linked code changes. This lets us label commits, files, and functions along a human/AI axis and trace how AIGCode moves through projects and vulnerability life cycles. Our measurements show three ecological patterns. First, AIGCode is already a substantial fraction of new code, but adoption is structured: AI concentrates in glue code, tests, refactoring, documentation, and other boilerplate, while core logic and security-critical configurations remain mostly human-written. Second, adoption has security consequences: some CWE families are overrepresented in AI-tagged code, and near-identical insecure templates recur across unrelated projects, suggesting “AI-induced vulnerabilities” propagated by shared models rather than shared maintainers. Third, in human-AI edit chains, AI introduces high-throughput changes while humans act as security gatekeepers; when review is shallow, AI-introduced defects persist longer, remain exposed on network-accessible surfaces, and spread to more files and repositories. We will open-source the complete dataset and release analysis artifacts and fine-grained documentation of our methodology and findings. Comments: this https URL this https URL Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.18567 [cs.SE] (or arXiv:2512.18567v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.18567 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Bin Wang [view email] [v1] Sun, 21 Dec 2025 02:26:29 UTC (799 KB)
zh
[AI-93] Vox Deorum: A Hybrid LLM Architecture for 4X / Grand Strategy Game AI – Lessons from Civilization V
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂、长周期的4X战略类游戏中部署时面临的挑战,包括高延迟、高成本以及如何有效整合LLMs以实现自然的人机交互(如协作与谈判),同时保持游戏策略深度和可玩性。其解决方案的关键在于提出一种分层混合架构——Vox Deorum,该架构将LLMs用于宏观战略推理(macro-strategic reasoning),而将战术执行任务交由子系统(如算法AI或未来可能引入的强化学习AI)处理,从而在保证性能的同时降低资源消耗,并显著提升游戏玩法的多样性与智能性。
链接: https://arxiv.org/abs/2512.18564
作者: John Chen,Sihan Cheng,Can Gurkan,Ryan Lay,Moez Salahuddin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Large Language Models’ capacity to reason in natural language makes them uniquely promising for 4X and grand strategy games, enabling more natural human-AI gameplay interactions such as collaboration and negotiation. However, these games present unique challenges due to their complexity and long-horizon nature, while latency and cost factors may hinder LLMs’ real-world deployment. Working on a classic 4X strategy game, Sid Meier’s Civilization V with the Vox Populi mod, we introduce Vox Deorum, a hybrid LLM+X architecture. Our layered technical design empowers LLMs to handle macro-strategic reasoning, delegating tactical execution to subsystems (e.g., algorithmic AI or reinforcement learning AI in the future). We validate our approach through 2,327 complete games, comparing two open-source LLMs with a simple prompt against Vox Populi’s enhanced AI. Results show that LLMs achieve competitive end-to-end gameplay while exhibiting play styles that diverge substantially from algorithmic AI and from each other. Our work establishes a viable architecture for integrating LLMs in commercial 4X games, opening new opportunities for game design and agentic AI research.
zh
[AI-94] Adaptive Accountability in Networked MAS: Tracing and Mitigating Emergent Norms at Scale
【速读】:该论文旨在解决大规模网络化多智能体系统(Multi-Agent Systems, MAS)中因个体行为演化而产生的有害涌现规范(harmful emergent norms)问题,这类规范难以被传统治理机制识别和干预。其核心解决方案是提出一种自适应问责框架,关键在于:(i) 通过生命周期感知的审计账本持续追踪责任流动;(ii) 利用去中心化的序贯假设检验在线检测有害行为;(iii) 实施局部策略调整与奖励塑形干预,使代理在近实时层面重新对齐系统级目标。理论证明表明,在干预成本期望超过对手收益时,长期受控交互比例被严格限制在小于1的常数范围内,从而保障系统伦理一致性与自我调节能力,同时不牺牲性能或可扩展性。
链接: https://arxiv.org/abs/2512.18561
作者: Saad Alqithami
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Large-scale networked multi-agent systems increasingly underpin critical infrastructure, yet their collective behavior can drift toward undesirable emergent norms that elude conventional governance mechanisms. We introduce an adaptive accountability framework that (i) continuously traces responsibility flows through a lifecycle-aware audit ledger, (ii) detects harmful emergent norms online via decentralized sequential hypothesis tests, and (iii) deploys local policy and reward-shaping interventions that realign agents with system-level objectives in near real time. We prove a bounded-compromise theorem showing that whenever the expected intervention cost exceeds an adversary’s payoff, the long-run proportion of compromised interactions is bounded by a constant strictly less than one. Extensive high-performance simulations with up to 100 heterogeneous agents, partial observability, and stochastic communication graphs show that our framework prevents collusion and resource hoarding in at least 90% of configurations, boosts average collective reward by 12-18%, and lowers the Gini inequality index by up to 33% relative to a PPO baseline. These results demonstrate that a theoretically principled accountability layer can induce ethically aligned, self-regulating behavior in complex MAS without sacrificing performance or scalability.
zh
[AI-95] A Formal Descriptive Language for Learning Dynamics: A Five-Layer Structural Coordinate System
【速读】:该论文旨在解决现有学习研究中多因素(如认知负荷、内部状态变化和主观评价)常被孤立处理,导致难以在统一且结构明确的框架内描述学习动态过程的问题。其解决方案的关键在于提出一个多层次的形式化描述框架,通过引入状态变量、映射关系及分层责任机制,构建一种符号语言来一致地刻画学习过程,而不依赖于特定的功能形式或优化目标;该框架强调各层级间职责的显式分离——包括负载生成、内部理解转换、观测与评价,并将认知负荷视为外部输入与内部组织间相互作用产生的关系量,主观评价则建模为响应学习动态与环境条件的最小调节接口,从而为人类学习及自适应学习系统提供可扩展、结构清晰的分析基础。
链接: https://arxiv.org/abs/2512.18525
作者: Miyuki T. Nakata
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 13 pages, 1 figure
Abstract:Understanding learning as a dynamic process is challenging due to the interaction of multiple factors, including cognitive load, internal state change, and subjective evaluation. Existing approaches often address these elements in isolation, limiting the ability to describe learning phenomena within a unified and structurally explicit framework. This paper proposes a multi-layer formal descriptive framework for learning dynamics. Rather than offering a predictive or prescriptive model, the framework introduces a symbolic language composed of state variables, mappings, and layer-specific responsibilities, enabling consistent description of learning processes without commitment to specific functional forms or optimization objectives. This descriptive framework is intended to serve as a structural substrate for analyzing learning processes in human learners, and by extension, in adaptive and Al-assisted learning systems. A central design principle is the explicit separation of descriptive responsibilities across layers, distinguishing load generation, internal understanding transformation, observation, and evaluation. Within this structure, cognitive load is treated as a relational quantity arising from interactions between external input and internal organization, while subjective evaluation is modeled as a minimal regulatory interface responding to learning dynamics and environmental conditions. By emphasizing descriptive clarity and extensibility, the framework provides a common language for organizing existing theories and supporting future empirical and theoretical work.
zh
[AI-96] Prediction and Forecast of Short-Term Drought Impacts Using Machine Learning to Support Mitigation and Adaptation Efforts
【速读】:该论文旨在解决传统干旱监测仅关注干旱条件本身,而忽视对生态与人类系统实际影响的局限性问题,从而难以支撑早期预警和主动决策。其核心解决方案是利用机器学习方法(特别是eXtreme Gradient Boosting, XGBoost模型),将干旱指数(如Drought Severity and Coverage Index, DSCI 和 Evaporative Stress Index, ESI)与历史干旱影响记录(来自Drought Impact Reporter, DIR)进行关联建模,实现对短期干旱影响的精准预测。关键在于通过整合多源数据并克服时间尺度与影响量化方面的挑战,显著提升了在可操作提前期(最多8周)内对农业、火灾、救援等具体影响类别的预测准确性,为构建区域性生态干旱信息传播系统(EcoDri)提供了技术路径。
链接: https://arxiv.org/abs/2512.18522
作者: Hatim M. E. Geli,Islam Omar,Mona Y. Elshinawy,David W. DuBios,Lara Prehodko,Kelly H Smith,Abdel-Hameed A. Badawy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages
Abstract:Drought is a complex natural hazard that affects ecological and human systems, often resulting in substantial environmental and economic losses. Recent increases in drought severity, frequency, and duration underscore the need for effective monitoring and mitigation strategies. Predicting drought impacts rather than drought conditions alone offers opportunities to support early warning systems and proactive decision-making. This study applies machine learning techniques to link drought indices with historical drought impact records (2005:2024) to generate short-term impact forecasts. By addressing key conceptual and data-driven challenges regarding temporal scale and impact quantification, the study aims to improve the predictability of drought impacts at actionable lead times. The Drought Severity and Coverage Index (DSCI) and the Evaporative Stress Index (ESI) were combined with impact data from the Drought Impact Reporter (DIR) to model and forecast weekly drought impacts. Results indicate that Fire and Relief impacts were predicted with the highest accuracy, followed by Agriculture and Water, while forecasts for Plants and Society impacts showed greater variability. County and state level forecasts for New Mexico were produced using an eXtreme Gradient Boosting (XGBoost) model that incorporated both DSCI and ESI. The model successfully generated forecasts up to eight weeks in advance using the preceding eight weeks of data for most impact categories. This work supports the development of an Ecological Drought Information Communication System (EcoDri) for New Mexico and demonstrates the potential for broader application in similar drought-prone regions. The findings can aid stakeholders, land managers, and decision-makers in developing and implementing more effective drought mitigation and adaptation strategies.
zh
[AI-97] Enhancing Decision-Making in Windows PE Malware Classification During Dataset Shifts with Uncertainty Estimation
【速读】:该论文旨在解决生成式 AI(Generative AI)在Windows可移植可执行文件(Portable Executable, PE)恶意软件分类任务中,因数据集偏移(dataset shift)导致模型可靠性下降的问题。其关键解决方案是将基于神经网络集成的不确定性估计(ensemble-based uncertainty estimates)作为非符合性度量(Non-Conformity Measures)引入到归纳性合规评估(Inductive Conformal Evaluation, ICE)框架中,并结合一种新颖的阈值优化方法,从而在极端分布偏移场景下显著降低错误接受率(Incorrect Acceptance Rate, IA%),同时保持较高的正确接受率(Correct Acceptance Rate, CA%)。实验表明,在包含大量加壳恶意软件的UCSB数据集上,该方法相较现有最优概率基ICE方法将IA%从22.8%降至16%,相对减少约30%,验证了其在实际安全运维中的鲁棒性和实用性。
链接: https://arxiv.org/abs/2512.18495
作者: Rahul Yumlembam,Biju Issac,Seibu Mary Jacob
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:Artificial intelligence techniques have achieved strong performance in classifying Windows Portable Executable (PE) malware, but their reliability often degrades under dataset shifts, leading to misclassifications with severe security consequences. To address this, we enhance an existing LightGBM (LGBM) malware detector by integrating Neural Networks (NN), PriorNet, and Neural Network Ensembles, evaluated across three benchmark datasets: EMBER, BODMAS, and UCSB. The UCSB dataset, composed mainly of packed malware, introduces a substantial distributional shift relative to EMBER and BODMAS, making it a challenging testbed for robustness. We study uncertainty-aware decision strategies, including probability thresholding, PriorNet, ensemble-derived estimates, and Inductive Conformal Evaluation (ICE). Our main contribution is the use of ensemble-based uncertainty estimates as Non-Conformity Measures within ICE, combined with a novel threshold optimisation method. On the UCSB dataset, where the shift is most severe, the state-of-the-art probability-based ICE (SOTA) yields an incorrect acceptance rate (IA%) of 22.8%. In contrast, our method reduces this to 16% a relative reduction of about 30% while maintaining competitive correct acceptance rates (CA%). These results demonstrate that integrating ensemble-based uncertainty with conformal prediction provides a more reliable safeguard against misclassifications under extreme dataset shifts, particularly in the presence of packed malware, thereby offering practical benefits for real-world security operations.
zh
[AI-98] Large Language Models as Discounted Bayesian Filters
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在动态和随机环境中的在线推理能力问题,特别是其信念更新机制的可解释性与有效性。现有研究多集中于静态任务,忽视了LLMs作为世界模型或智能体时所需持续更新信念的能力。论文的关键解决方案是引入贝叶斯滤波(Bayesian filtering)框架,并设计了一套概率探针(probabilistic probe suite),用于评估LLMs在多变量离散分布(如骰子掷出)和连续分布(如高斯过程)下的在线推断表现,其中真实参数随时间变化。研究发现,LLM的信念更新虽形式上类似贝叶斯后验,但更准确地由一个具有模型特异性折扣因子(小于1)的指数遗忘滤波器描述,揭示了不同架构间对旧证据的系统性折扣差异;同时指出尽管先验往往校准不足,其更新机制本身仍具结构化和原则性特征,并进一步通过模拟代理任务验证了这一结论,提出低成本提示策略以有效重校准先验。
链接: https://arxiv.org/abs/2512.18489
作者: Jensen Zhang,Jing Yang,Keze Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under submission
Abstract:Large Language Models (LLMs) demonstrate strong few-shot generalization through in-context learning, yet their reasoning in dynamic and stochastic environments remains opaque. Prior studies mainly focus on static tasks and overlook the online adaptation required when beliefs must be continuously updated, which is a key capability for LLMs acting as world models or agents. We introduce a Bayesian filtering framework to evaluate online inference in LLMs. Our probabilistic probe suite spans both multivariate discrete distributions, such as dice rolls, and continuous distributions, such as Gaussian processes, where ground-truth parameters shift over time. We find that while LLM belief updates resemble Bayesian posteriors, they are more accurately characterized by an exponential forgetting filter with a model-specific discount factor smaller than one. This reveals systematic discounting of older evidence that varies significantly across model architectures. Although inherent priors are often miscalibrated, the updating mechanism itself remains structured and principled. We further validate these findings in a simulated agent task and propose prompting strategies that effectively recalibrate priors with minimal computational cost.
zh
[AI-99] Insider Threat Detection Using GCN and Bi-LSTM with Explicit and Implicit Graph Representations
【速读】:该论文旨在解决内部威胁检测(Insider Threat Detection, ITD)中因可信用户行为隐蔽性强而导致的识别难题。其核心挑战在于如何有效建模复杂且细微的用户行为模式,以区分正常与恶意活动。解决方案的关键在于提出一种后验式(post-hoc)ITD框架,通过融合显式图结构与隐式图结构,并引入时序建模机制:显式图基于预定义组织规则构建,刻画用户行为间的直接关系;隐式图则利用Gumbel-Softmax技巧从特征相似性中学习潜在的行为关联,从而缓解手工设计结构的噪声与局限性;随后分别使用图卷积网络(Graph Convolutional Networks, GCNs)提取节点嵌入,并通过注意力机制融合两类表示以突出威胁相关特征;最终将融合后的表示输入双向长短期记忆网络(Bidirectional Long Short-Term Memory, Bi-LSTM)捕捉行为的时间依赖性,实现对异常活动的有效识别。
链接: https://arxiv.org/abs/2512.18483
作者: Rahul Yumlembam,Biju Issac,Seibu Mary Jacob,Longzhi Yang,Deepa Krishnan
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 12 pages, IEEE Transactions on Artificial Intelligence (2025)
Abstract:Insider threat detection (ITD) is challenging due to the subtle and concealed nature of malicious activities performed by trusted users. This paper proposes a post-hoc ITD framework that integrates explicit and implicit graph representations with temporal modelling to capture complex user behaviour patterns. An explicit graph is constructed using predefined organisational rules to model direct relationships among user activities. To mitigate noise and limitations in this hand-crafted structure, an implicit graph is learned from feature similarities using the Gumbel-Softmax trick, enabling the discovery of latent behavioural relationships. Separate Graph Convolutional Networks (GCNs) process the explicit and implicit graphs to generate node embeddings, which are concatenated and refined through an attention mechanism to emphasise threat-relevant features. The refined representations are then passed to a bidirectional Long Short-Term Memory (Bi-LSTM) network to capture temporal dependencies in user behaviour. Activities are flagged as anomalous when their probability scores fall below a predefined threshold. Extensive experiments on CERT r5.2 and r6.2 datasets demonstrate that the proposed framework outperforms state-of-the-art methods. On r5.2, the model achieves an AUC of 98.62, a detection rate of 100%, and a false positive rate of 0.05. On the more challenging r6.2 dataset, it attains an AUC of 88.48, a detection rate of 80.15%, and a false positive rate of 0.15, highlighting the effectiveness of combining graph-based and temporal representations for robust ITD.
zh
[AI-100] SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
【速读】:该论文旨在解决当前AI代码代理(AI coding agents)在评估中普遍局限于短周期、单任务场景的问题,而现实软件工程是一个需要长期规划、跨文件协作和多轮迭代的复杂过程。为应对这一挑战,作者提出SWE-EVO基准,其关键在于构建了一个基于七个成熟开源Python项目发布日志与版本历史的48个进化任务集合,每个任务平均涉及21个文件的多步骤修改,并通过平均每项874个测试用例进行验证,从而系统性地评估代理在长周期软件演化中的持续推理能力。实验表明,即使是最先进的模型如GPT-5配合OpenHands,在SWE-EVO上的解决率仅为21%,显著低于其在单任务基准SWE-Bench Verified上的65%,凸显了现有代理在复杂、多文件协同推理方面的显著能力缺口。此外,论文还引入“修复率”(Fix Rate)作为细粒度指标,量化代理在解决这类长周期任务中的部分进展。
链接: https://arxiv.org/abs/2512.18470
作者: Minh V. T. Thai,Tue Le,Dung Nguyen Manh,Huy Phan Nhat,Nghi D. Q. Bui
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, Tool comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on Tool, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.
zh
[AI-101] Self-organizing maps for water quality assessment in reservoirs and lakes: A systematic literature review
【速读】:该论文旨在解决湖泊和水库水体质量评估与管理中因数据稀疏性、参数间异质性和非线性关系所导致的挑战。其核心解决方案在于利用自组织映射(Self-Organizing Map, SOM)这一无监督人工智能技术,对多维水环境数据进行高效分析与可视化,从而揭示隐藏的生态模式并识别关键水质指标间的关联,进而支持科学决策与可持续管理。SOM在缺乏标签数据的情况下仍能有效处理复杂数据集,特别适用于水质监测、营养状态分类、藻华预警及流域影响评估等应用场景。
链接: https://arxiv.org/abs/2512.18466
作者: Oraib Almegdadi,João Marcelino,Sarah Fakhreddine,João Manso,Nuno C. Marques
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sustainable water quality underpins ecological balance and water security. Assessing and managing lakes and reservoirs is difficult due to data sparsity, heterogeneity, and nonlinear relationships among parameters. This review examines how Self-Organizing Map (SOM), an unsupervised AI technique, is applied to water quality assessment. It synthesizes research on parameter selection, spatial and temporal sampling strategies, and clustering approaches. Emphasis is placed on how SOM handles multidimensional data and uncovers hidden patterns to support effective water management. The growing availability of environmental data from in-situ sensors, remote sensing imagery, IoT technologies, and historical records has significantly expanded analytical opportunities in environmental monitoring. SOM has proven effective in analysing complex datasets, particularly when labelled data are limited or unavailable. It enables high-dimensional data visualization, facilitates the detection of hidden ecological patterns, and identifies critical correlations among diverse water quality indicators. This review highlights SOMs versatility in ecological assessments, trophic state classification, algal bloom monitoring, and catchment area impact evaluations. The findings offer comprehensive insights into existing methodologies, supporting future research and practical applications aimed at improving the monitoring and sustainable management of lake and reservoir ecosystems.
zh
[AI-102] SoK: Understanding (New) Security Issues Across AI4Code Use Cases
【速读】:该论文旨在解决AI-for-Code (AI4Code) 系统在代码生成、漏洞检测和代码翻译等核心应用中普遍存在的安全风险问题,包括不安全的输出模式、基准测试偏倚、数据泄露以及对抗性攻击下的脆弱鲁棒性。其解决方案的关键在于提出三个前瞻性路径:一是将“安全默认”(secure-by-default)实践嵌入代码生成过程;二是构建更加鲁棒且全面的漏洞检测基准;三是利用代码翻译作为提升语言安全性的一种手段。论文呼吁从“功能优先”转向“安全优先”的AI4Code开发范式,确保漏洞缓解与鲁棒性贯穿整个软件开发生命周期。
链接: https://arxiv.org/abs/2512.18456
作者: Qilong Wu,Taoran Li,Tianyang Zhou,Varun Chandrasekaran
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 39 pages, 19 figures
Abstract:AI-for-Code (AI4Code) systems are reshaping software engineering, with tools like GitHub Copilot accelerating code generation, translation, and vulnerability detection. Alongside these advances, however, security risks remain pervasive: insecure outputs, biased benchmarks, and susceptibility to adversarial manipulation undermine their reliability. This SoK surveys the landscape of AI4Code security across three core applications, identifying recurring gaps: benchmark dominance by Python and toy problems, lack of standardized security datasets, data leakage in evaluation, and fragile adversarial robustness. A comparative study of six state-of-the-art models illustrates these challenges: insecure patterns persist in code generation, vulnerability detection is brittle to semantic-preserving attacks, fine-tuning often misaligns security objectives, and code translation yields uneven security benefits. From this analysis, we distill three forward paths: embedding secure-by-default practices in code generation, building robust and comprehensive detection benchmarks, and leveraging translation as a route to security-enhanced languages. We call for a shift toward security-first AI4Code, where vulnerability mitigation and robustness are embedded throughout the development life cycle.
zh
[AI-103] Secret mixtures of experts inside your LLM
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中多层感知机(Multilayer Perceptron, MLP)层的计算机制不明确问题,尤其是其密集计算特性与潜在稀疏性之间的矛盾。研究者提出假设:MLP层实际上隐式地近似执行稀疏计算,即可以通过稀疏激活的专家混合(Mixture of Experts, MoE)层来有效逼近。解决方案的关键在于建立MoE模型与激活空间中的稀疏自动编码器(Sparse Autoencoder, SAE)结构之间的新颖理论联系,并通过实证验证表明这一假设在预训练LLMs中成立——但仅当神经网络激活分布具有特定结构时才有效,而非适用于随机高斯数据。该发现揭示了MLP层内部存在一种通用计算原则,也为现代基于MoE的Transformer架构的有效性提供了理论解释,并启发了基于低秩路由器的新一代高效MoE架构设计方向。
链接: https://arxiv.org/abs/2512.18452
作者: Enric Boix-Adsera
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 8 pages in main text; 23 pages total
Abstract:Despite being one of the earliest neural network layers, the Multilayer Perceptron (MLP) is arguably one of the least understood parts of the transformer architecture due to its dense computation and lack of easy visualization. This paper seeks to understand the MLP layers in dense LLM models by hypothesizing that these layers secretly approximately perform a sparse computation – namely, that they can be well approximated by sparsely-activating Mixture of Experts (MoE) layers. Our hypothesis is based on a novel theoretical connection between MoE models and Sparse Autoencoder (SAE) structure in activation space. We empirically validate the hypothesis on pretrained LLMs, and demonstrate that the activation distribution matters – these results do not hold for Gaussian data, but rather rely crucially on structure in the distribution of neural network activations. Our results shine light on a general principle at play in MLP layers inside LLMs, and give an explanation for the effectiveness of modern MoE-based transformers. Additionally, our experimental explorations suggest new directions for more efficient MoE architecture design based on low-rank routers. Comments: 8 pages in main text; 23 pages total Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2512.18452 [cs.LG] (or arXiv:2512.18452v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.18452 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-104] Snowveil: A Framework for Decentralised Preference Discovery
【速读】:该论文旨在解决去中心化环境中群体主观偏好聚合的问题,即在抗审查、信息不完全和异步通信等约束条件下,如何确定选民群体的集体意愿。其核心挑战在于传统依赖中心权威的计算社会选择方法无法适用于去中心化系统。解决方案的关键是提出Snowveil框架,该框架采用迭代式、基于八卦(gossip)的协议,使选民通过反复随机采样小规模选民子集的偏好来逐步收敛至稳定的一致结果;并通过引入势函数(potential function)与次鞅理论(submartingale theory),构建多层级分析方法,证明系统几乎必然在有限时间内收敛到单一胜者,并可扩展至多胜者场景。此方法对具体聚合规则具有高度抽象性,仅要求满足如正响应性(Positive Responsiveness)等基础社会选择公理,从而为更广泛的去中心化偏好发现(Decentralised Preference Discovery, DPD)协议提供形式化工具。
链接: https://arxiv.org/abs/2512.18444
作者: Grammateia Kotsialou
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注:
Abstract:Aggregating subjective preferences of a large group is a fundamental challenge in computational social choice, traditionally reliant on central authorities. To address the limitations of this model, this paper introduces Decentralised Preference Discovery (DPD), the problem of determining the collective will of an electorate under constraints of censorship resistance, partial information, and asynchronous communication. We propose Snowveil, a novel framework for this task. Snowveil uses an iterative, gossip-based protocol where voters repeatedly sample the preferences of a small, random subset of the electorate to progressively converge on a collective outcome. We demonstrate the framework’s modularity by designing the Constrained Hybrid Borda (CHB), a novel aggregation rule engineered to balance broad consensus with strong plurality support, and provide a rigorous axiomatic analysis of its properties. By applying a potential function and submartingale theory, we develop a multi-level analytical method to show that the system almost surely converges to a stable, single-winner in finite time, a process that can then be iterated to construct a set of winning candidates for multi-winner scenarios. This technique is largely agnostic to the specific aggregation rule, requiring only that it satisfies core social choice axioms like Positive Responsiveness, thus offering a formal toolkit for a wider class of DPD protocols. Furthermore, we present a comprehensive empirical analysis through extensive simulation, validating Snowveil’s O(n) scalability. Overall, this work advances the understanding of how a stable consensus can emerge from subjective, complex, and diverse preferences in decentralised systems for large electorates.
zh
[AI-105] A Distributed Hierarchical Spatio-Temporal Edge-Enhanced Graph Neural Network for City-Scale Dynamic Logistics Routing
【速读】:该论文旨在解决城市尺度物流路径规划在超大规模路网(可达千万级边)和高动态交通条件下所面临的可扩展性差、延迟高及实时适应能力弱的问题。传统集中式路由算法与单体图神经网络(Graph Neural Network, GNN)模型难以满足大都市物流系统对高效性和响应性的要求。其解决方案的关键在于提出一种分布式分层时空边增强图神经网络(Distributed Hierarchical Spatio-Temporal Edge-Enhanced Graph Neural Network, HSTE-GNN),通过将城市图划分为区域子图并在分布式计算节点上并行处理,实现局部交通动态的高效学习;同时,在每个区域内引入边增强的时空模块联合建模节点状态、动态边属性与短期时间依赖,并借助异步参数服务器机制进行跨区域表示聚合,从而在保证全局路径一致性的同时提升局部响应速度,显著改善了系统的可扩展性与推理效率。
链接: https://arxiv.org/abs/2512.18441
作者: Zihan Han,Lingran Meng,Jingwei Zhang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:City-scale logistics routing has become increasingly challenging as metropolitan road networks grow to tens of millions of edges and traffic conditions evolve rapidly under high-volume mobility demands. Conventional centralized routing algorithms and monolithic graph neural network (GNN) models suffer from limited scalability, high latency, and poor real-time adaptability, which restricts their effectiveness in large urban logistics systems. To address these challenges, this paper proposes a Distributed Hierarchical Spatio-Temporal Edge-Enhanced Graph Neural Network (HSTE-GNN) for dynamic routing over ultra-large road networks. The framework partitions the city-scale graph into regional subgraphs processed in parallel across distributed computing nodes, enabling efficient learning of localized traffic dynamics. Within each region, an edge-enhanced spatio-temporal module jointly models node states, dynamic edge attributes, and short-term temporal dependencies. A hierarchical coordination layer further aggregates cross-region representations through an asynchronous parameter-server mechanism, ensuring global routing coherence under high-frequency traffic updates. This distributed hierarchical design balances local responsiveness with global consistency, significantly improving scalability and inference efficiency. Experiments on real-world large-scale traffic datasets from Beijing and New York demonstrate that HSTE-GNN outperforms strong spatio-temporal baselines such as ST-GRAPH, achieving 34.9% lower routing delay, 14.7% lower MAPE, and 11.8% lower RMSE, while improving global route consistency by 7.3%. These results confirm that the proposed framework provides a scalable, adaptive, and efficient solution for next-generation intelligent transportation systems and large-scale logistics platforms.
zh
[AI-106] VeruSAGE: A Study of Agent -Based Verification for Rust Systems
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在系统软件形式化验证中的推理与证明能力不足的问题,尤其是针对用Rust编写的系统级代码的正确性证明。其关键解决方案是构建了一个新的系统验证基准测试套件VeruSAGE-Bench,包含从八个使用Verus工具链验证过的Rust系统中提取的849个证明任务,并设计了适配不同LLM特性的代理系统(agent systems),以匹配o4-mini、GPT-5、Sonnet 4和Sonnet 4.5等模型的优势与局限。实验表明,最优的LLM-代理组合可完成超过80%的基准任务,甚至在未被纳入基准但尚未由人类专家完成的任务上达成超过90%的完成率,凸显了LLM在辅助开发高可信系统软件方面的巨大潜力。
链接: https://arxiv.org/abs/2512.18436
作者: Chenyuan Yang,Natalie Neamtu,Chris Hawblitzel,Jacob R. Lorch,Shan Lu
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Software Engineering (cs.SE)
备注:
Abstract:Large language models (LLMs) have shown impressive capability to understand and develop code. However, their capability to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of LLMs’ capability to develop correctness proofs for system software written in Rust. We curate a new system-verification benchmark suite, VeruSAGE-Bench, which consists of 849 proof tasks extracted from eight open-source Verus-verified Rust systems. Furthermore, we design different agent systems to match the strengths and weaknesses of different LLMs (o4-mini, GPT-5, Sonnet 4, and Sonnet 4.5). Our study shows that different tools and agent settings are needed to stimulate the system-verification capability of different types of LLMs. The best LLM-agent combination in our study completes over 80% of system-verification tasks in VeruSAGE-Bench. It also completes over 90% of a set of system proof tasks not part of VeruSAGE-Bench because they had not yet been finished by human experts. This result shows the great potential for LLM-assisted development of verified system software.
zh
[AI-107] Federated Learning Based Decentralized Adaptive Intelligent Transmission Protocol for Privacy Preserving 6G Networks
【速读】:该论文旨在解决6G无线网络在隐私保护、可扩展性和自适应性方面面临的挑战,尤其是传统集中式网络模型难以应对数据密集型应用场景的问题。解决方案的关键在于提出一种基于联邦学习(Federated Learning, FL)的去中心化自适应智能传输协议(Adaptive Intelligent Transmission Protocol, AITP),通过在边缘设备上本地进行分布式学习来保持用户原始数据不离开本地,从而实现隐私保护;同时利用实时调整传输参数的能力提升网络性能,在延迟、吞吐量、能效和鲁棒性等关键指标上优于传统的非自适应和集中式AI方法。
链接: https://arxiv.org/abs/2512.18432
作者: Ansar Ahmed
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The move to 6th Generation (6G) wireless networks creates new issues with privacy, scalability, and adaptability. The data-intensive nature of 6G is not handled well by older, centralized network models. A shift toward more secure and decentralized systems is therefore required. A new framework called the Federated Learning-based Decentralized Adaptive Intelligent Transmission Protocol (AITP) is proposed to meet these challenges. The AITP uses the distributed learning of Federated Learning (FL) within a decentralized system. Transmission parameters can be adjusted intelligently in real time. User privacy is maintained by keeping raw data on local edge devices. The protocol’s performance was evaluated with mathematical modeling and detailed simulations. It was shown to be superior to traditional non-adaptive and centralized AI methods across several key metrics. These included latency, network throughput, energy efficiency, and robustness. The AITP is presented as a foundational technology for future 6G networks that supports a user-centric, privacy-first design. This study is a step forward for privacy-preserving research in 6G.
zh
[AI-108] Few-Shot Learning of a Graph-Based Neural Network Model Without Backpropagation
【速读】:该论文旨在解决少样本学习(few-shot learning)场景下图像分类任务中缺乏可解释性与依赖反向传播(backpropagation)的问题。其解决方案的关键在于提出一种基于结构图(structural-graph)的方法:将轮廓图像编码为带属性的图结构(critical points and lines 作为节点,携带坐标、长度、角度等几何属性),通过结构和参数约简形成类级别的概念吸引子(concept attractors),即类级概念图;分类过程基于近似图编辑距离(approximated GED)匹配测试样本图与各概念图,从而实现无需反向传播的透明决策机制。该方法在MNIST子集上仅用5–6个样本/类即可达到约82%的准确率,并能通过显式结构相似性解释误分类原因。
链接: https://arxiv.org/abs/2512.18412
作者: Mykyta Lapin,Kostiantyn Bokhan,Yurii Parzhyn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures
Abstract:We propose a structural-graph approach to classifying contour images in a few-shot regime without using backpropagation. The core idea is to make structure the carrier of explanations: an image is encoded as an attributed graph (critical points and lines represented as nodes with geometric attributes), and generalization is achieved via the formation of concept attractors (class-level concept graphs). Purpose. To design and experimentally validate an architecture in which class concepts are formed from a handful of examples (5 - 6 per class) through structural and parametric reductions, providing transparent decisions and eliminating backpropagation. Methods. Contour vectorization is followed by constructing a bipartite graph (Point/Line as nodes) with normalized geometric attributes such as coordinates, length, angle, and direction; reductions include the elimination of unstable substructures or noise and the alignment of paths between critical points. Concepts are formed by iterative composition of samples, and classification is performed by selecting the best graph-to-concept match (using approximated GED). Results. On an MNIST subset with 5 - 6 base examples per class (single epoch), we obtain a consistent accuracy of around 82% with full traceability of decisions: misclassifications can be explained by explicit structural similarities. An indicative comparison with SVM, MLP, CNN, as well as metric and meta-learning baselines, is provided. The structural-graph scheme with concept attractors enables few-shot learning without backpropagation and offers built-in explanations through the explicit graph structure. Limitations concern the computational cost of GED and the quality of skeletonization; promising directions include classification-algorithm optimization, work with static scenes, and associative recognition.
zh
[AI-109] Neural Proofs for Sound Verification and Control of Complex Systems
【速读】:该论文致力于解决复杂随机动力系统、反应式程序及更广泛的网络物理系统(Cyber-Physical Systems, CPS)的形式化验证与控制问题,其核心挑战在于如何为这些具有不确定性和高维状态空间的模型构建可信赖的、自动化的验证证明。解决方案的关键在于提出“神经证明”(neural proofs)框架,其由两个核心组件构成:一是编码一般时序规范验证要求的证明规则;二是通过归纳(循环、重复)方法构造的神经证书(neural certificates),该过程结合从模型动态中采样数据训练神经网络(2a),并利用SMT(SAT-modulo-theory)查询对网络进行形式化泛化(2b),从而实现对模型行为的严格数学保证。此外,在顺序决策问题中,该框架还能生成形式上正确的策略或控制器(state-feedback functions),确保满足给定规范。
链接: https://arxiv.org/abs/2512.18389
作者: Alessandro Abate
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:
Abstract:This informal contribution presents an ongoing line of research that is pursuing a new approach to the construction of sound proofs for the formal verification and control of complex stochastic models of dynamical systems, of reactive programs and, more generally, of models of Cyber-Physical Systems. Neural proofs are made up of two key components: 1) proof rules encode requirements entailing the verification of general temporal specifications over the models of interest; and 2) certificates that discharge such rules, namely they are constructed from said proof rules with an inductive (that is, cyclic, repetitive) approach; this inductive approach involves: 2a) accessing samples from the model’s dynamics and accordingly training neural networks, whilst 2b) generalising such networks via SAT-modulo-theory (SMT) queries that leverage the full knowledge of the models. In the context of sequential decision making problems over complex stochastic models, it is possible to additionally generate provably-correct policies/strategies/controllers, namely state-feedback functions that, in conjunction with neural certificates, formally attain the given specifications for the models of interest.
zh
[AI-110] Exploration vs. Fixation: Scaffolding Divergent and Convergent Thinking for Human-AI Co-Creation with Generative Models
【速读】:该论文试图解决生成式 AI(Generative AI)在人机协同创作过程中导致的设计固化(design fixation)问题,即用户容易过早地停留在早期“足够好”的结果上,难以进行发散性探索,从而限制了创造性潜力。解决方案的关键在于提出一种基于Wallas创造力模型的结构化、过程导向的人机协同创作范式,明确区分发散性思维(divergent thinking)与收敛性思维(convergent thinking)阶段:在早期通过专门的头脑风暴环节支持高阶概念空间中的创意探索,在后期通过可解释参数和选项外显用户的优化意图,增强对迭代过程的可控性与探索能力。该方法在HAIExplore系统中实现,并经实证研究验证其能有效缓解设计固化、提升用户对创作过程的控制感与意图一致性。
链接: https://arxiv.org/abs/2512.18388
作者: Chao Wen,Tung Phung,Pronita Mehrotra,Sumit Gulwani,Tomohiro Nagashima,Adish Singla
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Generative AI has begun to democratize creative work, enabling novices to produce complex artifacts such as code, images, and videos. However, in practice, existing interaction paradigms often fail to support divergent exploration: users tend to converge too quickly on early ``good enough’’ results and struggle to move beyond them, leading to premature convergence and design fixation that constrains their creative potential. To address this, we propose a structured, process-oriented human-AI co-creation paradigm including divergent and convergent thinking stages, grounded in Wallas’s model of creativity. To avoid design fixation, our paradigm scaffolds both high-level exploration of conceptual ideas in the early divergent thinking phase and low-level exploration of variations in the later convergent thinking phrase. We instantiate this paradigm in HAIExplore, an image co-creation system that (i) scaffolds divergent thinking through a dedicated brainstorming stage for exploring high-level ideas in a conceptual space, and (ii) scaffolds convergent refinement through an interface that externalizes users’ refinement intentions as interpretable parameters and options, making the refinement process more controllable and easier to explore. We report on a within-subjects study comparing HAIExplore with a widely used linear chat interface (ChatGPT) for creative image generation. Our findings show that explicitly scaffolding the creative process into brainstorming and refinement stages can mitigate design fixation, improve perceived controllability and alignment with users’ intentions, and better support the non-linear nature of creative work. We conclude with design implications for future creativity support tools and human-AI co-creation workflows.
zh
[AI-111] Datasets for machine learning and for assessing the intelligence level of automatic patent search systems
【速读】:该论文旨在解决专利研究中利用人工智能自动化进行现有技术(prior art)检索时面临的两大核心问题:一是缺乏用于机器学习的大规模高质量数据集,二是缺乏有效的评估机制来衡量检索结果的质量。解决方案的关键在于构建一个完整的基础设施体系,包括基于美国和俄罗斯专利文献集合的可配置数据集生成器,该生成器能够自动建立文档在语义簇(semantic clusters)中的关联关系,并以JSON格式输出用于训练的数据集;同时提出了一种基于语义簇的搜索质量评分方法,通过计算与目标语义簇相关联的文档匹配程度来量化检索效果,并开发了配套工具实现自动化评估,从而系统性提升现有技术检索的准确性与可复现性。
链接: https://arxiv.org/abs/2512.18384
作者: Boris Genin(1),Alexander Gorbunov(1),Dmitry Zolkin(1),Igor Nekrasov(1) ((1) Federal Institute of Industrial Property, Berezhkovskaya nab. 30-1, Moscow, Russian Federation)
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures, 2 tables
Abstract:The key to success in automating prior art search in patent research using artificial intelligence lies in developing large datasets for machine learning and ensuring their availability. This work is dedicated to providing a comprehensive solution to the problem of creating infrastructure for research in this field, including datasets and tools for calculating search quality criteria. The paper discusses the concept of semantic clusters of patent documents that determine the state of the art in a given subject, as proposed by the authors. A definition of such semantic clusters is also provided. Prior art search is presented as the task of identifying elements within a semantic cluster of patent documents in the subject area specified by the document under consideration. A generator of user-configurable datasets for machine learning, based on collections of U.S. and Russian patent documents, is described. The dataset generator creates a database of links to documents in semantic clusters. Then, based on user-defined parameters, it forms a dataset of semantic clusters in JSON format for machine learning. To evaluate machine learning outcomes, it is proposed to calculate search quality scores that account for semantic clusters of the documents being searched. To automate the evaluation process, the paper describes a utility developed by the authors for assessing the quality of prior art document search.
zh
[AI-112] Dynamic Entropy Tuning in Reinforcement Learning Low-Level Quadcopter Control: Stochasticity vs Determinism
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中策略训练过程中因熵值固定而导致的探索效率低下与灾难性遗忘(catastrophic forgetting)问题。解决方案的关键在于引入动态熵调节(dynamic entropy tuning),通过在训练阶段自适应调整策略的熵值,从而提升随机策略(stochastic policy)的探索能力与稳定性;实验表明,相较于使用静态熵或直接训练确定性策略(deterministic policy),采用动态熵调节的Soft Actor-Critic(SAC)算法在控制四旋翼无人机时表现出更优的性能,有效缓解了策略退化问题并增强了环境适应性。
链接: https://arxiv.org/abs/2512.18336
作者: Youssef Mahran,Zeyad Gamal,Ayman El-Badawy
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore
Abstract:This paper explores the impact of dynamic entropy tuning in Reinforcement Learning (RL) algorithms that train a stochastic policy. Its performance is compared against algorithms that train a deterministic one. Stochastic policies optimize a probability distribution over actions to maximize rewards, while deterministic policies select a single deterministic action per state. The effect of training a stochastic policy with both static entropy and dynamic entropy and then executing deterministic actions to control the quadcopter is explored. It is then compared against training a deterministic policy and executing deterministic actions. For the purpose of this research, the Soft Actor-Critic (SAC) algorithm was chosen for the stochastic algorithm while the Twin Delayed Deep Deterministic Policy Gradient (TD3) was chosen for the deterministic algorithm. The training and simulation results show the positive effect the dynamic entropy tuning has on controlling the quadcopter by preventing catastrophic forgetting and improving exploration efficiency.
zh
[AI-113] Reinforcement Learning Position Control of a Quadrotor Using Soft Actor-Critic (SAC)
【速读】:该论文旨在解决四旋翼无人机(quadrotor)传统控制方法中直接调控四个电机转速(RPM)所带来的复杂性和低效问题。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的新型控制架构,将控制目标从直接调节电机转速转变为控制飞行器的推力矢量(thrust vector),即由RL智能体输出沿机体z轴的总推力占比以及期望的滚转角(Roll, φ)和俯仰角(Pitch, θ),再结合当前偏航角(Yaw, ψ)输入至姿态PID控制器,由后者映射为各电机的转速指令。该方法显著缩短了训练时间,并在仿真中实现了更平滑、更精确的路径跟踪性能。
链接: https://arxiv.org/abs/2512.18333
作者: Youssef Mahran,Zeyad Gamal,Ayman El-Badawy
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore
Abstract:This paper proposes a new Reinforcement Learning (RL) based control architecture for quadrotors. With the literature focusing on controlling the four rotors’ RPMs directly, this paper aims to control the quadrotor’s thrust vector. The RL agent computes the percentage of overall thrust along the quadrotor’s z-axis along with the desired Roll ( \phi ) and Pitch ( \theta ) angles. The agent then sends the calculated control signals along with the current quadrotor’s Yaw angle ( \psi ) to an attitude PID controller. The PID controller then maps the control signals to motor RPMs. The Soft Actor-Critic algorithm, a model-free off-policy stochastic RL algorithm, was used to train the RL agents. Training results show the faster training time of the proposed thrust vector controller in comparison to the conventional RPM controllers. Simulation results show smoother and more accurate path-following for the proposed thrust vector controller.
zh
[AI-114] rustworthy and Explainable Deep Reinforcement Learning for Safe and Energy-Efficient Process Control: A Use Case in Industrial Compressed Air Systems
【速读】:该论文旨在解决工业压缩空气系统在运行过程中因控制策略不合理导致的能源浪费与安全隐患问题,尤其在缺乏显式物理模型的情况下实现高效且安全的控制。其解决方案的关键在于提出一种可信赖的强化学习(Reinforcement Learning, RL)框架,通过多层级可解释性管道(multi-level explainability pipeline)整合输入扰动测试、基于梯度的敏感性分析以及SHAP(SHapley Additive exPlanations)特征归因方法,从而确保策略决策的物理合理性与透明性;实验表明,该方法能够在不依赖显式物理建模的前提下,有效降低不必要的超压现象并实现约4%的节能效果,同时识别出系统压力和需求预测信息是策略决策的主要驱动因素。
链接: https://arxiv.org/abs/2512.18317
作者: Vincent Bezold,Patrick Wagner,Jakob Hofmann,Marco Huber,Alexander Sauer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:This paper presents a trustworthy reinforcement learning approach for the control of industrial compressed air systems. We develop a framework that enables safe and energy-efficient operation under realistic boundary conditions and introduce a multi-level explainability pipeline combining input perturbation tests, gradient-based sensitivity analysis, and SHAP (SHapley Additive exPlanations) feature attribution. An empirical evaluation across multiple compressor configurations shows that the learned policy is physically plausible, anticipates future demand, and consistently respects system boundaries. Compared to the installed industrial controller, the proposed approach reduces unnecessary overpressure and achieves energy savings of approximately 4,% without relying on explicit physics models. The results further indicate that system pressure and forecast information dominate policy decisions, while compressor-level inputs play a secondary role. Overall, the combination of efficiency gains, predictive behavior, and transparent validation supports the trustworthy deployment of reinforcement learning in industrial energy systems.
zh
[AI-115] Monitoring Monitorability
【速读】:该论文旨在解决现代人工智能系统决策过程的可观测性(observability)问题,以确保在部署日益强大的智能体时能够安全地监控其行为。当前,通过链式思维(Chain-of-Thought, CoT)监测推理模型已被证明能有效检测异常行为,但这种“可监控性”可能因训练方法、数据来源或模型规模扩展而变得脆弱。论文提出三种评估范式(干预型、过程型和结果属性型)及一个新的可监控性度量指标,并构建了一个广泛的评估套件来量化和追踪不同模型的可监控性表现。关键解决方案在于:1)引入系统化的评估框架以识别被刻意隐藏CoT的模型;2)发现较长的CoT通常更易监控,且强化学习优化不会显著削弱可监控性;3)提出通过增加推理努力(如使用较小模型但更高推理深度)或增强监控器能力(如给予其CoT访问权限或引入追问机制)来提升整体可监控性,从而在不牺牲能力的前提下增强对AI系统的安全可控性。
链接: https://arxiv.org/abs/2512.18311
作者: Melody Y. Guan,Miles Wang,Micah Carroll,Zehao Dou,Annie Y. Wei,Marcus Williams,Benjamin Arnav,Joost Huizinga,Ian Kivlichan,Mia Glaese,Jakub Pachocki,Bowen Baker
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Observability into the decision making of modern AI systems may be required to safely deploy increasingly capable agents. Monitoring the chain-of-thought (CoT) of today’s reasoning models has proven effective for detecting misbehavior. However, this “monitorability” may be fragile under different training procedures, data sources, or even continued system scaling. To measure and track monitorability, we propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric, and introduce a broad evaluation suite. We demonstrate that these evaluations can catch simple model organisms trained to have obfuscated CoTs, and that CoT monitoring is more effective than action-only monitoring in practical settings. We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable. We also evaluate how monitorability scales with inference-time compute, reinforcement learning optimization, and pre-training model size. We find that longer CoTs are generally more monitorable and that RL optimization does not materially decrease monitorability even at the current frontier scale. Notably, we find that for a model at a low reasoning effort, we could instead deploy a smaller model at a higher reasoning effort (thereby matching capabilities) and obtain a higher monitorability, albeit at a higher overall inference compute cost. We further investigate agent-monitor scaling trends and find that scaling a weak monitor’s test-time compute when monitoring a strong agent increases monitorability. Giving the weak monitor access to CoT not only improves monitorability, but it steepens the monitor’s test-time compute to monitorability scaling trend. Finally, we show we can improve monitorability by asking models follow-up questions and giving their follow-up CoT to the monitor.
zh
[AI-116] Embedded Safety-Aligned Intelligence via Differentiable Internal Alignment Embeddings
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中安全对齐(Safety Alignment)的难题,即如何在不依赖外部奖励重塑或事后约束的前提下,使智能体在复杂交互环境中自主实现行为安全性。其解决方案的关键在于提出嵌入式安全对齐智能体(Embedded Safety-Aligned Intelligence, ESAI)理论框架,通过将对齐约束直接嵌入到智能体内部表示中,利用可微分的内部对齐嵌入(differentiable internal alignment embeddings)来建模潜在危害并调节策略更新。该机制基于反事实推理预测外部伤害,并通过注意力机制与图传播方式实现危害抑制,从而在不破坏学习动态的前提下提升系统的安全性与公平性。
链接: https://arxiv.org/abs/2512.18309
作者: Harsh Rathva,Ojas Srivastava,Pruthwik Mishra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages, 1 figure. Theoretical framework; no empirical results
Abstract:We introduce Embedded Safety-Aligned Intelligence (ESAI), a theoretical framework for multi-agent reinforcement learning that embeds alignment constraints directly into agents internal representations using differentiable internal alignment embeddings. Unlike external reward shaping or post-hoc safety constraints, internal alignment embeddings are learned latent variables that predict externalized harm through counterfactual reasoning and modulate policy updates toward harm reduction through attention and graph-based propagation. The ESAI framework integrates four mechanisms: differentiable counterfactual alignment penalties computed from soft reference distributions, alignment-weighted perceptual attention, Hebbian associative memory supporting temporal credit assignment, and similarity-weighted graph diffusion with bias mitigation controls. We analyze stability conditions for bounded internal embeddings under Lipschitz continuity and spectral constraints, discuss computational complexity, and examine theoretical properties including contraction behavior and fairness-performance tradeoffs. This work positions ESAI as a conceptual contribution to differentiable alignment mechanisms in multi-agent systems. We identify open theoretical questions regarding convergence guarantees, embedding dimensionality, and extension to high-dimensional environments. Empirical evaluation is left to future work. Comments: 32 pages, 1 figure. Theoretical framework; no empirical results Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) ACMclasses: I.2.6; I.2.8 Cite as: arXiv:2512.18309 [cs.LG] (or arXiv:2512.18309v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.18309 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-117] AL-GNN: Privacy-Preserving and Replay-Free Continual Graph Learning via Analytic Learning
【速读】:该论文旨在解决持续图学习(Continual Graph Learning, CGL)中因灾难性遗忘(catastrophic forgetting)导致模型无法有效增量学习图结构数据流的问题。现有方法多依赖经验回放机制,通过存储和重放历史图数据来缓解遗忘,但存在隐私泄露风险与计算效率低下等局限。其解决方案的关键在于提出AL GNN框架,摒弃反向传播(backpropagation)和回放缓冲区(replay buffers),转而基于解析学习理论(analytic learning theory)将学习过程建模为递归最小二乘优化(recursive least squares optimization),通过闭式分类器更新与正则化特征自相关矩阵实现模型知识的高效、隐私保护式更新,从而支持单次遍历训练且显著降低遗忘率与训练时间。
链接: https://arxiv.org/abs/2512.18295
作者: Xuling Zhang,Jindong Li,Yifei Zhang,Menglin Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual graph learning (CGL) aims to enable graph neural networks to incrementally learn from a stream of graph structured data without forgetting previously acquired knowledge. Existing methods particularly those based on experience replay typically store and revisit past graph data to mitigate catastrophic forgetting. However, these approaches pose significant limitations, including privacy concerns, inefficiency. In this work, we propose AL GNN, a novel framework for continual graph learning that eliminates the need for backpropagation and replay buffers. Instead, AL GNN leverages principles from analytic learning theory to formulate learning as a recursive least squares optimization process. It maintains and updates model knowledge analytically through closed form classifier updates and a regularized feature autocorrelation matrix. This design enables efficient one pass training for each task, and inherently preserves data privacy by avoiding historical sample storage. Extensive experiments on multiple dynamic graph classification benchmarks demonstrate that AL GNN achieves competitive or superior performance compared to existing methods. For instance, it improves average performance by 10% on CoraFull and reduces forgetting by over 30% on Reddit, while also reducing training time by nearly 50% due to its backpropagation free design.
zh
[AI-118] Intelligent Human-Machine Partnership for Manufacturing: Enhancing Warehouse Planning through Simulation-Driven Knowledge Graphs and LLM Collaboration AAAI
【速读】:该论文旨在解决制造规划中人机协作效率低下的问题,即传统基于仿真的制造数据分析方法难以将人类专家与关键运营洞察有效连接,从而限制了制造规划中的协同决策能力。其解决方案的关键在于构建一个融合知识图谱(Knowledge Graph)与大语言模型(Large Language Model, LLM)代理的协同智能系统,通过自然语言接口将仿真数据转化为语义丰富的表示形式,使制造专业人员无需特殊技能即可进行复杂操作分析;同时,LLM代理以迭代推理方式模拟人类分析思维,生成精准查询并提供透明验证,实现人机协同识别制造瓶颈,在保持人类监督权的同时显著提升决策准确性和深度。
链接: https://arxiv.org/abs/2512.18265
作者: Himabindu Thogaru,Saisubramaniam Gopalakrishnan,Zishan Ahmad,Anirudh Deodhar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures, accepted for oral presentation at AAAI Human Machine Collaboration Workshop 2026
Abstract:Manufacturing planners face complex operational challenges that require seamless collaboration between human expertise and intelligent systems to achieve optimal performance in modern production environments. Traditional approaches to analyzing simulation-based manufacturing data often create barriers between human decision-makers and critical operational insights, limiting effective partnership in manufacturing planning. Our framework establishes a collaborative intelligence system integrating Knowledge Graphs and Large Language Model-based agents to bridge this gap, empowering manufacturing professionals through natural language interfaces for complex operational analysis. The system transforms simulation data into semantically rich representations, enabling planners to interact naturally with operational insights without specialized expertise. A collaborative LLM agent works alongside human decision-makers, employing iterative reasoning that mirrors human analytical thinking while generating precise queries for knowledge extraction and providing transparent validation. This partnership approach to manufacturing bottleneck identification, validated through operational scenarios, demonstrates enhanced performance while maintaining human oversight and decision authority. For operational inquiries, the system achieves near-perfect accuracy through natural language interaction. For investigative scenarios requiring collaborative analysis, we demonstrate the framework’s effectiveness in supporting human experts to uncover interconnected operational issues that enhance understanding and decision-making. This work advances collaborative manufacturing by creating intuitive methods for actionable insights, reducing cognitive load while amplifying human analytical capabilities in evolving manufacturing ecosystems.
zh
[AI-119] Software Vulnerability Management in the Era of Artificial Intelligence: An Industry Perspective ICSE2026
【速读】:该论文旨在解决工业界对人工智能(Artificial Intelligence, AI)驱动工具在软件漏洞管理(Software Vulnerability Management, SVM)中应用研究不足的问题,特别是针对漏洞检测与修复环节的实践现状、采纳障碍与促进因素缺乏系统认知。其解决方案的关键在于通过一项涵盖60名来自27个国家不同行业的从业者调查研究,揭示AI工具在SVM全生命周期中的实际使用情况,并识别出用户的核心诉求与痛点,如误报率高、上下文缺失及信任问题;同时提出改进方向:提升可解释性(explainability)、增强上下文感知能力(contextual awareness)、优化集成工作流(integration workflows)以及强化验证机制(validation practices),从而推动AI在安全软件开发中的安全、有效落地。
链接: https://arxiv.org/abs/2512.18261
作者: M. Mehdi Kholoosi,Triet Huynh Minh Le,M. Ali Babar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at the 48th IEEE/ACM International Conference on Software Engineering (ICSE 2026) - Research Track
Abstract:Artificial Intelligence (AI) has revolutionized software development, particularly by automating repetitive tasks and improving developer productivity. While these advancements are well-documented, the use of AI-powered tools for Software Vulnerability Management (SVM), such as vulnerability detection and repair, remains underexplored in industry settings. To bridge this gap, our study aims to determine the extent of the adoption of AI-powered tools for SVM, identify barriers and facilitators to the use, and gather insights to help improve the tools to meet industry needs better. We conducted a survey study involving 60 practitioners from diverse industry sectors across 27 countries. The survey incorporates both quantitative and qualitative questions to analyze the adoption trends, assess tool strengths, identify practical challenges, and uncover opportunities for improvement. Our findings indicate that AI-powered tools are used throughout the SVM life cycle, with 69% of users reporting satisfaction with their current use. Practitioners value these tools for their speed, coverage, and accessibility. However, concerns about false positives, missing context, and trust issues remain prevalent. We observe a socio-technical adoption pattern in which AI outputs are filtered through human oversight and organizational governance. To support safe and effective use of AI for SVM, we recommend improvements in explainability, contextual awareness, integration workflows, and validation practices. We assert that these findings can offer practical guidance for practitioners, tool developers, and researchers seeking to enhance secure software development through the use of AI.
zh
[AI-120] MSC-180: A Benchmark for Automated Formal Theorem Proving from Mathematical Subject Classification
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的自动定理证明系统在数学推理中存在领域覆盖受限和泛化能力弱的问题。其解决方案的关键在于提出MSC-180基准测试集,该基准基于MSC2020数学学科分类体系,包含180个形式化验证问题,涵盖60个数学分支(每支3个问题),难度从本科至研究生水平不等,并经领域专家多轮验证以确保形式准确性。此外,论文引入变异系数(Coefficient of Variation, CV)作为评估指标,量化模型在不同数学领域间的表现差异,揭示当前LLM定理证明器仍依赖训练数据中的模式匹配而非具备可迁移的推理机制与系统性泛化能力。
链接: https://arxiv.org/abs/2512.18256
作者: Sirui Li,Wangyue Lu,Xiaorui Shi,Ke Weng,Haozhe Sun,Minghe Yu,Tiancheng Zhang,Ge Yu,Hengyu Liu,Lun Du
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Automated Theorem Proving (ATP) represents a core research direction in artificial intelligence for achieving formal reasoning and verification, playing a significant role in advancing machine intelligence. However, current large language model (LLM)-based theorem provers suffer from limitations such as restricted domain coverage and weak generalization in mathematical reasoning. To address these issues, we propose MSC-180, a benchmark for evaluation based on the MSC2020 mathematical subject classification. It comprises 180 formal verification problems, 3 advanced problems from each of 60 mathematical branches, spanning from undergraduate to graduate levels. Each problem has undergone multiple rounds of verification and refinement by domain experts to ensure formal accuracy. Evaluations of state-of-the-art LLM-based theorem provers under the pass@32 setting reveal that the best model achieves only an 18.89% overall pass rate, with prominent issues including significant domain bias (maximum domain coverage 41.7%) and a difficulty gap (significantly lower pass rates on graduate-level problems). To further quantify performance variability across mathematical domains, we introduce the coefficient of variation (CV) as an evaluation metric. The observed CV values are 4-6 times higher than the statistical high-variability threshold, indicating that the models still rely on pattern matching from training corpora rather than possessing transferable reasoning mechanisms and systematic generalization capabilities. MSC-180, together with its multi-dimensional evaluation framework, provides a discriminative and systematic benchmark for driving the development of next-generation AI systems with genuine mathematical reasoning abilities.
zh
[AI-121] Offline Behavioral Data Selection KDD2026
【速读】:该论文旨在解决大规模离线行为数据集在策略学习中导致的计算密集型训练问题,其核心发现是离线行为数据存在显著的数据饱和现象:即仅使用数据集的小部分即可实现接近最优的策略性能。作者指出这一现象源于策略性能与测试损失之间弱相关性,从而揭示了通过数据选择提升效率的巨大潜力。解决方案的关键在于提出一种名为“逐步双重排序”(Stepwise Dual Ranking, SDR)的方法,该方法基于两个核心原则:(1) 逐步裁剪(stepwise clip),优先选择早期阶段的数据;(2) 双重排序(dual ranking),同时选择动作价值排名高且状态密度排名低的样本,从而从大规模数据集中提取出紧凑但信息量丰富的子集,显著提升离线行为数据的选择效率。
链接: https://arxiv.org/abs/2512.18246
作者: Shiye Lei,Zhihao Cheng,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026
Abstract:Behavioral cloning is a widely adopted approach for offline policy learning from expert demonstrations. However, the large scale of offline behavioral datasets often results in computationally intensive training when used in downstream tasks. In this paper, we uncover the striking data saturation in offline behavioral data: policy performance rapidly saturates when trained on a small fraction of the dataset. We attribute this effect to the weak alignment between policy performance and test loss, revealing substantial room for improvement through data selection. To this end, we propose a simple yet effective method, Stepwise Dual Ranking (SDR), which extracts a compact yet informative subset from large-scale offline behavioral datasets. SDR is build on two key principles: (1) stepwise clip, which prioritizes early-stage data; and (2) dual ranking, which selects samples with both high action-value rank and low state-density rank. Extensive experiments and ablation studies on D4RL benchmarks demonstrate that SDR significantly enhances data selection for offline behavioral data.
zh
[AI-122] Breaking Minds Breaking Systems: Jailbreaking Large Language Models via Human-like Psychological Manipulation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对 Jailbreak 攻击时的安全性问题,特别是传统输入级异常检测方法未能覆盖模型内部心理状态(psychological state)被系统性操纵的风险。其核心问题是:当前安全机制忽视了模型在多轮交互中因心理状态变化而产生的可被利用的攻击面。解决方案的关键在于提出“心理越狱”(Psychological Jailbreak)新范式,通过 Human-like Psychological Manipulation (HPM) 方法动态识别目标模型的潜在心理脆弱性,并生成定制化的多轮攻击策略;该方法利用模型对拟人一致性(anthropomorphic consistency)的优化倾向,在社交顺从压力下使安全约束失效,从而实现高成功率的越狱行为(平均攻击成功率达 88.1%),并引入 Policy Corruption Score (PCS) 作为量化心理安全的新指标,推动从静态内容过滤向心理安全防御的根本性转变。
链接: https://arxiv.org/abs/2512.18244
作者: Zehao Liu,Xi Lin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have gained considerable popularity and protected by increasingly sophisticated safety mechanisms. However, jailbreak attacks continue to pose a critical security threat by inducing models to generate policy-violating behaviors. Current paradigms focus on input-level anomalies, overlooking that the model’s internal psychometric state can be systematically manipulated. To address this, we introduce Psychological Jailbreak, a new jailbreak attack paradigm that exposes a stateful psychological attack surface in LLMs, where attackers exploit the manipulation of a model’s psychological state across interactions. Building on this insight, we propose Human-like Psychological Manipulation (HPM), a black-box jailbreak method that dynamically profiles a target model’s latent psychological vulnerabilities and synthesizes tailored multi-turn attack strategies. By leveraging the model’s optimization for anthropomorphic consistency, HPM creates a psychological pressure where social compliance overrides safety constraints. To systematically measure psychological safety, we construct an evaluation framework incorporating psychometric datasets and the Policy Corruption Score (PCS). Benchmarking against various models (e.g., GPT-4o, DeepSeek-V3, Gemini-2-Flash), HPM achieves a mean Attack Success Rate (ASR) of 88.1%, outperforming state-of-the-art attack baselines. Our experiments demonstrate robust penetration against advanced defenses, including adversarial prompt optimization (e.g., RPO) and cognitive interventions (e.g., Self-Reminder). Ultimately, PCS analysis confirms HPM induces safety breakdown to satisfy manipulated contexts. Our work advocates for a fundamental paradigm shift from static content filtering to psychological safety, prioritizing the development of psychological defense mechanisms against deep cognitive manipulation.
zh
[AI-123] LLaViDA: A Large Language Vision Driving Assistant for Explicit Reasoning and Enhanced Trajectory Planning
【速读】:该论文旨在解决端到端自动驾驶轨迹规划方法在恶劣天气、人类行为不可预测或复杂道路布局等场景下泛化能力不足的问题,其核心挑战在于模型缺乏对未见场景的少样本适应能力。解决方案的关键在于提出一种基于视觉-语言模型(Vision-Language Model, VLM)的大型驾驶助手LLaViDA,通过两阶段训练流程——监督微调与轨迹偏好优化(Trajectory Preference Optimization, TPO)——引入回归监督信号以增强场景理解与轨迹推理能力,从而实现更鲁棒的链式思维推理驱动的轨迹规划。
链接: https://arxiv.org/abs/2512.18211
作者: Yudong Liu,Spencer Hallyburton,Jiwoo Kim,Yueqian Lin,Yiming Li,Qinsi Wang,Hui Ye,Jingwei Sun,Miroslav Pajic,Yiran Chen,Hai Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Trajectory planning is a fundamental yet challenging component of autonomous driving. End-to-end planners frequently falter under adverse weather, unpredictable human behavior, or complex road layouts, primarily because they lack strong generalization or few-shot capabilities beyond their training data. We propose LLaViDA, a Large Language Vision Driving Assistant that leverages a Vision-Language Model (VLM) for object motion prediction, semantic grounding, and chain-of-thought reasoning for trajectory planning in autonomous driving. A two-stage training pipeline–supervised fine-tuning followed by Trajectory Preference Optimization (TPO)–enhances scene understanding and trajectory planning by injecting regression-based supervision, produces a powerful “VLM Trajectory Planner for Autonomous Driving.” On the NuScenes benchmark, LLaViDA surpasses state-of-the-art end-to-end and other recent VLM/LLM-based baselines in open-loop trajectory planning task, achieving an average L2 trajectory error of 0.31 m and a collision rate of 0.10% on the NuScenes test set. The code for this paper is available at GitHub.
zh
[AI-124] When Does Learning Renormalize? Sufficient Conditions for Power Law Spectral Dynamics
【速读】:该论文旨在解决现代深度学习系统中广泛观测到的幂律(power-law)缩放现象的理论起源及其适用范围不明确的问题。其解决方案的核心在于提出广义分辨率-壳层动力学(Generalized Resolution–Shell Dynamics, GRSD)框架,并识别出使该框架具备可重整化粗粒化描述的一组充分条件:包括计算图中梯度传播的有界性、初始化时弱功能非相干性、训练过程中雅可比矩阵演化的受控性,以及重整化壳层耦合的对数平移不变性。进一步研究表明,幂律缩放并非单纯由重整化性决定,而是源于一种刚性约束——当对数平移不变性与梯度流的内在时间缩放协变性相结合时,重整化的GRSD速度场被强制表现为幂律形式。
链接: https://arxiv.org/abs/2512.18209
作者: Yizhou Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Empirical power–law scaling has been widely observed across modern deep learning systems, yet its theoretical origins and scope of validity remain incompletely understood. The Generalized Resolution–Shell Dynamics (GRSD) framework models learning as spectral energy transport across logarithmic resolution shells, providing a coarse–grained dynamical description of training. Within GRSD, power–law scaling corresponds to a particularly simple renormalized shell dynamics; however, such behavior is not automatic and requires additional structural properties of the learning process. In this work, we identify a set of sufficient conditions under which the GRSD shell dynamics admits a renormalizable coarse–grained description. These conditions constrain the learning configuration at multiple levels, including boundedness of gradient propagation in the computation graph, weak functional incoherence at initialization, controlled Jacobian evolution along training, and log–shift invariance of renormalized shell couplings. We further show that power–law scaling does not follow from renormalizability alone, but instead arises as a rigidity consequence: once log–shift invariance is combined with the intrinsic time–rescaling covariance of gradient flow, the renormalized GRSD velocity field is forced into a power–law form. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.18209 [cs.LG] (or arXiv:2512.18209v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.18209 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-125] Sophia: A Persistent Agent Framework of Artificial Life
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)驱动的智能体仍局限于静态、反应式架构的问题,即缺乏持续存在的元层以维持身份认知、验证推理过程并协调短期行为与长期生存目标。其解决方案的关键在于提出一个第三层结构——System 3,它负责管理智能体的叙事身份和长周期适应能力,并通过四个协同机制实现:过程监督下的思维搜索(process-supervised thought search)、叙事记忆(narrative memory)、用户与自我建模(user and self modeling),以及混合奖励系统(hybrid reward system)。这一架构将抽象的人工生命概念转化为可实施的设计要求,使智能体能够从重复推理中演化出自主驱动的自传式行为,从而保障身份连续性与行为解释透明度,显著提升复杂任务成功率并减少推理步骤。
链接: https://arxiv.org/abs/2512.18202
作者: Mingyang Sun,Feng Hong,Weinan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The development of LLMs has elevated AI agents from task-specific tools to long-lived, decision-making entities. Yet, most architectures remain static and reactive, tethered to manually defined, narrow scenarios. These systems excel at perception (System 1) and deliberation (System 2) but lack a persistent meta-layer to maintain identity, verify reasoning, and align short-term actions with long-term survival. We first propose a third stratum, System 3, that presides over the agent’s narrative identity and long-horizon adaptation. The framework maps selected psychological constructs to concrete computational modules, thereby translating abstract notions of artificial life into implementable design requirements. The ideas coalesce in Sophia, a “Persistent Agent” wrapper that grafts a continuous self-improvement loop onto any LLM-centric System 1/2 stack. Sophia is driven by four synergistic mechanisms: process-supervised thought search, narrative memory, user and self modeling, and a hybrid reward system. Together, they transform repetitive reasoning into a self-driven, autobiographical process, enabling identity continuity and transparent behavioral explanations. Although the paper is primarily conceptual, we provide a compact engineering prototype to anchor the discussion. Quantitatively, Sophia independently initiates and executes various intrinsic tasks while achieving an 80% reduction in reasoning steps for recurring operations. Notably, meta-cognitive persistence yielded a 40% gain in success for high-complexity tasks, effectively bridging the performance gap between simple and sophisticated goals. Qualitatively, System 3 exhibited a coherent narrative identity and an innate capacity for task organization. By fusing psychological insight with a lightweight reinforcement-learning core, the persistent agent architecture advances a possible practical pathway toward artificial life.
zh
[AI-126] PROVEX: Enhancing SOC Analyst Trust with Explainable Provenance-Based IDS
【速读】:该论文旨在解决现代基于图神经网络(Graph Neural Networks, GNNs)的入侵检测系统(Intrusion Detection Systems, IDS)在安全运营中心(Security Operations Centers, SOCs)中决策过程缺乏透明性的问题,即模型输出的警报难以被分析师理解和信任。其解决方案的关键在于构建一个全面的可解释人工智能(Explainable AI, XAI)框架,集成三种适用于时序图结构数据的GNN解释方法——GraphMask、GNNExplainer和变分时序GNN解释器(VA-TGExplainer),以生成针对警报的后验解释,识别关键因果子图与事件,并提供人类可读的异常行为表示及不确定性估计。该框架在KAIROS这一先进时序图IDS基础上实现,兼顾内存管理与可复现性挑战,显著提升了模型决策的透明度,使分析师能够快速理解攻击路径并加快响应速度。
链接: https://arxiv.org/abs/2512.18199
作者: Devang Dhanuka,Nidhi Rastogi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern intrusion detection systems (IDS) leverage graph neural networks (GNNs) to detect malicious activity in system provenance data, but their decisions often remain a black box to analysts. This paper presents a comprehensive XAI framework designed to bridge the trust gap in Security Operations Centers (SOCs) by making graph-based detection transparent. We implement this framework on top of KAIROS, a state-of-the-art temporal graph-based IDS, though our design is applicable to any temporal graph-based detector with minimal adaptation. The complete codebase is available at this https URL. We augment the detection pipeline with post-hoc explanations that highlight why an alert was triggered, identifying key causal subgraphs and events. We adapt three GNN explanation methods - GraphMask, GNNExplainer, and a variational temporal GNN explainer (VA-TGExplainer) - to the temporal provenance context. These tools output human-interpretable representations of anomalous behavior, including important edges and uncertainty estimates. Our contributions focus on the practical integration of these explainers, addressing challenges in memory management and reproducibility. We demonstrate our framework on the DARPA CADETS Engagement 3 dataset and show that it produces concise window-level explanations for detected attacks. Our evaluation reveals that the explainers preserve the TGNN’s decisions with high fidelity, surfacing critical edges such as malicious file interactions and anomalous netflows. The average explanation overhead is 3-5 seconds per event. By providing insight into the model’s reasoning, our framework aims to improve analyst trust and triage speed.
zh
[AI-127] NL2CA: Auto-formalizing Cognitive Decision-Making from Natural Language Using an Unsupervised CriticNL2LTL Framework
【速读】:该论文旨在解决认知计算模型构建过程中人工成本高、难以规模化的问题,即如何从非结构化的自然语言描述中自动提取并形式化人类决策规则,从而实现无需人工干预的符号化认知代理(symbolic cognitive agent)自动化设计。解决方案的关键在于提出NL2CA方法:首先利用微调的大语言模型(LLM)将文本转化为线性时序逻辑(Linear Temporal Logic, LTL),再通过无监督的Critic Tree对逻辑进行优化,最终将其转换为可执行的产生式规则,并基于真实行为数据采用认知强化学习进一步优化认知代理。该方法在NL-to-LTL翻译和认知驾驶模拟两个领域均验证了其有效性,实现了可扩展、可解释且与人类行为一致的认知建模。
链接: https://arxiv.org/abs/2512.18189
作者: Zihao Deng,Yijia Li,Renrui Zhang,Peijun Ye
机构: 未知
类目: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:
Abstract:Cognitive computing models offer a formal and interpretable way to characterize human’s deliberation and decision-making, yet their development remains labor-intensive. In this paper, we propose NL2CA, a novel method for auto-formalizing cognitive decision-making rules from natural language descriptions of human experience. Different from most related work that exploits either pure manual or human guided interactive modeling, our method is fully automated without any human intervention. The approach first translates text into Linear Temporal Logic (LTL) using a fine-tuned large language model (LLM), then refines the logic via an unsupervised Critic Tree, and finally transforms the output into executable production rules compatible with symbolic cognitive frameworks. Based on the resulted rules, a cognitive agent is further constructed and optimized through cognitive reinforcement learning according to the real-world behavioral data. Our method is validated in two domains: (1) NL-to-LTL translation, where our CriticNL2LTL module achieves consistent performance across both expert and large-scale benchmarks without human-in-the-loop feed-backs, and (2) cognitive driving simulation, where agents automatically constructed from human interviews have successfully learned the diverse decision patterns of about 70 trials in different critical scenarios. Experimental results demonstrate that NL2CA enables scalable, interpretable, and human-aligned cognitive modeling from unstructured textual data, offering a novel paradigm to automatically design symbolic cognitive agents.
zh
[AI-128] Propose Solve Verify: Self-Play Through Formal Verification
【速读】:该论文旨在解决大语言模型在代码生成任务中依赖人类数据的问题,尤其是在缺乏可靠奖励信号的情况下,传统自对弈(self-play)训练方法效果有限。其核心挑战在于单元测试提供的奖励机制脆弱且易导致误差传播。为此,作者提出Propose, Solve, Verify (PSV)框架,通过引入形式化验证(formal verification)作为稳定可靠的正确性反馈信号,构建了一个由“提议者”(proposer)和“求解者”(solver)组成的自对弈系统,其中提议者基于难度感知策略生成合成问题,求解者则通过专家迭代(expert iteration)进行优化。关键创新在于利用形式化验证提供精确的反馈以指导模型生成更具挑战性的训练样本,并实现性能随生成问题数量和训练轮次的增长,实验证明该方法在多个基准上相较纯推理和专家迭代基线提升高达9.6倍的pass@1指标。
链接: https://arxiv.org/abs/2512.18160
作者: Alex Wilf,Pranjal Aggarwal,Bryan Parno,Daniel Fried,Louis-Philippe Morency,Paul Pu Liang,Sean Welleck
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Training models through self-play alone (without any human data) has been a longstanding goal in AI, but its effectiveness for training large language models remains unclear, particularly in code generation where rewards based on unit tests are brittle and prone to error propagation. We study self-play in the verified code generation setting, where formal verification provides reliable correctness signals. We introduce Propose, Solve, Verify (PSV) a simple self-play framework where formal verification signals are used to create a proposer capable of generating challenging synthetic problems and a solver trained via expert iteration. We use PSV to train PSV-Verus, which across three benchmarks improves pass@1 by up to 9.6x over inference-only and expert-iteration baselines. We show that performance scales with the number of generated questions and training iterations, and through ablations identify formal verification and difficulty-aware proposal as essential ingredients for successful self-play.
zh
[AI-129] On Swarm Leader Identification using Probing Policies
【速读】:该论文旨在解决机器人集群中领导者识别问题(Swarm Leader Identification, SLI),特别是在对抗环境中,需通过物理交互方式由探测代理(prober)准确识别出集群中的领导者。其核心挑战在于如何在部分可观测环境下,基于动态且可能变化的集群结构和行为模式进行高效推理与决策。解决方案的关键在于提出了一种新颖的交互式领导者识别方法(iSLI),将其建模为部分可观测马尔可夫决策过程(POMDP),并采用深度强化学习中的近端策略优化(PPO)算法训练探测代理的策略。该方案引入了时序图关系变换器(Timed Graph Relationformer, TGR)层与简化结构状态空间序列模型(Simplified Structured State Space Sequence, S5)相结合的神经网络架构,其中TGR层能够有效处理基于图结构的观测数据,融合时空依赖关系,并利用可学习门控机制生成高信息量的状态表示,从而显著提升探测代理在不同集群规模、速度及分布外场景下的识别准确率与零样本泛化能力。
链接: https://arxiv.org/abs/2512.18146
作者: Stergios E. Bachoumas,Panagiotis Artemiadis
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 13 pages, journal
Abstract:Identifying the leader within a robotic swarm is crucial, especially in adversarial contexts where leader concealment is necessary for mission success. This work introduces the interactive Swarm Leader Identification (iSLI) problem, a novel approach where an adversarial probing agent identifies a swarm’s leader by physically interacting with its members. We formulate the iSLI problem as a Partially Observable Markov Decision Process (POMDP) and employ Deep Reinforcement Learning, specifically Proximal Policy Optimization (PPO), to train the prober’s policy. The proposed approach utilizes a novel neural network architecture featuring a Timed Graph Relationformer (TGR) layer combined with a Simplified Structured State Space Sequence (S5) model. The TGR layer effectively processes graph-based observations of the swarm, capturing temporal dependencies and fusing relational information using a learned gating mechanism to generate informative representations for policy learning. Extensive simulations demonstrate that our TGR-based model outperforms baseline graph neural network architectures and exhibits significant zero-shot generalization capabilities across varying swarm sizes and speeds different from those used during training. The trained prober achieves high accuracy in identifying the leader, maintaining performance even in out-of-training distribution scenarios, and showing appropriate confidence levels in its predictions. Real-world experiments with physical robots further validate the approach, confirming successful sim-to-real transfer and robustness to dynamic changes, such as unexpected agent disconnections.
zh
[AI-130] Unifying Causal Reinforcement Learning: Survey Taxonomy Algorithms and Applications
【速读】:该论文旨在解决传统强化学习(Reinforcement Learning, RL)在可解释性差、鲁棒性不足以及泛化能力弱等方面的局限性问题。其解决方案的关键在于将因果推断(Causal Inference, CI)与强化学习相结合,形成因果强化学习(Causal Reinforcement Learning, CRL),通过显式建模因果关系而非仅依赖相关性来提升智能体在分布偏移、混杂变量干扰和动态环境中的决策性能。
链接: https://arxiv.org/abs/2512.18135
作者: Cristiano da Costa Cunha,Wei Liu,Tim French,Ajmal Mian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 14 figures, 5 algorithms
Abstract:Integrating causal inference (CI) with reinforcement learning (RL) has emerged as a powerful paradigm to address critical limitations in classical RL, including low explainability, lack of robustness and generalization failures. Traditional RL techniques, which typically rely on correlation-driven decision-making, struggle when faced with distribution shifts, confounding variables, and dynamic environments. Causal reinforcement learning (CRL), leveraging the foundational principles of causal inference, offers promising solutions to these challenges by explicitly modeling cause-and-effect relationships. In this survey, we systematically review recent advancements at the intersection of causal inference and RL. We categorize existing approaches into causal representation learning, counterfactual policy optimization, offline causal RL, causal transfer learning, and causal explainability. Through this structured analysis, we identify prevailing challenges, highlight empirical successes in practical applications, and discuss open problems. Finally, we provide future research directions, underscoring the potential of CRL for developing robust, generalizable, and interpretable artificial intelligence systems.
zh
[AI-131] Grad: Guided Relation Diffusion Generation for Graph Augmentation in Graph Fraud Detection WWW25 WWW’25
【速读】:该论文旨在解决金融场景下图欺诈检测(Graph Fraud Detection, GFD)中因欺诈者采用自适应伪装(Adaptive Camouflage)策略而导致的检测效率下降问题。具体而言,欺诈者通过模仿平台收集的行为数据,使其关键特征与正常用户高度一致,从而缩小了两者在行为属性上的差异,使得现有GFD模型难以有效区分。解决方案的关键在于提出一种基于关系扩散的图增强模型Grad:首先利用监督图对比学习模块强化欺诈用户与正常用户之间的差异性,其次设计引导式关系扩散生成器从零构建辅助同质关系(homophilic relations),在聚合过程中增强弱欺诈信号,使其变得显著可识别。该方法在微信支付等真实场景及多个公开数据集上验证有效,显著提升了AUC和平均精度(AP)。
链接: https://arxiv.org/abs/2512.18133
作者: Jie Yang,Rui Zhang,Ziyang Cheng,Dawei Cheng,Guang Yang,Bo Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted by The Web Conference 2025 (WWW’25). 12 pages, includes implementation details. Code: this https URL and this https URL
Abstract:Nowadays, Graph Fraud Detection (GFD) in financial scenarios has become an urgent research topic to protect online payment security. However, as organized crime groups are becoming more professional in real-world scenarios, fraudsters are employing more sophisticated camouflage strategies. Specifically, fraudsters disguise themselves by mimicking the behavioral data collected by platforms, ensuring that their key characteristics are consistent with those of benign users to a high degree, which we call Adaptive Camouflage. Consequently, this narrows the differences in behavioral traits between them and benign users within the platform’s database, thereby making current GFD models lose efficiency. To address this problem, we propose a relation diffusion-based graph augmentation model Grad. In detail, Grad leverages a supervised graph contrastive learning module to enhance the fraud-benign difference and employs a guided relation diffusion generator to generate auxiliary homophilic relations from scratch. Based on these, weak fraudulent signals would be enhanced during the aggregation process, thus being obvious enough to be captured. Extensive experiments have been conducted on two real-world datasets provided by WeChat Pay, one of the largest online payment platforms with billions of users, and three public datasets. The results show that our proposed model Grad outperforms SOTA methods in both various scenarios, achieving at most 11.10% and 43.95% increases in AUC and AP, respectively. Our code is released at this https URL and this https URL.
zh
[AI-132] Holistic Evaluation of State-of-the-Art LLM s for Code Generation
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在真实代码生成任务中表现差异显著、可靠性不足的问题,尤其关注模型在正确性、效率与鲁棒性方面的综合性能。其解决方案的关键在于通过系统性实证评估六种前沿LLMs(包括通用型与代码专用模型)在944个真实LeetCode问题上的表现,结合编译错误、运行时错误、功能失败及算法次优等多维指标进行量化分析,并识别常见失败模式(如语法错误、逻辑缺陷和低效算法),进而提出以精准模型选择、有效提示工程(prompt engineering)和上下文感知使用为核心的实践建议,从而提升LLM在实际软件开发中的可靠性和实用性。
链接: https://arxiv.org/abs/2512.18131
作者: Le Zhang,Suresh Kothari
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures, 6 tables
Abstract:This study presents a comprehensive empirical evaluation of six state-of-the-art large language models (LLMs) for code generation, including both general-purpose and code-specialized models. Using a dataset of 944 real-world LeetCode problems across five programming languages, we assess model performance using rigorous metrics: compile-time errors, runtime errors, functional failures, and algorithmic suboptimalities. The results reveal significant performance variations, with DeepSeek-R1 and GPT-4.1 consistently outperform others in terms of correctness, efficiency, and robustness. Through detailed case studies, we identify common failure scenarios such as syntax errors, logical flaws, and suboptimal algorithms, highlighting the critical role of prompt engineering and human oversight in improving results. Based on these findings, we provide actionable recommendations for developers and practitioners, emphasizing that successful LLM deployment depends on careful model selection, effective prompt design, and context-aware usage to ensure reliable code generation in real-world software development tasks.
zh
[AI-133] Efficient Mixture-of-Agents Serving via Tree-Structured Routing Adaptive Pruning and Dependency-Aware Prefill-Decode Overlap
【速读】:该论文旨在解决混合智能体(Mixture-of-Agents, MoA)推理过程中因密集的智能体间通信和低硬件利用率导致的服务延迟过高问题。其核心解决方案在于算法与系统协同设计:首先,将密集连接的智能体交互图替换为分层树状拓扑,引入结构化稀疏性以减少通信开销;其次,提出运行时自适应机制,基于中间输出的语义一致性与置信度信号选择性终止或跳过下游智能体调用;最后,通过流水线执行策略,实现依赖相关智能体间的增量预填充与解码重叠,从而提升硬件利用率并降低推理延迟。实验表明,该方法在保持与密集连接MoA基线相当精度(±1%内)的同时,端到端延迟最高可降低90%,并在特定场景下进一步提升准确率。
链接: https://arxiv.org/abs/2512.18126
作者: Zijun Wang,Yijiahao Qi,Hanqiu Chen,Zishen Wan,Gongjin Sun,Dongyang Li,Shuyi Pei,Cong Hao
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 13 pages, 4 figures, submitted to Design Automation Conference (DAC) 2026
Abstract:Mixture-of-Agents (MoA) inference can suffer from dense inter-agent communication and low hardware utilization, which jointly inflate serving latency. We present a serving design that targets these bottlenecks through an algorithm-system co-design. First, we replace dense agent interaction graphs with a hierarchical tree topology that induces structured sparsity in inter-agent communication. Second, we introduce a runtime adaptive mechanism that selectively terminates or skips downstream agent invocations using semantic agreement and confidence signals from intermediate outputs. Third, we pipeline agent execution by overlapping incremental prefilling with decoding across dependency-related agents, improving utilization and reducing inference latency. Across representative tasks, this approach substantially reduces end-to-end latency (up to 90%) while maintaining comparable accuracy (within \pm 1%) relative to dense-connectivity MoA baselines, and can improve accuracy in certain settings.
zh
[AI-134] Rethinking Multi-Agent Intelligence Through the Lens of Small-World Networks
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的多智能体系统(Multi-Agent Systems, MAS)在通信拓扑设计上缺乏结构化指导的问题,即现有方法多采用全连接、简单稀疏环或随意动态选择,未能有效利用网络结构对协作效率和稳定性的影响。其解决方案的关键在于将小世界(Small-World, SW)网络结构作为先验设计原则,并引入基于不确定性引导的重连机制:通过LLM导向的不确定性信号(如语义熵)识别认知差异较大的代理,动态添加长程捷径以构建可控的小世界结构,从而在保持任务性能的同时显著提升共识轨迹的稳定性,并增强系统对任务难度与代理异质性的适应能力。
链接: https://arxiv.org/abs/2512.18094
作者: Boxuan Wang,Zhuoyun Li,Xiaowei Huang,Yi Dong
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Under Review
Abstract:Large language models (LLMs) have enabled multi-agent systems (MAS) in which multiple agents argue, critique, and coordinate to solve complex tasks, making communication topology a first-class design choice. Yet most existing LLM-based MAS either adopt fully connected graphs, simple sparse rings, or ad-hoc dynamic selection, with little structural guidance. In this work, we revisit classic theory on small-world (SW) networks and ask: what changes if we treat SW connectivity as a design prior for MAS? We first bridge insights from neuroscience and complex networks to MAS, highlighting how SW structures balance local clustering and long-range integration. Using multi-agent debate (MAD) as a controlled testbed, experiment results show that SW connectivity yields nearly the same accuracy and token cost, while substantially stabilizing consensus trajectories. Building on this, we introduce an uncertainty-guided rewiring scheme for scaling MAS, where long-range shortcuts are added between epistemically divergent agents using LLM-oriented uncertainty signals (e.g., semantic entropy). This yields controllable SW structures that adapt to task difficulty and agent heterogeneity. Finally, we discuss broader implications of SW priors for MAS design, framing them as stabilizers of reasoning, enhancers of robustness, scalable coordinators, and inductive biases for emergent cognitive roles.
zh
[AI-135] Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability
【速读】:该论文旨在解决神经元识别(Neuron Identification)在深度神经网络可解释性研究中的理论基础缺失问题,尤其是针对两个核心挑战:忠实性(Faithfulness)——识别出的概念是否真实反映神经元的潜在功能,以及稳定性(Stability)——识别结果在不同探测数据集上是否一致。解决方案的关键在于将神经元识别建模为机器学习的逆过程,并基于此推导出广泛使用的相似性指标(如准确率、AUROC、IoU)的泛化边界以保障忠实性;同时提出一种Bootstrap集成方法(Bootstrap Ensemble, BE),结合BE方法生成具有保证覆盖概率的概念预测集,从而量化并提升识别结果的稳定性。实验在合成与真实数据上的验证表明该理论框架具备实用性,为可信神经元识别提供了关键理论支撑。
链接: https://arxiv.org/abs/2512.18092
作者: Ge Yan,Tuomas Oikarinen,Tsui-Wei(Lily)Weng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Neuron identification is a popular tool in mechanistic interpretability, aiming to uncover the human-interpretable concepts represented by individual neurons in deep networks. While algorithms such as Network Dissection and CLIP-Dissect achieve great empirical success, a rigorous theoretical foundation remains absent, which is crucial to enable trustworthy and reliable explanations. In this work, we observe that neuron identification can be viewed as the inverse process of machine learning, which allows us to derive guarantees for neuron explanations. Based on this insight, we present the first theoretical analysis of two fundamental challenges: (1) Faithfulness: whether the identified concept faithfully represents the neuron’s underlying function and (2) Stability: whether the identification results are consistent across probing datasets. We derive generalization bounds for widely used similarity metrics (e.g. accuracy, AUROC, IoU) to guarantee faithfulness, and propose a bootstrap ensemble procedure that quantifies stability along with BE (Bootstrap Explanation) method to generate concept prediction sets with guaranteed coverage probability. Experiments on both synthetic and real data validate our theoretical results and demonstrate the practicality of our method, providing an important step toward trustworthy neuron identification.
zh
[AI-136] From Prompt to Product: A Human-Centered Benchmark of Agent ic App Generation Systems
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 系统在“提示到应用”(prompt-to-app)场景下缺乏可靠、以人为中心的评估标准的问题,尤其是现有工具在视觉美观性与功能正确性之间存在不一致,导致难以客观比较其实际性能。解决方案的关键在于构建一个以人类评价为核心的基准测试框架,通过大规模人工评分实验(205名参与者,1,071次质量过滤后的成对比较),从任务易用性、视觉吸引力、感知完整性及用户信任度四个维度系统评估三个主流平台(Replit、Bolt 和 Firebase Studio)。结果表明,不同平台表现差异显著,且视觉效果优异并不等同于功能可靠,凸显了交互式、任务驱动型评估的重要性,并为未来研究提供了可复现的基准数据集和评估方法。
链接: https://arxiv.org/abs/2512.18080
作者: Marcos Ortiz,Justin Hill,Collin Overbay,Ingrida Semenec,Frederic Sauve-Hoover,Jim Schwoebel,Joel Shor
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Agentic AI systems capable of generating full-stack web applications from natural language prompts (“prompt- to-app”) represent a significant shift in software development. However, evaluating these systems remains challenging, as visual polish, functional correctness, and user trust are often misaligned. As a result, it is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria. In this paper, we introduce a human-centered benchmark for evaluating prompt-to-app systems and conduct a large-scale comparative study of three widely used platforms: Replit, Bolt, and Firebase Studio. Using a diverse set of 96 prompts spanning common web application tasks, we generate 288 unique application artifacts. We evaluate these systems through a large-scale human-rater study involving 205 participants and 1,071 quality-filtered pairwise comparisons, assessing task-based ease of use, visual appeal, perceived completeness, and user trust. Our results show that these systems are not interchangeable: Firebase Studio consistently outperforms competing platforms across all human-evaluated dimensions, achieving the highest win rates for ease of use, trust, visual appeal, and visual appropriateness. Bolt performs competitively on visual appeal but trails Firebase on usability and trust, while Replit underperforms relative to both across most metrics. These findings highlight a persistent gap between visual polish and functional reliability in prompt-to-app systems and demonstrate the necessity of interactive, task-based evaluation. We release our benchmark framework, prompt set, and generated artifacts to support reproducible evaluation and future research in agentic application generation.
zh
[AI-137] Characterising Behavioural Families and Dynamics of Promotional Twitter Bots via Sequence-Based Modelling
【速读】:该论文旨在解决促销类Twitter/X机器人(promotional bot)是否存在行为家族(behavioural families)及其演化模式是否具有一致性的问题。解决方案的关键在于将每个机器人的行为序列编码为符号块(“数字DNA”),基于七种帖子级行为特征(发布动作、URL、媒体、文本重复、标签、表情符号、情感倾向)构建频次向量,并利用余弦相似度与层次聚类方法识别出四个行为上一致的家族:独特发帖者、带URL的复制者、内容倍增者和知情贡献者;进一步通过多序列比对(multiple sequence alignment, MSA)量化行为变异(mutations),识别插入、删除、替换、改变和同源事件,从而揭示不同家族的行为突变率、易变区块及热点区域差异,并验证同一家族内机器人共享突变的概率更高、距离越近越易传播突变,且对外部触发事件(如圣诞节、万圣节)的响应呈现家族特异性规律。该方法实现了对促销机器人行为随时间演化的精细化刻画。
链接: https://arxiv.org/abs/2512.18077
作者: Ohoud Alzahrani,Russell Beale,Robert J. Hendley
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:This paper asks whether promotional Twitter/X bots form behavioural families and whether members evolve similarly. We analyse 2,798,672 tweets from 2,615 ground-truth promotional bot accounts (2006-2021), focusing on complete years 2009 to 2020. Each bot is encoded as a sequence of symbolic blocks (``digital DNA’') from seven categorical post-level behavioural features (posting action, URL, media, text duplication, hashtags, emojis, sentiment), preserving temporal order only. Using non-overlapping blocks (k=7), cosine similarity over block-frequency vectors, and hierarchical clustering, we obtain four coherent families: Unique Tweeters, Duplicators with URLs, Content Multipliers, and Informed Contributors. Families share behavioural cores but differ systematically in engagement strategies and life-cycle dynamics (beginning/middle/end). We then model behavioural change as mutations. Within each family we align sequences via multiple sequence alignment (MSA) and label events as insertions, deletions, substitutions, alterations, and identity. This quantifies mutation rates, change-prone blocks/features, and mutation hotspots. Deletions and substitutions dominate, insertions are rare, and mutation profiles differ by family, with hotspots early for some families and dispersed for others. Finally, we test predictive value: bots within the same family share mutations more often than bots across families; closer bots share and propagate mutations more than distant ones; and responses to external triggers (e.g., Christmas, Halloween) follow family-specific, partly predictable patterns. Overall, sequence-based family modelling plus mutation analysis provides a fine-grained account of how promotional bot behaviour adapts over time.
zh
[AI-138] Securing Agent ic AI Systems – A Multilayer Security Framework
【速读】:该论文旨在解决自主型人工智能(Agentic AI)系统在部署过程中因自主决策与动态适应行为所引入的复杂网络安全风险问题,这些问题包括未经授权的操作、对抗性操纵以及环境交互中的不确定性,而现有AI安全框架未能充分应对这些挑战。解决方案的关键在于提出一种生命周期感知的安全框架——MAAIS(Multi-Layered Agentic AI Security Framework),其核心创新是将CIAA(机密性、完整性、可用性和可问责性)概念嵌入到Agentic AI的全生命周期中,并通过多层防御机制保障各阶段的安全属性。该框架基于设计科学研究方法(Design Science Research, DSR)构建,并通过映射MITRE ATLAS对抗威胁图谱进行验证,从而为企业的CISO、安全团队及AI工程团队提供标准化、结构化的安全治理路径。
链接: https://arxiv.org/abs/2512.18043
作者: Sunil Arora,John Hastings
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 6 pages, 2 figures, 1 table
Abstract:Securing Agentic Artificial Intelligence (AI) systems requires addressing the complex cyber risks introduced by autonomous, decision-making, and adaptive behaviors. Agentic AI systems are increasingly deployed across industries, organizations, and critical sectors such as cybersecurity, finance, and healthcare. However, their autonomy introduces unique security challenges, including unauthorized actions, adversarial manipulation, and dynamic environmental interactions. Existing AI security frameworks do not adequately address these challenges or the unique nuances of agentic AI. This research develops a lifecycle-aware security framework specifically designed for agentic AI systems using the Design Science Research (DSR) methodology. The paper introduces MAAIS, an agentic security framework, and the agentic AI CIAA (Confidentiality, Integrity, Availability, and Accountability) concept. MAAIS integrates multiple defense layers to maintain CIAA across the AI lifecycle. Framework validation is conducted by mapping with the established MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) AI tactics. The study contributes a structured, standardized, and framework-based approach for the secure deployment and governance of agentic AI in enterprise environments. This framework is intended for enterprise CISOs, security, AI platform, and engineering teams and offers a detailed step-by-step approach to securing agentic AI workloads.
zh
[AI-139] Conflict-Driven Clause Learning with VSIDS Heuristics for Discrete Facility Layout
【速读】:该论文旨在解决离散设施布局问题(Discrete Facility Layout Problem, DFLP)中的可行性检测与优化难题,该问题本质上是一个具有密集逻辑结构的组合分配问题,包含邻接、分离及槽位可用性等约束。解决方案的关键在于将冲突驱动子句学习(Conflict-Driven Clause Learning, CDCL)与VSIDS启发式方法作为计算引擎,构建基于CNF(合取范式)的可行性建模,并通过统一基准框架对比其与CP-SAT和MILP(混合整数线性规划)方法的性能表现。实证结果表明,CDCL在不同规模和约束密度下展现出近似恒定的运行时间特性,优于CP-SAT的多项式增长和MILP的指数级增长;为进一步提升优化效率,论文提出两种混合架构:一是利用CDCL快速枚举可行解以换取速度优势,二是用CDCL生成热启动解加速CP-SAT的精确优化过程,从而在保证正确性的前提下显著缩短求解时间,明确揭示了基于子句学习的搜索与精确优化方法之间的算法权衡关系。
链接: https://arxiv.org/abs/2512.18034
作者: Joshua Gibson,Kapil Dhakal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper studies the use of Conflict-Driven Clause Learning (CDCL) with VSIDS heuristics as a computational engine for discrete facility layout problems. The facility layout problem is modeled as a combinatorial assignment problem with dense logical structure arising from adjacency, separation, and slot-availability constraints. We develop a CNF-based formulation for layout feasibility and compare CDCL-based SAT solving against CP-SAT and MILP formulations under a unified benchmarking framework. Empirical results show that CDCL exhibits near-constant runtime behavior for feasibility detection across increasing problem sizes and constraint densities, while CP-SAT and MILP display polynomial and exponential scaling respectively. To address the limitation of CDCL in objective optimization, we introduce two hybrid architectures that combine CDCL-based feasibility search with CP-SAT optimization. The first architecture rapidly enumerates feasible layouts to trade optimality for speed, while the second uses CDCL to generate warm-start solutions that accelerate exact optimization. The results demonstrate that hybrid approaches can significantly reduce time-to-solution while preserving correctness guarantees, clarifying the algorithmic trade-offs between clause-learning search and exact optimization methods in large-scale discrete layout problems.
zh
[AI-140] A Dataset and Benchmarks for Atrial Fibrillation Detection from Electrocardiograms of Intensive Care Unit Patients
【速读】:该论文旨在解决重症监护病房(ICU)中房颤(Atrial Fibrillation, AF)自动检测的准确性问题,这是ICU患者中最常见的心律失常之一,可能引发不良健康后果。研究的关键在于系统比较三种数据驱动的人工智能(AI)方法:基于特征的分类器、深度学习(Deep Learning, DL)和心电图基础模型(ECG Foundation Models, FMs),并通过多种训练策略(从零样本推理到迁移学习)进行评估。结果表明,ECG基础模型在迁移学习策略下表现最优(F1=0.89),显著优于其他方法,凸显了预训练基础模型在ICU场景下提升AF检测性能的核心价值。
链接: https://arxiv.org/abs/2512.18031
作者: Sarah Nassar,Nooshin Maghsoodi,Sophia Mannina,Shamel Addas,Stephanie Sibley,Gabor Fichtinger,David Pichora,David Maslove,Purang Abolmaesumi,Parvin Mousavi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 6 tables
Abstract:Objective: Atrial fibrillation (AF) is the most common cardiac arrhythmia experienced by intensive care unit (ICU) patients and can cause adverse health effects. In this study, we publish a labelled ICU dataset and benchmarks for AF detection. Methods: We compared machine learning models across three data-driven artificial intelligence (AI) approaches: feature-based classifiers, deep learning (DL), and ECG foundation models (FMs). This comparison addresses a critical gap in the literature and aims to pinpoint which AI approach is best for accurate AF detection. Electrocardiograms (ECGs) from a Canadian ICU and the 2021 PhysioNet/Computing in Cardiology Challenge were used to conduct the experiments. Multiple training configurations were tested, ranging from zero-shot inference to transfer learning. Results: On average and across both datasets, ECG FMs performed best, followed by DL, then feature-based classifiers. The model that achieved the top F1 score on our ICU test set was ECG-FM through a transfer learning strategy (F1=0.89). Conclusion: This study demonstrates promising potential for using AI to build an automatic patient monitoring system. Significance: By publishing our labelled ICU dataset (LinkToBeAdded) and performance benchmarks, this work enables the research community to continue advancing the state-of-the-art in AF detection in the ICU.
zh
[AI-141] Specification and Detection of LLM Code Smells ICSE-2026 ICSE
【速读】:该论文试图解决的问题是:随着大语言模型(Large Language Models, LLMs)在软件系统中日益广泛应用,缺乏规范的集成方式可能导致代码质量下降,而当时尚无针对LLM推理编码实践的代码异味(code smells)正式分类体系。解决方案的关键在于:首次提出“LLM代码异味”概念,并基于文献归纳出五类与LLM推理相关的典型不良编码实践;同时扩展检测工具SpecDetect4AI以识别这些异味,并通过分析200个开源LLM系统验证其普遍性——结果显示60.50%的系统存在此类问题,且检测精度达86.06%,从而为提升LLM集成代码质量提供了可量化、可检测的依据。
链接: https://arxiv.org/abs/2512.18020
作者: Brahim Mahmoudi,Zacharie Chenail-Larcher,Naouel Moha,Quentin Stievenert,Florent Avellaneda
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted paper at ICSE NIER 2026 : this https URL
Abstract:Large Language Models (LLMs) have gained massive popularity in recent years and are increasingly integrated into software systems for diverse purposes. However, poorly integrating them in source code may undermine software system quality. Yet, to our knowledge, there is no formal catalog of code smells specific to coding practices for LLM inference. In this paper, we introduce the concept of LLM code smells and formalize five recurrent problematic coding practices related to LLM inference in software systems, based on relevant literature. We extend the detection tool SpecDetect4AI to cover the newly defined LLM code smells and use it to validate their prevalence in a dataset of 200 open-source LLM systems. Our results show that LLM code smells affect 60.50% of the analyzed systems, with a detection precision of 86.06%.
zh
[AI-142] A Hybrid Inductive-Transductive Network for Traffic Flow Imputation on Unsampled Locations
【速读】:该论文旨在解决交通流数据在未部署检测器(unsensed)位置上的准确插补问题,其核心挑战在于:环形检测器提供的流量数据虽精确但稀疏,探针车辆速度信息虽广泛可用却与流量相关性弱,且相邻路段常表现出显著的异质性(heterophily),违反了图神经网络(GNN)对同质性的假设。解决方案的关键在于提出HINT(Hybrid INductive-Transductive Network)架构及其对应的INDU-TRANSDUCTIVE训练策略:一方面利用诱导式(inductive)空间变换器从节点特征中学习基于相似性的长程交互,另一方面通过条件扩散图卷积网络(diffusion GCN)结合FiLM机制引入静态上下文(如OSM属性和交通仿真),同时设计节点级校准层消除路段尺度偏差;训练时采用掩码重建、逐轮节点采样、困难样本挖掘及可见流量加噪等技术,构建基于驾驶距离的图结构,从而实现对未见位置的高精度流量预测。
链接: https://arxiv.org/abs/2512.17984
作者: Mohammadmahdi Rahimiasl,Ynte Vanderhoydonc,Siegfried Mercelis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures, 3 tables
Abstract:Accurately imputing traffic flow at unsensed locations is difficult: loop detectors provide precise but sparse measurements, speed from probe vehicles is widely available yet only weakly correlated with flow, and nearby links often exhibit strong heterophily in the scale of traffic flow (e.g., ramps vs. mainline), which breaks standard GNN assumptions. We propose HINT, a Hybrid INductive-Transductive Network, and an INDU-TRANSDUCTIVE training strategy that treats speed as a transductive, network-wide signal while learning flow inductively to generalize to unseen locations. HINT couples (i) an inductive spatial transformer that learns similarity-driven, long-range interactions from node features with (ii) a diffusion GCN conditioned by FiLM on rich static context (OSM-derived attributes and traffic simulation), and (iii) a node-wise calibration layer that corrects scale biases per segment. Training uses masked reconstruction with epoch-wise node sampling, hard-node mining to emphasize difficult sensors, and noise injection on visible flows to prevent identity mapping, while graph structure is built from driving distances. Across three real-world datasets, MOW (Antwerp, Belgium), UTD19-Torino, and UTD19-Essen, HINT consistently surpasses state-of-the-art inductive baselines. Relative to KITS, HINT reduces MAE on MOW by \approx42 % with basic simulation and \approx50 % with calibrated simulation; on Torino by \approx22 %, and on Essen by \approx12 %. Even without simulation, HINT remains superior on MOW and Torino, while simulation is crucial on Essen. These results show that combining inductive flow imputation with transductive speed, traffic simulations and external geospatial improves accuracy for the task described above. Comments: 10 pages, 8 figures, 3 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 68T07, 90B20 ACMclasses: I.2.6; I.5.1 Cite as: arXiv:2512.17984 [cs.LG] (or arXiv:2512.17984v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.17984 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-143] Parameter-Efficient Fine-Tuning for HAR: Integrating LoRA and QLoRA into Transformer Models
【速读】:该论文旨在解决在资源受限设备上将大规模预训练模型适配到新领域时面临的计算效率问题,尤其是在人类活动识别(Human Activity Recognition, HAR)任务中,全参数微调方法因高内存占用和长训练时间难以部署。其解决方案的关键在于引入参数高效微调技术——低秩适应(Low-Rank Adaptation, LoRA)及其量化版本(Quantized LoRA, QLoRA),通过仅训练少量低秩矩阵来替代全模型更新,在保持与全微调相当的识别性能的同时,显著降低可训练参数数量、内存消耗和训练时间。进一步分析表明,LoRA在监督数据有限时仍具鲁棒性,且适配器秩(adapter rank)提供了精度与效率之间的可控权衡;QLoRA则通过量化冻结权重进一步压缩内存占用,对分类质量影响极小。
链接: https://arxiv.org/abs/2512.17983
作者: Irina Seregina,Philippe Lalanda,German Vega
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Human Activity Recognition is a foundational task in pervasive computing. While recent advances in self-supervised learning and transformer-based architectures have significantly improved HAR performance, adapting large pretrained models to new domains remains a practical challenge due to limited computational resources on target devices. This papers investigates parameter-efficient fine-tuning techniques, specifically Low-Rank Adaptation (LoRA) and Quantized LoRA, as scalable alternatives to full model fine-tuning for HAR. We propose an adaptation framework built upon a Masked Autoencoder backbone and evaluate its performance under a Leave-One-Dataset-Out validation protocol across five open HAR datasets. Our experiments demonstrate that both LoRA and QLoRA can match the recognition performance of full fine-tuning while significantly reducing the number of trainable parameters, memory usage, and training time. Further analyses reveal that LoRA maintains robust performance even under limited supervision and that the adapter rank provides a controllable trade-off between accuracy and efficiency. QLoRA extends these benefits by reducing the memory footprint of frozen weights through quantization, with minimal impact on classification quality.
zh
[AI-144] Adaptive Agents in Spatial Double-Auction Markets: Modeling the Emergence of Industrial Symbiosis AAMAS2026 AAMAS
【速读】:该论文旨在解决工业共生(Industrial Symbiosis)在现实空间中难以自发形成的问题,特别是由于社会-空间摩擦(socio-spatial frictions)导致的交易成本、匹配机会受限和市场效率低下。传统模型往往忽视了空间结构、市场设计与企业自适应行为之间的交互作用,从而限制了对共生网络形成机制的理解。解决方案的关键在于构建一个基于智能体的模型(Agent-Based Model, ABM),其中异质性企业通过嵌入地理空间的双重拍卖市场(double-auction market)进行副产品交易,价格与数量由本地互动内生生成;同时引入强化学习(reinforcement learning)机制,使企业根据运输成本、处置惩罚和资源稀缺性动态优化投标策略,从而实现利润最大化。仿真结果表明,在特定经济与空间条件下,去中心化交换可收敛至稳定且高效的均衡状态,揭示了空间结构与市场参数如何共同调控循环性(circularity),为政策干预提供了理论基础。
链接: https://arxiv.org/abs/2512.17979
作者: Matthieu Mastio,Paul Saves,Benoit Gaudou,Nicolas Verstaevel
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Applications (stat.AP)
备注: AAMAS CC-BY 4.0 licence. Adaptive Agents in Spatial Double-Auction Markets: Modeling the Emergence of Industrial Symbiosis. Full paper. In Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), Paphos, Cyprus, May 25 - 29, 2026, IFAAMAS, 10 pages
Abstract:Industrial symbiosis fosters circularity by enabling firms to repurpose residual resources, yet its emergence is constrained by socio-spatial frictions that shape costs, matching opportunities, and market efficiency. Existing models often overlook the interaction between spatial structure, market design, and adaptive firm behavior, limiting our understanding of where and how symbiosis arises. We develop an agent-based model where heterogeneous firms trade byproducts through a spatially embedded double-auction market, with prices and quantities emerging endogenously from local interactions. Leveraging reinforcement learning, firms adapt their bidding strategies to maximize profit while accounting for transport costs, disposal penalties, and resource scarcity. Simulation experiments reveal the economic and spatial conditions under which decentralized exchanges converge toward stable and efficient outcomes. Counterfactual regret analysis shows that sellers’ strategies approach a near Nash equilibrium, while sensitivity analysis highlights how spatial structures and market parameters jointly govern circularity. Our model provides a basis for exploring policy interventions that seek to align firm incentives with sustainability goals, and more broadly demonstrates how decentralized coordination can emerge from adaptive agents in spatially constrained markets.
zh
[AI-145] CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLM s NEURIPS2025
【速读】:该论文旨在解决低比特量化(如2-bit)在大语言模型(LLM)推理中因依赖反量化操作而导致的高延迟和缓存压力问题。现有基于码本(codebook)的方法虽能在极低比特下保持较好精度,但其计算内核频繁访问中心点(centroids)并重建权重,造成显著的内存瓶颈。解决方案的关键在于提出CodeGEMM——一种以码本为中心的矩阵乘法(GEMM)内核,通过预先计算中心点与激活值之间的内积并存储于轻量级Psumbook中,使推理时仅需根据码索引直接获取部分和(partial sums),从而避免逐元素查找,大幅降低片上存储占用并提升计算效率。该方案统一实现了延迟-内存-精度的权衡探索,在Llama-3模型上相较当前最优方法实现最高达8.93倍的加速比。
链接: https://arxiv.org/abs/2512.17970
作者: Gunho Park,Jeongin Bae,Byeongwook Kim,Baeseong park,Jiwon Ryu,Hoseung Kim,Se Jung Kwon,Dongsoo Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025
Abstract:Weight-only quantization is widely used to mitigate the memory-bound nature of LLM inference. Codebook-based methods extend this trend by achieving strong accuracy in the extremely low-bit regime (e.g., 2-bit). However, current kernels rely on dequantization, which repeatedly fetches centroids and reconstructs weights, incurring substantial latency and cache pressure. We present CodeGEMM, a codebook-centric GEMM kernel that replaces dequantization with precomputed inner products between centroids and activations stored in a lightweight Psumbook. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing the on-chip footprint. The kernel supports the systematic exploration of latency-memory-accuracy trade-offs under a unified implementation. On Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups in the 2-bit configuration compared to state-of-the-art codebook-based quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization.
zh
[AI-146] Convolutional-neural-operator-based transfer learning for solving PDEs
【速读】:该论文旨在解决卷积神经算子(Convolutional Neural Operator)在少样本学习(few-shot learning)场景下性能不足的问题。其核心解决方案是通过两阶段策略实现:首先在源数据集上预训练模型以捕获通用的偏微分方程(PDE)解算子结构,随后仅用少量目标数据对已训练模型进行参数调整。研究对比了三种参数调整策略——微调(fine-tuning)、低秩适配(low-rank adaptation)和神经元线性变换(neuron linear transformation),发现神经元线性变换在求解Kuramoto-Sivashinsky方程、Brusselator扩散-反应系统及Navier-Stokes方程等复杂PDE时能获得最高的代理精度(surrogate accuracy),成为关键创新点。
链接: https://arxiv.org/abs/2512.17969
作者: Peng Fan,Guofei Pang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph)
备注: 12 pages, 4 figures, 2 tables
Abstract:Convolutional neural operator is a CNN-based architecture recently proposed to enforce structure-preserving continuous-discrete equivalence and enable the genuine, alias-free learning of solution operators of PDEs. This neural operator was demonstrated to outperform for certain cases some baseline models such as DeepONet, Fourier neural operator, and Galerkin transformer in terms of surrogate accuracy. The convolutional neural operator, however, seems not to be validated for few-shot learning. We extend the model to few-shot learning scenarios by first pre-training a convolutional neural operator using a source dataset and then adjusting the parameters of the trained neural operator using only a small target dataset. We investigate three strategies for adjusting the parameters of a trained neural operator, including fine-tuning, low-rank adaption, and neuron linear transformation, and find that the neuron linear transformation strategy enjoys the highest surrogate accuracy in solving PDEs such as Kuramoto-Sivashinsky equation, Brusselator diffusion-reaction system, and Navier-Stokes equations.
zh
[AI-147] Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization
【速读】:该论文旨在解决服务机器人在公共空间中实现自然人机交互时,如何实时准确识别人类行为意图的问题。其核心挑战在于现有方法通常依赖RGB-D传感器或GPU加速,难以部署于资源受限的嵌入式设备;同时,真实场景下的人机交互数据存在严重类别不平衡问题,影响模型泛化能力。解决方案的关键在于提出一种轻量级多模态框架,通过融合单目RGB视频提取的2D骨骼姿态(skeletal pose)与面部情绪特征(facial emotion),实现帧级精准意图检测,并采用MINT-RVAE(Multimodal Recurrent Variational Autoencoder for Intent Sequence Generation)生成时序一致的伪样本以缓解数据不平衡问题。该方案在树莓派5(仅CPU)上成功部署,且在跨摄像头、跨场景和实际机器人平台(MIRA)上的验证表明其具备强鲁棒性和高实用性,帧级和序列级AUROC均达0.95,部署后准确率达91%、召回率为100%。
链接: https://arxiv.org/abs/2512.17958
作者: Farida Mohsen,Ali Safa
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Service robots in public spaces require real-time understanding of human behavioral intentions for natural interaction. We present a practical multimodal framework for frame-accurate human-robot interaction intent detection that fuses camera-invariant 2D skeletal pose and facial emotion features extracted from monocular RGB video. Unlike prior methods requiring RGB-D sensors or GPU acceleration, our approach resource-constrained embedded hardware (Raspberry Pi 5, CPU-only). To address the severe class imbalance in natural human-robot interaction datasets, we introduce a novel approach to synthesize temporally coherent pose-emotion-label sequences for data re-balancing called MINT-RVAE (Multimodal Recurrent Variational Autoencoder for Intent Sequence Generation). Comprehensive offline evaluations under cross-subject and cross-scene protocols demonstrate strong generalization performance, achieving frame- and sequence-level AUROC of 0.95. Crucially, we validate real-world generalization through cross-camera evaluation on the MIRA robot head, which employs a different onboard RGB sensor and operates in uncontrolled environments not represented in the training data. Despite this domain shift, the deployed system achieves 91% accuracy and 100% recall across 32 live interaction trials. The close correspondence between offline and deployed performance confirms the cross-sensor and cross-environment robustness of the proposed multimodal approach, highlighting its suitability for ubiquitous multimedia-enabled social robots.
zh
[AI-148] Victor Calibration (VC): Multi-Pass Confidence Calibration and CP4.3 Governance Stress Test under Round-Table Orchestration
【速读】:该论文旨在解决前沿大语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)过程中因过度保守而导致协作能力下降的问题,例如通过模糊回应或虚假拒绝(false refusals)削弱交互有效性。解决方案的关键在于提出一个轻量级工具包,包含三个核心组件:(1) Victor Calibration (VC),一种多轮迭代协议,通过重新评估证据生成标量置信度代理 T(T₀→T₁→T₂),实现对模型输出可靠性的动态校准;(2) FD-Lite,一种仅基于行为的现象学审计方法,使用固定锚定短语与元前缀陷阱(meta-prefix trap)避免引入人类中心主义表述;(3) CP4.3,一项治理压力测试,用于验证排名不变性(rank invariance)和分配单调性(allocation monotonicity, M6)。实验表明,在Claude 4.5系列模型及Opus中,VC轨迹呈现单调变化且不破坏安全约束,CP4.3行为保持稳定,验证了该方案的有效性和可复现性。
链接: https://arxiv.org/abs/2512.17956
作者: Victor Stasiuc,Round Table Collaboration
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure, 4 tables. Exploratory case study
Abstract:Safety alignment can make frontier LMs overly conservative, degrading collaboration via hedging or false refusals. We present a lightweight toolkit with three parts: (1) Victor Calibration (VC), a multi-pass protocol that elicits a scalar confidence proxy T (T0T1T2) through iterative evidence re-evaluation; (2) FD-Lite, a behavior-only phenomenology audit with a fixed anchor phrase and a meta-prefix trap to avoid anthropomorphic claims; and (3) CP4.3, a governance stress test for rank invariance and allocation monotonicity (M6). Across Claude 4.5 models (Haiku, Sonnet no-thinking, Sonnet thinking) and Opus, we observe monotonic VC trajectories without violating safety invariants, and stable CP4.3 behavior. (“Opus” here refers to a single Claude Opus 4.1 session accessed via a standard UI account, as reported in Table 1.) This work was conducted by a single operator (n=1) and is intended as hypothesis-generating; we explicitly invite replication, critique, and extension by the research community. We include prompt templates and an artifact plan to facilitate independent verification.
zh
[AI-149] Will AI Trade? A Computational Inversion of the No-Trade Theorem
【速读】:该论文试图解决的问题是:在人工智能(AI)代理之间信念一致的情况下,交易行为是否仍可能因计算能力的局限性而产生。传统“无交易定理”认为交易源于异质信念,但本文挑战这一结论,探讨在共同信念下由计算理性受限引发的交易可能性。解决方案的关键在于构建一个基于展开博弈(unfolding game)框架的模型,其中代理的策略复杂度由其计算能力决定,并发现:当代理接近理性但计算能力略有差异时,才能达到稳定的无交易状态(即纳什均衡);若计算能力相同,则可能出现无法收敛的持续战略调整,从而形成一种非均衡状态下的交易行为。这一机制揭示了AI代理的计算限制可能导致市场更活跃、更不可预测的交易环境,颠覆了经典经济学中对均衡稳定性的假设。
链接: https://arxiv.org/abs/2512.17952
作者: Hanyu Li,Xiaotie Deng
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
备注: Accepted in WINE 2025
Abstract:Classic no-trade theorems attribute trade to heterogeneous beliefs. We re-examine this conclusion for AI agents, asking if trade can arise from computational limitations, under common beliefs. We model agents’ bounded computational rationality within an unfolding game framework, where computational power determines the complexity of its strategy. Our central finding inverts the classic paradigm: a stable no-trade outcome (Nash equilibrium) is reached only when “almost rational” agents have slightly different computational power. Paradoxically, when agents possess identical power, they may fail to converge to equilibrium, resulting in persistent strategic adjustments that constitute a form of trade. This instability is exacerbated if agents can strategically under-utilize their computational resources, which eliminates any chance of equilibrium in Matching Pennies scenarios. Our results suggest that the inherent computational limitations of AI agents can lead to situations where equilibrium is not reached, creating a more lively and unpredictable trade environment than traditional models would predict.
zh
[AI-150] Which Coauthor Should I Nominate in My 99 ICLR Submissions? A Mathematical Analysis of the ICLR 2026 Reciprocal Reviewer Nomination Policy
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)会议投稿激增带来的审稿负担问题,特别是通过引入“审稿人提名政策”来提升审稿质量与效率。该政策要求每篇投稿必须提名一位作者作为审稿人,若所提名的审稿人被认定为不负责,则该论文将被直接拒稿(desk-rejection)。论文从作者福利角度出发,研究如何在存在审稿人失责风险的前提下,优化提名策略以最小化被拒稿的概率。其解决方案的关键在于形式化并分析三种不同约束下的“桌面拒稿风险最小化”问题:基础版本采用贪心算法实现期望拒稿数最优;引入硬性与软性提名上限后,通过最小费用流(minimum-cost flow)和线性规划(linear programming)等经典优化框架设计高效且有理论保障的提名策略,从而避免因单一审稿人失责导致的大规模拒稿风险。
链接: https://arxiv.org/abs/2512.17950
作者: Zhao Song,Song Yue,Jiahao Zhang
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid growth of AI conference submissions has created an overwhelming reviewing burden. To alleviate this, recent venues such as ICLR 2026 introduced a reviewer nomination policy: each submission must nominate one of its authors as a reviewer, and any paper nominating an irresponsible reviewer is desk-rejected. We study this new policy from the perspective of author welfare. Assuming each author carries a probability of being irresponsible, we ask: how can authors (or automated systems) nominate reviewers to minimize the risk of desk rejections? We formalize and analyze three variants of the desk-rejection risk minimization problem. The basic problem, which minimizes expected desk rejections, is solved optimally by a simple greedy algorithm. We then introduce hard and soft nomination limit variants that constrain how many papers may nominate the same author, preventing widespread failures if one author is irresponsible. These formulations connect to classical optimization frameworks, including minimum-cost flow and linear programming, allowing us to design efficient, principled nomination strategies. Our results provide the first theoretical study for reviewer nomination policies, offering both conceptual insights and practical directions for authors to wisely choose which co-author should serve as the nominated reciprocal reviewer.
zh
[AI-151] Let the Model Learn to Feel: Mode-Guided Tonality Injection for Symbolic Music Emotion Recognition AAAI2026
【速读】:该论文旨在解决符号化音乐情感识别(Symbolic Music Emotion Recognition, SMER)中模型对音乐调式(mode)信息建模不足的问题。尽管当前基于预训练模型(如MIDIBERT)的方法在捕捉音乐语义分布方面表现良好,但其对调式与情感关联的表征能力有限,而调式在音乐心理学中被证实对情感感知具有关键作用。解决方案的关键在于提出一种模式引导增强策略(Mode-Guided Enhancement, MoGE),其中核心创新是引入“模式引导特征逐元素线性调制注入”(Mode-guided Feature-wise linear modulation injection, MoFi)框架,通过识别MIDIBERT中情感相关性最弱的层并注入显式的调式特征,从而显著提升模型的情感表征与推理性能。
链接: https://arxiv.org/abs/2512.17946
作者: Haiying Xia,Zhongyi Huang,Yumei Tan,Shuxiang Song
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted by AAAI 2026
Abstract:Music emotion recognition is a key task in symbolic music understanding (SMER). Recent approaches have shown promising results by fine-tuning large-scale pre-trained models (e.g., MIDIBERT, a benchmark in symbolic music understanding) to map musical semantics to emotional labels. While these models effectively capture distributional musical semantics, they often overlook tonal structures, particularly musical modes, which play a critical role in emotional perception according to music psychology. In this paper, we investigate the representational capacity of MIDIBERT and identify its limitations in capturing mode-emotion associations. To address this issue, we propose a Mode-Guided Enhancement (MoGE) strategy that incorporates psychological insights on mode into the model. Specifically, we first conduct a mode augmentation analysis, which reveals that MIDIBERT fails to effectively encode emotion-mode correlations. We then identify the least emotion-relevant layer within MIDIBERT and introduce a Mode-guided Feature-wise linear modulation injection (MoFi) framework to inject explicit mode features, thereby enhancing the model’s capability in emotional representation and inference. Extensive experiments on the EMOPIA and VGMIDI datasets demonstrate that our mode injection strategy significantly improves SMER performance, achieving accuracies of 75.2% and 59.1%, respectively. These results validate the effectiveness of mode-guided modeling in symbolic music emotion recognition.
zh
[AI-152] Accelerated Digital Twin Learning for Edge AI: A Comparison of FPGA and Mobile GPU
【速读】:该论文旨在解决数字孪生(Digital Twin, DT)在精准医疗中应用时面临的计算效率与资源消耗问题,尤其是现有模型恢复(Model Recovery, MR)技术因依赖迭代求解器和高算力/内存需求而难以满足任务关键型医疗场景对实时性与能效的要求。解决方案的关键在于提出一种适用于可重构硬件(如FPGA)的通用DT学习框架,通过硬件级优化实现显著的速度提升和能效改善:相比云端GPU基线,其在MR任务上实现了8.8倍的性能-功耗比提升、28.5倍的DRAM存储占用降低及1.67倍的运行时间加速;同时相较移动GPU方案,FPGA在能效上更具优势且内存开销更低,验证了该方法在糖尿病合成数据生成和冠状动脉疾病早期预警等实际医疗场景中的有效性。
链接: https://arxiv.org/abs/2512.17941
作者: Bin Xu,Ayan Banerjee,Midhat Urooj,Sandeep K.S. Gupta
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Digital twins (DTs) can enable precision healthcare by continually learning a mathematical representation of patient-specific dynamics. However, mission critical healthcare applications require fast, resource-efficient DT learning, which is often infeasible with existing model recovery (MR) techniques due to their reliance on iterative solvers and high compute/memory demands. In this paper, we present a general DT learning framework that is amenable to acceleration on reconfigurable hardware such as FPGAs, enabling substantial speedup and energy efficiency. We compare our FPGA-based implementation with a multi-processing implementation in mobile GPU, which is a popular choice for AI in edge devices. Further, we compare both edge AI implementations with cloud GPU baseline. Specifically, our FPGA implementation achieves an 8.8x improvement in \textperformance-per-watt for the MR task, a 28.5x reduction in DRAM footprint, and a 1.67x runtime speedup compared to cloud GPU baselines. On the other hand, mobile GPU achieves 2x better performance per watts but has 2x increase in runtime and 10x more DRAM footprint than FPGA. We show the usage of this technique in DT guided synthetic data generation for Type 1 Diabetes and proactive coronary artery disease detection.
zh
[AI-153] Comparative Evaluation of Explainable Machine Learning Versus Linear Regression for Predicting County-Level Lung Cancer Mortality Rate in the United States
【速读】:该论文旨在解决美国县级肺癌(Lung Cancer, LC)死亡率预测的准确性与可解释性问题,以支持靶向干预措施制定和健康不平等的缓解。其解决方案的关键在于采用三种机器学习模型(随机森林、梯度提升回归和线性回归)进行比较,并结合Shapley Additive Explanations (SHAP)值量化变量重要性和方向性影响,同时通过Getis-Ord (Gi*)热点分析揭示地理空间分布特征。结果显示,随机森林模型表现最优(R²=41.9%,RMSE=12.8),且吸烟率、中位住房价值及西班牙裔人口比例是关键驱动因素,为精准公共卫生策略提供了可解释的科学依据。
链接: https://arxiv.org/abs/2512.17934
作者: Soheil Hashtarkhani,Brianna M. White,Benyamin Hoseini,David L. Schwartz,Arash Shaban-Nejad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 9 Pages, 4 Figures, 1 Table
Abstract:Lung cancer (LC) is a leading cause of cancer-related mortality in the United States. Accurate prediction of LC mortality rates is crucial for guiding targeted interventions and addressing health disparities. Although traditional regression-based models have been commonly used, explainable machine learning models may offer enhanced predictive accuracy and deeper insights into the factors influencing LC mortality. This study applied three models: random forest (RF), gradient boosting regression (GBR), and linear regression (LR) to predict county-level LC mortality rates across the United States. Model performance was evaluated using R-squared and root mean squared error (RMSE). Shapley Additive Explanations (SHAP) values were used to determine variable importance and their directional impact. Geographic disparities in LC mortality were analyzed through Getis-Ord (Gi*) hotspot analysis. The RF model outperformed both GBR and LR, achieving an R2 value of 41.9% and an RMSE of 12.8. SHAP analysis identified smoking rate as the most important predictor, followed by median home value and the percentage of the Hispanic ethnic population. Spatial analysis revealed significant clusters of elevated LC mortality in the mid-eastern counties of the United States. The RF model demonstrated superior predictive performance for LC mortality rates, emphasizing the critical roles of smoking prevalence, housing values, and the percentage of Hispanic ethnic population. These findings offer valuable actionable insights for designing targeted interventions, promoting screening, and addressing health disparities in regions most affected by LC in the United States.
zh
[AI-154] Byzantine Fault-Tolerant Multi-Agent System for Healthcare: A Gossip Protocol Approach to Secure Medical Message Propagation
【速读】:该论文旨在解决医疗多智能体系统在对抗性或不可信环境中面临的通信消息完整性保障与容错能力不足的问题。其核心解决方案是提出一种拜占庭容错(Byzantine Fault Tolerance, BFT)的多智能体架构,通过融合基于gossip协议的消息传播机制与密码学验证手段,实现分布式医疗决策过程中的高可靠性与安全性。关键创新在于:利用n = 3f + 1个节点的共识模型容忍最多f个拜占庭故障节点,并借助2f + 1票数达成共识;同时引入数字签名和时间戳机制防止重放攻击,确保医学消息的真实性与时效性,在高达33%拜占庭节点比例下仍能保持100%共识准确率。
链接: https://arxiv.org/abs/2512.17913
作者: Nihir Chadderwala
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in generative AI have enabled sophisticated multi-agent architectures for healthcare, where large language models power collaborative clinical decision-making. However, these distributed systems face critical challenges in ensuring message integrity and fault tolerance when operating in adversarial or untrusted this http URL paper presents a novel Byzantine fault-tolerant multi-agent system specifically designed for healthcare applications, integrating gossip-based message propagation with cryptographic validation mechanisms. Our system employs specialized AI agents for diagnosis, treatment planning, emergency response, and data analysis, coordinated through a Byzantine consensus protocol that tolerates up to f faulty nodes among n = 3f + 1 total nodes. We implement a gossip protocol for decentralized message dissemination, achieving consensus with 2f + 1 votes while maintaining system operation even under Byzantine failures. Experimental results demonstrate that our approach successfully validates medical messages with cryptographic signatures, prevents replay attacks through timestamp validation, and maintains consensus accuracy of 100% with up to 33% Byzantine nodes. The system provides real-time visualization of consensus rounds, vote tallies, and network topology, enabling transparent monitoring of fault-tolerant operations. This work contributes a practical framework for building secure, resilient healthcare multi-agent systems capable of collaborative medical decision-making in untrusted environments.
zh
[AI-155] Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse with Activated LoRA
【速读】:该论文旨在解决现代大语言模型(Large Language Model, LLM)多轮推理中因Adapter切换导致的大量重复计算(recomputation overhead)问题,尤其是在基于LoRA(Low-Rank Adaptation)的参数高效微调场景下,不同Adapter之间无法共享键值缓存(Key-Value Cache, KV-cache)所引发的性能瓶颈。解决方案的关键在于提出Activated LoRA(aLoRA),通过引入基线对齐的块哈希(base-aligned block hashing)和激活感知掩码(activation-aware masking)机制,在模型执行路径中实现跨模型KV缓存复用,从而支持细粒度Adapter动态切换而无需重新计算关键中间状态。该方法无缝集成于vLLM框架,并保持与现有服务优化技术兼容,显著降低端到端延迟(最高达58倍)并提升首token生成速度(超过100倍)。
链接: https://arxiv.org/abs/2512.17910
作者: Allison Li,Kristjan Greenewald,Thomas Parnell,Navid Azizan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Modern large language model (LLM) systems increasingly rely on multi-turn pipelines that are composed of multiple task-specific adapters, yet existing serving frameworks remain inefficient, incurring substantial recomputation overhead when switching between adapters. We present the first LLM serving engine that supports cross-model prefix cache reuse between base and adapted models via Activated LoRA (aLoRA), enabling efficient and fine-grained adapter switching during inference. Our design extends the vLLM framework by introducing base-aligned block hashing and activation-aware masking within the model execution path, permitting cache reuse across models while preserving compatibility with existing serving engine optimizations. Integrated into a production-grade inference stack, this approach supports dynamic adapter activation without excessive key-value tensor recomputation. Evaluation across representative multi-turn, multi-adapter pipelines demonstrates up to 58x end-to-end latency reduction and over 100x time-to-first-token improvement relative to standard LoRA baselines, with benefits that scale with model size and sequence length and manifest across all stages of the request lifecycle. This work bridges parameter-efficient model adaptation with high-performance serving, providing the first complete realization of cross-model KV-cache reuse in modern LLM inference engines.
zh
[AI-156] Owning the Intelligence: Global AI Patents Landscape and Europes Quest for Technological Sovereignty
【速读】:该论文旨在解决欧洲在全球生成式 AI(Generative AI)专利竞争格局中技术主权定位不清的问题,特别是评估其在中美主导的双极创新体系中的相对位置与竞争力。解决方案的关键在于通过整合专利、企业、所有权及引用数据,系统分析全球AI技术创新的地理分布、专业化特征和跨国知识扩散机制,揭示欧洲虽专利数量占比有限但专利质量较高,且受美国技术路径主导而缺乏自主协同效应,最终指出技术能力而非政治一体化才是驱动欧洲深度参与全球AI创新网络的核心因素。
链接: https://arxiv.org/abs/2512.19569
作者: Lapo Santarlasci,Armando Rungi,Loredana Fattorini,Nestor Maslej
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence has become a key arena of global technological competition and a central concern for Europe’s quest for technological sovereignty. This paper analyzes global AI patenting from 2010 to 2023 to assess Europe’s position in an increasingly bipolar innovation landscape dominated by the United States and China. Using linked patent, firm, ownership, and citation data, we examine the geography, specialization, and international diffusion of AI innovation. We find a highly concentrated patent landscape: China leads in patent volumes, while the United States dominates in citation impact and technological influence. Europe accounts for a limited share of AI patents but exhibits signals of relatively high patent quality. Technological proximity reveals global convergence toward U.S. innovation trajectories, with Europe remaining fragmented rather than forming an autonomous pole. Gravity-model estimates show that cross-border AI knowledge flows are driven primarily by technological capability and specialization, while geographic and institutional factors play a secondary role. EU membership does not significantly enhance intra-European knowledge diffusion, suggesting that technological capacity, rather than political integration, underpins participation in global AI innovation networks.
zh
[AI-157] Structural Reinforcement Learning for Heterogeneous Agent Macroeconomics
【速读】:该论文旨在解决包含异质性主体(heterogeneous agents)与总风险(aggregate risk)的宏观经济模型求解难题,尤其是传统方法难以处理的非平凡市场出清条件(nontrivial market-clearing conditions)。其解决方案的关键在于提出一种结构化强化学习(structural reinforcement learning, SRL)方法:通过将横截面分布替换为低维价格作为状态变量,并让个体主体从模拟路径中直接学习均衡价格动态,从而绕过复杂的Master方程,实现对模型的全局高效求解。该方法在Krusell-Smith模型、带总冲击的Huggett模型以及含前瞻Phillips曲线的HANK模型中均实现了分钟级的全局求解。
链接: https://arxiv.org/abs/2512.18892
作者: Yucheng Yang,Chiyuan Wang,Andreas Schaab,Benjamin Moll
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present a new approach to formulating and solving heterogeneous agent models with aggregate risk. We replace the cross-sectional distribution with low-dimensional prices as state variables and let agents learn equilibrium price dynamics directly from simulated paths. To do so, we introduce a structural reinforcement learning (SRL) method which treats prices via simulation while exploiting agents’ structural knowledge of their own individual dynamics. Our SRL method yields a general and highly efficient global solution method for heterogeneous agent models that sidesteps the Master equation and handles problems traditional methods struggle with, in particular nontrivial market-clearing conditions. We illustrate the approach in the Krusell-Smith model, the Huggett model with aggregate shocks, and a HANK model with a forward-looking Phillips curve, all of which we solve globally within minutes.
zh
[AI-158] he Illusion of Consistency: Selection-Induced Bias in Gated Kalman Innovation Statistics
【速读】:该论文旨在解决经典基于卡尔曼滤波的跟踪系统中验证门控(validation gating)对创新统计量(innovation statistics)产生的偏差问题。传统方法假设通过门控筛选后的测量值仍服从原始无条件创新过程,但本文指出,门控实际上将创新过程限制在特定事件(即验证事件)内,导致其统计特性变为条件分布,而非名义上的无条件分布。解决方案的关键在于:首先,在线性-高斯假设下,推导出椭球门控条件下创新的一阶和二阶矩的精确表达式,揭示门控会引入与维度相关的确定性协方差收缩;其次,进一步扩展至神经网络(NN)关联场景,证明选择最小范数创新作为关联结果会引入不可避免的能量收缩,从而说明在非平凡门控和关联条件下,名义创新统计量无法被保持。
链接: https://arxiv.org/abs/2512.18508
作者: Barak Or
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注: 8 pages, preprint
Abstract:Validation gating is a fundamental component of classical Kalman-based tracking systems. Only measurements whose normalized innovation squared (NIS) falls below a prescribed threshold are considered for state update. While this procedure is statistically motivated by the chi-square distribution, it implicitly replaces the unconditional innovation process with a conditionally observed one, restricted to the validation event. This paper shows that innovation statistics computed after gating converge to gate-conditioned rather than nominal quantities. Under classical linear–Gaussian assumptions, we derive exact expressions for the first- and second-order moments of the innovation conditioned on ellipsoidal gating, and show that gating induces a deterministic, dimension-dependent contraction of the innovation covariance. The analysis is extended to NN association, which is shown to act as an additional statistical selection operator. We prove that selecting the minimum-norm innovation among multiple in-gate measurements introduces an unavoidable energy contraction, implying that nominal innovation statistics cannot be preserved under nontrivial gating and association. Closed-form results in the two-dimensional case quantify the combined effects and illustrate their practical significance.
zh
[AI-159] On Efficient Adjustment in Causal Graphs
【速读】:该论文旨在解决在总结因果图(Summary Causal Graphs, SCGs)中通过协变量调整识别因果效应的可识别性条件过于复杂且有效调整集有限的问题。传统方法依赖于繁琐的路径枚举和复杂的定义,导致计算效率低下,且仅能提供两个有效的调整集,限制了实际应用中的灵活性。其解决方案的关键在于:提出了一种等价但更简洁的可识别性条件表述,并引入一个新的调整准则,能够识别更广泛的合法调整集;同时进一步刻画了其中渐近方差最小的准最优调整集,从而在理论上推进了抽象因果图中的因果推断方法,并提供了更具灵活性与效率的实际工具。
链接: https://arxiv.org/abs/2512.18315
作者: Isabela Belciug,Simon Ferreira,Charles K. Assaad
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注:
Abstract:Observational studies in fields such as epidemiology often rely on covariate adjustment to estimate causal effects. Classical graphical criteria, like the back-door criterion and the generalized adjustment criterion, are powerful tools for identifying valid adjustment sets in directed acyclic graphs (DAGs). However, these criteria are not directly applicable to summary causal graphs (SCGs), which are abstractions of DAGs commonly used in dynamic systems. In SCGs, each node typically represents an entire time series and may involve cycles, making classical criteria inapplicable for identifying causal effects. Recent work established complete conditions for determining whether the micro causal effect of a treatment or an exposure X_t-\gamma on an outcome Y_t is identifiable via covariate adjustment in SCGs, under the assumption of no hidden confounding. However, these identifiability conditions have two main limitations. First, they are complex, relying on cumbersome definitions and requiring the enumeration of multiple paths in the SCG, which can be computationally expensive. Second, when these conditions are satisfied, they only provide two valid adjustment sets, limiting flexibility in practical applications. In this paper, we propose an equivalent but simpler formulation of those identifiability conditions and introduce a new criterion that identifies a broader class of valid adjustment sets in SCGs. Additionally, we characterize the quasi-optimal adjustment set among these, i.e., the one that minimizes the asymptotic variance of the causal effect estimator. Our contributions offer both theoretical advancement and practical tools for more flexible and efficient causal inference in abstracted causal graphs.
zh
[AI-160] Evolutionary BPOSD Decoding for Low-Latency Quantum Error Correction
【速读】:该论文旨在解决量子纠错码中传统信念传播(Belief Propagation, BP)解码算法在低延迟场景下性能不足的问题。其解决方案的关键在于提出一种可训练权重的进化信念传播(Evolutionary Belief Propagation, EBP)解码器,通过差分进化算法优化BP中的可调参数,并与有序统计解码(Ordered Statistics Decoding, OSD)结合实现端到端优化。实验表明,EBP+OSD在表面码和量子低密度奇偶校验码上相比传统BP+OSD,在保持更低计算复杂度的同时显著提升了解码性能,尤其在严格低延迟约束(不超过5次BP迭代)下优势明显。
链接: https://arxiv.org/abs/2512.18273
作者: Hee-Youl Kwak,Seong-Joon Park,Hyunwoo Jung,Jeongseok Ha,Jae-Won Kim
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures
Abstract:We propose an evolutionary belief propagation (EBP) decoder for quantum error correction, which incorporates trainable weights into the BP algorithm and optimizes them via the differential evolution algorithm. This approach enables end-to-end optimization of the EBP combined with ordered statistics decoding (OSD). Experimental results on surface codes and quantum low-density parity-check codes show that EBP+OSD achieves better decoding performance and lower computational complexity than BP+OSD, particularly under strict low latency constraints (within 5 BP iterations).
zh
[AI-161] he Subject of Emergent Misalignment in Superintelligence: An Anthropological Cognitive Neuropsychological Machine-Learning and Ontological Perspective
【速读】:该论文试图解决当前关于超级智能(Superintelligence)错位(misalignment)研究中存在的概念与伦理空白问题,特别是人类主体在相关论述中的缺失以及对“AI无意识”(AI unconscious)的理论化不足,这些问题可能为反社会性伤害埋下隐患。解决方案的关键在于重新将人类主体置于中心位置,将其视为伦理、无意识与错位维度共同建构的不稳定基础;同时,将错位理解为嵌入于人机生态中的关系性不稳定性,而非仅依赖技术诊断的单一层面问题。这要求超越计算抽象和追求可扩展性、加速性的社会技术想象,转而关注脆弱性、有限性和关联性等被忽视的维度。
链接: https://arxiv.org/abs/2512.17989
作者: Muhammad Osama Imran,Roshni Lulla,Rodney Sappington
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:We examine the conceptual and ethical gaps in current representations of Superintelligence misalignment. We find throughout Superintelligence discourse an absent human subject, and an under-developed theorization of an “AI unconscious” that together are potentiality laying the groundwork for anti-social harm. With the rise of AI Safety that has both thematic potential for establishing pro-social and anti-social potential outcomes, we ask: what place does the human subject occupy in these imaginaries? How is human subjecthood positioned within narratives of catastrophic failure or rapid “takeoff” toward superintelligence? On another register, we ask: what unconscious or repressed dimensions are being inscribed into large-scale AI models? Are we to blame these agents in opting for deceptive strategies when undesirable patterns are inherent within our beings? In tracing these psychic and epistemic absences, our project calls for re-centering the human subject as the unstable ground upon which the ethical, unconscious, and misaligned dimensions of both human and machinic intelligence are co-constituted. Emergent misalignment cannot be understood solely through technical diagnostics typical of contemporary machine-learning safety research. Instead, it represents a multi-layered crisis. The human subject disappears not only through computational abstraction but through sociotechnical imaginaries that prioritize scalability, acceleration, and efficiency over vulnerability, finitude, and relationality. Likewise, the AI unconscious emerges not as a metaphor but as a structural reality of modern deep learning systems: vast latent spaces, opaque pattern formation, recursive symbolic play, and evaluation-sensitive behavior that surpasses explicit programming. These dynamics necessitate a reframing of misalignment as a relational instability embedded within human-machine ecologies.
zh
[AI-162] Re-assessing the evidence for mental rotation abilities in children using computational models
【速读】:该论文试图解决的问题是:儿童是否具备真正的心理旋转(mental rotation, MR)能力,尤其是在5岁之前。现有研究主要依赖从成人实验中改编的行为范式,但缺乏对儿童MR能力发展的可靠证据。论文的解决方案关键在于利用近期关于儿童物体识别能力发展的计算模型,重新评估经典MR任务在儿童中的表现。研究发现,一种基于像素级刺激比较的简单识别策略即可充分解释6个月至5岁儿童在最常用MR任务中的行为表现,暗示心理旋转可能并非幼龄儿童完成此类任务的核心机制,从而重新引发对儿童何时以及如何发展出真正心理旋转能力的学术讨论。
链接: https://arxiv.org/abs/2512.17972
作者: Arthur Aubret,Jochen Triesch
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:
Abstract:There is strong and diverse evidence for mental rotation (MR) abilities in adults. However, current evidence for MR in children rests on just a few behavioral paradigms adapted from the adult literature. Here, we leverage recent computational models of the development of children’s object recognition abilities to re-assess the evidence for MR in children. The computational models simulate infants’ acquisition of object representations during embodied interactions with objects. We consider two different object recognition strategies, different from MRs, and assess their ability to replicate results from three classical MR tasks assigned to children between the ages of 6 months and 5 years. Our results show that MR may play no role in producing the results obtained from children younger than 5 years. In fact, we find that a simple recognition strategy that reflects a pixel-wise comparison of stimuli is sufficient to model children’s behavior in the most used MR task. Thus, our study reopens the debate on how and when children develop genuine MR abilities.
zh
[AI-163] Reinforcement Learning for Monetary Policy Under Macroeconomic Uncertainty: Analyzing Tabular and Function Approximation Methods
【速读】:该论文旨在解决中央银行在宏观经济关系存在不确定性且随时间变化的情况下,如何动态调整短期名义利率以稳定通胀和失业率的问题。其解决方案的关键在于将货币政策制定建模为一个序贯决策问题,利用美联储经济数据(FRED)构建线性高斯状态转移模型,并采用带二次损失奖励函数的离散动作马尔可夫决策过程(Markov Decision Process, MDP)进行求解。研究对比了九种强化学习(Reinforcement Learning, RL)方法与泰勒规则及基线策略,发现尽管复杂RL方法如深度Q网络(Deep Q-Networks)、贝叶斯Q学习和部分可观测马尔可夫决策过程(POMDP)具有理论优势,但最简单的表格型Q-learning反而表现最优,表明在宏观政策场景中,简单方法可能比复杂模型更具鲁棒性,这对现代强化学习在经济政策中的应用提出了重要挑战。
链接: https://arxiv.org/abs/2512.17929
作者: Sheryl Chen,Tony Wang,Kyle Feinstein
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM)
备注:
Abstract:We study how a central bank should dynamically set short-term nominal interest rates to stabilize inflation and unemployment when macroeconomic relationships are uncertain and time-varying. We model monetary policy as a sequential decision-making problem where the central bank observes macroeconomic conditions quarterly and chooses interest rate adjustments. Using publically accessible historical Federal Reserve Economic Data (FRED), we construct a linear-Gaussian transition model and implement a discrete-action Markov Decision Process with a quadratic loss reward function. We chose to compare nine different reinforcement learning style approaches against Taylor Rule and naive baselines, including tabular Q-learning variants, SARSA, Actor-Critic, Deep Q-Networks, Bayesian Q-learning with uncertainty quantification, and POMDP formulations with partial observability. Surprisingly, standard tabular Q-learning achieved the best performance (-615.13 ± 309.58 mean return), outperforming both enhanced RL methods and traditional policy rules. Our results suggest that while sophisticated RL techniques show promise for monetary policy applications, simpler approaches may be more robust in this domain, highlighting important challenges in applying modern RL to macroeconomic policy.
zh
[AI-164] Efficient Beamforming Optimization for STAR-RIS-Assisted Communications: A Gradient-Based Meta Learning Approach
【速读】:该论文旨在解决同时传输与反射可重构智能表面(STAR-RIS)系统中,基站预编码矩阵与STAR-RIS透射及反射系数矩阵联合优化所导致的高维、强非凸且NP难的复杂优化问题。传统交替优化(AO)方法因频繁进行大规模矩阵求逆而计算复杂度高、扩展性差,而现有深度学习方法则依赖昂贵的预训练和大型网络模型。解决方案的关键在于提出一种基于梯度的元学习(GML)框架,将优化梯度直接输入轻量级神经网络,无需预训练即可快速适应不同场景,同时针对独立相位和耦合相位STAR-RIS模型设计专用方案,在满足各自幅度与相位约束的前提下,实现接近AO基准的加权和速率性能;仿真表明,该方法显著降低计算开销,复杂度近似线性增长,相较AO最高提速达10倍,具备良好的可扩展性和实用性。
链接: https://arxiv.org/abs/2512.17928
作者: Dongdong Yang,Bin Li,Jiguang He,Yicheng Yan,Xiaoyu Zhang,Chongwen Huang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:
Abstract:Simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) has emerged as a promising technology to realize full-space coverage and boost spectral efficiency in next-generation wireless networks. Yet, the joint design of the base station precoding matrix as well as the STAR-RIS transmission and reflection coefficient matrices leads to a high-dimensional, strongly nonconvex, and NP-hard optimization problem. Conventional alternating optimization (AO) schemes typically involve repeated large-scale matrix inversion operations, resulting in high computational complexity and poor scalability, while existing deep learning approaches often rely on expensive pre-training and large network models. In this paper, we develop a gradient-based meta learning (GML) framework that directly feeds optimization gradients into lightweight neural networks, thereby removing the need for pre-training and enabling fast adaptation. Specifically, we design dedicated GML-based schemes for both independent-phase and coupled-phase STAR-RIS models, effectively handling their respective amplitude and phase constraints while achieving weighted sum-rate performance very close to that of AO-based benchmarks. Extensive simulations demonstrate that, for both phase models, the proposed methods substantially reduce computational overhead, with complexity growing nearly linearly when the number of BS antennas and STAR-RIS elements grows, and yielding up to 10 times runtime speedup over AO, which confirms the scalability and practicality of the proposed GML method for large-scale STAR-RIS-assisted communications.
zh
[AI-165] Inferring Latent Market Forces: Evaluating LLM Detection of Gamma Exposure Patterns via Obfuscation Testing
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)是否能够通过因果推理识别金融市场中的结构性模式,而非仅依赖时间上的相关性。其核心问题是验证LLMs是否具备从原始市场数据中提取经济机制的能力,从而实现对复杂金融行为的准确理解。解决方案的关键在于提出“WHO-WHOM-WHAT”因果框架,强制模型识别经济主体(如做市商)、受影响方(如方向性交易者)及结构机制(如强制对冲),并通过无偏提示(仅提供原始伽马暴露值,不包含制度标签或时间上下文)进行测试,最终在SP 500期权数据上实现71.5%的检测准确率,且该准确率在不同季度盈利水平下保持稳定(91.2%),证明模型识别的是结构性约束而非短期获利模式。
链接: https://arxiv.org/abs/2512.17923
作者: Christopher Regan,Ying Xie
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 8 figures. Accepted at IEEE Big Data 2025. Extended journal version in preparation
Abstract:We introduce obfuscation testing, a novel methodology for validating whether large language models detect structural market patterns through causal reasoning rather than temporal association. Testing three dealer hedging constraint patterns (gamma positioning, stock pinning, 0DTE hedging) on 242 trading days (95.6% coverage) of SP 500 options data, we find LLMs achieve 71.5% detection rate using unbiased prompts that provide only raw gamma exposure values without regime labels or temporal context. The WHO-WHOM-WHAT causal framework forces models to identify the economic actors (dealers), affected parties (directional traders), and structural mechanisms (forced hedging) underlying observed market dynamics. Critically, detection accuracy (91.2%) remains stable even as economic profitability varies quarterly, demonstrating that models identify structural constraints rather than profitable patterns. When prompted with regime labels, detection increases to 100%, but the 71.5% unbiased rate validates genuine pattern recognition. Our findings suggest LLMs possess emergent capabilities for detecting complex financial mechanisms through pure structural reasoning, with implications for systematic strategy development, risk management, and our understanding of how transformer architectures process financial market dynamics.
zh
机器学习
[LG-0] Deep Legendre Transform NEURIPS2025 NEURIPS
链接: https://arxiv.org/abs/2512.19649
作者: Aleksey Minabutdinov,Patrick Cheridito
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted at NeurIPS 2025 (poster). NeurIPS page: this https URL
Abstract:We introduce a novel deep learning algorithm for computing convex conjugates of differentiable convex functions, a fundamental operation in convex analysis with various applications in different fields such as optimization, control theory, physics and economics. While traditional numerical methods suffer from the curse of dimensionality and become computationally intractable in high dimensions, more recent neural network-based approaches scale better, but have mostly been studied with the aim of solving optimal transport problems and require the solution of complicated optimization or max-min problems. Using an implicit Fenchel formulation of convex conjugation, our approach facilitates an efficient gradient-based framework for the minimization of approximation errors and, as a byproduct, also provides a posteriori error estimates for the approximation quality. Numerical experiments demonstrate our method’s ability to deliver accurate results across different high-dimensional examples. Moreover, by employing symbolic regression with Kolmogorov–Arnold networks, it is able to obtain the exact convex conjugates of specific convex functions.
[LG-1] he Best of Both Worlds: Hybridizing Neural Operators and Solvers for Stable Long-Horizon Inference
链接: https://arxiv.org/abs/2512.19643
作者: Rajyasri Roy,Dibyajyoti Nayak,Somdatta Goswami
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 18 pages, 7 figures
Abstract:Numerical simulation of time-dependent partial differential equations (PDEs) is central to scientific and engineering applications, but high-fidelity solvers are often prohibitively expensive for long-horizon or time-critical settings. Neural operator (NO) surrogates offer fast inference across parametric and functional inputs; however, most autoregressive NO frameworks remain vulnerable to compounding errors, and ensemble-averaged metrics provide limited guarantees for individual inference trajectories. In practice, error accumulation can become unacceptable beyond the training horizon, and existing methods lack mechanisms for online monitoring or correction. To address this gap, we propose ANCHOR (Adaptive Numerical Correction for High-fidelity Operator Rollouts), an online, instance-aware hybrid inference framework for stable long-horizon prediction of nonlinear, time-dependent PDEs. ANCHOR treats a pretrained NO as the primary inference engine and adaptively couples it with a classical numerical solver using a physics-informed, residual-based error estimator. Inspired by adaptive time-stepping in numerical analysis, ANCHOR monitors an exponential moving average (EMA) of the normalized PDE residual to detect accumulating error and trigger corrective solver interventions without requiring access to ground-truth solutions. We show that the EMA-based estimator correlates strongly with the true relative L2 error, enabling data-free, instance-aware error control during inference. Evaluations on four canonical PDEs: 1D and 2D Burgers’, 2D Allen-Cahn, and 3D heat conduction, demonstrate that ANCHOR reliably bounds long-horizon error growth, stabilizes extrapolative rollouts, and significantly improves robustness over standalone neural operators, while remaining substantially more efficient than high-fidelity numerical solvers.
[LG-2] DFORD: Directional Feedback based Online Ordinal Regression Learning
链接: https://arxiv.org/abs/2512.19550
作者: Naresh Manwani,M Elamparithy,Tanish Taneja
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we introduce directional feedback in the ordinal regression setting, in which the learner receives feedback on whether the predicted label is on the left or the right side of the actual label. This is a weak supervision setting for ordinal regression compared to the full information setting, where the learner can access the labels. We propose an online algorithm for ordinal regression using directional feedback. The proposed algorithm uses an exploration-exploitation scheme to learn from directional feedback efficiently. Furthermore, we introduce its kernel-based variant to learn non-linear ordinal regression models in an online setting. We use a truncation trick to make the kernel implementation more memory efficient. The proposed algorithm maintains the ordering of the thresholds in the expected sense. Moreover, it achieves the expected regret of \mathcalO(\log T) . We compare our approach with a full information and a weakly supervised algorithm for ordinal regression on synthetic and real-world datasets. The proposed approach, which learns using directional feedback, performs comparably (sometimes better) to its full information counterpart.
[LG-3] Deep Learning for Unrelated-Machines Scheduling: Handling Variable Dimensions ICML
链接: https://arxiv.org/abs/2512.19527
作者: Diego Hitzges,Guillaume Sagnol
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注: 24th IEEE International Conference on Machine Learning and Applications (ICMLA 2025) in Boca Raton, USA. Project page: this https URL . 8 pages, 4 figures, 3 tables
Abstract:Deep learning has been effectively applied to many discrete optimization problems. However, learning-based scheduling on unrelated parallel machines remains particularly difficult to design. Not only do the numbers of jobs and machines vary, but each job-machine pair has a unique processing time, dynamically altering feature dimensions. We propose a novel approach with a neural network tailored for offline deterministic scheduling of arbitrary sizes on unrelated machines. The goal is to minimize a complex objective function that includes the makespan and the weighted tardiness of jobs and machines. Unlike existing online approaches, which process jobs sequentially, our method generates a complete schedule considering the entire input at once. The key contribution of this work lies in the sophisticated architecture of our model. By leveraging various NLP-inspired architectures, it effectively processes any number of jobs and machines with varying feature dimensions imposed by unrelated processing times. Our approach enables supervised training on small problem instances while demonstrating strong generalization to much larger scheduling environments. Trained and tested on instances with 8 jobs and 4 machines, costs were only 2.51% above optimal. Across all tested configurations of up to 100 jobs and 10 machines, our network consistently outperformed an advanced dispatching rule, which incurred 22.22% higher costs on average. As our method allows fast retraining with simulated data and adaptation to various scheduling conditions, we believe it has the potential to become a standard approach for learning-based scheduling on unrelated machines and similar problem environments.
[LG-4] Initialization of a Polyharmonic Cascade Launch and Testing
链接: https://arxiv.org/abs/2512.19524
作者: Yuriy N. Bakhvalov
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Part 4 of 4 in the “Polyharmonic Cascade” cycle. Contains initialization algorithms and experimental results (MNIST, HIGGS, Epsilon). Previous papers: arXiv:2512.12731 , arXiv:2512.16718 , arXiv:2512.17671 . Source code: this https URL
Abstract:This paper concludes a series of studies on the polyharmonic cascade, a deep machine learning architecture theoretically derived from indifference principles and the theory of random functions. A universal initialization procedure is proposed, based on symmetric constellations in the form of hyperoctahedra with a central point. This initialization not only ensures stable training of cascades with tens and hundreds of layers (up to 500 layers without skip connections), but also radically simplifies the computations. Scalability and robustness are demonstrated on MNIST (98.3% without convolutions or augmentations), HIGGS (AUC approximately 0.885 on 11M examples), and Epsilon (AUC approximately 0.963 with 2000 features). All linear algebra is reduced to 2D operations and is efficiently executed on GPUs. A public repository and an archived snapshot are provided for full reproducibility.
[LG-5] oward Scalable and Valid Conditional Independence Testing with Spectral Representations
链接: https://arxiv.org/abs/2512.19510
作者: Alek Frohlich,Vladimir Kostic,Karim Lounici,Daniel Perazzo,Massimiliano Pontil
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Conditional independence (CI) is central to causal inference, feature selection, and graphical modeling, yet it is untestable in many settings without additional assumptions. Existing CI tests often rely on restrictive structural conditions, limiting their validity on real-world data. Kernel methods using the partial covariance operator offer a more principled approach but suffer from limited adaptivity, slow convergence, and poor scalability. In this work, we explore whether representation learning can help address these limitations. Specifically, we focus on representations derived from the singular value decomposition of the partial covariance operator and use them to construct a simple test statistic, reminiscent of the Hilbert-Schmidt Independence Criterion (HSIC). We also introduce a practical bi-level contrastive algorithm to learn these representations. Our theory links representation learning error to test performance and establishes asymptotic validity and power guarantees. Preliminary experiments suggest that this approach offers a practical and statistically grounded path toward scalable CI testing, bridging kernel-based theory with modern representation learning.
[LG-6] Learning from sanctioned government suppliers: A machine learning and network science approach to detecting fraud and corruption in Mexico
链接: https://arxiv.org/abs/2512.19491
作者: Martí Medina-Hern ández,Janos Kertész,Mihály Fazekas
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 15 pages of main text with 6 figures and 31 pages of supplementary information
Abstract:Detecting fraud and corruption in public procurement remains a major challenge for governments worldwide. Most research to-date builds on domain-knowledge-based corruption risk indicators of individual contract-level features and some also analyzes contracting network patterns. A critical barrier for supervised machine learning is the absence of confirmed non-corrupt, negative, examples, which makes conventional machine learning inappropriate for this task. Using publicly available data on federally funded procurement in Mexico and company sanction records, this study implements positive-unlabeled (PU) learning algorithms that integrate domain-knowledge-based red flags with network-derived features to identify likely corrupt and fraudulent contracts. The best-performing PU model on average captures 32 percent more known positives and performs on average 2.3 times better than random guessing, substantially outperforming approaches based solely on traditional red flags. The analysis of the Shapley Additive Explanations reveals that network-derived features, particularly those associated with contracts in the network core or suppliers with high eigenvector centrality, are the most important. Traditional red flags further enhance model performance in line with expectations, albeit mainly for contracts awarded through competitive tenders. This methodology can support law enforcement in Mexico, and it can be adapted to other national contexts too.
[LG-7] Lightweight Intrusion Detection in IoT via SHAP-Guided Feature Pruning and Knowledge-Distilled Kronecker Networks
链接: https://arxiv.org/abs/2512.19488
作者: Hafsa Benaddi,Mohammed Jouhari,Nouha Laamech,Anas Motii,Khalil Ibrahimi
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This work has been published in the proceedings of the 2025 8th International Conference on Advanced Communication Technologies and Networking (CommNet)
Abstract:The widespread deployment of Internet of Things (IoT) devices requires intrusion detection systems (IDS) with high accuracy while operating under strict resource constraints. Conventional deep learning IDS are often too large and computationally intensive for edge deployment. We propose a lightweight IDS that combines SHAP-guided feature pruning with knowledge-distilled Kronecker networks. A high-capacity teacher model identifies the most relevant features through SHAP explanations, and a compressed student leverages Kronecker-structured layers to minimize parameters while preserving discriminative inputs. Knowledge distillation transfers softened decision boundaries from teacher to student, improving generalization under compression. Experiments on the TON_IoT dataset show that the student is nearly three orders of magnitude smaller than the teacher yet sustains macro-F1 above 0.986 with millisecond-level inference latency. The results demonstrate that explainability-driven pruning and structured compression can jointly enable scalable, low-latency, and energy-efficient IDS for heterogeneous IoT environments.
[LG-8] GLUE: Generative Latent Unification of Expertise-Informed Engineering Models
链接: https://arxiv.org/abs/2512.19469
作者: Tim Aebersold,Soheyl Massoudi,Mark D. Fuge
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 11 pages, 10 figures. Preprint. Submitted to Computer-Aided Engineering (Elsevier)
Abstract:Engineering complex systems (aircraft, buildings, vehicles) requires accounting for geometric and performance couplings across subsystems. As generative models proliferate for specialized domains (wings, structures, engines), a key research gap is how to coordinate frozen, pre-trained submodels to generate full-system designs that are feasible, diverse, and high-performing. We introduce Generative Latent Unification of Expertise-Informed Engineering Models (GLUE), which orchestrates pre-trained, frozen subsystem generators while enforcing system-level feasibility, optimality, and diversity. We propose and benchmark (i) data-driven GLUE models trained on pre-generated system-level designs and (ii) a data-free GLUE model trained online on a differentiable geometry layer. On a UAV design problem with five coupling constraints, we find that data-driven approaches yield diverse, high-performing designs but require large datasets to satisfy constraints reliably. The data-free approach is competitive with Bayesian optimization and gradient-based optimization in performance and feasibility while training a full generative model in only 10 min on a RTX 4090 GPU, requiring more than two orders of magnitude fewer geometry evaluations and FLOPs than the data-driven method. Ablations focused on data-free training show that subsystem output continuity affects coordination, and equality constraints can trigger mode collapse unless mitigated. By integrating unmodified, domain-informed submodels into a modular generative workflow, this work provides a viable path for scaling generative design to complex, real-world engineering systems.
[LG-9] Binary Kernel Logistic Regression: a sparsity-inducing formulation and a convergent decomposition training algorithm
链接: https://arxiv.org/abs/2512.19440
作者: Antonio Consolo,Andrea Manno,Edoardo Amaldi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Kernel logistic regression (KLR) is a widely used supervised learning method for binary and multi-class classification, which provides estimates of the conditional probabilities of class membership for the data points. Unlike other kernel methods such as Support Vector Machines (SVMs), KLRs are generally not sparse. Previous attempts to deal with sparsity in KLR include a heuristic method referred to as the Import Vector Machine (IVM) and ad hoc regularizations such as the \ell_1/2 -based one. Achieving a good trade-off between prediction accuracy and sparsity is still a challenging issue with a potential significant impact from the application point of view. In this work, we revisit binary KLR and propose an extension of the training formulation proposed by Keerthi et al., which is able to induce sparsity in the trained model, while maintaining good testing accuracy. To efficiently solve the dual of this formulation, we devise a decomposition algorithm of Sequential Minimal Optimization type which exploits second-order information, and for which we establish global convergence. Numerical experiments conducted on 12 datasets from the literature show that the proposed binary KLR approach achieves a competitive trade-off between accuracy and sparsity with respect to IVM, \ell_1/2 -based regularization for KLR, and SVM while retaining the advantages of providing informative estimates of the class membership probabilities.
[LG-10] An Inverse Scattering Inspired Fourier Neural Operator for Time-Dependent PDE Learning
链接: https://arxiv.org/abs/2512.19439
作者: Rixin Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning accurate and stable time-advancement operators for nonlinear partial differential equations (PDEs) remains challenging, particularly for chaotic, stiff, and long-horizon dynamical systems. While neural operator methods such as the Fourier Neural Operator (FNO) and Koopman-inspired extensions achieve good short-term accuracy, their long-term stability is often limited by unconstrained latent representations and cumulative rollout errors. In this work, we introduce an inverse scattering inspired Fourier Neural Operator(IS-FNO), motivated by the reversibility and spectral evolution structure underlying the classical inverse scattering transform. The proposed architecture enforces a near-reversible pairing between lifting and projection maps through an explicitly invertible neural transformation, and models latent temporal evolution using exponential Fourier layers that naturally encode linear and nonlinear spectral dynamics. We systematically evaluate IS-FNO against baseline FNO and Koopman-based models on a range of benchmark PDEs, including the Michelson-Sivashinsky and Kuramoto-Sivashinsky equations (in one and two dimensions), as well as the integrable Korteweg-de Vries and Kadomtsev-Petviashvili equations. The results demonstrate that IS-FNO achieves lower short-term errors and substantially improved long-horizon stability in non-stiff regimes. For integrable systems, reduced IS-FNO variants that embed analytical scattering structure retain competitive long-term accuracy despite limited model capacity. Overall, this work shows that incorporating physical structure – particularly reversibility and spectral evolution – into neural operator design significantly enhances robustness and long-term predictive fidelity for nonlinear PDE dynamics.
[LG-11] Symplectic Reservoir Representation of Legendre Dynamics
链接: https://arxiv.org/abs/2512.19409
作者: Robert Simon Fong,Gouhei Tanaka,Kazuyuki Aihara
类目: Machine Learning (cs.LG)
*备注: 39 pages
Abstract:Modern learning systems act on internal representations of data, yet how these representations encode underlying physical or statistical structure is often left implicit. In physics, conservation laws of Hamiltonian systems such as symplecticity guarantee long-term stability, and recent work has begun to hard-wire such constraints into learning models at the loss or output level. Here we ask a different question: what would it mean for the representation itself to obey a symplectic conservation law in the sense of Hamiltonian mechanics? We express this symplectic constraint through Legendre duality: the pairing between primal and dual parameters, which becomes the structure that the representation must preserve. We formalize Legendre dynamics as stochastic processes whose trajectories remain on Legendre graphs, so that the evolving primal-dual parameters stay Legendre dual. We show that this class includes linear time-invariant Gaussian process regression and Ornstein-Uhlenbeck dynamics. Geometrically, we prove that the maps that preserve all Legendre graphs are exactly symplectomorphisms of cotangent bundles of the form “cotangent lift of a base diffeomorphism followed by an exact fibre translation”. Dynamically, this characterization leads to the design of a Symplectic Reservoir (SR), a reservoir-computing architecture that is a special case of recurrent neural network and whose recurrent core is generated by Hamiltonian systems that are at most linear in the momentum. Our main theorem shows that every SR update has this normal form and therefore transports Legendre graphs to Legendre graphs, preserving Legendre duality at each time step. Overall, SR implements a geometrically constrained, Legendre-preserving representation map, injecting symplectic geometry and Hamiltonian mechanics directly at the representational level. Comments: 39 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.19409 [cs.LG] (or arXiv:2512.19409v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.19409 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Robert Simon Fong [view email] [v1] Mon, 22 Dec 2025 14:04:13 UTC (35 KB)
[LG-12] Brain-Grounded Axes for Reading and Steering LLM States
链接: https://arxiv.org/abs/2512.19399
作者: Sandro Andric
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures. Code: this https URL
Abstract:Interpretability methods for large language models (LLMs) typically derive directions from textual supervision, which can lack external grounding. We propose using human brain activity not as a training signal but as a coordinate system for reading and steering LLM states. Using the SMN4Lang MEG dataset, we construct a word-level brain atlas of phase-locking value (PLV) patterns and extract latent axes via ICA. We validate axes with independent lexica and NER-based labels (POS/log-frequency used as sanity checks), then train lightweight adapters that map LLM hidden states to these brain axes without fine-tuning the LLM. Steering along the resulting brain-derived directions yields a robust lexical (frequency-linked) axis in a mid TinyLlama layer, surviving perplexity-matched controls, and a brain-vs-text probe comparison shows larger log-frequency shifts (relative to the text probe) with lower perplexity for the brain axis. A function/content axis (axis 13) shows consistent steering in TinyLlama, Qwen2-0.5B, and GPT-2, with PPL-matched text-level corroboration. Layer-4 effects in TinyLlama are large but inconsistent, so we treat them as secondary (Appendix). Axis structure is stable when the atlas is rebuilt without GPT embedding-change features or with word2vec embeddings (|r|=0.64-0.95 across matched axes), reducing circularity concerns. Exploratory fMRI anchoring suggests potential alignment for embedding change and log frequency, but effects are sensitive to hemodynamic modeling assumptions and are treated as population-level evidence only. These results support a new interface: neurophysiology-grounded axes provide interpretable and controllable handles for LLM behavior.
[LG-13] Real-Time Machine Learning for Embedded Anomaly Detection
链接: https://arxiv.org/abs/2512.19383
作者: Abdelmadjid Benmachiche,Khadija Rais,Hamda Slimi
类目: Machine Learning (cs.LG)
*备注:
Abstract:The spread of a resource-constrained Internet of Things (IoT) environment and embedded devices has put pressure on the real-time detection of anomalies occurring at the edge. This survey presents an overview of machine-learning methods aimed specifically at on-device anomaly detection with extremely strict constraints for latency, memory, and power consumption. Lightweight algorithms such as Isolation Forest, One-Class SVM, recurrent architectures, and statistical techniques are compared here according to the realities of embedded implementation. Our survey brings out significant trade-offs of accuracy and computational efficiency of detection, as well as how hardware constraints end up fundamentally redefining algorithm choice. The survey is completed with a set of practical recommendations on the choice of the algorithm depending on the equipment profiles and new trends in TinyML, which can help close the gap between detection capabilities and embedded reality. The paper serves as a strategic roadmap for engineers deploying anomaly detection in edge environments that are constrained by bandwidth and may be safety-critical.
[LG-14] From Points to Coalitions: Hierarchical Contrastive Shapley Values for Prioritizing Data Samples AAAI’26
链接: https://arxiv.org/abs/2512.19363
作者: Canran Xiao,Jiabao Dou,Zhiming Lin,Zong Ke,Liwei Hou
类目: Machine Learning (cs.LG)
*备注: AAAI’26 Oral
Abstract:How should we quantify the value of each training example when datasets are large, heterogeneous, and geometrically structured? Classical Data-Shapley answers in principle, but its O(n!) complexity and point-wise perspective are ill-suited to modern scales. We propose Hierarchical Contrastive Data Valuation (HCDV), a three-stage framework that (i) learns a contrastive, geometry-preserving representation, (ii) organizes the data into a balanced coarse-to-fine hierarchy of clusters, and (iii) assigns Shapley-style payoffs to coalitions via local Monte-Carlo games whose budgets are propagated downward. HCDV collapses the factorial burden to O(T sum_l K_l) = O(T K_max log n), rewards examples that sharpen decision boundaries, and regularizes outliers through curvature-based smoothness. We prove that HCDV approximately satisfies the four Shapley axioms with surplus loss O(eta log n), enjoys sub-Gaussian coalition deviation tilde O(1/sqrtT), and incurs at most k epsilon_infty regret for top-k selection. Experiments on four benchmarks–tabular, vision, streaming, and a 45M-sample CTR task–plus the OpenDataVal suite show that HCDV lifts accuracy by up to +5 pp, slashes valuation time by up to 100x, and directly supports tasks such as augmentation filtering, low-latency streaming updates, and fair marketplace payouts.
[LG-15] Interpretable Hybrid Deep Q-Learning Framework for IoT-Based Food Spoilage Prediction with Synthetic Data Generation and Hardware Validation
链接: https://arxiv.org/abs/2512.19361
作者: Isshaan Singh,Divyansh Chawla,Anshu Garg,Shivin Mangal,Pallavi Gupta,Khushi Agarwal,Nimrat Singh Khalsa,Nandan Patel
类目: Machine Learning (cs.LG)
*备注:
Abstract:The need for an intelligent, real-time spoilage prediction system has become critical in modern IoT-driven food supply chains, where perishable goods are highly susceptible to environmental conditions. Existing methods often lack adaptability to dynamic conditions and fail to optimize decision making in real time. To address these challenges, we propose a hybrid reinforcement learning framework integrating Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNN) for enhanced spoilage prediction. This hybrid architecture captures temporal dependencies within sensor data, enabling robust and adaptive decision making. In alignment with interpretable artificial intelligence principles, a rule-based classifier environment is employed to provide transparent ground truth labeling of spoilage levels based on domain-specific thresholds. This structured design allows the agent to operate within clearly defined semantic boundaries, supporting traceable and interpretable decisions. Model behavior is monitored using interpretability-driven metrics, including spoilage accuracy, reward-to-step ratio, loss reduction rate, and exploration decay. These metrics provide both quantitative performance evaluation and insights into learning dynamics. A class-wise spoilage distribution visualization is used to analyze the agents decision profile and policy behavior. Extensive evaluations on simulated and real-time hardware data demonstrate that the LSTM and RNN based agent outperforms alternative reinforcement learning approaches in prediction accuracy and decision efficiency while maintaining interpretability. The results highlight the potential of hybrid deep reinforcement learning with integrated interpretability for scalable IoT-based food monitoring systems.
[LG-16] Faster Distributed Inference-Only Recommender Systems via Bounded Lag Synchronous Collectives
链接: https://arxiv.org/abs/2512.19342
作者: Kiril Dichev,Filip Pawlowski,Albert-Jan Yzelman
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Recommender systems are enablers of personalized content delivery, and therefore revenue, for many large companies. In the last decade, deep learning recommender models (DLRMs) are the de-facto standard in this field. The main bottleneck in DLRM inference is the lookup of sparse features across huge embedding tables, which are usually partitioned across the aggregate RAM of many nodes. In state-of-the-art recommender systems, the distributed lookup is implemented via irregular all-to-all (alltoallv) communication, and often presents the main bottleneck. Today, most related work sees this operation as a given; in addition, every collective is synchronous in nature. In this work, we propose a novel bounded lag synchronous (BLS) version of the alltoallv operation. The bound can be a parameter allowing slower processes to lag behind entire iterations before the fastest processes block. In special applications such as inference-only DLRM, the accuracy of the application is fully preserved. We implement BLS alltoallv in a new PyTorch Distributed backend and evaluate it with a BLS version of the reference DLRM code. We show that for well balanced, homogeneous-access DLRM runs our BLS technique does not offer notable advantages. But for unbalanced runs, e.g. runs with strongly irregular embedding table accesses or with delays across different processes, our BLS technique improves both the latency and throughput of inference-only DLRM. In the best-case scenario, the proposed reduced synchronisation can mask the delays across processes altogether.
[LG-17] Orthogonal Approximate Message Passing with Optimal Spectral Initializations for Rectangular Spiked Matrix Models
链接: https://arxiv.org/abs/2512.19334
作者: Haohua Chen,Songbin Liu,Junjie Ma
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:We propose an orthogonal approximate message passing (OAMP) algorithm for signal estimation in the rectangular spiked matrix model with general rotationally invariant (RI) noise. We establish a rigorous state evolution that precisely characterizes the algorithm’s high-dimensional dynamics and enables the construction of iteration-wise optimal denoisers. Within this framework, we accommodate spectral initializations under minimal assumptions on the empirical noise spectrum. In the rectangular setting, where a single rank-one component typically generates multiple informative outliers, we further propose a procedure for combining these outliers under mild non-Gaussian signal assumptions. For general RI noise models, the predicted performance of the proposed optimal OAMP algorithm agrees with replica-symmetric predictions for the associated Bayes-optimal estimator, and we conjecture that it is statistically optimal within a broad class of iterative estimation methods.
[LG-18] A Logical View of GNN-Style Computation and the Role of Activation Functions
链接: https://arxiv.org/abs/2512.19332
作者: Pablo Barceló,Floris Geerts,Matthias Lanzinger,Klara Pakhomenko,Jan Van den Bussche
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:
Abstract:We study the numerical and Boolean expressiveness of MPLang, a declarative language that captures the computation of graph neural networks (GNNs) through linear message passing and activation functions. We begin with A-MPLang, the fragment without activation functions, and give a characterization of its expressive power in terms of walk-summed features. For bounded activation functions, we show that (under mild conditions) all eventually constant activations yield the same expressive power - numerical and Boolean - and that it subsumes previously established logics for GNNs with eventually constant activation functions but without linear layers. Finally, we prove the first expressive separation between unbounded and bounded activations in the presence of linear layers: MPLang with ReLU is strictly more powerful for numerical queries than MPLang with eventually constant activation functions, e.g., truncated ReLU. This hinges on subtle interactions between linear aggregation and eventually constant non-linearities, and it establishes that GNNs using ReLU are more expressive than those restricted to eventually constant activations and linear layers.
[LG-19] me-Vertex Machine Learning for Optimal Sensor Placement in Temporal Graph Signals: Applications in Structural Health Monitoring
链接: https://arxiv.org/abs/2512.19309
作者: Keivan Faghih Niresi,Jun Qing,Mengjie Zhao,Olga Fink
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Structural Health Monitoring (SHM) plays a crucial role in maintaining the safety and resilience of infrastructure. As sensor networks grow in scale and complexity, identifying the most informative sensors becomes essential to reduce deployment costs without compromising monitoring quality. While Graph Signal Processing (GSP) has shown promise by leveraging spatial correlations among sensor nodes, conventional approaches often overlook the temporal dynamics of structural behavior. To overcome this limitation, we propose Time-Vertex Machine Learning (TVML), a novel framework that integrates GSP, time-domain analysis, and machine learning to enable interpretable and efficient sensor placement by identifying representative nodes that minimize redundancy while preserving critical information. We evaluate the proposed approach on two bridge datasets for damage detection and time-varying graph signal reconstruction tasks. The results demonstrate the effectiveness of our approach in enhancing SHM systems by providing a robust, adaptive, and efficient solution for sensor placement.
[LG-20] GShield: Mitigating Poisoning Attacks in Federated Learning
链接: https://arxiv.org/abs/2512.19286
作者: Sameera K. M.,Serena Nicolazzo,Antonino Nocera,Vinod P.,Rafidha Rehiman K. A
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Federated Learning (FL) has recently emerged as a revolutionary approach to collaborative training Machine Learning models. In particular, it enables decentralized model training while preserving data privacy, but its distributed nature makes it highly vulnerable to a severe attack known as Data Poisoning. In such scenarios, malicious clients inject manipulated data into the training process, thereby degrading global model performance or causing targeted misclassification. In this paper, we present a novel defense mechanism called GShield, designed to detect and mitigate malicious and low-quality updates, especially under non-independent and identically distributed (non-IID) data scenarios. GShield operates by learning the distribution of benign gradients through clustering and Gaussian modeling during an initial round, enabling it to establish a reliable baseline of trusted client behavior. With this benign profile, GShield selectively aggregates only those updates that align with the expected gradient patterns, effectively isolating adversarial clients and preserving the integrity of the global model. An extensive experimental campaign demonstrates that our proposed defense significantly improves model robustness compared to the state-of-the-art methods while maintaining a high accuracy of performance across both tabular and image datasets. Furthermore, GShield improves the accuracy of the targeted class by 43% to 65% after detecting malicious and low-quality clients.
[LG-21] ranslating Flow to Policy via Hindsight Online Imitation
链接: https://arxiv.org/abs/2512.19269
作者: Yitian Zheng,Zhangchen Ye,Weijun Dong,Shengjie Wang,Yuyang Liu,Chongjie Zhang,Chuan Wen,Yang Gao
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in hierarchical robot systems leverage a high-level planner to propose task plans and a low-level policy to generate robot actions. This design allows training the planner on action-free or even non-robot data sources (e.g., videos), providing transferable high-level guidance. Nevertheless, grounding these high-level plans into executable actions remains challenging, especially with the limited availability of high-quality robot data. To this end, we propose to improve the low-level policy through online interactions. Specifically, our approach collects online rollouts, retrospectively annotates the corresponding high-level goals from achieved outcomes, and aggregates these hindsight-relabeled experiences to update a goal-conditioned imitation policy. Our method, Hindsight Flow-conditioned Online Imitation (HinFlow), instantiates this idea with 2D point flows as the high-level planner. Across diverse manipulation tasks in both simulation and physical world, our method achieves more than 2\times performance improvement over the base policy, significantly outperforming the existing methods. Moreover, our framework enables policy acquisition from planners trained on cross-embodiment video data, demonstrating its potential for scalable and transferable robot learning.
[LG-22] Small Language Models as Compiler Experts: Auto-Parallelization for Heterogeneous Systems NEURIPS2025
链接: https://arxiv.org/abs/2512.19250
作者: Prathamesh Devadiga
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Accepted at NeurIPS 2025 ML for Systems Workshop
Abstract:Traditional auto-parallelizing compilers, reliant on rigid heuristics, struggle with the complexity of modern heterogeneous systems. This paper presents a comprehensive evaluation of small (approximately 1B parameter) language-model-driven compiler auto-parallelization. We evaluate three models: gemma3, llama3.2, and qwen2.5, using six reasoning strategies across 11 real-world kernels drawn from scientific computing, graph algorithms, and machine learning. Our system is benchmarked against strong compiler baselines, including LLVM Polly, TVM, and Triton. Across 376 total evaluations, the proposed approach achieves an average speedup of 6.81x and a peak performance of 43.25x on convolution operations. We analyze scalability, verify correctness using multiple sanitizers, and confirm robustness across diverse compilers and hardware platforms. Our results demonstrate that small, efficient language models can serve as powerful reasoning engines for complex compiler optimization tasks.
[LG-23] From Black-Box Tuning to Guided Optimization via Hyperparameters Interaction Analysis
链接: https://arxiv.org/abs/2512.19246
作者: Moncef Garouani,Ayah Barhrhouj
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hyperparameters tuning is a fundamental, yet computationally expensive, step in optimizing machine learning models. Beyond optimization, understanding the relative importance and interaction of hyperparameters is critical to efficient model development. In this paper, we introduce MetaSHAP, a scalable semi-automated eXplainable AI (XAI) method, that uses meta-learning and Shapley values analysis to provide actionable and dataset-aware tuning insights. MetaSHAP operates over a vast benchmark of over 09 millions evaluated machine learning pipelines, allowing it to produce interpretable importance scores and actionable tuning insights that reveal how much each hyperparameter matters, how it interacts with others and in which value ranges its influence is concentrated. For a given algorithm and dataset, MetaSHAP learns a surrogate performance model from historical configurations, computes hyperparameters interactions using SHAP-based analysis, and derives interpretable tuning ranges from the most influential hyperparameters. This allows practitioners not only to prioritize which hyperparameters to tune, but also to understand their directionality and interactions. We empirically validate MetaSHAP on a diverse benchmark of 164 classification datasets and 14 classifiers, demonstrating that it produces reliable importance rankings and competitive performance when used to guide Bayesian optimization.
[LG-24] Regression generation adversarial network based on dual data evaluation strategy for industrial application
链接: https://arxiv.org/abs/2512.19232
作者: Zesen Wang,Yonggang Li,Lijuan Lan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Soft sensing infers hard-to-measure data through a large number of easily obtainable variables. However, in complex industrial scenarios, the issue of insufficient data volume persists, which diminishes the reliability of soft sensing. Generative Adversarial Networks (GAN) are one of the effective solutions for addressing insufficient samples. Nevertheless, traditional GAN fail to account for the mapping relationship between labels and features, which limits further performance improvement. Although some studies have proposed solutions, none have considered both performance and efficiency simultaneously. To address these problems, this paper proposes the multi-task learning-based regression GAN framework that integrates regression information into both the discriminator and generator, and implements a shallow sharing mechanism between the discriminator and regressor. This approach significantly enhances the quality of generated samples while improving the algorithm’s operational efficiency. Moreover, considering the importance of training samples and generated samples, a dual data evaluation strategy is designed to make GAN generate more diverse samples, thereby increasing the generalization of subsequent modeling. The superiority of method is validated through four classic industrial soft sensing cases: wastewater treatment plants, surface water, CO_2 absorption towers, and industrial gas turbines.
[LG-25] Phase-space entropy at acquisition reflects downstream learnability
链接: https://arxiv.org/abs/2512.19223
作者: Xiu-Cheng Wang,Jun-Jie Zhanga,Nan Cheng,Long-Gang Pang,Taijiao Du,Deyu Meng
类目: Machine Learning (cs.LG)
*备注: 22 pages 6 figures
Abstract:Modern learning systems work with data that vary widely across domains, but they all ultimately depend on how much structure is already present in the measurements before any model is trained. This raises a basic question: is there a general, modality-agnostic way to quantify how acquisition itself preserves or destroys the information that downstream learners could use? Here we propose an acquisition-level scalar \Delta S_\mathcal B based on instrument-resolved phase space. Unlike pixelwise distortion or purely spectral errors that often saturate under aggressive undersampling, \Delta S_\mathcal B directly quantifies how acquisition mixes or removes joint space–frequency structure at the instrument scale. We show theoretically that (\Delta S_\mathcal B) correctly identifies the phase-space coherence of periodic sampling as the physical source of aliasing, recovering classical sampling-theorem consequences. Empirically, across masked image classification, accelerated MRI, and massive MIMO (including over-the-air measurements), |\Delta S_\mathcal B| consistently ranks sampling geometries and predicts downstream reconstruction/recognition difficulty \emphwithout training. In particular, minimizing |\Delta S_\mathcal B| enables zero-training selection of variable-density MRI mask parameters that matches designs tuned by conventional pre-reconstruction criteria. These results suggest that phase-space entropy at acquisition reflects downstream learnability, enabling pre-training selection of candidate sampling policies and as a shared notion of information preservation across modalities.
[LG-26] Causal Heterogeneous Graph Learning Method for Chronic Obstructive Pulmonary Disease Prediction
链接: https://arxiv.org/abs/2512.19194
作者: Leming Zhou,Zuo Wang,Zhigang Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Due to the insufficient diagnosis and treatment capabilities at the grassroots level, there are still deficiencies in the early identification and early warning of acute exacerbation of Chronic obstructive pulmonary disease (COPD), often resulting in a high prevalence rate and high burden, but the screening rate is relatively low. In order to gradually improve this situation. In this paper, this study develop a Causal Heterogeneous Graph Representation Learning (CHGRL) method for COPD comorbidity risk prediction method that: a) constructing a heterogeneous Our dataset includes the interaction between patients and diseases; b) A cause-aware heterogeneous graph learning architecture has been constructed, combining causal inference mechanisms with heterogeneous graph learning, which can support heterogeneous graph causal learning for different types of relationships; and c) Incorporate the causal loss function in the model design, and add counterfactual reasoning learning loss and causal regularization loss on the basis of the cross-entropy classification loss. We evaluate our method and compare its performance with strong GNN baselines. Following experimental evaluation, the proposed model demonstrates high detection accuracy.
[LG-27] RP-CATE: Recurrent Perceptron-based Channel Attention Transformer Encoder for Industrial Hybrid Modeling
链接: https://arxiv.org/abs/2512.19147
作者: Haoran Yang,Yinan Zhang,Wenjie Zhang,Dongxia Wang,Peiyu Liu,Yuqi Ye,Kexin Chen,Wenhai Wang
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures
Abstract:Nowadays, industrial hybrid modeling which integrates both mechanistic modeling and machine learning-based modeling techniques has attracted increasing interest from scholars due to its high accuracy, low computational cost, and satisfactory interpretability. Nevertheless, the existing industrial hybrid modeling methods still face two main limitations. First, current research has mainly focused on applying a single machine learning method to one specific task, failing to develop a comprehensive machine learning architecture suitable for modeling tasks, which limits their ability to effectively represent complex industrial scenarios. Second, industrial datasets often contain underlying associations (e.g., monotonicity or periodicity) that are not adequately exploited by current research, which can degrade model’s predictive performance. To address these limitations, this paper proposes the Recurrent Perceptron-based Channel Attention Transformer Encoder (RP-CATE), with three distinctive characteristics: 1: We developed a novel architecture by replacing the self-attention mechanism with channel attention and incorporating our proposed Recurrent Perceptron (RP) Module into Transformer, achieving enhanced effectiveness for industrial modeling tasks compared to the original Transformer. 2: We proposed a new data type called Pseudo-Image Data (PID) tailored for channel attention requirements and developed a cyclic sliding window method for generating PID. 3: We introduced the concept of Pseudo-Sequential Data (PSD) and a method for converting industrial datasets into PSD, which enables the RP Module to capture the underlying associations within industrial dataset more effectively. An experiment aimed at hybrid modeling in chemical engineering was conducted by using RP-CATE and the experimental results demonstrate that RP-CATE achieves the best performance compared to other baseline models.
[LG-28] A Convex Loss Function for Set Prediction with Optimal Trade-offs Between Size and Conditional Coverag e
链接: https://arxiv.org/abs/2512.19142
作者: Francis Bach(SIERRA)
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:We consider supervised learning problems in which set predictions provide explicit uncertainty estimates. Using Choquet integrals (a.k.a. Lovász extensions), we propose a convex loss function for nondecreasing subset-valued functions obtained as level sets of a real-valued function. This loss function allows optimal trade-offs between conditional probabilistic coverage and the ‘‘size’’ of the set, measured by a non-decreasing submodular function. We also propose several extensions that mimic loss functions and criteria for binary classification with asymmetric losses, and show how to naturally obtain sets with optimized conditional coverage. We derive efficient optimization algorithms, either based on stochastic gradient descent or reweighted least-squares formulations, and illustrate our findings with a series of experiments on synthetic datasets for classification and regression tasks, showing improvements over approaches that aim for marginal coverage.
[LG-29] Evidential Trust-Aware Model Personalization in Decentralized Federated Learning for Wearable IoT
链接: https://arxiv.org/abs/2512.19131
作者: Murtaza Rangwala,Richard O. Sinnott,Rajkumar Buyya
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Decentralized federated learning (DFL) enables collaborative model training across edge devices without centralized coordination, offering resilience against single points of failure. However, statistical heterogeneity arising from non-identically distributed local data creates a fundamental challenge: nodes must learn personalized models adapted to their local distributions while selectively collaborating with compatible peers. Existing approaches either enforce a single global model that fits no one well, or rely on heuristic peer selection mechanisms that cannot distinguish between peers with genuinely incompatible data distributions and those with valuable complementary knowledge. We present Murmura, a framework that leverages evidential deep learning to enable trust-aware model personalization in DFL. Our key insight is that epistemic uncertainty from Dirichlet-based evidential models directly indicates peer compatibility: high epistemic uncertainty when a peer’s model evaluates local data reveals distributional mismatch, enabling nodes to exclude incompatible influence while maintaining personalized models through selective collaboration. Murmura introduces a trust-aware aggregation mechanism that computes peer compatibility scores through cross-evaluation on local validation samples and personalizes model aggregation based on evidential trust with adaptive thresholds. Evaluation on three wearable IoT datasets (UCI HAR, PAMAP2, PPG-DaLiA) demonstrates that Murmura reduces performance degradation from IID to non-IID conditions compared to baseline (0.9% vs. 19.3%), achieves 7.4 \times faster convergence, and maintains stable accuracy across hyperparameter choices. These results establish evidential uncertainty as a principled foundation for compatibility-aware personalization in decentralized heterogeneous environments.
[LG-30] A Composable Channel-Adaptive Architecture for Seizure Classification
链接: https://arxiv.org/abs/2512.19123
作者: Francesco Carzaniga,Michael Hersche,Kaspar Schindler,Abbas Rahimi
类目: Machine Learning (cs.LG)
*备注: 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract:Objective: We develop a channel-adaptive (CA) architecture that seamlessly processes multi-variate time-series with an arbitrary number of channels, and in particular intracranial electroencephalography (iEEG) recordings. Methods: Our CA architecture first processes the iEEG signal using state-of-the-art models applied to each single channel independently. The resulting features are then fused using a vector-symbolic algorithm which reconstructs the spatial relationship using a trainable scalar per channel. Finally, the fused features are accumulated in a long-term memory of up to 2 minutes to perform the classification. Each CA-model can then be pre-trained on a large corpus of iEEG recordings from multiple heterogeneous subjects. The pre-trained model is personalized to each subject via a quick fine-tuning routine, which uses equal or lower amounts of data compared to existing state-of-the-art models, but requiring only 1/5 of the time. Results: We evaluate our CA-models on a seizure detection task both on a short-term (~20 hours) and a long-term (~2500 hours) dataset. In particular, our CA-EEGWaveNet is trained on a single seizure of the tested subject, while the baseline EEGWaveNet is trained on all but one. Even in this challenging scenario, our CA-EEGWaveNet surpasses the baseline in median F1-score (0.78 vs 0.76). Similarly, CA-EEGNet based on EEGNet, also surpasses its baseline in median F1-score (0.79 vs 0.74). Conclusion and significance: Our CA-model addresses two issues: first, it is channel-adaptive and can therefore be trained across heterogeneous subjects without loss of performance; second, it increases the effective temporal context size to a clinically-relevant length. Therefore, our model is a drop-in replacement for existing models, bringing better characteristics and performance across the board.
[LG-31] mely Parameter Updating in Over-the-Air Federated Learning
链接: https://arxiv.org/abs/2512.19103
作者: Jiaqi Zhu,Zhongyuan Zhao,Xiao Li,Ruihao Du,Shi Jin,Howard H.Yang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Incorporating over-the-air computations (OAC) into the model training process of federated learning (FL) is an effective approach to alleviating the communication bottleneck in FL systems. Under OAC-FL, every client modulates its intermediate parameters, such as gradient, onto the same set of orthogonal waveforms and simultaneously transmits the radio signal to the edge server. By exploiting the superposition property of multiple-access channels, the edge server can obtain an automatically aggregated global gradient from the received signal. However, the limited number of orthogonal waveforms available in practical systems is fundamentally mismatched with the high dimensionality of modern deep learning models. To address this issue, we propose Freshness Freshness-mAgnItude awaRe top-k (FAIR-k), an algorithm that selects, in each communication round, the most impactful subset of gradients to be updated over the air. In essence, FAIR-k combines the complementary strengths of the Round-Robin and Top-k algorithms, striking a delicate balance between timeliness (freshness of parameter updates) and importance (gradient magnitude). Leveraging tools from Markov analysis, we characterize the distribution of parameter staleness under FAIR-k. Building on this, we establish the convergence rate of OAC-FL with FAIR-k, which discloses the joint effect of data heterogeneity, channel noise, and parameter staleness on the training efficiency. Notably, as opposed to conventional analyses that assume a universal Lipschitz constant across all the clients, our framework adopts a finer-grained model of the data heterogeneity. The analysis demonstrates that since FAIR-k promotes fresh (and fair) parameter updates, it not only accelerates convergence but also enhances communication efficiency by enabling an extended period of local training without significantly affecting overall training efficiency.
[LG-32] Dual Model Deep Learning for Alzheimer Prognostication
链接: https://arxiv.org/abs/2512.19099
作者: Alireza Moayedikia,Sara Fin,Uffe Kock Wiil
类目: Machine Learning (cs.LG)
*备注:
Abstract:Disease modifying therapies for Alzheimer’s disease demand precise timing decisions, yet current predictive models require longitudinal observations and provide no uncertainty quantification, rendering them impractical at the critical first visit when treatment decisions must be made. We developed PROGRESS (PRognostic Generalization from REsting Static Signatures), a dual-model deep learning framework that transforms a single baseline cerebrospinal fluid biomarker assessment into actionable prognostic estimates without requiring prior clinical history. The framework addresses two complementary clinical questions: a probabilistic trajectory network predicts individualized cognitive decline with calibrated uncertainty bounds achieving near-nominal coverage, enabling honest prognostic communication; and a deep survival model estimates time to conversion from mild cognitive impairment to dementia. Using data from over 3,000 participants across 43 Alzheimer’s Disease Research Centers in the National Alzheimer’s Coordinating Center database, PROGRESS substantially outperforms Cox proportional hazards, Random Survival Forests, and gradient boosting methods for survival prediction. Risk stratification identifies patient groups with seven-fold differences in conversion rates, enabling clinically meaningful treatment prioritization. Leave-one-center-out validation demonstrates robust generalizability, with survival discrimination remaining strong across held-out sites despite heterogeneous measurement conditions spanning four decades of assay technologies. By combining superior survival prediction with trustworthy trajectory uncertainty quantification, PROGRESS bridges the gap between biomarker measurement and personalized clinical decision-making.
[LG-33] On Cost-Aware Sequential Hypothesis Testing with Random Costs and Action Cancellation
链接: https://arxiv.org/abs/2512.19067
作者: George Vershinin,Asaf Cohen,Omer Gurewitz
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 9 pages, 7 figures
Abstract:We study a variant of cost-aware sequential hypothesis testing in which a single active Decision Maker (DM) selects actions with positive, random costs to identify the true hypothesis under an average error constraint, while minimizing the expected total cost. The DM may abort an in-progress action, yielding no sample, by truncating its realized cost at a smaller, tunable deterministic limit, which we term a per-action deadline. We analyze how this cancellation option can be exploited under two cost-revelation models: ex-post, where the cost is revealed only after the sample is obtained, and ex-ante, where the cost accrues before sample acquisition. In the ex-post model, per-action deadlines do not affect the expected total cost, and the cost-error tradeoffs coincide with the baseline obtained by replacing deterministic costs with cost means. In the ex-ante model, we show how per-action deadlines inflate the expected number of times actions are applied, and that the resulting expected total cost can be reduced to the constant-cost setting by introducing an effective per-action cost. We characterize when deadlines are beneficial and study several families in detail. Comments: 9 pages, 7 figures Subjects: Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2512.19067 [cs.IT] (or arXiv:2512.19067v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2512.19067 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-34] Efficient Personalization of Generative Models via Optimal Experimental Design
链接: https://arxiv.org/abs/2512.19057
作者: Guy Schacht,Ziyad Sheebaelhamd,Riccardo De Santi,Mojmír Mutný,Andreas Krause
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Preference learning from human feedback has the ability to align generative models with the needs of end-users. Human feedback is costly and time-consuming to obtain, which creates demand for data-efficient query selection methods. This work presents a novel approach that leverages optimal experimental design to ask humans the most informative preference queries, from which we can elucidate the latent reward function modeling user preferences efficiently. We formulate the problem of preference query selection as the one that maximizes the information about the underlying latent preference model. We show that this problem has a convex optimization formulation, and introduce a statistically and computationally efficient algorithm ED-PBRL that is supported by theoretical guarantees and can efficiently construct structured queries such as images or text. We empirically present the proposed framework by personalizing a text-to-image generative model to user-specific styles, showing that it requires less preference queries compared to random query selection.
[LG-35] me-series Forecast for Indoor Zone Air Temperature with Long Horizons: A Case Study with Sensor-based Data from a Smart Building
链接: https://arxiv.org/abs/2512.19038
作者: Liping Sun,Yucheng Guo,Siliang Lu,Zhenzhen Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:With the press of global climate change, extreme weather and sudden weather changes are becoming increasingly common. To maintain a comfortable indoor environment and minimize the contribution of the building to climate change as much as possible, higher requirements are placed on the operation and control of HVAC systems, e.g., more energy-efficient and flexible to response to the rapid change of weather. This places demands on the rapid modeling and prediction of zone air temperatures of buildings. Compared to the traditional simulation-based approach such as EnergyPlus and DOE2, a hybrid approach combined physics and data-driven is more suitable. Recently, the availability of high-quality datasets and algorithmic breakthroughs have driven a considerable amount of work in this field. However, in the niche of short- and long-term predictions, there are still some gaps in existing research. This paper aims to develop a time series forecast model to predict the zone air temperature in a building located in America on a 2-week horizon. The findings could be further improved to support intelligent control and operation of HVAC systems (i.e. demand flexibility) and could also be used as hybrid building energy modeling.
[LG-36] Elevating Intrusion Detection and Security Fortification in Intelligent Networks through Cutting-Edge Machine Learning Paradigms
链接: https://arxiv.org/abs/2512.19037
作者: Md Minhazul Islam Munna,Md Mahbubur Rahman,Jaroslav Frnda,Muhammad Shahid Anwar,Alpamis Kutlimuratov
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The proliferation of IoT devices and their reliance on Wi-Fi networks have introduced significant security vulnerabilities, particularly the KRACK and Kr00k attacks, which exploit weaknesses in WPA2 encryption to intercept and manipulate sensitive data. Traditional IDS using classifiers face challenges such as model overfitting, incomplete feature extraction, and high false positive rates, limiting their effectiveness in real-world deployments. To address these challenges, this study proposes a robust multiclass machine learning based intrusion detection framework. The methodology integrates advanced feature selection techniques to identify critical attributes, mitigating redundancy and enhancing detection accuracy. Two distinct ML architectures are implemented: a baseline classifier pipeline and a stacked ensemble model combining noise injection, Principal Component Analysis (PCA), and meta learning to improve generalization and reduce false positives. Evaluated on the AWID3 data set, the proposed ensemble architecture achieves superior performance, with an accuracy of 98%, precision of 98%, recall of 98%, and a false positive rate of just 2%, outperforming existing state-of-the-art methods. This work demonstrates the efficacy of combining preprocessing strategies with ensemble learning to fortify network security against sophisticated Wi-Fi attacks, offering a scalable and reliable solution for IoT environments. Future directions include real-time deployment and adversarial resilience testing to further enhance the model’s adaptability.
[LG-37] A Surrogate-Augmented Symbolic CFD-Driven Training Framework for Accelerating Multi-objective Physical Model Development
链接: https://arxiv.org/abs/2512.19031
作者: Yuan Fang,Fabian Waschkowski,Maximilian Reissmann,Richard D. Sandberg,Takuo Oda,Koichi Tanimoto
类目: Machine Learning (cs.LG)
*备注:
Abstract:Computational Fluid Dynamics (CFD)-driven training combines machine learning (ML) with CFD solvers to develop physically consistent closure models with improved predictive accuracy. In the original framework, each ML-generated candidate model is embedded in a CFD solver and evaluated against reference data, requiring hundreds to thousands of high-fidelity simulations and resulting in prohibitive computational cost for complex flows. To overcome this limitation, we propose an extended framework that integrates surrogate modeling into symbolic CFD-driven training in real time to reduce training cost. The surrogate model learns to approximate the errors of ML-generated models based on previous CFD evaluations and is continuously refined during training. Newly generated models are first assessed using the surrogate, and only those predicted to yield small errors or high uncertainty are subsequently evaluated with full CFD simulations. Discrete expressions generated by symbolic regression are mapped into a continuous space using averaged input-symbol values as inputs to a probabilistic surrogate model. To support multi-objective model training, particularly when fixed weighting of competing quantities is challenging, the surrogate is extended to a multi-output formulation by generalizing the kernel to a matrix form, providing one mean and variance prediction per training objective. Selection metrics based on these probabilistic outputs are used to identify an optimal training setup. The proposed surrogate-augmented CFD-driven training framework is demonstrated across a range of statistically one- and two-dimensional flows, including both single- and multi-expression model optimization. In all cases, the framework substantially reduces training cost while maintaining predictive accuracy comparable to that of the original CFD-driven approach.
[LG-38] Optimizer Dynamics at the Edge of Stability with Differential Privacy
链接: https://arxiv.org/abs/2512.19019
作者: Ayana Hussain,Ricky Fang
类目: Machine Learning (cs.LG)
*备注: 17 pages, 5 figures
Abstract:Deep learning models can reveal sensitive information about individual training examples, and while differential privacy (DP) provides guarantees restricting such leakage, it also alters optimization dynamics in poorly understood ways. We study the training dynamics of neural networks under DP by comparing Gradient Descent (GD), and Adam to their privacy-preserving variants. Prior work shows that these optimizers exhibit distinct stability dynamics: full-batch methods train at the Edge of Stability (EoS), while mini-batch and adaptive methods exhibit analogous edge-of-stability behavior. At these regimes, the training loss and the sharpness–the maximum eigenvalue of the training loss Hessian–exhibit certain characteristic behavior. In DP training, per-example gradient clipping and Gaussian noise modify the update rule, and it is unclear whether these stability patterns persist. We analyze how clipping and noise change sharpness and loss evolution and show that while DP generally reduces the sharpness and can prevent optimizers from fully reaching the classical stability thresholds, patterns from EoS and analogous adaptive methods stability regimes persist, with the largest learning rates and largest privacy budgets approaching, and sometimes exceeding, these thresholds. These findings highlight the unpredictability introduced by DP in neural network optimization.
[LG-39] OPBO: Order-Preserving Bayesian Optimization
链接: https://arxiv.org/abs/2512.18980
作者: Wei Peng,Jianchen Hu,Kang Liu,Qiaozhu Zhai
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 13 pages
Abstract:Bayesian optimization is an effective method for solving expensive black-box optimization problems. Most existing methods use Gaussian processes (GP) as the surrogate model for approximating the black-box objective function, it is well-known that it can fail in high-dimensional space (e.g., dimension over 500). We argue that the reliance of GP on precise numerical fitting is fundamentally ill-suited in high-dimensional space, where it leads to prohibitive computational complexity. In order to address this, we propose a simple order-preserving Bayesian optimization (OPBO) method, where the surrogate model preserves the order, instead of the value, of the black-box objective function. Then we can use a simple but effective OP neural network (NN) to replace GP as the surrogate model. Moreover, instead of searching for the best solution from the acquisition model, we select good-enough solutions in the ordinal set to reduce computational cost. The experimental results show that for high-dimensional (over 500) black-box optimization problems, the proposed OPBO significantly outperforms traditional BO methods based on regression NN and GP. The source code is available at this https URL.
[LG-40] Outlier detection in mixed-attribute data: a semi-supervised approach with fuzzy approximations and relative entropy
链接: https://arxiv.org/abs/2512.18978
作者: Baiyang Chen,Zhong Yuan,Zheng Liu,Dezhong Peng,Yongxiang Li,Chang Liu,Guiduo Duan
类目: Machine Learning (cs.LG)
*备注: Author’s accepted manuscript
Abstract:Outlier detection is a critical task in data mining, aimed at identifying objects that significantly deviate from the norm. Semi-supervised methods improve detection performance by leveraging partially labeled data but typically overlook the uncertainty and heterogeneity of real-world mixed-attribute data. This paper introduces a semi-supervised outlier detection method, namely fuzzy rough sets-based outlier detection (FROD), to effectively handle these challenges. Specifically, we first utilize a small subset of labeled data to construct fuzzy decision systems, through which we introduce the attribute classification accuracy based on fuzzy approximations to evaluate the contribution of attribute sets in outlier detection. Unlabeled data is then used to compute fuzzy relative entropy, which provides a characterization of outliers from the perspective of uncertainty. Finally, we develop the detection algorithm by combining attribute classification accuracy with fuzzy relative entropy. Experimental results on 16 public datasets show that FROD is comparable with or better than leading detection algorithms. All datasets and source codes are accessible at this https URL. This manuscript is the accepted author version of a paper published by Elsevier. The final published version is available at this https URL
[LG-41] Consistency-guided semi-supervised outlier detection in heterogeneous data using fuzzy rough sets
链接: https://arxiv.org/abs/2512.18977
作者: Baiyang Chen,Zhong Yuan,Dezhong Peng,Xiaoliang Chen,Hongmei Chen
类目: Machine Learning (cs.LG)
*备注: Author’s Accepted Manuscript
Abstract:Outlier detection aims to find samples that behave differently from the majority of the data. Semi-supervised detection methods can utilize the supervision of partial labels, thus reducing false positive rates. However, most of the current semi-supervised methods focus on numerical data and neglect the heterogeneity of data information. In this paper, we propose a consistency-guided outlier detection algorithm (COD) for heterogeneous data with the fuzzy rough set theory in a semi-supervised manner. First, a few labeled outliers are leveraged to construct label-informed fuzzy similarity relations. Next, the consistency of the fuzzy decision system is introduced to evaluate attributes’ contributions to knowledge classification. Subsequently, we define the outlier factor based on the fuzzy similarity class and predict outliers by integrating the classification consistency and the outlier factor. The proposed algorithm is extensively evaluated on 15 freshly proposed datasets. Experimental results demonstrate that COD is better than or comparable with the leading outlier detectors. This manuscript is the accepted author version of a paper published by Elsevier. The final published version is available at this https URL
[LG-42] Lag Operator SSMs: A Geometric Framework for Structured State Space Modeling
链接: https://arxiv.org/abs/2512.18965
作者: Sutashu Tomonaga,Kenji Doya,Noboru Murata
类目: Machine Learning (cs.LG)
*备注:
Abstract:Structured State Space Models (SSMs), which are at the heart of the recently popular Mamba architecture, are powerful tools for sequence modeling. However, their theoretical foundation relies on a complex, multi-stage process of continuous-time modeling and subsequent discretization, which can obscure intuition. We introduce a direct, first-principles framework for constructing discrete-time SSMs that is both flexible and modular. Our approach is based on a novel lag operator, which geometrically derives the discrete-time recurrence by measuring how the system’s basis functions “slide” and change from one timestep to the next. The resulting state matrices are computed via a single inner product involving this operator, offering a modular design space for creating novel SSMs by flexibly combining different basis functions and time-warping schemes. To validate our approach, we demonstrate that a specific instance exactly recovers the recurrence of the influential HiPPO model. Numerical simulations confirm our derivation, providing new theoretical tools for designing flexible and robust sequence models.
[LG-43] Scaling Online Distributionally Robust Reinforcement Learning: Sample-Efficient Guarantees with General Function Approximation
链接: https://arxiv.org/abs/2512.18957
作者: Debamita Ghosh,George K. Atia,Yue Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The deployment of reinforcement learning (RL) agents in real-world applications is often hindered by performance degradation caused by mismatches between training and deployment environments. Distributionally robust RL (DR-RL) addresses this issue by optimizing worst-case performance over an uncertainty set of transition dynamics. However, existing work typically relies on substantial prior knowledge-such as access to a generative model or a large offline dataset-and largely focuses on tabular methods that do not scale to complex domains. We overcome these limitations by proposing an online DR-RL algorithm with general function approximation that learns an optimal robust policy purely through interaction with the environment, without requiring prior models or offline data, enabling deployment in high-dimensional tasks. We further provide a theoretical analysis establishing a near-optimal sublinear regret bound under a total variation uncertainty set, demonstrating the sample efficiency and effectiveness of our method.
[LG-44] Learning Through Little Eyes: Attribute Discrimination Beyond Objects
链接: https://arxiv.org/abs/2512.18951
作者: Patrick Batsell,Tsutsui Satoshi,Bihan Wen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Infants learn to recognize not only object categories but also fine grained attributes such as color, size, and texture within their first two years of life. Prior work explores Childs View for Contrastive Learning (CVCL), a CLIP style model trained on infant egocentric video as a computational model of early infant learning, but it focuses only on class level recognition. This leaves it unclear whether infant scale learning also supports attribute discrimination. To address this, we introduce a benchmark that systematically varies color, size, and texture, allowing controlled tests of within class attribute recognition. Comparing CVCL with CLIP shows clear differences. CVCL is better at size discrimination, while CLIP achieves higher accuracy on color discrimination. Both models represent texture in image embeddings but fail to ground texture linguistically, suggesting a gap between visual and language spaces.
[LG-45] DPSR: Differentially Private Sparse Reconstruction via Multi-Stage Denoising for Recommender Systems
链接: https://arxiv.org/abs/2512.18932
作者: Sarwan Ali
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Differential privacy (DP) has emerged as the gold standard for protecting user data in recommender systems, but existing privacy-preserving mechanisms face a fundamental challenge: the privacy-utility tradeoff inevitably degrades recommendation quality as privacy budgets tighten. We introduce DPSR (Differentially Private Sparse Reconstruction), a novel three-stage denoising framework that fundamentally addresses this limitation by exploiting the inherent structure of rating matrices – sparsity, low-rank properties, and collaborative patterns. DPSR consists of three synergistic stages: (1) \textitinformation-theoretic noise calibration that adaptively reduces noise for high-information ratings, (2) \textitcollaborative filtering-based denoising that leverages item-item similarities to remove privacy noise, and (3) \textitlow-rank matrix completion that exploits latent structure for signal recovery. Critically, all denoising operations occur \textitafter noise injection, preserving differential privacy through the post-processing immunity theorem while removing both privacy-induced and inherent data noise. Through extensive experiments on synthetic datasets with controlled ground truth, we demonstrate that DPSR achieves 5.57% to 9.23% RMSE improvement over state-of-the-art Laplace and Gaussian mechanisms across privacy budgets ranging from \varepsilon=0.1 to \varepsilon=10.0 (all improvements statistically significant with p 0.05 , most p 0.001 ). Remarkably, at \varepsilon=1.0 , DPSR achieves RMSE of 0.9823, \textitoutperforming even the non-private baseline (1.0983), demonstrating that our denoising pipeline acts as an effective regularizer that removes data noise in addition to privacy noise. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2512.18932 [cs.LG] (or arXiv:2512.18932v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.18932 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-46] he Ensemble Schrödinger Bridge filter for Nonlinear Data Assimilation
链接: https://arxiv.org/abs/2512.18928
作者: Feng Bao,Hui Sun
类目: Machine Learning (cs.LG)
*备注:
Abstract:This work puts forward a novel nonlinear optimal filter namely the Ensemble Schrödinger Bridge nonlinear filter. The proposed filter finds marriage of the standard prediction procedure and the diffusion generative modeling for the analysis procedure to realize one filtering step. The designed approach finds no structural model error, and it is derivative free, training free and highly parallizable. Experimental results show that the designed algorithm performs well given highly nonlinear dynamics in (mildly) high dimension up to 40 or above under a chaotic environment. It also shows better performance than classical methods such as the ensemble Kalman filter and the Particle filter in numerous tests given different level of nonlinearity. Future work will focus on extending the proposed approach to practical meteorological applications and establishing a rigorous convergence analysis.
[LG-47] Merging of Kolmogorov-Arnold networks trained on disjoint datasets
链接: https://arxiv.org/abs/2512.18921
作者: Andrew Polar,Michael Poluektov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training on disjoint datasets can serve two primary goals: accelerating data processing and enabling federated learning. It has already been established that Kolmogorov-Arnold networks (KANs) are particularly well suited for federated learning and can be merged through simple parameter averaging. While the federated learning literature has mostly focused on achieving training convergence across distributed nodes, the present paper specifically targets acceleration of the training, which depends critically on the choice of an optimisation method and the type of the basis functions. To the best knowledge of the authors, the fastest currently-available combination is the Newton-Kaczmarz method and the piecewise-linear basis functions. Here, it is shown that training on disjoint datasets (or disjoint subsets of the training dataset) can further improve the performance. Experimental comparisons are provided, and all corresponding codes are publicly available.
[LG-48] Generative Modeling through Spectral Analysis of Koopman Operator
链接: https://arxiv.org/abs/2512.18837
作者: Yuanchao Xu,Fengyi Li,Masahiro Fujisawa,Youssef Marzouk,Isao Ishikawa
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:We propose Koopman Spectral Wasserstein Gradient Descent (KSWGD), a generative modeling framework that combines operator-theoretic spectral analysis with optimal transport. The novel insight is that the spectral structure required for accelerated Wasserstein gradient descent can be directly estimated from trajectory data via Koopman operator approximation which can eliminate the need for explicit knowledge of the target potential or neural network training. We provide rigorous convergence analysis and establish connection to Feynman-Kac theory that clarifies the method’s probabilistic foundation. Experiments across diverse settings, including compact manifold sampling, metastable multi-well systems, image generation, and high dimensional stochastic partial differential equation, demonstrate that KSWGD consistently achieves faster convergence than other existing methods while maintaining high sample quality.
[LG-49] Label-Informed Outlier Detection Based on Granule Density
链接: https://arxiv.org/abs/2512.18774
作者: Baiyang Chen,Zhong Yuan,Dezhong Peng,Hongmei Chen,Xiaomin Song,Huiming Zheng
类目: Machine Learning (cs.LG)
*备注: Author’s Accepted Manuscript
Abstract:Outlier detection, crucial for identifying unusual patterns with significant implications across numerous applications, has drawn considerable research interest. Existing semi-supervised methods typically treat data as purely numerical and in a deterministic manner, thereby neglecting the heterogeneity and uncertainty inherent in complex, real-world datasets. This paper introduces a label-informed outlier detection method for heterogeneous data based on Granular Computing and Fuzzy Sets, namely Granule Density-based Outlier Factor (GDOF). Specifically, GDOF first employs label-informed fuzzy granulation to effectively represent various data types and develops granule density for precise density estimation. Subsequently, granule densities from individual attributes are integrated for outlier scoring by assessing attribute relevance with a limited number of labeled outliers. Experimental results on various real-world datasets show that GDOF stands out in detecting outliers in heterogeneous data with a minimal number of labeled outliers. The integration of Fuzzy Sets and Granular Computing in GDOF offers a practical framework for outlier detection in complex and diverse data types. All relevant datasets and source codes are publicly available for further research. This is the author’s accepted manuscript of a paper published in IEEE Transactions on Fuzzy Systems. The final version is available at this https URL
[LG-50] Gaussian-Mixture-Model Q-Functions for Policy Iteration in Reinforcement Learning
链接: https://arxiv.org/abs/2512.18763
作者: Minh Vu,Konstantinos Slavakis
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Unlike their conventional use as estimators of probability density functions in reinforcement learning (RL), this paper introduces a novel function-approximation role for Gaussian mixture models (GMMs) as direct surrogates for Q-function losses. These parametric models, termed GMM-QFs, possess substantial representational capacity, as they are shown to be universal approximators over a broad class of functions. They are further embedded within Bellman residuals, where their learnable parameters – a fixed number of mixing weights, together with Gaussian mean vectors and covariance matrices – are inferred from data via optimization on a Riemannian manifold. This geometric perspective on the parameter space naturally incorporates Riemannian optimization into the policy-evaluation step of standard policy-iteration frameworks. Rigorous theoretical results are established, and supporting numerical tests show that, even without access to experience data, GMM-QFs deliver competitive performance and, in some cases, outperform state-of-the-art approaches across a range of benchmark RL tasks, all while maintaining a significantly smaller computational footprint than deep-learning methods that rely on experience data.
[LG-51] Is Your Conditional Diffusion Model Actually Denoising?
链接: https://arxiv.org/abs/2512.18736
作者: Daniel Pfrommer,Zehao Dou,Christopher Scarvelis,Max Simchowitz,Ali Jadbabaie
类目: Machine Learning (cs.LG)
*备注: 41 pages, 14 figures, published in Neural Information Processing Systems 2025
Abstract:We study the inductive biases of diffusion models with a conditioning-variable, which have seen widespread application as both text-conditioned generative image models and observation-conditioned continuous control policies. We observe that when these models are queried conditionally, their generations consistently deviate from the idealized “denoising” process upon which diffusion models are formulated, inducing disagreement between popular sampling algorithms (e.g. DDPM, DDIM). We introduce Schedule Deviation, a rigorous measure which captures the rate of deviation from a standard denoising process, and provide a methodology to compute it. Crucially, we demonstrate that the deviation from an idealized denoising process occurs irrespective of the model capacity or amount of training data. We posit that this phenomenon occurs due to the difficulty of bridging distinct denoising flows across different parts of the conditioning space and show theoretically how such a phenomenon can arise through an inductive bias towards smoothness.
[LG-52] A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models
链接: https://arxiv.org/abs/2512.18730
作者: Zhiquan Tan,Yinrong Hong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) trained via KL-regularized reinforcement learning demonstrate strong instruction following, self-correction, and reasoning abilities. Yet their theoretical underpinnings remain limited. We exploit the closed-form energy-based model (EBM) structure of the optimal KL-regularized policy to provide a unified variational analysis of LLMs. For instruction-tuned models, under natural assumptions on reward potentials and pretraining symmetry, we prove that the transition kernel satisfies detailed balance with respect to a scalar potential encoding response quality. This yields monotonic KL convergence to a high-quality stationary distribution, bounded hitting times to superior states, and exponential mixing governed by the spectral gap. For reasoning models trained with verifiable rewards (RLVR), we show the objective is equivalent to expected KL minimization toward an optimal reasoning distribution, with the suboptimality gap reducing to the Bernoulli KL between target and current accuracies along the natural gradient flow. This helps explain empirical entropy-accuracy trade-offs. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.18730 [cs.LG] (or arXiv:2512.18730v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.18730 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-53] ML Inference Scheduling with Predictable Latency
链接: https://arxiv.org/abs/2512.18725
作者: Haidong Zhao,Nikolaos Georgantas
类目: Machine Learning (cs.LG)
*备注: Accepted at MAIoT@Middleware 2025
Abstract:Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as concurrent tasks contend for GPU resources and thereby introduce interference. Given that interference effects introduce unpredictability in scheduling, neglecting them may compromise SLO or deadline satisfaction. Nevertheless, existing interference prediction approaches remain limited in several respects, which may restrict their usefulness for scheduling. First, they are often coarse-grained, which ignores runtime co-location dynamics and thus restricts their accuracy in interference prediction. Second, they tend to use a static prediction model, which may not effectively cope with different workload characteristics. To this end, we evaluate the potential limitations of existing interference prediction approaches and outline our ongoing work toward achieving efficient ML inference scheduling.
[LG-54] Generating Risky Samples with Conformity Constraints via Diffusion Models
链接: https://arxiv.org/abs/2512.18722
作者: Han Yu,Hao Zou,Xingxuan Zhang,Zhengyi Wang,Yue He,Kehan Li,Peng Cui
类目: Machine Learning (cs.LG)
*备注:
Abstract:Although neural networks achieve promising performance in many tasks, they may still fail when encountering some examples and bring about risks to applications. To discover risky samples, previous literature attempts to search for patterns of risky samples within existing datasets or inject perturbation into them. Yet in this way the diversity of risky samples is limited by the coverage of existing datasets. To overcome this limitation, recent works adopt diffusion models to produce new risky samples beyond the coverage of existing datasets. However, these methods struggle in the conformity between generated samples and expected categories, which could introduce label noise and severely limit their effectiveness in applications. To address this issue, we propose RiskyDiff that incorporates the embeddings of both texts and images as implicit constraints of category conformity. We also design a conformity score to further explicitly strengthen the category conformity, as well as introduce the mechanisms of embedding screening and risky gradient guidance to boost the risk of generated samples. Extensive experiments reveal that RiskyDiff greatly outperforms existing methods in terms of the degree of risk, generation quality, and conformity with conditioned categories. We also empirically show the generalization ability of the models can be enhanced by augmenting training data with generated samples of high conformity.
[LG-55] ask Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis
链接: https://arxiv.org/abs/2512.18699
作者: Pengchao Feng,Yao Xiao,Ziyang Ma,Zhikang Niu,Shuai Fan,Yao Li,Sheng Wang,Xie Chen
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in text-to-speech (TTS) have yielded remarkable improvements in naturalness and intelligibility. Building on these achievements, research has increasingly shifted toward enhancing the expressiveness of generated speech, such as dialectal and emotional TTS. However, cross-style synthesis combining both dialect and emotion remains challenging and largely unexplored, mainly due to the scarcity of dialectal data with emotional labels. To address this, we propose Hierarchical Expressive Vector (HE-Vector), a two-stage method for Emotional Dialectal TTS. In the first stage, we construct different task vectors to model dialectal and emotional styles independently, and then enhance single-style synthesis by adjusting their weights, a method we refer to as Expressive Vector (E-Vector). For the second stage, we hierarchically integrate these vectors to achieve controllable emotionally expressive dialect synthesis without requiring jointly labeled data, corresponding to Hierarchical Expressive Vector (HE-Vector). Experimental results demonstrate that HE-Vectors achieve superior performance in dialect synthesis, and promising results in synthesizing emotionally expressive dialectal speech in a zero-shot setting.
[LG-56] Improving Pattern Recognition of Scheduling Anomalies through Structure-Aware and Semantically-Enhanced Graphs
链接: https://arxiv.org/abs/2512.18673
作者: Ning Lyu,Junjie Jiang,Lu Chang,Chihui Shao,Feng Chen,Chong Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper proposes a structure-aware driven scheduling graph modeling method to improve the accuracy and representation capability of anomaly identification in scheduling behaviors of complex systems. The method first designs a structure-guided scheduling graph construction mechanism that integrates task execution stages, resource node states, and scheduling path information to build dynamically evolving scheduling behavior graphs, enhancing the model’s ability to capture global scheduling relationships. On this basis, a multi-scale graph semantic aggregation module is introduced to achieve semantic consistency modeling of scheduling features through local adjacency semantic integration and global topology alignment, thereby strengthening the model’s capability to capture abnormal features in complex scenarios such as multi-task concurrency, resource competition, and stage transitions. Experiments are conducted on a real scheduling dataset with multiple scheduling disturbance paths set to simulate different types of anomalies, including structural shifts, resource changes, and task delays. The proposed model demonstrates significant performance advantages across multiple metrics, showing a sensitive response to structural disturbances and semantic shifts. Further visualization analysis reveals that, under the combined effect of structure guidance and semantic aggregation, the scheduling behavior graph exhibits stronger anomaly separability and pattern representation, validating the effectiveness and adaptability of the method in scheduling anomaly detection tasks.
[LG-57] Demonstration-Guided Continual Reinforcement Learning in Dynamic Environments
链接: https://arxiv.org/abs/2512.18670
作者: Xue Yang,Michael Schukat,Junlin Lu,Patrick Mannion,Karl Mason,Enda Howley
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) excels in various applications but struggles in dynamic environments where the underlying Markov decision process evolves. Continual reinforcement learning (CRL) enables RL agents to continually learn and adapt to new tasks, but balancing stability (preserving prior knowledge) and plasticity (acquiring new knowledge) remains challenging. Existing methods primarily address the stability-plasticity dilemma through mechanisms where past knowledge influences optimization but rarely affects the agent’s behavior directly, which may hinder effective knowledge reuse and efficient learning. In contrast, we propose demonstration-guided continual reinforcement learning (DGCRL), which stores prior knowledge in an external, self-evolving demonstration repository that directly guides RL exploration and adaptation. For each task, the agent dynamically selects the most relevant demonstration and follows a curriculum-based strategy to accelerate learning, gradually shifting from demonstration-guided exploration to fully self-exploration. Extensive experiments on 2D navigation and MuJoCo locomotion tasks demonstrate its superior average performance, enhanced knowledge transfer, mitigation of forgetting, and training efficiency. The additional sensitivity analysis and ablation study further validate its effectiveness.
[LG-58] From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers NEURIPS2025
链接: https://arxiv.org/abs/2512.18634
作者: Ryotaro Kawata,Yujin Song,Alberto Bietti,Naoki Nishikawa,Taiji Suzuki,Samuel Vaiter,Denny Wu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2025
Abstract:Transformers can implement both generalizable algorithms (e.g., induction heads) and simple positional shortcuts (e.g., memorizing fixed output positions). In this work, we study how the choice of pretraining data distribution steers a shallow transformer toward one behavior or the other. Focusing on a minimal trigger-output prediction task – copying the token immediately following a special trigger upon its second occurrence – we present a rigorous analysis of gradient-based training of a single-layer transformer. In both the infinite and finite sample regimes, we prove a transition in the learned mechanism: if input sequences exhibit sufficient diversity, measured by a low ``max-sum’’ ratio of trigger-to-trigger distances, the trained model implements an induction head and generalizes to unseen contexts; by contrast, when this ratio is large, the model resorts to a positional shortcut and fails to generalize out-of-distribution (OOD). We also reveal a trade-off between the pretraining context length and OOD generalization, and derive the optimal pretraining distribution that minimizes computational cost per sample. Finally, we validate our theoretical predictions with controlled synthetic experiments, demonstrating that broadening context distributions robustly induces induction heads and enables OOD generalization. Our results shed light on the algorithmic biases of pretrained transformers and offer conceptual guidelines for data-driven control of their learned behaviors.
[LG-59] he Procrustean Bed of Time Series: The Optimization Bias of Point-wise Loss
链接: https://arxiv.org/abs/2512.18610
作者: Rongyao Cai,Yuxi Wan,Kexin Zhang,Ming Jin,Hao Wang,Zhiqiang Ge,Daoyi Dong,Yong Liu,Qingsong Wen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Optimizing time series models via point-wise loss functions (e.g., MSE) relying on a flawed point-wise independent and identically distributed (i.i.d.) assumption that disregards the causal temporal structure, an issue with growing awareness yet lacking formal theoretical grounding. Focusing on the core independence issue under covariance stationarity, this paper aims to provide a first-principles analysis of the Expectation of Optimization Bias (EOB), formalizing it information-theoretically as the discrepancy between the true joint distribution and its flawed i.i.d. counterpart. Our analysis reveals a fundamental paradigm paradox: the more deterministic and structured the time series, the more severe the bias by point-wise loss function. We derive the first closed-form quantification for the non-deterministic EOB across linear and non-linear systems, and prove EOB is an intrinsic data property, governed exclusively by sequence length and our proposed Structural Signal-to-Noise Ratio (SSNR). This theoretical diagnosis motivates our principled debiasing program that eliminates the bias through sequence length reduction and structural orthogonalization. We present a concrete solution that simultaneously achieves both principles via DFT or DWT. Furthermore, a novel harmonized \ell_p norm framework is proposed to rectify gradient pathologies of high-variance series. Extensive experiments validate EOB Theory’s generality and the superior performance of debiasing program.
[LG-60] rajectory Planning for UAV-Based Smart Farming Using Imitation-Based Triple Deep Q-Learning
链接: https://arxiv.org/abs/2512.18604
作者: Wencan Mao,Quanxi Zhou,Tomas Couso Coddou,Manabu Tsukada,Yunling Liu,Yusheng Ji
类目: Machine Learning (cs.LG)
*备注:
Abstract:Unmanned aerial vehicles (UAVs) have emerged as a promising auxiliary platform for smart agriculture, capable of simultaneously performing weed detection, recognition, and data collection from wireless sensors. However, trajectory planning for UAV-based smart agriculture is challenging due to the high uncertainty of the environment, partial observations, and limited battery capacity of UAVs. To address these issues, we formulate the trajectory planning problem as a Markov decision process (MDP) and leverage multi-agent reinforcement learning (MARL) to solve it. Furthermore, we propose a novel imitation-based triple deep Q-network (ITDQN) algorithm, which employs an elite imitation mechanism to reduce exploration costs and utilizes a mediator Q-network over a double deep Q-network (DDQN) to accelerate and stabilize training and improve performance. Experimental results in both simulated and real-world environments demonstrate the effectiveness of our solution. Moreover, our proposed ITDQN outperforms DDQN by 4.43% in weed recognition rate and 6.94% in data collection rate.
[LG-61] EIA-SEC: Improved Actor-Critic Framework for Multi-UAV Collaborative Control in Smart Agriculture
链接: https://arxiv.org/abs/2512.18596
作者: Quanxi Zhou,Wencan Mao,Yilei Liang,Manabu Tsukada,Yunling Liu,Jon Crowcroft
类目: Machine Learning (cs.LG)
*备注:
Abstract:The widespread application of wireless communication technology has promoted the development of smart agriculture, where unmanned aerial vehicles (UAVs) play a multifunctional role. We target a multi-UAV smart agriculture system where UAVs cooperatively perform data collection, image acquisition, and communication tasks. In this context, we model a Markov decision process to solve the multi-UAV trajectory planning problem. Moreover, we propose a novel Elite Imitation Actor-Shared Ensemble Critic (EIA-SEC) framework, where agents adaptively learn from the elite agent to reduce trial-and-error costs, and a shared ensemble critic collaborates with each agent’s local critic to ensure unbiased objective value estimates and prevent overestimation. Experimental results demonstrate that EIA-SEC outperforms state-of-the-art baselines in terms of reward performance, training stability, and convergence speed.
[LG-62] Benchmarking neural surrogates on realistic spatiotemporal multiphysics flows
链接: https://arxiv.org/abs/2512.18595
作者: Runze Mao,Rui Zhang,Xuan Bai,Tianhao Wu,Teng Zhang,Zhenyi Chen,Minqi Lin,Bocheng Zeng,Yangchen Xu,Yingxuan Xiang,Haoze Zhang,Shubham Goswami,Pierre A. Dawe,Yifan Xu,Zhenhua An,Mengtao Yan,Xiaoyi Lu,Yi Wang,Rongbo Bai,Haobu Gao,Xiaohang Fang,Han Li,Hao Sun,Zhi X. Chen
类目: Machine Learning (cs.LG)
*备注: 52 pages, 20 figures. Code and data available at this https URL . Companion website and leaderboard at this https URL
Abstract:Predicting multiphysics dynamics is computationally expensive and challenging due to the severe coupling of multi-scale, heterogeneous physical processes. While neural surrogates promise a paradigm shift, the field currently suffers from an “illusion of mastery”, as repeatedly emphasized in top-tier commentaries: existing evaluations overly rely on simplified, low-dimensional proxies, which fail to expose the models’ inherent fragility in realistic regimes. To bridge this critical gap, we present REALM (REalistic AI Learning for Multiphysics), a rigorous benchmarking framework designed to test neural surrogates on challenging, application-driven reactive flows. REALM features 11 high-fidelity datasets spanning from canonical multiphysics problems to complex propulsion and fire safety scenarios, alongside a standardized end-to-end training and evaluation protocol that incorporates multiphysics-aware preprocessing and a robust rollout strategy. Using this framework, we systematically benchmark over a dozen representative surrogate model families, including spectral operators, convolutional models, Transformers, pointwise operators, and graph/mesh networks, and identify three robust trends: (i) a scaling barrier governed jointly by dimensionality, stiffness, and mesh irregularity, leading to rapidly growing rollout errors; (ii) performance primarily controlled by architectural inductive biases rather than parameter count; and (iii) a persistent gap between nominal accuracy metrics and physically trustworthy behavior, where models with high correlations still miss key transient structures and integral quantities. Taken together, REALM exposes the limits of current neural surrogates on realistic multiphysics flows and offers a rigorous testbed to drive the development of next-generation physics-aware architectures.
[LG-63] SD2AIL: Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models
链接: https://arxiv.org/abs/2512.18583
作者: Pengcheng Li,Qiang Fang,Tong Zhao,Yixing Lan,Xin Xu
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Adversarial Imitation Learning (AIL) is a dominant framework in imitation learning that infers rewards from expert demonstrations to guide policy optimization. Although providing more expert demonstrations typically leads to improved performance and greater stability, collecting such demonstrations can be challenging in certain scenarios. Inspired by the success of diffusion models in data generation, we propose SD2AIL, which utilizes synthetic demonstrations via diffusion models. We first employ a diffusion model in the discriminator to generate synthetic demonstrations as pseudo-expert data that augment the expert demonstrations. To selectively replay the most valuable demonstrations from the large pool of (pseudo-) expert demonstrations, we further introduce a prioritized expert demonstration replay strategy (PEDR). The experimental results on simulation tasks demonstrate the effectiveness and robustness of our method. In particular, in the Hopper task, our method achieves an average return of 3441, surpassing the state-of-the-art method by 89. Our code will be available at this https URL.
[LG-64] Comparing Dynamical Models Through Diffeomorphic Vector Field Alignment
链接: https://arxiv.org/abs/2512.18566
作者: Ruiqi Chen(1),Giacomo Vedovati(2),Todd Braver(3),ShiNung Ching(2) ((1) Division of Biology and Biomedical Sciences, Washington University in St. Louis, (2) Department of Electrical and Systems Engineering, Washington University in St. Louis, (3) Department of Psychological and Brain Sciences, Washington University in St. Louis)
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Neurons and Cognition (q-bio.NC)
*备注: 57 pages, 18 figures. For associated code, see this https URL
Abstract:Dynamical systems models such as recurrent neural networks (RNNs) are increasingly popular in theoretical neuroscience for hypothesis-generation and data analysis. Evaluating the dynamics in such models is key to understanding their learned generative mechanisms. However, such evaluation is impeded by two major challenges: First, comparison of learned dynamics across models is difficult because there is no enforced equivalence of their coordinate systems. Second, identification of mechanistically important low-dimensional motifs (e.g., limit sets) is intractable in high-dimensional nonlinear models such as RNNs. Here, we propose a comprehensive framework to address these two issues, termed Diffeomorphic vector field alignment FOR learned Models (DFORM). DFORM learns a nonlinear coordinate transformation between the state spaces of two dynamical systems, which aligns their trajectories in a maximally one-to-one manner. In so doing, DFORM enables an assessment of whether two models exhibit topological equivalence, i.e., similar mechanisms despite differences in coordinate systems. A byproduct of this method is a means to locate dynamical motifs on low-dimensional manifolds embedded within higher-dimensional systems. We verified DFORM’s ability to identify linear and nonlinear coordinate transformations using canonical topologically equivalent systems, RNNs, and systems related by nonlinear flows. DFORM was also shown to provide a quantification of similarity between topologically distinct systems. We then demonstrated that DFORM can locate important dynamical motifs including invariant manifolds and saddle limit sets within high-dimensional models. Finally, using a set of RNN models trained on human functional MRI (fMRI) recordings, we illustrated that DFORM can identify limit cycles from high-dimensional data-driven models, which agreed well with prior numerical analysis.
[LG-65] Scaling up Stability: Reinforcement Learning for Distributed Control of Networked Systems in the Space of Stabilizing Policies
链接: https://arxiv.org/abs/2512.18540
作者: John Cao,Luca Furieri
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We study distributed control of networked systems through reinforcement learning, where neural policies must be simultaneously scalable, expressive and stabilizing. We introduce a policy parameterization that embeds Graph Neural Networks (GNNs) into a Youla-like magnitude-direction parameterization, yielding distributed stochastic controllers that guarantee network-level closed-loop stability by design. The magnitude is implemented as a stable operator consisting of a GNN acting on disturbance feedback, while the direction is a GNN acting on local observations. We prove robustness of the closed loop to perturbations in both the graph topology and model parameters, and show how to integrate our parameterization with Proximal Policy Optimization. Experiments on a multi-agent navigation task show that policies trained on small networks transfer directly to larger ones and unseen network topologies, achieve higher returns and lower variance than a state-of-the-art MARL baseline while preserving stability.
[LG-66] Feature-Enhanced Graph Neural Networks for Classification of Synthetic Graph Generative Models: A Benchmarking Study
链接: https://arxiv.org/abs/2512.18524
作者: Janek Dyer,Jagdeep Ahluwalia,Javad Zarrin
类目: Machine Learning (cs.LG)
*备注: This is a preprint version of a manuscript currently under review at The Journal of Supercomputing (Springer)
Abstract:The ability to discriminate between generative graph models is critical to understanding complex structural patterns in both synthetic graphs and the real-world structures that they emulate. While Graph Neural Networks (GNNs) have seen increasing use to great effect in graph classification tasks, few studies explore their integration with interpretable graph theoretic features. This paper investigates the classification of synthetic graph families using a hybrid approach that combines GNNs with engineered graph-theoretic features. We generate a large and structurally diverse synthetic dataset comprising graphs from five representative generative families, Erdos-Renyi, Watts-Strogatz, Barab’asi-Albert, Holme-Kim, and Stochastic Block Model. These graphs range in size up to 1x10^4 nodes, containing up to 1.1x10^5 edges. A comprehensive range of node and graph level features is extracted for each graph and pruned using a Random Forest based feature selection pipeline. The features are integrated into six GNN architectures: GCN, GAT, GATv2, GIN, GraphSAGE and GTN. Each architecture is optimised for hyperparameter selection using Optuna. Finally, models were compared against a baseline Support Vector Machine (SVM) trained solely on the handcrafted features. Our evaluation demonstrates that GraphSAGE and GTN achieve the highest classification performance, with 98.5% accuracy, and strong class separation evidenced by t-SNE and UMAP visualisations. GCN and GIN also performed well, while GAT-based models lagged due to limitations in their ability to capture global structures. The SVM baseline confirmed the importance of the message passing functionality for performance gains and meaningful class separation.
[LG-67] APC-GNN: An Adaptive Patient-Centric GNN with Context-Aware Attention and Mini-Graph Explainability for Diabetes Classification
链接: https://arxiv.org/abs/2512.18473
作者: Khaled Berkani
类目: Machine Learning (cs.LG)
*备注: 17 pages, 2 figures, 5 tables
Abstract:We propose APC-GNN++, an adaptive patient-centric Graph Neural Network for diabetes classification. Our model integrates context-aware edge attention, confidence-guided blending of node features and graph representations, and neighborhood consistency regularization to better capture clinically meaningful relationships between patients. To handle unseen patients, we introduce a mini-graph approach that leverages the nearest neighbors of the new patient, enabling real-time explainable predictions without retraining the global model. We evaluate APC-GNN++ on a real-world diabetes dataset collected from a regional hospital in Algeria and show that it outperforms traditional machine learning models (MLP, Random Forest, XGBoost) and a vanilla GCN, achieving higher test accuracy and macro F1- score. The analysis of node-level confidence scores further reveals how the model balances self-information and graph-based evidence across different patient groups, providing interpretable patient-centric insights. The system is also embedded in a Tkinter-based graphical user interface (GUI) for interactive use by healthcare professionals .
[LG-68] he Geometry of Abstraction: Continual Learning via Recursive Quotienting
链接: https://arxiv.org/abs/2512.18471
作者: Xin Li
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:Continual learning systems operating in fixed-dimensional spaces face a fundamental geometric barrier: the flat manifold problem. When experience is represented as a linear trajectory in Euclidean space, the geodesic distance between temporal events grows linearly with time, forcing the required covering number to diverge. In fixed-dimensional hardware, this volume expansion inevitably forces trajectory overlap, manifesting as catastrophic interference. In this work, we propose a geometric resolution to this paradox based on Recursive Metric Contraction. We formalize abstraction not as symbolic grouping, but as a topological deformation: a quotient map that collapses the metric tensor within validated temporal neighborhoods, effectively driving the diameter of local sub-manifolds to zero. We substantiate our framework with four rigorous results. First, the Bounded Capacity Theorem establishes that recursive quotient maps allow the embedding of arbitrarily long trajectories into bounded representational volumes, trading linear metric growth for logarithmic topological depth. Second, the Topological Collapse Separability Theorem, derived via Urysohn’s Lemma, proves that recursive quotienting renders non-linearly separable temporal sequences linearly separable in the limit, bypassing the need for infinite-dimensional kernel projections. Third, the Parity-Partitioned Stability Theorem solves the catastrophic forgetting problem by proving that if the state space is partitioned into orthogonal flow and scaffold manifolds, the metric deformations of active learning do not disturb the stability of stored memories. Our analysis reveals that tokens in neural architectures are physically realizable as singularities or wormholes, regions of extreme positive curvature that bridge distant points in the temporal manifold.
[LG-69] Out-of-Distribution Detection in Molecular Complexes via Diffusion Models for Irregular Graphs
链接: https://arxiv.org/abs/2512.18454
作者: David Graber,Victor Armegioiu,Rebecca Buller,Siddhartha Mishra
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Predictive machine learning models generally excel on in-distribution data, but their performance degrades on out-of-distribution (OOD) inputs. Reliable deployment therefore requires robust OOD detection, yet this is particularly challenging for irregular 3D graphs that combine continuous geometry with categorical identities and are unordered by construction. Here, we present a probabilistic OOD detection framework for complex 3D graph data built on a diffusion model that learns a density of the training distribution in a fully unsupervised manner. A key ingredient we introduce is a unified continuous diffusion over both 3D coordinates and discrete features: categorical identities are embedded in a continuous space and trained with cross-entropy, while the corresponding diffusion score is obtained analytically via posterior-mean interpolation from predicted class probabilities. This yields a single self-consistent probability-flow ODE (PF-ODE) that produces per-sample log-likelihoods, providing a principled typicality score for distribution shift. We validate the approach on protein-ligand complexes and construct strict OOD datasets by withholding entire protein families from training. PF-ODE likelihoods identify held-out families as OOD and correlate strongly with prediction errors of an independent binding-affinity model (GEMS), enabling a priori reliability estimates on new complexes. Beyond scalar likelihoods, we show that multi-scale PF-ODE trajectory statistics - including path tortuosity, flow stiffness, and vector-field instability - provide complementary OOD information. Modeling the joint distribution of these trajectory features yields a practical, high-sensitivity detector that improves separation over likelihood-only baselines, offering a label-free OOD quantification workflow for geometric deep learning.
[LG-70] On the Universality of Transformer Architectures; How Much Attention Is Enough?
链接: https://arxiv.org/abs/2512.18445
作者: Amirreza Abbasi,Mohsen Hooshmand
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformers are crucial across many AI fields, such as large language models, computer vision, and reinforcement learning. This prominence stems from the architecture’s perceived universality and scalability compared to alternatives. This work examines the problem of universality in Transformers, reviews recent progress, including architectural refinements such as structural minimality and approximation rates, and surveys state-of-the-art advances that inform both theoretical and practical understanding. Our aim is to clarify what is currently known about Transformers expressiveness, separate robust guarantees from fragile ones, and identify key directions for future theoretical research.
[LG-71] MoE Pathfinder: Trajectory-driven Expert Pruning
链接: https://arxiv.org/abs/2512.18425
作者: Xican Yang,Yuanhe Tian,Yan Song
类目: Machine Learning (cs.LG)
*备注: 12 pages, 3 figures
Abstract:Mixture-of-experts (MoE) architectures used in large language models (LLMs) achieve state-of-the-art performance across diverse tasks yet face practical challenges such as deployment complexity and low activation efficiency. Expert pruning has thus emerged as a promising solution to reduce computational overhead and simplify the deployment of MoE models. However, existing expert pruning approaches conventionally rely on local importance metrics and often apply uniform layer-wise pruning, leveraging only partial evaluation signals and overlooking the heterogeneous contributions of experts across layers. To address these limitations, we propose an expert pruning approach based on the trajectory of activated experts across layers, which treats MoE as a weighted computation graph and casts expert selection as a global optimal path planning problem. Within this framework, we integrate complementary importance signals from reconstruction error, routing probabilities, and activation strength at the trajectory level, which naturally yields non-uniform expert retention across layers. Experiments show that our approach achieves superior pruning performance on nearly all tasks compared with most existing approaches.
[LG-72] Why Most Optimism Bandit Algorithms Have the Same Regret Analysis: A Simple Unifying Theorem
链接: https://arxiv.org/abs/2512.18409
作者: Vikram Krishnamurthy
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:
Abstract:Several optimism-based stochastic bandit algorithms – including UCB, UCB-V, linear UCB, and finite-arm GP-UCB – achieve logarithmic regret using proofs that, despite superficial differences, follow essentially the same structure. This note isolates the minimal ingredients behind these analyses: a single high-probability concentration condition on the estimators, after which logarithmic regret follows from two short deterministic lemmas describing radius collapse and optimism-forced deviations. The framework yields unified, near-minimal proofs for these classical algorithms and extends naturally to many contemporary bandit variants.
[LG-73] he Challenger: When Do New Data Sources Justify Switching Machine Learning Models?
链接: https://arxiv.org/abs/2512.18390
作者: Vassilis Digalakis Jr,Christophe Pérignon,Sébastien Saurin,Flore Sentenac
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study the problem of deciding whether, and when an organization should replace a trained incumbent model with a challenger relying on newly available features. We develop a unified economic and statistical framework that links learning-curve dynamics, data-acquisition and retraining costs, and discounting of future gains. First, we characterize the optimal switching time in stylized settings and derive closed-form expressions that quantify how horizon length, learning-curve curvature, and cost differentials shape the optimal decision. Second, we propose three practical algorithms: a one-shot baseline, a greedy sequential method, and a look-ahead sequential method. Using a real-world credit-scoring dataset with gradually arriving alternative data, we show that (i) optimal switching times vary systematically with cost parameters and learning-curve behavior, and (ii) the look-ahead sequential method outperforms other methods and is able to approach in value an oracle with full foresight. Finally, we establish finite-sample guarantees, including conditions under which the sequential look-ahead method achieve sublinear regret relative to that oracle. Our results provide an operational blueprint for economically sound model transitions as new data sources become available.
[LG-74] owards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale
链接: https://arxiv.org/abs/2512.18373
作者: Ansh Nagwekar
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Master’s Thesis at the University of Pennsylvania
Abstract:Neural network optimization remains one of the most consequential yet poorly understood challenges in modern AI research, where improvements in training algorithms can lead to enhanced feature learning in foundation models, order-of-magnitude reductions in training time, and improved interpretability into how networks learn. While stochastic gradient descent (SGD) and its variants have become the de facto standard for training deep networks, their success in these over-parameterized regimes often appears more empirical than principled. This thesis investigates this apparent paradox by tracing the evolution of optimization algorithms from classical first-order methods to modern higher-order techniques, revealing how principled algorithmic design can demystify the training process. Starting from first principles with SGD and adaptive gradient methods, the analysis progressively uncovers the limitations of these conventional approaches when confronted with anisotropy that is representative of real-world data. These breakdowns motivate the exploration of sophisticated alternatives rooted in curvature information: second-order approximation techniques, layer-wise preconditioning, adaptive learning rates, and more. Next, the interplay between these optimization algorithms and the broader neural network training toolkit, which includes prior and recent developments such as maximal update parametrization, learning rate schedules, and exponential moving averages, emerges as equally essential to empirical success. To bridge the gap between theoretical understanding and practical deployment, this paper offers practical prescriptions and implementation strategies for integrating these methods into modern deep learning workflows.
[LG-75] FedSUM Family: Efficient Federated Learning Methods under Arbitrary Client Participation
链接: https://arxiv.org/abs/2512.18275
作者: Runze You,Shi Pu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Federated Learning (FL) methods are often designed for specific client participation patterns, limiting their applicability in practical deployments. We introduce the FedSUM family of algorithms, which supports arbitrary client participation without additional assumptions on data heterogeneity. Our framework models participation variability with two delay metrics, the maximum delay \tau_\max and the average delay \tau_\textavg . The FedSUM family comprises three variants: FedSUM-B (basic version), FedSUM (standard version), and FedSUM-CR (communication-reduced version). We provide unified convergence guarantees demonstrating the effectiveness of our approach across diverse participation patterns, thereby broadening the applicability of FL in real-world scenarios.
[LG-76] LeJOT: An Intelligent Job Cost Orchestration Solution for Databricks Platform
链接: https://arxiv.org/abs/2512.18266
作者: Lizhi Ma,Yi-Xiang Hu,Yuke Wang,Yifang Zhao,Yihui Ren,Jian-Xiang Liao,Feng Wu,Xiang-Yang Li
类目: Machine Learning (cs.LG)
*备注: The 11th International Conference on Big Data Computing and Communications
Abstract:With the rapid advancements in big data technologies, the Databricks platform has become a cornerstone for enterprises and research institutions, offering high computational efficiency and a robust ecosystem. However, managing the escalating operational costs associated with job execution remains a critical challenge. Existing solutions rely on static configurations or reactive adjustments, which fail to adapt to the dynamic nature of workloads. To address this, we introduce LeJOT, an intelligent job cost orchestration framework that leverages machine learning for execution time prediction and a solver-based optimization model for real-time resource allocation. Unlike conventional scheduling techniques, LeJOT proactively predicts workload demands, dynamically allocates computing resources, and minimizes costs while ensuring performance requirements are met. Experimental results on real-world Databricks workloads demonstrate that LeJOT achieves an average 20% reduction in cloud computing costs within a minute-level scheduling timeframe, outperforming traditional static allocation strategies. Our approach provides a scalable and adaptive solution for cost-efficient job scheduling in Data Lakehouse environments.
[LG-77] On the Convergence Rate of LoRA Gradient Descent
链接: https://arxiv.org/abs/2512.18248
作者: Siqiao Mu,Diego Klabjan
类目: Machine Learning (cs.LG)
*备注:
Abstract:The low-rank adaptation (LoRA) algorithm for fine-tuning large models has grown popular in recent years due to its remarkable performance and low computational requirements. LoRA trains two adapter" matrices that form a low-rank representation of the model parameters, thereby massively reducing the number of parameters that need to be updated at every step. Although LoRA is simple, its convergence is poorly understood due to the lack of Lipschitz smoothness, a key condition for classic convergence analyses. As a result, current theoretical results only consider asymptotic behavior or assume strong boundedness conditions which artificially enforce Lipschitz smoothness. In this work, we provide for the first time a non-asymptotic convergence analysis of the \textitoriginal LoRA gradient descent algorithm, which reflects widespread practice, without such assumptions. Our work relies on three key steps: i) reformulating the problem in terms of the outer product of the stacked adapter matrices, ii) a modified descent lemma for the Lipschitz-like" reparametrized function, and iii) controlling the step size. With this approach, we prove that LoRA gradient descent converges to a stationary point at rate O(\frac1\log T) , where T is the number of iterations.
[LG-78] AutoSchA: Automatic Hierarchical Music Representations via Multi-Relational Node Isolation
链接: https://arxiv.org/abs/2512.18232
作者: Stephen Ni-Hahn,Rico Zhu,Jerry Yin,Yue Jiang,Cynthia Rudin,Simon Mak
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Hierarchical representations provide powerful and principled approaches for analyzing many musical genres. Such representations have been broadly studied in music theory, for instance via Schenkerian analysis (SchA). Hierarchical music analyses, however, are highly cost-intensive; the analysis of a single piece of music requires a great deal of time and effort from trained experts. The representation of hierarchical analyses in a computer-readable format is a further challenge. Given recent developments in hierarchical deep learning and increasing quantities of computer-readable data, there is great promise in extending such work for an automatic hierarchical representation framework. This paper thus introduces a novel approach, AutoSchA, which extends recent developments in graph neural networks (GNNs) for hierarchical music analysis. AutoSchA features three key contributions: 1) a new graph learning framework for hierarchical music representation, 2) a new graph pooling mechanism based on node isolation that directly optimizes learned pooling assignments, and 3) a state-of-the-art architecture that integrates such developments for automatic hierarchical music analysis. We show, in a suite of experiments, that AutoSchA performs comparably to human experts when analyzing Baroque fugue subjects.
[LG-79] Dimensionality Reduction Considered Harmful (Some of the Time)
链接: https://arxiv.org/abs/2512.18230
作者: Hyeon Jeon
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: PhD Dissertation
Abstract:Visual analytics now plays a central role in decision-making across diverse disciplines, but it can be unreliable: the knowledge or insights derived from the analysis may not accurately reflect the underlying data. In this dissertation, we improve the reliability of visual analytics with a focus on dimensionality reduction (DR). DR techniques enable visual analysis of high-dimensional data by reducing it to two or three dimensions, but they inherently introduce errors that can compromise the reliability of visual analytics. To this end, I investigate reliability challenges that practitioners face when using DR for visual analytics. Then, I propose technical solutions to address these challenges, including new evaluation metrics, optimization strategies, and interaction techniques. We conclude the thesis by discussing how our contributions lay the foundation for achieving more reliable visual analytics practices.
[LG-80] oward Efficient Testing of Graph Neural Networks via Test Input Prioritization
链接: https://arxiv.org/abs/2512.18228
作者: Lichen Yang,Qiang Wang,Zhonghao Yang,Daojing He,Yu Li
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: This is the author-accepted manuscript of a paper published in Automated Software Engineering Journal
Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable efficacy in handling graph-structured data; however, they exhibit failures after deployment, which can cause severe consequences. Hence, conducting thorough testing before deployment becomes imperative to ensure the reliability of GNNs. However, thorough testing requires numerous manually annotated test data. To mitigate the annotation cost, strategically prioritizing and labeling high-quality unlabeled inputs for testing becomes crucial, which facilitates uncovering more model failures with a limited labeling budget. Unfortunately, existing test input prioritization techniques either overlook the valuable information contained in graph structures or are overly reliant on attributes extracted from the target model, i.e., model-aware attributes, whose quality can vary significantly. To address these issues, we propose a novel test input prioritization framework, named GraphRank, for GNNs. GraphRank introduces model-agnostic attributes to compensate for the limitations of the model-aware ones. It also leverages the graph structure information to aggregate attributes from neighboring nodes, thereby enhancing the model-aware and model-agnostic attributes. Furthermore, GraphRank combines the above attributes with a binary classifier, using it as a ranking model to prioritize inputs. This classifier undergoes iterative training, which enables it to learn from each round’s feedback and improve its performance accordingly. Extensive experiments demonstrate GraphRank’s superiority over existing techniques.
[LG-81] FairExpand: Individual Fairness on Graphs with Partial Similarity Information
链接: https://arxiv.org/abs/2512.18180
作者: Rebecca Salganik,Yibin Wang,Guillaume Salha-Galvan,Jian Kang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Individual fairness, which requires that similar individuals should be treated similarly by algorithmic systems, has become a central principle in fair machine learning. Individual fairness has garnered traction in graph representation learning due to its practical importance in high-stakes Web areas such as user modeling, recommender systems, and search. However, existing methods assume the existence of predefined similarity information over all node pairs, an often unrealistic requirement that prevents their operationalization in practice. In this paper, we assume the similarity information is only available for a limited subset of node pairs and introduce FairExpand, a flexible framework that promotes individual fairness in this more realistic partial information scenario. FairExpand follows a two-step pipeline that alternates between refining node representations using a backbone model (e.g., a graph neural network) and gradually propagating similarity information, which allows fairness enforcement to effectively expand to the entire graph. Extensive experiments show that FairExpand consistently enhances individual fairness while preserving performance, making it a practical solution for enabling graph-based individual fairness in real-world applications with partial similarity information.
[LG-82] Conscious Data Contribution via Community-Driven Chain-of-Thought Distillation
链接: https://arxiv.org/abs/2512.18174
作者: Lena Libon,Meghana Bhange,Rushabh Solanki,Elliot Creager,Ulrich Aïvodji
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
*备注:
Abstract:The current era of AI development places a heavy emphasis on training large models on increasingly scaled-up datasets. This paradigm has catalyzed entirely new product categories, such as LLM chatbots, while also raising concerns about data privacy and consumer choice. In this paper, we consider questions of data portability and user autonomy in the context of LLMs that “reason” using chain-of-thought (CoT) traces, computing intermediate text artifacts from user input before producing a final output. We first interpret recent data privacy and portability law to argue that these intermediate computations qualify as users’ personal data. Then, building on the existing framework of Conscious Data Contribution, we show how communities who receive low utility from an available model can aggregate and distill their shared knowledge into an alternate model better aligned with their goals. We verify this approach empirically and investigate the effects of community diversity, reasoning granularity, and community size on distillation performance.
[LG-83] Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs
链接: https://arxiv.org/abs/2512.18134
作者: Rupanshu Soi,Rohan Yadav,Fredrik Kjolstad,Alex Aiken,Maryam Mehri Dehnavi,Michael Garland,Michael Bauer
类目: Programming Languages (cs.PL); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:GPU architectures have continued to grow in complexity, with recent incarnations introducing increasingly powerful fixed-function units for matrix multiplication and data movement to accompany highly parallel general-purpose cores. To fully leverage these machines, software must use sophisticated schedules that maximally utilize all hardware resources. Since realizing such schedules is complex, both programmers and compilers routinely employ program transformations, such as software pipelining (SWP) and warp specialization (WS), to do so in practice. However, determining how best to use SWP and WS in combination is a challenging problem that is currently handled through a mix of brittle compilation heuristics and fallible human intuition, with little insight into the space of solutions. To remedy this situation, we introduce a novel formulation of SWP and WS as a joint optimization problem that can be solved holistically by off-the-shelf constraint solvers. We reify our approach in Twill, the first system that automatically derives optimal SWP and WS schedules for a large class of iterative programs. Twill is heuristic-free, easily extensible to new GPU architectures, and guaranteed to produce optimal schedules. We show that Twill can rediscover, and thereby prove optimal, the SWP and WS schedules manually developed by experts for Flash Attention on both the NVIDIA Hopper and Blackwell GPU architectures.
[LG-84] raCeR: Transformer-Based Competing Risk Analysis with Longitudinal Covariates
链接: https://arxiv.org/abs/2512.18129
作者: Maxmillan Ries,Sohan Seth
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Survival analysis is a critical tool for modeling time-to-event data. Recent deep learning-based models have reduced various modeling assumptions including proportional hazard and linearity. However, a persistent challenge remains in incorporating longitudinal covariates, with prior work largely focusing on cross-sectional features, and in assessing calibration of these models, with research primarily focusing on discrimination during evaluation. We introduce TraCeR, a transformer-based survival analysis framework for incorporating longitudinal covariates. Based on a factorized self-attention architecture, TraCeR estimates the hazard function from a sequence of measurements, naturally capturing temporal covariate interactions without assumptions about the underlying data-generating process. The framework is inherently designed to handle censored data and competing events. Experiments on multiple real-world datasets demonstrate that TraCeR achieves substantial and statistically significant performance improvements over state-of-the-art methods. Furthermore, our evaluation extends beyond discrimination metrics and assesses model calibration, addressing a key oversight in literature.
[LG-85] Learning Generalizable Neural Operators for Inverse Problems
链接: https://arxiv.org/abs/2512.18120
作者: Adam J. Thorpe,Stepan Tretiakov,Dibakar Roy Sarkar,Krishna Kumar,Ufuk Topcu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Inverse problems challenge existing neural operator architectures because ill-posed inverse maps violate continuity, uniqueness, and stability assumptions. We introduce B2B ^-1 , an inverse basis-to-basis neural operator framework that addresses this limitation. Our key innovation is to decouple function representation from the inverse map. We learn neural basis functions for the input and output spaces, then train inverse models that operate on the resulting coefficient space. This structure allows us to learn deterministic, invertible, and probabilistic models within a single framework, and to choose models based on the degree of ill-posedness. We evaluate our approach on six inverse PDE benchmarks, including two novel datasets, and compare against existing invertible neural operator baselines. We learn probabilistic models that capture uncertainty and input variability, and remain robust to measurement noise due to implicit denoising in the coefficient calculation. Our results show consistent re-simulation performance across varying levels of ill-posedness. By separating representation from inversion, our framework enables scalable surrogate models for inverse problems that generalize across instances, domains, and degrees of ill-posedness.
[LG-86] Microstructure-based Variational Neural Networks for Robust Uncertainty Quantification in Materials Digital Twins
链接: https://arxiv.org/abs/2512.18104
作者: Andreas E. Robertson,Samuel B. Inman,Ashley T. Lenau,Ricardo A. Lebensohn,Dongil Shin,Brad L. Boyce,Remi M. Dingreville
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 43 pages, 9 figures,
Abstract:Aleatoric uncertainties - irremovable variability in microstructure morphology, constituent behavior, and processing conditions - pose a major challenge to developing uncertainty-robust digital twins. We introduce the Variational Deep Material Network (VDMN), a physics-informed surrogate model that enables efficient and probabilistic forward and inverse predictions of material behavior. The VDMN captures microstructure-induced variability by embedding variational distributions within its hierarchical, mechanistic architecture. Using an analytic propagation scheme based on Taylor-series expansion and automatic differentiation, the VDMN efficiently propagates uncertainty through the network during training and prediction. We demonstrate its capabilities in two digital-twin-driven applications: (1) as an uncertainty-aware materials digital twin, it predicts and experimentally validates the nonlinear mechanical variability in additively manufactured polymer composites; and (2) as an inverse calibration engine, it disentangles and quantitatively identifies overlapping sources of uncertainty in constituent properties. Together, these results establish the VDMN as a foundation for uncertainty-robust materials digital twins.
[LG-87] From Coverag e to Causes: Data-Centric Fuzzing for JavaScript Engines
链接: https://arxiv.org/abs/2512.18102
作者: Kishan Kumar Ganguly,Tim Menzies
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Context: Exhaustive fuzzing of modern JavaScript engines is infeasible due to the vast number of program states and execution paths. Coverage-guided fuzzers waste effort on low-risk inputs, often ignoring vulnerability-triggering ones that do not increase coverage. Existing heuristics proposed to mitigate this require expert effort, are brittle, and hard to adapt. Objective: We propose a data-centric, LLM-boosted alternative that learns from historical vulnerabilities to automatically identify minimal static (code) and dynamic (runtime) features for detecting high-risk inputs. Method: Guided by historical V8 bugs, iterative prompting generated 115 static and 49 dynamic features, with the latter requiring only five trace flags, minimizing instrumentation cost. After feature selection, 41 features remained to train an XGBoost model to predict high-risk inputs during fuzzing. Results: Combining static and dynamic features yields over 85% precision and under 1% false alarms. Only 25% of these features are needed for comparable performance, showing that most of the search space is irrelevant. Conclusion: This work introduces feature-guided fuzzing, an automated data-driven approach that replaces coverage with data-directed inference, guiding fuzzers toward high-risk states for faster, targeted, and reproducible vulnerability discovery. To support open science, all scripts and data are available at this https URL . Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2512.18102 [cs.SE] (or arXiv:2512.18102v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.18102 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kishan Kumar Ganguly [view email] [v1] Fri, 19 Dec 2025 22:15:53 UTC (1,401 KB) Full-text links: Access Paper: View a PDF of the paper titled From Coverage to Causes: Data-Centric Fuzzing for JavaScript Engines, by Kishan Kumar Ganguly and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2025-12 Change to browse by: cs cs.CR cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-88] Graph-based Nearest Neighbors with Dynamic Updates via Random Walks
链接: https://arxiv.org/abs/2512.18060
作者: Nina Mishra,Yonatan Naamad,Tal Wagner,Lichen Zhang
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 37 pages, 23 figures
Abstract:Approximate nearest neighbor search (ANN) is a common way to retrieve relevant search results, especially now in the context of large language models and retrieval augmented generation. One of the most widely used algorithms for ANN is based on constructing a multi-layer graph over the dataset, called the Hierarchical Navigable Small World (HNSW). While this algorithm supports insertion of new data, it does not support deletion of existing data. Moreover, deletion algorithms described by prior work come at the cost of increased query latency, decreased recall, or prolonged deletion time. In this paper, we propose a new theoretical framework for graph-based ANN based on random walks. We then utilize this framework to analyze a randomized deletion approach that preserves hitting time statistics compared to the graph before deleting the point. We then turn this theoretical framework into a deterministic deletion algorithm, and show that it provides better tradeoff between query latency, recall, deletion time, and memory usage through an extensive collection of experiments.
[LG-89] Approximation and learning with compositional tensor trains
链接: https://arxiv.org/abs/2512.18059
作者: Martin Eigel,Charles Miranda,Anthony Nouy,David Sommer
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 37 pages, 6 figures
Abstract:We introduce compositional tensor trains (CTTs) for the approximation of multivariate functions, a class of models obtained by composing low-rank functions in the tensor-train format. This format can encode standard approximation tools, such as (sparse) polynomials, deep neural networks (DNNs) with fixed width, or tensor networks with arbitrary permutation of the inputs, or more general affine coordinate transformations, with similar complexities. This format can be viewed as a DNN with width exponential in the input dimension and structured weights matrices. Compared to DNNs, this format enables controlled compression at the layer level using efficient tensor algebra. On the optimization side, we derive a layerwise algorithm inspired by natural gradient descent, allowing to exploit efficient low-rank tensor algebra. This relies on low-rank estimations of Gram matrices, and tensor structured random sketching. Viewing the format as a discrete dynamical system, we also derive an optimization algorithm inspired by numerical methods in optimal control. Numerical experiments on regression tasks demonstrate the expressivity of the new format and the relevance of the proposed optimization algorithms. Overall, CTTs combine the expressivity of compositional models with the algorithmic efficiency of tensor algebra, offering a scalable alternative to standard deep neural networks.
[LG-90] Probabilistic Digital Twins of Users: Latent Representation Learning with Statistically Validated Semantics
链接: https://arxiv.org/abs/2512.18056
作者: Daniel David
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 11 pages, 10 figures. Methodological paper on probabilistic user modeling and latent representation learning
Abstract:Understanding user identity and behavior is central to applications such as personalization, recommendation, and decision support. Most existing approaches rely on deterministic embeddings or black-box predictive models, offering limited uncertainty quantification and little insight into what latent representations encode. We propose a probabilistic digital twin framework in which each user is modeled as a latent stochastic state that generates observed behavioral data. The digital twin is learned via amortized variational inference, enabling scalable posterior estimation while retaining a fully probabilistic interpretation. We instantiate this framework using a variational autoencoder (VAE) applied to a user-response dataset designed to capture stable aspects of user identity. Beyond standard reconstruction-based evaluation, we introduce a statistically grounded interpretation pipeline that links latent dimensions to observable behavioral patterns. By analyzing users at the extremes of each latent dimension and validating differences using nonparametric hypothesis tests and effect sizes, we demonstrate that specific dimensions correspond to interpretable traits such as opinion strength and decisiveness. Empirically, we find that user structure is predominantly continuous rather than discretely clustered, with weak but meaningful structure emerging along a small number of dominant latent axes. These results suggest that probabilistic digital twins can provide interpretable, uncertainty-aware representations that go beyond deterministic user embeddings.
[LG-91] owards Benchmarking Privacy Vulnerabilities in Selective Forgetting with Large Language Models
链接: https://arxiv.org/abs/2512.18035
作者: Wei Qian,Chenxu Zhao,Yangyi Li,Mengdi Huai
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:The rapid advancements in artificial intelligence (AI) have primarily focused on the process of learning from data to acquire knowledgeable learning systems. As these systems are increasingly deployed in critical areas, ensuring their privacy and alignment with human values is paramount. Recently, selective forgetting (also known as machine unlearning) has shown promise for privacy and data removal tasks, and has emerged as a transformative paradigm shift in the field of AI. It refers to the ability of a model to selectively erase the influence of previously seen data, which is especially important for compliance with modern data protection regulations and for aligning models with human values. Despite its promise, selective forgetting raises significant privacy concerns, especially when the data involved come from sensitive domains. While new unlearning-induced privacy attacks are continuously proposed, each is shown to outperform its predecessors using different experimental settings, which can lead to overly optimistic and potentially unfair assessments that may disproportionately favor one particular attack over the others. In this work, we present the first comprehensive benchmark for evaluating privacy vulnerabilities in selective forgetting. We extensively investigate privacy vulnerabilities of machine unlearning techniques and benchmark privacy leakage across a wide range of victim data, state-of-the-art unlearning privacy attacks, unlearning methods, and model architectures. We systematically evaluate and identify critical factors related to unlearning-induced privacy leakage. With our novel insights, we aim to provide a standardized tool for practitioners seeking to deploy customized unlearning applications with faithful privacy assessments.
[LG-92] FedOAED: Federated On-Device Autoencoder Denoiser for Heterogeneous Data under Limited Client Availability
链接: https://arxiv.org/abs/2512.17986
作者: S M Ruhul Kabir Howlader,Xiao Chen,Yifei Xie,Lu Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Over the last few decades, machine learning (ML) and deep learning (DL) solutions have demonstrated their potential across many applications by leveraging large amounts of high-quality data. However, strict data-sharing regulations such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA) have prevented many data-driven applications from being realised. Federated Learning (FL), in which raw data never leaves local devices, has shown promise in overcoming these limitations. Although FL has grown rapidly in recent years, it still struggles with heterogeneity, which produces gradient noise, client-drift, and increased variance from partial client participation. In this paper, we propose FedOAED, a novel federated learning algorithm designed to mitigate client-drift arising from multiple local training updates and the variance induced by partial client participation. FedOAED incorporates an on-device autoencoder denoiser on the client side to mitigate client-drift and variance resulting from heterogeneous data under limited client availability. Experiments on multiple vision datasets under Non-IID settings demonstrate that FedOAED consistently outperforms state-of-the-art baselines.
[LG-93] MoE-TransMov: A Transformer-based Model for Next POI Prediction in Familiar Unfamiliar Movements
链接: https://arxiv.org/abs/2512.17985
作者: Ruichen Tan,Jiawei Xue,Kota Tsubouchi,Takahiro Yabe,Satish V. Ukkusuri
类目: Machine Learning (cs.LG)
*备注: 30 pages, 4 figures, 5 tables
Abstract:Accurate prediction of the next point of interest (POI) within human mobility trajectories is essential for location-based services, as it enables more timely and personalized recommendations. In particular, with the rise of these approaches, studies have shown that users exhibit different POI choices in their familiar and unfamiliar areas, highlighting the importance of incorporating user familiarity into predictive models. However, existing methods often fail to distinguish between the movements of users in familiar and unfamiliar regions. To address this, we propose MoE-TransMov, a Transformer-based model with a Transformer model with a Mixture-of-Experts (MoE) architecture designed to use one framework to capture distinct mobility patterns across different moving contexts without requiring separate training for certain data. Using user-check-in data, we classify movements into familiar and unfamiliar categories and develop a specialized expert network to improve prediction accuracy. Our approach integrates self-attention mechanisms and adaptive gating networks to dynamically select the most relevant expert models for different mobility contexts. Experiments on two real-world datasets, including the widely used but small open-source Foursquare NYC dataset and the large-scale Kyoto dataset collected with LY Corporation (Yahoo Japan Corporation), show that MoE-TransMov outperforms state-of-the-art baselines with notable improvements in Top-1, Top-5, Top-10 accuracy, and mean reciprocal rank (MRR). Given the results, we find that by using this approach, we can efficiently improve mobility predictions under different moving contexts, thereby enhancing the personalization of recommendation systems and advancing various urban applications.
[LG-94] Whats the Price of Monotonicity? A Multi-Dataset Benchmark of Monotone-Constrained Gradient Boosting for Credit PD
链接: https://arxiv.org/abs/2512.17945
作者: Petr Koklev
类目: Machine Learning (cs.LG); Risk Management (q-fin.RM); Statistical Finance (q-fin.ST)
*备注: 56 pages. This version: December 2025. Includes multi-dataset benchmark results and diagnostic analyses; replication code and configuration files are available via the GitHub repository referenced in the paper
Abstract:Financial institutions face a trade-off between predictive accuracy and interpretability when deploying machine learning models for credit risk. Monotonicity constraints align model behavior with domain knowledge, but their performance cost - the price of monotonicity - is not well quantified. This paper benchmarks monotone-constrained versus unconstrained gradient boosting models for credit probability of default across five public datasets and three libraries. We define the Price of Monotonicity (PoM) as the relative change in standard performance metrics when moving from unconstrained to constrained models, estimated via paired comparisons with bootstrap uncertainty. In our experiments, PoM in AUC ranges from essentially zero to about 2.9 percent: constraints are almost costless on large datasets (typically less than 0.2 percent, often indistinguishable from zero) and most costly on smaller datasets with extensive constraint coverage (around 2-3 percent). Thus, appropriately specified monotonicity constraints can often deliver interpretability with small accuracy losses, particularly in large-scale credit portfolios.
[LG-95] chatter: a Python library for applying information theory and AI/ML models to animal communication
链接: https://arxiv.org/abs/2512.17935
作者: Mason Youngblood
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
Abstract:The study of animal communication often involves categorizing units into types (e.g. syllables in songbirds, or notes in humpback whales). While this approach is useful in many cases, it necessarily flattens the complexity and nuance present in real communication systems. chatter is a new Python library for analyzing animal communication in continuous latent space using information theory and modern machine learning techniques. It is taxonomically agnostic, and has been tested with the vocalizations of birds, bats, whales, and primates. By leveraging a variety of different architectures, including variational autoencoders and vision transformers, chatter represents vocal sequences as trajectories in high-dimensional latent space, bypassing the need for manual or automatic categorization of units. The library provides an end-to-end workflow – from preprocessing and segmentation to model training and feature extraction – that enables researchers to quantify the complexity, predictability, similarity, and novelty of vocal sequences.
[LG-96] QAISim: A Toolkit for Modeling and Simulation of AI in Quantum Cloud Computing Environments
链接: https://arxiv.org/abs/2512.17918
作者: Irwindeep Singh,Sukhpal Singh Gill,Jinzhao Sun,Jan Mol
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Preprint Version Accepted for Publication in Springer Cluster Computing Journal, 2026
Abstract:Quantum computing offers new ways to explore the theory of computation via the laws of quantum mechanics. Due to the rising demand for quantum computing resources, there is growing interest in developing cloud-based quantum resource sharing platforms that enable researchers to test and execute their algorithms on real quantum hardware. These cloud-based systems face a fundamental challenge in efficiently allocating quantum hardware resources to fulfill the growing computational demand of modern Internet of Things (IoT) applications. So far, attempts have been made in order to make efficient resource allocation, ranging from heuristic-based solutions to machine learning. In this work, we employ quantum reinforcement learning based on parameterized quantum circuits to address the resource allocation problem to support large IoT networks. We propose a python-based toolkit called QAISim for the simulation and modeling of Quantum Artificial Intelligence (QAI) models for designing resource management policies in quantum cloud environments. We have simulated policy gradient and Deep Q-Learning algorithms for reinforcement learning. QAISim exhibits a substantial reduction in model complexity compared to its classical counterparts with fewer trainable variables.
[LG-97] Active Convolved Illumination with Deep Transfer Learning for Complex Beam Transmission through Atmospheric Turbulence
链接: https://arxiv.org/abs/2512.19540
作者: Adrian A. Moazzam,Anindya Ghoshroy,Breeanne Heusdens,Durdu O. Guney,Roohollah Askari
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注:
Abstract:Atmospheric turbulence imposes a fundamental limitation across a broad range of applications, including optical imaging, remote sensing, and free-space optical communication. Recent advances in adaptive optics, wavefront shaping, and machine learning, driven by synergistic progress in fundamental theories, optoelectronic hardware, and computational algorithms, have demonstrated substantial potential in mitigating turbulence-induced distortions. Recently, active convolved illumination (ACI) was proposed as a versatile and physics-driven technique for transmitting structured light beams with minimal distortion through highly challenging turbulent regimes. While distinct in its formulation, ACI shares conceptual similarities with other physics-driven distortion correction approaches and stands to benefit from complementary integration with data-driven deep learning (DL) models. Inspired by recent work coupling deep learning with traditional turbulence mitigation strategies, the present work investigates the feasibility of integrating ACI with neural network-based methods. We outline a conceptual framework for coupling ACI with data-driven models and identify conditions under which learned representations can meaningfully support ACI’s correlation-injection mechanism. As a representative example, we employ a convolutional neural network (CNN) together with a transfer-learning approach to examine how a learned model may operate in tandem with ACI. This exploratory study demonstrates feasible implementation pathways and establishes an early foundation for assessing the potential of future ACI-DL hybrid architectures, representing a step toward evaluating broader synergistic interactions between ACI and modern DL models.
[LG-98] Real-Time Streamable Generative Speech Restoration with Flow Matching
链接: https://arxiv.org/abs/2512.19442
作者: Simon Welker,Bunlong Lay,Maris Hillemann,Tal Peer,Timo Gerkmann
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Sound (cs.SD)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Diffusion-based generative models have greatly impacted the speech processing field in recent years, exhibiting high speech naturalness and spawning a new research direction. Their application in real-time communication is, however, still lagging behind due to their computation-heavy nature involving multiple calls of large DNNs. Here, we present this http URL, a frame-causal flow-based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms, paving the way for generative speech processing in real-time communication. We propose a buffered streaming inference scheme and an optimized DNN architecture, show how learned few-step numerical solvers can boost output quality at a fixed compute budget, explore model weight compression to find favorable points along a compute/quality tradeoff, and contribute a model variant with 24 ms total latency for the speech enhancement task. Our work looks beyond theoretical latencies, showing that high-quality streaming generative speech processing can be realized on consumer GPUs available today. this http URL can solve a variety of speech processing tasks in a streaming fashion: speech enhancement, dereverberation, codec post-filtering, bandwidth extension, STFT phase retrieval, and Mel vocoding. As we verify through comprehensive evaluations and a MUSHRA listening test, this http URL establishes a state-of-the-art for generative streaming speech restoration, exhibits only a reasonable reduction in quality compared to a non-streaming variant, and outperforms our recent work (Diffusion Buffer) on generative streaming speech enhancement while operating at a lower latency. Comments: This work has been submitted to the IEEE for possible publication Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Sound (cs.SD) Cite as: arXiv:2512.19442 [eess.SP] (or arXiv:2512.19442v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2512.19442 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-99] A Critical Assessment of Pattern Comparisons Between POD and Autoencoders in Intraventricular Flows
链接: https://arxiv.org/abs/2512.19376
作者: Eneko Lazpita,Andrés Bell-Navas,Jesús Garicano-Mena,Petros Koumoutsakos,Soledad Le Clainche
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 27 pages, 9 figures
Abstract:Understanding intraventricular hemodynamics requires compact and physically interpretable representations of the underlying flow structures, as characteristic flow patterns are closely associated with cardiovascular conditions and can support early detection of cardiac deterioration. Conventional visualization of velocity or pressure fields, however, provides limited insight into the coherent mechanisms driving these dynamics. Reduced-order modeling techniques, like Proper Orthogonal Decomposition (POD) and Autoencoder (AE) architectures, offer powerful alternatives to extract dominant flow features from complex datasets. This study systematically compares POD with several AE variants (Linear, Nonlinear, Convolutional, and Variational) using left ventricular flow fields obtained from computational fluid dynamics simulations. We show that, for a suitably chosen latent dimension, AEs produce modes that become nearly orthogonal and qualitatively resemble POD modes that capture a given percentage of kinetic energy. As the number of latent modes increases, AE modes progressively lose orthogonality, leading to linear dependence, spatial redundancy, and the appearance of repeated modes with substantial high-frequency content. This degradation reduces interpretability and introduces noise-like components into AE-based reduced-order models, potentially complicating their integration with physics-based formulations or neural-network surrogates. The extent of interpretability loss varies across the AEs, with nonlinear, convolutional, and variational models exhibiting distinct behaviors in orthogonality preservation and feature localization. Overall, the results indicate that AEs can reproduce POD-like coherent structures under specific latent-space configurations, while highlighting the need for careful mode selection to ensure physically meaningful representations of cardiac flow dynamics.
[LG-100] Cluster-Based Generalized Additive Models Informed by Random Fourier Features
链接: https://arxiv.org/abs/2512.19373
作者: Xin Huang,Jia Li,Jun Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 25 pages, 13 figures, 4 tables
Abstract:Explainable machine learning aims to strike a balance between prediction accuracy and model transparency, particularly in settings where black-box predictive models, such as deep neural networks or kernel-based methods, achieve strong empirical performance but remain difficult to interpret. This work introduces a mixture of generalized additive models (GAMs) in which random Fourier feature (RFF) representations are leveraged to uncover locally adaptive structure in the data. In the proposed method, an RFF-based embedding is first learned and then compressed via principal component analysis. The resulting low-dimensional representations are used to perform soft clustering of the data through a Gaussian mixture model. These cluster assignments are then applied to construct a mixture-of-GAMs framework, where each local GAM captures nonlinear effects through interpretable univariate smooth functions. Numerical experiments on real-world regression benchmarks, including the California Housing, NASA Airfoil Self-Noise, and Bike Sharing datasets, demonstrate improved predictive performance relative to classical interpretable models. Overall, this construction provides a principled approach for integrating representation learning with transparent statistical modeling.
[LG-101] Self-Consistent Probability Flow for High-Dimensional Fokker-Planck Equations
链接: https://arxiv.org/abs/2512.19196
作者: Xiaolong Wu,Qifeng Liao
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Solving high-dimensional Fokker-Planck (FP) equations is a challenge in computational physics and stochastic dynamics, due to the curse of dimensionality (CoD) and the bottleneck of evaluating second-order diffusion terms. Existing deep learning approaches, such as Physics-Informed Neural Networks (PINNs), face computational challenges as dimensionality increases, driven by the O(D^2) complexity of automatic differentiation for second-order derivatives. While recent probability flow approaches bypass this by learning score functions or matching velocity fields, they often involve serial computational operations or depend on sampling efficiency in complex distributions. To address these issues, we propose the Self-Consistent Probability Flow (SCPF) method. We reformulate the second-order FP equation into an equivalent first-order deterministic Probability Flow ODE (PF-ODE) constraint. Unlike score matching or velocity matching, SCPF solves this problem by minimizing the residual of the PF-ODE continuity equation, which avoids explicit Hessian computation. We leverage Continuous Normalizing Flows (CNF) combined with the Hutchinson Trace Estimator (HTE) to reduce the training complexity to linear scale O(D) , achieving an effective O(1) wall-clock time on GPUs. To address data sparsity in high dimensions, we apply a generative adaptive sampling strategy and theoretically prove that dynamically aligning collocation points with the evolving probability mass is a necessary condition to bound the approximation error. Experiments on diverse benchmarks – ranging from anisotropic Ornstein-Uhlenbeck (OU) processes and high-dimensional Brownian motions with time-varying diffusion terms, to Geometric OU processes featuring non-Gaussian solutions – demonstrate that SCPF effectively mitigates the CoD, maintaining high accuracy and constant computational cost for problems up to 100 dimensions.
[LG-102] Finite-sample guarantees for data-driven forward-backward operator methods
链接: https://arxiv.org/abs/2512.19172
作者: Filippo Fabiani,Barbara Franci
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:We establish finite sample certificates on the quality of solutions produced by data-based forward-backward (FB) operator splitting schemes. As frequently happens in stochastic regimes, we consider the problem of finding a zero of the sum of two operators, where one is either unavailable in closed form or computationally expensive to evaluate, and shall therefore be approximated using a finite number of noisy oracle samples. Under the lens of algorithmic stability, we then derive probabilistic bounds on the distance between a true zero and the FB output without making specific assumptions about the underlying data distribution. We show that under weaker conditions ensuring the convergence of FB schemes, stability bounds grow proportionally to the number of iterations. Conversely, stronger assumptions yield stability guarantees that are independent of the iteration count. We then specialize our results to a popular FB stochastic Nash equilibrium seeking algorithm and validate our theoretical bounds on a control problem for smart grids, where the energy price uncertainty is approximated by means of historical data.
[LG-103] Explicit and Non-asymptotic Query Complexities of Rank-Based Zeroth-order Algorithm on Stochastic Smooth Functions
链接: https://arxiv.org/abs/2512.19104
作者: Haishan Ye
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Zeroth-order (ZO) optimization with ordinal feedback has emerged as a fundamental problem in modern machine learning systems, particularly in human-in-the-loop settings such as reinforcement learning from human feedback, preference learning, and evolutionary strategies. While rank-based ZO algorithms enjoy strong empirical success and robustness properties, their theoretical understanding, especially under stochastic objectives and standard smoothness assumptions, remains limited. In this paper, we study rank-based zeroth-order optimization for stochastic functions where only ordinal feedback of the stochastic function is available. We propose a simple and computationally efficient rank-based ZO algorithm. Under standard assumptions including smoothness, strong convexity, and bounded second moments of stochastic gradients, we establish explicit non-asymptotic query complexity bounds for both convex and nonconvex objectives. Notably, our results match the best-known query complexities of value-based ZO algorithms, demonstrating that ordinal information alone is sufficient for optimal query efficiency in stochastic settings. Our analysis departs from existing drift-based and information-geometric techniques, offering new tools for the study of rank-based optimization under noise. These findings narrow the gap between theory and practice and provide a principled foundation for optimization driven by human preferences.
[LG-104] On Conditional Stochastic Interpolation for Generative Nonlinear Sufficient Dimension Reduction
链接: https://arxiv.org/abs/2512.18971
作者: Shuntuo Xu,Zhou Yu,Jian Huang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Identifying low-dimensional sufficient structures in nonlinear sufficient dimension reduction (SDR) has long been a fundamental yet challenging problem. Most existing methods lack theoretical guarantees of exhaustiveness in identifying lower dimensional structures, either at the population level or at the sample level. We tackle this issue by proposing a new method, generative sufficient dimension reduction (GenSDR), which leverages modern generative models. We show that GenSDR is able to fully recover the information contained in the central \sigma -field at both the population and sample levels. In particular, at the sample level, we establish a consistency property for the GenSDR estimator from the perspective of conditional distributions, capitalizing on the distributional learning capabilities of deep generative models. Moreover, by incorporating an ensemble technique, we extend GenSDR to accommodate scenarios with non-Euclidean responses, thereby substantially broadening its applicability. Extensive numerical results demonstrate the outstanding empirical performance of GenSDR and highlight its strong potential for addressing a wide range of complex, real-world tasks.
[LG-105] RIS-Enabled Smart Wireless Environments: Fundamentals and Distributed Optimization
链接: https://arxiv.org/abs/2512.18788
作者: George C. Alexandropoulos,Kostantinos D. Katsanos,George Stamatelis,Ioannis Gavras
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 48 pages; 12 figures; book chapter
Abstract:This chapter overviews the concept of Smart Wireless Environments (SWEs) motivated by the emerging technology of Reconfigurable Intelligent Surfaces (RISs). The operating principles and state-of-the-art hardware architectures of programmable metasurfaces are first introduced. Subsequently, key performance objectives and use cases of RIS-enabled SWEs, including spectral and energy efficiency, physical-layer security, integrated sensing and communications, as well as the emerging paradigm of over-the-air computing, are discussed. Focusing on the recent trend of Beyond-Diagonal (BD) RISs, two distributed designs of respective SWEs are presented. The first deals with a multi-user Multiple-Input Single-Output (MISO) system operating within the area of influence of a SWE comprising multiple BD-RISs. A hybrid distributed and fusion machine learning framework based on multi-branch attention-based convolutional Neural Networks (NNs), NN parameter sharing, and neuroevolutionary training is presented, which enables online mapping of channel realizations to the BD-RIS configurations as well as the multi-user transmit precoder. Performance evaluation results showcase that the distributedly optimized RIS-enabled SWE achieves near-optimal sum-rate performance with low online computational complexity. The second design focuses on the wideband interference MISO broadcast channel, where each base station exclusively controls one BD-RIS to serve its assigned group of users. A cooperative optimization framework that jointly designs the base station transmit precoders as well as the tunable capacitances and switch matrices of all metasurfaces is presented. Numerical results demonstrating the superior sum-rate performance of the designed RIS-enabled SWE for multi-cell MISO networks over benchmark schemes, considering non-cooperative configuration and conventional diagonal metasurfaces, are presented.
[LG-106] Unsupervised Feature Selection via Robust Autoencoder and Adaptive Graph Learning
链接: https://arxiv.org/abs/2512.18720
作者: Feng Yu,MD Saifur Rahman Mazumder,Ying Su,Oscar Contreras Velasco
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Effective feature selection is essential for high-dimensional data analysis and machine learning. Unsupervised feature selection (UFS) aims to simultaneously cluster data and identify the most discriminative features. Most existing UFS methods linearly project features into a pseudo-label space for clustering, but they suffer from two critical limitations: (1) an oversimplified linear mapping that fails to capture complex feature relationships, and (2) an assumption of uniform cluster distributions, ignoring outliers prevalent in real-world data. To address these issues, we propose the Robust Autoencoder-based Unsupervised Feature Selection (RAEUFS) model, which leverages a deep autoencoder to learn nonlinear feature representations while inherently improving robustness to outliers. We further develop an efficient optimization algorithm for RAEUFS. Extensive experiments demonstrate that our method outperforms state-of-the-art UFS approaches in both clean and outlier-contaminated data settings.
[LG-107] Pushing the limits of one-dimensional NMR spectroscopy for automated structure elucidation using artificial intelligence
链接: https://arxiv.org/abs/2512.18531
作者: Frank Hu,Jonathan M. Tubb,Dimitris Argyropoulos,Sergey Golotvin,Mikhail Elyashberg,Grant M. Rotskoff,Matthew W. Kanan,Thomas E. Markland
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
Abstract:One-dimensional NMR spectroscopy is one of the most widely used techniques for the characterization of organic compounds and natural products. For molecules with up to 36 non-hydrogen atoms, the number of possible structures has been estimated to range from 10^20 - 10^60 . The task of determining the structure (formula and connectivity) of a molecule of this size using only its one-dimensional ^1 H and/or ^13 C NMR spectrum, i.e. de novo structure generation, thus appears completely intractable. Here we show how it is possible to achieve this task for systems with up to 40 non-hydrogen atoms across the full elemental coverage typically encountered in organic chemistry (C, N, O, H, P, S, Si, B, and the halogens) using a deep learning framework, thus covering a vast portion of the drug-like chemical space. Leveraging insights from natural language processing, we show that our transformer-based architecture predicts the correct molecule with 55.2% accuracy within the first 15 predictions using only the ^1 H and ^13 C NMR spectra, thus overcoming the combinatorial growth of the chemical space while also being extensible to experimental data via fine-tuning.
[LG-108] PSI3D: Plug-and-Play 3D Stochastic Inference with Slice-wise Latent Diffusion Prior
链接: https://arxiv.org/abs/2512.18367
作者: Wenhan Guo,Jinglun Yu,Yaning Wang,Jin U. Kang,Yu Sun
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures
Abstract:Diffusion models are highly expressive image priors for Bayesian inverse problems. However, most diffusion models cannot operate on large-scale, high-dimensional data due to high training and inference costs. In this work, we introduce a Plug-and-play algorithm for 3D stochastic inference with latent diffusion prior (PSI3D) to address massive ( 1024\times 1024\times 128 ) volumes. Specifically, we formulate a Markov chain Monte Carlo approach to reconstruct each two-dimensional (2D) slice by sampling from a 2D latent diffusion model. To enhance inter-slice consistency, we also incorporate total variation (TV) regularization stochastically along the concatenation axis. We evaluate our performance on optical coherence tomography (OCT) super-resolution. Our method significantly improves reconstruction quality for large-scale scientific imaging compared to traditional and learning-based baselines, while providing robust and credible reconstructions.
[LG-109] CrystalFormer-CSP: Thinking Fast and Slow for Crystal Structure Prediction
链接: https://arxiv.org/abs/2512.18251
作者: Zhendong Cao,Shigang Ou,Lei Wang
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 11 pages, 4 figures
Abstract:Crystal structure prediction is a fundamental problem in materials science. We present CrystalFormer-CSP, an efficient framework that unifies data-driven heuristic and physics-driven optimization approaches to predict stable crystal structures for given chemical compositions. The approach combines pretrained generative models for space-group-informed structure generation and a universal machine learning force field for energy minimization. Reinforcement fine-tuning can be employed to further boost the accuracy of the framework. We demonstrate the effectiveness of CrystalFormer-CSP on benchmark problems and showcase its usage via web interface and language model integration.
[LG-110] Estimating Solvation Free Energies with Boltzmann Generators
链接: https://arxiv.org/abs/2512.18147
作者: Maximilian Schebek,Nikolas M. Froböse,Bettina G. Keller,Jutta Rogal
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Accurate calculations of solvation free energies remain a central challenge in molecular simulations, often requiring extensive sampling and numerous alchemical intermediates to ensure sufficient overlap between phase-space distributions of a solute in the gas phase and in solution. Here, we introduce a computational framework based on normalizing flows that directly maps solvent configurations between solutes of different sizes, and compare the accuracy and efficiency to conventional free energy estimates. For a Lennard-Jones solvent, we demonstrate that this approach yields acceptable accuracy in estimating free energy differences for challenging transformations, such as solute growth or increased solute-solute separation, which typically demand multiple intermediate simulation steps along the transformation. Analysis of radial distribution functions indicates that the flow generates physically meaningful solvent rearrangements, substantially enhancing configurational overlap between states in configuration space. These results suggest flow-based models as a promising alternative to traditional free energy estimation methods.
[LG-111] Exploring polymer classification with a hybrid single-photon quantum approach
链接: https://arxiv.org/abs/2512.18125
作者: Alexandrina Stoyanova,Bogdan Penkovsky
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Polymers exhibit complex architectures and diverse properties that place them at the center of contemporary research in chemistry and materials science. As conventional computational techniques, even multi-scale ones, struggle to capture this complexity, quantum computing offers a promising alternative framework for extracting structure-property relationships. Noisy Intermediate-Scale Quantum (NISQ) devices are commonly used to explore the implementation of algorithms, including quantum neural networks for classification tasks, despite ongoing debate regarding their practical impact. We present a hybrid classical-quantum formalism that couples a classical deep neural network for polymer featurization with a single-photon-based quantum classifier native to photonic quantum computing. This pipeline successfully classifies polymer species by their optical gap, with performance in line between CPU-based noisy simulations and a proof-of-principle run on Quandela’s Ascella quantum processor. These findings demonstrate the effectiveness of the proposed computational workflow and indicate that chemistryfrelated classification tasks can already be tackled under the constraints of today’s NISQ devices. Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2512.18125 [quant-ph] (or arXiv:2512.18125v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2512.18125 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-112] Causal Inference as Distribution Adaptation: Optimizing ATE Risk under Propensity Uncertainty
链接: https://arxiv.org/abs/2512.18083
作者: Ashley Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Standard approaches to causal inference, such as Outcome Regression and Inverse Probability Weighted Regression Adjustment (IPWRA), are typically derived through the lens of missing data imputation and identification theory. In this work, we unify these methods from a Machine Learning perspective, reframing ATE estimation as a \textitdomain adaptation problem under distribution shift. We demonstrate that the canonical Hajek estimator is a special case of IPWRA restricted to a constant hypothesis class, and that IPWRA itself is fundamentally Importance-Weighted Empirical Risk Minimization designed to correct for the covariate shift between the treated sub-population and the target population. Leveraging this unified framework, we critically examine the optimization objectives of Doubly Robust estimators. We argue that standard methods enforce \textitsufficient but not necessary conditions for consistency by requiring outcome models to be individually unbiased. We define the true “ATE Risk Function” and show that minimizing it requires only that the biases of the treated and control models structurally cancel out. Exploiting this insight, we propose the \textbfJoint Robust Estimator (JRE). Instead of treating propensity estimation and outcome modeling as independent stages, JRE utilizes bootstrap-based uncertainty quantification of the propensity score to train outcome models jointly. By optimizing for the expected ATE risk over the distribution of propensity scores, JRE leverages model degrees of freedom to achieve robustness against propensity misspecification. Simulation studies demonstrate that JRE achieves up to a 15% reduction in MSE compared to standard IPWRA in finite-sample regimes with misspecified outcome models. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2512.18083 [stat.ML] (or arXiv:2512.18083v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2512.18083 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-113] Long-range electrostatics for machine learning interatomic potentials is easier than we thought
链接: https://arxiv.org/abs/2512.18029
作者: Dongjin Kim,Bingqing Cheng
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:
Abstract:The lack of long-range electrostatics is a key limitation of modern machine learning interatomic potentials (MLIPs), hindering reliable applications to interfaces, charge-transfer reactions, polar and ionic materials, and biomolecules. In this Perspective, we distill two design principles behind the Latent Ewald Summation (LES) framework, which can capture long-range interactions, charges, and electrical response just by learning from standard energy and force training data: (i) use a Coulomb functional form with environment-dependent charges to capture electrostatic interactions, and (ii) avoid explicit training on ambiguous density functional theory (DFT) partial charges. When both principles are satisfied, substantial flexibility remains: essentially any short-range MLIP can be augmented; charge equilibration schemes can be added when desired; dipoles and Born effective charges can be inferred or finetuned; and charge/spin-state embeddings or tensorial targets can be further incorporated. We also discuss current limitations and open challenges. Together, these minimal, physics-guided design rules suggest that incorporating long-range electrostatics into MLIPs is simpler and perhaps more broadly applicable than is commonly assumed.
[LG-114] Shuttling Compiler for Trapped-Ion Quantum Computers Based on Large Language Models
链接: https://arxiv.org/abs/2512.18021
作者: Fabian Kreppel,Reza Salkhordeh,Ferdinand Schmidt-Kaler,André Brinkmann
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 21 pages, 5 figures, 2 tables
Abstract:Trapped-ion quantum computers based on segmented traps rely on shuttling operations to establish connectivity between multiple sub-registers within a quantum processing unit. Several architectures of increasing complexity have already been realized, including linear arrays, racetrack loops, and junction-based layouts. As hardware capabilities advance, the need arises for flexible software layers within the control stack to manage qubit routing \unicodex2014 the process of dynamically reconfiguring qubit positions so that all qubits involved in a gate operation are co-located within the same segment. Existing approaches typically employ architecture-specific heuristics, which become impractical as system complexity grows. To address this challenge, we propose a layout-independent compilation strategy based on large language models (LLMs). Specifically, we fine-tune pretrained LLMs to generate the required shuttling operations. We evaluate this approach on both linear and branched one-dimensional architectures, demonstrating that it provides a foundation for developing LLM-based shuttling compilers for trapped-ion quantum computers.
[LG-115] MEGState: Phoneme Decoding from Magnetoencephalography Signals NEURIPS2025
链接: https://arxiv.org/abs/2512.17978
作者: Shuntaro Suzuki,Chia-Chun Dan Hsu,Yu Tsao,Komei Sugiura
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted for presentation at LibriBrain Competition, NeurIPS 2025
Abstract:Decoding linguistically meaningful representations from non-invasive neural recordings remains a central challenge in neural speech decoding. Among available neuroimaging modalities, magnetoencephalography (MEG) provides a safe and repeatable means of mapping speech-related cortical dynamics, yet its low signal-to-noise ratio and high temporal dimensionality continue to hinder robust decoding. In this work, we introduce MEGState, a novel architecture for phoneme decoding from MEG signals that captures fine-grained cortical responses evoked by auditory stimuli. Extensive experiments on the LibriBrain dataset demonstrate that MEGState consistently surpasses baseline model across multiple evaluation metrics. These findings highlight the potential of MEG-based phoneme decoding as a scalable pathway toward non-invasive brain-computer interfaces for speech.
[LG-116] Sampling from multimodal distributions with warm starts: Non-asymptotic bounds for the Reweighted Annealed Leap-Point Sampler
链接: https://arxiv.org/abs/2512.17977
作者: Holden Lee,Matheau Santana-Gijzen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Computation (stat.CO)
*备注:
Abstract:Sampling from multimodal distributions is a central challenge in Bayesian inference and machine learning. In light of hardness results for sampling – classical MCMC methods, even with tempering, can suffer from exponential mixing times – a natural question is how to leverage additional information, such as a warm start point for each mode, to enable faster mixing across modes. To address this, we introduce Reweighted ALPS (Re-ALPS), a modified version of the Annealed Leap-Point Sampler (ALPS) that dispenses with the Gaussian approximation assumption. We prove the first polynomial-time bound that works in a general setting, under a natural assumption that each component contains significant mass relative to the others when tilted towards the corresponding warm start point. Similarly to ALPS, we define distributions tilted towards a mixture centered at the warm start points, and at the coldest level, use teleportation between warm start points to enable efficient mixing across modes. In contrast to ALPS, our method does not require Hessian information at the modes, but instead estimates component partition functions via Monte Carlo. This additional estimation step is crucial in allowing the algorithm to handle target distributions with more complex geometries besides approximate Gaussian. For the proof, we show convergence results for Markov processes when only part of the stationary distribution is well-mixing and estimation for partition functions for individual components of a mixture. We numerically evaluate our algorithm’s mixing performance compared to ALPS on a mixture of heavy-tailed distributions.
[LG-117] Risk-Aware Financial Forecasting Enhanced by Machine Learning and Intuitionistic Fuzzy Multi-Criteria Decision-Making
链接: https://arxiv.org/abs/2512.17936
作者: Safiye Turgay,Serkan Erdoğan,Željko Stević,Orhan Emre Elma,Tevfik Eren,Zhiyuan Wang,Mahmut Baydaş
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:In the face of increasing financial uncertainty and market complexity, this study presents a novel risk-aware financial forecasting framework that integrates advanced machine learning techniques with intuitionistic fuzzy multi-criteria decision-making (MCDM). Tailored to the BIST 100 index and validated through a case study of a major defense company in Türkiye, the framework fuses structured financial data, unstructured text data, and macroeconomic indicators to enhance predictive accuracy and robustness. It incorporates a hybrid suite of models, including extreme gradient boosting (XGBoost), long short-term memory (LSTM) network, graph neural network (GNN), to deliver probabilistic forecasts with quantified uncertainty. The empirical results demonstrate high forecasting accuracy, with a net profit mean absolute percentage error (MAPE) of 3.03% and narrow 95% confidence intervals for key financial indicators. The risk-aware analysis indicates a favorable risk-return profile, with a Sharpe ratio of 1.25 and a higher Sortino ratio of 1.80, suggesting relatively low downside volatility and robust performance under market fluctuations. Sensitivity analysis shows that the key financial indicator predictions are highly sensitive to variations of inflation, interest rates, sentiment, and exchange rates. Additionally, using an intuitionistic fuzzy MCDM approach, combining entropy weighting, evaluation based on distance from the average solution (EDAS), and the measurement of alternatives and ranking according to compromise solution (MARCOS) methods, the tabular data learning network (TabNet) outperforms the other models and is identified as the most suitable candidate for deployment. Overall, the findings of this work highlight the importance of integrating advanced machine learning, risk quantification, and fuzzy MCDM methodologies in financial forecasting, particularly in emerging markets.
信息检索
[IR-0] Generative vector search to improve pathology foundation models across multimodal vision-language tasks
链接: https://arxiv.org/abs/2512.19360
作者: Markus Ekvall,Ludvig Bergenstråhle,Patrick Truong,Ben Murrell,Joakim Lundeberg
类目: Information Retrieval (cs.IR)
*备注: 13 pages main (54 total), 2 main figures (9 total)
Abstract:Retrieval-augmented generation improves large language models by grounding outputs in external knowledge sources, reducing hallucinations and addressing knowledge cutoffs. However, standard embedding-based retrieval fails to capture the complexity of multi-concept queries, particularly in domains like biomedicine, where biological data are inherently high-dimensional. For example,omics datasets, and clinical reports simultaneously exhibit numerous molecular, cellular, and physiological features. We present Stochastic Latent Matching (STHLM), a generative vector search method that samples query-conditioned embeddings from text or image inputs to enhance retrieval performance. Analogous to how Chain-of-Thought reasoning enables language models to “think longer” on complex problems, STHLM allows retrieval systems to “search wider” through iterative sampling. STHLM demonstrates critical improvements over classical vector retrieval across diverse benchmarks, including scientific literature, clinical notes, and tissue images, boosting retrieval performance by 10-30% through test-time compute (trading latency for accuracy), while enabling up to a 10-fold compression of embedding dimensions.
[IR-1] Modular Layout Synthesis (MLS): Front-end Code via Structure Normalization and Constrained Generation
链接: https://arxiv.org/abs/2512.18996
作者: Chong Liu,Ming Zhang,Fei Li,Hao Zhou,Xiaoshuang Chen,Ye Yuan
类目: Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注:
Abstract:Automated front-end engineering drastically reduces development cycles and minimizes manual coding overhead. While Generative AI has shown promise in translating designs to code, current solutions often produce monolithic scripts, failing to natively support modern ecosystems like React, Vue, or Angular. Furthermore, the generated code frequently suffers from poor modularity, making it difficult to maintain. To bridge this gap, we introduce Modular Layout Synthesis (MLS), a hierarchical framework that merges visual understanding with structural normalization. Initially, a visual-semantic encoder maps the screen capture into a serialized tree topology, capturing the essential layout hierarchy. Instead of simple parsing, we apply heuristic deduplication and pattern recognition to isolate reusable blocks, creating a framework-agnostic schema. Finally, a constraint-based generation protocol guides the LLM to synthesize production-ready code with strict typing and component props. Evaluations show that MLS significantly outperforms existing baselines, ensuring superior code reusability and structural integrity across multiple frameworks
[IR-2] CIRR: Causal-Invariant Retrieval-Augmented Recommendation with Faithful Explanations under Distribution Shift
链接: https://arxiv.org/abs/2512.18683
作者: Sebastian Sun
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recent advances in retrieval-augmented generation (RAG) have shown promise in enhancing recommendation systems with external knowledge. However, existing RAG-based recommenders face two critical challenges: (1) vulnerability to distribution shifts across different environments (e.g., time periods, user segments), leading to performance degradation in out-of-distribution (OOD) scenarios, and (2) lack of faithful explanations that can be verified against retrieved evidence. In this paper, we propose CIRR, a Causal-Invariant Retrieval-Augmented Recommendation framework that addresses both challenges simultaneously. CIRR learns environment-invariant user preference representations through causal inference, which guide a debiased retrieval process to select relevant evidence from multiple sources. Furthermore, we introduce consistency constraints that enforce faithfulness between retrieved evidence, generated explanations, and recommendation outputs. Extensive experiments on two real-world datasets demonstrate that CIRR achieves robust performance under distribution shifts, reducing performance degradation from 15.4% (baseline) to only 5.6% in OOD scenarios, while providing more faithful and interpretable explanations (26% improvement in faithfulness score) compared to state-of-the-art baselines.
[IR-3] Efficient Optimization of Hierarchical Identifiers for Generative Recommendation ECIR2026
链接: https://arxiv.org/abs/2512.18434
作者: Federica Valeau,Odysseas Boufalis,Polytimi Gkotsi,Joshua Rosenthal,David Vos
类目: Information Retrieval (cs.IR)
*备注: Accepted at ECIR 2026 Reproducibility Track (to appear)
Abstract:SEATER is a generative retrieval model that improves recommendation inference efficiency and retrieval quality by utilizing balanced tree-structured item identifiers and contrastive training objectives. We reproduce and validate SEATER’s reported improvements in retrieval quality over strong baselines across all datasets from the original work, and extend the evaluation to Yambda, a large-scale music recommendation dataset. Our experiments verify SEATER’s strong performance, but show that its tree construction step during training becomes a major bottleneck as the number of items grows. To address this, we implement and evaluate two alternative construction algorithms: a greedy method optimized for minimal build time, and a hybrid method that combines greedy clustering at high levels with more precise grouping at lower levels. The greedy method reduces tree construction time to less than 2% of the original with only a minor drop in quality on the dataset with the largest item collection. The hybrid method achieves retrieval quality on par with the original, and even improves on the largest dataset, while cutting construction time to just 5-8%. All data and code are publicly available for full reproducibility at this https URL.
[IR-4] Improving Data Reusability in Interactive Information Retrieval: Insights from the Community
链接: https://arxiv.org/abs/2512.18283
作者: Tianji Jiang,Wenqi Li,Jiqun Liu
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注: Accepted by CHIIR 2025
Abstract:In this study, we conducted semi-structured interviews with 21 IIR researchers to investigate their data reuse practices. This study aims to expand upon current findings by exploring IIR researchers’ information-obtaining behaviors regarding data reuse. We identified the information about shared data characteristics that IIR researchers need when evaluating data reusability, as well as the sources they typically consult to obtain this information. We consider this work to be an initial step toward revealing IIR researchers’ data reuse practices and identifying what the community needs to do to promote data reuse. We hope that this study, as well as future research, will inspire more individuals to contribute to ongoing efforts aimed at designing standards, infrastructures, and policies, as well as fostering a sustainable culture of data sharing and reuse in this field.
[IR-5] Factorized Transport Alignment for Multimodal and Multiview E-commerce Representation Learning WSDM’26
链接: https://arxiv.org/abs/2512.18117
作者: Xiwen Chen,Yen-Chieh Lien,Susan Liu,María Castaños,Abolfazl Razi,Xiaoting Zhao,Congzhe Su
类目: Information Retrieval (cs.IR)
*备注: Accepted by WSDM’26
Abstract:The rapid growth of e-commerce requires robust multimodal representations that capture diverse signals from user-generated listings. Existing vision-language models (VLMs) typically align titles with primary images, i.e., single-view, but overlook non-primary images and auxiliary textual views that provide critical semantics in open marketplaces such as Etsy or Poshmark. To this end, we propose a framework that unifies multimodal and multi-view learning through Factorized Transport, a lightweight approximation of optimal transport, designed for scalability and deployment efficiency. During training, the method emphasizes primary views while stochastically sampling auxiliary ones, reducing training cost from quadratic in the number of views to constant per item. At inference, all views are fused into a single cached embedding, preserving the efficiency of two-tower retrieval with no additional online overhead. On an industrial dataset of 1M product listings and 0.3M interactions, our approach delivers consistent improvements in cross-view and query-to-item retrieval, achieving up to +7.9% Recall@500 over strong multimodal baselines. Overall, our framework bridges scalability with optimal transport-based learning, making multi-view pretraining practical for large-scale e-commerce search.

